Ngoc Thanh Nguyen, Radoslaw Katarzyniak, and Shyi-Ming Chen (Eds.) Advances in Intelligent Information and Database Systems
Studies in Computational Intelligence, Volume 283 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 260. Edward Szczerbicki and Ngoc Thanh Nguyen (Eds.) Smart Information and Knowledge Management, 2009 ISBN 978-3-642-04583-7 Vol. 262. Jacek Koronacki, Zbigniew W. Ras, Slawomir T. Wierzchon, and Janusz Kacprzyk (Eds.) Advances in Machine Learning I, 2009 ISBN 978-3-642-05176-0 Vol. 263. Jacek Koronacki, Zbigniew W. Ras, Slawomir T. Wierzchon, and Janusz Kacprzyk (Eds.) Advances in Machine Learning II, 2009 ISBN 978-3-642-05178-4
Vol. 272. Carlos A. Coello Coello, Clarisse Dhaenens, and Laetitia Jourdan (Eds.) Advances in Multi-Objective Nature Inspired Computing, 2009 ISBN 978-3-642-11217-1 Vol. 273. Fatos Xhafa, Santi Caballé, Ajith Abraham, Thanasis Daradoumis, and Angel Alejandro Juan Perez (Eds.) Computational Intelligence for Technology Enhanced Learning, 2010 ISBN 978-3-642-11223-2 Vol. 274. Zbigniew W. Ra´s and Alicja Wieczorkowska (Eds.) Advances in Music Information Retrieval, 2010 ISBN 978-3-642-11673-5
Vol. 264. Olivier Sigaud and Jan Peters (Eds.) From Motor Learning to Interaction Learning in Robots, 2009 ISBN 978-3-642-05180-7
Vol. 275. Dilip Kumar Pratihar and Lakhmi C. Jain (Eds.) Intelligent Autonomous Systems, 2010 ISBN 978-3-642-11675-9
Vol. 265. Zbigniew W. Ras and Li-Shiang Tsay (Eds.) Advances in Intelligent Information Systems, 2009 ISBN 978-3-642-05182-1
Vol. 276. Jacek Ma´ndziuk Knowledge-Free and Learning-Based Methods in Intelligent Game Playing, 2010 ISBN 978-3-642-11677-3
Vol. 266. Akitoshi Hanazawa, Tsutom Miki, and Keiichi Horio (Eds.) Brain-Inspired Information Technology, 2009 ISBN 978-3-642-04024-5 Vol. 267. Ivan Zelinka, Sergej Celikovsk´y, Hendrik Richter, and Guanrong Chen (Eds.) Evolutionary Algorithms and Chaotic Systems, 2009 ISBN 978-3-642-10706-1 Vol. 268. Johann M.Ph. Schumann and Yan Liu (Eds.) Applications of Neural Networks in High Assurance Systems, 2009 ISBN 978-3-642-10689-7 Vol. 269. Francisco Fern´andez de de Vega and Erick Cant´u-Paz (Eds.) Parallel and Distributed Computational Intelligence, 2009 ISBN 978-3-642-10674-3 Vol. 270. Zong Woo Geem Recent Advances In Harmony Search Algorithm, 2009 ISBN 978-3-642-04316-1 Vol. 271. Janusz Kacprzyk, Frederick E. Petry, and Adnan Yazici (Eds.) Uncertainty Approaches for Spatial Data Modeling and Processing, 2009 ISBN 978-3-642-10662-0
Vol. 277. Filippo Spagnolo and Benedetto Di Paola (Eds.) European and Chinese Cognitive Styles and their Impact on Teaching Mathematics, 2010 ISBN 978-3-642-11679-7 Vol. 278. Radomir S. Stankovic and Jaakko Astola From Boolean Logic to Switching Circuits and Automata, 2010 ISBN 978-3-642-11681-0 Vol. 279. Manolis Wallace, Ioannis E. Anagnostopoulos, Phivos Mylonas, and Maria Bielikova (Eds.) Semantics in Adaptive and Personalized Services, 2010 ISBN 978-3-642-11683-4 Vol. 280. Chang Wen Chen, Zhu Li, and Shiguo Lian (Eds.) Intelligent Multimedia Communication: Techniques and Applications, 2010 ISBN 978-3-642-11685-8 Vol. 281. Robert Babuska and Frans C.A. Groen (Eds.) Interactive Collaborative Information Systems, 2010 ISBN 978-3-642-11687-2 Vol. 282. xxx Katarzyniak, and Vol. 283. Ngoc Thanh Nguyen, Radoslaw Shyi-Ming Chen (Eds.) Advances in Intelligent Information and Database Systems, 2010 ISBN 978-3-642-12089-3
Ngoc Thanh Nguyen, Radoslaw Katarzyniak, and Shyi-Ming Chen (Eds.)
Advances in Intelligent Information and Database Systems
123
Prof. Ngoc Thanh Nguyen
Prof. Shyi-Ming Chen
Institute of Informatics Wroclaw University of Technology Str. Wyb. Wyspianskiego 2750-370 Wroclaw Poland
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
E-mail:
[email protected]
Prof. Radoslaw Katarzyniak
E-mail:
[email protected]
Institute of Informatics Wroclaw University of Technology Str. Wyb. Wyspianskiego 2750-370 Wroclaw Poland E-mail:
[email protected]
ISBN 978-3-642-12089-3
e-ISBN 978-3-642-12090-9
DOI 10.1007/978-3-642-12090-9 Studies in Computational Intelligence
ISSN 1860-949X
Library of Congress Control Number: 2010922320 c 2010 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed in acid-free paper 987654321 springer.com
Preface
Intelligent information and database systems are two closely related and wellestablished subfields of modern computer science. They focus on the integration of artificial intelligence and classic database technologies in order to create the class of next generation information systems. The major target of this new generation of systems is to provide end-users with intelligent behavior: simple and/or advanced learning, problem solving, uncertain and certain reasoning, selforganization, cooperation, etc. Such intelligent abilities are implemented in classic information systems to make them autonomous and user oriented, in particular when advanced problems of multimedia information and knowledge discovery, access, retrieval and manipulation are to be solved in the context of large, distributed and heterogeneous environments. It means that intelligent knowledge-based information and database systems are used to solve basic problems of large collections management, carry out knowledge discovery from large data collections, reason about information under uncertain conditions, support users in their formulation of complex queries etc. Topics discussed in this volume include but are not limited to the foundations and principles of data, information, and knowledge models, methodologies for intelligent information and database systems analysis, design, implementation, validation, maintenance and evolution. They cover a relatively broad spectrum of detailed research and design topics: user models, intelligent and cooperative query languages and interfaces, knowledge representation, integration, fusion, interchange and evolution, foundations and principles of data, information, and knowledge management, methodologies for intelligent information systems analysis, design, implementation, validation, maintenance and evolution, intelligent databases, intelligent information retrieval, digital libraries, and networked information retrieval, distributed multimedia and hypermedia information space design, implementation and navigation, multimedia interfaces, machine learning, knowledge discovery, and data mining, uncertainty management and reasoning under uncertainty. The book consists of extended chapters based on original works presented during a poster session organized within the 2nd Asian Conference on Intelligent Information and Database Systems (24-26 March 2010 in Hue, Vietnam). The book is organized into four parts. The first part is titled Information Retrieval and Management and consists of ten chapters that concentrate on many issues related to the way information can be retrieved and managed in the context of modern distributed and multimedia database systems. The second part of the book is titled Service Composition and User-Centered Approach, and consists of seven papers devoted to user centered information environments design and implementation. In some of these chapters detailed problems of effective autonomous user model creation and automation of the creation of user centered interfaces are discussed.
VI
Preface
The third part of the book is titled Data Mining and Knowledge Extraction and consists of eight chapters. In the majority of them their authors present and discuss new developments in data mining strategies and algorithms, and present examples of their application to support effective information retrieval, management and discovery. The fourth part of this volume consists of seven chapters published under one title Computational Intelligence. Their authors show how chosen computational intelligence technologies can be used to solve many optimization problems related to intelligent information retrieval and management. The editors hope that this book can be useful for graduate and PhD students in computer science as well as for mature academics, researchers and practitioners interested in merging of artificial intelligence technologies and database technologies in order to create new class of intelligent information systems. We wish to express our great attitude to Prof. Janusz Kacprzyk, the editor of this series, and Dr. Thomas Ditzinger from Springer for their interest and support for our project. Thanks are also due to Dr. Bogdan TrawiĔski, Dr. Przemysław Kazienko, Prof. Oscar Cordón for their excellent work during the organization of the Special Session on Multiple Model Approach to Machine Learning (MMAML 2010 – ACIIDS 2010), and to Prof. Le Thi Hoai An and Prof. Pham Dinh Tao for their effective organization of the Special Session on Modelling and Optimization Techniques in Information Systems, Database Systems and Industrial Systems (MOT-ACIIDS 2010). The last but not least we wish to express our great attitude to all authors who contributed to the content of this volume.
January 2010
Ngoc Thanh Nguyen Radosław Piotr Katarzyniak Shyi-Ming Chen
Table of Contents
Part I: Information Retrieval and Management A Construction of Hierarchical Rough Set Approximations in Information Systems Using Dependency of Attributes . . . . . . . . . . . . . . Tutut Herawan, Iwan Tri Riyadi Yanto, and Mustafa Mat Deris
3
Percentages of Rows Read by Queries as an Operational Database Quality Indicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pawel Lenkiewicz and Krzysztof Stencel
17
Question Semantic Analysis in Vietnamese QA System . . . . . . . . . . . . Tuoi T. Phan, Thanh C. Nguyen, and Thuy N.T. Huynh
29
Ontology-Based Query Expansion with Latently Related Named Entities for Semantic Text Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vuong M. Ngo and Tru H. Cao
41
Indexing Spatial Objects in Stream Data Warehouse . . . . . . . . . . . . . . Marcin Gorawski and Rafal Malczok
53
Real Time Measurement and Visualization of ECG on Mobile Monitoring Stations of Biotelemetric System . . . . . . . . . . . . . . . . . . . . . Ondrej Krejcar, Dalibor Janckulik, Leona Motalova, Karel Musil, and Marek Penhaker
67
A Search Engine Log Analysis of Music-Related Web Searching . . . . . Sally Jo Cunningham and David Bainbridge
79
Data Hiding Based on Compressed Dithering Images . . . . . . . . . . . . . . Cheonshik Kim
89
Reliable Improvement for Collective Intelligence on Thai Herbal Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Verayuth Lertnattee, Sinthop Chomya, and Virach Sornlertlamvanich Semantic Compression for Specialised Information Retrieval Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dariusz Ceglarek, Konstanty Haniewicz, and Wojciech Rutkowski
99
111
VIII
Table of Contents
Part II: Service Composition and User-Centered Approach Service Mining for Composite Service Discovery . . . . . . . . . . . . . . . . . . . Min-Feng Wang, Meng-Feng Tsai, Cheng-Hsien Tang, and Jia-Ying Hu
125
An Adaptive Grid-Based Approach to Location Privacy Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anh Tuan Truong, Quynh Chi Truong, and Tran Khanh Dang
133
View Driven Federation of Choreographies . . . . . . . . . . . . . . . . . . . . . . . Amirreza Tahamtan and Johann Eder
145
Semantic Battlespace Data Mapping Using Tactical Symbology . . . . . Mariusz Chmielewski and Andrzej Galka
157
A Method for Scenario Modification in Intelligent E-Learning Systems Using Graph-Based Structure of Knowledge . . . . . . . . . . . . . . Adrianna Kozierkiewicz-Hetma´ nska and Ngoc Thanh Nguyen
169
Development of the E-Learning System Supporting Online Education at the Polish-Japanese Institute of Information Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pawel Lenkiewicz, Lech Banachowski, and Jerzy Pawel Nowacki Evolutionally Improved Quality of Intelligent Systems Following Their Users’ Point of View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Barbara Begier
181
191
Part III: Data Mining and Knowledge Extraction Mining the Most Generalization Association Rules . . . . . . . . . . . . . . . . Bay Vo and Bac Le
207
Structure of Set of Association Rules Based on Concept Lattice . . . . . Tin C. Truong and Anh N. Tran
217
Some Novel Heuristics for Finding the Most Unusual Time Series Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mai Thai Son and Duong Tuan Anh
229
Using Rule Order Difference Criterion to Decide Whether to Update Class Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kritsadakorn Kongubol, Thanawin Rakthanmanon, and Kitsana Waiyamai
241
Table of Contents
IX
An Integrated Approach for Exploring Path-Type Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nai-Chieh Wei, Yang Wu, I-Ming Chao, and Shih-Kai Lin
253
A Framework of Rough Clustering for Web Transactions . . . . . . . . . . . Iwan Tri Riyadi Yanto, Tutut Herawan, and Mustafa Mat Deris
265
Fuzzy Bayesian Belief Network for Analyzing Medical Track Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rolly Intan and Oviliani Yenty Yuliana
279
An Experiment Model of Grounded Theory and Chance Discovery for Scenario Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tzu-Fu Chiu, Chao-Fu Hong, and Yu-Ting Chiu
291
Part IV: Computational Intelligence Using Tabu Search for Solving a High School Timetabling Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Khang Nguyen Tan Tran Minh, Nguyen Dang Thi Thanh, Khon Trieu Trang, and Nuong Tran Thi Hue Risk Management Evaluation Based on Elman Neural Network for Power Plant Construction Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongli Wang, Dongxiao Niu, and Mian Xing A New Approach to Multi-criteria Decision Making (MCDM) Using the Fuzzy Binary Relation of the ELECTRE III Method and the Principles of the AHP Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laor Boongasame and Veera Boonjing
305
315
325
A Routing Method Based on Cost Matrix in Ad Hoc Networks . . . . . Mary Wu, Shin Hun Kim, and Chong Gun Kim
337
A Fusion Approach for Multi-criteria Evaluation . . . . . . . . . . . . . . . . . . Jia-Wen Wang and Jing-Wen Chang
349
Algorithmic Aspects of the Reachability of Conflicting Chip Firing Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Le Manh Ha, Nguyen Anh Tam, and Phan Thi Ha Duong
359
Neurofuzzy Decision-Making Approach for the Next Day Portfolio Thai Stock Index Management Trading Strategies . . . . . . . . . . . . . . . . . Monruthai Radeerom and M.L. Kulthon Kasemsan
371
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
383
A Construction of Hierarchical Rough Set Approximations in Information Systems Using Dependency of Attributes Tutut Herawan1,2, Iwan Tri Riyadi Yanto1,2, and Mustafa Mat Deris1 1
FTMM, Universiti Tun Hussein Onn Malaysia, Johor, Malaysia 2 Universitas Ahmad Dahlan, Yogyakarta, Indonesia
[email protected],
[email protected],
[email protected]
Abstract. This paper presents an alternative approach for constructing a hierarchical rough set approximation in an information system. It is based on the notion of dependency of attributes. The proposed approach is started with the notion of a nested sequence of indiscernibility relations that can be defined from the dependency of attributes. With this notion, a nested rough set approximation can be easily constructed. Then, the notion of a nested rough set approximation is used for constructing a hierarchical rough set approximation. Lastly, applications of a hierarchical rough set approximation for data classification and capturing maximal association in document collection through information systems are presented. Keywords: Information systems; Rough set theory; Dependency of Attributes; Nested rough set approximations; Hierarchical rough set approximation.
1 Introduction Rough set theory, introduced by Pawlak in 1982 [1], is a new mathematical tool to deal with vagueness (set) and uncertainty (element). To information systems, the idea of rough set theory consists of two dual set approximations called the lower and upper approximations, where it is based on indiscernibility relation of all objects [2,3]. It offers effective methods that are applicable in many branches of artificial intelligence and data mining [3,4]. One such application is in granular computing where the concept of approximation is used to solve some classification problems. A cluster (granule) usually consists of elements that are drawn together by similarity, proximity, or functionality [5−7]. In granular computing, the granules i.e., groups, concepts, categories, classes, or clusters of a universe, are used in the processes of problem solving. When a problem involves incomplete, uncertain, or vague information, it may be difficult to differentiate distinct elements and one is forced to consider granules for the purpose of differentiation. The granulated view of the universe is based on a binary relation representing the type of similarities between elements of a universe [6]. Marek and Rasiowa [7] considered gradual approximations of sets based on a descending sequence of equivalence relations. Yao [8,9] suggested the use of hierarchical granulations for N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 3–15. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
4
T. Herawan, I.T.R. Yanto, and M.M. Deris
the study of stratified rough set approximations. To this, Yao [6] proposed hierarchical granulations induced by a special class of equivalence relations. In this paper, we extend the Yao’s hierarchical rough set approximation induced by a special class of equivalence relations based on (Pawlak) approximation space to hierarchical rough set approximations of an information system, induced by dependency of attributes. Our approach starts with the notion of a nested sequence of indiscernibility relation that can be defined from the dependency of attributes in an information system. Based on a nested sequence of indiscernibility relation, a nested rough set approximation can be constructed. Further on, the notion of a nested rough set approximation is then used for constructing hierarchical rough set approximation in an information system. With this approach, a hierarchical approximation (hierarchical granulation) can be easily constructed. In addition, we present applications of a hierarchical rough set approximation for data classification and capturing maximal association in document collection through information systems. The rest of this paper is organized as follows. Section 2 describes a fundamental concept of rough set theory. Section 3 describes a construction of a hierarchical rough set approximations using dependency of attributes and its applications. Finally, we conclude our works in section 4.
2
Rough Set Theory
An information system is a 4-tuple (quadruple), S = (U , A, V , f ) , where U is a non-empty finite set of objects, A is a non-empty finite set of attributes, V = a∈A V a , Va is the domain (value set) of attribute a, f : U × A → V is a total
∪
function such that f (u, a ) ∈ V a , for every (u, a ) ∈ U × A , called information (knowledge) function. In many information system’s applications, there is an outcome of classification that is known and categorized as posteriori knowledge. The posteriori knowledge is expressed by one (or more) distinguished attribute called decision attribute; the process is known as supervised learning [3]. This information system is called a decision system. Thus, a decision system is an information system of the form D = (U , A ∪ {d }, V , f ) , where d ∉ A is the decision attribute. The elements of A are called condition attributes.
Definition 1. Let S = (U , A, V , f ) be an information system and let B be any subset of A. Two elements x, y ∈ U are said to be B-indiscernible (indiscernible by
the set of attribute B in S) if only if f (x, a ) = f ( y, a ) , for every a ∈ B .
Obviously, every non-empty subset of A induces unique indiscernibility relation. Notice that, an indiscernibility relation induced by the set of attribute B, denoted by IND(B ) , is an equivalence relation. It is well known that, an equivalence relation induces unique partition. The partition of U induced by B in S denoted by U / B and the equivalence class in the partition U / B containing x ∈ U , denoted
A Construction of Hierarchical Rough Set Approximations in Information Systems
5
by [x ]B . The notions of lower and upper approximations of a set are given in the following definition. Definition 2. Let S = (U , A, V , f ) be an information system and let B be any subset of A and let a subset X ⊆ U . The B-lower approximation of X, denoted by
B( X ) and B-upper approximations of X, denoted by B( X ) , respectively, are defined by
{
B( X ) = x ∈ U
[x]
B
}
{
⊆ X and B( X ) = x ∈ U
[x]
B
}
∩ X ≠φ .
Definition 3. Let S = (U , A, V , f ) be an information system and let B be any subset of A. A rough approximation of a subset X ⊆ U with respect to B is defined as a pair of lower and upper approximations of X, i.e.
B( X ), B( X ) .
(1)
The upper approximation of a subset X ⊆ U can be expressed using set complement and lower approximation of X that can been seen easily as follows
B( X ) = U − B(¬X ) ,
(2)
where ¬ X denote the complement of X relative to U. The accuracy of approximation (roughness) of a subset X ⊆ U , denoted
α B ( X ) is measured by
α B (X ) =
B( X ) B( X )
,
(3)
where X denotes the cardinality of X. For empty set φ , we define α B (φ ) = 1 .
Obviously, 0 ≤ α B ( X ) ≤ 1 . If X is a union of some equivalence classes, then
α B ( X ) = 1 . Thus, the set X is crisp with respect to B, and otherwise, if α B ( X ) < 1 , X is rough with respect to B. The notion a functional dependency of attributes is given in the following definition. Definition 4. Let S = (U , A, V , f ) be an information system and let D and C be any subsets of A. Attribute D is said to be totally dependent on attribute C, denoted C ⇒ D , if all values of attributes D are uniquely determined by values of attributes C.
In other words, attribute D depends totally on attribute C, if there exist a functional dependency between values D and C. The notion a generalized dependency of attributes with a degree k is given in the following definition.
6
T. Herawan, I.T.R. Yanto, and M.M. Deris
Definition 5. Let S = (U , A, V , f ) be an information system and let D and C be any subsets of A. The dependency degree of attribute D on attributes C in k, denoted C ⇒ k D , is defined by
k=
∑
X ∈U / D
C(X )
U
.
(5)
Obviously, 0 ≤ k ≤ 1 . Attribute D is said to be totally dependent (in a degree of k) on the attribute C if k = 1 . Otherwise, D will partially depends on C. Thus, attribute D depends totally (partially) on attribute C, if all (some) elements of the universe U can be uniquely classified to equivalence classes of the partition U / D , employing C.
3 A Construction of Hierarchical Rough Set Approximation in Information Systems Using Dependency of Attributes In this section we present a construction of a hierarchical rough set approximation in an information system using dependency of attributes. To start off, the nested rough set approximation has to be based on the dependency of attributes. Further, the notion of a nested rough set approximation is used for constructing hierarchical rough set approximations. 3.1 Nested Rough Set Approximations
In this sub-section, the construction of nested rough set approximations in an information system using dependency of attributes is presented. We may consider all possible subsets of attributes to obtain different degrees of dependency of attributes. Definition 6. A sequence
sn
is a function which domain is contained in the set
of all natural numbers N and range is contained in the set of its terms, {s1 , s 2 , , s n } . Definition 7. Let S = (U , A, V , f ) be an information system and let A1 , A2 , , An be any subset of A. A sequence of indiscernibility relations induced by IND( Ai )i =1, , n is said to be nested if Ai , i = 1,2, , n , denoted
IND ( A1 ) ⊆ IND( A2 ) ⊆
⊆ IND ( An ) .
In this case, we can say that an equivalence relation induced by Ai is coarser than an equivalence relation induced by A j , where 1 ≤ j ≤ n . It is clear that, if
IND ( A1 ) ⊆ IND ( A2 ) , then U / A1 is finer than U / A2 .
A Construction of Hierarchical Rough Set Approximations in Information Systems
X ⊆ U , denoted
Definition 8. The rough approximations of a subset
(A (X ), A (X )) n
n
i =1,
is said to be nested if ,n
An ( X ) ⊆
then
⊆ A1 ( X ) ⊂ X ⊂ A1 ( X ) ⊆
⊆ An ( X ) .
S = (U , A, V , f ) be an information system and let
Proposition 9. Let
, An be any subsets of A. If Ai depends on Ai +1 , for i = 1,2,
A1 , A2 ,
(A ( X ), A ( X )) n
7
n
, n −1,
is a nested rough approximation of a subset X ⊆ U .
i =1, , n
, An ⊆ A . If Ai depends on Ai +1 , then IND( Ai +1 ) ⊆ IND( Ai )
Proof. Let A1 , A2 ,
and every equivalence class induced by IND( Ai +1 ) is a union of some equivalence
class induced by IND( Ai ) , it means that for every x ∈ U , [x ]A ⊆ [x ]A i
i +1
and thus
U / Ai +1 is coarser than U / Ai . On the other side, the dependability of Ai on Ai +1 , for i = 1,2,
IND( Ai )i =1,
, n − 1 , implies
to be a nested sequence of indis-
,n
cernibility relations. Since, a nested sequence of indiscernibility relations determines a nested rough approximations of a subset X ⊆ U ,
An ( X ), An ( X )
i =1, , n
□
then the proof is completed. Proposition
10.
(A ( X ), A ( X )) n
n
i =1, ,n
Let
S = (U , A, V , f )
be
an
information
system.
If
is a nested rough approximations of a subset X ⊆ U , then
α A ( X ) ≤ α A ( X ) , for i = 1,2, i +1
,
, n −1 .
i
Proof. It follows from Definition 8, we have An ( X ) ⊆
⊆ A1 ( X ) ⊂ X ⊂ A1 ( X ) ⊆
⊆ An ( X ) .
for every X ⊆ U . Consequently An ( X ) An ( X )
≤
An −1 ( X ) An −1 ( X )
α A (X ) ≤ α A n
n −1
≤
(X ) ≤
≤
A2 ( X ) A2 ( X )
≤
A1 ( X ) A1 ( X )
≤ α A (X ) ≤ α A (X ) . 2
1
□
Example 11. We illustrate our approach and compare it with Yao’s nested rough set approximation approach. Based on the information system as in Table 1 in [6], Yao’s rough set approximations is been presented.
Based on Table 1, Yao considered the sequence of subsets of attributes A = {A1 , A2 , A3 , A4 } , B = {A1 , A2 } , C = {A1 } , D = φ . Yao used the notion of a nested sequence of equivalence relation R A ⊆ RB ⊆ RC ⊆ RD . Thus, for each subset of attributes, Yao will then obtain the following partition (granulation structure)
8
T. Herawan, I.T.R. Yanto, and M.M. Deris
U / D = {a, b, c, d , e, f } , U / C = {{a, b, c, d }, {e, f }} , U / B = {{a}, {b, c}, {d }, {e, f }} ,
U / A = {{a}, {b}, {c}, {d }, {e}, { f }} . As for X (Class = + ) = {c, d , e, f } , the accuracy in each level granulation of the rough set approximation obtained by Yao can tabulated as in Table 2. Table 1. An information system
Object a b c d e f
A1 1 1 1 1 0 0
A2 1 2 2 3 1 1
A3 1 1 0 1 0 1
A4 1 0 0 1 1 1
Class − − + + + +
Table 2. Set approximations and their accuracy in each level partition
App of X w.r.t. A
B C
Low
{c, d , e, f } {c, d , e, f } {d , e, f } {b, c, d , e, f } U {e, f }
φ
D
Upp
U
Acc
1.000 0.600 0.333 0
However, our approach uses the notion of dependency of attributes in an information system. When compared with Yao’s approach, we only use the set of attributes A, B and C. From Table 1, we have, C ⇒ k =1 B , B ⇒ k =1 A , and A ⇒ k =1 Class . The nested rough approximations from those attributes dependen-
cies of a subset X (Class = + ) = {c, d , e, f } is given by
A( X ), A( X ) , B( X ), B( X ) , C ( X ), C ( X ) with
C ( X ) ⊆ B( X ) ⊆ A( X ) ⊆ X ⊆ A( X ) ⊆ B( X ) ⊆ C ( X ) .
Meanwhile, the accuracy in each approximation can be summarized as in Table 3. Table 3. Set approximations and their accuracy in eachlevel partition using attributes dependency
App of X w.r.t. A
B C
Low
Upp
{c, d , e, f } {c, d , e, f } {d , e, f } {b, c, d , e, f } U {e, f }
Acc
1.000 0.600 0.333
A Construction of Hierarchical Rough Set Approximations in Information Systems
9
From Table 3, with lower degree of dependency, coarser partition will be obtained, at the same time obtaining lesser accuracy in the rough approximation. Thus, with this approach, from Table 3, we can easily define the notions of a nested rough set approximation in an information system and at the same time obtain the identical accuracy of approximation as in Table 2. 3.2 Hierarchical Rough Set Approximations
In this sub-section, a construction of hierarchical rough set approximations using dependency of attributes is presented. It is based on nested rough set approximations. A hierarchy may be viewed as a successive top-down decomposition of a universe U. Alternatively, a hierarchy may also be viewed as a successive bottomup combination of smaller clusters to form larger clusters [6]. Follows Definition 7, a hierarchical rough set approximation is constructed in Figure 1. The highest level of a hierarchical rough set approximation is the partition induced by IND ( An ) and the lowest of the hierarchy is the partition induced by IND( A1 ) . Consequently, for every x ∈ U , we have the nested clusters, in this case
equivalence classes, contain x, i.e. [x ]A ⊆ [x ]A ⊆ 1
2
⊆ [x ]A . n
Let C = {X ⊆ U : X is a class in each partition U / Ai } , thus every class induced
by IND( Ai +1 ) is a union of some class induced by IND( Ai ) .
U / An
U / An
⇓ U / An −1
⇑ U / An −1
⇓
⇑
⇓
⇑
U / A1 Top-down
U / A1 Bottom-up
Fig. 1. A hierarchical rough set approximation
3.3 Example
a. We explain the application of hierarchical rough set approximation concept for data classification through an example from a simple data set derived from [10]. In Table 4, there are ten objects U = {1,2,3,4,5,6,7,8,9,10} with six attributes A = {a1 , a 2 , a3 , a 4 , a5 , a 6 } .
From Table 4, we let three subsets of A; B = {a 2 , a 3 , a 6 } , C = {a 2 , a 3 } and
D = {a 2 } . To this, we have the following partitions
10
T. Herawan, I.T.R. Yanto, and M.M. Deris
U / B = {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}}
U / C = {{1}, {2}, {3,5,8}, {4}, {6,9}, {7}, {10}}
U / D = {{1,4}, {2}, {3,5,7,8}, {6,9,10}}
Table 4. An information system from [10]
U 1 2 3 4 5 6 7 8 9 10
a1 Big Medium Small Medium Small Big Small Small Big Medium
a2 Blue Red Yellow Blue Yellow Green Yellow Yellow Green Green
a3 Hard Moderate Soft Moderate Soft Hard Hard Soft Hard Moderate
a4 Indefinite Smooth Fuzzy Fuzzy Indefinite Smooth Indefinite Indefinite Smooth Smooth
a5 Plastic Wood Plush Plastic Plastic Wood Metal Plastic Wood Plastic
a6 Negative Neutral Positive Negative Neutral Positive Positive Positive Neutral Neutral
The degree dependencies of those set of attributes are B⇒ k C , where
k1 =
1
∑
X ∈U / C
B( X )
U
=
{1,2,3,4,5,6,7,8,9,10} =1 {1,2,3,4,5,6,7,8,9,10}
and C ⇒ k =1 D , where k1 =
∑
X ∈U / D
U
C(X )
=
{1,2,3,4,5,6,7,8,9,10} = 1. {1,2,3,4,5,6,7,8,9,10}
Thus, a nested rough set approximation of a set X ⊆ U can be obtained as follows B( X ), B( X ) , C ( X ), C ( X ) , D( X ), D( X )
with D( X ) ⊆ C ( X ) ⊆ B( X ) ⊆ X ⊆ B ( X ) ⊆ C ( X ) ⊆ D ( X ) .
The hierarchical of rough set approximations is obtained as follows. From Figure 2, the set of classes is given by C = {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}, {1,4}, {6,9}, {3,5,8}, {6,9,10}, {3,5,7,8},U }
Therefore, from C, we can classify the objects based on all possible partitions.
A Construction of Hierarchical Rough Set Approximations in Information Systems
{1,2,3,4,5,6,7,8,9,10} {1,4} {2} {3,5,7,8} {6,9,10} {1} {4}
{3,5,8} {7} {6,9} {10}
{3} {5} {8}
{6} {9}
The 1st level
The 2nd level
The 3rd level
Fig. 2. A hierarchy of the rough set approximations
℘1 = {{1,2,3,4,5,6,7,8,9,10}}
℘2 = {{1,4}, {2}, {3,5,7,8}, {6,9,10}}
℘3 = {{1,4}, {2}, {3,5,7,8}, {6,9}, {10}}
℘4 = {{1,4}, {2}, {3,5,7,8}, {6}, {9}, {10}}
℘5 = {{1,4}, {2}, {3,5,8}, {6,9,10}, {7}}
℘6 = {{1,4}, {2}, {3,5,8}, {6,9}, {7}, {10}}
℘7 = {{1,4}, {2}, {3,5,8}, {6}, {7}, {9}, {10}}
℘8 = {{1,4}, {2}, {3}, {5}, {6,9,10}, {7}, {8}}
℘9 = {{1,4}, {2}, {3}, {5}, {6,9}, {7}, {8}, {10}}
℘10 = {{1,4}, {2}, {3}, {5}, {6}, {7}, {8}, {9}, {10}} ℘11 = {{1}, {2}, {3,5,7,8}, {4}, {6,9,10}}
℘12 = {{1}, {2}, {3,5,7,8}, {4}, {6,9}, {10}}
℘13 = {{1}, {2}, {3,5,7,8}, {4}, {6}, {9}, {10}} ℘14 = {{1}, {2}, {3,5,8}, {4}, {6,9,10}, {7}}
℘15 = {{1}, {2}, {3,5,8}, {4}, {6,9}, {7}, {10}}
℘16 = {{1}, {2}, {3,5,8}, {4}, {6}, {7}, {9}, {10}}
℘17 = {{1}, {2}, {3}, {4}, {5}, {6,9,10}, {7}, {8}}
℘18 = {{1}, {2}, {3}, {4}, {5}, {6,9}, {7}, {8}, {10}}
℘19 = {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}} Fig. 3. The possible partitions
11
12
T. Herawan, I.T.R. Yanto, and M.M. Deris
From Figure 3, we can classify the objects based on a pre-defined number of classes. This is subjective and is pre-decided based either on user requirement or domain knowledge. b. We will further elaborate our approach for discovering maximal association in document collection and compare it with the rough-set based approach of [11−13]. The data is presented in a Boolean-valued in formation system derived from the widely used Reuters-21578 [14]. It is a labeled document collection, i.e. a benchmark for text categorization, as follows: Assume that there are 10 articles regarding corn which relate to the USA and Canada and 20 other articles concerning fish and the countries USA, Canada and France.
Table 5. An information system
U u1
… u 10 u 11
… u 30
US A 1 … 1 1 … 1
Canada
France
Corn
1 … 1 1 … 1
0 … 0 1 … 1
1 … 1 0 … 0
Fis h 0 … 0 1 … 1
Bi et al. [11] and Guan et al. [12,13] proposed the same approach for discovering maximal association rules using rough set theory. Their proposed approach is based on a partition on the set of all attributes in a transactional database so-called a taxonomy and categorization of items. In defining the maximal support of an association, they used the concept of set approximation. However, in defining the maximal confidence, they still used the confidence concept of the maximal association approach of [15]. In this example, we show how the hierarchical of rough set approximations using dependency of attributes can be used to capture maximal associations in document collection. Let I = {i1 , i2 , i3 , , in } be a set of items and D = {t1 , t 2 , , t m } is transaction database over I. A taxonomy T of I is a partition of I into disjoint sets T = {T1 , T2 , , Tk } . Further, each elements of T is called category. Based on Table 5,
we
can
define
a
taxonomy
T
as
T = {Countries, Topics} ,
where
Countries = {USA, Canada, France } and Topics = {Corn, Fish} . According to
[15], for a transaction t and a category Ti , an itemset X ⊆ Ti is said to be maximal in t if t ∩ Ti = X . Thus, X is maximal in t if X is the largest subset of Ti which is max
in t. A maximal association rule is a rule of the form X ⇒ Y , where X and Y are maximal subsets in distinct categories, T ( X ) and T (Y ) , respectively. The support
A Construction of Hierarchical Rough Set Approximations in Information Systems max
of
the
X ⇒Y ,
rule
denoted
by
max MSupp⎛⎜ X ⇒ Y ⎞⎟ ⎝ ⎠
is
defined
13
as
max MSupp⎛⎜ X ⇒ Y ⎞⎟ = {t : t maximal supports X ∪ Y } . The confidence of the rule ⎝ ⎠
⎛
max
⎞
max max max MSupp ⎜ X ⇒ Y ⎟ ⎝ ⎠ X ⇒ Y , denoted by C Dmax ⎛⎜ X ⇒ Y ⎞⎟ is defined as MConf ⎛⎜ X ⇒ Y ⎞⎟ = . X ⎝ ⎠ ⎝ ⎠ From Table 5, the maximal supported sets are captured as follow. From Figure 4, we have the partitions of category Countries as
U / Countries = {{1,2,
,10}, {11,12,
,30}} ,
and the partitions of category Topics as U /{Topics} = {{1,2,
{t , t , {t , t , {t , t , {t , t , 1
11
1
11
2
12
2
12
,10}, {11,12,
,30}}.
, t10 } ∩ Countries = {USA, Canada} , t 30 } ∩ Countries = {USA, Canada, France} , t10 } ∩ Topics = {Corn} , t 30 } ∩ Topics = {Fish} Fig. 4. The maximal supported sets
The hierarchical of rough set approximations are obtained as follows.
{1,2, {1,
,10}
,30}
{11,
{1,2, ,30}
{1,
Countries
,10}
,30}
{11,
,30}
Topics
Fig. 5. A hierarchy of the rough set approximations
From Figure 5, we get the maximal association between two categories as given in Figure 6.
{USA, Canada}⇒{Corn} , {Corn}⇒{USA, Canada} {USA, Canada, France}⇒{Fish} , {Fish}⇒{USA, Canada, France} max
max
max
max
Fig. 6. The maximal rules
14
T. Herawan, I.T.R. Yanto, and M.M. Deris
We notice that, the maximal rules captured are equivalent with that of [11−13,15]. The maximal supports and confidences of the rules are 100% equal with their totally dependency degree.
4 Conclusion In this paper the notion of dependency of attributes in information systems is used. We have shown that it can be used to define a nested sequence of indiscernibility relations. Subsequently, the notion of a nested sequence of indiscernibility relations can be used to define nested rough set approximation in information systems. Further, we have shown that the notion of a nested rough set approximation can be used for constructing hierarchical rough set approximations. For the applications, firstly we have presented how hierarchical rough set approximations can be applied for data classification through an information system. And, lastly we have presented an application of such hierarchy for capturing maximal association in document collection. It is shown that our approach properly capture the maximal rules.
Acknowledgement This work was supported by the FRGS under the Grant No. Vote 0402, Ministry of Higher Education, Malaysia.
References 1. Pawlak, Z.: Rough sets. International Journal of Computer and Information Science 11, 341–356 (1982) 2. Pawlak, Z.: Rough sets: A theoretical aspect of reasoning about data. Kluwer Academic Publishers, Dordrecht (1991) 3. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Information Sciences 177(1), 3–27 (2007) 4. Pawlak, Z., Skowron, A.: Rough sets: Some extensions. Information Sciences 177(1), 28–40 (2007) 5. Yao, Y.Y.: Granular Computing Using Neighborhood Systems. In: Roy, R., Furuhashi, T., Chawdhry, P.K. (eds.) Advances in Soft Computing: Engineering Design and Manufacturing, pp. 539–553. Springer-Verlag, Heidelberg (1999) 6. Yao, Y.Y.: Information granulation and rough set approximation. International Journal of Intelligent Systems 16(1), 87–104 (2001) 7. Marek, W., Rasiovwa, H.: Gradual Approximation Sets by Means of Equivalence Relations. Bulletin of Polish Academy of Sciences 35, 233–238 (1987) 8. Yao, Y.Y.: Stratified rough sets and granular computing. In: The Proceedings of the 18th International Conference of the North American Fuzzy Information Processing Society, pp. 800–804. IEEE Press, Los Alamitos (1999)
A Construction of Hierarchical Rough Set Approximations in Information Systems
15
9. Yao, Y.Y.: Rough sets, neighborhood systems, and granular computing. In: The Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering, pp. 1553–1558. IEEE Press, Edmonton (1999) 10. Parmar, D., Wu, T., Blackhurst, J.: MMR: An algorithm for clustering categorical data using rough set theory. Data and Knowledge Engineering 63, 879–893 (2007) 11. Bi, Y., Anderson, T., McClean, S.: A rough set model with ontologies for discovering maximal association rules in document collections. Knowledge-Based Systems 16, 243–251 (2003) 12. Guan, J.W., Bell, D.A., Liu, D.Y.: The Rough Set Approach to Association Rule Mining. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), pp. 529–532 (2003) 13. Guan, J.W., Bell, D.A., Liu, D.Y.: Mining Association Rules with Rough Sets. Studies in Computational Intelligence, pp. 163–184. Springer, Heidelberg (2005) 14. http://www.research.att.com/lewis/reuters21578.html 15. Feldman, R., Aumann, Y., Amir, A., Zilberstein, A., Klosgen, W.: Maximal association rules: a new tool for mining for keywords cooccurrences in document collections. In: The Proceedings of the KDD 1997, pp. 167–170 (1997)
Percentages of Rows Read by Queries as an Operational Database Quality Indicator Paweł Lenkiewicz1 and Krzysztof Stencel1,2 1
Polish-Japanese Institute of Information Technology, Warsaw, Poland
[email protected] 2 Institute of Informatics, Warsaw University, Poland
[email protected]
Abstract. Trace files generated during operation of a database provide enormous amounts of information. This information varies from one DBMS to another—some databases produce more information than the others. There are many research projects which aim at analysing workloads of databases (not necessarily the trace files). Many of them work online in parallel with the usual business of a DBMS. Such approaches exclude a holistic tackling of trace files. To date, the research on offline methods had only a partial scope. In this paper we show a comprehensive method to analyse trace files off-line. The aim of this analysis is to indicate tables and queries which are not handled well in current design of the database and the application. Next, we show case-studies performed on two dissimilar database applications which show the potential of the described method.
1 Introduction An operating database may be set to produce trace files. It is a log of all SQL statements performed by the DBMS accompanied with a number of measures like elapsed time, CPU time, number of rows touched, number of logical/physical reads of data blocks. Trace files are sources of mass information on the actual performance of the DBMS. A holistic analysis of such a source is a challenging task. There is a lot of ongoing research on the workload of databases (by the workload we mean the stream of queries performed by a database). If we limit solely to the stream we can perform some online tests to assess the current performance and the need to change the design [1,2,3]. However, it is impossible to analyse the whole trace online, since in this mode we can only assess some simple measures based on the workload stream. Compare it to the basic blood test. A small blood drop contains all information on the diseases you had, you have and most of those which you will have. However, the basic blood test consists in evaluating a relatively small number of simple indicators. This is a tremendous wasting of information: the blood probe contains the whole DNA of the examined human being, but we observe no more than fifty numbers. The same concerns database trace, which contains all the information on the problems the database system has and will have. However, currently available tools and methods allow assessing only an isolated fraction of the problem. N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 17–27. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
18
P. Lenkiewicz and K. Stencel
Of course we have excellent advisor methods to recommend indices [4,5,6], recommend merging of indices [7], recommend materialized views [5,8]. Biggest commercial DBMS vendors implemented special advising tools, e.g. Oracle [9], Microsoft SQL Server [10] and IBM DB2 [11]. There are also some good books [12,13] on database tuning. They are invaluable source of practical information on tuning. However, there is no proposal of comprehensive analysis of more than one aspect in the trace file. In this paper we propose one such method. We show how a thorough analysis of the trace file can show sources of potential problems long before these problems start annoying users. The method consists in applying a holistic analysis of the collected trace files and producing a report on strange behaviours of a DBMS. We verified our method in practice by applying it to two OLTP applications with dissimilar usage profiles. One of them was a web application which performs many local queries (i.e. queries always returning a small result set). The other was a thick client application with a lot of local queries and updates, but it also has some bigger reporting queries. The proposed method allowed to improve both applications significantly. The paper is organized as follows. In Section 2 we present the sketch of our method. In Section 3 we characterize the database applications used in our case study. Section 4 describes desired and undesired behaviours which can be diagnosed by our method. Section 5 presents the improvements which have been achieved as the result of applying tuning activities recommended by our method. Section 6 concludes.
2 The Method Our proposed method of database performance monitoring consists of three steps. First, we gather trace information on performed queries with statistical data like the number of rows and pages read. Second, we gather all this information in the form of histograms (one histogram for each table). Third, we clasify table histogram to one of the cases. During our research, the most challenging task was to identify these cases, find the interesting points in those histograms and check the execution plans of queries typical for each such point. It amounted that there were not so many shapes of those histograms and the main indicator is the number of "full scan" reads. In Section 4 we will present the classification of those shapes and the possible execution plan anomalies they indicated. The first step to get the number of rows read consists of collecting a database trace during the most typical operation period of a database application. SQL Server Profiler tool was used for this purpose. It was configured to catch all SQL statements and stored procedures calls in the monitored database. The trace had been saved to the database table. Here we encountered a serious problem, since the table created by SQL Server Profiler did not allow to make more detailed analysis. The trace data is very difficult for analysis. It would be necessary to parse all the statements to get more information about tables affected by this statements and then deduce the number of affected rows. The situation became more complicated, because we did not analyse individual SQL statements but groups of instructions or stored procedure
Percentages of Rows Read by Queries
19
call. Even more difficult issue was caused by a statement which fired a trigger. In his situation it would be necessary to parse all texts of related objects (stored procedures, triggers). Checking the number of rows read from the tables would be too difficult. We decided to do it another way. We’ve created a diagnostic application which uses the SQL Server query optimiser to get execution plans of all statements from the trace. In these plans we could easily find the physical data access paths (table scan, index scan, index seek) and learn the number of rows affected by each read. The application stores obtained data in simple relational database, which make further analysis much easier. This database stores information about batches, particular statements and tables affected by these statements with most inferesting value for us, i.e. the number of rows read from the table. When the trace is transformed to this form, it is possible to start the generation of histograms on the tables’ usage. We used a simple application for this. It reads all the records from the table which stores information about affected tables and using the number of affected rows and total number of rows builds the table with histogram of row reads. The last step is classifying the tables’ histogram to one of the cases basing on its shape. Characteristics of the case shows us possible tables’ anomalies. The classification is described in details in Section 4.
3 Test-Drive Applications The method has been tested on two database applications, which have Microsoft SQL Server 2005 as the backend. Both of these applications are intensively used in Polish-Japanese Institute of Information Technology. These database applications have been chosen because they represent two different, typical usage profiles. The first tested application was the main Institutes’ database system consisting of: deanery module, documents turnover module, virtual deanery, as well as the module of students payments. The majority of system modules are the Windows applications prepared for particular group of users - employees and students. Only small part of the system functionality is available on the web. The database is very often used for reporting and processing large groups of records, so we can find a lot of complex queries and stored procedures in the trace. The second system was e-learning platform used for managing Internet based studies as well as supporting of traditional learning. The system consists of many modules: content management system, discussion forum, files management module, tests, exams, lessons and others. Almost entire system functionality is available through web interface. The trace shows, that there are a lot of simple point queries and joins, what is typical in web applications. During our analysis we have omitted tables with small number of records as well as the tables which occurred in the trace sporadically. Both applications are intensively developed, so the database objects change very often and a lot of new objects are created. Databases are tuned in unsystematic way by developers and administrators.
20
P. Lenkiewicz and K. Stencel
4 Classification of Histograms We checked many queries and their execution plans from the particular points in the histogram and tried to find anomalies. Then we could classify typical histograms and describe possible anomalies for every case. As the result of analysis of histograms obtained by the proposed method, we have found the five most often occurring cases. The majority of tables could be easily classified to one of them. Every case showed or did not show some anomalies. There were also tables not adequate to these cases or showing characteristics of several cases. It can be easily noticed that the most of physical data accesses concern between 0 and 10% of the table records (mainly point queries, multi point queries, joins) as well as between 90 and 100% (full scans). Reads of number of rows from the range 10-90% are rare. It often indicates anomalies, but sometimes it occurs that they are proper, especially within the range 10-30%. The reads of this type concern mainly range queries or columns with bad selectivity. 4.1 Case 1: 100% of Operations Reading Less Than 10% Records This is the best case. The result suggests lack of anomalies regarding the table. Proper selection and use of indexes. Analysis of queries for his type of tables shows that all reads are done by the index seek physical operation. Such tables have properly designed access to the data from the application.
Fig. 1. Histogram, case 1
4.2 Case 2: Significant Predomination of Operations Reading Less Than 10% of Records, Small Amount of Reads Regarding the Whole Table (Less Than 20%), Small Amount of Reads Regarding 10-30% of Records (Less Than 20%) This case regards typical tables, intensively used in different parts of application. Significant predomination of operations reading less than 10% of records suggests the proper selection of indexes and right data access path used by the application.
Percentages of Rows Read by Queries
21
Fig. 2. Histogram, case 2
Small amount of table scan reads is normal, mainly due to execution of reporting queries. The reads regarding 10-30% may, but need not suggest anomalies. Usually they become from queries that join many tables or fetch many rows to application. Example The Student_registration table is used in deanery application very often. Record in the table contains information that a student is registered for a semester of particular type of studies. 12% of physical data access operations are full table scan reads. Some of them are reporting queries, where the full scan is necessary. The rest of them refers to queries with WHERE condition on a foreign key. All foreign keys are indexed, but some of them have small selectivity. Here is a typical query: SELECT Name, Surname, No FROM Person INNER JOIN Student ON Person.IdPerson=Student.IdPerson INNER JOIN Student_registration ON Student.IdPerson=Student_registration.IdPerson WHERE IdStudies=5 ORDER BY Surname, Name There are only about 20 types of studies, but thousands of registrations, so the selectivity is very bad. The query optimiser does not use the index on foreign key and does the full scan read. 16% of reads affect between 10 and 30% of records. These reads results from the queries which read many records to the application or join with another table, but with foreign key with better selectivity. 4.3 Case 3: Predomination of Operations Reading 100% of Records (More Than 60%) The histogram of this type suggests the lack of indexes or non-proper selection of indexes. Especially lack of indexes on foreign keys and columns occurring in WHERE clause. Analysing queries from the trace for this table type, we can observe a lot of table scan operations (lack of indexes) or clustered index scan operations (only default clustered index on the primary key) although we are dealing with point queries or joins.
22
P. Lenkiewicz and K. Stencel
Fig. 3. Histogram, case 3
Example All reads from the table Payment affect all records although there are mainly point queries and range queries. There is only one clustered index on the primary key. Typical query: SELECT Amount, Date FROM Payment WHERE IdPerson = 4537 AND Date BETWEEN ’2009-01-01’ AND ’2009-01-31’ The optimal strategy for this case is adding non clustered index on IdPerson column and clustered index on Date column. 4.4 Case 4: Predomination of Operations Reading Small Amount of Records, but Large Amount of Operations Reading 100% of Records (Up to 50%) This situation may occur in case of queries selecting data from many tables with typically created indexes on the keys: clustered index on the primary key and non clustered indexes on foreign keys. In many cases, especially regarding less selective data, the server will not use non clustered index on the foreign key but will read
Fig. 4. Histogram, case 4
Percentages of Rows Read by Queries
23
100% of records. Performance improvement may be achieved by using clustered index on a foreign key, which enforce change of the index type on the primary key to non clustered. When the table includes many foreign keys, the main problem is to find the one in which the data is less selective and create clustered index on it and the non clustered indexes for the others. The typical situation in this case is an association table with auto-numbered primary key with unused clustered index and with two foreign keys with non clustered indexes, especially when the data in the foreign key columns are badly selective. Example Group_assignment is the association between the Student_registration table and the Group table. The table consist of auto-numbered primary key with default clustered index and two foreign keys with non clustered indexes. The clustered index is never used. 38% of reads are the full scan caused by joins which use indexes with small selectivity. SELECT Name, Surname, No, GroupNo FROM Person p INNER JOIN Student s ON p.IdPerson = s.IdPerson INNER JOIN Student_registration sr ON s.IdPerson = sr.IdPerson INNER JOIN Group_assignment ga ON sr.IdStudent_registration = ga.IdStudent_registration INNER JOIN Group g ON g.IdGroup = ga.IdGroup ORDER BY GroupNo, Surname, Name
4.5 Case 5: Large Amount of Full Scan Operations (More Than 15%). Other Queries Read Small Number of Records This type of situation could occur when an application is written badly. Small amount of full scan operations is acceptable and could be generated by reporting queries, but bigger number of full scans may suggests that the application needlessly reads entire table e.g. to combo boxes, lists, data grids, etc. In this case usually we
Fig. 5. Histogram, case 5
24
P. Lenkiewicz and K. Stencel
can see, that the indexes are created properly. Some tables can be classified as belonging to this case because of the structure of the queries. Mainly these are complex queries containing subqueries, which are difficult for the query optimiser. Rewriting these queries or stored procedures can significantly improve performance. The other example is a query which contains functions in WHERE clause, what makes the existing indexes useless. Creating the functional index or index on computed column can help in this case. Example The main modules’ starting screen loads all students to the data grid, using the query: SELECT Name, Surname, No FROM Person INNER JOIN Student ON Person.IdPerson = Student.IdPerson ORDER BY Surname, Name
The query must read both tables with the full scan and sort them. Most of users ignore the students’ list read to the screen at such a big cost and start their work issuing the query for a particular student they are interested in. The full scan and external sort of the big table Students is thus performed in vain. 4.6 Other Cases Some tables cannot be classified to any of these cases or are able to be classified to several cases. During our works we have found tables, in which some anomalies occurred parallel, for example unnecessary reading all records by the application and incorrect index selection. There were also tables, which were classified incorrectly because of uncharacteristic queries in period of time, when the trace was collected.
5 Achieved Improvements Proposed method will be very good for large, intensively used and developed databases, in which manual, unsystematic methods of database tuning will not be successful. The big advantage of the method is that tuning recommendations can be obtained relatively fast, without large administrative work. It is also possible to use the method incrementally, by adding to the histogram results of analysis of traces taken later. It will allow to analyse consecutively developed database or the database which is used differently in different periods. Our method will not always guarantee exact optimisation recommendations, so cannot replace a database administrator, but in most cases it will make his work easier. In every case all the tables in which a number of row reads varies against standard will be produced. Obtained list of tables as well as list of possible anomalies may be used for manual tuning. It can also suit as an input to other methods of authomatic database tuning. Owing to large reduction of tables list used for tuning, it becomes possible to achieve substantial reduction of time required for authomatic as well as manual methods. Below we list problems, which have been detected.
Percentages of Rows Read by Queries
25
– Lack of indexes. We have detected lack of indexes mainly on foreign keys and columns used in selection criteria. The problem affected 11 tables in our testdrive applications. Adding proper indexes makes a great impact on performance of particular queries and the application. Adding indexes to all detected non indexed tables reduced the total number of physical page reads by 8% for typical hour of database functioning. – Incorrect indexes selection. We have found 7 tables with improper index configuration. In many cases administrators accept default clustered index on the primary key and use non clustered indexes for the remaining columns. Sometimes it is a very bad strategy, mainly for the foreign keys with small selectivity. In many cases query optimiser will not use the non clustered index, but will use the full scan. Tuning all such tables gave us 3% reduction of physical reads in typical hour. – Execution of heavy reporting queries during the hours of intensive activity in database. Using our method, these kind of queries can be identified. In consequence, they may be executed for example during the night. It is also possible to analyse the time schedule of full scan type reads, aimed for finding the optimum time for its execution. The next goal is to schedule the preparation of materialized views, which will help to generate the reports during the day. We have identified a few reports which could be a good target for such optimisation and moved parts or entire of its preparation to the night. It improved the system response time during the peak hours of system activity and shortened the time needed for preparations of some reports. – Faulty written application. We have found a lot of places, where the application reads large amounts of data unnecessarily (including the worse situation: all records from the table). It happens mainly in the case of dictionary tables used in combo boxes, lists, data grids, etc. It was impossible to fix all the problems quickly, but fixing some of them improved the system response time and reduced the average number of physical reads per hour by 4%. – Finding the tables in which the number of reads is small in the relation to the number of modifications. In these tables the number of indexes should be reduced, although speeding up the modifications will slow down the queries. The decision about the proportion between reads and writes is very difficult. The good knowledge of applications is recommended. In our test-drive applications we have found a few intensively updated tables, which are queried rarely. Dropping existing indexes allowed us to save some disk space and speed up modifications (mainly insertions). We reduced the number of page writes from 1 to 4% in some hours of database activity. – Finding places where creating a functional index should be preferable. We have found build-in functions (like CONVERT, DATEPART, DATEADD, SUBSTRING etc.) in many queries in WHERE clause. In many cases the query optimiser cannot use the index in such queries. To fix the problem we must rewrite the query and/or change the application or use the functional index or indexed
26
P. Lenkiewicz and K. Stencel
computed columns. In some queries it allowed us to reduce the number of full scan type reads and speed up the queries significantly. As in every tuning method based on the trace, it is very important to take the workload from the very typical period of database functioning. If the workload contains to little amount of reads from the table, the result can be not authoritative.
6 Conclusion In this paper we have presented an offline approach to analysis of database trace files. We have show that quantitative analysis of the volumes of rows accessed for each table can indicated ill-tuned tables and queries. The case studies included in the paper proved that the proposed method is proficient as it allowed improving studied database applications significantly. The tuning process performed in our case studies was partly manual with some automated tool support. However, vast majority of the activities performed were easy to mechanise. Future work will encompass efforts to automate the advocated approach as a new kind of database tuning advisors.
References 1. Sattler, K.U., Geist, I., Schallehn, E.: Quiet: Continuous query-driven index tuning. In: VLDB, pp. 1129–1132 (2003) 2. Schnaitter, K., Abiteboul, S., Milo, T., Polyzotis, N.: Colt: continuous on-line tuning. In: Chaudhuri, S., Hristidis, V., Polyzotis, N. (eds.) SIGMOD Conference, pp. 793–795. ACM, New York (2006) 3. Bruno, N., Chaudhuri, S.: An online approach to physical design tuning. In: [6], pp. 826–835 4. Chaudhuri, S., Narasayya, V.R.: Autoadmin ’what-if’ index analysis utility. In: Haas, L.M., Tiwary, A. (eds.) SIGMOD Conference, pp. 367–378. ACM Press, New York (1998) 5. Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in sql databases. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) VLDB, pp. 496–505. Morgan Kaufmann, San Francisco (2000) 6. Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007. IEEE (2007) 7. Bruno, N., Chaudhuri, S.: Physical design refinement: The “Merge-reduce” approach. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 386–404. Springer, Heidelberg (2006) 8. Zilio, D.C., Zuzarte, C., Lightstone, S., Ma, W., Lohman, G.M., Cochrane, R., Pirahesh, H., Colby, L.S., Gryz, J., Alton, E., Liang, D., Valentin, G.: Recommending materialized views and indexes with ibm db2 design advisor. In: ICAC, pp. 180–188. IEEE Computer Society, Los Alamitos (2004) 9. Dageville, B., Das, D., Dias, K., Yagoub, K., Zaït, M., Ziauddin, M.: Automatic sql tuning in oracle 10g. In: [14], pp. 1098–1109 (2004)
Percentages of Rows Read by Queries
27
10. Agrawal, S., Chaudhuri, S., Kollár, L., Marathe, A.P., Narasayya, V.R., Syamala, M.: Database tuning advisor for microsoft sql server 2005. In: [14], pp. 1110–1121 (2004) 11. Valentin, G., Zuliani, M., Zilio, D.C., Lohman, G.M., Skelley, A.: Db2 advisor: An optimizer smart enough to recommend its own indexes. In: ICDE, pp. 101–110 (2000) 12. Shasha, D., Bonnet, P.: Database tuning: principles, experiments, and troubleshooting techniques. Morgan Kaufmann Publishers Inc, San Francisco (2003) 13. Lightstone, S.S., Teorey, T.J., Nadeau, T.: Physical Database Design: the database professional’s guide to exploiting indexes, views, storage, and more. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers Inc, San Francisco (2007) 14. Nascimento, M.A., Özsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (eds.) (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 - September 3 2004. Morgan Kaufmann, San Francisco (2004)
Question Semantic Analysis in Vietnamese QA System Tuoi T. Phan, Thanh C. Nguyen, and Thuy N.T. Huynh CSE Faculty, HCMC University of Technology, 268 Ly Thuong Kiet, HCMC, Vietnam {tuoi,thanh}@cse.hcmut.edu.vn,
[email protected]
Abstract. Question semantic analysis is an important step in Question Answering (QA), especially for Vietnamese question. The paper introduces our proposed model for Vietnamese QA system (VQAS) and an approach of Vietnamese question analysis. The model first caries out syntax and semantic analysis of Vietnamese queries, then outputs set of tuples related to information in VQAS’ ontology which they intend to retrieve. Based on features of Vietnamese, analyzing semantics of Vietnamese questions plays an important role in searching correct result for question. Therefore, the paper focuses on semantic analysis and processing of Vietnamese queries. The model has performed a syntax analysis and semantic processing of hundreds Vietnamese queries in kinds of Yes/No, WH and selective form, which are related to the ontology of a computer science area. Keywords: Vietnamese question, question analysis, QA, Dependent Grammar.
1 Introduction In recent years, there are many researches of Question Answering in English. However, the syntax and semantic analysis of English queries is not handled completely in them. This processing is still based on matching a syntax structure of query with some given samples. There are also some projects as S–CREAM [2] and MnM [3] using many machine learning techniques to retrieve relations of objects, but those techniques have been realized semi–automatically. The authors of [4] have used concept graph to represent relations of entities. AquaLog [6] is an QA system in English. That system has processed the semantics of English queries, but all of the queries are unambiguous. The Dependent Grammar [5] [8] is used for the syntax and semantic analysis of Vietnamese queries. In the grammar, structure of the sentence is represented by functions of parts in the sentence: agent, action and theme. The parsed tree of a sentence has a root node which is the tagged verb (action) and two branches are phrases representing its agent and theme (Section 3.2). By this, our approach to use Dependent Grammar to analyze the semantic of natural language’s questions and to disambiguate them with corrected specification of relations of entities in relevant ontology. Until now, it is one of the new approaches in not only Question N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 29–40. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
30
T.T. Phan, T.C. Nguyen, and N.T. Huynh
Answering field but also Natural Language Processing field. It is also applied in Vietnamese question which is introduced in below sections. Another approach of [12] for analysis structure of questions bases on language grammar rules also Logical Form to build a world model taxonomy and syntacticsemantic interpretation rule. That can help much on Italian and Danish questions but cannot resolve complexity kinds of question in both of English and Vietnamese. The [1], [11], [13] and [14] focused on analysis of question however they did not apply Dependent Grammar to their approaches, then cannot recognize semantic relations among parts of question. Last, Salis et al [15] had the detailed analysis on WH-English question to investigate Avrutin’s hypothesis in a range of WH-questions, however, as above cases, [15] did not base on Dependent Grammar to solve their processing. In the paper, our proposal of model of VQAS is introduced in Section 2 to provide overview of the system. The Section 3 is the most important one in the paper because it introduces our analyzing on kinds of Vietnamese question also re– organizing them. The experimentation in Section 4 summarizes the experimental result for Vietnamese questions. Last section is the conclusion and our future work.
2 A Model of Vietnamese Question Answering system (VQAS) To develop a QA system for Vietnamese, our model is introduced in Fig.1 as below. There are three main modules including an ontology named Vietnamese KB (briefly VKB).
Fig. 1. The processing model of Vietnamese query in VQAS
The first module focuses on Vietnamese question semantic analysis by conducting some steps such as (a) segmenting question, (b) POS-tagging, then (c) analyzing question based on kinds of question and (d) represent analyzed question in format of structurized tuples also dependent (syntax) tree. Depending on format of questions, result from (d) can be one or many linguistic tuples. Here linguistic tuple includes group of subject, object, action and theme which are described in Section 3. Based on that output, the second module looks its candidates (as knowledge tuples) in VKB which are similar to linguistic tuples, then finding many relationships mapping between them, therefore VQAS can determine answer candidates for initial question. The last module selects most appropriate candidates to generate answers in Vietnamese natural language format which will be friendlier
Question Semantic Analysis in Vietnamese QA System
31
and easier to understand for users. This is an important feature of our proposed VQAS. At present, the paper just focuses on detail of the first module which analyzing Vietnamese question. This module retrieves an input which is a question in natural Vietnamese language, then returns the list of linguistic tuples which will be inputed to next processing steps of the second module.
3 Syntax Analysis - Semantic Processing of Vietnamese Questions In this section, the first part systematizes kinds of question in both of English and Vietnamese, in which illustrating different cases of Vietnamese questions for each English question. After that, the second introduces our analysis approach of those question kinds based on Dependent Grammar. 3.1 Basic Forms of Vietnamese Questions Vietnamese linguists have classified Vietnamese sentence by alternative criteria, besides classifying by syntax structure [1]. To base on Vietnamese question properties and a means used to show it, Vietnamese question are classified according to: (i) Yes/No question (iii) Alternative question
(ii) (iv)
WH– question Tag question
The paper focuses to analyze the syntactic and semantics of Vietnamese question forms (i), (ii) and (iii). (i) Yes/No question: The Vietnamese Yes/No question uses question words in different positions in the sentence (… phải không?) or (có phải … không?) or (… có phải … không?) a) Relating two classes of objects: According to relationships between objects (Fig.2), there are able to some below question forms. The sentence of the active form has the syntax structure Subject/agent–Verb/action–Object/theme, or Subject/theme–Verb/action–Object/agent in passive form. Below list provides lexicons of Vietnamese terms and their relevant English terms with V is brief for Vietnamese and E for English:
Fig. 2. Objects and their relations
32
T.T. Phan, T.C. Nguyen, and N.T. Huynh
tác giảV ≈authorE phát hànhV ≈publishE
nhà xuất bảnV ≈publisherE xuất bảnV ≈releaseE
nămV ≈yearE được phát hànhV ≈published byE
There are three forms of the kind of question as below: * Form 1: Subject/agent–Verb/action?–Object/theme With sample question “Is Mr./Ms./ɸJohn an author of the Compiler book?”, there are different kinds of Vietnamese questions such as: - Ông/bà/ɸJohn là tác giả của cuốn Compiler phải không? - Ông/bà/ɸJohn có phải là tác giả của cuốn Compiler không? - Có phải ông/bà/ɸJohn là tác giả của cuốn Compiler không?
These questions have the same meaning but their interrogative in bold text are in different positions. * Form 2: Subject/theme– Verb/action?–Object/agent From the question “Is the Compiler book Mr./Ms./ɸJohn’s?”, there are below kinds of Vietnamese questions: - Compiler là của ông/bà/ɸJohn phải không? - Compiler có phải là của ông/bà/ɸJohn không?
As same as above form, two Vietnamese questions are same meaning even if having interrogative (bold text) in other positions. * Form3: Object/theme–Subject/agent –Verb/action? Given English question “Did the Compiler book/ɸ KD publisher release?” has two kinds of Vietnamese questions: - Cuốn/ɸCompiler là do nhà xuất bản KD phát hành phải không? - Cuốn/ɸCompiler là do nhà xuất bản KD phát hành?
As above forms, the meaning of these questions are same but their interrogative (bold text) in other positions or missing. b) Relating three classes of objects
Fig. 3. Relations of one object with the others
The question “Did John write the book Compiler in year/around year/year 1992?” has different kinds of Vietnamese questions such as: - John viết cuốn Compiler trong năm/khoảng năm/năm 1992 phải không? -Trong năm/khoảng năm/năm 1992 ông/bà/ɸJohn viết cuốn Compiler phải không?
Their meaning of are same even as their parts’ location in other positions.
Question Semantic Analysis in Vietnamese QA System
33
(ii) WH-Question a) Relating two classes of objects * Form 4: Subject/agent–Verb/action–Object/theme? The Vietnamese question “Ông/bà/ɸJohn là tác giả của những quyển sách nào?” has same meaning with English question “What books have author named Mr./Ms./ɸJohn?” * Form 5: Subject/agent?–Verb/action–Object/theme The question “Who is an author of the Compiler book/ɸ?” has the relevant Vietnamese question “Tác giả của cuốn/ɸCompiler là ai?” with interrogative in bold. * Form 6: Object/theme?–Subject/agent–Verb/action The question “Which book belongs to KD publisher?” is same meaning with Vietnamese questions “Những quyển sách nào là của nhà xuất bản KD?” with interrogative in bold text. * Form 7: Object/Theme–Verb/Action–Subject/Agent? The question “Which publisher does the Compiler book/ɸbelong to?” and Vietnamese question “Cuốn Compiler là của nhà xuất bản nào?” are same meaning, here Vietnamese interrogative in bold text according to “Which” in English. * Form 8: Object/Theme– Subject/Agent?– Verb?Action Given question “Who wrote the Compiler book?” has below kinds of Vietnamese questions “Cuốn Compiler là do ai viết?” or “Cuốn Compiler do ai viết?”. b) The questions relate to three classes of objects The questions in below form relate to three classes: Nhà xuất bản/Publisher – Sách/Book – Tác giả/Author. * Form 9: Subject/agent–Verb/action–Object/theme– Indirect_Object/co_theme The question “Which books of John did the KD publisher release?” is same meaning with Vietnamese question “Nhà xuất bản KD phát hành những quyển sách nào của John?” The questions in below forms relate to three classes: Tác giả/Author – Sách/Book – Năm/Year. *
Form
10:
Subject/agent–
Verb/action–
Object/theme–
Indi-
rect_Object/co_theme?
There are two Vietnamese questions in same meaning with the question “What year did John write the Compiler book?”: - John viết cuốn Compiler năm nào? - Năm nào John viết cuốn Compiler?
34
T.T. Phan, T.C. Nguyen, and N.T. Huynh
They have different position of interrogative (bold) but still keeping same meaning. * Form 11: Object/theme?–Subject/agent–Verb/action– Indirect_Object/co_theme The question “What books were written by John in 1992?” is same meaning with “Những cuốn sách nào được John viết năm 1992?” with interrogative in bold. * Form 12: Object/theme–Indirect_Object/co_theme–Verb/action–Subject/agent? The Vietnamese question “Ai là tác giả của cuốn Compiler xuất bản năm 1992?” is same meaning with “Who is the author of the Compiler book published in 1992?”. The below questions relate to three classes: Nhà xuất bản/Publisher – Sách/Book – Năm/Year * Form 13: Object/theme–Indirect_Object–Verb/action–Subject/agent? The question “What publisher does the Compiler book published in 1992 belong to?” and “Cuốn Compiler xuất bản năm 1992 là của nhà xuất bản nào?” are same meaning. * Form 14: Subject/theme?–Object/agent–Verb/action–Indirect_Object There is Vietnamese question “Những cuốn sách nào được nhà xuất bản KD phát hành năm 1992?” translated from English question “What books were released by KD publisher in 1992?”. * Form 15: Subject/agent–Verb/action–Object/Indirect_Object/co_theme? The question “What year did KD publisher release the Compiler book in?”, has below Vietnamese questions: - Nhà xuất bản KD phát hành cuốn Compiler vào năm nào? - Năm nào nhà xuất bản KD phát hành cuốn Compiler?
They are in same meaning even if having the difference of interrogative’s position. (iii) Alternative and combinative questions With sample question “Is John or Ullman the author of the Compiler book?”, there is Vietnamese questions “John hay Ullman là tác giả của cuốn sách Compiler?”, here bold texts are Vietnamese lexicons of relevant English terms. 3.2 Syntax Analysis and Semantic Processing of the Vietnamese Questions There are two steps of Vietnamese question analysis as follows: (i) Preprocessing A Vietnamese question, for instance “Ông Aho là tác giả của cuốn Compiler phải không? / Is Mr. John an author of the Compiler book?” will be processed by steps: word segmentation, tagging, then looking in the list of the synonyms to define words in the sentence which synonym sets belong to.
Question Semantic Analysis in Vietnamese QA System
35
For above instance, after checking in the synonym sets, the word “ông/Mr.” belongs the synonym set N_tacgia (N_author). The set N_tacgia includes the synonym words: “bà/(Mrs., Ms.)”, “tác giả/author”, “người viết/writer” … The word “cuốn/book” belongs to the set N_tacpham (N_work) including “cuốn”, “cuốn sách”, “quyển sách”, “quyển/book”, “tác phẩm/work”, “bài báo/article”… The term “phải không” belongs to set of the interrogatives about Yes/No question (tdh_phaikhong) including “phải không?”, “có phải … không?”. The list of the synonyms has the other sets as V_phathanh (V_publishing) including “in/print”, “xuất bản/publish”, “phát hành/issue”; the set V_viet includes “viết/write”, “biên soạn/compile”, “sáng tác/compose”… besides. (ii) Syntax analysis and semantic processing The output of preprocessing is an input of the syntax analysis and semantic processing phase. Finally, the first module (in Fig.1) will output linguistic tuples (group of objects). For the instance “Aho viết cuốn Compiler phải không?/Does Aho write the Compiler book?” in Yes/No question form, the output of the syntax analysis and semantic processing model has the form
. With the Wh-question as “Ai viết cuốn Compiler?/Who write the Compiler book?”, the output is . The processing methods of two above question forms are identical. The processing steps of Yes/No questions are described as follows. The given question “Aho viết cuốn Compiler phải không?/ Did Aho write the Compiler book?”, the preprocessor outputs the list of words with taggers “Aho/Ne, viết/V_viet, cuốn/N_tacpham, Compiler/Ne, phải không/tdh_phaikhong”. After that, the parser recognizes the question form (Yes/No question, because here is the tagger tdh_phaikhong) and outputs the dependent tree. The leaves of the dependent tree are the words with their semantic features. The Fig.4 is the dependent (syntax) tree of the sentence “Aho viết cuốn Compiler phải không?/ Did Aho write the book Compiler?”. To output the linguistic tuple , the semantic analyzer has to define proper names in the sentence (Aho, Compiler) and what class of objects (author, work, publisher) they belong to. This is necessary to include the complement, which is proper for each class of objects; instance for the class of author (N_tacgia) is the complement: “tác giả”, “ông”, “bà” and “người viết”. For the class of work (N_tacpham) is the complement: “cuốn”, “tác phẩm”, “bài báo”, “sách” and “cuốn sách”. The semantic analyzer can use verb in the sentence to classify the proper name, besides. In the above sentence, the verb “viết/write” is useful to define the proper name (tagger Ne). The proper name before “viết” belongs to class of author (N_tacgia) and after “viết” is in class of work (N_tacpham). The semantic features tagged to the leaves of the tree are in “[“ and “]” signs. The symbol “N_lớp?” means which class of objects Ne belongs to. “N_lớp” means N_lớp is before the correlative node on the tree. Symbol “*” is the position of the node in the syntax structure. Through the analysis of Vietnamese questions
36
T.T. Phan, T.C. Nguyen, and N.T. Huynh
presented in the section 3.1, we have the method to handle semantics of the question forms as follow. The verb or the possessive word (“của/of”, “thuộc/belong”) are used as a focus to consider the function of other parts in the sentence.
Fig. 4. The dependent tree of the sentence “Aho viết cuốn Compiler phải không?/Did Aho write the Compiler book?”
Now, the semantics is being dealing to define the proper name first. Considering the nodes on the tree, they are the proper names with their tagger Ne. The each node Ne is processed in succession by follow steps. For every step, if Ne defines the specific class of objects then mark this node that it is classified, so the next step it will not been considered. Step 1: To consider the word W_x is before Ne: a) The form “cuốn Compiler/book Compiler”, “ông Aho/Mr. Aho”, “nhà xuất bản Printice Hall/publisher Printice Hall”. If W_x is in symnonym set: N_tacpham (“cuốn”/book), or N_tacgia (“ông”/Mr.), or N_nxb (“nhà xuất bản”/publisher), then immediately the class of Ne will been specified. Instance, if W_x is “cuốn” then “Compiler/Ne” will be in the class N_tacpham, W_X is “ông/Mr.” then “Aho/Ne” will belong to N_tacgia. b) The sentence “Compiler có tác giả là Aho phải không?”. If W_x is verb “là/is” then will consider from verb “là” to proper name before “là” (marked Ne_pre), in this case, is “Compiler”/Ne_pre, if it is an appearance of the word in the one of the classes: N_tacgia, N_tacpham, N_nxb, then immediately the function of Ne after W_x will been specified. Instance, if the word “tác giả/author” occurs, then Ne will be in the class N_tacgia. In the sentence above, “Aho/Ne” will be in N_tacgia. Step 2: To consider W_x is after Ne, in the space between Ne and the next proper name (Ne_next). If Ne is the last Ne in the sentence, the semantic handler will consider the space between Ne to the last word of the sentence. a) The form “Compiler của Aho phải không?/Is it Compiler of Aho?” or “Compiler thuộc nhà xuất bản Printice Hall phải không?/Does Compiler belong to Printice Hall Publisher?” If W_x after Ne (“Compiler”/Ne) is the possessive word “của/of” or “thuộc/belong to” then Ne will be in the class of works (N_tacpham). b) The form “Aho là tác giả cuốn Compiler phải không?/ Is Aho an author of the Compiler book?”. If W_x is the verb “là/is” and after W_x is the word in one of the classes N_tacpham, N_tacgia, N_nxb, then the class of Ne will be specified. In the sentence above, “Aho” is in the class of author (N_tacgia), because after “Aho” is the word “là/is” and the next is “tác giả/author”.
Question Semantic Analysis in Vietnamese QA System
37
Step 3: If before Ne is the possessive word “của/of”, then consider the space between Ne_pre and Ne. This is the form “Aho là tác giả của Compiler phải không?/Is Aho an author of the Compiler book?”. If before “của/of” is the word in the class of publisher (N_nxb) or the class of author (N_tacgia), then Ne will belong to the class of work (N_tacpham). If before “của/of” is the word in the class of work, then Ne will belong to the class of publisher (N_nxb) or the class of author (N_tacgia). The Fig.5 demonstrates the semantic processing of the dependent tree in Fig.4.
Fig.5. The parsed tree of Fig.4 is handled by semantics
From the tree, the dot-line links provide the references to definite agents, objects or actions for interrogative classes (with “?” sign). By this, there will have links between the tuples of a parsed (or analyzed) question which having dependent phrases.
4 Evaluation When the VQAS has specified the class of proper names, then the output of linguistic multiple is easily to do. Hereafter, some instances of the processing of Vietnamese questions and the outputs of the linguistic tuples (Fig. 1 and Fig.2). The verb and possessive words are not in the linguistic tuple. Because, the semantic processor used them to define the classes of the objects relating to the ontology, then transferring the dependent tree to the semantic tree with the semantic features (Fig.4 and Fig. 5). Therefore, these words are not necessary to appear in the linguistic tuple. In initial experiment of Question Analysis module, first step is our developing this module based on Java programming language. In source code, from GATE library [9], the gate.creole.SerialAnalyserController class supports to create a Corpus Pipeline object, then other linguistic resources in the library such as gate.creole.Transducer and JAPE grammar developed rules are added to the object to build the processing pipeline. This pipeline can process user’s question step by step to recognize its parse also linguistic tuples. The second step of experiment is to build testing data set. In initial phase, the data set is collected from two sources:
38
T.T. Phan, T.C. Nguyen, and N.T. Huynh
- The first is documents of HCMC Technology University’s library about 127 Vietnamese questions which focuses deeply in digital library covering topics of Author, Publisher, Book mentioned in Section 3.1. - The second is VN-Express newspaper (www.vnexpress.net) that question list is extracted about 4130 Vietnamese questions in many topics except digital library field. All of those questions were categorized in sixteen forms of three question kinds such as Yes/No question, WH question and alternative question (see Section 3.1). The experiment is performed with focusing on MAP (precision) measure for above data set that its results are sumarized in below table, here recall measure is out of scope in this evaluation. Table 1. Summary of question analysis Question form Group1: Yes/No question Group2: WH & alternative questions
# Question 128 4129
Question Ratio 3.01% 96.99%
MAP 87.00% 71.00%
The evaluated data are also illustrated in Fig.6 as follows. 120.00% 96.99%
100.00%
87.00% 71.00%
80.00% 60.00%
Group1
40.00%
Group2
20.00%
3.01%
0.00% Question Ratio
MAP
Fig. 6. The summary result of question analysis
From evaluation’s result, Group2’s retrieved MAP is just 71% depending on complexity of question forms. This shows that our proposed approach of question analysis must be upgraded more to have differently solutions for kinds and their sub-kinds of Group2 thus it can return more accurately results.
5 Conclusion The paper introduces proposed model of VQAS in section 2 which overviews processing mechanism of the system and the role of each module to analysis question then recognize answer candidate and generate best answer for user.
Question Semantic Analysis in Vietnamese QA System
39
The core of the paper is a proposal for approach of Vietnamese question semantic analysis for the first module in section 3. The section firstly provides detail analysis of Vietnamese question kinds and relevant forms including both of English and Vietnamese example questions and describes methodology to perform syntax analysis then semantic processing for an user’s question. The approach provides a heuristic to recognize semantics of a question then develop dependent syntax tree which has dynamic links between its nodes as semantic reference. This will be an important basis to develop processing steps for the other modules of the model. The initial experiment achieved high level of precision for small group of Vietnamese Yes/No question and medium level of precision for large group in the other question kinds. This shows that there are several improvements of this approach for our further research to get higher precision; but this also shows that our proposed approach is feasible for question semantic processing also for building VQAS. Acknowledgments. We would like to thank the Vietnam national KC science committee and all members of the BK–NLP group of Vietnam HCMC University of Technology for their enthusiastic collaboration. This work was supported by the key project “Research and development some multimedia information retrieval systems which supporting Vietnamese” of the group.
References [1] Vuong, T.D.: Discussions of comparison between English-Vietnamese question kinds. [online], http://www.tckt.edu.vn/uploads/noisan/47/ J-Dao-R-24-27.doc [2] Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM-Semi-automatic CREAtion of Metadata. In: Proceeding of the 13th International Conference on Knowledge Engineering and Management, Siguenza, Spain, October 1–4 (2002) [3] Vargas–Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F.: MnM: Ontology driven semi–automatic supp for semantic markup. In: Proceeding of the 13th International Conference on Knowledge Engineering and Management, Siguenza, Spain, October 1–4 (2002) [4] Hong, T.D., Cao, H.T.: Auto-translating query to concept graph. The Vietnam journal of Informatics and Cybernetics 23(3), 272–283 (2007) [5] Mírovsky, J.: Netgraph Query Language for the Prague Dependency Treebank 2.0. In: The Prague bulletin of Mathematical Linguistics, December 2008, vol. (90), pp. 5–32 (2008) [6] Lopez, V., Uren, V., Motta, E., Pasin, M.: AquaLog: An ontology – driven question answering system for organizational semantic intranets. Journal of Web Semantics (March 31, 2007) [7] Nguyen, Q.C., Phan, T.T.: A Hybrid Approach to Vietnamese Part–Of–Speech Tagging. In: The Proceedings of 9th International Oriental COCOSDA 2006 Conference, Malaysia, pp. 157–160 (2006) [8] Nivre, J.: Dependency Grammar and Dependency Parsing, [online] http://vxu.se/msi/nivre/papers/05133.pdf
40
T.T. Phan, T.C. Nguyen, and N.T. Huynh
[9] GATE (A General Architecture for Text Engineering), [online] http://gate.ac.uk [10] Lita, L.V., Hunt, A.W., Nyberg, E.: Resource Analysis for Question Answering. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain (2004) [11] Sable, C., Lee, M., Zhu, H.R., Yu, H.: Question Analysis for Biomedical Question Answering. In: AMIA 2005 Symposium Proceedings (2005) [12] Paggio, P., Hansen, D.H., Basili, R., Pazienza, M.T., Zanzotto, F.M.: Ontology-based question analysis in a multilingual environment: the MOSES case study. In: Proceedings of Workshop OntoLex 2004, held jointly with LREC 2004 LISBON, Portugal (May 2004) [13] Ortega, F.V.: A Preliminary Analysis of Yes / No Questions in Glasgow English. In: Proceedings of Speech Prosody, Aix-en-Provence, France, pp. 683–686 (2002) [online] http://www.isca-speech.org/archive/sp2002/sp02_683.html [14] Metzler, D., Croft, W.B.: Analysis of Statistical Question Classication for Fact-based Questions. Journal of Information Retrieval 8(3), 481–504 (2005) [15] Salis, C., Edwards, S.: Comprehension of wh-questions in agrammatism: a single-case study. Reading Working Papers in Linguistics 8, 219–233 (2005)
Ontology-Based Query Expansion with Latently Related Named Entities for Semantic Text Search Vuong M. Ngo and Tru H. Cao Faculty of Computer Science and Engineering Ho Chi Minh City University of Technology Viet Nam [email protected], [email protected]
Abstract. Traditional information retrieval systems represent documents and queries by keyword sets. However, the content of a document or a query is mainly defined by both keywords and named entities occurring in it. Named entities have ontological features, namely, their aliases, classes, and identifiers, which are hidden from their textual appearance. Besides, the meaning of a query may imply latent named entities that are related to the apparent ones in the query. We propose an ontology-based generalized vector space model to semantic text search. It exploits ontological features of named entities and their latently related ones to reveal the semantics of documents and queries. We also propose a framework to combine different ontologies to take their complementary advantages for semantic annotation and searching. Experiments on a benchmark dataset show better search quality of our model to other ones.
1 Introduction With the explosion of information on the Word Wide Web and the emergence of e-societies where electronic documents are key means for information exchange, Information Retrieval (IR) keeps attracting much research effort, social and industrial interests. There are two types of searches in IR: 1. Document Retrieval: A user provides a search engine with a word, a phrase or a sentence to look for desired documents. The answer documents do not need to contain the terms in the user’s query and can be ranked by their relatedness to the query. This type of searching was mentioned as Navigational Search in [14]. 2. Question and Answering: A user provides a search engine with a phrase or sentence to look for objects, rather than documents, as answers for the user’s query. This type of searching was mentioned as Research Search in [14]. In practice, answer objects obtained from a Question and Answering search engine can be used to search further for documents about them ([11]). Our work here is about Document Retrieval that uses related objects in a query to direct searching. N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 41–52. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
42
V.M. Ngo and T.H. Cao
Current search engines like Yahoo and Google mainly use keywords to search for documents. Much semantic information of documents or user's queries is lost when they are represented by only ‘bags of words’. Meanwhile, people often use named entities (NE) in information search. Specifically, in the top 10 search terms by YahooSearch1 and GoogleSearch2 in 2008, there are respectively 10 and 9 ones that are NEs. Named entities are those that are referred to by names such as people, organizations, and locations ([22]) and could be described in ontologies. The precision and recall measures of an IR system could be improved by exploiting ontologies. For Question and Answering, [17] proposed methods to choose a suitable ontology among different available ontologies and choose an answer in case of having various answers from different ontologies. Meanwhile, [25] presented a method to translate a keyword-based query into a description logic query, exploiting links between entities in the query. In [6], the targeted problem was to search for named entities of specified classes associated with keywords in a query, i.e., considering only entity classes for searching. Recently, in [15] the query was converted into SPARQL (a W3C standard language for querying RDF data) and the results were ranked by a statistical language model. For Document Retrieval, the methods in [2] and [20] combined keywords with only NE classes, not considering other features of named entities and combinations of those features. In [5] and [11], a linear combination of keywords and NEs was applied, but a query had to be posted in RDQL to find satisfying NEs before the query vector could be constructed. Meanwhile, [12] proposed to enhance the content description of a document by adding those entity names and keywords in other documents that co-occurred with the entity names or keywords in that document. In [16], it was showed that normalization of entity names improved retrieval quality, which is actually what we call aliases here. As other alternative approaches, [26] and [9] respectively employed Wordnet and Wikipedia to expand query with related terms. Our motivation and focus in this work is to expand a query with the names entities that are implied by, or related to, those in the query, which were not discovered in previous works. For example, given the query to search for documents about “earthquakes in Southeast Asia”, documents about earthquakes in Indonesia or Philippines are truly relevant answers, because the two countries are part of Southeast Asia. Such named entities having relations with ones in a query are defined in an ontology being used. Intuitively, adding correct related named entities to a query should increase the recall while not sacrificing the precision of searching. In this paper, we propose a new ontology-based IR model with two key ideas. First, the system extracts latently related named entities from a query to expand it. Second, it exploits multiple ontologies to have rich sources of both NE descriptions and NE relations for semantic expansions of documents and queries. Section 2 introduces the generalized Vector Space Model adapted from [4] combining keywords with different ontological features of named entities, namely, name, class, identifier, alias, and super-class. Section 3 describes the proposed system architecture and the methods to extract related named entities and to expand 1 2
http://buzz.yahoo.com/yearinreview2008/top10/ http://www.google.com/intl/en/press/zeitgeist2008/
Ontology-Based Query Expansion with Latently Related Named Entities
43
documents and queries. Section 4 presents evaluation of the proposed model and discussion on experiment results in comparison to other models. Finally, section 5 gives some concluding remarks and suggests future works.
2 A Generalized Vector Space Model Textual corpora, such as web pages and blogs, often contain named entities, which are widely used in information extraction, question answering, natural language processing, and mentioned at Message Understanding Conferences (MUC) in 1990s ([19]). For example, consider the following passage from BBC-News3 written on Friday, 19 December 2008: “The US government has said it will provide $17.4bn (£11.6bn) in loans to help troubled carmakers General Motors and Chrysler survive. [...] GM Chief Executive Rick Wagoner said his company would focus on: fully and rapidly implementing the restructuring plan that we reviewed with Congress earlier this month.” Here, US, General Motors, Chrysler, GM and Rick Wagoner are named entities. Each NE may be annotated with its occurring name, type, and identifier if existing in the ontology of discourse. That is, a fully recognized named entity has three features, namely, name, type, and identifier. For instance, a possible full annotation of General Motors is the NE triple (“General Motors”, Company, #Company_123), where GM and General Motors are aliases of the same entity whose identifier is #Company_123. Due to ambiguity in a context or performance of a recognition method, a named entity may not be fully annotated or may have multiple annotations. For instance, Rick Wagoner should be recognized as a person, though not existing in the ontology, hence its identifier is unknown. As a popular IR model, the Vector Space Model (VSM) has advantages as being simple, fast, and with a ranking method as good as large variety of alternatives ([1]). However, with general disadvantages of the keyword based IR, the keyword based VSM is not adequate to represent the semantics of queries referring to named entities, for instances: (1) Search for documents about commercial organizations; (2) Search for documents about Saigon; (3) Search for documents about Paris City; (4) Search for documents about Paris City, Texas, USA. In fact, the first query searches for documents containing named entities of the class Commercial Organization, e.g. NIKE, SONY, …, rather than those containing the keywords “commercial organization”. For the second query, target documents may mention Saigon City under other names, i.e., the city’s aliases, such as Ho Chi Minh City or HCM City. Besides, documents containing Saigon River or Saigon University are also suitable. In the third query, users do not expect to receive answer documents about entities that are also named “Paris”, e.g. the actress Paris Hilton, but are not cities. Meanwhile, the fourth query requests documents 3
http://news.bbc.co.uk/
44
V.M. Ngo and T.H. Cao
about a precisely identified named entity, i.e., the Paris City in Texas, USA, not the one in France. Nevertheless, in many cases, named entities alone do not represent fully the contents of a document or a query. For example, given the query “earthquake in Indonesia”, the keyword “earthquake” also conveys important information for searching suitable documents. Besides, there are queries without named entities. Hence, it needs to have an IR model that combines named entities and keywords to improve search quality. In [4], a generalized VSM was proposed so that a document or a query was represented by a vector over a space of generalized terms each of which was either a keyword or an NE triple. As usual, similarity of a document and a query was defined by the cosine of the angle between their representing vectors. The work implemented the model by developing a platform called S-Lucene modified from Lucene4. The system automatically processed documents for NE-keyword-based searching in the following steps: 1. Removing stop-words in the documents. 2. Recognizing and annotating named entities in the documents using KIM5. 3. Extending the documents with implied NE triples. That is, for each entity named n possibly with class c and identifier id in the document, the triples (n/*/*), (*/c/*), (n/c/*), (alias(n)/*/*), (*/super(c)/*), (n/super(c)/*), (alias(n)/c/*), (alias(n)/ super(c)/*), and (*/*/id) were added for the document. 4. Indexing NE triples and keywords by S-Lucene. Here alias(n) and super(c) respectively denote any alias of n and any super class of c in the ontology and knowledge base of discourse. A query was also automatically processed in the following steps: 1. Removing stop-words in the query. 2. Recognizing and annotating named entities in the query. 3. Representing each recognized entity named n possibly with class c and identifier id by the most specific and available triple among (n/*/*), (*/c/*), (n/c/*), and (*/*/id). However, [4] did not consider latent information of the interrogative words Who, What, Which, When, Where, or How in a query. For example, given the query "Where was George Washington born?", the important terms are not only the NE George Washington and the keyword “born”, but also the interrogative word Where, which is to search for locations or documents mentioning them. The experiments on a TREC dataset in [21] showed that mapping such interrogative words to appropriate NE classes improved the search performance. For instance, Where in this example was mapped to the class Location. The mapping could be automatically done with high accuracy using the method proposed in [3]. Table1 gives some examples on mapping interrogative words to entity types, which are dependent on a query context. 4 5
http://lucene.apache.org/ http://www.ontotext.com/kim/
Ontology-Based Query Expansion with Latently Related Named Entities
45
Table 1. Mapping interrogative words to entity types Interrogative Word Who
Which
Where
What
NE Class
Example Query
Person
Who was the first American in space?
Woman
Who was the lead actress in the movie "Sleepless in Seattle"?
Person
Which former Ku Klux Klan member won an elected office in the U.S.?
City
Which city has the oldest relationship as a sister-city with Los Angeles?
Location
Where did Dylan Thomas die?
WaterRegion
Where is it planned to berth the merchant ship, Lane Victory, which Merchant Marine veterans are converting into a floating museum?
CountryCapital
What is the capital of Congo?
Percent
What is the legal blood alcohol limit for the state of California?
Money
What was the monetary value of the Nobel Peace Prize in 1989?
Person
What two researchers discovered the double-helix structure of DNA in 1953?
When
DayTime
When did the Jurassic Period end?
How
Money
How much could you rent a Volkswagen bug for in 1966?
3 Ontology-Based Query Expansion 3.1
System Architecture
Our proposed system architecture of semantic text search is shown in Figure 1. It has two main parts. Part 1 presents the generalized VSM searching system implemented in [21]. Part 2 presents the query expansion module, which is the focus of this paper, to add in a query implied, i.e., latently related, named entities before searching. The NE Recognition and Annotation module extracts and embeds NE triples in a raw text. The text is then indexed by contained NE triples and keywords and stored in the Extended NE-Keyword-Annotated Text Repository. Meanwhile, the InterrogativeWord-NE Recognition and Annotation module extracts and embeds the most specific NE triples in the extended query and replaces the interrogative word if existing by a suitable class. Semantic document search is performed via the NE- Keyword-Based Generalized VSM module. An ontology is a formal description of classes, entities, and their relations that are assumed to exist in a world of discourse ([13], [10]). Since no single ontology is rich enough for every domain and application, merging or combining multiple ontologies are reasonable solutions ([7]). Specifically, on one hand, our proposed model needs an ontology with a comprehensive class catalog, large entity population and rich
46
V.M. Ngo and T.H. Cao
Part 2
Part 1
Extended Query
Raw Query
Ontology_2
InterrogativeWord-NE Recognition and Annotation
Implied Entities Determination
NE-Keyword-Annotated and Extended Query
Ontology_1 Raw Text ...... ...... ...... ......
NE Recognition and Annotation
Ranked Text Document
NE Triple Extension and Indexing
NE-Keyword-Based Generalized VSM
Extended NE-KeywordAnnotated Text Repository
Fig. 1. System architecture for semantic text search
entity description, and an efficient accompanying NE recognition engine, for annotating documents and queries. On the other hand, it needs one with many relations between entities, for expanding queries with latently related named entities. In this work we employ KIM ([18]) for Ontology_1 in the system architecture illustrated above, as an infrastructure for automatic NE recognition and semantic annotation of documents and queries. The used KIM ontology is an upper-level ontology containing about 250 concepts and 100 attributes and relations. KIM Knowledge Base (KB) contains about 77,500 entities with more than 110,000 aliases. NE descriptions are stored in an RDF(S) repository. Each entity has information about its specific type, aliases, and attributes (i.e., its own properties or relations with other named entities). However, KIM ontology defines only a small number of relations. Therefore, we employ YAGO (Yet Another Great Ontology) ([23], [24]), which is rich in assertions of relations between named entities, for Ontology_2 in the system. It contains about 1.95 millions entities, 93 different relation types, and 19 millions facts that are specific relations between entities. The facts are extracted from Wikipedia and combined with WordNet using information extraction rules and heuristics. New facts are verified and added to the knowledge base by YAGO core checker. Therefore the correctness of the facts is about 95%. In addition, with logical extraction techniques and a flexible architecture, YAGO can be further extended in future. Note that, to have more relation types and facts, we can employ and combine it with some other ontologies. 3.2
Query Expansion
Figure 2 shows the main steps of our method to determine latently related entities for a query:
Ontology-Based Query Expansion with Latently Related Named Entities Recognizing Relation Phrases
Determining Relations
Recognizing Entities
47
Determining Related Entities
Fig. 2. The steps determining latently related entities for a query
1. Recognizing Relation Phrases: Relation phrases are prepositions, verbs, and other phrases representing relations, such as in, on, of, has, is, are, live in, located in, was actress in, is author of, was born. We implement relation phrase recognition using the ANNIE tool of GATE ([8]). 2. Determining Relations: Each relation phrase recognized in step 1 is mapped to a corresponding one in Ontology_2 by a manually built dictionary. For example, “was actress in” is mapped to actedIn, “is author of” is mapped to wrote, and “nationality is” is mapped to isCitizenOf. 3. Recognizing Entities: Entity recognition is implemented by OCAT (Ontology-based Corpus Annotation Tool) of GATE. 4. Determining Related Entities: Each entity that has a relation determined in step 2 with an entity recognized in step 3 is added to the query. In the scope of this paper, we consider to expand only queries having one relation each. However, the method can be applied straightforwardly to queries with more than one relation. After the query is expanded with the names of latently related entities, it is processed by Part 1 of the system described in above.
4 Experiment 4.1
Datasets
A test collection includes 3 parts: (1) document collection; (2) query collection; and (3) relevance evaluation, stating which document is relevant to a query. A document is relevant to a query if it actually conveys enquired information, rather than just the words in the query. There are well-known standard datasets such as TREC, CISI, NTCIR, CLEF, Reuters-21578, TIME, and WBR99. We have surveyed papers in SIGIR-20076 and SIGIR-20087 to know which datasets have been often used in information retrieval community so far. We only consider papers about text IR, not IR for multi-language, picture, music, video, markup document XML, SGML,… Besides, all poster papers in SIGIR-2007 and SIGIR-2008 are not reviewed. There are 56 papers about text IR examined and classified into three groups, namely, TREC8 (The Text REtrieval Conference), author-own, and other standard datasets. TREC is annually co-organized by the National Institute of Standards and Technology (NIST) and U.S. Department of 6
http://www.sigir2007.org http://www.sigir2008.org 8 http://trec.nist.gov 7
48
V.M. Ngo and T.H. Cao
Table 2. Statistics about dataset usage of text retrieval papers in SIGIR 2007 and SIGIR 2008 SIGIR
Paper Total
2007
34
2008
22
8
4
12
2007+2008
56
19 (~34%)
11(~20%)
33 (~59%)
Number of Papers Using a Dataset Type Author-Own Dataset Other Standard Dataset TREC’s Dataset 11 7 21
Defense, supporting research and evaluation of large-scale information retrieval systems. Table 2 shows that 59% of papers use TREC’s datasets as popular ones in the IR community. We choose the L.A. Times document collection, which was a TREC one used by 15 papers among the 33 papers of SIGIR-2007 and SIGIR-2008 in the above survey. The L.A. Times consists of more than 130,000 documents in nearly 500MB. Next, we choose 124 queries out of 200 queries in the QA Track-1999 that have answer documents in this document collection. 4.2
Testing results
Using the chosen dataset, we evaluate the performance of the proposed model and compare it with others by the common precision (P) and recall (R) measures ([1]). In addition, our system ranks documents regarding their similarity degrees to the query. Hence, P-R curves represent better the retrieval performance and allow comparing those of different systems. The closer the curve is to the right top corner, the better performance it represents. In order to have an average P-R curve of all queries, the P-R curve of each query is interpolated to the eleven standard recall levels 0%, 10%, …, 100% ([1]). Besides, a single measure combining the P and R ones is F-measure, which is computed by F = 2. P.R . We also use average F-R curves at the eleven standard P+R recall levels to compare system performances. We conduct experiments to compare the results obtained by three different search models: 1. Keyword Search: This search uses Lucene text search engine. 2. NE+KW Search: This search is given in [21]. 3. Semantic Search: This is the search engine proposed in this paper. From Table 3 and Figure 3, we can see the average precisions and F-measures of the keyword-based Lucene, the NE+KW Search, and the proposed Semantic Search at each of the standard recall levels of 124 queries. They show that taking into account latent ontological features in queries, documents, and expanding queries using relations described in an ontology enhance text retrieval performance.
Ontology-Based Query Expansion with Latently Related Named Entities
49
Table 3. The average precisions and F-measures at the eleven standard recall levels on 124 queries of the L.A. Times Measure
Precision (%)
F-measure (%)
Recall (%)
Model 0
10
20
30
40
50
60
70
80
90
100
Lucene
66.1
66.0
63.2
60.4
56.7
55.1
45.7
40.4
37.9
37.5
37.1
NE+KW
71.8
71.6
69.5
65.5
62.2
60.8
52.4
48.0
46.4
45.4
44.7
Semantic Search
73.0
72.7
70.9
67.0
63.8
62.4
54.4
50.1
48.4
47.4
46.8
Lucene
0.0
15.5
26.6
34.8
40.1
45.0
43.4
42.2
41.8
43.0
44.1
NE+KW
0.0
16.3
28.4
37.1
42.8
48.3
48.0
47.7
48.5
49.8
50.8
Semantic Search
0.0
16.5
28.7
37.4
43.3
49.0
48.8
48.7
49.6
51.0
52.1
Fig. 3. Average P-R and F-R curves of Lucene, KW+NE and Semantic Search models on 124 queries of the L.A. Times
In the 124 queries, there are only 17 queries expanded. The other queries are not expanded because: (1) The queries have more than one relation phrase, which are out of the experiment scope of this paper (55 queries); (2) Ontoloy_2 does not have relation types corresponding to the relation phrases in the queries (36 queries); and (3) Ontoloy_2 does not have facts asserting specific relations of named entities in the queries with others (16 queries). Table 4 and Figure 4 show the average precisions and F-measures of the three systems at each of the standard recall levels for those 17 expanded queries only. One can observe that, when all queries are expanded using our proposed method, the Semantic Search clearly outperforms the other two systems. We have analyzed some typical queries for which Semantic Search is better or worse than NE+KW. For query_38 "Where was George Washington born?”, Semantic Search performs better than NE+KW. Actually, Semantic Search maps the relation word born to the relation bornIn and Ontology_2 has the fact (George_Washington bornIn Westmoreland_Country). So, Westmoreland Country is added to the query.
50
V.M. Ngo and T.H. Cao
Table 4. The average precisions and F-measures at the eleven standard recall levels on 17 expanded queries of the L.A. Times Measure
Lucene
Precision (%)
F-measure (%)
Recall (%)
Model 0
10
20
30
40
50
60
70
80
90
100
61.9
61.9
58.4
53.6
51.9
51.9
39.6
39.1
38.3
38
37.6
NE+KW
71.6
71.6
69.5
67.7
66.9
65.8
55.2
54.9
54.8
54.7
54.7
Semantic Search
82.2
82.2
82.2
80.1
80.1
78.9
70.3
70.3
69.7
69.7
69.7 44.2
Lucene
0.0
14.7
25.0
32.3
37.4
42.5
38.3
40.4
41.6
43.2
NE+KW
0.0
16.3
28.5
37.6
44.8
51.2
50.0
52.8
55.3
57.9
58.9
Semantic Search
0.0
17.5
31.1
40.4
48.7
56.0
55.8
59.6
62.6
66.0
67.5
Fig. 4. Average P-R and F-R curves of Lucene, KW+NE and Semantic Search models on 17 expanded queries of the L.A. Times
For query_190 "Where is South Bend?”, Semantic Search maps the relation phrase where is to the relation locatedIn, and Ontology_2 has the fact (South_Bend locatedIn Indiana). However, all of the relevant documents of the query only contain Ind rather than Indiana. Although, Ind is an alias of Indiana, Ontology_1 does not include it. Therefore, when adding Indiana into the query, it makes Semantic Search perform worse than NE+KW.
5 Conclusion and Future Works We have presented the generalized VSM that exploits ontological features for semantic text search. That is a whole IR process, from a natural language query to a set of ranked documents. Given an ontology, we have explored latent named entities related to those in a query and enriched the query with them. We have also proposed a framework to combine multiple ontologies to take their complementary advantages for the whole semantic search process.
Ontology-Based Query Expansion with Latently Related Named Entities
51
Besides, the system takes into account all the main features of named entities, namely, name, class, identifier, alias, super-class, and supports various query types, such as searching by only named entities, only keywords, combined named entities and keywords, and Wh-questions. The conducted experiments on a TREC dataset have showed that appropriate ontology exploitation improves the search quality in terms of the precision, recall, and F-measures. In particular, expanding queries with implied named entities, the proposed Semantic Search system outperforms the previous NE+KW and keyword-based Lucene ones. Our experiments on the effect of query expansion are partially eclipsed by the used Ontology_2, which does not cover all relation types and specific relations in the test query set, and by the relation recognition module of our system. For future work, we will combine more ontologies to increase the relation coverage and research logical methods to recognize better relations in a query. Furthermore, we will also investigate ontological similarity and relatedness between keywords. These are expected to increase the performance of the proposed model.
References [1] Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999) [2] Bast, H., Chitea, A., Suchanek, F., Weber, I.: ESTER: Efficient Search on Text, Entities, and Relations. In: Proceedings 30th Annual International ACM SIGIR Conference (SIGIR-2007), pp. 671–678. ACM, New York (2007) [3] Cao, T.H., Cao, T.D., Tran, T.L.: A Robust Ontology-Based Method for Translating Natural Language Queries to Conceptual Graphs. In: Domingue, J., Anutariya, C. (eds.) ASWC 2008. LNCS, vol. 5367, pp. 479–492. Springer, Heidelberg (2008) [4] Cao, T.H., Le, K.C., Ngo, V.M.: Exploring Combinations of Ontological Features and Keywords for Text Retrieval. In: Ho, T.-B., Zhou, Z.-H. (eds.) PRICAI 2008. LNCS (LNAI), vol. 5351, pp. 603–613. Springer, Heidelberg (2008) [5] Castells, P., Vallet, D., Fernández, M.: An Adaptation of the Vector Space Model for Ontology-Based Information Retrieval. IEEE Transactions of Knowledge and Data Engineering 19(2), 261–272 (2007) [6] Cheng, T., Yan, X., Chen, K., Chang, C.: EntityRank: Searching Entities Directly and Holistically. In: Proceedings of the 33rd Very Large Data Bases Conference (VLDB2007), pp. 387–398 (2007) [7] Choi, N., Song, I.Y., Han, H.: A Survey on Ontology Mapping. ACM SIGMOD Record 35(3), 34–41 (2006) [8] Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Developing Language Processing Components with GATE Version 4, User Guide (2006), http://gate.ac.uk/sale/tao [9] Elsas, J.L., Arguello, J., Callan, J., Carbonell, J.G.: Retrieval and Feedback Models for Blog Feed Search. In: Proceedings of the 31st annual international ACM SIGIR conference (SIGIR-2008), pp. 347–354. ACM, New York (2008) [10] Fensel, D., Harmelen, V.F., Horrocks, I.: OIL: An Ontology Infrastructure for the Semantic Web. IEEE Intelligent System 16(2), 38–45 (2001)
52
V.M. Ngo and T.H. Cao
[11] Fernández, M., et al.: Semantic Search Meets the Web. In: Proceedings of the 2nd IEEE International Conference on Semantic Computing (ICSC-2008), pp. 253–260 (2008) [12] Goncalves, A., Zhu, J., Song, D., Uren, V., Pacheco, R.: Latent Relation Dicovery for Vector Space Expansion and Information Retrieval. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 122–133. Springer, Heidelberg (2006) [13] Gruber, T.R.: Toward Principles for the Design of Ontologies Used for Knowledge Sharing. International Journal Human-Computer Studies 43(4), 907–928 (1995) [14] Guha, R., McCool, R., Miller, E.: Semantic Search. In: Proceedings of the 12th International Conference on World Wide Web, pp. 700–709 (2003) [15] Kasneci, G., Ramanath, M., Suchanek, F., Weikum, G.: The YAGO-NAGA Approach to Knowledge Discovery. In: Proceedings of 28th ACM SIGMOD International Conference on Management of Data (ACM SIGMOD-2008), pp. 41–47. ACM, New York (2008) [16] Khalid, M.A., Jijkoun, A., Rijke, M.: The Impact of Named Entity Normalization on Information Retrieval for Question Answering. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 705– 710. Springer, Heidelberg (2008) [17] Lopez, V., Sabou, M., Motta, E.: PowerMap: Mapping the Real Semantic Web on the Fly. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 414–427. Springer, Heidelberg (2006) [18] Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic Annotation, Indexing, and Retrieval. Elsevier’s Journal of Web Semantics 2(1) (2005) [19] Marsh, E., Perzanowski, D.: MUC-7 Evaluation of IE Technology: Overview of Results. In: Proceedings of the Seventh Message Understanding Conference, MUC-7 (1998) [20] Mihalcea, R., Moldovan, D.: Document Indexing using Named Entities. Studies in Informatics and Control 10(1) (2001) [21] Ngo, V.M., Cao, T.H.: A Generalized Vector Space Model for Ontology-Based Information Retrieval. Vietnamese Journal on Information Technologies and Communications 22 (2009) (To appear) [22] Sekine, S.: Named Entity: History and Future. Proteus Project Report (2004) [23] Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO - A Core of Semantic Knowledge. Unifying WordNet and Wikipedia. In: Proceeding of the 16th international conference on World Wide Web (WWW-2007), pp. 697–706. ACM, New York (2007) [24] Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO - A Large Ontology from Wikipedia and Wordnet. Journal of Semantic Web 6(3), 203–217 (2008) [25] Tran, T., Cimiano, P., Rudolph, S., Studer, R.: Ontology-Based Interpretation of Keywords for Semantic Search. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 523–536. Springer, Heidelberg (2007) [26] Varelas, G., Voutsakis, E., Paraskevi, R., Petrakis, G.M.E., Evagelos, E.M.: Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web. In: Proceedings of the 7th annual ACM international workshop on web information and data management, pp. 10–16 (2005)
Indexing Spatial Objects in Stream Data Warehouse Marcin Gorawski and Rafal Malczok Silesian University of Technology, Institute of Computer Science, Akademicka 16, 44-100 Gliwice, Poland {Marcin.Gorawski,Rafal.Malczok}@polsl.pl
Abstract. The process of adapting data warehouse solutions for application in many areas of everyday life causes that data warehouses are used for storing and processing many, often far from standard, kinds of data like maps, videos, clickstreams to name a few. A new type of data – stream data, generated by many types of systems like traffic monitoring or telemetry systems, created a motivation for a new concept, a stream data warehouse. In this paper we address a problem of indexing spatial objects generating streams of data with spatial indexing structure. Basing on our motivation, a telemetric system of integrated meter readings, and utilizing the results of our previous work, we extend the solution we created for processing long but limited aggregates lists to make it applicable for processing data streams. Then we describe the process of adapting a spatial indexing structure for usage in a stream data warehouse by modifying both the structure of the index nodes and the operation of the algorithm answering the range aggregate queries. The paper contains also experimental evaluation of the proposed solution.
1
Introduction
In recent years data warehouse systems become more and more popular. Users find the ability of processing large amounts of data in a short time very useful and convenient. This trend is supported by products of large software companies who enrich they business offer with ready-to-use data warehouse solutions integrated with broadly known database systems. A simple data warehouse can be created by a user who, by clicking, models data warehouse structure, defines dimensions, attributes and hierarchies and finally defines what reports should be available for a final user. Systems build from a scratch and dedicated for a single company are no longer the only way to create a data warehouse. The extension of the domain where data warehouse systems are applied results in the need for supporting various types of data. Very interesting are N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 53–65. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
54
M. Gorawski and R. Malczok
aspects of adapting a data warehouse to be used for processing stream data. To the stream data category we classify for example car traffic and cell phones tracking data as well as utilities consumption data. For the purpose of storing and processing stream data stream data warehouses are being designed and implemented. The area of stream data processing and storing is an active research field. There are many projects [1,10,6] focused on designing systems which make possible to register and evaluate continuous queries [1]. Stream data warehouse systems pose many new challenges which do not occur in standard data warehouses. One of the most important is the problem concerning the data loading process (ETL process – Extract, Transform and Load). In standard data warehouses ETL is a batch process launched from time to time (every night, every week etc). In stream data warehouses the ETL process is a continuous one. Changing the nature of the ETL process from batch to continuous forces a designer of a stream data warehouse to provide efficient mechanisms for processing and managing stream data. In this paper we provide description of a solution allowing processing and managing any kind of a stream data using spatial indexing structure. The remaining part of the paper is organized as follows: in section 2 we present details of the example motivating our research, then we define a problem we address in this paper. The next section contains the details of the stream data managing and processing. Section 4 describes the process of spatial range query evaluation. Finally we show experimental results and we conclude the paper discussing our future plans.
2
Spatial Sensor Systems. Motivating Example
Stream data processing is, in many cases, motivated by the need of handing endless streams of sensor data [7,3]. The motivation for the research presented in this paper is a system of integrated utilities meters reading. The system monitors consumption of utilities such as water, natural gas and electrical energy. The meters are located in some region where a telemetric installation exists. The meters send the reading via radio waves to collecting points called nodes. The nodes, using standard TCP/IP network transfers the readings to telemetric servers. The ETL process gathers the data from telemetric servers. The operation of reading the utility consumption in some meter can be executed according to two different models. The first model assumes that there is some signal that causes the meter to send the readings to the collecting point. In the second model no signal is required and the meters send the readings at some time intervals. The intervals depend on the telemetric system configuration. In the motivating example the second model is applied. Considering the telemetric system operation we can assume that every single meter is an independent source generating an endless stream of readings. The intensity of a stream depends on the given meter configuration
Indexing Spatial Objects in Stream Data Warehouse
55
parameters which define how often the readings are being sent to the collecting point. Utilizing the motivating example we created an experimental stream data warehouse system which we named DWE (Data Warehouse Experimental). The system is coupled with stream ETL process what allows on-line stream processing. Despite the meter readings, the system has access to additional information: the geographical location of the meters and collecting points, weather conditions in the region encompassed by the telemetric installation and a brief description of the meters users. The information is used when the utilities consumption is being analyzed. The basic functionality of DWE is to provide an answer to a range aggregate query. The query is defined as a set of regions R encompassing one or more utilities meters. The answer generated by the system consist of two parts. The first part contains information about the number of various kinds of meters located in the query region. The second part is a merged stream (or streams) of aggregated readings coming from the meters encompassed by the region query. There are many spatial indexes which, in very fast and efficient way can answer any region query. The best known and most popular are indexes based on the R-Tree [5] spatial index. The main idea behind the R-Tree index is to build a hierarchical indexing structure where index nodes on the higher levels encompass regions of the nodes on the lower levels of the index. Using one of the indexes derived from the R-Tree family (eg. R*-Tree [2] or R+ -Tree [9]) we can quickly obtain the first part of an answer. In order to calculate the second part of an answer we need to apply an indexing structure which, in index nodes, stores aggregates concerning objects located in the nodes regions. The first proposed solution to this problem was an aR-Tree [8] (aggregation R-Tree). aR-Tree index nodes located in the higher levels of the hierarchy store the number of objects in the nodes located in the lower levels of the hierarchy. The functionality of this solution can be easily extended by adding any kind of aggregated information stored in the nodes. The idea of partial aggregates was used in a solution presented in [11], where the authors want to use it for creating a spatio-temporal index. Motivated by the index characteristics the authors suggested sorted hashing tables to be used in indexing structure nodes. Designers of the mentioned spatial aggregating indices assumed, that the size of the aggregated data is well defined and small enough to fit into the computer main memory. In the case of the presented system such an assumption cannot be made because of the endless stream nature of the data. Hence there is a need for a dedicated memory structure and a set of algorithms which, operating on the structure, can efficiently process the stream data. In this paper we focus on a solution designed for managing stream data and on its application.
56
3
M. Gorawski and R. Malczok
Stream Data Aggregating and Processing
In order to efficiently process an endless stream of sensor readings a dedicated solution must be applied. In paper [4] we presented Materialized Aggregates List (MAL). MAL is a combination of memory structure and algorithms allowing to efficiently process endless streams of data. 3.1
Materialized Aggregates List
MAL bases its operation on the concept of list and iterator. Mechanisms implemented in the list allows generating and then optionally materializing the calculated aggregates. Iterators are used for browsing the generated data and communicating with the list. The iterator requires a static table filled with aggregates. The iterator table is a set of logical parts called pages. Iterator table pages store some number of aggregates. The number is equal for each page. The fact that the table is logically divided into pages is used when the table is being filled with aggregates. In [4] we presented and compared three multithread page-filling algorithms; for the purpose of this research we used SPARE algorithm. By using a static table MAL allows processing data streams of any length without imposing memory limitations. MAL supports also an optional materialization mechanism which can significantly speed-up the process of aggregates recreation. In order to provide a flexible solution we designed MAL to create aggregates basing on data from four various data sources (fig. 1): 1. Other MALs. This source is used when MAL works as a component of a node located on intermediate level of an indexing structure. The structure must be hierarchical – the upper level nodes encompass the regions (and also objects) of the nodes in the lower levels and a parent node must be able to identify its child nodes. 2. Stream of sensor readings (generally, a raw data stream). When MAL’s aggregates creating component is integrated with the stream ETL system, iterator
iterator MAL engine
client
iterator client
Aggregates retrieving interface
client 1 2
Database 3 4
Fig. 1. Schema of Materialized Aggregates List showing four various types of sources used for creating aggregates and MAL – client operation
Indexing Spatial Objects in Stream Data Warehouse
57
the aggregates can be created in on-line mode. This mode is much more efficient when compared to retrieving data from database because no I/O operations are performed. 3. Stream history stored in database. Utilizing standard SQL language MAL’s aggregates retriever queries the database and creates aggregates. This source is used only when the stream source cannot be used for some reason (requested aggregates are no longer available in the stream). 4. Materialized aggregates. Some of the aggregates are materialized for further use (described in the next section). Before accessing the database for retrieving aggregates, the MAL engine checks if the required aggregates are present in the dedicated table. A single information stored in MAL is called an aggregate A. The aggregate comprises of a timestamp T S and a set of values VA = {Vi } (A = [T S, VA ]). Every element Vi ∈ VA is of a defined type tVi . Aggregates are calculated for some time interval called aggregate time window. The width of the window is identical for all aggregates stored in the list; its size can be adjusted to specific requirements. In the case of utilities consumption the values stored in an aggregate can be interpreted as a medium consumption in a time period (aggregate time window). As a data stream we understand an endless sequence of elements of a given type. The elements can occur in the stream in arbitrary moments in time. An interval between subsequent elements is not defined. For a stream there is no concept of stream end. As a stream beginning we consider the first element generated by the source of the stream (utilities meter). 3.2
Merging Aggregates Streams
Calculating the second part of the range aggregate query requires aggregates stream merging. Before we define the details of the streams merging operation we need to define the requirements that must be satisfied to add two single aggregates. An aggregate A = [T S, VA ] is the smallest amount of information stored in the aggregates stream. Each aggregate has a timestamp and a list of values. The length of the values list is not limited and can be adopted to system requirements. Aggregate type is defined by the cardinality of the VA list and the types of elements of VA . Two aggregates are of equal type if their values lists are of equal cardinality and values located under the same indexes are of the same simple type (integer or real number). Two aggregates can be added if and only if they have equal timestamps and they are of equal type. The result of aggregates adding operation is an aggregate having timestamp equal timestamps of the added aggregates and resulting aggregate values list is created by adding values of the added aggregates. Aggregates streams merging operation merges two or more aggregates streams creating one aggregates stream. The merging operation is performed by adding aggregates with equivalent timestamps. The aggregates in the
58
M. Gorawski and R. Malczok
merged streams must be calculated for the same aggregate window and can be added. For aggregates streams merging there is a special moment in time when the operation begins. All merged streams must contain aggregates used during adding operation. The streams merging operation cannot be performed for streams where some required aggregates are missing. The streams merging operation stopping criteria depends on the particular application of the solution. There are three stopping criteria. 1. The first option is to perform the streams merging only for a given time period (in other words, the operation merges only fragments of the streams). The operation is broken when a timestamp in the first aggregate is older than the date bordering the merged streams segments. 2. The second option is to break the merging operation if in some of the merged streams an aggregate required adding is missing (the operation came across an end of a stream). 3. The last possibility is, in the case when the merging operation comes across an end of some merged streams, to suspended the operation until the required aggregate appears in the stream.
4
Calculating a Region Query Answer Stream
MAL iterator uses a static table for browsing and managing aggregates stream. The iterator tables are stored by the data warehouse systems in form of resource pool. It is possible for an external process to use MAL iterator only if there is at least one free table in the pool. In the case there are no free table, the external process waits until some table is returned to the pool. By defining the number of tables stored in the pool one can very easily control the amount of memory consumed by the part of system responsible for processing data streams. In order to use MAL as a component of spatial index structure nodes we must define an algorithm that distributes the tables present in the pool to the appropriate MALs according to set criteria. 4.1
Finding Nodes Answering the Query
For every window from the region query set the indexing structure is recursively browsed in order to find the nodes which provides answer to the query. The browsing stars from the tree root and proceeds towards tree leaves. The tree browsing algorithm checks the relation of the window region O and the node region N . The algorithm operating can be described by the following steps: – if the window region and the node region share no part, (O ∩ N = ∅), the node is skipped,
Indexing Spatial Objects in Stream Data Warehouse
59
– if the window region entirely encompasses the node region (O ∩ N = N ), the node is added to a F AN set(Full Access Node – a node which participates in answer generation with all its aggregates), – finally, if window region and node region share some part (O ∩ N = O ) then the algorithm performs a recursive call to lower structure levels, passing parameter O as an aggregate window. When traversing to lower tree levels it is possible that the algorithm reaches the node on the lowest hierarchy level. In this case, the algorithm must execute a query searching for encompassed objects. A set of found objects is marked with a letter M . The objects in the M set are marked with a so called query mark which is then used when the answer stream is being generated. 4.2
Iterator Table Assignment Algorithm
For every reading category (energy, water etc) the data warehouse system creates a separate table pool. For every pool the system defines a maximal number of tables that can be stored in the pool. The main task of the iterator table assignment algorithm is to distribute the tables from the pool over the nodes and objects involved in the process of generating the answer in the most optimal way. The process of finding nodes and single objects involved in query answer generation creates two sets: F AN and M . Nodes sorting. The actual tables assignment algorithm operation starts with sorting the nodes stored in the F AN set. The nodes are sorted according to the following criteria: 1. The number of materialized data available for a given node. The more materialized data, the higher the position the node. Only the materialized data that can be used during answer stream generation are taken into account. 2. Generated stream materialization possibility. Every indexing structure node is located in some level. One of the data warehouse operating parameter is the materialization level, starting from which the streams generated by the nodes are materialized. For level set to 0 all streams generated by the nodes are materialized. For level set to -1 not only nodes streams are materialized, but also streams generated by single objects. The materialization possibility is a binary criteria: a node has it or has not. The node with the possibility is always placed in higher level than a node which stream will not be materialized. 3. The last criteria sorting criteria is the amount of objects encompassed by the node’s region. The more encompassed objects, the higher the position of the node.
60
M. Gorawski and R. Malczok
The listing presented below shows the operation of the algorithm comparing the nodes from the F AN set. The input arguments are two nodes N1 and N2 . The returned value depends on the comparison result. The function returns -1 if the node N1 should take higher position, 1 if the node N2 should be placed in higher than N1 , and finally 0 if the nodes comparison attributes are equal. 1 function compareNodes(Node N1, Node N2) return INT is 2 begin 3 if (N1.matDataCount > N2.matDataCount) then 4 return -1; /* N1 has more materialized data than N2 */ 5 end if; 6 if (N2.matDataCount > N1.matDataCount) then 7 return 1; /* N2 has more materialized data than N1 */ 8 end if; 9 if (N1.level >= matThsld and N2.level < matThsld) then 10 return -1; /* N1’s stream will be materialized, N2’s will not*/ 11 end if; 12 if (N2.level >= matThsld and N1.level < matThsld) then 13 return 1; /* N2’s stream will be materialized, N1’s will not */ 14 end if; 15 if (N1.objectsCount > N2.objectsCount) then 16 return -1; /* N1 encompasses more objects than N2 */ 17 end if; 18 if (N1.objectsCount < N2.objectsCount) then 19 return 1; /* N2 encompasses more objects than N1 */ 20 end if; 21 return 0; /* according to the attributes N1 = N2 */ 22 end;
Alike the elements in F AN sorted are the in the M set. In the case of the M set, the only sorting criteria is the amount of available materialized data which can be used during answer stream generation. Iterator tables assignment. After the elements in the F AN and M sets are sorted, the algorithm can assign the iterator tables to the chosen elements. Let P denotes the iterator table pool, and |P | the number of tables available in the pool. If ((|F AN | = 1 AND |M | = 0) OR ((|F AN | = 0 AND |M | = 1))) (the answer to the query is a single stream generated by a node or a single object) then one table is taken from the pool and it is assigned to an element generating the answer. The condition that must be fulfilled is that the |P | ≥ 1 (there is at least one table in the pool). In the other case (|F AN | ≥ 1 AND/OR |M | ≥ 0) the algorithm requires an additional structure called GCE (Global Collecting Element). The GCE element is of type MAL and it is used for merging streams generated by other elements (nodes and objects). The GCE element requires one iterator table. One table in the iterator table pool is enough to answer any region query. If there are some free tables in the pool they are assigned to other elements
Indexing Spatial Objects in Stream Data Warehouse
61
involved in the result stream generating process. First, the tables are assigned to the elements in the F AN set (according to the order of elements set by the sorting operation). Then, if there are still some free tables, they are assigned to single objects from the M set (also in appropriate order). A few tables assigning scenarios are possible: – |P | ≥ |F AN | + |M | – every element involved into answer stream generation is assigned a separate table, – |P | ≥ |F AN | AN D |P | < |F AN | + |M | – every element from the F AN is assigned a separate table and some single objects streams from the M are merged right in the moment of creating aggregates, – |P | < |F AN | – all streams of the elements from the M set and some streams of the elements from the |F AN | set are merged into one stream during the process of aggregates creation. A stream can be materialized only if a generating element is assigned a separate iterator table. If a stream of elements aggregates is at early generation stage merged with streams of other elements the materialization is not performed because it cannot be explicitly determined which element generated the stream. In general, partial answer streams, for example merged in GCE, are not materialized. Also, the stream generated by GCE is not materialized either because those streams change with every query and if the system materialize every indirect stream the amount of materialized data would grow very fast. In figure 2 we can observe an example of tables distributing algorithm operation. For purpose of this example we assume that the number of tables in the pool is 4 (|P | = 4). The tree traversing algorithm marked nodes 4, 6 and 7 (they create the F AN set) and objects a, b and c (they create the M set). The sorting operation sorted the elements into an order presented in the params table of the GCE element. Because (|F AN | = 1 AND |M | = 1) we need to reserve one iterator table for the GCE element. The remaining tables are assigned to node 7, 4 and merged elements: node 6 and single objects a, b and c. Below we present a pseudocode showing the operation of the iterator GCE params
1
7
2
4
3
6 6
7
8
4
5 b
c
MAL2
+ +
answer stream
a b
a
MAL1
MAL3
c
Fig. 2. Operation of the algorithm assigning iterator tables to answer stream generating nodes. All the answer stream generating elements are stored in the params table of the GCE element.
62
M. Gorawski and R. Malczok
table assigning algorithm. The input parameters are sorted sets F AN and M . At least one of those sets contains as least one element. Returned is an GCE object which is of type MAL. At the beginning the function checks if the sum of F AN and M sets cardinalities equals 1 (line 4). If that condition (|F AN | + |M | = 1) holds, as the GCE element the function returns an existing element from F AN or M set (lines 6-10). In the other case (|F AN | + |M | > 1) the function assign one iterator table to the GCE element (line 11) and, if the sum of elements in the F AN and M sets is greater than the number of tables available in the pool (|F AN |+|M | > |P |), one iterator table for all merged streams (line 14). In the next step, the function assigns remaining tables to the elements from the sorted F AN set (lines 16, 17), and then, if there are still some tables in the pool, to the elements from the sorted M set (lines 21, 22). If the sum of sets F AN and M cardinalities is greater than the number of tables available in the pool some elements must use a shared table. Those elements are stored in the params table of the GCE element (lines 24, 27). 1 function calcGCE(FAN Collection, M Collection) return MAL is 2 int fcntr := 1, mcntr := 1; 3 MAL GCE; /* Global Collecting Element 4 begin 5 if (|FAN|=1 and |M|=0) or (|FAN|=0 and |M|=1)) then 6 if (|FAN| != 0) then /* there is only one element 7 GCE := FAN(1); 8 else 9 GCE := M(1); 10 end if; 11 else /* there are more elements 12 assign_table(P, GCE); /* assign one table for GCE element 13 if (|FAN|+|M| > |P|) then 14 assign_table(P, GCE.Shared); /* one table must be reserved 15 end if; for shared table elements 16 while (|P| > 0 and fcntr <= FAN.count) loop 17 assign_table(P, FAN(fcntr)); /* assign free tables to FANs 18 GCE.add_to_params(FAN(fcntr++)); /* inform GCE about this FAN 19 end loop; 20 while (|P| > 0 and mcntr <= M.count) loop 21 assign_table(P, M(mcntr)); /* assign free tables to objects 22 GCE.add_to_params(M(mcntr++));/* inform GCE about this object 23 end loop; 24 while (fcntr <= FAN.count) loop 25 GCE.shared_table(FAN(fcntr++)); /* FAN will use shared table 26 end loop; 27 while (mcntr <= FAN.count) loop 28 GCE.shared_table(M(mcntr++)); /* object will use shared table 29 end loop; 30 end if; 31 end;
*/
*/
*/ */
*/ */ */
*/ */
*/
*/
Indexing Spatial Objects in Stream Data Warehouse
5
63
Test Results
In this section we present experimental test results. We implemented the entire solution in Java. More implementation details of MAL can be found in [4]. For the tests we used a machine equipped with Inter Core 2 T7200 processor, 2 GB of RAM and 100 GB HDD 7200 rpm. The software environment was: Windows XP, Sun Java 1.5 and Oracle 9i database. The F AN and M sets sorting algorithm was a modified mergesort algorithm which used by default by Java collections. This algorithm offers guaranteed log(n) performance. The test model contained 1000 electrical energy meters. Every meter generated a reading in a time period from 30 to 60 minutes. The aggregation window width was 120 minutes. Database contained readings from a period of one year. The table storing readings counted over 12 million rows. We tested the system by first defining a set of query windows and then calculating the answer. The generated answer stream was browsed without aggregates processing. The times presented in the charts are sums of answer calculation and stream browsing times. In all tested cases the answer calculation time was negligibly small when compared to stream browsing time. The first measured times concerned answer generation and stream browsing for a query encompassing 100 objects. In order to investigate various query defining possibilities we separated four different query variants: 1. a query encompassing 2 indexing structure nodes (each node encompassed 25 objects) and 50 single objects (variant marked as 50+50), 2. a query encompassing 3 indexing structure nodes and 25 single objects (variant marked as 75+25), 3. a query encompassing 4 indexing structure nodes (marked as 4*25), 4. a query encompassing 1 indexing structure node located in a higher level of the indexing structure (node encompassing 100 objects, marked as 1*100). Figures 3 and 4 show answer stream browsing times for the above described queries. Figure 3 shows times for the first run when no materialized data is available. In figure 4 we can observe times for the second run, when the system can use materialized data created during the first run. The first and the most important observation is that stream materialization has a great positive influence on the query answering process. The results show also, that the stream browsing times strictly depends on the number of MALs used for resulting stream generation. The less the number of MALs, the less database - application data transfers and the shorter the stream browsing times. But on the other hand when we must observe that even thought the first run times increase with the number of MALs, during the second run the times significantly decrease.
64 1010 990
M. Gorawski and R. Malczok 50+50
75+25
4*25
1*100
1000
50+50
75+25
4*25
1*100
900 800
970
700 600 time [s]
time [s]
950 930
500 400 300
910
200 890 100 870
0 2
5
15
30
number of available iterator tables
2
5
15
30
number of available iterator tables
Fig. 3. Answer calculation times for var- Fig. 4. Answer calculation times for various number of available iterator tables ious number of available iterator tables (no materialized data) (using materialized data)
6
Conclusions and Future Plans
Our previously designed solution – Materialized Aggregate List (MAL) can be used as a component of any spatial aggregating index processing data streams. In order to apply MAL in indexing structure we needed to define an algorithm that distributes available iterator tables among the elements involved in answer calculation process. The algorithm, using a set of criteria, sorts all the involved elements and then assign the tables. The most important criteria is the amount of available materialized data that can be used in answer generation process. We implemented the entire solution in Java. We carried a set of experiments to verify the MAL and tables distributing algorithm efficiency and scalability. The test show that increasing number of encompassed objects and aggregation period results in linear growth of answer calculation time. In the nearest future we want to precisely check when the materialization process can be omitted. We want to apply many sets of various queries to help us to define a materialization threshold. Basing on tests results analyses we suppose that if some of the streams generated by the lowest level nodes and single objects are not materialized it will cause no harm to answer generation time but will significantly reduce the number of materialized data stored in the database.
References 1. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and Issues in Data Stream Systems. In: Proceedings of the PODS Conference, pp. 1–16 (2002) 2. Beckmann, N., Kriegel, N., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the SIGMOD Conference, June 1990, pp. 322–331 (1990) 3. Bonnet, P., Gehrke, J., Seshadri, P.: Towards Sensor Database Systems. Book Mobile Data Management, 3–14 (2001)
Indexing Spatial Objects in Stream Data Warehouse
65
4. Gorawski, M., Malczok, R.: On Efficient Storing and Processing of Long Aggregate Lists, DaWaK, Copenhagen, Denmark (2005) 5. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the SIGMOD Conference, Boston, MA, June 1984, pp. 47–57 (1984) 6. Hellerstein, J., et al.: Adaptive Query Processing: Technology in Evolution. IEEE Data Eng. Bull., 7–18 (2000) 7. Madden, S., Franklin, M.J.: Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data. In: ICDE 2002, pp. 555–566 (2002) 8. Papadias, D., Kalnis, P., Zhang, J., Tao, Y.: Effcient OLAP Operations in Spatial Data Warehouses. LNCS. Springer, Heidelberg (2001) 9. Sellis, T.K., Roussopoulos, N., Faloutsos, C.: The R+-Tree: A Dynamic Index for Multi-Dimensional Objects. In: VLDB 1987, pp. 507–518 (1987) 10. Terry, D., Goldberg, D., Nichols, D., Oki, B.: Continuous Queries over AppendOnly Databases. In: Proceedings of the SIGMOD Conference, pp. 321–330 (1992) 11. You, B., Lee, D., Eo, S., Lee, J., Bae, H.: Hybrid Index for Spatio-temporat OLAP operations. In: Proceedings of the ADVIS Conference, Izmir, Turkey (2006)
Real Time Measurement and Visualization of ECG on Mobile Monitoring Stations of Biotelemetric System Ondrej Krejcar, Dalibor Janckulik, Leona Motalova, Karel Musil, and Marek Penhaker VSB Technical University of Ostrava, Center for Applied Cybernetics, Department of Measurement and Control, Faculty of Electrical Engineering and Computer Science, 17. Listopadu 15, 70833 Ostrava Poruba, Czech Republic [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract. The main area of interest of our Biotelemetric System is to provide solution which can be used in different areas of health care and which will be available through PDAs (Personal Digital Assistants), web browsers or desktop clients. In paper we deals with a problem of visualization of measured ECG signal on mobile devices in Real Time as well as with a solution how to solve a problem of unsuccessful data processing on desktop or server. The realized system deals with an ECG sensor connected to mobile equipment, such as PDA/Embedded, based on Microsoft Windows Mobile operating system. The whole system is based on the architecture of .NET Framework, .NET Compact Framework, and Microsoft SQL Server. Visualization possibilities of ECG data are also discussed as the WPF (Windows Presentation Foundation) solution. The project was successfully tested in real environment in cryogenic room (-136OC). Keywords: Real Time, PDA, Embedded Device, Biotelemetry, ECG.
1 Introduction Aim of the platform for patients’ bio-parameters monitoring is to offer a solution providing services to help and make full health care more efficient without limitations for specific country. Physicians and other medical staff will not be forced to make difficult and manual work including unending paperwork, but they will be able to focus on the patients and their problems. All data will be accessible almost anytime anywhere through special applications designated for portable devices web browser or desktop clients and any changes will be made immediately at disposal to medical staff based on the security clearance. Physicians will have immediate access to the patient’s newest results of accomplished examinations. In the case that the ambulance have to go to some accident, rescue team can due to portable devices send information about patient health condition directly to hospital where responsible doctors and staff will have N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 67–78. springerlink.com © Springer-Verlag Berlin Heidelberg 2009
68
O. Krejcar et al.
information needed to execute immediate operation without delaying by preparation of necessary equipment. All bio-signals data are stored and automatically analyzed by neuronal network. System can evaluate presence of critical values which could be the sign of worse medical condition of a patient. In the moment of crossing the border of monitored bio-signals values predefined by doctor, system will inform responsible medical staff and provides all information which could help to determine the cause and seriousness of the problem.
Fig. 1. Architecture of Guardian II platform
The basic idea is to create a system that controls important information about the state of a wheelchair-bound person (monitoring of ECG and pulse in early phases, then other optional values like temperature or oxidation of blood etc.), his situation in time and place (GPS) and an axis tilt of his body or wheelchair (2axis accelerometer). Values are measured with the existing equipment, which communicates with the module for processing via Bluetooth wireless communication technology. Most of the data (according to heftiness) is processed directly in PDA or Embedded equipment to a form that is acceptable for simple visualization. Two variants are possible in case of embedded equipment – with visualization and without visualization (entity with/without LCD display). Data is continually sent by GPRS or WiFi network to a server, where they are processed and evaluated in detail. Processing and evaluating on the server consists of - receiving data, saving data to data storage, visualization in an advanced form (possibility to recur to the older graph, zoom on a histogram (graph with historical trend), copying from the graphs, printing graphs), automatic evaluation of the critical states with the help of advanced technologies (algorithms) that use Artificial intelligence to notify the operator about the critical state and its archiving. Application in PDA,
Real Time Measurement and Visualization of ECG
69
Embedded equipment is comfortable, with minimum time - the first configuration, but also configuration after downfall of application. The problem we would like to describe is concerned from the processing of ECG data in Real Time. ECG data are the most complex in compare to each other previously mentioned. The problem was found in processing of 12 channels ECG from BT ECG device to database. Our current software application on .NET platform is unable to process these data in real time to provide an online visualization neither on desktop nor on mobile device. We found only two possibilities to surpass this problem. First is in SQL server procedure, second solution is in use of embedded microcontrollers unit HCS08 for data preprocessing. These phenomena will be discussed in section 3.
2 Developed Parts of Platform Complete proposition of solution and implementation of the platform for patient’s biotelemetry as it was described in previous section requires determination and teamwork. Every single part of the architecture has to be designed for easy application and connectivity without user extra effort, but user must be able to use given solution easily and effectively. Crucial parts of whole architecture are network servers, database servers and client applications. Due to these crucial parts the development is focused particularly on proposition and implementation of mobile and desktop client application, database structure and some other important web services. 2.1 Server Parts In order to run a server, an operating system supporting IIS (Internet Information Server) is needed. IIS allow to users to connect to the web server by the HTTP protocol. The web service transfers data between the server and PDA/Embedded devices. Web service also read the data, sends acknowledgments, and stores the data in the database. The service is built upon ASP.NET 2.0 technology. The SOAP protocol is used for the transport of XML data. Methods that devices communicating with the web service can use include: • • • •
receiving measured data, receiving patient data, deleting a patient, patient data sending.
To observe measured data effectively, visualization is needed. A type of graph as used in professional solutions is an ideal solution. To achieve this in a server application, a freeware Zed Graph library can be used. For data analysis, neural nets are a convenient solution. However, there are problems in the automatic detection of critical states. Every person has a specific ECG pattern. The Neural net has to learn to distinguish critical states of each patient separately.
70
O. Krejcar et al.
Important part of Guardian is central database. There are stored all data of medical staff and patients. Data of patients include different records such as diagnosis, treatment progress or data which are results of measuring by small portable devices designated to home care. These data represent the greatest problem, because amount of these data rapidly increase with increasing amount of patients. Due to this fact database servers are very loaded. 2.2 Mobile Parts The main part of the system is an Embedded or PDA device. The difference in applications for measurement units is the possibility to visualize the measured data in both Real-time Graph and Historical Trend Graph, which can be omitted on an embedded device. PDA is a much better choice for Personal Healthcare, where the patient is already healthy and needs to review his condition. Embedded devices can be designed for one user, with the option to use an external display used for settings or with the possibility of usage in extreme conditions. The information about user, as ID, name, surname, address and application properties are stored in the system registry (HKEY_CURRENT_USER / Software / Guardian). Working (saving, reading, finding) with registry is easier and faster as saving these information in file. User registry values are crypted with simple algorithm (shifting char ASCII value). Devices based on PDA type have a several limitations such as low CPU performance, low battery life or small display, which is possible to solve by embedded version of such mobile clients. We created a special windows mobile based embedded device. During the development process the several problems occurred. One of them and the most important was the need of a new operation system creation for our special architectural and device needs. We used the Microsoft PlatformBuilder for Windows CE 4.2 tools. The created operation system based on standard windows mobile has several drivers which we need to operate with communication devices and measurement devices. As measurement device is possible to connect several device with Bluetooth communication possibility. In our application we use an ECG Measurement Unit (3channels ECG Corbelt or 12 channels BlueECG) through a virtual serial port using wireless Bluetooth technology. Measured data are stored on a SD Memory Card as a database of MS SQL Server 2008 Mobile Edition. The performance of available devices seems insufficient for sequential access [Table 1]; parsing of incoming packets is heavily time-consuming. Pseudo paralleling is strongly required. A newer operating system (Windows Mobile 6) must be used to allow the processing of data from a professional EKG due to thread count limitations. Table 1. Mobile Devices with LCD 480x800 pixels, GSM, WiFi, BT Mobile device HTC Touch HD HTC Touch HD2 HTC Touch Diamond 2 Samsung Omnia II
OS WM 6.1 Pro 6.5 Pro 6.5 Pro 6.5 Pro
Display LCD 3,8” LCD 4,3” LCD 3,2” LED 3,7”
CPU [Mhz] SPB Benchmark Index 528 553 1000 779 528 520 800 565
Real Time Measurement and Visualization of ECG
71
3 Visualization To make an ECG visualisation the measured data are needed at the beginning. The measurement is made on bipolar ECG corbel and 12 channels BlueECG. The amount0020needed to transfer from source device through a Bluetooth is in Table 2. You can compare the increased data transfer speed in case of 12 channels ECG to 1 500 bytes per second. These data amount is very small; on the other hand the data are going as packets, so the processing is needed before the real data can be accessed. Table 2. BT ECG Device -> Mobile Device measurement ECG device 3 Channels 12 Channels
Packet Size [Bytes] 100 300
Speed [Packets/s] Transfer Speed [kB/s] 3x 0,3 5x 1,5
Table 3. BT ECG device measurement ECG device BT – Mobile device – Server –Visualization BT – Mobile device – Server –Visualization BT – Mobile device – Server – DB Visualization BT – MCU - Mobile device – Server Visualization
Platform .NET Framework C++ SQL Server procedure MCU HCS08
Problem Memory overflow Memory overflow -
Real Time Impossible Impossible Soft RT (2 sec deadline) Hard RT
This process (called “parsing”) take an unacceptable time in case of mobile device to process the data in Real Time [Table 3]. Same problem is growing on desktop PC, where the C# or C++ is used. In both cases the Memory Overflow is reached. The only possible way we found is in use of SQL procedure which is executed on SQL server. When the data (packets) are stored in table the procedure is call to execute and provide RAW data. In such case the data are ready to userconsumer application until 2 second deadline, so the Soft Real Time is possible to use [Fig. 2]. The RAW data table contains full size packets received from an ECG device. Only the packets with measured data are stored to database. Those packets must contain in part of packet number bytes with a value of 0x0724 [Table 4]. The table with parsed data [Table 5] contains decimal values. Column „I“ contains data from bipolar ECG; column „II“ with „I“ contains data from 6-channel ECG. The 12-channel ECG fills after parsing columns „I II V1 V2 V3 V4 V5 V6“.
72
O. Krejcar et al. Table 4. RAW data table
Table 5. Data table of Parsed data
Fig. 2. Measurement chain: ECG Bluetooth device – mobile device – server - Visualization
Real Time Measurement and Visualization of ECG
73
Fig. 3. Measurement chain detail for FPGA solution inside the USB BT dongle
To get a real ECG data immediately after the measurement the next way can be used. We can use a special microcontroller (MCU) embedded in USB unit. This MCU unit has a full speed (12Mbit/s) USB access and BT is connected through a serial port [Fig. 3]. The MCU unit process all needed operations with parsing to provide a real ECG record to database or directly to visualizing application. In case of WPF application the Hard Real Time mode was reached. An example of real ECG record is shown in [Fig 4]. In this case only Soft Real Time mode was reached even when a special MCU unit was used for preprocessing
Fig. 4. Real Time visualization of bipolar ECG on desktop device in classical Windows Forms application
74
O. Krejcar et al.
of data. In next subsection the use of WPF application is described as only way when the hard Real Time was reached on windows desktop PC (Windows RTX extension was used). 3.1 ECG Visualization in WPF Application WPF (Windows Presentation Foundation) provide up to date possibilities to visualize ECG data on desktop PC. We create a WPF application to provide a full scale of graphic features to user. WPF technology runs directly on GPU possibilities (Graphic Processing Unit) which is founded on modern graphics cards. This fact is key parameter for speed of data presentation of screen. CPU has more time to compute others tasks (e.g. ECG data analyses by neural network tasks). WPF has a more design possibilities in compare to classical Windows Forms including 3D animation, pattern changes of whatever elements etc.. WPF application allows viewing an ECG characteristic of measured patient in Real Time, selection of patient from database and view of historical graphs. The figure [Fig. 5] shows an example of bipolar ECG characteristic in WPF application.
Fig. 5. Real Time visualization of bipolar ECG on desktop device in WPF application
3.2 Battery Consumption Tests During the real tests the battery consumption tests were executed. Firstly the set of two monocell battery with nominal voltage of 2,5 V were tested without successful time of usage. They provide only 2 hours of operation time. At second case the Lithium-Polymer cell was used with nominal voltage of 3,7 V. In this case an additional circuitry is needed to use an USB port for recharging of battery in
Real Time Measurement and Visualization of ECG
75
device. At figure [Fig. 6] the battery test screen of 12 channels ECG is presented. Figure shows the voltage of 3V (discharged battery) where current is presented by light trace on oscilloscope screen and its average value is approximately equal to 106 mA. Figure [Fig. 7] shows the same at a normal charged battery voltage level where the average current is going down to 81 mA. In case of Li-Pol battery usage the operation time of 12 channels ECG is about 10 hours.
Fig. 6. Battery test screens of 12 channels ECG. Discharged battery
Fig. 7. Battery test screens of 12 channels ECG. Charged battery
76
O. Krejcar et al.
Fig. 8. Infracamera from Cryogenic chamber test
Fig. 9. Real image from Cryogenic chamber test
4 Conclusions The measuring device (bipolar Corbelt ECG and 12 channels BlueECG) was tested in extreme conditions in a cryogen room in spa Teplice nad Becvou (Czech Republic) (-136°C) [Fig. 8], [Fig. 9]. All developed platforms were tested during these extreme tests with high credibility of measured data for physicians. The
Real Time Measurement and Visualization of ECG
77
reached experimental data will be used by physicians to make a set of recommendation for cardiac patient who are healing in this Cryogenic chamber. In such case of patients the time of recovery can be shortness by tens of percent. The Real Time measurement and visualization was reached in case of WPF usage (section 3.1). Necessary condition for comfort measurement is an operation time. Executed battery consumption tests (section 3.2) provide a suggestion to use a Li-Pol battery with nominal voltage of 3,7 V. In such case the operation time is going to sufficient 10 hours. As the final improvement in the future, the application would have some special algorithm, which could recognize any symptoms of the QRS curve, and make the job for the doctors much easier. Acknowledgment. This work was supported by the Ministry of Education of the Czech Republic under Project 1M0567.
References 1. Krejcar, O., Cernohorsky, J.: Database Prebuffering as a Way to Create a Mobile Control and Information System with Better Response Time. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2008, Part I. LNCS, vol. 5101, pp. 489–498. Springer, Heidelberg (2008) 2. Krejcar, O.: PDPT Framework - Building Information System with Wireless Connected Mobile Devices. In: CINCO 2006, 3rd International Conference on Informatics in Control, Automation and Robotics, Setubal, Portugal, Aug 01-05, 2006, pp. 162–167 (2006) 3. Krejcar, O., Cernohorsky, J.: New Possibilities of Intelligent Crisis Management by Large Multimedia Artifacts Prebuffering. In: I.T. Revolutions 2008, Venice, Italy, December 17-19, 2008. LNICST, vol. 11, pp. 44–59. Springer, Heidelberg (2009) 4. Janckulik, D., Krejcar, O., Martinovic, J.: Personal Telemetric System – Guardian. In: Biodevices 2008, Insticc Setubal, Funchal, Portugal, pp. 170–173 (2008) 5. Krejcar, O., Cernohorsky, J., Janckulik, D.: Portable devices in Architecture of Personal Biotelemetric Systems. In: 4th WSEAS International Conference on Cellular and Molecular Biology, Biophysics and Bioengineering, BIO 2008, Puerto De La Cruz, Canary Islands, Spain, December 15-17, 2008, pp. 60–64 (2008) 6. Krejcar, O., Cernohorsky, J., Czekaj, P.: Secured Access to RT Database in Biotelemetric System. In: 4th WSEAS Int. Conference on Cellular and Molecular Biology, Biophysics and Bioengineering, BIO 2008, Puerto De La Cruz, Canary Islands, Spain, December 15-17, 2008, pp. 70–73 (2008) 7. Krejcar, O., Cernohorsky, J., Janckulik, D.: Database Architecture for real-time accessing of Personal Biotelemetric Systems. In: 4th WSEAS Int. Conference on Cellular and Molecular Biology, Biophysics and Bioengineering, BIO 2008, Puerto De La Cruz, Canary Islands, Spain, December 15-17, 2008, pp. 85–89 (2008) 8. Penhaker, M., Cerny, M., Martinak, L., Spisak, J., Valkova, A.: HomeCare - Smart embedded biotelemetry system. In: World Congress on Medical Physics and Biomedical Engineering, Seoul, South Korea, August 27-September 2001, vol. 14(1-6), pp. 711–714 (2006)
78
O. Krejcar et al.
9. Cerny, M., Penhaker, M.: Biotelemetry. In: 14th Nordic-Baltic Conference an Biomedical Engineering and Medical Physics, IFMBE Proceedings, Riga, Latvia, June 16-20, vol. 20, pp. 405–408 (2008) 10. Krejcar, O., Janckulik, D., Motalova, L., Kufel, J.: Mobile Monitoring Stations and Web Visualization of Biotelemetric System - Guardian II. In: Mehmood, R., et al. (eds.) EuropeComm 2009. LNICST, vol. 16, pp. 284–291. Springer, Heidelberg (2009) 11. Krejcar, O., Janckulik, D., Motalova, L.: Complex Biomedical System with Mobile Clients. In: Dössel, O., Schlegel, W.C. (eds.) IFMBE Proceedings The World Congress on Medical Physics and Biomedical Engineering 2009, WC 2009, Munich, Germany, September 07-12, 2009, vol. 25(5), Springer, Heidelberg (2009) 12. Krejcar, O., Janckulik, D., Motalova, L., Frischer, R.: Architecture of Mobile and Desktop Stations for Noninvasive Continuous Blood Pressure Measurement. In: Dössel, O., Schlegel, W.C. (eds.) The World Congress on Medical Physics and Biomedical Engineering 2009, WC 2009, Munich, Germany, September 07-12, 2009, vol. 25/5, Springer, Heidelberg (2009) 13. Cerny, M., Penhaker, M.: The HomeCare and circadian rhythm. In: conference proceedings 5th International Conference on Information Technology and Applications in Biomedicine (ITAB) in conjunction with the 2nd International Symposium and Summer School on Biomedical and Health Engineering (IS3BHE), Shenzhen, May 30-31, vol. 1(2), pp. 110–113 (2008) 14. Vasickova, Z., Augustynek, M.: New method for detection of epileptic seizure. Journal of Vibroengineering 11(2), 279–282 (2009) 15. Cerny, M., Martinak, L., Penhaker, M., et al.: Design and Implementation of Textile Sensors for Biotelemetry Applications. In: konference proceedings 14th Nordic-Baltic Conference an Biomedical Engineering and Medical Physics, Riga, LATVIA, June 16-20, vol. 20, pp. 194–197 (2008) 16. Idzkowski, A., Walendziuk, W.: Evaluation of the static posturograph platform accuracy. Journal of Vibroengineering 11(3), 511–516 (2009) 17. Cerny, M.: Movement Monitoring in the HomeCare System. In: Schleger, D. (ed.) IFMBE proceddings, (25), Springer, Berlin (2009)
A Search Engine Log Analysis of Music-Related Web Searching Sally Jo Cunningham and David Bainbridge Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton, New Zealand {sallyjo,davidb}@cs.waikato.ac.nz
Abstract. We explore music search behavior by identifying music-related queries in a large (over 20 million queries) search engine log, gathered over three months in 2006. Music searching is a significant information behavior: approximately 15% of users conduct at least one music search in the time period studied, and approximately 1.35% of search activities are connected to music. We describe the structural characteristics of music searches—query length and frequency for result selection—and also summarize the most frequently occurring search terms and destinations. The findings are compared to earlier studies of general search engine behavior and to qualitative studies of natural language music information needs statements. The results suggest the need for specialized music search facilities and provide implications for the design of a music information retrieval system. Keywords: query analysis, music searching, search engine logs.
1 Introduction Web log analysis is useful for developing a broad understanding of search behavior. Search engine logging provides researchers with unparalleled amounts of data on the daily information seeking activities across a broad spectrum of society. Where previously naturalistic studies of searching were restricted to direct observations of a handful of people, search engine log analysis supports researchers in examining the behavior of real users who are engaging in satisfying authentic information needs. The goal is often to translate insights gained into user preferences, strategies, and difficulties, and so forth into suggestions for improving the search experience. As discussed in Section 2, the majority of log analysis has to date examined general Web search behavior, or broad categories such as multimedia searching ([1], [2]). This paper focuses on searches for music information (audio, lyrics, music scores, etc.), with an analysis of three months of AOL search engine activity [3]. In Section 3 we describe our empirical method for identifying the music-related searches in the AOL log, and discuss the limitations of that method and its likely impact on the analysis. Section 4 examines query length, N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 79–88. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
80
S.J. Cunningham and D. Bainbridge
query terms, and click-through behavior in the music searches, and Section 5 presents our conclusions.
2 Previous Research The body of research into general Web search behavior is large, and growing (e.g., [4], [5], [6], [7]). These papers typically examine the search terms, query lengths, session characteristics (session length, number of queries per session, etc.), and query modification strategies. Results from these studies are remarkably similar over time: queries posed to search engines are typically brief (fewer than three terms, on average), sessions are short (both in the number of search interactions and average time length of session), and the majority of results selected appear on the first screen presented to the user. More fine-grained analysis of query modification strategies (e.g., [4], [5]) present possibilities for improving relevance feedback mechanisms. A smaller number of studies address more specialized Web search needs: for example, searches for video material [8], multimedia [2], or sex-related information [7]. These studies aim to understand these types of information needs and their associated search strategies so as to suggest improved search functionalities tailored to these specific information needs—or additionally, in the case of sexrelated information, to suggest ways to allow users to avoid that type of material. But how are these subject or format specific searches identified? Halvey and Keane [1] study video search through an examination of linking and search on a single website dedicated to video (YouTube); Tjondronegoro et al. [2] draw their image, audio, and video searches from the appropriate tabbed interface of Dogpile; and Spink et al. [7] manually identify sex-related queries by examining search strings. These methods are less well suited to music queries. Many of the most significantly used music sites are commercial and so would skew results by the content that they provide, or are pirate or peer-to-peer sites and so their logs are not available (or there are no logs created); music search facilities such as that supported by Dogpile emphasize audio file search over other significant music information needs ([9], [10], [11]); and the requirement for manual examination of logs dramatically reduces the amount of data available for analysis. We suggest in Section 3 a simple, heuristic method for automatically identifying music-related queries in a general Web search engine. The primary benefit of search engine query logs analysis is that logs provide unprecedented amounts of data covering the activities of huge numbers of searcher and searches (for [4], 1 billion search actions over 285 search sessions), over significant swaths of time (three months in [3], 6 weeks in [5]). As search engine usage becomes embedded in everyday life, it becomes possible to gather significantly sized logs (over 1.2 million queries) from a single day’s usage of secondary functionality of a relatively little known search engine [2]. Unfortunately, this data provides no insights into the motivations and information goals of the users, and only inferential evidence as to whether the search was successful [5]. For the case of music information seeking, we can flesh out some of the log analysis results by referencing small scale, qualitative studies of natural language
A Search Engine Log Analysis of Music-Related Web Searching
81
music information requests (e.g., of questions posted to online forums as reported in [9] and [11], or to ask-an-expert websites [10]). The qualitative studies focus on the more difficult to answer music queries, by the nature of the data—why pose a question to an online expert if it could be easily answered by a straightforward Web search? The further value of analyzing search engine logs is that the logs are dominated by those more commonplace, easily addressed information needs that are difficult to elicit through the qualitative studies.
3 Data Collection We analyze a publicly released set of AOL search logs [3]. The logs consist of over 21 million Web queries from over 650,000 distinct users, over a period of three months (March 1 – May 31 2006). The log file contains over 36 million lines of data, each line representing either a new query and possibly a selection of an URL from the search results list (approximately 21 million lines), or a return to a search results list from an earlier query (approximately 15 million lines). Each line in the log consists of: a numeric, anonymized user identification number (ID); the query string submitted by the user, case shifted and with punctuation stripped (Query); the time and date at which the query was submitted for (Timestamp); a URL, if the user clicked on a link in the search results list (ClickURL); and the rank of the ClickURL in the search results list (Rank). Identifying a music-related search is neither straightforward nor exact, given that the AOL searcher is not able to explicitly specify that an audio document is required in the same way that, for example, a user can choose an image search or news search tab in Google. In this paper, we take the approach of labeling as music-related those queries in which either terms in the query string or elements of the destination URL indicate that the user is seeking music, a song, lyrics, a music representation that will support performance, or a common music audio file format or type—that is, either the query string or the ClickURL contain one or more of the terms music, song/songs, lyric/lyrics, a performance format term (sheet music, music score, or tablature), or an audio format term (mp3, midi, or a variant of the term ringtone). We derive these terms from earlier qualitative research into natural language music queries ([9], [11], [10]), where these terms consistently appear in statements of music information need. We recognize that this approach is both too liberal and too stringent in identifying music-related searches. A search can include the term music, for example, but not be motivated by a desire for information related to a song, genre, or artist (e.g., the queries ‘bachelors of music vs. a bachelors of arts in music’, ‘what whitworth music students say about whitworth’; these irrelevant queries are included in our analysis). Conversely, a query that does not contain any of the above terms might still be motivated by a desire for a music document—or it might not. Consider, for example, the AOL log query ‘truly deeply madly’; is this person looking for information related to the made-for-TV film Truly, Madly, Deeply, or for information related to the German eurodance group Cascada’s song by the same title? In this
82
S.J. Cunningham and D. Bainbridge
Table 1. Summary of AOL log analyzed in this paper
Lines of data Instances of new queries Click-through events Unique (normalized) queries Unique user IDs
AOL log (complete) 36,389,567 21,011,340 19,442,629 10,154,742 657,426
AOL log (music related) 492,463 354,498 203,537 228,072 100,912
% of complete log 1.35% 1.69% 1.05% 2.24% 15.34%
Table 2. Summary of music-related categories of the AOL search log
New queries Clickthrough events
Ringtone
Performance
3,645
9,877
2,774
2,956
6,633
2,547
All musicrelated 354,498
Music
Lyrics
Song
Mp3
Midi
142,940
165,819
58,643
13,915
288,926
122,380
136,728
45,081
12,399
case we can identify the query as being music-related with some confidence because the query’s ClickURL is a lyrics reference site (http://www.lyrics007.com). For less obviously named destination sites, we lose those queries from our analysis. Similarly, a query may use different/less formal terminology, for example, asking for ‘the words’ to a song rather than the lyrics (e.g., ‘words to ring of fire’); we do not attempt to capture this type of query by broadening our criteria for inclusion in our analysis, as including terms like ‘words’ would pull in too many false positives (non-music queries). Given that log analysis is intended to create a broad-stroke picture of search behavior based on large amounts of data, we argue that the inclusion of a small number of outliers and the exclusion of some variant queries is unlikely to significantly affect the analysis. A more significant issue is the question of whether the above technique for filtering out ‘music-related’ search activities is biased in its sample towards any particular kinds of queries. As we discuss in Section 4.3, there is some evidence that we are capturing a smaller proportion of the highly specific queries (that is, queries directed at finding a specific song by a named artist). These are likely the sorts of music information requests that are also being satisfied in peer-to-peer music services rather than through general Web searches—and so the analysis presented in Section 4 should be viewed as illustrative of general Web searching behavior rather than strictly representative of all music behavior, as satisfied through specialist music resources. An overview of the AOL search log and the music-related portions of that log are presented in Table 1. Music searching is a significant information behavior; while music related searches are a minority of all AOL search activities during the log period (1.35%), approximately one in seven AOL searchers (15.34%) engage in music searching.
A Search Engine Log Analysis of Music-Related Web Searching
83
4 Analysis In this section we examine query length and search terms used for the 354,498 music-related queries filtered from the AOL log (Sections 4.1 and 4.2), and perform a preliminary analysis of search results selection over the 203,537 clickthrough events (Section 4.3). A summary of queries and click through events by category is presented in Table 2. Table 3. Summary of query length over music-related categories query length 1 2 3 4 5 6 7 8 9 10 >10 Total
1 2 3 4 5 6 7 8 9 10 >10 Total
All music music related All % freq music % Song freq song % lyric lyric % 101687 20.67% 53788 26.18% 7183 9.11% 37236 17.16% 92665 18.84% 49462 24.07% 13948 17.69% 28943 13.34% 96287 19.57% 42833 20.85% 15384 19.51% 39057 18.00% 67642 13.75% 25291 12.31% 12797 16.23% 31895 14.70% 47481 9.65% 14733 7.17% 9305 11.80% 25601 11.80% 31892 6.48% 8235 4.01% 7450 9.45% 18345 8.46% 20024 4.07% 4647 2.26% 4344 5.51% 12693 5.85% 12436 2.53% 2822 1.37% 2797 3.55% 7975 3.68% 7388 1.50% 1446 0.70% 1957 2.48% 4925 2.27% 4475 0.91% 810 0.39% 1172 1.49% 3033 1.40% 9916 2.02% 1402 0.68% 2522 3.20% 7261 3.35% 491893 205469 78859 216964 Performance Performance mp3 mp3 % midi midi % ringtone ringtone % term % 2613 14.23% 928 18.33% 9152 34.04% 84 2.00% 3317 18.07% 1134 22.40% 3808 14.16% 468 11.15% 4059 22.11% 1147 22.65% 1363 5.07% 891 21.23% 2843 15.49% 788 15.56% 1145 4.26% 857 20.42% 2221 12.10% 499 9.86% 798 2.97% 681 16.23% 1358 7.40% 252 4.98% 492 1.83% 470 11.20% 757 4.12% 133 2.63% 484 1.80% 275 6.55% 551 3.00% 96 1.90% 452 1.68% 211 5.03% 241 1.31% 38 0.75% 429 1.60% 106 2.53% 163 0.89% 22 0.43% 363 1.35% 76 1.81% 236 1.29% 26 0.51% 4107 15.27% 77 1.84% 4196 18359 5063 26889
4.1 Terms Per Query Web searches are characteristically short (e.g., a reported average number of terms per query for Vivisimo was 3.14 [12], for Excite 2.21 terms [5], and for AltaVista 2.35 [4]). Analysis of searches over the complete AOL log illustrates this same behavior: the average number of terms per query for the log as whole is 2.83. In contrast, the average number of terms per query for music-related searches is 5—significantly longer. Breaking this down for the three largest categories: searches containing music in the query string averaged 4.22 terms per query, those containing lyrics averaged 5.50, and those containing song averaged 5.84. A preliminary analysis of multimedia searching on Dogpile [2] uncovered similar, though not as extreme, behavior; the Dogpile audio searches averaged 3.1 terms. Tjondronegoro et al. [2] speculated that ‘audio search queries usually contain
84
S.J. Cunningham and D. Bainbridge
Table 4. Listing of 80 most frequently occurring terms, excluding music category terms Term the to free for you myspace and of i by download(s) my a in me musical video/videos love on it
Freq 39556 32681 21721 18768 18461 17554 17460 17414 17130 15483 14190 13698 12696 11749 10477 8678 8570 8093 7708 6470
Term your school what high is with i'm/im from all country be aol don’t/don't that theme new codes do space like
Freq 5495 5048 4784 4606 4548 4443 4211 4133 3952 3904 3809 3787 3744 3630 3393 3375 3300 2926 2906 2850
Term day at down listen girl how go so http now we get this when one up top guitar out know
Freq 2814 2779 2749 2711 2708 2680 2650 2635 2610 2447 2417 2416 2401 2396 2340 2329 2308 2303 2229 2203
Term gospel no about rock life christian time are want just soundtrack if can baby dance back will not bad black
Freq 2187 2164 2157 2100 2067 2061 2032 1962 1922 1899 1879 1827 1812 1811 1757 1757 1732 1696 1637 1626
terms from songs title [sic], which are generally longer.’ Manual inspection of the music-related queries from the AOL logs provides evidence for that hypothesis; they include terms apparently drawn from song titles, CD and compilation titles, and artist/group names. Table 3 shows the range of query lengths for the music-related queries. Note that a single query may appear in more than one category (e.g., a request for the lyrics for a song), so the sum of subcategory queries exceeds the total number of music-related queries. Queries of length one generally contain only the term used to filter the queries from the complete AOL logs (e.g., lyrics, ringtone, mp3, etc.). One-term queries are likely an attempt to locate a general resource such as a lyrics database or source of mp3 files. The distribution profile for the song and lyrics categories differ markedly from that of music, which is skewed far more to the 1 – 2 term queries. Manual inspection of the logs suggests that the song and lyrics queries are more frequently known-item searches for a specific song, while a one to three term query for music appears to indicate a desire for a general music resource, genre/style (e.g., irish music) or format (e.g., music videos, free music). A more extensive qualitative examination of the query strings is necessary at this point, to make a more definitive statement about apparent search motivations as embodied in shorter/longer queries. 4.2 Search Terms Table 4 presents the 80 most frequently occurring terms in the query statements, excluding the terms used to filter the music-related searches (e.g., music, song, lyrics, etc.). Striking differences exist between this list and the most frequent term
A Search Engine Log Analysis of Music-Related Web Searching
85
Table 5. Thirty most frequently visted sites in music-related searches Rank
Site
Visit freq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
azlyrics.com music.myspace.com sing365.com lyrics007.com lyricsandsongs.com lyricsdownload.com lyricsfreak.com stlyrics.com music.aol.com lyricsondemand.com music.yahoo.com seeklyrics.com cowboylyrics.com plyrics.com lyricstop.com
21223 18971 11116 7966 6110 5809 5797 5316 3903 3684 3289 3109 3020 2710 2674
% of total visit 10.44% 9.33% 5.47% 3.92% 3.01% 2.86% 2.85% 2.61% 1.92% 1.81% 1.62% 1.53% 1.49% 1.33% 1.32%
Rank
Site
Visit freq
% of total visit
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
letssingit.com amazon.com mp3.com lyrics.com metrolyrics.com allthelyrics.com lyricsmania.com contactmusic.com anysonglyrics.com lyricsdepot.com music.download.com lyrics.astraweb.com dapslyrics.com musicsonglyrics.com music.msn.com
2651 2531 2291 2279 2111 1833 1798 1703 1597 1433 1402 1307 1245 1204 1137
1.30% 1.24% 1.13% 1.12% 1.04% 0.90% 0.88% 0.84% 0.79% 0.70% 0.69% 0.64% 0.61% 0.59% 0.56%
lists from earlier general Web log analyses (e.g., [5], [12]); these latter lists contain fewer stopwords (e.g., for, so, on) and a higher proportion of nouns. The presence of so many terms that are not content bearing by themselves strongly suggests the presence of searches over song titles or lyrics. Looking more closely at Table 4, we find evidence of a relatively common searching habit (e.g., [12]): accessing a website by entering its URL into a search engine rather than directly entering it in the browser (for example, myspace, aol, http). Searches on download and free are strongly indicative of a desire to obtain audio rather than other formats of music information (e.g., bibliographic details, lyrics). Commonly requested styles or genres include country, gospel, rock, christian, and dance. Matching between individual and ‘official’ genre categories is known to be a difficult problem, but perhaps can be supported for very broad genres such as these. The frequency of the term video provides corroboration for an observation from an earlier, qualitative study of video searching [11]: that music searches can sometimes be satisfied by other multimedia (for example, a music audio search by the appropriate music video). 4.3 Destinations The complete AOL log contains 21,011,340 queries and 19442629 click-through events; for the log as a whole, there are over 92 clicks for every 100 queries. For music-related queries the ratio is much lower: just over 54 selections for every 100 queries (227,801 click-throughs for 354,133 queries). Earlier research indicates Web search engines provided limited or no specialized support for multimedia searching (including music search) during the period under study [2], and that keyword searching is ill-suited to satisfying many music information needs [13]. This exceptionally high failure rate for the identified music-related queries underscores the potential of specialized music searching functionalities and resources. Table 5 presents the thirty most frequently accessed websites for all musicrelated interactions. These thirty sites represent nearly two-thirds of click-through
86
S.J. Cunningham and D. Bainbridge
Table 6. Fifteen most frequently accessed sites for lyrics, music, and song searches Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
lyrics azlyrics.com sing365.com lyrics007.com lyricsdownload.com lyricsfreak.com stlyrics.com lyricsandsongs.com lyricsondemand.com seeklyrics.com cowboylyrics.com lyricstop.com plyrics.com letssingit.com lyrics.com metrolyrics.com
Freq 21533 11203 8169 6181 5945 5611 5276 3807 3221 3217 2818 2736 2647 2296 2238
music music.myspace.com music.aol.com music.yahoo.com contactmusic.com music.download.com musicsonglyrics.com amazon.com music.msn.com ok.pixiesmusic.com countrymusic.about.com artistdirect.com sheetmusicplus.com musiciansfriend.com solmusical.com hidalgomusic.com
Freq 18971 3963 3366 1710 1452 1411 1367 1205 1001 906 807 739 720 693 660
song lyricsandsongs.com anysonglyrics.com musicsonglyrics.com songmeanings.net amazon.com songlyrics.com songwave.com stlyrics.com lyrics.com azlyrics.com flysong.com songfacts.com lyricsdownload.com lyrics007.com lyrics.astraweb.com
Freq 6110 1597 1204 1057 905 879 797 766 601 591 489 488 357 353 331
activity (64.54%). Twenty-two of the thirty identify themselves primarily as lyrics databases (though many contain other music information as well, and all contain bibliographic details about songs). Most of the lyrics sites are intended to be comprehensive in their cover (amusingly, many claim to be the largest on the Web). Only three of the lyrics sites focus on specific genres or styles of music: stylyrics.com on sound tracks, cowboylyrics.com on country, and plyrics.com on punk and associated genres. Lyrics searches might not be motivated by a desire for ‘the words’ of a song, per se; an individual who does not know the title or artist of a song may search on remembered snatches of lyrics to identify the song (e.g., [9], [11], [13]). Table 6 breaks down the 15 most frequently accessed websites for interactions containing the terms the music, song, and lyrics in either the search term or the click-through URL. It is striking how few of the top music and song sites are specialist sites: that is, focusing by artist (okpixiesmusic.com, hidalgomusic.com), genre (cowboylyrics.com, countrymusic.com, about.com, stlyrics.com, plyrics.com), or format (sheetmusicplus.com). Musiciansfriend.com is the only site in the top 15 that does not actually serve music or lyrics (it is a commercial site for new and used music equipment), while solmusical.com consolidated data on music CDs for sale on auction and other commercial sites. One indication of the comprehensiveness of the technique used to filter out music-related queries (Section 3) is to check whether the top music destinations occur as click-throughs in queries in the complete logs that are not selected as music queries. The majority of destinations in Table 5 contain one of the music category terms in the URL name, and so associated queries are included in the music log subset. Of the remainder, Amazon.com is a generalist shopping site for books, music, and other items—and so it would be expected that many, perhaps a majority, of queries including Amazon as a destination would not be music related. The sites sing365.com and letssingit.com are primarily lyrics websites. Over threequarters of searches including click-throughs to those sites also contain the word ‘lyrics’ or ‘song’ in the query string; the remaining searches contain a song title
A Search Engine Log Analysis of Music-Related Web Searching
87
(‘walking on broken glass’) and possibly an artist’s name. Inclusion of these additional lyrics queries would give further weight to the significance of lyrics searching as a music behavior, and the value of specialist support for lyrics searching. Looking at Table 6, the artistdirect.com site includes downloads, lyrics, music images, and music news and discussion; an examination of the complete AOL logs indicates that approximately 2/3 of clickthroughs to artistdirect.com do not include one of the music category terms, and so are not included in this log analysis. By inspection, a majority of these omitted searches terminating in artistdirect.com are for specific songs or compilations (e.g., ‘twisted sister come out and play’, ‘emmylou harris evangeline’) or include a named musical artist or group (e.g., ‘the beatles’, ‘ray stevens’). These ‘known item’ searches (that is, searches in which the user is seeking a specific item or set of items) are likely to be underrepresented in this present analysis, given that peer-to-peer and other specialist music services are better suited to answering these information needs than a general Web search engine.
5 Conclusion It is difficult to place crisp boundaries on music-related searches within search engine logs. Search engine log files provide a flood of data, but it is difficult to automate the selection of a semantically distinct section of those logs for analysis (Section 3). Log analysis is useful in corroborating observations drawn from small scale qualitative studies (for example, that a music information need might be satisfiable by video as well as audio; Section 4.3), or conversely, qualitative studies can provide the insights to explain behavior exhibited in the logs (for example, that lyrics searches may be aimed at identifying a song whose title and artist are not known to the user, rather than at obtaining a full set of ‘the words’; Section 4.2). We see clear evidence that music-related searches are more complex to specify than average (Section 4.1), and that music-related searches are significantly less likely to result in selection of a search result (Section 4.3). At the same time, significant proportions of searchers engage in music-related searches (Section 3)— music searching is clearly a major, though problematic, information seeking activity that demands greater attention from the research world. This study is one small step towards identifying patterns in music searching, to inform the design of musicspecific search functionality and search resources.
References 1. Halvey, M., Keane, M.T.: Analysis of online video search and sharing. In: Proceedings of Hypertext 2007, Manchester, UK, September 2007, pp. 217–226 (2007) 2. Tjondronegoro, D., Spink, A.: Multimedia Web searching on a meta-search engine. In: Twelfth Australasian Document Computing Symposium, Melbourne Zoo, Australia, 10 December (2007)
88
S.J. Cunningham and D. Bainbridge
3. Pass, G., Chowdhury, A., Torgeson, C.: A Picture of Search. In: 1st International Conference on Scalable Information Systems, Hong Kong (June 2006) 4. Silverstein, C., Henzinger, M., Moricz, M.: Analysis of a very large Web search engine query log. SIGIR Forum 33(1), 6–12 (1999) 5. Jansen, B.J., Spink, A., Saracevic, T.: Real life, real users, and real needs: a study and analysis of user queries on the Web. Information Processing & Management 36, 207–227 (2000) 6. Beitzel, S.M., Jensen, E.C., Chowdhury, A., Grossman, D., Frieder, O.: Hourly analysis of a very large topically categorized Web query log. In: Proceedings SIGIR 2004, Sheffield (UK), July 2004, pp. 321–328 (2004) 7. Spink, A.: Web searching for sexual information: an exploratory study. Information Processing & Management 40, 113–123 (2004) 8. Cunningham, S.J., Nichols, D.M.: How people find videos. In: Proc. 8th ACM/IEEECS Joint Conference on Digital Libraries (JCDL), Pittsburgh, USA, June 2008, pp. 201–210 (2008) 9. Bainbridge, D., Cunningham, S.J., Downie, J.S.: How people describe their music information needs: a grounded theory analysis of music queries. In: Proceedings of the International Symposium on Music Information Retrieval (ISMIR 2003), Baltimore, October 2003, pp. 221–222 (2003) 10. Lee, J., Downie, J.S., Cunningham, S.J.: Challenges in cross-cultural/multilingual music information seeking. In: Proceedings of ISMIR 2005 Sixth International Conference On Music Information Retrieval, London, September 2005, pp. 1–7 (2005) 11. Cunningham, S.J., Laing, S.: An analysis of lyrics questions on Yahoo! Answers: Implications for lyric / music retrieval systems. In: Proceedings of the 11th Australasian Document Computing Symposium (December 2009) 12. Koshman, S., Spink, A., Jansen, B.J.: Web searching on the Vivisimo search engine. Journal of the American Society for Information Science and Technology 57(14), 1875–1887 (2006) 13. Downie, J.S.: Music Information Retrieval. Annual Review of Information Science & Technology 37(1), 295–340 (2003)
Data Hiding Based on Compressed Dithering Images Cheonshik Kim Department of Computer Engineering, Sejong University, 98 Kunja-Dong, Kwangjin-Ku, Seoul 143-747, Korea [email protected]
Abstract. In this paper, we proposed data hiding method for halftone compressed images based on Ordered dither BTC(block truncation coding) [7]. BTC[9] is a simple and efficient image compression technique. However, it yields images of unacceptable quality and significant blocking effects are seen when the block size used increases. Guo[2] improved the image quality of ODBTC using new algorithm. A halftone image is very sensitive for data hiding, because it is a bitmap image. For this reason, there are a few research related data hiding in a halftone image. Therefore, EMD [6] technique can also be used to embed digital data into the compressed image including BTC. Until now, EMD was never used to halftone image for data hiding. In this paper, we solved this problem and experimental results have indicated that the resulting image quality is better than that of Guo[2]. As a result, we show the new method of data hiding in a halftone image. Keywords: BTC, ODTBC, EMD, data hiding.
1
Introduction
The digital images will be reproduced by a device or a number of devices. Most of the image reproduction devices, particularly the printing devices, are restricted to few colors while the digital image mostly consists of millions of colors. In order to continuous tone digital image, it is transformed into a binary image, consisting of 1’s and 0’s, i.e. a bitmap. This transformation from a continuous tone image to a bitmap representation is called halftoning[1]. Halftone technique can be used to book or image compression. These materials should be protected from illegal users, so data hiding has been researched a lot of researchers in various kinds of schemes. In fact, data hiding technique cannot take safety of the messages; data hiding conceals the existence of secret messages while cryptography protects the content of messages. That is, the purpose of data hiding only conceals the existence of it. Until now, it has been proposed so many kinds of secret communication methods that can conceal the messages [2-3]. Until now, many researchers have been investigated gray scaled images on data hiding; however, halftone was less research than that of gray scaled images. Therefore, it is meaning that hiding N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 89–98. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
90
C. Kim
data in a halftone image. It is possible that halftone techniques are divided into three categories including ordered dithering, error diffusion, and direct binary search. Delp and Mitchell proposed Block truncation coding (BTC) in 1979[9]. BTC has many advantages such as simple, efficient and low computation complexity [3]. However, BTC is less image quality than ODBTC[2] that was introduced by Guo in 2008. PAN[3] employed statistics feature of pixel block patterns to embed data, and utilizes the HVS characteristics to reduce the introduced visual distortion. Shu-Fen TU [4] was shown that the binary image combined with the watermark to construct the ownership share with the aid of the XOR operation. This method was not really hiding information in host images. Tseng[5] proposed an improved method for halftone image hiding. This method can be eliminated and the hidden visual pattern can be revealed precisely. It is a kind of visual watermarking. In this paper, we will be used ODBTC[2] for data hiding in halftone. The problem of [2]-[5] is impossible to hide data enough capacity in the cover image. For this reason, we proposed novel method to hide enough data into the cover image.
2
Related Works
In this chapter, we will explain EMD, BTC and ODBTC algorithm related data hiding in a halftone compressed image. EMD is a novel method for data hiding, because it is possible to hide very huge data rather than any other method. BTC (Block Truncation Code) and ODBTC (Ordered Dithering Block Truncation Code) are compression method, and they are used to publish for newspaper or book. 2.1
EMD (Exploiting Modification Direction)
The EMD was proposed by Zhang and Wang [6], a novel method for data hiding in an image with schemes of modification directions. The EMD method is that each secret digit in a (2n+1)-ary notational system is carried by n cover pixels, when n ≥ 2 and, at most, only one pixel is increased or decreased by. A group of pixels is composed of as (g1,g2,…gn). If the secret message is a binary stream, it can be segmented into many pieces with L bits, and the decimal value of each secret piece is represented by K digits in a (2n+1)-ary notational system, where
L = ⎣K • log2 (2n + 1)⎦
(1)
n pixels are used to carry one secret digit in the (2n+1)-ary notational system and, at most, only on pixel is increased or decreased by 1. According to a secret key, all cover pixels are permuted pseudo-randomly and divided into a series of pixelsgroups, each containing n pixels. A vector (g1,g2,…gn) in n-dimensional space is labeled with its f value, which is solved by Eq.(2) as a weighted sum modulo (2n+1):
⎤ ⎡n f ( g1 , g 2 ,..., g n ) = ⎢∑ ( g i • i)⎥ mod (2n + 1) ⎦ ⎣ i =1
(2)
Data Hiding based on Compressed Dithering Images
2.2
91
BTC (Block Truncation Coding)
Delp and Mitchell[9] proposed block truncation coding that is lossy compression technique for gray-level image. For this work, the image is divided into blocks of M×N pixels and each block is processed respectively. Following, we calculated the mean value (μ) and the standard deviation (σ) for each block. And, the first two sample moments are preserved in the compression. Each original block can be encoded two-tone that is composed of ‘0’ and ‘1’. When a pixel value is smaller than a mean of block, set to ‘0’, otherwise, set to ‘1’. For block decompression, we need a (a and b) in each block of image and bitmap of a compressed image. Each block of B is composed of ’0’ and ‘1’. After calculation, the bit ‘0’ of B is set to a, and the bit ‘1’ of B is set to b, where a, and b are computed according to Eq. (3a) and (3b), and q means the number of bit ‘1’ and m means for the number of bit ‘0’ in B.
a = μ −σ •
q m−q
(3a)
b = μ +σ •
m−q q
(3b)
BTC is very simple algorithm, anybody can implement this image easily, and however, it shows lower quality than ODBTC[2] dithering image. 2.3
Ordered Dither BTC
Ordered dithering [7] is an image dithering algorithm. It is commonly used by programs that need to provide continuous image of higher colors on a display of fewer color depth. The algorithm achieves dithering by applying a threshold map on the pixels displayed; causing some of the pixels to be rendered at a different color, depending on how far in between the color is of available color entries. There are many researches about improvement quality of a halftone image by many researchers. However, it yields images of unacceptable quality and significant blocking effects are seen when the block size used increases. Guo[2] solved those problem for good quality of a halftone image. BTC divide image into M × N block, and maximum value is set to xmax and minimum value is set to xmin. Eq.(4) is the ODBTC[2] method.
⎧ xmax , if xi , j ≥ LUTi mod M j mod N + xmin oi , j = ⎨ (k ) ⎩ xmin , if xi , j < LUTi mod M j mod N + xmin
(4)
The oij denotes the output pixel value, and k=xmax-xmin. A significant feature of the ODBTC is the dither array LUT, where each specific dither array has its corresponding 255 different scaling versions. The 255 scaling versions are obtained by
92
C. Kim
LUT Tm( k,n) = k ×
LUTm , n − LUTmin
(55)
LUTmax − LUTmin
where 1 ≤ k ≤ 255, 1 ≤ m ≤ M, and 1 ≤ n ≤ N; LUTmin and LUTmax denote the minn( k ) must b ues in dithered array. The dynamic range of LUTmin be imum and maximum valu added by xmin to provide a fair threshold with the pixel values in a block. Since thhe ( k ) can be dither arrays LUT min b calculated in advance, the complexity can be signifficantly reduced in practicaal application.
(a)
(b)
(c c)
Fig. 1. (a) halftone screen, (b b) Original grayscale Lena image of size 512×512, (c) ODBT TC image of size 512×512
Fig.1 (a) is used for sccreen for making halftone image with grayscale image oof (b) using ODBTC metho od. Lena (c) is the halftone image, which is made bby ODBTC algorithm.
3
The Proposed Scheme for Data Hiding
In this section, suppose th he host image is of size P×Q and ODBTC dithered imagge using Eq.(4) and (5), and The general algorithm is described as follows.
Data Hiding based on Compressed Dithering Images
3.1
Encoding Algorithm
The data encoding method is described in this section. The flow chart as shown in Fig. 3 depicts the following steps. (1) First, we construct a halftone image using Eq.(4), where GI and HI denote grayscale and halftone images respectively. M×N is a block size, which is a unit of processing halftone. xm,n is a pixel value in 4×4 block of grayscale image. xmax is the largest value in a block and xmin is the smallest value in a block. om,n takes the value of xmin or xmax depending on Eq.(4). If om,n is greater than xmin, bm,n is assigned to ‘1’, otherwise bm,n is assigned to ‘0’. Thus, we can get a halftone like in Fig.2. (2) In order to use EMD for bitmap, the value of group bitmap pixels needs to be changed from gray code to decimal because of arithmetic operation in EMD. In Fig.4, one can see that the procedure of converting from gray code to decimal number is possible to be done for group of 8 pixels. In this way, we get first and second values, which are 247 and 212 respectively. (3) First decimal value is assigned to g1 and second decimal value is assigned to g2, so that g1 and g2 become a group of pixels, i.e., [247, 212]. (4) No modification is needed if a secret digit d equals to the extraction function of the original pixel-group. When d≠f, calculate s = d - f mod (2n+1). If s is less than n, increase the value of gs by 1, otherwise, decrease the value of g2n+1-abs(s) by 1. If d is 3, then the result becomes as shown in Fig.4 That is, group of pixels becomes [247, 213].
Fig. 2. (a) a block of halftone, (b) a bit pane
Fig. 3. Encoding procedure for data hiding
93
94
C. Kim
The quality of halftone image is very sensitivity in small change of value; therefore we are certain that LBS based method is very proper in halftone image. If Gaussian filtering is used in ODBTC, we will get more good quality image than before. That is, receivers always hope that receiving image has a good quality.
Fig. 4. The process of encoding in EMD method
When d≠f, calculate s=d-f mod (2n+1). If s is less than n, increase the value of gs by 1, otherwise, decrease the value of g2n+1−abs(s) by 1 Example 1. g = [137 139], n = 2, f = 0. Let d = 4. Since s = 4, an encoder will decrease the gray value of the first pixel by 1 to produce the stego-pixels [136 139]. At the decoder, f = 4. Therefore, decoder can extract secret message 4. Example 2. g = [141 140], n = 2, f = 1. Let d = 0. Since s = -1, an encoder will increase the gray value of the first pixel by gs = gs + s to produce the stego-pixels [140 140]. At the decoder, f = 0. Therefore, decoder can extract secret message 0. 3.2
Decoding Algorithm
The message decoding method is described in this section. The flow chart in Fig. 5 represents the following steps. (1) Divide the stego image into blocks of size M×N. (2) In order for receivers to decode the stego image, the halftone is changed into the binary of bitmap as shown in Fig.6. (3) For EMD arithmetic operation, blocks of bitmap need to be changed into decimal numbers. And then, we apply these values to Eq.(2) as shown in Fig.6. First value is assigned to g1 and second value is assigned to g2, so g1 and g2 become a group, i.e., [246, 212]. Next, we can calculate f value using EMD method in Eq.(2). In this case, f becomes 3. That is, ‘3’ is a hidden message.
Data Hiding based on Compressed Dithering Images
95
Fig. 5. Decoding algorithm for ODBTC image
Fig. 6. The process of decoding in EMD method
Fig. 5 shows that the procedure of decoding with ODBTC image using EMD scheme. As one can see from Fig.5, decoding procedure is very simple process.
4
Experimental Results
We had experimented data hiding with nine 512×512 halftone images, which were obtained by ODBTC algorithm [2] from 8-bit gray level images. In order to evaluate distortion, we apply an effective quality metric, weighted signal-to-noise ratio (WSNR). Given two versions of an image of size M × N pixels, one original (denoted x) and the other halftone (denoted y), WSNR of the binary image is computed as follows [3]: 2 ⎡ ⎤ ∑ u ,v ( X (u , v )C (u , v )) ⎥ WSNR(dB) = 10 log10 ⎢ 2 ⎢⎣ ∑ u ,v ( X (u , v) − Y (u, v))C (u, v) ⎥⎦
(6)
where X (u,v) , Y(u,v) and C(u,v) represent the DFT of the input image, output image and contrast sensitivity function (CSF), respectively, and 0 ≤ u ≤ M -1 and 0 ≤ v ≤ N -1. In the same way as SNR is defined as the ratio of average signal power to average noise power, WSNR is defined as the ratio of average weighted signal power to average weighted noise power, where the weighting is derived
96
C. Kim
from the CSF. A measuree of the nonlinear HVS response to a single frequency is called the contrast thresho old function (CTF). The CTF is the minimum amplitudde necessary to just detect a sine wave of a given angular spatial frequency. Invertinng F), a CTF gives a frequency response, called the contrast sensitivity function (CSF which is a linear spatiallly invariant approximation to the HVS [10]. Since thhe halftone image attempts to o preserve useful information of the gray level image, w we compare the halftone or stego image with the original grayscale one. Similar tto m higher quality. PSNR, a higher WSNR means
proposed method
guo's method
40 35 30 25 20 15 10 5 0 lena
babooon pepper barbara Airplane Goldhill Tiffany
Fig. 7. Com mparison wsnr between guo’s and proposed method
For example, Fig.7 sho ows that WSNR between the proposed method and guoo’s method. In this case, thee quality of our proposed method is better than that oof guo’s method. Moreover,, the embedding capacity of [2] method is 824 and 30113 with variable BWth of 2 and a 4 respectively; BWth denotes an adjustable thresholld parameter.
R=
log 2 (2n + 1) n
(77)
On the other hand, our proposed method is able to embed about 32,768 bits intto 512×512 image, with abiility to hide 2 bits per a block. Eq.(7) was proposed bby Xinpeng Zhang and Shuo ozhong Wang in 2006 [6] and it is used to evaluate hidinng quality grayscale image of o EMD method. In order for us to show the capacity oof the method proposed in this t paper, we compare it with the one proposed by [33] and introduce the results in i Table 1. As one can see from this table, our method is better than that of [3].
Data Hiding based on Compressed Dithering Images Table 1. Comparison of embedding capacity between [3] and our method bpp Image name Lena Airplane Baboon Boat Pepper Barbara
[3] 831 1191 54 553 685 254
Capacity(bit) Proposed method 32768 32768 32768 32768 32768 32768
Fig. 8. The quality of nine stego images in our proposed method
97
98
C. Kim
Fig.8 shows the stego images created by EMD and gray code. The qualities of halftone images look similar to the original gray codes when we use human visual system. Therefore, we can conclude that our proposed method produces a very significant result.
5
Conclusion
Many researchers have been interested in data hiding and steganography to hide information in a halftone image. Therefore, they proposed a lot of methods in various kinds of the journals. We also interested in data hiding in a halftone image. For this reason, we proposed new method to increase image quality and to hide more bits of data. Moreover, halftone image is a very sensitive image, because if some pixels are changed in some part of a halftone, noise block is increased in an image. In order to reduce that problem, we applied gray code into EMD method. As a result, our proposed method shows very good result, that is, image quality is increased and embedding quantity is increased rather than that of [2] and [3].
References 1. Kite, T.D., Evans, B.L., Bovik, A.C.: Modeling and quality assessment of halftoning by error diffusion. IEEE Trans. Image Processing 9, 909–922 (2000) 2. Guo, J.-M.: Watermarking in dithered halftone images with embeddable cells selection and inverse halftoning. Signal Processing 88, 1496–1510 (2008) 3. Pan, J.S., Luo, H., Lu, Z.H.: Look-up Table Based Reversible Data Hiding for Error Diffused Halftone Images. INFORMATICA 18(4), 615–628 (2007) 4. Tu, S.-F., Hsu, C.-S.: A BTC-based watermarking scheme for digital images. Information & security 15(2), 216–228 (2004) 5. Tseng, H.W., Chang, C.C.: Hiding data in halftone images. Informatica 16(3), 419–430 (2005) 6. Zhang, X., Wang, S.: Efficient Steganographic Embedding by Exploiting Modification Direction. Communications Letters, IEEE 10(11), 781–783 (2006) 7. Bayer, B.: An optimum method for two-level rendition of continuous-tone pictures. IEEE International Conference on Communications 1, 11–15 (1973) 8. Floyd, R.W., Steinberg, L.: An adaptive algorithm for spatial grey scale. In: Proceedings of the Society of Information Display, vol. 17, pp. 75–77 (1976) 9. Delp, E., Mitchell, O.: Image Compression Using Block Truncation Coding. IEEE Transactions Communications 27, 1335–1342 (1979) 10. Niranjan, D.V., Thomas, D.K., Wilson, S.G., Brian, L.E., Alan, C.B.: Image Quality Assessment Based on a Degradation Model. IEEE Transactions on image processing 9(4) (2000)
Reliable Improvement for Collective Intelligence on Thai Herbal Information Verayuth Lertnattee1 , Sinthop Chomya1, and Virach Sornlertlamvanich2 1
Faculty of Pharmacy, Silpakorn University Nakorn Pathom, 73000, Thailand [email protected], [email protected], [email protected] 2 Thai Computational Linguistics Laboratory NICT Asia Research Center, Pathumthani, 12000, Thailand [email protected]
Abstract. Creating a system for collecting herbal information on the Internet, is not a trivial task. With the conventional techniques, it is hard to find the way which the experts can build a self-sustainable community for exchanging their knowledge. In this work, the Knowledge Unifying Initiator for Herbal Information (KUIHerb) is used as a platform for building a web community for collecting the intercultural herbal knowledge with the concept of a collective intelligence. With this system, images of herbs, herbal vocabulary and medicinal usages can be collected from this system. Due to the diversities of herbs, geographic distribution and their applications, one problem is the reliability of herbal information which is collected from this system. In this paper, three mechanisms are utilized for improving reliability of the system: (1) information for an herb is divided into several topics. Contributors could select some topics which they are expertise, (2) a voting system is applied and the standard source members (SSMs) are able to contribute their knowledge on text information, (3) a voting system, keywords and comments are implemented for controlling quality and reliable of images of an herb. With these mechanisms, herbal information on KUIHerb is more accurate and reliable.
1
Introduction
Origins of may traditional treatments in Thailand can be traced to India. The derivation has been diversified through out many cultures since then [1]. For example, herb names and their medicinal usages are gradually spread out into communities resulting in distinction from each other according to their cultural background. Some are named different and hardly found the relation between each other. Some are complimentary knowledge of their usages. These herb names and terminology on herbal medicine are useful for searching herbal information on the Internet. With the conventional techniques such as interview with traditional doctors, it is hard to find the way which the experts N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 99–110. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
100
V. Lertnattee, S. Chomya, and V. Sornlertlamvanich
can build a self-sustainable community for exchanging their information. The Internet is an excellent source for providing and sharing information. With Web 2.0 system, it provides an opportunity for sharing information from a group of members on a topic of interest. The Knowledge Unifying Initiator for Herbal Information (KUIHerb), a system for collective intelligence on herbal medicine based on Web 2.0 system, is used as a platform for building a web community for collecting the intercultural knowledge. However, mechanisms for controlling accurate and reliable herbal information contributed by members, are questionable. In this paper, three mechanisms are utilized for improving reliability of the system: (1) information for an herb is divided into several topics. Contributors could select some topics which they are expertise, (2) a voting system is applied and the standard source members (SSMs) are able to contribute their knowledge on text information, (3) a voting system, keywords and comments are implemented for controlling quality and reliable of images of an herb. In the rest of this paper, the concept of collective intelligence with Web 2.0 and the future Web system is described in Section 2. Section 3 gives a detail of herbal information. Section 4 presents four components of a system for collecting herbal information, namely KUIHerb, with the aspects that we describes. Section 5 explains mechanisms for reliable improvement in KUIHerb. The experimental results are described in Section 6. A conclusion and future works is made in Section 7.
2
Collective Intelligence with Web 2.0 and the Future Web
In Web 2.0 era, the Internet users easily share their opinions and resources. Consequently, users can collectively contribute to the Web community and generate massive content behind their virtual collaboration [2]. For a system with collective intelligence, implementing scalability can indeed be challenging, but sensibility comes at variable sophistication levels. Several approaches are dealing with the sensibility e.g., user feedback, recommender systems, search engine, and mashups. As suggested by Gruber T., the true collective intelligence can be considered if the data collected from all those participants is aggregated and recombined to create new knowledge and new ways of learning that individual humans cannot do by themselves [3]. However, it provided only a little bit on control of information in Web 2.0. Nowadays, we are going to the new generation of Web technology i.e., Web 3.0 or the future Web. Although it has already received quite a number of definitions, some useful features of Web 3.0 are described as follow. It can be considered as “The data Web” instead of “The document Web” in Web 2.0. The control of sharing information is better. The decision for the opinions which are provided in Web 3.0, is more accurate. The intelligence Web is a new
Reliable Improvement for Collective Intelligence
101
important feature in Web 3.0 while in Web 2.0, it is only the social Web [4]. Unlike Web 2.0 which participants are usually general Internet users, wisdom of the expert is essential for constructing more knowledge that is valuable. From these features of Web 3.0, it should be a better collective intelligence system for building new knowledge by way of Information Technology (IT), especially medical knowledge, and herbal knowledge should be no exception.
3
Herbal Information
Herbal information is a special type of information dealing with medicinal herbs. Some topics such as name identification and medicinal uses which may be different among cultures, are still problems. For instance, the same species of an herb may be known by different names in different areas. On the other hand, a certain herbal name may mean one thing in one area but something completely different in another. The relationship between herbs and their names is Many-Many i.e., a plant may have several names while a name may be several plants. For example, Dracaena loureiri Gagnep. We use its hard wood for fever and call Chan dang. Some time we call this plant in other names up to the area of country, e.g., Chan pha (northern part), and Lakka chan (central part) [5]. Lack of information about native herbs has made them more difficult for applying. Herbal specialists usually seek herbal information in a standard monograph. The herbal monograph deals with information to determine the proper identity of a plant genus or genus and species, including part used, indication, method for preparation and so on. However, these sources of information are limited. In the case of the herb does not appear in the pharmacopoeia, it is hard to seek accurately information about the herb. This causes general users and herbal specialists find information of herbs and their products on the Internet. A set of images of an herb is excellent sources for sharing knowledge about herb identity. From the images, the users can discuss which species (including variety) it should be. The scientific name of an herb and its images are used for common understanding. Furthermore, the users can discuss about which herb should be the real herb that appears in herbal formulas [6].
4
KUIHerb: Collective Intelligence System for Herbal Information
KUIHerb has been implemented using all open source software components. The url is http://inf.pharm.su.ac.th/˜kuiherb in Thai version. The scripting language is PHP. The data are collected in a database which is constructed with MySQL. With the concept of Web 2.0 and some features of the future Web, the system has been designed for general users and members who would like to participate. The four components in KUIHerb are described as follow.
102
4.1
V. Lertnattee, S. Chomya, and V. Sornlertlamvanich
Accessing Information
Information of an herb can be reached by two methods i.e., keyword search and directory search. KUIHerb provides the ability to keyword search by using a Thai common name, a Thai local name, an English name, a scientific name of an herb as well as a family name. Is also provides the ability to browse categories of part used and symptom. In the Figure 1, it shows the Web page for searching information by a directory and keyword.
Fig. 1. The Home Page of the KUIHerb
The scientific name of an herb and its images are used for common understanding. In this platform, not only the text content can be shared among members but also images of an herb which can be uploaded to the system. This is very important for herbs whose part used rarely appear. 4.2
Sharing Information
For the first version of KUIHerb, six topics are taken into account i.e., general characteristics, pictures, local name, medicinal usages (i.e., part used with their indications and methods for preparation), toxicity, and additional information. Among these topics, a poll-based system is implemented on local names and medicinal usages.
Reliable Improvement for Collective Intelligence
4.3
103
Providing Information
Two approaches are constructed for providing herbal information. The first approach is the current news about herbal information by Web links. The administrator of the KUIHerb usually added news about herbs and it is easy to link to the source of information. The other approach, information of an herb is randomly selected from KUIHerb database when users visit the homepage of the Web site. It also provides a list of new herbs added to the database. 4.4
Web Site Statistics
In KUIHerb, hit counters roughly indicate Web sites’ relative popularity and users’ activity. Three set of counters are created for these purposes. The first set is for herbal database activity. The volume of user information contributes to KUIHerb can be used as an indicator for the level of user participation. For this set, three counters are used i.e., the number of herbs, news and topics in Web board. The second set is for describing the members of the community i.e., the number of member, the newest member, the number of active members of that time. This set indicates the popularity of the system. The last statistical set reports the total activities in a period of a day, a month and a year.
5
Reliable Improvement in KUIHerb
Several mechanisms are used for improving the reliability of the system. The detail of these mechanisms is described as follow. 5.1
Subdividing All Information into a Set of Topics
In fact, information of an herb is huge. It is hard for a person to recognize everything. For example, some photographers would like to take pictures of plants. They may not know the local names and usefulness of an herb. However, they can contribute the pictures of whole plants, flower, fruit, etc. A native people of the area can contribute the local names of an herb. Pharmacists may suggest its indication and toxicity but they may not know the local names of an herb. With this concept, information of an herb should be subdividing into several topics. Members should contribute information on topics which they have some experience, Moreover, some reliable or standard references will be added for further finding information. 5.2
Applying Voting System
The voting mechanism is widely used to improve accuracy of the system such as in [7]. For more accurate information, only members (and the administrator) in the system are able to contribute and modify their information.
104
V. Lertnattee, S. Chomya, and V. Sornlertlamvanich
With this system, a contributor may choose to work individually by posting his/her opinions into the topics. Any opinions or suggestions are committed to voting. While opinions may be different, majority votes determine the view of the communities. These features naturally realize the online collaborative works to create the knowledge communities. The weighting system for each opinion can be calculated by the formula W sumik = wijk j
Here, W sumik is defined as the total weight of the ith opinion of the k th topic. The wijk is the weight of the ith opinion which is given by the j th member who would like to vote in this opinion for the k th topic. The value of w depends on the priority and agreeness of the member. The weight from the member, who contributes more accurate information for a long period, should be higher than the new one. The w is needed to update from a period of time. Furthermore, if the member agrees with the opinion, the value is positive and vice versa. A set of higher weight opinions for each topic, tends to be more believable. To increase more accuracy, a concept of standard source members (SSM) is proposed. For herbal information, standard sources of information may standard textbook about herbal medicine which written by experts in this area. Each standard source can be represented a member in the system. The list of these standard references is included on the topic of references. The administrator creates SSMs and input opinions from the SSMs to system (only text information). This process is transparent to the real members. With this method, not only information with voting system is more accurate but also gives information which is used for calculating w for each member (a real member and a SSM) in the future. The voting system can be applied in both text and image information. 5.3
Reliable for Images of an Herb
Due to synonyms and homonyms of the local names of an herb, the scientific name of an herb and its images are used for common understanding. In this system, images of an herb can be uploaded to the system. The images described should relate to the whole plant. The parts which have medicinal usage such as leaves, roots, flowers, seeds, resin, root bark, inner bark (cambium), berries and sometimes the pericarp or other portions, should be included. This is very useful to visitors who would like to see parts used of an herb. They should be in both fresh and dry forms. To make these images more reliable, three mechanisms are provided as follow. – The keywords and contributors’ names can be given to the system: keywords suggest visitors about the focus point on the image. Contributors’ names guarantee visitors for quality of their images.
Reliable Improvement for Collective Intelligence
105
– The voting system: this mechanism may summarize the popularity and quality of the images. The basic idea is that images which high quality and/or useful for treatments should be more popular. – A comment : in case of an image has some problems; e.g., incorrect picture, image is not clear. This comment can be used as a tool to inform visitors. The administrator may have a decision to the image.
6
Experimental Results
When the KUIHerb has been collected information from members for one year. Several mechanisms has been proposed for improving reliability of the systems. In order to represent the concept for improving reliability of the system, six herbs are randomly selected as a set of sample. The results are described as follow. 6.1
Main Topics
With careful design, information of an herb can be divided into seven topics; i.e., general characteristics, pictures, local name, medicinal usages (i.e., parts used with their indications and methods for preparation), toxicity, additional information and references. The detail description for each topic is shown in Table 1. Notice that some information; i.e., locations and part used, can be selected from a list. This controls typo errors from contributors. 6.2
Voting System
In this version, a majority voting score is applied on the topics of local names and medicinal usages. For local names, we can summarize that the herb name is used in the levels of a city, province or larger area. Due to the first period of use, all members are given equal weight. A member can suggest a new opinion which the score of one is initiated. When other members agree with the opinion, a simple click on the button “Vote” will increase the score by one. Each member has only one vote for an opinion. The opinion with the higher score will be moved to the upper part of the window. In cases of multiple opinions, the popular vote will select more preferred opinions used in the community. The Figure 2 represents a list of local names for an herb with their scores. Moreover, a hierarchy of locations is provided in the list. It can be selected by members. The hierarchy of locations controls accuracy of inputs from users and uses for summarization. If a local name is called in several provinces of a region, we can conclude that the local name is belonged to the region. For medicinal usages, information about part used, indication, method for preparation can be contributed. In order to represent the effectiveness of the SSM, 27 standard Thai textbooks for herbal medicine are added into the KUIHerb as 27 SSMs. We
106
V. Lertnattee, S. Chomya, and V. Sornlertlamvanich Table 1. Seven topics for herbal information
Topic
Explanation
Contributor
Type
General Characteristics
General characteristics of an herb such Admin (to give basic Text as description about leaves, flowers, information, not for fruit, etc., location, culture are added sharing) into the system Images Images of a whole plant and parts Member (general Images, which have some medicinal usages member, Text photographer) Local Names Local names of herbs are suggested by Member (local area Text, members. Multi-lingual names can also people, herbal List be applied. Lists of location are specialist) provided in the level of city, province, region, etc. Medicinal Usages A list of predefined parts which may Member (traditional Text, be used for treatments is provided. A doctor, herbal List member may select the part and specialist, suggest its indications. The method for pharmacist) preparation can be suggested. In the case of a part with several indications and several methods for preparation, the opinion should separate the indications and methods of preparation for each part used. Precaution and Any suggestions about precaution and Member (traditional Text Toxicity toxicity will be kept for warning when doctor, herbal someone would like to use the herb specialist, pharmacist) Additional Other valuable information such as Member (general Text Information cultivation may also be given. member) References This area can be applied for suggesting Admin Text references for an opinion in order to make the opinion more reliable.
randomly select six herbs. Each herb has at least one image. The numbers of local names as well as medicinal usages for each herb before SSM (no SSM) and after 27 SSMs (with SSM), are shown in Table 2. The maximum scores for both topics from each herb are presented in parenthesis. Table 2. Comparing voting results from two patterns of membership Scientific Name
English Name
Local Name
Medicinal Usage
no SSM with SSM no SSM with SSM
Alpinia galanga (L.) Willd. Senna alata (L.) Roxb. Millingtonia hortensis L.f. Oroxylum indicum (L.) Kurz Piper betle L. Solanum indicum L.
Glalangal Ringworm bush Cork tree Indian trumpet flower Betel pepper Sparrow ’s Brinjal
11(1) 17(7) 5(7) 14(2) 10(5) 15(4)
13(5) 22(10) 12(8) 24(9) 11(6) 25(7)
10(2) 7(7) 4(7) 8(2) 7(6) 3(5)
12(5) 12(9) 5(11) 19(5) 9(8) 8(10)
Reliable Improvement for Collective Intelligence
107
Fig. 2. Majority Voting System for the Thai Local Name
From the result, some observations are made; 1) the more different local names and medicinal usages are obtained with SSM, 2) the maximum scores are increased on both topics for the system with SSM, 3) for some herbs, the large gap of the differences can be found. We can conclude that the SSM is useful for the system. The opinions on each topic are more diverse, more reliable and more interesting. 6.3
Reliable for Images of an Herb
In this system, a voting system, keywords and a comment are applied on contribution of images. Keywords provide clear text information to images about an herb name and the name of part used. This is useful for users who do not familiar with herbal information. With voting system, an image that is frequently clicked by visitors will obtain the higher score and this image will be shifted to the upper part of the window. For example, images of Alpinia galanga (L.) Willd., is shown in Figure 3. Five images of the herb has been collected i.e., stem, flower1, fruit, rhizome and flower2. At this time, the numbers of hits for each image are 5, 3, 3, 3 and 1, respectively. Keywords below an image help us for identifying it. The Table 3, represents parts of an herb given by members on the image topic and medicinal usages topic. The first part used for each cell (in bold) is the part which gained the highest score. Moreover, the comment is valuable for controlling quality of an image.
108
V. Lertnattee, S. Chomya, and V. Sornlertlamvanich
Fig. 3. Images of Alpina galanga (L.) Willd. are ranked according to popular view
When members find that an image is incorrect and/or low quality, they can write a massage to the system for warning other users. From the result, almost all parts used on image topic can be found on medicinal usages topic. The most popular part used is quite similar. In addition, information on medicinal topic also suggests that which images of an
Table 3. Information about part used on image topic and medicinal usage topic Scientific Name
Image stem, flowers, fruit, rhizome flowers, stem
Part Used for Medicinal Usages
rhizome, leaves, fruit, stem, flowers, root Senna alata (L.) Roxb. flowers, seed, leaves, pod, stem, root Millingtonia hortensis L.f. flowers, leaves flowers, root, heartwood Oroxylum indicum (L.) Kurz flowers, pod, leaves, bark, root, pod, seed, root seed bark, leaves, stem Piper betle L. leaves leaves, root Solanum indicum L. fruit, flower fruit, root, leaves, stem Alpinia galanga (L.) Willd.
Reliable Improvement for Collective Intelligence
109
herb should be included into the system. We can summarize that keywords and voting system is valuable for increasing reliability of images topic.
7
Conclusion and Future Works
In this work, the KUIHerb was used as a platform for building a web community for collecting the intercultural herbal knowledge based on the concepts of Web 2.0 and some features of the future Web. Due to the diversities of herbs, geographic distribution and their applications, one problem was the reliability of herbal information which was collected from this system. Three mechanisms were utilized for improving reliable on herbal information. First, information for an herb was divided into several topics. Contributors could select some topics which they are expertise. Second, a voting system was applied and the standard source members (SSMs) could contribute their knowledge. Finally, a voting system, keywords and comments were implemented for controlling quality and reliable of images. With these mechanisms, herbal information on KUIHerb was more accurate and reliable. It could be applied for medical and pharmaceutical usages with confidence. For the first version of KUIHerb, majority voting with equal weight from the members were used for selecting a set of accepted opinions. However, the member who has made contribution that is more valuable to the system should be given more weight. Furthermore, applying data mining to the collected data will be useful. These issues are left for our future works.
Acknowledgments This work has been supported by Thailand Research Fund and Commission on Higher Education (CHE) under project number MRG5080125 as well as the National Electronics and Computer Technology Center (NECTEC) via research grant NT-B-22-MA-17-50-14.
References 1. Lovell-Smith, H.D.: In defence of ayurvedic medicine. The New Zealand Medical Journal 119, 1–3 (2006) 2. Lin, K.J.: Building web 2.0. IEEE Computer 40, 101–102 (2007) 3. Gruber, T.: Collective knowledge systems: Where the solcial web meets the semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 6, 4–13 (2007) 4. Glustini, D.: Web 3.0 and medicine. British Medical Journal 335, 1273–1274 (2007) 5. Smitinand, T.: Thai Plant Names. In: The forest Herbarium Royal Forest Department, Thailand (2001)
110
V. Lertnattee, S. Chomya, and V. Sornlertlamvanich
6. Wu, K., Farrelly, J., Upton, R., Chen, J.: Complexities of the herbal nomenclature system in traditional chinese medicine (tcm): lessons learned from the misuse of aristolochia-related species and the importance of the pharmaceutical name during botanical drug product development. Phytomedicine 14, 273–279 (2007) 7. Ko, Y., Park, J., Seo, J.: Using the feature projection technique based on a normalized voting method for text classification. Information Processing and Management 40, 191–208 (2004)
Semantic Compression for Specialised Information Retrieval Systems Dariusz Ceglarek1, Konstanty Haniewicz2 , and Wojciech Rutkowski3 1 2
Wyzsza Szkola Bankowa w Poznaniu, Poland [email protected] Uniwersytet Ekonomiczny w Poznaniu, Poland [email protected] 3 Business Consulting Center, Poland [email protected]
Abstract. The aim of this work is to present methods some of the ongoing research done as a part of development of Semantically Enhanced Intellectual Property Protection System - SEIPro2S. Main focus is on description of methods that allow for creation of more concise documents preserving semantically the same meaning as their originals. Thus, compacting methods are denoted as a semantic compression. Keywords: semantic compression, sense , intellectual property, semantic net, thought matching, natural language processing.
1
Introduction
Semantic Compression is motivated by an issue of background performance of specialised information retrieval system. We must remember that, not only quality of results, but also processing time has a crucial role for end user. The latter is an effect of algorithm complexity and naturally the amount of data being processed. Therefore, in contemporary IR systems based on vector space model, it is desirable to reduce the number of dimensions (corresponding to recognized concepts), and limit the amount of calculations thereby. Such techniques as stop-lists or domain lexicons are used for many years to streamline the processing. The idea behind usage of those is to identify words with low semantic value and exclude them from the next steps. Reduction of vector space dimensions is also possible by identifying synonyms and replacing them with one concept, thus introducing a notion of descriptor. Solutions presented in this article are tightly devised for the tasks of intellectual property protection, therefore some details are not generally applicable in IR systems. All discussion on solutions presented here stems from the works on SEIPro2S [1] which has been devised as anti-plagiarism system for Polish. Thus, some of the discussion has to be seen as mapping outcomes of already done work on general (English) cases. N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 111–121. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
112
D. Ceglarek, K. Haniewicz, and W. Rutkowski
First of all, an idea of semantic compression will be introduced, then a discussion of various prerequisites for successful processing is presented. Further, a place of semantic compression in IR task will be discussed. Another section will be devoted to discussion of particular solutions. Everything is summarized in the last section along with discussion of future challenges.
2
Idea of Semantic Compression
Semantic compression is a process throughout which reduction of dimension space occurs. The reduction entails some information loss, but in general it cannot degrade quality of results thus every possible improvement is considered in terms of overall impact on the quality. Dimensions’ reduction is performed by introduction of descriptors for viable terms. Descriptors are chosen to represent a set of synonyms or hyponyms in the processed passage. Decision is made taking into account relations among the terms and their frequency in context domain. Preceding paragraph implies that there are specific conditions for a system to be able to use semantic compression. All these will be discussed in later sections.
3
Semantic Compression in Application
As inner workings of SEIPro2S are out of this work scope one has to present some initial information on conditions taken into consideration and technology enhanced along with general outline of the system and its capabilities. SEIPro2S has been built to be an aid in a situation where some kind of intellectual assets has to be protected from unauthorized usage in terms of copying ideas. An example of this situation is plagiarism which as is commonly known violates ones intellectual property to gain some sort of revenue. One cannot come up with a universally valid definition of plagiarism. As an example, Polish law does not specify a notion of plagiarism, nevertheless it enumerates a list of offences directed at intellectual property. SEIPro2S enables its users for a semiautomatic control of publicly available documents and those that are made available to it by other channels in context of using concepts which represent user’s intellectual property. This application can be easily transformed into another one, meaning that it lets us monitor a stream of documents for content similar on concept level with base document. One can easily imagine this as a help in automatic building of relatively complete reference base for some research domain. Working prototype allows to produce a report demonstrating areas copied from other sources along with clear indicator informing user on ratio of copied work to the overall document content. [1] Thanks to devised algorithms basing on extensive semantic net for Polish language and thesauri prototype is able to detect copied fragments where
Semantic Compression for Specialised Information Retrieval Systems
113
insincere author tried to hide his deeds by changing of word order, using synonyms, hyponyms or commonly interchangeable phrases when compiling his document.
4
Compression in the Preparation Procedure
We shall give a brief overview of document preparation procedure for information retrieval tasks. This is both, to clearly demonstrate where semantic solution can find application and how one can view it from IR standpoint. First and foremost, it is necessary to discuss data representation. First to be discussed, is the way how documents being key interest point of any IR system, are processed. The vector space model deals with vectors that represent each and every single document taken into account. One can view this vector as a way to depict a degree of mutual similarity among individual document and a set of surrounding it other documents to which it will be compared [5]. Vector itself is a result of a preparation procedure. This procedure is key reference point thus it will be examined in greater detail. To begin with, every known approach agrees that some type of normalization procedure takes place. The aim of normalization is to come up with a set of tokens that will reduce overall number of items (ie. terms). Whether it is stemming aor lemmatizing (lexing) depends on the language of a document and an approach of people responsible for devising particular method. The normalization actions deal also with a number of minor task such as punctuation clean up, trimming and homogenizing encodings. This is first place when one can actively lessen computational burden imposed on a system. Remarks on choice of lexing over stemming or other choice of actions are to be considered as a reflection on fundamental choices faced by system creators. One can easily imagine systems where omissions of minor mark-up such as superscript and subscript is unacceptable. Every term is subjected to a comparison with a stop list that allows to further reduce total number of terms by removing those most frequent yet unmeaning for the task. Meaningfulness can be attributed to words having high frequency paired up with low information value both to the reader and to the systems i.e. auxiliary verbs, linkers, conjunctions [4] etc. One can argue that these words can be of some value in a number of IR tasks (e.g. frequency of modal verbs can be heuristic for a genre of processed text), but closer discussion of this topic is not the aim of this work. After these actions we come up with cleared set of elements that are represented as a vector containing refined terms. Additional information may be submitted (number of occurrences and their position relative to documents beginning) when a need arises. In context of document to vector transformation semantic compression is to be viewed as a additional step in the procedure that will reduce total
114
D. Ceglarek, K. Haniewicz, and W. Rutkowski
number of terms in a vector representing a document. Thus it will be in the same group of actions as stopwords elimination. To further generalize, one can perceive semantic compression as an advanced process of creating of descriptors for synsets, where terms more general are favoured over the less general - thus the less general ones can be omitted, when later specified conditions occur. With this kind of compression, apart from introducing a generalization that is not to make the results worse, we get additional bonus of lowered computational complexity in later stages of information retrieval procedures due to the lower count of terms stored in vectors. 4.1
Example Scenario
It is now time to carefully describe the situation in which semantic compression can be applied. Let us envision a situation where a document is an artefact to be matched against a corpus of other documents. This can occur in a variety of occasions. One of them is intellectual property system (such us SEIPro2S). To apply semantic compression it is postulated that system is equipped in a various domain specific corpora. These corpora let the system come up with a set of word frequencies specific for some area of interest (medicine, computer science, mathematics, biology, physics etc.). To illustrate it, lets consider following scenario. When system processes a document that is a piece of news concerning recent advances in antibiotics posted in some popular magazine, we can take advantage of domain corpora. We can extract mentioned frequencies and reason automatically about information obvious for human reader i.e. type of document, its potential origin and its register, as a result of reasoning this information becomes available to the system by capturing appropriate statistics in the text. All this does not exceed commonly observed capabilities of classification systems. Yet, it will enable us to undertake additional steps. When both, potential reader and the system is conscious of document type we can use semantic net for compressing terms. Coming back to the scenario with news concerning advances in antibiotics we can safely assume that it is not a highly specialized article. Thus any reference to penicillin or vancomycin is a possible point where semantic compression can be applied. Semantic net stores data on terms and their mutual relationships. Good semantic net will store data that reflect fact that penicillin is a type of antibiotic, so is vancomycin. The result of application of semantic compression is visible in shortening the global vector by 2 elements. Instead of entries for antiboticantibiotic, penicillin and vancomycin, we can store just the first one. Analogical actions are to be performed to terms that are too specific in context of processed document.
Semantic Compression for Specialised Information Retrieval Systems
5
115
Knowledge Representation and Compression Process
As stated in previous sections, there are several conditions and prerequisites that shall be met in order to make whole process functional. One of necessary elements of the system are domain corpora. They are invaluable as we can extract from them data on term frequency. To illustrate what is being discussed please consider following example. We queried the Brown University Standard Corpus of Present-Day American English [2] for term frequency of a set of terms. Set is composed of these three terms: mean, average and median. Brown corpus for this example was chosen as it is readily available for anyone and provides valuable information on the domain of texts contained. Of interest to us was frequency of terms in two different categories i.e. news and learned (one can easily check the nature of these here [2]). Query demonstrated what follows in table 1. Table 1. Example of term frequency in Brown corpus - sampling terms: mean, average and median term adventure belles lettres editorial fiction government hobbies humor learned lore mystery news religion reviews romance science fiction
mean average median 19 0 0 24 7 0 4 8 0 13 1 0 5 19 0 10 19 0 7 0 0 44 39 1 16 9 0 13 2 0 8 16 0 17 3 0 3 1 0 10 5 0 6 1 0
The results serve us well to highlight the importance of text corpora in our system. We are provided by them with statistical data on related terms. This is of essential use to any system that wants to perform semantic compression in any form, as it gives a way to deal with polysemy. There are other ways to handle this challenge [3] yet all are imperfect at some point [19]. The system we are developing bases on a variety of domain corpora. They allow us to come up with a ranking of terms along with a ranking of concrete senses of these terms. Referring to the example - we can safely tell that when we are dealing with scientific text we can choose right sense for a average. Here the relation was obvious and can be perceived by specialists as biased or
116
D. Ceglarek, K. Haniewicz, and W. Rutkowski
flowed but for lay person there is no difference between average and median. One has to remember that system at this point will not be smarter than an educated human. Still if it points out a possible convergence of meanings to the system operator a potential act of plagiarism or sensitive information property leak can be spotted. To summarize, with the rankings of terms and their senses we obtain one of the elements enabling system to decide which term is to be used as a descriptor in later parts of the process. From the term frequency ranking and its sense ranking a decision to pick one term over another is made. It shall be referred to as information capacity. We are interested in terms that represent the broadest possible meaning that do not deteriorate classification results. This is only possible when second prerequisite is met. We are in need of semantic net that stores a set of relationships enabling us to reason about its terms; both synonymy and hypernymy has to be available. For the sake of discourse we will refer our work upon Wordnet. Nevertheless, one has to remember that our system is driven by its propriety semantic net for Polish. From computational point of view a hierarchical data structure is the most desirable one. If one can have all terms organized in a hierarchy, they can directly indicate a hypernym, ie. a word with more general meaning. A most desirable situation is when used semantic net does provide a feature of disambiguation. Wordnet does not provide such thus there is ongoing research in the field of terms’ sense disambiguation as stated before [17]. Semantic net in SEIPro2S provides disambiguation to some extent but to make it universal further research effort is due refceglarek. Taking into account whole semantic net one can use a graph as its depiction. For the sake of clarity we can focus on a tree as a type of graph, representing the hierarchy. Every tree has a root node, this is the one that all other nodes stem from. When an edge from one node to another is a representation of relationship - ”is-more-general-than” - we can say that the most general node is the root of a tree. The most general node in the tree is also the one that has the biggest information capacity. We present a first version of our algorithm performing semantic compression. Please note that all prerequisites must be met. When one does not have statistics of term and terms’ senses frequencies he is left with manual sense disambiguation or a brute force approach. 5.1
Algorithm for Semantic Compression
Initially, we are given a list ki of M key concepts, used to create M-dimensional vectors representing documents, and a target condition: desired number of key concepts N (where N < M ). First, a total frequency of each concept f (ki ) has to be computed for all documents in questions. Second step requires assembling information on
Semantic Compression for Specialised Information Retrieval Systems
117
Algorithm 1. Calculating cumulated concept instance frequency in document corpus C for v ∈ V do p = card(Hv ) for h ∈ Hv do lh = lh + lpv end for end for V − vector of concepts stored in semantic net V − topologically sorted vector V V − reversed vector V lv − number of occurences of concept v in corpus C Hv − set of hypernyms for concept v
Algorithm 2. Choosing m concepts in domain compressed semantic net for v ∈ V do if lv ≥ f then dv = v else dv = F M ax(v) end if end for L - vector storing number of concept occurences in document corpus C L− sorted vector L in a descending order f − occurrence number of m-th concept in vector L
Algorithm 3. FMax procedure - finding descriptor for hypernym with highest frequency FMax(v): max = 0 x=∅ for h ∈ Hv do if dh = ∅ then if ld h > max then max = ld h x = dh end if end if end for return x
118
D. Ceglarek, K. Haniewicz, and W. Rutkowski
Fig. 1. Selection of N concepts with top cumulative frequencies
relationships among concepts. Moving upwards in the hierarchy, we calculate a cumulative concept frequency by adding a sum of hyponyms frequencies to frequency of hypernym: cumf (ki ) = f (ki ) + j cumf (kj ), where ki is a hypernym of kj - in pseudocode algorithm 1. Cumulated freqencies are to be sorted and M concepts with top values are selected as target key concepts (descriptor list) - see algorithm 2 and algorithm 3. Finally, we define compression mapping rules for the remaining (N − M ) words, in order to handle every occurrence of kj as its hypernym ki in further processing. If necessary (when hypernym hasn’t been selected as descriptor), mapping rules can be nested. This is essential as it allows us to shorten individual vectors by replacing terms with lower information capacity by their descriptors. The described method of semantic compression results in reduction of vector space dimensions by (M − N ). As an effect, a part of specific information, proportional to information capacity of concepts not selected as descriptiors, is lost. Process was depicted in figure 1.
6
Evaluation
In order to verify the idea of utilizing semantic compression to steamline IR/NLP tasks, an experiment has been conducted. We have checked, if the number of vector space dimensions can be reduced significantly without deterioration of results. Two sample sets of documents (containing 780 and 900 items each), have been subjected to a procedure of clustering. Documents were from 6 categories: astronomy, ecology, IT, culture, law and sport. To verify the results,
Semantic Compression for Specialised Information Retrieval Systems
119
Table 2. Evaluation of classification with semantic compression task 1 (780 documents) results Number of clustering features Without semantic compression 12000 concepts 10000 concepts 8000 concepts 6000 concepts 4000 concepts 2000 concepts 1000 concepts
1000 93,46% 91,92% 93,08% 92,05% 91,79% 88,33% 86,54% 83,85%
900 90,90% 90,38% 89,62% 92,69% 90,77% 89,62% 87,18% 84,10%
800 91,92% 90,77% 91,67% 90,51% 90,90% 87,69% 85,77% 81,92%
700 92,69% 88,59% 90,51% 91,03% 89,74% 86,79% 85,13% 81,28%
600 89,49% 87,95% 90,90% 89,23% 91,03% 86,92% 84,74% 80,51%
Average 91,69% 89,92% 91,15% 91,10% 90,85% 87,87% 85,87% 82,33%
Table 3. Evaluation of classification with semantic compression task 2 (900 documents) results Number of clustering features Without semantic compression 12000 concepts 10000 concepts 8000 concepts 6000 concepts 4000 concepts 2000 concepts 1000 concepts
1000 93,78% 93,00% 93,33% 92,78% 92,56% 92,00% 92,33% 92,00%
900 93,89% 94,00% 94,22% 93,22% 93,44% 92,44% 91,78% 92,00%
800 93,11% 94,00% 93,56% 94,22% 92,22% 91,22% 89,89% 88,33%
700 92,56% 91,33% 93,44% 93,33% 92,89% 90,89% 90,56% 87,11%
600 92,11% 90,78% 92,22% 90,89% 91,00% 90,22% 89,67% 83,78%
Average 92,03% 91,49% 92,33% 91,79% 91,26% 90,03% 89,44% 86,90%
all documents have been initially labelled manually labled with a category. All documents were in Polish. Clustering procedure was performed 8 times. First run was without semantic compression methods: all identified concepts (about 25000 - this is only about a quarter of all concepts in our research material) were included. Then, semantic compression algortihm has been used to gradually reduce the number of concepts. It started with 12000 and it proceeded with 10000, 8000, 6000, 4000, 2000 and 1000. Classification results have been evaluated by comparing them with labels specified by document editors: a ratio of correct classifications was calculated. The outcome is presented in Tables 2 and 3. Figure 2 presents the average evaluation results from two classification tasks. The loss of classification quality is virtually insignificant for semantic compression strength which reduces the number of concepts to 6000. Stronger semantic compression and further reduction of concept number entails a deterioration of classification quality (which can, however, be still acceptable). The conducted experiment indicates, that semantic compression algorithm can be employed in classification tasks to significantly reduce the number of
120
D. Ceglarek, K. Haniewicz, and W. Rutkowski
Fig. 2. Experiments’ results
concepts and corresponding vector dimensions. As a concequence, tasks with extensive computational complexity are performed faster.
7
Summary and Future Work
Presented work is a proof of our effort in the domain of information retrieval. It presents our ideas that can be applied to further perfect systems such as SEIPro2S. A notion of semantic compression has been introduced and positioned among already recognized steps of any information retrieval task. Additionally a discussion of possible ways of coping with sense disambiguation has been presented. We are aware of a variety of challenges such as sense disambiguation of senses and one of our goals is to come up with good general heuristic method for the issue. We will further pursue the idea of creating statistics for terms and its senses frequencies in domain corpora. We strongly believe that our work will induce feedback from the community as the topic is import seems to be of high interest. Our future publications will focus on presenting results from experiments done on English corpora.
Semantic Compression for Specialised Information Retrieval Systems
121
References 1. Ceglarek, D., Haniewicz, K., Rutkowski, W.: Semantically Enchanced Intellectual Property Protection System - SEIPro2S. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) ICCCI 2009. LNCS, vol. 5796, pp. 449–459. Springer, Heidelberg (2009) 2. Francis, W.N., Kucera, H.: Brown Corpus Manual to accompany A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. Providence, Rhode Island Department of Linguistics Brown University 1964 Revised 1971 Revised and Amplified (1979) 3. Boyd-Graber, J., Blei, D., Zhu, X.: A Topic Model forWord Sense Disambiguation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, June 2007, pp. 1024–1033 (2007) 4. Maron, M.E., Kuhns, J.L.: On Relevance. Probabilistic Indexing and Information Retrieval. Journal of the ACM (JACM) archive 7(3), 216–244 (1960) 5. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, Addison-Wesley Longman Publishing Co, New York (1999) 6. Baziz, M.: Towards a Semantic Representation of Documents by OntologyDocument Mapping (2004) 7. Baziz, M., Boughanen, M., Aussenac-Gilles, N.: Semantic Networks for a Conceptual Indexing of Documents in IR (2005) 8. Ceglarek, D.: Zastosowanie sieci semantycznej do disambiguacji pojec w jezyku naturalnym, red. Porebska-Miac T., Sroka H., w: Systemy wspomagania organizacji SWO 2006 - Katowice: Wydawnictwo Akademii Ekonomicznej (AE) w Katowicach (2006) 9. Frakes, W.B., Baeza-Yates, R.: Information Retrieval - Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs (1992) 10. Gonzalo, J., et al.: Indexing with WordNet Synsets can improve Text Retrieval (1998) 11. Hotho, A., Staab, S., Stumme, S.: Explaining Text Clustering Results using Semantic Structures. In: Lavraˇc, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 217–228. Springer, Heidelberg (2003) 12. Labuzek, M.: Wykorzystanie metamodelowania do specyfikacji ontologii znaczenia opisow rzeczywistosci, projekt badawczy KBN (2004) 13. Khan, L., McLeod, D., Hovy, E.: Retrieval effectiveness of an ontology-based model for information selection (2004) 14. Krovetz, R., Croft, W.B.: Lexical Ambiguity and Information Retrieval (1992) 15. Sanderson, M.: Word Sense Disambiguation and Information Retrieval (1997) 16. Sanderson, M.: Retrieving with Good Sense (2000) 17. Stokoe, C., Oakes, M.P., Tait, J.: Word Sense Disambiguation in Information Retrieval Revisited, SIGIR (2003) 18. Van Bakel, B.: Modern Classical Document Indexing. A linguistic contribution to knowledge-based IR (1999) 19. Gale, W., Church, K., Yarowsky, D.: A Method for Disambiguating Word Senses in a Large Corpus. Computers and the Humanities 26, 415–439 (1992)
Service Mining for Composite Service Discovery Min-Feng Wang, Meng-Feng Tsai, Cheng-Hsien Tang, and Jia-Ying Hu Department of Computer Science and Information Engineering, National Central University, Jhongli, Taiwan, 32001 [email protected], [email protected], [email protected], [email protected]
Abstract. Web service technology is being applied to organizing business process in many large-scale enterprises. Discovery of composite service, therefore, has become an active research area. In this paper, we utilize a PLWAP-tree algorithm to analyze the relationship among web services from web service usage log. This method generates time-ordered sets of web services which can be exploited to integrate into a real business process. The empirical result shows the methodology is useful, flexible, and efficient. It is able to integrate web services into a composite service according to the mining result.
1 Introduction Nowadays, numerous web services have emerged in the Internet. The major technologies of web services include three open standards: WSDL [10], UDDI [11], and SOAP [9]. Web Service Description Language (WSDL) describes how to use a web service. Universal Description, Discovery, and Integration (UDDI) provides a directory of services on the Internet, and gives users and software agent the information about where the service is. Simple Object Access Protocol (SOAP) is the way of communication between service requestor and web services. SOAP is a protocol for exchanging XML-encoded data and simulates Remote Procedure Calls over SMTP, FTP, TCP/IP, and HTTP. Recently, the major approaches to discovery of required services are syntactic matching and semantic matching. Syntactic matching performs the matching based on data types and keywords, and it thoroughly depends on the completeness of descriptions in WSDL files. This technique can not discover a group of related web services which have no the relationship of passing parameters, such as travel insurance quotes and air flight reservation. Therefore, semantic matching is exploited to expand the searching ability, and it adopts semantic web service [8], which represents web services with semantic descriptions about interface and its characteristics. Semantic matching work has been focused on the matching of functional N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 125–131. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
126
M.-F. Wang et al.
and non-functional properties of web services. This technique needs to add additional information into WSDL files, and builds Ontology. It is used to reason about the objects within a specific domain. According to aforementioned issues, several research challenges are identified including (1) How to help service requestor to discover related services in more efficient way? (2) How to discover related services to reach user’s requirement? (3) How to discover syntactic and semantic relationship among related services? In this paper, we utilize a sequential pattern mining method which called PLWAP tree mining [4] to help us discover time ordered web services. It is an efficient algorithm in discovery of sequential pattern, and it can generate sequence service patterns representing the execution order in sequence to complete a task or process.
2 Related Work Now, we present the related fields of research which have influence on service mining. The most relevant research fields are web mining and workflow mining, and we discuss the characteristics of these works. Workflow systems are exploited to manage and support business process. The work [1] is the first mining algorithms on workflow logs and discovered dependency between different activities. In [6], an algorithm is presented for learning workflow graphs that makes use of a coherent probability model. In [2], a web service interaction mining architecture (WSIM) is introduced to analyze interactions between web service consumer and provider. Web usage mining is the application that adapts the data mining methods to analyze and discover interesting patterns of user’s usage data on the web. The usage data represents the user behavior when the user browses or makes transactions on the web site. Discovered patterns help system developer understand the relationship among web pages and the behavior of particular users. In [5], the authors presented a scalable framework for web personalization based on sequential patterns and non-sequential patterns mining from real usage data. The authors in [4] proposed an efficient approach to mine non-contiguous sequential pattern by using the WAP-tree. It eliminates the need to reconstruct a large number of intermediate WAP-trees during mining.
3 System Architecture A brief review of our system architecture is illustrated in Fig. 1. The goal is to provide web service users with composite services to satisfy their requirement for the completion of several tasks. Our system can be broken down into three parts including web logging, pre-processing, and pattern discovery. We set up a web service server which is used for web service users to invoke required web services. We build up a monitoring program, called LogHandler, to monitor the web service server when a user invokes one of web services.
Service Mining for Composite Service Discovery
127
Fig. 1. System architecture for discovery of composite service
Web logging is responsible for recording all events triggered by web service users. All the event log data are saved on the web usage data repository. Pre-processing is responsible for transforming the web usage data into web service access sequence database. Eventually, pattern discovery is responsible for performing mining on web service access sequence data and providing users with frequent service patterns. PLWAP-tree mining generates sequential patterns for web service users which tend to find a set of web service in sequential execution order to complete their tasks, e.g., “AirBook then HotelBook” web services are used in sequence. Finally, the patterns string will be used as feedback to the users. Now, we explain how we monitor web service server, and all the events triggered by web service users are logged. In the work [2], the authors present 5 levels of web service logging, and they vary in the richness of the logging information and in the additional development effort which is needed when implementing the respective features. In our implementation of web service logging, we adopt the level 3 logging, which is logging at WS (Web Service) container level.
4 Service Mining for Composite Service Discovery 4.1 PLWAP-Tree Algorithm PLWAP-tree (Position Coded Pre-Order Linked WAP-tree) algorithm is more efficient than WAP-tree algorithm [4] since it eliminates the need to re-construct intermediate WAP-trees during mining. This approach builds the frequent header node links of the original WAP-tree in a pre-order fashion and employs the position code of each node to identify the ancestor/descendant relationships between nodes of the tree.
128
M.-F. Wang et al.
Now, we demonstrate how we apply PLWAP-tree mining to discover composite web services in the travel industry in the following sections. There are 3 steps as follows: First, scan the web service access sequence database first time to obtain the count of all operations. All operations that have a count greater than or equal to the minsup are regard as frequent. In Table 1, we display 10 web service operations which are generated by referencing the documents from OTA [7]. In our working example in Table 2, after we scan the original web service access sequences, we obtain the result of each operation ID. The minsup is 4 in our working example. So the non-frequent operations with ID 3, 4, 6, and 7 are deleted from the sequence. Table 1. A example of web service in travel domain Service Operation ID 1 2 3 4 5 6 7 8 9 10
Web Service Operation Name AirAvailable AirBook HotelAvailable HotelSearch HotelReservation VehicleReservation VehicleCancel InsuranceBook RailAvailable RailBook
Table 2. An example of web service access sequence database TID T100 T200 T300 T400 T500 T600 T700 T800 T900
Web Service Access Sequences 1,2,5 1,2,8 1,2,4,5,8 9,10,4,5 9,10,5,6 9,10,1,2,6,8 1,2,6,8 1,2,8 9,10
Frequent Subsequence 1,2,5 1,2,8 1,2,5,8 9,10,5 9,10,5 9,10,1,2,8 1,2,8 1,2,8 9,10
Second, construct PLWAP-tree. The algorithm scans frequent subsequence one after the other and inserts them into the PLWAP-tree. The construction of PLWAP-tree is almost the same as that of FP-tree [3]. The difference is that the position code is assigned to each node in PLWAP-tree during construction. Once the PLWAP-tree is completely constructed, the construction algorithm traverses the tree to construct a pre-order linkage of frequent header nodes. The complete PLWAP-tree is depicted in Fig 2. Third, generate sequence patterns. In this section we focus on how to mine sequential patterns from previously-constructed PLWAP-tree. PLWAP-tree mining
Service Mining for Composite Service Discovery
ID = 1 ID = 2 ID = 8 ID = 5 ID = 9 ID = 10
129
null
ID = 1:5 1 ID = 2:5 11 ID = 5:2 111
ID = 8:3 1110
ID = 8:1 1111
ID = 9:4 10 ID = 10:4 101 ID = 5:2 1011
ID = 1:1 10110 ID = 2:1 101101 ID = 8:1 1011011
Fig. 2. The PLWAP-tree
searches access patterns with the same prefix. The algorithm starts to find the frequent sequence with the frequent 1-sequence in the set of frequent operations {ID=1, ID=2, ID=5, ID=8, ID=9, ID=10}. For every frequent operation in the frequent operations and the suffix trees of current conditional PLWAP-tree being mined, it follows the linkage of this operation to find the first occurrence of this frequent operation in every current suffix tree being mined, and adds the support count of all first occurrences of this frequent operation in all its current suffix trees. If the count is greater than the minsup, then this operation is appended to the last list of frequent sequence. The suffix trees of these first occurrence operations in the previously mined conditional suffix PLWAP-trees are now in turn, used for mining the next operation. Please note that the conditional suffix PLWAP-tree, which is obtained during the mining process, does not actually physically exist. To obtain this conditional PLWAP-tree, we only need to remember the roots of the current suffix trees, which are stored for next round mining. Finally, after the third step, we obtain the frequent sequence set {(1,2), (1,2,8), (1,2,5), (1,8), (1,5), (2, 8), (2, 5), (9,10), (9,10,5), (9,5), (10,5)} when minsup is 2. 4.2 Pattern Analysis The PLWAP-tree mining generates frequent non-contiguous sequential patterns to assist us in analyzing more detailed information about how these web services
130
M.-F. Wang et al.
coordinate in sequential execution order. For example, the sequence (1,2) means the operation ‘ID=1’ is followed by the operation ‘ID=2’. The sequential pattern with highest support count can be used to compose composite service or organize into a business process. At the same time, we focus the attention on the non-contiguous sequential patterns since it can help us filter some operations which are not very frequently used.
5 Experiments We simulate two cases for discovery of frequent usage patterns, and discuss how we generate web usage log. Finally, we discuss the performance and pattern analysis. In case 1, we prepare 30 web services in travel domain, such as AirBook, which is a web service for online air booking, by referencing the documents on OTA [7] web site. In case 2, we prepare 15 web services in entertainment domain, such as SearchMP3Files, which is a web service for searching MP3 files. We generate synthetic web usage log to evaluate the performance. These transactions simulate the transactions in the web service environment. Our model of the “real” world is that people tend to use some sets of web services together. To create the web usage log, our synthetic data generation program takes the parameters shown in Table 1. Table 1. Web usage log for sequential pattern mining Name Case 1 Case 2
Number of Transaction 5000 10000
Average size of the transactions 5 2
Size in KB 218 231
We compare the execution time by using different minimum support. Next, we demonstrate the number of sequential patterns by using different minsup. To assess the performance, we performed several experiments on a Pentium PC machine with a CPU clock rate of 3.0G MHz, 1.5GB of main memory in dual channel mode, and the data resided in the 80X86 file system. For case 1 and case 2, we show the execution time and number of patterns in Table 2. Table 2. Experimental results
minsup 0.005 0.008 0.02 0.025
Execution time(ms) 734 515 500 485
Case 1 Number of patterns 3599 550 465 462
Execution time(ms) 484 437 422 422
Case 2 Number of patterns 120 120 15 15
Service Mining for Composite Service Discovery
131
6 Conclusions In this paper, we utilize PLWAP-tree algorithm for discovery time-ordered service patterns. Our experiment shows the approach can generate these service patterns for web service user, and each service pattern can be regarded as a composite service for users to select. We also know which kinds of web services are highly-related to which kinds of web services. In addition, the existing business processes can be restructured according to the mined patterns or the newly-generated patterns can be organized into new business process. According to our implementation the result shows the approaches are useful, efficient and feasible.
References 1. Agrawal, R., Gunopulos, D., Leymann, F.: Mining Process Models from Workflow Logs. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377, pp. 469–483. Springer, Heidelberg (1998) 2. Gombotz, R., Dustdar, S.: On Web Services Workflow Mining. In: Proc. of the BPI Workshop (LNCS), pp. 216–228 (2006) 3. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Proc. ACM SIGMOD, pp. 1–12 (2000) 4. Lu, Y., Ezeife, C.I.: Position Coded Pre-Order Linked WAP-Tree for Web Log Sequential Pattern Mining. In: Proc. PAKDD, pp. 337–349 (2003) 5. Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Using Sequential and Non-Sequential Patterns in Predictive Web Usage Mining Tasks. In: Proc. ICDM, pp. 669–672 (2002) 6. Silva, R.: Zhang, Jiji., Shanahan, J.G.: Probabilistic Workflow Mining. In: Proc. ACM SIGKDD, pp. 275–284 (2005) 7. http://www.opentravel.org/ 8. http://www.w3.org/2001/sw 9. http://www.w3.org/TR/soap 10. http://www.w3.org/TR/wsdl 11. http://www.uddi.org/
An Adaptive Grid-Based Approach to Location Privacy Preservation Anh Tuan Truong, Quynh Chi Truong, and Tran Khanh Dang Faculty of Computer Science and Engineering Ho Chi Minh City University of Technology, Vietnam {anhtt,tqchi,khanh}@cse.hcmut.edu.vn
Abstract. Location privacy protection is a key factor to the development of location-based services. Location privacy relates to the protection of a user’s identity, position, and path. In a grid-based approach, the user’s position is obfuscated in a number of cells. However, this grid does not allow users to adjust the cell size which relates to a minimum privacy level. Therefore, it is hard to fix various privacy requirements from different users. This paper proposes a flexible-grid-based approach as well as an algorithm to protect the user’s location privacy. However, the user can custom conveniently his grid due to his requirement of privacy. The overlap-area problem is also counted in the algorithm. By deeply investigating on our solution, we also discuss open research issues to make the solution feasible in the practice. Keywords: Location-based Services, Location Privacy, Privacy Preserving, Adaptive Grid, Privacy Attack Models.
1 Introduction The rapid development of location-based services (LBS) gives both opportunities and challenges for users and service providers. LBS are services that make use of the location information of users [2, 8]. The opportunities are that users can be benefited from the service while the service providers can earn more profits. However, by using services, users face with privacy problems because his privacy data is attractive to attackers. The location privacy can be defined as the right of individuals, groups, and institutions to determine themselves how, when, to whom, and for which purposes their location information is used [3, 2, 5]. Besides, service providers have more responsibility to protect the user’s private information, especially the location privacy. Therefore, location privacy protection is an emerging topic that is interested by many researchers [7]. In [1], we proposed an approach that preserves the location privacy by cloaking the user’s location in a grid-based map. However, we only designed the solution with a fixed grid. It means that the size of each cell in the grid is predefined. It is not convenient for users as each user has a different requirement of privacy. Thus,
N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 133–144. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
134
A.T. Truong, Q.C. Truong, and T.K. Dang
in this paper we have improved the previous solution by proposing an algorithm that works on a flexible grid or a grid that allows users change the cell’s size. The rest of this paper is organized as follows. In section 2, we briefly summarize the related work and the fixed-grid-based problem. Next, section 3 presents our improved approach for preserving privacy in LBS with an adaptive grid. We propose various open issues with our solution in section 4. Finally, section 5 presents concluding remarks as well as our future works.
2 Related Works 2.1 Anonymity-Based Technique Location privacy is classified into three categories: identity privacy, position privacy and path privacy [2]. The identity privacy is to protect users’ identities from disclosure to attackers, the position privacy is to hide the true position of users from the attack and the path privacy is to protect the information related to users’ motions. For the identity privacy, there are some solutions such as anonymitybased technique, grouping technique. In the anonymity-based technique [5, 6], the user uses a false identity to keep his anonymity when he calls the services. In the grouping technique, users gather in a group then one of them acts as deputy to send the request to service providers. By this way, it is hard to identify the one who really issues the request. For the second category, the position privacy, the main approach is to obfuscate the user’s true position. In other words, the true position of the user is blurred to decrease the accuracy. There are numerous techniques to obfuscate a position such as enlarging, shifting, and reducing. For more details, refer to [2]. The last privacy category can be violated if an attacker finds the user’s path by monitoring the user’s requests in a period. However, the path privacy can be protected by applying both identity and position privacy preserving techniques to make the attacker hard to infer the path by linking the requests. 2.2 Fixed Grid-Based Solution and Problems The user maybe wants to use the service many times. When he wants to use the service, he will send his true location to the service server. This location should be hidden in a region to protect the user’s privacy. Attackers maybe wait and catch this region. Clearly, if attackers catch more regions, they can find the users’ location more easily. In figure 1, the user uses the service three times and regions R1, R2, R3 are created. Attackers can catch three regions and limit the area that contains the users’ location to the colored area. In [1], a fixed grid-based approach for the trusted party architecture was introduced to solve above problem. With this approach, the user’s location will be hidden in an area, which includes cells of a grid, it is named anonymization rectangle. With this grid, it is simple to satisfy the required privacy level of the user.
An Adaptive Grid-Based Approach to Location Privacy Preservation
135
Fig. 1. Randomization approach problem
When the user requires a higher level, the anonymization rectangle will be extended and when the user requires a smaller level, the rectangle will be smaller. However, the middleware will use the same grid to anonymize the user’s location with this approach, but each user has a different required level of privacy, so if the cell size is too small, it is not enough to preserve the location of the user. Conversely, the service is not sufficient if the size of a cell is too big. Therefore, It is also difficult to decide how big the size of a cell. To solve this problem, we can combine cells to form a bigger cell or split a cell to form new smaller cells. For example, if the user wants a small cell, the middleware will split the default cell to some new cells. Otherwise, if the user wants a big cell, the combination of some default cells to form a new cell will be carried out. In this paper, we propose another solution; it is to design a grid which cells can resize. At each time the user wants to use the service, the grid will be redesigned to meet the requirement of the user. In the next section, we will discuss the detail of this solution, which we call adaptive Grid-based solution.
3 Adaptive Grid-Based Solution In the previous section, we show problems of the fixed grid-based solution. In this section, we will introduce an adaptive Grid-based approach, which the grid cell size can change. Depending on the requirement of the user, the size of the grid cell is established. The trusted middleware will choose the anonymization area that contains these grid cells and the location of the user is in one of these grid cells. First, we will consider the definition of this grid. 3.1 Definitions As defined in [1], a Grid G will divide a map into cells, a cell is not necessary a square-shaped, it can a rectangle but all of cells will cover the space. Different from [1], the grid cell size in this solution can be variable. To divide a map into cells, a Starting point S is needed. Similar to [1], an anonymization area is a square, it contains some cells and the location of the user (point U) will be in this area. Figure 2 shows a Grid and an anonymization area, h and w is the height and the width of the grid cell. In this paper, h and w are equal for simplicity:
136
A.T. Truong, Q.C. Truong, and T.K. Dang
Fig. 2. Grid (a) and Anonymization area (b)
With the starting point, the middleware will create the grid with the height and the width of the grid cell. For example, with the height and the width w1, the grid is in Figure 3a and with the height and the width w2, the grid is in Figure 3b.
Fig. 3. Two grids with the starting point S
3.2 Architecture Similar to [1], the middleware architecture, which has a trusted middleware, is used to implement this solution. Figure 4 shows this architecture:
Fig. 4. Trusted Middleware Architecture
In this solution, the grid is put in the middleware. When the user wants to use services, he will send his requirement information to the middleware. The requirement information will include the user’s location information, required level of privacy and the required cell size. According to the required cell size, the middleware will create a grid for this user. Then, the middleware will choose cells from this grid to form an anonymization area according to the user’s required privacy. The middleware will send this anonymization area to the service server and receive results. Finally, the middleware will filter reasonable results and return these results to the user. In Figure 5a, the user will send his requirement information including his true location U, required level of privacy is 9 cells and required cell size is w1. The
An Adaptive Grid-Based Approach to Location Privacy Preservation
137
middleware will create the grid as in Figure 5a and choose 9 cells to form 3*3anonymization area and sends this area to the service server. The anonymization area is coloured in Figure 5a. At another time, the user uses the service again but the required cell size is w2. The grid in Figure 5b is created, the anonymization area is coloured in Figure 5b. We see that the square in Figure 5a is not equal to the square in Figure 5b although the required level of privacy is the same.
Fig. 5. Two anonymization areas with different requirement information
Additionally, the middleware should have a default grid. Therefore, the user does not need to send the required cell size if he wants to use the default grid. This case is mentioned in [1] or in other word; the case in [1] is a particular case of the adaptive grid-based solution. 3.3 Overlapping Problems with Adaptive Grid-Based Solution Because the fixed grid is an adaptive grid‘s case, adaptive grid also has problems of fixed grid-based solution. These problems were mentioned in [1] and have been solved by the memorizing algorithm. Moreover, with the adaptive grid-based solution, the grid can be redesigned at different times when the user uses services, so the anonymization area can be limited. See the example in figure 6: at time t1, the user uses the service, he defines the grid cell with size w1. He also requires the privacy level is 9 cells. The anonymization area is R1. In future, the user uses the service again but he also redefines the grid cell with size w2. The required privacy level is not changed. The middleware will create the anonymization area R2.
Fig. 6. Overlapping problem
138
A.T. Truong, Q.C. Truong, and T.K. Dang
We see that the overlapping area can limit the anonymization area to R3. Attackers can decide the area containing the location of the user easily. As [1], the smallest area, which can be limited in the default grid, is a grid cell. However, in the adaptive grid, we can not decide the smallest area which can be limited because the grid is redesigned when the user uses services. Clearly, if the user uses the service many times, the overlapping area can be smaller. In the next section, we will introduce an algorithm to solve these problems. Like the memorizing algorithm, this algorithm also requires that the middleware has a database to save the anonymization area. 3.4 Algorithm We will review problems in the previous section. At the first time, the user uses the service and the middleware chooses the anonymization area, but at the second time and so on, the grid cell size can be resized, so the anonymization area may be changed. Because the grid cell can be resized, the anonymization area may not overlap totally. The partial overlap is the cause of these problems. The smaller the overlap, the easier attackers can find the location of the user. To solve these problems, we should combine information from previous times when the user used the service. The middleware will combine information from previous uses of the user to create the anonymization area for the current time. We can figure out the mechanism of this algorithm as following: - The user will send his requirement information when he uses the service to the middleware. As discussion before, the requirement information includes his true location, the required grid cell size and the required level of privacy. - The middleware will receive the user’s requirement information. With the starting point, it will create the grid according to the required grid cell size. Then, it will query the database to check if this is the first time the user uses this service or not: • If this is the first time, depending on the location of the user and the required level of privacy, the middleware will choose the anonymization area and save it to the database for future references. • If no, the middleware will find all information from previous uses; combine them with the current information to choose the appropriate anonymization area. Then, it will save the current information to the database. - The middleware will send the anonymization area that has been just created to the service server. - The middleware will receive return results, choose the acceptable result and return this result to the user. Clearly, the anonymization area for the current time should totally overlap with previous areas. The total overlap will help our against the information mining of attackers to find the user’s location. Reality, if the anonymization area which is created at the second time and so on does not overlap the first anonymization area totally, attackers will limit the area which contains the true location of the user. We discussed this problem in the section 3.3. See the case in Figure 7a, the anonymization area R1 overlaps partially with the anonymization area R2, so attackers
An Adaptive Grid-Based Approach to Location Privacy Preservation
139
Fig. 7. Partial overlap area (a) and Total overlap area (b)
can limit the area that contains the user’s location to R3. In this case, the total overlap area as in Figure 7b is better. However, the total overlap may not occur at any time. As we discussed before, the grid cell size may change, so the bigger overlap area may not fill all space of the smaller one. For more details, we will consider the example as in Figure 8a: at the first time, required privacy level is 9 cells. Anonymization area R1 is created and at the second time, required privacy level is 16 cells. Grid G2 is created and anonymization area R2 is chosen.
Fig. 8. Example for overlap area (a) and Maximal overlap area (b)
We see that we can not choose the anonymization area R2 in order to overlap R1 totally. In this case, the algorithm should choose the anonymization area R2 in order to the overlap area between R2 and R1 is maximal. The maximal overlap area is R3 in Figure 8b. As we see in the mechanism, when the anonymization area is created and sent to the service server, the middleware also save this anonymization area to its database for future references. However, what does information need to save? To choose the anonymization area, the middleware will consider all previous anonymization areas. It will choose the anonymization area for current time in order to the overlap area is maximal. Therefore, the middleware should also save all previous anonymization areas. Intuitively, we need just the information of the last overlap area. So, the middleware will choose a new anonymization area so that new overlap area between this anonymization area and the last overlap area is maximal.
140
A.T. Truong, Q.C. Truong, and T.K. Dang
Fig. 9. Very small overlap area
Clearly, the partial overlap area will limit the space that contains the users’ location. In above examples, the maximal overlap is acceptable because the space, which contains the users’ location, is big enough. However, not all of maximal overlap area is acceptable. We will consider the example in Figure 9: anonymization area R1 is created at the first time and R2 is created at the second time the user uses the service. In this case, the maximal overlap area, which intersects between R1 and R2, is R3. Actually, the user wants the space, which contains his true location, is R2 at the second time. However, attackers can find out that the true location of the user is in R3. Clearly, R3 is very small when comparing with R2. Intuitively, to solve above problem, we can define the minimal anonymization area. When the maximal overlap area at the current time is smaller the minimal anonymization area, the middleware will choose the previous maximal overlap area and send this area to the service server. However, it is difficult to decide the size of this area because we can not know how big the minimal anonymization is enough. We also propose another approach to solve above problem. It is to use a roving starting point. The idea of this approach as follow: - When the user uses the service for the first time, the middleware will save to its database the information about 4 vertexes of the anonymization area. In Figure 10, they are vertex A, B, C and D. - At the second time and so on, the middleware will choose one of four vertexes as the starting point. It will create new grid according to the new starting point and return the anonymization area.
Fig. 10. Roving starting point
An Adaptive Grid-Based Approach to Location Privacy Preservation
141
As shown in Figure 10a, the vertex A is chosen as new starting point. The new grid is created and the new anonymization area (R2) will totally overlap with the previous anonymization area (R1). In Figure 10b, the vertex C is chosen as starting point and the anonymization area is R2. It also overlaps totally with R1. The details of this approach will be left as future works. In short, we can describe the algorithm in pseudo code as follows: Create the grid according to the user’s requirement information; if (this is the first time the user uses the service) { Get random anonymization area which contains the true location of the user; Save this anonymization area; Send this area to the service server; } else { Query the last maximal overlap area of the user; Perform overlap_area_getting function; Save the anonymization area which have just found in the overlap_area_getting function; Save the maximal overlap area; Send this area to the service server; } In this algorithm, the overlap_area_getting()function is very important. The goal of this function is to find the new anonymization area so that the overlap area between this area and the last maximal overlap area is maximal. Therefore, we can describe the mechanism of this function as follow: - Query the last maximal overlap area from the middleware’s database - Based on the grid that has just been created and the required privacy level of the user. The middleware will choose all anonymization areas according to the user’s requiredprivacy. The condition is that these anonymization areas must contain the true location of the user. - Choose the anonymization area that the overlap area between it and the last maximal area is biggest. - Return the anonymization that has just found and new maximal overlap area. To limit the number of anonymization areas, we notice that these anonymization areas must contain the location the user. So we will start at the cell contains the location of the user, we will go forward to four directions from this cell as in Figure 11. At each direction, choose cells that are “the most suitable”. We will consider the example in Figure 11a: the starting cell is cell 1. Assume that we want to get a 2*2-anonymization area. With the width, two cells 2 and 5 are considered. We will choose cell 2 because the overlap area between cell 2 and last maximal area is bigger. With the height, it is similar to the width’s process. The anonymization area with cells 1, 2, 3, 4 is the best one for 2*2-anonymization area. Another example is in Figure 11b; in this case, we want to choose a 3*3anonymization area. At the step 1, similar to the figure 11a, the cell 5 and cell 2
142
A.T. Truong, Q.C. Truong, and T.K. Dang
Fig. 11. Roving starting point
will be considered; we will choose cell 2. At the next step, cell 5 and cell 7 are considered, cell 5 will be chosen. The process for the width is stopped because three cells have been chosen. At the next step, the process for the height will be started and it is similar to the width process. Finally, we can see that an efficient structure data is important. For a long time, anonymization areas, which are stored to database, is increased. So, a sufficient structure data for saving these anonymization areas is needed. 3.5 Measures of Quality The main requirements for the location cloaking are Accuracy, Quality, Efficiency and Flexibility as shown in [14]: -
Accuracy: the system must satisfy the requirement of the user as accuracy as possible. Quality: the attacker can not find out the true location of the user. Efficiency: the computation for the location cloaking should be simple. Flexibility: the user can change his requirement of privacy at any time
However, these criterions should be trade off. The requirement for the best quality will lead to increase the complexity of the computation and so on. In our approach, the user can require the level of privacy to protect his private location. The middleware will choose the anonymization area to hide the true location of the user according to user’s privacy level. Furthermore, the user can define the smallest area (cell size) or use the default cell. He can also change his required level of privacy at any time when he wants to use the service. Indeed, the approach can easily satisfy the privacy requirement of the user. When the user wants a high level of privacy, the middleware will expand the anonymization area that contains the true location of the user. Conversely, the anonymization area will be smaller if a lower level of privacy is required. Because the true location of the user is embedded in an area, it is difficult to find the true location of the user. When the anonymization area is enough big, the attacker maybe make more effort to find out the true location of the user. Moreover, we notice that the overlap_area_getting()function is the main function and it takes much time to finish. This function will find the anonymization area so that the overlap area between this anonymization area and the last overlap area is the largest. The function will take two loops, one for find cells in the vertical and another one for horizontal. So the complexity of the this function is O(n).
An Adaptive Grid-Based Approach to Location Privacy Preservation
143
Besides, the complexity of the algorithm also depends on the database access. So, the data structure for saving anonymization areas is needed to decrease the complexity of this algorithm, we discussed it before.
4 Open Research Issues In previous sections, we introduced a new research approach for applying an adaptive grid to the middleware architecture to preserve the privacy of the user. This new research approach also opens more new research issues. As we discussed before, the adaptive grid with fixed starting point will result in some problems. Therefore, the design an adaptive grid with a roving starting point will make the middleware to protect the privacy of the user sufficiently. In some case, the anonymization area, which is chosen, will not “big” enough to hide the location of the user. For example, assume that the anonymization area includes four cells; they are cell 1, cell 2, cell 3 and cell 4. However, cell 1, cell 2, cell 3 are regions that the user may not be there, for example, a lake or a swamp, so attackers can limit the area which contains the user’s location to the cell 1. A new direction in investigating a new algorithm or a method to eliminate the anonymization area, which contains “dead” regions, should be considered. The probability P of an anonymity area which has been chosen should be:
P=
required _ privacy _ level
∑ Pi
+R
(1)
1
P is the probability of an anonymity area that does not have the “dead” regions. Pi is the probability of a cell that is not a “dead” region. R is the priority of this anonymity area. R should depend on the overlap area between this anonymization area and previous anonymization areas of previous uses. When the middleware wants to choose an anonymity area, it will choose the anonymity area that has the biggest value of P. A combination between the grid approach with an algorithm, which helps to find Pi and R, will increase the efficiency in protecting the user’s privacy. Again, we notice that the time to carry out the algorithm also depends on the database structure. So, an efficient data structure is needed. The efficient data structure will help us to save the anonymization area efficiently. This will help to reduce the time to get the anonymization area when the middleware wants to query the database. A new direction in designing a new data structure should be also considered.
5 Conclusions In this paper, we proposed a flexible grid and an algorithm working on this grid to anonymize the location of the user. This solution gives the user a right to adjust the size of a cell, which is corresponding with the minimum privacy level, to meet the user’s requirement. In the algorithm, we covered all possible situations that can be occurred when users resize the cell’s size. Moreover, we also proposed the solution for the overlap-area problems.
144
A.T. Truong, Q.C. Truong, and T.K. Dang
This approach can be applied in many industry fields such as health, work, personal life… In these services, users do not use services directly; they will send their request to a trusted middleware, which is provided by a third trusted organization. The middleware will be responsible for protecting the user’s location according to the user’s requirement. In future, we will investigate all research directions discussed in the previous part to make our solution become more applicable in real life.
References 1. Truong, Q.C., Truong, T.A., Dang, T.K.: Privacy Preserving through A Memorizing Algorithm in Location-Based Services. In: 7th International Conference on Advances in Mobile Computing & Multimedia (2009) 2. Ardagna, C.A., Cremonini, M., Vimercati, S.D.C., Samarati, P.: Privacy-enhanced Location-based Access Control. In: Michael, G., Sushil, J. (eds.) Handbook of Database Security – Applications and Trends, pp. 531–552. Springer, Heidelberg (2008) 3. Beresford, A.R., Stajano, F.: Location privacy in pervasive computing. IEEE Pervasive Computing, 46–55 (2003) 4. Beresford, A.R., Stajano, F.: Mix zones: User privacy in location-aware services. In: 2nd IEEE Annual Conference on Pervasive Computing and Communications Workshops (2004) 5. Bettini, C., Wang, X., Jajodia, S.: Protecting privacy against location-based personal identification. In: 2nd VLDB Workshop on Secure Data Management (2005) 6. Bettini, C., Mascetti, S., Wang, X.S.: Privacy Protection through Anonymity in Location-based Services. In: Michael, G., Sushil, J. (eds.) Handbook of Database Security – Applications and Trends, pp. 509–530. Springer, Heidelberg (2008) 7. Bugra, G., Ling, L.: Protecting Location Privacy with Personalized k-Anonymity: Architecture and Algorithms. IEEE Transaction on mobile computing (2008) 8. Cuellar, J.R.: Location Information Privacy. In: Srikaya, B. (ed.) Geographic Location in the Internet, pp. 179–208. Kluwer Academic Publishers, Dordrecht (2002) 9. Gidófalvi, G., Huang, X., Pedersen, T.B.: Privacy-Preserving Data Mining on Moving Object Trajectories. In: 8th International Conference on Mobile Data Management (2007) 10. Gruteser, M., Grunwald, D.: Anonymous usage of location-based services through spatial and temporal cloaking. In: 1st International Conference on Mobile Systems, Applications, and Services (2003) 11. Kupper, A.: Location-based Services - Fundamentals and Operation. John Wiley & Sons, Chichester (2005) 12. Langheinrich, M.: A Privacy Awareness System for Ubiquitous Computing Environments. In: 4th International Conference on Ubiquitous Computing, pp. 237–245 (2002) 13. Marco, G., Xuan, L.: Protecting Privacy in Continuous Location - Tracking Applications. IEEE Computer Society, Los Alamitos (2004) 14. Mohamed, F.M.: Privacy in Location-based Services: State-of-the-art and Research Directions. In: IEEE International Conference on Mobile Data Management (2007) 15. Myles, G., Friday, A., Davies, N.: Preserving Privacy in Environments with LocationBased Applications. IEEE Pervasive Computing, 56–64 (2003) 16. Panos, K., Gabriel, G., Kyriakos, M., Dimitris, P.: Preventing Location-Based Identity Inference in Anonymous Spatial Queries. IEEE Transactions on Knowledge and Data Engineering (2007)
View Driven Federation of Choreographies Amirreza Tahamtan1 and Johann Eder2 1
Vienna University of Technology, Dept. of Software Technology & Interactive Systems, Information & Software Engineering Group [email protected] 2 Alpen-Adria University of Klagenfurt, Dept. of Informatic-Systems, Austria [email protected]
Abstract. We propose a layered architecture for choreographies and orchestrations of web services. The proposed architecture uses the concept of process views. The distributed nature of the model and the concept of views improve the privacy of business partners but do not limit their interaction capabilities, an essential feature in B2B and interorganizational applications. Our approach enables description of business processes in different levels of detail with a uniform modeling language and is fully distributed. Keywords: Web Service Composition, Choreography, Orchestration, Process View, Interorganizational Process.
1
Introduction
Web Services enable application development and integration over the Web by supporting interactions within and across the boundaries of cooperating partner organizations. Two mostly used concepts in the realm of Web Service composition are choreographies and orchestrations. An orchestration belongs to and is controlled by one partner and describes an executable process which is run by its owner. A partner’s internal logic and business know-how are contained in his orchestration. An orchestration is solely visible to its owner and other external partners have no view on and knowledge about this orchestration. An orchestration is a process viewed only from the perspective of its owner. Different languages such as WS-BPEL executable process [1] or BPML [2] can be used for definition of orchestrations. On the other hand, a choreography is a non-executable, abstract process that defines the message exchange protocol between partners. A choreography defines the collaboration among involved partners. Exchanged messages are visible to all participants of a choreography. External parties who are not part of a choreography are not able to view and monitor the messages and have no view on the choreography. A choreography has no owner or a super user in charge of control and all involved partners are treated equally. A choreography is a process definition from a global perspective shared among all involved partners [13]. N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 145–156. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
A. Tahamtan and J. Eder
Partner 1
takes_part
takes_part
Partner 2
Orchestration
owns
owns
Shared Choreography realizes realizes
Orchestration
146
takes_part realizes
owns
Orchestration Partner 3
Fig. 1. A typical scenario of Web Service composition
A typical scenario [3,5,6] of web service composition assumes one choreography shared among several partners where each partner realizes its parts of the choreography in its orchestration. The shared choreography defines the communication among orchestrations. This scenario is depicted in figure 1. Imagine a procurement scenario, whose participants are a buyer, a seller and a shipper. Figure 2 shows such a choreography. Note that in this work choreohraphies and orchestrations are modeled as workflows, where nodes represent activities and edges the dependencies between activities. For the meta model refer to section 3. This process represents a simple scenario. A real life business process is more complex including e.g exception handling mechanisms. The partners’ orchestrations have additional activities which are not contained in the shared choreography. The buyer’s orchestration is depicted in figure 3. The buyer before making a request for quote, searches for available sellers for his requested item and consequently selects one, activities Search sellers and Select seller in the buyer’s orchestration.
2
View Driven Federated Choreographies
The typical scenario explained above, one shared choreography and a set of private orchestrations, misses an important facet. Presence of only one choreography is not fully adequate for all real life applications. Imagine a web shopping scenario. When shopping online a buyer takes part in a choreography whose partners are a buyer, a seller company like Amazon , a credit card
View Driven Federation of Choreographies
Fig. 2. The shared choreography between buyer, seller and shipper
Fig. 3. The buyer’s orchestration
147
148
A. Tahamtan and J. Eder
provider like Visa and a shipper company like FedEx. The buyer knows the following partners and steps: the buyer orders something at the seller, pays by a credit card and expects to receive the items from a shipper. At the same time the seller takes part in several other choreographies which are not visible to the buyer, e.g. the seller and the credit card company are involved in a process for handling payment through a bank. Furthermore, the seller and the shipper realize another protocol they agreed upon containing other actions such as money transfer from the seller’s bank to the shipper for balancing shipment charges. As this example shows, more than one choreography may be needed for reaching the goals of a business process. Besides, two partners involved in one choreography may also take part in another choreography that is not visible to other partners of the choreography, however essential for the realization of business goals. All these choreographies overlap in some parts but cannot be composed into a single global choreography. Moreover, such choreographies must be realized by orchestrations of partners that take part in them. In the above example the seller implements an orchestration enacting the different interaction protocols with the buyer, the shipper and the credit card company. Even if the combination of all choreographies into one choreography would be possible, the separation offers obvious advantages. To overcome these restrictions, a new architecture and a nouvelle approach called view driven federated choreographies, is proposed. This architecture is an extension of our previous work [7] by the concept of views. Consideration of Views has several advantages: Improvement of Privacy: Views improve the privacy of partners in a business process. By using views, partners can decide which parts of their internal private process can be exposed to external partners and there is no need for exposing the whole internal logic and thus preventing cooperating partners to become competitors. Serving different (groups of ) partners: A single private process can have many views and each view can be used for interaction with a partner or a group of partners. By application of views, one single underlying private process can serve different groups of partners. Interaction compatibility: Partners can change and modify their private processes without any effect on the interaction with other partners, as long as the views remain the same, i.e. they can be defined on the new processes as well. In this way the interaction can be performed in a consistent way and there is no need to inform other partners about changes in the private process. Each partner has an orchestration for realization of its tasks and takes part in one or more choreographies. Each partner has a view on his orchestration. A view is not only a view on an orchestration but also a view on the shared choreography and identifies the parts that belong to a specific partner of the shared choreography which this partner is in charge of its realization. By realization of the activities that belong to a partner, the partner has a
View Driven Federation of Choreographies
149
Fig. 4. The main idea of the view driven federated choreographies
conformant behavior with respect to the agreed upon choreography. A view shows a single partner’s perspective on the choreography and can be used as a skeleton for designing the partner’s orchestrations by adding other internal tasks. In other words they show the minimum amount of task as well as the structure of the tasks that a partner’s orchestration must contain in order to be conformant with the shared choreography. For a more detailed discussion on views and how correct views can be constructed refer to [14]. The main idea of the view driven federated choreographies is presented in figure 4. It consists of two layers. The upper layer consists of the federated choreographies shared between different partners, e.g. in figure 4 Purchase processing choreography is shared between buyer, seller and shipper. A choreography is composed of views of the orchestrations by which the choreography is (partially) realized. In other words, the activities contained in a choreography are only those in the views. A choreography may support another
150
A. Tahamtan and J. Eder
choreography. This means the former, the supporting choreography, contributes to the latter, the supported choreography, and partially elaborates it. E.g. Shipment processing choreography is the supporting choreography and Purchase processing choreography is the supported choreography. The set of activities contained in a supporting choreography is an extended subset of the activities of the supported choreography. The supporting choreography describes parts of the supported choreography in more detail. The choreography which supports no other choreography and is only supported by other choreographies is called the global choreography, in our running example Purchase processing choreography is the global choreography. Informally, the global choreography captures the core of a business process and other choreographies which support the global choreography describe parts of the global choreography in the needed detail for implementation. In figure 4 the global choreography, Purchase processing choreography, describes how an item is sold and shipped to the buyer. It contains the activities and steps which are interesting for the buyer and the buyer needs to know them in order to take part in or initiate the business process. How shipping of the items and debiting the buyer’s credit card is handled in reality are described in the Shipment processing choreography and the Payment processing choreography respectively. The bottom layer consists of orchestrations that realize the choreographies in the upper layer. Each orchestration provides several views for different interactions with other partners. The interactions with other partners are reflected in the choreographies. Hence, an orchestration needs to provide as many views as the number of choreographies this orchestration (partially) realizes. Let figure 2 be the Purchase processing choreography. The buyer’s orchestration and its shared view with the Purchase processing choreography are presented in figures 3 and 5 respectively. Each partner provides its own internal realization of relevant parts of the according choreographies, e.g. buyer has an orchestration which realizes its part in all three choreographies. The presented approach is fully distributed and there is no need for a centralized coordination. Each partner has local models of all choreographies in which it participates. All local models of the same choreography are identical. By having the identical local models of choreographies, partners know to which activities they have access, which activities they have to execute and in which order. In addition, partners are aware when to expect messages and in which interval they can send messages. In other words, the knowledge about execution of the model is distributed among involved partners and each partner is aware of its duties in the course of process execution. Hence, there is no need for a super-user or a central role that possesses the whole knowledge about execution of the process. Rather this knowledge is distributed among participants and each partner knows what he needs to know. Additionally, each partner holds and runs its own model of the orchestration. Note that if there is a link between two chorographies and/or orchestrations, either a support link between two choreographies or a realize link between a
View Driven Federation of Choreographies
151
Fig. 5. The view on the buyer’s orchestration
choreography and an orchestration, it implies that these two choreographies and orchestrations have at least one activity in common. That means that the greatest common divisor of these two choreographies and/or orchestrations is not empty. In fact, one can argue that supporting choreographies may be combined by the means of composition as described in [11,4] where existing choreography definitions can be reused and recursively combined into more complex choreographies. But, in fact, the relationship between choreographies can be more sophisticated than merely a composition. For example the relationship between the seller and the shipper may include not only the passing of shipment details from the seller to the shipper but also payment of shipment charges through the seller’s bank. This can be described by a separate choreography between seller, shipper and the bank (see Shipment processing choreography in figure 4). This choreography has additional activities and partners which are not visible in its supported choreography. This choreography contributes to the Purchase processing choreography and elaborates the interaction between the seller and the shipper. The Shipment processing choreography is illustrated in figure 6. 2.1
Advantages of the View Driven Federated Choreographies
View driven federated choreographies are more flexible than typical compositional approaches used in proposals like WS-CDL [11] and it closes the gap between choreographies and orchestrations by providing a coherent and integrated view on both choreographies and orchestrations. View driven federated choreographies offer obvious advantages such as:
152
A. Tahamtan and J. Eder
Fig. 6. The Shipment processing choreography
Protection of business know-how: View driven federated choreographies improve business secrecy and protect business know-how. If the whole business process including all involved partners are modeled as one single choreography, all message exchanges are visible to all partners. But if the interactions are separated into different choreographies, other external observers have no knowledge about the message exchanges and partners can keep the actual handling of their business private. Avoidance of unnecessary information: The proposed approach avoids unnecessary information. Even if there is no need for protection of business know-how, it is desirable to separate choreographies and limit them only to the interested parties. Extendability: The model is extendable, when such a need arises. As long as the conformance conditions are satisfied, the model can be extended and there is no need to interfere with the running process and notifying
View Driven Federation of Choreographies
153
the partners for setting up new choreographies. The conformance issues are discussed briefly in subsection 2.2. Uniform modeling: Finally, View Driven federated choreographies use a coherent and uniform modeling for both choreographies and orchestrations and eliminates the need for different modeling languages and techniques for choreographies and orchestrations. The uniform modeling technique reduces the cost of process at design phase. 2.2
Conformance Issues
The central requirement of the proposed model is inter-layer conformance (between choreographies and orchestrations) as well as intra-layer conformance i.e. inside the choreographies and orchestrations layers itself. This includes structural conformance [8], temporal conformance [10,9], messaging conformance and dataflow conformance. Conformance issues are out of scope of this work. For structural and temporal conformance please refer to the references in this subsection as well as [14].
3
Metamodel of the View Driven Federated Choreographies
Choreographies, orchestrations and views are treated as workflows. Therefore, choreographies, orchestrations and views can be modeled using typical workflow control flow structures. Moreover, our approach provides a coherent view on both choreographies and orchestrations and their mutual relationships, thus bridging the gap between abstract and executable processes. An orchestration provides several views. Choreographies can be federated into more complex ones. Choreographies are composed of views. Moreover, as all choreographies are workflows, they can be composed out of other choreographies by means of complex activities and control structures available in the workflow models. The same applies to orchestrations. The metamodel allows to describe several choreographies and orchestrations on different levels of detail. Choreographies and orchestrations can share the same activities. These activities are contained in a view that is provided by the orchestration. Such a view identifies which activities of the choreography must be realized in the orchestration. An activity visible in one choreography can be extended by its relationships with other activities in a federated choreography. On the other hand, an activity visible in a choreography can have a complex implementation described in an orchestration. Thus, choreographies and orchestrations together with their activities can be viewed on different levels of detail and in context of different relationships. The metamodel of the view driven federated choreographies is represented in figure 7.
Fig. 7. Metamodel of view driven federated choreographies
154 A. Tahamtan and J. Eder
View Driven Federation of Choreographies
155
A workflow is either a choreography, an orchestration or a workflow view. A workflow can have many views. A workflow defines views for different roles (of partners). Each role sees and accesses the workflow through the view. A workflow uses activities. An activity is either a task or a complex activity . An activity can be used to compose complex activities. An activity occurrence in such a composition is represented by an activity step. One activity can be represented by several activity steps in one or several workflows or complex activities and each activity step belongs to exactly one activity. In other words, activity steps are placeholders for reusable activities. The same activity can occur in different workflows. The control structure of a complex activity is described by its type (seq for sequence, par for parallel and cond for conditional). An activity may be owned by a partner. Orchestrations and tasks must have an owner, whereas choreographies must not have an owner. A partner may have several roles and one role can be played by several partners. A role may take part in a workflow and call an activity step in this workflow. An activity step is provided by another role. Thus a single parter can use different roles to participate in a workflow and provide or call activity steps. A role sees and accesses a defined view on the workflow. The notion of a step is very important for the presented metamodel. Both workflows and complex activities consist of steps. Between the subsequent steps there can be a transition from a predecessor to a successor which represents control flow dependencies between steps. A complex activity may be decomposed in a given workflow into steps that constitute this complex activity only if all of the activities corresponding to these steps are also used and visible in this workflow. Therefore, a workflow can be decomposed and analyzed on different levels of detail with complex activities disclosing their content, but without revealing protected information on the implementation of these complex activities. To allow a correct decomposition, a complex activity must have only one activity without any predecessors and only one activity without any successors. The same applies to workflows. A step can be either an activity step or a control step. As mentioned above, activity steps are placeholders for reusable activities and each activity step belongs to exactly one activity. Activity steps can be called in a workflow definition. An activity step may be used as a reply for a previous activity step. A single activity step may have several alternative replies. A control step represents a control flow element such as a split or a join. Conditional and parallel structures are allowed, i.e. the type of a control step is one of the followings: par-split, par-join, cond-split or cond-join. An attribute predicate is specified only for steps corresponding to a conditional split and represents a conditional predicate. Conditional splits have XORsemantics. A split control step have a corresponding join control step which is represented by the recursive relation is counterpart. This relation is used to represent well structured workflows [12] where each split node has a corresponding join node of the same type and vice versa.
156
4
A. Tahamtan and J. Eder
Conclusions
We introduced a layered, distributed architecture for web service composition composed of choreographies and orchestrations. The concept of views and distributed nature of this model allow business partners to interact and at the same time protect their business know-how and improve their privacy. Besides, business processes can be described in different levels of detail and with a uniform modeling language, as an executable process in orchestrations and as an abstract process in choreographies and views. Acknowledgments. This work is supported by the European Project WSDiamond in FP6.STREP and the Austrian projects GATiB and Secure 2.0.
References 1. Andrews, T., et al.: Business process execution language for web services (bpel4ws), ver. 1.1. BEA, IBM, Microsoft, SAP, Siebel Systems (2003) 2. Arkin, A.: Business process modeling language (bpml), ver. 1.0. Technical report, BPMI (2002), http://www.bpmi.org/downloads/spec_down_bpml.htm 3. Barros, A., Dumas, M., Oaks, P.: A critical overview of the web services choreography description language(ws-cdl). Technical report, Business Process Trends (2005) 4. Burdett, D., Kavantzas, N.: Ws choreography model overview. Technical report, W3C (2004) 5. Decker, G., Overdick, H., Zaha, J.M.: On the suitability of ws-cdl for choreography modeling. In: Proc. of EMISA 2006 (2006) 6. Dijkman, R.M., Dumas, M.: Service-oriented design: A multi-viewpoint approach. Int. J. Cooperative Inf. Syst. 13(4), 337–368 (2004) 7. Eder, J., Lehmann, M., Tahamtan, A.: Choreographies as federations of choreographies and orchestrations. In: Proc. of CoSS 2006 (2006) 8. Eder, J., Lehmann, M., Tahamtan, A.: Conformance test of federated choreographies. In: Proc. of I-ESA 2007 (2007) 9. Eder, J., Pichler, H., Tahamtan, A.: Probabilistic time management of choreographies. In: Proc. of QSWS 2008 (2008) 10. Eder, J., Tahamtan, A.: Temporal conformance of federated choreographies. In: Bhowmick, S.S., K¨ ung, J., Wagner, R. (eds.) DEXA 2008. LNCS, vol. 5181, pp. 668–675. Springer, Heidelberg (2008) 11. Kavantzas, N.: et al. Web services choreography description language (ws-cdl) 1.0. Technical report, W3C (2004) 12. Kiepuszewski, B., ter Hofstede, A.H.M., Bussler, C.: On structured workflow modelling. In: Wangler, B., Bergman, L.D. (eds.) CAiSE 2000. LNCS, vol. 1789, p. 431. Springer, Heidelberg (2000) 13. Peltz, C.: Web services orchestration and choreography. IEEE Computer 36(10), 46–53 (2003) 14. Tahamtan, A.: Web Service Composition Based Interorganizational Workflows: Modeling and Verification. In: Suedwestdeutscher Verlag fuer Hochschulschriften (2009)
Semantic Battlespace Data Mapping Using Tactical Symbology Mariusz Chmielewski and Andrzej Gałka Cybernetics Faculty, Military University of Technology, Kaliskiego 2, 00-908 Warsaw, Poland
Abstract. Interoperability of military C4ISR systems mainly in NATO Joint Operations has been one of crucial tasks. The most mature standard designed for system interoperability is the JC3IEDM, which was developed based on several NATO C4I models. Representing battlefield scenarios in form of pure relational data requires additional rendering mechanism to present the situation picture, which can further be reused by the decision makers. The purpose of this work is to describe the development process of JC3IEDM and APP-6A ontologies and designed model mappings. Analysis of both standards helped us realized that transformation of data structures is not the modelling goal itself. The descriptions provided within the models carry the semantics as domain values for attributes, and business rules stating their valid combinations. Through detailed review of both standards, it was possible to identify missing descriptions both in JC3 and APP-6A and most of all describe the similarities further used as a source for definition of ontology class constructors and SWRL rules. Produced models in conducted research have been applied as knowledge bases for decision support tools providing several combat scenarios, uploaded into operational JC3 database, which was then the source for migrating them into semantic model. Ontology processing services using mapping rules transformed instances of battlespace objects into symbology domain, identifying the APP-6A signs codes used by the distributed Common Operational Picture tools. Prototype subsystem, developed as proof of concept, renders the battlespace scenario based on identified semantic bridges between the JC3IEDM (used for reflecting detailed information on the battlefield) and symbology standard. Keywords: military operations, warfighting symbology, common operational picture, ontology, semantic models, decision support.
1 Introduction Development of decision support tools reuses many rendering mechanisms for presentation of GIS (Geographic Information System) data along with current and future state of the battlespace elements. Representing combat scenarios in form of pure relational data requires additional algorithms to present the situation picture, which can further be reused by the decision makers. Graphical icons placed over the generated digital maps help to express current combat situation [4]. Shared N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 157–168. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
158
M. Chmielewski and A. Gałka
situational awareness enables collaboration between sensors, actors and command centers improving synchronization, and reduces the communication delays which in result speed up decision process and increase mission effectiveness. NATO symbology standards covers the majority of battlespace and civilian elements. Analysis of taxonomy helps to understand the idea of upgrading the data of the element on the battlespace, where incoming reports provide additional details, which can change the semantics of stored element and in result the symbol itself. Static mapping of data in the software is most usually approach found in the presentation layer. This solution is efficient but not extendable and without additional program logic could not provide mechanisms for mapping and data consistency checking. This work presents an original approach of building semantic bridges between data stored in C4ISR [12] standard model (JC3) and APP-6A symbology using ontology and reasoning mechanisms. Such case study helped us to develop transformation mechanisms for object oriented and relational data models, verify the abilities of such transformations and evaluate designed ontology metrics. Our experiences, conclusions and results have been applied in semantic mechanisms of SOA demonstrator developed under the agenda of KOS project referenced in acknowledgments. Network Centric Warfare theory defines several concepts used as a product of information flow or as a tool for decision support procedures - Situation Awareness and Common Operational Picture. Both concepts are connected with each other due to their specific domain that is battlespace description. Situation awareness (SAW) can be understand as a perception of environmental elements within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future. SAW combines perception and environment critical to decision-makers in complex, dynamic processes, in this case - military command and control [4]. Situation awareness as stated before, is achieved in minds of commanders by their analysis of incoming reports and functionalities of used C4ISR systems. Sophisticated tools for data acquisition, data mining, decision support, generate data sets representing picture of environment in the current combat space. Situation awareness contains data gathered from all available battlespace dimensions loaded form C4ISR systems. Migrated data is the key to understanding and describing environment, providing battlespace perception and preparing variants for developed decisions. Lack of data or having inadequate information and in result limited situation awareness, has been identified as one of the primary factors in accidents attributed to human error. Situation awareness characterizes wider concept than Common Operational Picture which could be described as a visualization product used by the C4I systems for Joint Operation Command Centers. To define COP can be provided from several sources which after analysis can be stated as [12] : “Common Operational Picture can be defined as a single identical display of all relevant (battlespace) information shared by more than one command. COP merges all available data providing information superiority and facilitating collaborative planning and assisting all units to achieve situational awareness.” It must be stated that COP is a tool supporting existing, in many cases legacy C4I systems using an integration layer which provides a unified means of data
Semantic Battlespace Data Mapping Using Tactical Symbology
159
recognition, filtering and fusion. Such service or group of services in command system delivers current situation consisting of both military and civilian domain, general pointers and guidelines in Area of Responsibility – AOR and Area of Interest - AOI. Essential requirements for COP has been formulated in [15], in which it can be defined as a service or a set of services providing: collection of recognized pictures and data fusion. Common Operational Picture requires Geospatial Information Services often based on several standardized data sources, supported by graphic and decision support functionalities. Considering current development phase, NATO stated that the COP requirements need to be sequentially and partially extended. Battlespace view is usually a collection of information in the COP aggregated and displayed for decision support. It serves to ensure functional consistency among specific views of separate battlespace dimensions – air, ground, marine. Problems connected with achieving a perception of a situation and therefore COP, evolve indicating many concepts of battlespace and overcoming technical requirements for federated distributed real time systems.
2 APP-6A Warfighting Symbology Standard and JC3 Interoperability Exchange Data Model Semantics To express semantics of current combat scenario commanders have used all kinds of symbology preparing tactical maps. Standard provides common operational symbology defining details on their display and plotting, ensuring compatibility among the semantics of created battlespace picture. APP-6A [14] (Allied Procedural Publication) defines rules for creating, organizing object information through the use of a standard methodology for symbol hierarchy, information taxonomy, and symbol identifiers. This work will only describe general elements of the standard which will directly be reflected in designed mechanisms and which the Reader should be familiar with. For detailed information on APP-6A see [14]. Standard defines whole taxonomy of signs considering the rule of generalisation or specialisation. Digging deeper into the sign hierarchy enables to define object in more details. Each sign is defined in form of 15 character code which stores whole sign semantics. Given this code we can express: -
-
coding scheme [CS] (position 1) - indicates to which overall symbology set a symbol belongs; affiliation / exercise amplifying descriptor [AF] (position 2) – indicates the side of the conflict or special identified affiliation for exercise purposes; battle dimension [BD] (position 3) – indicates one of the dimensions in which the units/equipment can operate in air, ground, surface, subsurface, space and other; status [ST] (position 4) – defines the future or present indicator for the object; function id [FK] – (position 5-10) – important semantic information which , identifies object function-type. Each position indicates an increasing level of detail and specialization;
160
-
-
M. Chmielewski and A. Gałka
symbol modifier [SM] – (position 11-12) - identifies indicators present on the symbol such as object rank, feint/dummy, installation, task force, headquarters staff, and equipment mobility; country code [CC] - (position 13-14) - identifies the country with which a symbol is associated; order of battle [OB] – (position 15) - provides additional information about the role of an object in the battlespace (differencing type of forces).
The values in each field are filled from left to right unless otherwise specified. Basic symbol requires minimal information workload to present the coding scheme, affiliation, battle dimension and function id due to the fact that they reflect the basic icon. Standard allows to leave unknown code positions blank or filled with standardized symbols “-” or “_”. Table 1. Symbol hierarchy representation in APP-6A standard presenting information for example hostile unit
ŽĚĞ
ĞƐĐƌŝƉƚŝŽŶ Hostile ground unit (present)
^ŝŐŶ
shgpu----------
shgpuc---------
Hostile ground combat unit (present)
shgpuca--------
Hostile ground armoured unit (present)
shgpucaa-------
Hostile ground anti-armoured unit (present)
shgpucaaa------
Hostile ground anti-armoured armoured unit (present)
shgpucaaaw-----
Hostile ground anti-armoured armoured wheeled unit (present)
Table presents part of sign hierarchy in which extended coding scheme specify the object and reflect the semantics of the combat unit presented in the scenario. APP-6A extends image symbology towards mandatory and optional sign attributes. Presented set of icons can demonstrate battlefield data incompleteness, where in series of reports system may update and upgrade object’s attributes reflected further in APP-6A. Fig.1a presents scheme for defining all basic elements of the sign: the form of all identifiers (describing their possible values, placement, development rules, and most of all required graphical elements).
Semantic Battlespace Data Mapping Using Tactical Symbology
161
Fig. 1. (a,b) Elements of APP-6A sign with all description blocks and elements (both graphical and text – left figure), sample Common Operational Picture reflected on the CADRG map in designed prototype using APP-6A symbology (right figure)
Several years of collaborative work in workgroup (MIP) allowed to unify the common data model dedicated for C4ISR systems integration. Model construction required specific approach for storing battlespace entities, domain values and value restrictions for defined enumerations. Main model purpose - data interoperability requires a rigorously defined semantic vocabulary (domains and domain values in that is embedded in a structured context. JC3IEDM defines elements of information that form the basis for interoperability between automated Command and Control Information Systems - in case of Polish Armed Forces Szafran, Dunaj, Podbiał, Łeba MCIS - accommodating model's information structure. Standard intent is to represent the core of the data identified for exchange across multiple functional areas of existing or legacy C4ISR systems. Model in current version 3.1b [16] consist of: 290 entities, 1594 attributes, 533 domains (enumeration types), 12409 - domain values, 259 associations, 166 hierarchy relations. Due to the large scale of the model and all available descriptions provided by MIP we have chosen a metamodel database MIRD [17] to transform chosen entities and relations to the form of semantic model. This operation required extended analysis of the JC3 semantics to: -
choose main parts of the model to be transformed; identify transformation rules for JC3 relational model to ontology; filter ontology classes and extract only needed elements
3 Semantic Model and Ontology in Project Application In case of presented work we apply ontologies in model descriptions to utilize the semantics of main parts and in the end provide semantic bridges between the domain models. For model representation we choose Description Logic and appropriate dialect for expressing our ontologies. Selecting currently available semantic languages we have based our models on OWL DL which is direct offspring of
162
M. Chmielewski and A. Gałka
description logic SHOIN (ALCR+HOIN) [9]. Model of formally represented knowledge is based on a conceptualization: concepts, objects and entities with relationships identified among them [1]. Conceptualization can be defined as an abstract, simplified view of the reality that the modeller wants to represent for defined purpose. Every knowledge-based system, or knowledge multiagent environment is committed to some conceptualization, explicitly or implicitly. We base our ontology design criteria on guidelines described in [1] concerning clarity, coherence, extendibility, minimal encoding bias, minimal ontological commitment. Ontology languages design criteria required that the expressiveness of the model must be also fallowed by the reasoning capabilities. Experiences taken from the First Order Logic reasoning methods and algorithms and dedicated reasoning algorithms for Description Logic (Tableux algorithm) [9] allowed to provide inference tools. Those capabilities are mainly used in the design phase of each of described ontologies and most of all they provide the matching algorithms for mapping data from JC3 to APP-6A. Designed models have been serialized in form of OWL DL, and using Protégé modelling tool. We have tested two separate strategies for model design. Based on available JC3IEDM conceptual model we managed to define set of transformation rules that was used to convert relational model into the ontology. The transformation process consecrates on reflecting the semantics of the model not its structure. Due to this fact, some of the entities defined on the level of conceptual model have been erased and replaced by the relations between concepts e.g. associations between OBJECT-TYPE are represented by the OBJECT-TYPE-ASSOCIATION entity. The other tested strategy for ontology design assumed building it from scratch, therefore studying the details of modelled domain. This approach has been applied to develop the two variants of APP-6A ontology.
Fig. 2. Overview of the design process and elements of fused ontology model - Unified Battlefield Ontology
Semantic Battlespace Data Mapping Using Tactical Symbology
163
Ontology development for the COP enabled tool required detailed C4I system data sources especially in the aspect of visualization mechanism and GIS data standard used by the system. Current version of Unified Battlespace Ontology Model (UBOM) consists of: -
MIP’s JC3IEDM model based ontology describing wide range of military operations (actions), military units and equipment, their location and available reporting data; APP-6A Warfighting symbology standard – containing battlespace objects and their description without references to location and extended relations between objects; Fusion mappings – ontology focused on concept definitions using JC3IEDM semantics (classes and relations) and reflecting them in APP-6A concept hierarchy.
Relational model chosen as a form of JC3 representation provides difficulties to identify the semantics of model. Analysis of the model taking into account only structure will provide only model stored in OWL not ontology itself due to crucial requirements that such model must satisfy. Relational models carry overflow in form of required elements used by the RDBMS to properly construct relations among data records. To filter such elements we have chosen to process the JC3 metamodel MIRD instead of the JC3 physical model. Implementation of the JC3IEDM usually depend on an underlying MIP Information Resource Dictionary database. Information about the JC3IEDM entities, attributes, cardinality or subtype relationships, primary or foreign keys, domains, domain values etc., are stored in this metamodel and enable to determine, under dynamic user-imposed constraints, what to replicate and how. The ontology development has been divided into two stages: in first all specified entities and their relationships are transformed into ontology classes and properties. The second stage verifies the model and refines the constructs changing the model itself to bring up the semantics (discarding association tables and introducing relations with additional characteristics). JC3 model structure consists of entities, composed of attributes. In most cases types of attributes are defined by domains which after deep overview are true holders of the model semantics. Domains are composed of domain vales which create similar structure as enumeration type. JC3 defines also business rules which are used to obtain valid domain values combinations for selected attributes. For the purpose of this work we utilize business rules for UNIT-TYPE entity to construct available variants of defined units. The domain knowledge used in this process has been described in MIR Annex G - Symbology Mapping [13]. Definitions of model transformation rules has been mainly aimed at: -
Generic guidelines for relational model to ontology transformation; Identification of excess relational model elements associated with physical model representation and not its contents; Definition of uniform naming conventions for created semantic model elements based on relational model predecessors; Development of additional validation rules to identify valid domain values combinations and their reflection in form of description logic class constructors or SWRL reasoning rules;
164
M. Chmielewski and A. Gałka
-
Definition of optimal range of JC3IEDM elements to be transformed providing required ontology expressiveness with compact size and processing efficiency.
Generation of JC3 ontology has been executed several times with different set of transformation rules. The main cause of such approach has been ontology refinement process and optimization of ontology OWL language form stored in resulting file. Final set of rules define fallowing transformations: -
Entity to OWLNamedClass Cardinality relationships to OWLObjectProperty with OWLCardinali-
-
Subtype relationships to OWLNamedClass subtype hierarchy Attribute to OWLObjectProperty in case where the attribute type is Domain or OWLDatatypeProperty where the attribute type is primitive type reflected in ontology by RDFSDatatype DomainValue-> OWLIndividual Domain -> OWLEnumeratedClass containing OWLIndividuals generated from DomainValues associated to this Domain BusinessRules-> OWLNamedClass with defined value restriction OWLHasValue,OWLAllValuesFrom,OWLSomeValuesFrom
tyRestriction
-
Fig. 3. JC3IEDM ontology transformation algorithm based on MIRD model transformation.
As stated before each of the APP-6A element can be described by the 15 character code. Positions of the code represent specific part of the element semantics. Reflecting these elements is the basis for definitions of all elements. Using those elements we have defined hierarchy of symbols as specified in the APP-6A documentation. Each sign category (Equipment, Unit, Weapon, etc.) consist of concepts provided in the specification but also extended using DL class
Semantic Battlespace Data Mapping Using Tactical Symbology
Fig. 4. Concept definition of APP-6A code elements
165
Fig. 5. Specification on Unit concepts in APP-6A
constructors to use reasoning capabilities in ontology. Unit concept sub tree allows to analyze the required conditions for inference mechanism to correctly classify individual to this class. Category-Code subclasses define individuals which identify the APP-6A enumeration codes used in sign coding e.g. Affiliation-CategoryCode - A_Indiv-ACC, D_Indiv-ACC, F_Indiv-ACC, G_Indiv-ACC, H_Indiv-ACC, etc. Individual naming scheme uses as a first character the specified in the APP-6A code concatenated with “_Indiv” string, followed by the abbreviation of the CategoryCode subclass. For shown example H_Indiv-ACC stands for individual from Affiliation-CategoryCode (ACC) indicating “H” code meaning Hostile. As shown on Fig. 5 prepared taxonomy focus on the combat units because of the source data which will later will be gathered from the JC3. Prepared concept list tries to reflect taxonomy based on several criteria: - Function-ID – defining the purpose and function of the unit (Combat-Unit, CombatServiceSupport-Unit, CombatSupport-Unit) - Affiliation – defining unit’s specified side of the conflict – Friend-Unit, Hostile-Unit, Neutral-Unit, Unknown-Unit) Intentionally other specified values for function-id and affiliation were not defined to minimize the size of ontology. We also concentrate on providing mapping method not fully reflecting source specifications. Definitions of classes Unit, Combat-Unit and Infantry-Combat show Description Logic constructor usage modelled using APP-6A code elements defined in ontology. (∃hasWarfighting-CodeScheme.Warfighting-CodingScheme
∃hasUnit-FunctionID.FunctionID-CategoryCode
∃hasStatus.Status-CategoryCode
∃hasGroundDimension.Ground-BattleDimension
∃hasAffiliation.Affiliation-CategoryCode) ≡ Unit
hasUnit-FunctionID. {UC----_Indiv-FIDCC}≡ Combat-Unit Combat-Unit
Unit
hasUnit-FunctionID. { UCI---_Indiv-FIDCC }≡ Infantry-Combat Infantry-Combat
Combat-Unit
166
M. Chmielewski and A. Gałka
Presented constructions use necessary conditions in form of ⊑ inclusion, subsumption or necessary and sufficient conditions in form of ≡ equality. Next stage of integration process is overcoming the differences between ontologies. Process of reconciliation of these differences is called ontology mediation, which enables reuse of data across C4I systems and, in general, cooperation between different battlespace dimensions. Considering context of semantic knowledge management, ontology mediation is important due to data sharing between heterogeneous knowledge bases and data reuse. UBOM [11] development final step is based on ontology merging (definition of Fusion.owl), which can be considered as the creation of ontology from several source ontologies. New product unifies information representation and provides equivalent concepts in source ontologies (app6a.owl and jc3.owl). Ontology merging distinguish two approaches [2],[3]: overwrite method – taking the input set of ontologies and transforming them to merged, ontology which captures the original ontologies; append method in which the original ontologies are not replaced, but rather dynamically represented in bridge ontology, created using original ontologies imports using bridge axioms. Developed solution uses the append method using ontology import functionality and introducing the mapping axioms. To present the abilities of designed models and mapping scheme we introduced definition of class MechanisedInfantry-Unit-Type which has been defined using elements of JC3 ontology as: MechanisedInfantry-Unit-Type
⊑ Unit-Type
has-Unit-Type-Supplementary-Specialisation-Code.{Ground-UTSSC} ⊓ has-Unit-Type-General-Mobility-Code.{Land-Tracked-UTGMC}⊓ has-Unit-Type-ARM-Category-Code.{Armour-UTACC}⊓
has-Military-Organization-Type-Service-Code.{ Army-MOTSC}
≡ MechanisedInfantry-Unit-Type
Fusion OWL defines class equality between jc3: MechanisedInfantryUnit-Type and app6a:InfantryMechanised-Combat. To demonstrate the mapping mechanisms we must introduce unit instance into the JC3 ontology which requires adding Unit and Unit-Type individuals and setting all required object property values used by the reasoner and specified in table above. Introduced equivalent axiom automatically classifies individual as app6a:InfantryMechanised-Combat setting all object property values required by the app6a ontology thus filling in the symbology semantics and in the end assigning code. Table 2 presents possible combat situation provided in system where reconnaissance units supply reports (Reporting Data) identifying new units discovered within the Area of Responsibility. Unit information in fallowing reports fill the details on affiliation of the unit, its equipment – type and rank.
Semantic Battlespace Data Mapping Using Tactical Symbology
167
Table 2. Combat object representation improved using information from battlefield
/ŶĐŽŵƉůĞƚĞŝŶĨŽƌŵĂƚŝŽŶƐƚĂƚƵƐ Unknown ground unit (present)
ŽĚĞ
^ŝŐŶ
sugpu----------
Unknown ground combat unit (present) sugpuc---------
Unknown ground infantry unit (present) sugpuci--------
Unknown ground armored infantry unit (present)
sugpuciz-------
Unknown ground armored infantry battalion (present)
sugpucizef-----
Hostile ground armored infantry battalion (present)
shgpucizef-----
4 Summary In this paper we have described concept of ontology usage as mapping tool between two separate battlespace semantic standards JC3IEDM and APP-6A. Analysis of both standards directly showed that transformation of data structures must not be the crucial task while designing knowledge base. The descriptions provided within the models as enumeration types, domain values and additional descriptions carry most of the model semantics. Through detailed review of both standards, it was possible to identify missing descriptions both in JC3 and APP6A and most of all describe the similarities further used as a source for definition of ontology class constructors and SWRL rules. To complete designed method we have proposed an architecture and implementation of COP environment capable of utilizing JC3IEDM and mapping it to APP6A symbology while providing currently stored battlespace scenario. Generating battlefield picture filtered and delivered on time to all command centers provides a new quality in information management. Developed software environment, equipped with ontology tools, demonstrates feasibility study of integrated dynamic battlefield, providing large scale information resources in heterogeneous systems. Extending such system with SOA specific technologies, allowed integration of legacy systems; and the ability to easily substitute alternative components to meet specific interoperability requirements. Acknowledgments. This work was partially supported by Research Project No O R00 0050 06 and PBZ-MNiSW-DBO-02/I/2007.
168
M. Chmielewski and A. Gałka
References 1. Gruber, T.R.: Toward Principles for the Design of Ontologies Used for Knowledge Sharing. International Journal Human-Computer Studies 43 (August 23, 1993) 2. de Bruijn, J., Polleres, A.: Towards and ontology mapping specification language for the semantic web, Technical Report DERI-2004-06-30 (2004) 3. Ehrig, M., Sure, Y.: Ontology mapping - an integrated approach. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 76–91. Springer, Heidelberg (2004) 4. Endsley, M.R., Garland, D.J.: Situation Awareness Analysis and Measurement: Analysis and Measurement. Lawrence Erlbaum Associates (2000), ISBN 0805821341 5. Roman, D., Keller, U., Lausen, H., Bruijn, J., Lara, R., Stollberg, M., Polleres, A., Feier, C., Bussler, C., Fensel, D.: Web Service Modeling Ontology, Applied Ontology (2005) 6. Davies, J., Fensel, D., Harmelen, F.: Towards the Semantic Web: Ontology-driven Knowledge Management. In: HPL-2003-173, John Wiley & Sons, Ltd, Chichester (2003) 7. Herrmann, M., Dalferth, O., Aslam, M.A.: Applying Semantics (WSDL, WSDL-S, OWL) in Service Oriented Architectures (SOA). In: 10th Intl. Protégé Conference (2007) 8. Davies, J., Fensel, D., Harmelen, F.: Towards the Semantic Web: Ontology-driven Knowledge management. In: HPL-2003-173, John Wiley & Sons, Chichester (2003) 9. Baader, F., et al.: Description Logic Handbook. Cambridge University Press, Cambridge (2003) 10. Chmielewski, M., Koszela, J.: The concept of C4I systems data integration for planning joint military operations, based on JC3 standard. MCC Conference (2008) 11. Chmielewski, M.: Data fusion based on ontology model for Common Operational Picture using OpenMap and Jena semantic framework, MCC Conference (2008) 12. DOD, JP 1-02, DOD Dictionary of Military and Associated Terms (April 2001) 13. MIP, MIR–SEAWG–ANNEX G, MIP symbology mapping rules (February 2009) 14. NATO, APP-6A, Military Symbols for Land Based Systems (October 1998) 15. NATO, NATO Common Operational Picture (NCOP) - NC3A, (16.11.2006) 16. MIP, joint C3 information exchange data model overview (2009) 17. MIP, The Joint C3 Information Exchange Data Model metamodel (JC3IEDM metamodel), (13.12.2007)
A Method for Scenario Modification in Intelligent E-Learning Systems Using Graph-Based Structure of Knowledge Adrianna Kozierkiewicz-Hetmańska and Ngoc Thanh Nguyen Institute of Informatics, Wroclaw University of Technology, Poland [email protected], [email protected]
Abstract. Intelligent e-learning systems provide direct and customized instructions to students without human intervention. Therefore, system should be accurately planned to offer a suitable learning path for students’ knowledge state in each step of a learning process. The most important element in an e-learning system is a domain model. A proper knowledge representation allows to teach effectively. In this paper a knowledge structure and a learning scenario are defined. We outline a method for modification of a learning scenario. The proposed algorithm will be tested in future works using a prototype of an e-learning system which is described in this paper.
1 Introduction The main advantage of an e-learning system is a possibility to adapt learning materials to student’s preferences, learning style, interests, abilities etc. The researchers report on students achieving success if they learn in a learning environment preferred by students. A typical intelligent e-learning system consists of 3 modules: a student module, a domain module and a tutor module. The first of them contains descriptions of student’s knowledge, behaviors, demographic data, learning style, interests etc. The domain module is responsible for knowledge representations. The last module - serves as a teacher in a traditional learning - controls the learning process. Methods implemented in the tutor module allow determining the opening learning scenario, modifying a learning scenario during a learning process, offering a suitable method of evaluation etc. [8]. This work is devoted to a method of representation of knowledge in e-learning systems. An appropriate and flexible knowledge structure is a very important step in designing of an intelligent e-learning system because it allows offering a suitable learning path for students’ preferences and current knowledge state. In this work a graph-based knowledge structure and a learning scenario for the defined knowledge structure are presented. The knowledge structure from [7] is extended to a labeled graph. The labeled graph enables the transformation from a graphbased knowledge structure to a knowledge structure defined in [6]. For the N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 169–179. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
170
A. Kozierkiewicz-Hetmańska and N.T. Nguyen
proposed knowledge structure the conception of modification of a learning scenario during the learning process is outlined. The conception of modification of a learning scenario described in [7] is refined and the labeled graph is considered at present. The proposed method will be tested in a prototype of an e-learning system described in this work. We worked out educational materials (a set of learning scenarios) to conduct experimental tests. In intelligent e-learning systems knowledge is represented in a few different ways. In paper [3] knowledge is structured as a set of topics (composed of hypertext page and a set of questions) and a knowledge map (describes dependencies between topics). Knowledge is represented as a Bayesian Network. Two different node types are distinguished: learned nodes (representing the degree of belief that a certain topic has been learned and calculated basing on the amount of time spent on the topic and the number of questions answered) and show nodes (representing the degree of belief that the related topic should be learnt). Using the Bayesian Network system determines what the student should do: learning a new topic, learning the current topic thoroughly, repeating the previous topic or shallowing scan a new topic. In [4] a hierarchical rule structure is proposed. Each rule is in the form of the unique rule name, condition(s), the action or decision if condition is satisfied, the parent rule, the rule to be followed if the current rule is matched and fired and a rule to be followed in case of a failure. Knowledge in an intelligent tutoring system could be represented as a semantic network. The basic components are nodes for presentation of domain knowledge objects and links for illustrating relations between pairs of objects. Additionally properties and frames (an attribute and respective values) are used. For the proposed knowledge structure semantic primitives like: is_a, subclass, a_kind_of, instance, part_of are used [11]. Ontology and a concept map are very popular forms for knowledge representation. In [2] the ontology contains concepts and relationships among them. A narrower/broader relationship to support hierarchical links between concepts and a contrast or an extended relationship are chosen. In this paper the RDF Schema to describe the proposed model is used. A poorer version of the knowledge representation authors described in [3]. The domain model consists of concepts and prerequisite relations among concepts. In [10] knowledge is represented on two levels: concepts and presentation of concepts. Linear relations and partial linear relations exist between concepts and presentations, respectively. In [6] the described knowledge structure was extended to the third level: versions of presentations. To our best knowledge, so far the problem of modification of learning scenarios during learning processes has not been solved in an effective way. In the next section the knowledge structure is presented. For the proposed knowledge structure learning scenarios are defined (Section 3). Section 4 contains a model of a learning process. The conception for modification of a learning scenario is presented in Section 5. In Section 6 the conception of an experiment using a prototype of an e-learning system is proposed. Finally, conclusions and future works are described.
A Method for Scenario Modification in Intelligent E-Learning Systems
171
2 Structure of Knowledge in an Intelligent E-Learning System In a traditional learning educational materials consist of several smaller parts. Chapters and topics are the most popular partitions. In an e-learning system partitioning of an educational material is also necessary. In the following work it is assumed that the learning material is divided into lessons. By a lesson we call an elementary and integral part of knowledge e.g. “matrix addition” could be a lesson of algebra coursework. Each lesson occurs in an e-learning system in a different form such as: graphical, textual, interactive etc. Some lessons should be learned before others, therefore between lessons there exist linear relations. Each of such relation defines the order in which lessons should be presented to a student. A binary relation is called linear if the relation is reflexive, transitive, antisymmetric and total. The learning process is started by presenting the first lesson. This lesson contains information about goals of the coursework and its requirements and does not finish with a test. An intelligent e-learning system stores data such as: an average score for each lesson, an average time of learning of each lesson and a difficulty degree of each lesson which is measured by a number of failed tests. Described data is stored separately for each class of students and for all users registered in system. Additionally, these parameters are different for different lessons' order. Therefore, data is represented by two-dimensional matrix:
AS g = [asij ]i =0,...,q, j =1,...,q
where: asij – 100% minus average the score for lesson p i which was learnt after lesson p j , i ∈ {0,..., q} , j ∈ {1,..., q} , g ∈ {1,..., G} , G- a number of class
AD = [adcij ]i =0,...,q, z =1,...,q
G
∑ ad ij
where: adcij =
y =1
, ad ij - difficulty degree of lesson p i , represented by a G number of test failures referred to lesson p i (learnt after lesson p j ) divided by a number of all tests taken by students who were learnt the lesson p i after the lesson p j , i ∈ {0,..., q} , j ∈ {1,..., q} , g ∈ {1,..., G}
ATg = [atij ]i =0,...,q, j =1,...,q
where: at ij – average time of learning the lesson p i which was learnt after the lesson p j , i ∈ {0,..., q} , j ∈ {1,..., q} , g ∈ {1,..., G}
172
A. Kozierkiewicz-Hetmańska and N.T. Nguyen
AT = [atcij ]i =0,...,q,
j =1,...,q ,
G
∑ at ij
where: atcij =
y =1
for i ∈ {0,..., q} , j ∈ {1,..., q} , g ∈ {1,..., G} G We can define weight matrix as W = [ wij ]i =0,...,q, where wij could equal asij , j =1,...,q
adcij , atij or atcij for i ∈ {0,..., q} , j ∈ {1,..., q} , g ∈ {1,..., G}. Let P be the finite set of lessons. Each lesson pi ∈ P , i ∈ {0,..., q} is a set of different versions: v k(i ) ∈ pi , k ∈ {1,..., m} , V =
∪ pi i = 0,...,q
, i ∈ {0,..., q} . RC is called
a linear relation on set P. Definition 1. The graph-based knowledge structure is a labeled and directed graph: Gr = ( P , E , μ )
where: P - set of nodes, E - set of edges, μ : E → P - function assigning labels to edges,
L=
card ( R C )
∪ L (α f ) -
set
of
labels
where
f =1
L(α f ) = (W , α f ) ,
f ∈ {1,..., card ( RC )} , α ∈ RC This knowledge structure was described for the first time in [7], in our paper it is extended and improved. The advantage of knowledge structure represented as a
Fig. 1. A graph-based knowledge structure
A Method for Scenario Modification in Intelligent E-Learning Systems
173
graph is possibility of applying well-known algorithms and additional information stored in a graph label. Figure 1 presents a graphical representation of the defined knowledge structure.
3 Structure of Learning Scenario For the defined in Section 2 knowledge structure, the learning scenario is defined in the following way: Definition 2. By the Hamiltonian path based on order α ∈ RC in graph Gr we call a sequence of nodes: hp =< p0 ,..., p q >
where: 1. 2.
For each i ∈ {0,..., q} pi ≠ pi+1 For each e ∈ E μ (e) ∈ L (α ) .
Definition 3. By the learning scenario s we call a Hamiltonian path based on a order α ∈ RC hp in which exactly one element from each node pi , i ∈ {0,..., q} occurs: s =< vk( 0) ,..., vn( q ) >
where: v k( 0) ∈ p0 ,..., v n( q ) ∈ p q for k , n ∈ {1,...m}
4 Model of a Learning Process The idea of learning in intelligent e-learning systems is based on assumptions that similar student will learn in the same or a very similar way [9]. Therefore, the learning process is started to collect the most important information about students. The student, before starting learning, fills in a set of questionnaires and solves psychological tests. All data is stored in a student’s profile. The student provides the following data: demographic data (login, name, telephone, e-mail, age, sex, educational level, IQ), learning style (related to perception, receiving, processing and understanding of information by student), abilities (verbal comprehension, word fluency, computational ability, spatial visualization, associative memory, perceptual speed, reasoning), personal character traits (concentration, motivation, ambition, self-esteem, level of anxiety, locus of control, open mind, impetuosity, perfectionism) and interests (humanistic science, formal science, the natural science, economics and law, technical science, business and administration, sport and tourism, artistic science, management and organization, education) [5].
174
A. Kozierkiewicz-Hetmańska and N.T. Nguyen
If an e-learning system collects all necessary information about student then a new learner is classified to a group of similar students. It is assumed that the classification criterion is a set of attributes selected by experts. In the next step an intelligent e-learning system has to select the best educational material in the best order and propose an adequate for student’s preferences presentations methods. The opening learning scenario for a new learner is chosen from successfully finished scenarios of students who belong to the same class. In the system there should be an implemented method of determination of a learning scenario. It is possible to apply the nearest neighbor algorithm. In this case a student is offered a learning scenario which was successfully completed by a student the most similar to the new one. For comparison of students’ profiles the Hamming distance could be assumed. The methods for determination of an opening learning scenario with a proposed knowledge structure and a learning scenario are described in [6] and [9]. After generating the opening learning scenario the student can start to learn. He is presented the first lesson from the indicated opening learning scenario. When he finishes, he has to pass a test. The result of the test implicates the next step. If the test score is sufficient (student achieves more than 50% of all points) the learner is presented the next lesson. Otherwise the system decides to change the presentation method and suggests relearning. The conception of modification of a learning scenario during the learning process is described in Section 5. During the learning process the system collects additional information related to each lesson. To student’s profile the following are added: ti -time spent on reading each lesson pi, scorei-result of the test of each lesson pi, di - difficulty of each lesson pi (it is a subjective evaluation of difficulty of lesson pi) for i ∈ {0,..., q} . Opening and final scenarios are also stored in student’s profile. The learning process is finished if all lessons from the learning scenario are taught.
5 Conception of Scenario Modification Algorithm The learning process is closely connected with an evaluation process. If a student solved the test with a sufficient score, an intelligent e-learning system can assume that he mastered this part of knowledge. Otherwise the system should offer a modification of the learning scenario. In the traditional learning a student has a several chances of passing a test. After an assumed number of failed tests the student finishes course with an unclassified grade. In the proposed conception of modification of a learning scenario the student could take the test three times. After failing the test for the fourth time, the student finishes the learning process without receiving a credit for this course. The conception of a modification of a learning scenario is based on identifying reasons for mistakes. The e-learning system distinguishes three reasons of a test failure: student’s distraction, bad lessons’ order, bad student’s classification.
A Method for Scenario Modification in Intelligent E-Learning Systems
175
The modification of a learning scenario during the learning process is conducted in three steps. After the first test failure a learner is offered a repetition of the same lesson but in a different version. It is possible that the student was not concentrated thus he did not learn well enough. He should try to learn the same lesson again. The student will be less bored if the system offered him a different version of lessons. Most of us do not have a very strong preference for one dimension of a learning style so offering similar versions of lessons should be effective. After another failure system changes lessons’ order based on data of students who belong to the same class. Sometimes explanation from one lesson could help with understanding the other lesson. The third step is the final chance – the e-learning system offers a modification of the lesson’s order based on all collected data. In the registration process the student could provide false information about himself so it might have happened that he was classified to an improper group. For the proposed procedure of modification of a learning scenario some of data stored in the students’ profile (described in Section 5) and data collected during functioning of the system related to lessons (described in Section 2) are used. The procedure of modification of the learning scenario is presented as follows: Given: ti , scorei , d i for fixed i ∈ {1,..., q} , ASg, AD, AT, ATg for fixed g ∈ {1,..., G} , list of passed lessons, Result: s * - modified learning scenario
BEGIN 1. Did the student fail the test for the first time? IF NO GOTO 3. 2. Find the version of the lesson such that the following condition is satisfied: arg max P(v ( k ) i )∏ P(ti , scorei , d i | v ( k ) i ) for k ∈ {1,..., m} , i ∈ {1,..., q} and v ( k )i
3.
4.
GOTO END. Did the student fail the test for the second time? Delete from the knowledge structure passed lessons and relations connected with deleted lessons. IF NO GOTO 6. For W=ASg. find the Hamiltonian path s* for which the following sum q
∑ asix is minimal for i ∈ {0,..., q} and for each e ∈ E μ (e) ∈ L (α ) .
x =1
q
5.
If you find s1* and s2* which sums ∑ asix are equal, let W=ATg, choose the x =1
q
Hamiltonian path s* such that ∑ at ix is minimal for i ∈ {0,..., q} and for x =1
6.
each e ∈ E μ (e) ∈ L (α ) and GOTO END. Did the student fail the test for the third time? Delete from the knowledge structure passed lessons and relations connected with deleted lessons. IF NO GOTO 9.
176
7.
A. Kozierkiewicz-Hetmańska and N.T. Nguyen
For W=AD find the Hamiltonian path s* for which the following sum q
∑ adcix is minimal for i ∈ {0,..., q} and for each e ∈ E μ (e) ∈ L (α )
x =1
q
8.
If you find s1* and s2* which sums ∑ adcix are equal, W=AT, choose the x =1
q
Hamiltonian path s* such that ∑ atcix is minimal for i ∈ {0,..., q} and for x =1
each e ∈ E μ (e) ∈ L (α ) and GOTO END. 9. The student finishes course with an unclassified grade. 10. END
6 The Concept of Experiments In our researches we want to show that a student learns more effectively (achieves better test scores and learns in shorter period) if his student’s profile is considered during determining the learning scenario. Some of methods described in this work and in [5], [6] and [7] are implemented. In a prototype of an intelligent e-learning system the learning process is started after a registration process. In the first step data about login, password, sex, educational level and data referring to learning style is collected (Fig. 4).
Fig. 2. Registration process in prototype of intelligent e-learning system
A Method for Scenario Modification in Intelligent E-Learning Systems
Fig. 3. The graphical version of educational material
Fig. 4. The interactive version of educational material
177
178
A. Kozierkiewicz-Hetmańska and N.T. Nguyen
Based on student’s learning style, the system tries to choose the best educational material for the new student. In the intelligent e-learning system a prototype of only one part of the course of Rules of the road was worked out. The student could learn about intersections, roadway signs related to intersections and right-ofway laws. Twelve different learning scenarios could stand out. Versions of learning materials differ in lessons’ order and presentations’ methods (the amount of graphical, interactive and textual elements). It is assumed that a student has to pass a test after learning. The test consists of 10 questions chosen randomly from the total of 30 questions. Experiment should have limited time because too long experiments bore students so a method of modification of learning scenario during learning process is a little simplified. Students have 3 chances of passing the test. After the first test failure the student is offered the same learning scenario. If the test score is still not sufficient he is proposed the learning scenario which was the best learnt (students achieve the best results after following this learning scenario). The experiment is finished after passing the test or 3 test failures. The e-learning system prototype is available at: http://brylant.iit.pwr.wroc.pl/~kozierkiewicz/system.
7 Conclusions An accuracy of a knowledge representation is very important in planning of an elearning system. The proposed knowledge structure is very flexible, stores a great amount of information and allows changing the lessons’ order and versions of lessons. The proposed graph-based knowledge structure enables to apply wellknown algorithms. The proposed method of modification of a learning scenario needs to be refined in the future. It is planned to conduct experimental tests using the prototype of an e-learning system. We want to demonstrate that student achieves success if the learning process is adapted to student’s learning style, current knowledge state, abilities, interests etc. The experiment shall prove the correctness of our assumptions and worked out methods.
Acknowledgment This research was financially supported by European Union- European Social Found and by Human Capital National Cohesion Strategy under the grant no. II/33/2009 and by the Polish Ministry of Science and Higher Education under the grant no. 0419/B/T02/2009/37.
References 1. Bouzeghoub, A., Defude, B., Ammout, S., Duitama, J.F., Lecocq, C.: A RDF Description Model For Manipulating Learning Objects. In: Proc. of International Conference on Advanced Learning Technologies, Joensuu, pp. 81–85 (2004)
A Method for Scenario Modification in Intelligent E-Learning Systems
179
2. Gamboa, H., Fred, A.: Designing Intelligent Tutoring Systems: a Bayesian Approach. In: Proc. of 3rd International Conference on Enterprise Information Systems, ICEIS 2001, pp. 452–458 (2001) 3. Günel, K., Asliyan, R.: Determinig Difficulty of Questions in Intelligent Tutoring Systems. The Turkish Online Journal of Educational Technology, 14–21 (2009) 4. Hewahi, N.M.: Intelligent Tutoring System: Hierarchical Rule as a Knowledge Representation and Adaptive Pedagogical Model. Information Technology Journal, 739–744 (2007) 5. Kozierkiewicz, A.: Content and structure of learner profile in an intelligent E-learning system. In: Nguyen, N.T., Kolaczek, G., Gabrys, B. (eds.) Knowledge Processing and Reasoning for Information Society, EXIT Warsaw, pp. 101–116 (2008) 6. Kozierkiewicz, A.: Determination of Opening Learning Scenarios in Intelligent Tutoring Systems. In: Zgrzywa, A., Choroś, K., Siemiński, A. (eds.) New trend in multimedia nad network information systems, pp. 204–213. IOS Press, Amsterdam (2008) 7. Kozierkiewicz, A.: A Conception for Modification of Learning Scenario in an Intelligent E-learning System. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) ICCCI 2009. LNCS, vol. 5796, pp. 87–96. Springer, Heidelberg (2009) 8. Kukla, E.: Zarys metodyki konstruowania strategii nauczania w multimedialnych inteligentnych systemach edukacyjnych, In: Multimedialne i Sieciowe Systemy Informacyjne, Wrocław, Oficyna Wydawnicza Politechniki Wrocławskiej (2002) 9. Kukla, E., Nguyen, N.T., Daniłowicz, C., Sobecki, J., Lenar, M.: A model conception for optimal scenario determination in an intelligent learning system. ITSE - International Journal of Interactive Technology and Smart Education 1(3), 171–184 (2004) 10. Nguyen, N.T.: Advanced Methods for Inconsistent Knowledge Management, pp. 263–307. Springer, Heidelberg (2008) 11. Stankov, S., Glavinić, V., Rosić, M.: On Knowledge Representation in an Intelligent Tutoring System. In: Proceedings of INES 2000, Ljubljana, Slovenia, pp. 381–384 (2000)
Development of the E-Learning System Supporting Online Education at the Polish-Japanese Institute of Information Technology Paweł Lenkiewicz, Lech Banachowski, and Jerzy Paweł Nowacki Polish-Japanese Institute of Information Technology (PJIIT) Warsaw 02-008, ul. Koszykowa 86, Poland [email protected], [email protected], [email protected]
Abstract. The purpose of the paper is to present development over the years of the elearning system supporting online education at the Polish-Japanese Institute of Information Technology (PJIIT). In addition, we mention the coming, essential changes related to advancements of continuing education and lifelong learning within the framework of higher education. The main part of the paper is devoted to showing the necessity of exploring and optimizing the underlying database system in the areas of intelligent database methods, reporting and data mining and database engine performance. Keywords: elearning, distance learning, online learning, LMS, platform EDU, lifelong learning, continuing education, database engine, database optimization, intelligent database methods, data mining.
1 Model of Online Studies at PJIIT The online courses run either exclusively over the Internet or in the blended mode: lectures over the Internet and laboratory classes at the Institute's premises. Each course comprises 15 units treated as lectures. The content of one lecture is mastered by students during one week. At the end of the week the students send the assignments to the instructor and take tests, which are automatically checked and graded by the system. The grades are entered into the gradebook - each student can see only his or her own grades. Besides home assignments and tests there are online office hours held lasting two hours a week; seminars and live class discussions. Bulletin boards, timetables, discussion forum and FAQ lists are also available. It is also important that during their studies the students have remote access to the PJIIT's resources such as software, applications, databases, an ftp server, an email server. Each online student has to come to the Institute for one-week stationary sessions two or three times a year. During these visits they take examinations and participate in laboratory courses requiring specialized equipment. N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 181–189. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
182
P. Lenkiewicz, L. Banachowski, and J.P. Nowacki
2 Requirements for Online Platform 9 Years Ago Around the year 2000 mostly CMS - Content Management Systems were used as elearning platforms adapted to the needs of distance learning. The set of modules of the first version of our elearning platform called Edu built in the year 2000 comprised the following modules (for details, see [1] and [2]). − Module enabling access to didactic materials in different file formats with the management by the lecturer. − Gradebook. − Multithreaded discussion forum. − Chat with graphical table for online contacts. − Message board. − List of addresses to other pages and sites related to the course content. − Module for creating and administering of simple tests, quizzes. The structure of Edu platform reflects organizational structure of the academic studies. Special administrative tools were developed to support management of courses and access privileges.
3 Using the Online Platform Edu for the Last 9 Years At the beginning the platform was used by a small number of students and teachers. In the following years a rapid growth of interest in the application of elearning system was observed at our university. It was caused by the growing number of online students as well as wider use of platform for other kind of studies. Figure 1 presents the number of logins to the platform in consecutive years (the number for the year 2009 is estimated on the basis of data for the first half of the year). The number of logins can be assessed as not very large. However, taking into account that there are about 2000 students at our university and only 200 of them are Internet-based, the system is really used very intensively. On average we register about one thousand logins per day in 2009. The number of courses run on the platform has risen at a similar rate. Our platform Edu, as other elearning platforms, is also widely used to support classroom-based (stationary) studies. Our platform is most often used to support class-based studies by: − − − − −
making didactic materials available online to students; allowing the student to see all their current grades any time; collecting all course information and announcements in one place; helping in homework supply and delivery; helping in organizing and evaluating tests and exams.
After a few initial semesters of the system usage, thanks to user feedback, we were able to define the most important directions of improvement. All tasks mentioned below were implemented and became successfully used on platform Edu.
Development of the E-Learning System Supporting Online Education
183
350000 300000 250000 200000 150000 100000 50000 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 Fig. 1. The number of logins to platform Edu in the consecutive years
Fig. 2. The number of courses run on the platform Edu in consecutive years
− Creation of the more sophisticated module for tests and exams. A very wide diversification of the lecturers’ needs was observed. In consequence, we created a very flexible test module with large range of options and parameters. − Improvement of the grade module by adding new views and statistical information as well as easier grade management for lecturers. − Development of the new module “Lessons” for creation of interactive lessons by lecturers themselves using only a web browser. At the beginning
184
P. Lenkiewicz, L. Banachowski, and J.P. Nowacki
of Internet-based studies didactic materials were prepared outside the system Edu usually by additionally employed persons who transformed materials to the format acceptable by the platform. In the current version the produced materials can be enriched by interactive elements as well as tests controlling the passage to the next lesson, etc by lecturers themselves. − Improvement of the module of student homework management.
4 Future Developments: Support for Continuing Education New perspectives for elearning platforms have appeared in connection with the need for supporting continuing education for academic teachers, students and graduates (see [3]). Constant updating of the state of knowledge and skills in the given specialization has become a common goal of academic teachers, students and graduates. Consequently, a specific community of learning is being formed around the specialization taught and pursued at the academic institution. Online learning environment supporting continuing education is to be based on a knowledge repository equipped with a search engine over keywords (tags) and words appearing in repository documents. Each participant is to have their own subsystem enabling defining different views on repository documents authored by the participant. For example: − − − − −
to view and comment by all participants, to view, comment and grade by an academic teacher, to view and evaluate by a prospective employer, to view and evaluate by an admission commission, to view and evaluate by an awarding degree commission.
The online education platform Edu used currently at PJIIT is not accommodated to the new needs (which corresponds to the obvious necessity of extending functionality of a standard LMS system to new features present in the new Webs postulates). In particular, Edu requires the following additions: − − − − − −
extending two-level repository structure to multi-level structure, Wiki documents (group shared WWW documents), Blogs (besides existing threaded Forum), possibility of assigning comments to places in existing repository documents, search engine over all repository documents, two horizontal structures grouping repository documents either with respect to courses or with respect to participants. The participant should be able to grant access to their documents on a user-by-user basis. In this way the participants should be able to define various e-portfolios intended for various readers/evaluators e.g. prospective employer, academic teacher, commission, the same project members.
The required system supporting continuing studies will constitute a combination of the three systems: traditional LMS, learning community portal (see [4]) and system supporting personal development – see Table 1.
Development of the E-Learning System Supporting Online Education
185
Table 1. Specific activities in the three kinds of elearning platforms
Learning Management Course, academic teacher, student, homework assignment, comment Test Chat, Forum, message Grade
Document, paper, article, keyword, comment, document version, annotation Message, announcement Project, game, virtual reality Lectures, syllabus, com-ment, keyword, a-nota-tion, programmed lesson
Learning Community
Personal Learning
Chat
Blog (private, public), entry
Forum, message, blog (public), entry Shared wiki (with versioning) Publish/Subscribe subsystem (feed)
E-portfolio (access restricted) Link to another element
Document, keyword, comment, document version, annotation
Link/info to external achievements e.g. awarded certificate, prize Document, keyword, comment, document version, annotation
Message, announcement
Message, announcement
Project, game, virtual reality
Project, game, virtual reality
5 The Need for Intelligent Database Methods in E-Learning Platform The typical elearning platform user has access to very large amount of information. The data change over the time. Some of them require user activity in particular period. Effective managing this information became an important issue for all groups of platform users: students, lecturers and administration. Due to rapidly growing amount of data available for users, it was very important to find automated methods of information retrieval used for personalization of data presented for users. Below we list examples of places, where intelligent database methods were used: 5.1 Personalization of Entrance Page After student logs on, he should see all important information about his studies: dates of particular events, deadlines for homework deliveries, current announcements, etc. We have implemented algorithm which searches all database tables concerning modules enabled in students’ courses, and presents the information in the form of “students’ schedule”. The algorithm bases mostly on dates which are
186
P. Lenkiewicz, L. Banachowski, and J.P. Nowacki
searched in the database and compared to previous user activity. Transact-SQL stored procedures are used for this purpose. Due to performance reasons, it’s very important to place the code on database layer, because it processes very large amounts of data. 5.2 Algorithm of Random Tests Generation and Task Assignment In our elearning platform we’ve implemented some modules which help to avoid students’ cheating. These algorithms are designed to randomly select tasks or questions from the database basing on many configuration options set by the lecturer. It should be as most fair as possible. All students should have the same level of difficulty, but if it is possible, selected tasks or questions should not be assigned to more than one student. The solution requires application of a sophisticated stored procedure on the platforms’ database server. 5.3 Automatic Suggestions for Students In our elearning platform we’ve implemented experimental module for “database systems” course, which can suggest what student should do next, basing on students’ tests results. It may suggest the student to repeat particular part of lessons or focus on some aspects. The module was very helpful for students and they strongly appreciate it.
6 Reporting and Data Mining on E-Learning Platform Database During the 8 years of elearning platform development, we have done a lot of research on using data mining techniques on systems’ database. We’ve also implemented some reporting solutions to support administration of Internet based studies. It gave us some experience on what and how we can extract interesting information from such database. We started to create reports at the beginning of systems’ usage. 8 years ago the number of students and courses was not very large, so simple reporting techniques were sufficient. The administration needed only basic reports like the number of students and courses, reports supporting computation of lecturers’ salaries, etc. We used simple Transact-SQL scripts and stored procedures and Crystal Reports or similar tools. Few years later growing number of courses and students forced us to work on reports which support administration of studies in more complex form. The main administration staff requirement was to find different kinds of abnormal situations in the system. Examples of such queries are: “find courses which are not handled well by the lecturer” or “find students who need help in studying over the Internet”. Reply to such queries must be based on many factors concerning many of the database tables, what makes finding these factors and their weights a challenging task. Some of the factors are obvious e.g..: student passed the test or not. The other are more difficult to analyze, e.g. browsing web
Development of the E-Learning System Supporting Online Education
187
pages with didactic materials or entering incorrect modules. To make the analysis possible, the platform must fully register users’ activity. To answer more sophisticated queries, typical approach based on SQL scripts will not be sufficient. It lead us to work on using data mining techniques. It allowed us to find out, which modules are the most efficient for on-line learning. We also tried to find out if the results achieved by students in particular courses are better or worse than in traditional, class-based learning. The main data mining techniques used for this purpose was data clustering and decision trees (discussed in [9]). The main problem at the beginning of using data mining techniques was preparing the data. In different courses the lecturers use different functions, modules, grading scheme, formats etc., so the data had to be filtered and in some cases converted to uniform scale. The results are promising and helped us to improve the system, curriculum and methods of teaching on-line.
7 Database Engine Performance Problems A few years ago performance problems of elearning platforms were left out of consideration. Using an average server was enough for the platform installation and usage. Rapidly growing amount of data, as well as the number of users related to an expanding interest in the elearning systems causes that ensuring an acceptable response time becomes an essential problem. Because of the fact that typical elearning platform is based on a relational database, tuning this database becomes necessary. Due to rapidly expanding amount of data, the lack of implementation of appropriate optimization strategy may lead to severe performance problems. We carried out detailed database trace file analysis, aiming to define the profile of its use and in consequence to propose the best optimization strategies. We have created a tool which analyses trace files and transforms their information into the relational database format which makes further analysis easier (see [6]). As we expected, simple SQL statements dominate: point queries, multi-point queries with small numbers of records and joins, which is typical of web applications. Examples of typical queries are: “show announcements for a given course”, “check whether a user is authorized to view a web page”, “show the list of participants for a given course”, etc. From the other side, it can be remarked that even simple users’ activities like browsing the platform pages, generate considerable number of SQL queries. Therefore tuning the database is not aimed at shortening complicated database operations but rather to improve the system response time, what can make working with the system more comfortable. The following statistics proves this argument: in the analyzed trace describing typical week of platform operation: 84% queries joined not more than 5 tables. 92% executed queries contain only simple WHERE conditions and return from 1 to 100 resulting rows. Another frequent operation is sorting, because most of the data displayed on platform pages is ordered. Most often the results are sorted by dates or in alphabetical order. Our research allowed us to create some guidelines for elearning platform tuning. These are commonly known strategies but their proper use needs good knowledge of database schema as well as the way of use of the platform database. Tuning advisor tools may be very useful.
188
P. Lenkiewicz, L. Banachowski, and J.P. Nowacki
− Because of the fact that joins are very frequent, it is strongly recommended to create indexes on foreign keys. In many cases a small selectivity of the foreign key column may become a problem. In such cases clustered indexes can be useful. Choosing one of the columns for the clustered index may be difficult because only one such index may exist for one table. − It is strongly recommended to find the columns used for sorting and create clustered indexes on them. In most cases such columns can be easily identified. − It is also recommended to identify the columns which occur in search conditions. In general, these columns can be easily identified and, hopefully, in the database supporting elearning platform there are not too many of them. Because most of them are point queries, selection of index type may be not very important, but if it’s possible, we should use data structures which support such type of queries, e.g. hash indexes. − If we have many hard drives on the server, a good idea is to place often joined big tables on separate hard drives. − If our database server supports table partitioning, using it may be a good idea for big tables. In elearning platform we can easily find data which are stored only for archiving purposes and are queried rarely. In most cases only a small subset of records is required for everyday usage. We can partition tables basing on date columns which are common in such systems. Due to smaller size of data, full-scan reads will require much smaller number of logical reads. As mentioned before the majority of queries executed on our elearning platform database are simple queries. But in some modules more complicated queries and SQL scripts occur. It happens mainly in more advanced modules, reporting applications as well in places where the statistics based on large amount of data are required. Improvement of performance of this kind of elements is connected mainly with the code optimization. In our elearning platform we discovered, that code optimization is very important. In most cases elearning platforms are developed rapidly. Many programmers involved, database schema changes, re-usage of the code sometimes makes the code not optimal and slow. It is worth mentioning one more optimization method which can be useful in the places where complicated read-only transactions occur. This type of transactions appears mainly during computation of different types of statistics. Performance improvement can be obtained by using snapshot isolation levels or similar multiversioning based mechanisms available in most of existing database servers. It will reduce locking and improve the system response time. There are also performance problems connected with the file system. Elearning platforms sometimes store files outside the database. The number of these files is growing rapidly. An example could be a lecture in the form of HTML code consisting of a large number of pages and pictures. Large amount of operations on small files in parallel creates a bottle-neck for the operating system. Using a more efficient file systems could solve this problem.
Development of the E-Learning System Supporting Online Education
189
8 Conclusions We have presented past and intended future developments of our elearning platform as a response to the needs of curricula developments at our university. We have also shown the necessity of optimization of elearning systems, in particular their underlying relational databases, due to rapidly growing amount of data and number of users. Consideration of the underlying database is also needed when implementing intelligent database methods on elearning platform and reporting and data mining on elearning data.
References 1. Banachowski, L., Mrówka-Matejewska, E., Lenkiewicz, P.: Teaching computer science on-line at the Polish-Japanese Institute of Information Technology. In: Proc. IMTCI, Warszawa, September 13-14 2004, PJIIT Publishing Unit (2004) 2. Banachowski, L., Nowacki, J.P.: Application of e-learning methods in the curricula of the faculty of computer science, 2007 WSEAS International Conferences. In: Banachowski, L., Nowacki, J.P. (eds.) Advances in Numerical Methods, Cairo, Egypt, December 29-31, 2007. Lecture Notes in Electrical Engineering, vol. 11, pp. 161–171. Springer, Heidelberg (2009) 3. Banachowski, L., Nowacki, J.P.: How to organize continuing studies at an academic institution? In: Proc. WSEAS Intern. Conferences, Istanbul, Turkey, May 30 - June 1 (2009) 4. How to Use Social Software in Higher Education, handbook from the iCamp project (2008), http://www.icamp-project.org 5. Shasha, D., Bonnet, P.: Database tuning: principles, experiments, and troubleshooting techniques. Morgan Kaufmann Publishers Inc, San Francisco (2003) 6. Lenkiewicz, P., Stencel, K.: Percentages of Rows Read by Queries as an Operational Database Quality Indicator (under preparation) 7. Agrawal, S., Chaudhuri, S., Kollár, L., Marathe, A.P., Narasayya, V., Syamala, M.: Database tuning advisor for Microsoft SQL Server 2005. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 1110–1121. Morgan Kaufmann, San Francisco (2004) 8. Chaudhuri, S., Narasayya, V.R.: Autoadmin ’what-if’ index analysis utility. In: Haas, L.M., Tiwary, A. (eds.) SIGMOD Conference, pp. 367–378. ACM Press, New York (1998) 9. Werłaty, K.: Raportowanie i Data Mining dla Studiów Internetowych w tym analiza logowań. M.Sc. Thesis, PJIIT (2009), (in Polish)
Evolutionally Improved Quality of Intelligent Systems Following Their Users’ Point of View* Barbara Begier Institute of Control and Information Engineering, Poznan University of Technology, pl. M. Sklodowskiej-Curie 5, 60-965 Poznan, Poland [email protected]
Abstract. Authors of various intelligent expert systems concentrate their efforts on innovative elements of the developed software. But they often ignore quality aspects especially concerning software usage and users’ comfort of work. The decision how to improve an expert system during its evolutional development requires an instrument to determine whether the installed software tool meets expectations of its users. One of the possible solutions, in line with agile methodologies, is to involve users in the software process. Users provide regular feedback on the considered product. In the described approach feedback is obtained in a survey by questionnaire. The guidelines are given on how to design and conduct it. Quality tree of an expert system reflects its users’ point of view. Specifications formulated on the basis of the feedback allow software designers to develop improved versions of the product. The reported empirical research refers to a software system applied in civil engineering. After six iterations of its assessment and then related improvements the level of users’ satisfaction from the product is currently much better than that at the beginning.
1 Introduction Software quality referred to the information systems and customer satisfaction are based on a content and format of the presented data and usage aspects of the considered software [6, 13]. Authors of intelligent systems concentrate their efforts on innovative elements, like knowledge representation, intercommunication of software agents, inference methods, and other formal models implemented by this kind of software. Then software authors are concerned with correct implementation of devised models and technical problems. This constitutes the basis for software correctness [9] from their point of view. The product is being developed evolutionally for years to cover more and more real life cases, constraints, and exceptions. First of all, good quality of software is equivalent to a small number of defects, which leads to a minimal amount of avoidable rework [4]. This is a useful criterion from a programmer’s point of view; however, low density of defects alone does not satisfy software users any more. Nowadays the quality of a software product is * The work was supported by Polish Ministry of Science and Higher Education under the grant nr 0014/R/2/T00/06/02. N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 191–203. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
192
B. Begier
assessed by the users rather than by the developer. And a set of measures is applied rather than a single measure. A brief review of quality aspects related to software systems, including intelligent software tools, is given in Section 2. In the presented paper the quality aspects are referred to expert systems. Their users are not professionals in computer science or software engineering. They are experts in their professional domain. They often complain the limited number of real life cases considered by the delivered software tool. Correctness of input data is usually far away from researchers’ interest but it forms a basis of usefulness of a given product. Defects in input data implicate then further failures in software usage. The quality characteristics noticed and cyclically assessed by users of the considered class of software are given in Section 3. Users’ impressions and feelings related to the comfort of work, and ethical aspects are added to the set of considered quality criteria in the described approach. Measures of usability include also those observed by indirect users. The problem arises how to measure software quality and then make use of the obtained results in the evolutional development of a particular product. There is a need to learn user expectations and software quality measures from the users’ point of view. To improve software quality, users’ involvement in the software process is recommended. It is in line with agile methodologies [12, 17] and the principles expressed in the Agile Manifesto [16]. Users’ involvement in software development definitely brings benefits [14, 18]. On the other hand, user’s role may be problematical so some guidance for users is also required [10]. Cooperation with users especially helps defining requirements successfully [15]. But domain experts are often not available for software developers. The described proposal is to reduce users’ involvement to the cyclic assessment of each developed software version. The intended (and confirmed in practice) product assessment makes possible to elicit feedback from software users. The former experiences (obtained in 2002−2005) with a software product assessment [1, 2] show that results of a software cyclic assessment are useful for software developers. The developers of expert systems are not isolated from software users this way. The presented paper is focused on guidelines how to arrange these activities and then process their results statistically (in other words, how to make use of these results). The discussed example described as a case study in Section 4 comes from the civil engineering. Selected results of product assessments conducted in 2005−2009 are included. Specifications of software quality improvement based on the results of the survey are discussed in Section 5.
2 Software Quality Aspects and Managing User Expectations According to the standards, software quality bears in the software process. It is the core principle expressed in the family of the ISO 9000 standards and it forms the basis for the CMM model recommended by the Software Engineering Institute [5]. Developers of expert systems, especially those addressed for real organizations, should consider and accept some general facts, common in various applications:
Evolutionally Improved Quality of Intelligent Systems
193
• Software system is developed till its withdrawal − iterative-incremental model of evolutional system development is applied in the relatively long product life cycle. • Clerks, officials, lawyers, doctors, etc. but not programmers are the direct users. • The provided quality of services for indirect users (applicants, patients, doctors, customers, etc.) depends deeply on solutions implemented in the software tool. • New requirements are born after the basic needs are satisfied according to the Maslow’s hierarchy of needs. • Correctness of the input data conditions the suitable solutions and correct content of generated documents; it is a matter of great importance for users. • The existing data have to be incorporated and applied in the new system. • Maintained records support a communication across an organization. • Privacy is a problem of great importance. All cited above statements are to be considered during any software system development. Strategies and tactics have been analyzed how to manage end user expectations and to address the risk of failure in this area. The notion of consumer satisfaction applied in marketing has been extended to software development [11, 13, 19]. Software production is not only a product development but is a combination of product and service delivery to offer a solution to the users. Working with users (not at them or for them) and letting them make tough choices, and, in general, keeping users involved throughout the project are considered the successful tactics [22]. The specified criteria to assess software quality are then decomposed into a set of measures. There are usually followed the basic quality attributes given in the ISO 9126: functionality, reliability, usability, efficiency, maintainability, and portability. This set may be expanded to include other important quality attributes from software user’s point of view. Then it is used as a checklist in software development. The Web era shows that software users may abandon those applications which are too difficult or simply boring for them and may switch to other pages. The quality criteria related to software available via Web do not much differ from those formulated for other products. Mostly primary features are emphasized − software product should be technically complete, testable, maintainable, structured, efficient, secure, etc. [23]. To keep users’ attention, the quality criteria for Website excellence have been formulated [21], starting from the necessary functionality (including accessibility, speed, and easy navigation) which constitutes only 20% of the entire excellence. But the devised set still does not consider user’s comfort of work, his/her likings, and atmosphere at workplace, for example. The derived measures, devised for a particular software product, are then applied in its assessment. But it is not obvious who is predisposed to assess the product − its developers, domain experts, quality engineers, quality auditors, users? The presented approach confirmed an importance of a feedback from software users.
194
B. Begier
3 Quality Features of Expert Systems from the Users’ Point of View An expert system (ES) is an open system − it interacts with its environment (users are its primary elements). Its operation is context-dependent; 39 socio-technical dimensions of ES quality have been specified [7]. The author’s proposal is the devised set of quality attributes to represent quality of expert systems from their users’ point of view as shown in Figure 1. This set is an extension of quality characteristics recommended in the ISO 9126. Let us explain their meaning starting from quality attributes on the left side of the presented quality tree. General usefulness of the expert system for an entire organization of software purchaser and its customers (if any exist) has been introduced as the important criterion in the presented approach − users are employees and they are domain experts who know the needs better than system developers. It is been assumed that basic Functionality and Safety (presence and suitability of the expected and implemented functions) have been already confirmed during the testing phase in accordance with functional requirements. The user evaluates mainly the variety of the considered types of real life cases (building constructions and their loads, for example), convenient cooperation of the analyzed software product with other required and applied software tools (the AutoCAD system in the presented case), and data verification and correction facilities decomposed into particular types of data and cases of usage. Therefore, Functionality is decomposed into at least three subsets of measures to assess the Variety of considered real life cases, required Cooperation with other software tools, and Data correctness facilities. The last criterion has been introduced to emphasize the problem of reliable data. Safety refers mainly to the maintained data. Since the real life expert, like a civil engineer, makes a number of intermediate (design) decisions on the base of partial results, the software product should have the properties of granularity and visualization of subsequent steps of calculations, i. e. it should consists of modules or other units which generate partial results visible for the user. Specialists must be able to trace calculations, change the input data at any point and resume calculations from that point. Also an ability to repeat calculations from a selected point is often expected. Product Usability is a key quality attribute from the user’s point of view although not the only one. It refers to the comprehensible software construction (including easy navigation and access to required data and tracing facilities) which conditions its proper use, and to facilities of user interface enabling ease of learning and ease of use of an intelligent system. Thus the Usability has been decomposed into four sub-criteria specified on the quality tree. Conformity with domain terminology and applied notation (including graphics) is expected. The high level of software usability translates also into the efficiency of work (the dash line symbolizes this fact) and, in consequence, in high productivity in users’ organization. Usability is then decomposed into a rich set of detailed features including those providing ease of use.
Evolutionally Improved Quality of Intelligent Systems
Expert system quality
195
Ethical aspects
Quality attributes Impressions & feelings General usefulness
Functionality and safety
Usability
Portability Comprehension of software construction
Variety of cases Data Cooperation correctness with other software tools
Efficiency
Friendliness of user interface
Ease of use
Ease of learning
Reliability Efficiency of direct user’s work
Efficiency of services for indirect users
Fig. 1. The quality tree of a software expert system
Efficiency in the presented approach has been decomposed into two subsets: Efficiency of direct user’s work (observed by a direct user) with the provided software tool and Efficiency of services provided for indirect users (if any) − the efficiency of provided services should be assessed separately by the direct and indirect users. Specified measures of the second introduced sub-criterion are related to an inquirer’s or an applicant’s service in a public organization, for example, like the following given below author’s proposals: − − − − − − − − −
Average time of performance of each service specified for indirect users Time required by data acquisition procedures (if they are time-consuming) Total time required to find a solution (to fix a business matter, for example) Number of required documents/data that should be presented by an applicant Percentage of real life cases considered by the system Number of failures in data including their incompleteness met in one day of work Number of words/records/documents required to correct the particular data Frequency of net problems monthly (which cause delays in service) Paper savings monthly (number of pages generated earlier and nowadays, if any).
Reliability of software applied in civil engineering refers to the required ability to operate, stability, and savings of resources. From the user’s point of view, Portability of a software product is equivalent to its easy installation on various computer platforms. Transfer to new platforms then requires minimum of rework. The last two criteria in the presented approach are referred to so-called soft features. The criterion of User’s impressions and feelings has been introduced to consider various attitudes, behaviors, and emotions of software users who are different people than software developers. Some examples of their measures are: satisfaction with a software tool, its general assessment at work, comfort of work with the considered tool (its usage is not stressful), product impact on an atmosphere at workplace, work with the tool is rather interesting than boring, product is in accordance with user’s likings, screen views are esthetic, etc. These subjective
196
B. Begier
measures help to understand if people will keep applying the software tool instead of looking for the other one. The satisfied users make all business successful. Ethical aspects include social expectations referred to the considered tool, especially when applied in a public organization (but not limited to those cases). The Ethical aspects and social expectations are considered with respect to social interests and are not referred to a single person or one organization. Developers of an expert system may be unaware of various threats [3] the usage of the developed software product may cause, dealing with: privacy, honesty, respect to particular persons, etc. The ETHICOMP conference series provides an international forum for discussions concerning ethical and social aspects of the technological progress. Each quality criterion is then decomposed into particular quality measures.
4 Guidelines to Software Assessment by Questionnaire Survey and Its Results − the Case Study The presented guidelines to software quality assessment are described as the case study on a base of author’s experience. Its aim is to improve those software features which are poor from the users’ point of view. The devised quality tree reflects this aim. There are the following quality attributes introduced on the quality tree developed for the considered expert system [2]: Functionality (including Safety), Usability decomposed into 4 subcriteria as shown in the Figure 1 and combined with efficiency of use, Reliability, and Portability. They have been decomposed into particular measures. Usability has been decomposed into 26 measures. The General assessment has been separately added to help expressing user’s general opinion about the product. It represents the user’s satisfaction with that product − it is the main quality measure of the assessed software tool. No social expectations are associated with it. The devised set of 41 quality measures concerns the user’s point of view. Techniques developed and used in marketing, sociology, and also recommended in education [20], are applied to design a questionnaire, to specify its recipients, to perform pilot tests, and to improve the questionnaire itself. The designed questionnaire has several parts. An initial part contains five questions intended to learn users’ skills in computing. The questionnaire items of the main part are divided into 8 groups related to criteria and sub-criteria of the assessment − the questionnaire structure reflects the devised quality tree of the assessed software product. Each group of items has a name corresponding to the quality criterion (attribute) and focusing the user’s attention on a given subject. Each questionnaire item is related to the particular quality measure. Some space is left for user’s suggestions and opinions. The reported software quality assessments were performed six times – the first one in 2002 and the last one in 2009. The first four editions have been reported in [1]. The first two were a kind of pilot tests. It was necessary to improve the questionnaire because some items were interpreted by respondents in different ways. In next two editions the set of values and their explanations was attached to each measure as possible answers to avoid misunderstandings, for example:
Evolutionally Improved Quality of Intelligent Systems
197
M24. The initially required help of an experienced program user [1 – necessary, 2 – desired, 3 – partial, 4 – occasional, 5 – unnecessary] In the last two editions of the questionnaire, the psychometric scale devised by Rensis Likert was applied. Every questionnaire item has a form of a statement, for example: M24. The help of an experienced user is not required in the initial period of using the program. Then a respondent is asked to indicate his/her degree of agreement with this statement, using a five-point scale (such scale is applied at schools in Poland where “5” is the highest mark equivalent to the American “A” and “1” the worst one equivalent to the “E”): 5 − I fully agree, 4 − I rather agree, 3 − I have doubts, 2 − I rather disagree, 1 − I completely disagree. Intelligent systems are usually not addressed to the thousands of people but to particular experts. So selecting the representative group (a sample) of respondents [8] is not required and several dozen of available users are questionnaire recipients. The considered expert system applied in civil engineering has been assessed only by its direct users (from 38 up to 81 respondents). The time of an assessment is set a priori with its users in one month. There is no need to gather all respondents at the same place and time. It took ca 15 minutes for each respondent to fulfill the questionnaire. Results obtained in the initial part of questionnaire have shown in each survey that civil engineers are well skilled in computing – all of them use computers on a daily basis, all are experienced users of popular general purpose tools like MS Excel. And all of them know and use the AutoCAD system. The result processing procedure is applied when all respondents have already given their ratings in questionnaires. All answers are recorded in a report sheet. Every row of the sheet concerns one quality measure and contains its number, name, and ratings given by particular respondents. The following values are calculated: − − − − − − − − − −
number of provided answers, their sum, obtained minimum value, obtained maximum value, calculated mean value, number of obtained “1”, number of obtained “2”, number of obtained “3”, number of obtained “4”, number of obtained “5”.
These data make possible to learn: • What is the progress in quality, namely how many measures and which in particular have obtained the higher mean value than before?
198
B. Begier
• What is the mean value of the general assessment representing user’s satisfaction with the product and what is its comparison with ratings in previous surveys? • What measures have obtained the highest values (more than 4.0)? • Which measures have obtained a substantial dispersion of values? Diagrams are recommended to illustrate it. • How many and what measures have obtained worse mean values than before? • What measures have obtained the lowest values in the present survey? • What measures have obtained poor ratings (it means values of “1” or “2”)? • What are ratios of particular ratings to the ratings obtained in previous surveys? There is recommended here to present the results obtained in last 3 editions of the survey in a tabular form and, in subsequent columns, the ratios of mean values related to particular measures obtained in the compared editions. These data constitute the basis of conclusions and further actions. The sample data referred to several selected measures including the product general assessment (M41) are presented in Table 1. The analysis of results of last three iterations of the assessment shows that the users are satisfied with the considered product and assess it highly. In the last edition the mean value of as many as 35 out of 40 particular measures was not less than 4.0 (32 in 2007 and 30 in 2005, respectively). The mean value of M41 expressing the general satisfaction with the considered product has obtained 4.64 in 2009. Table 1. Comparison of selected measures and their ratios related to previous editions Measure ID M3 M4 M5 M7 M23 M24 M26 M33 M35 M41
Mean value in 2009 4.29 4.27 3.75 4.62 4.85 4.36 4.73 3.50 3.58 4.64
Mean value in 2007 4.12 4.12 3.93 4.53 4.82 3.75 4.59 2.88 3.18 4.55
Mean value in 2005 4.00 4.14 3.71 4.47 4.93 3.07 4.14 3.07 3.43 4.08
2009/2007 ratio 1.04 1.04 0.96 1.02 1.01 1.16 1.03 1.22 1.13 1.02
2009/2005 ratio 1.07 1.03 1.01 1.03 0.98 1.42 1.14 1.14 1.04 1.14
Mean values of 29 measures were higher in 2009 than in 2007, in particular: M3.Mode of cooperation with the AutoCAD system, providing an effective way to check the correctness of the input data, M4.Possibility of presenting subsequent steps of calculations, M7.Comprehensibility of all provided software options and their use, M24.Initially required help of an experienced program user, M26.Number of mistakes made currently by the user in a time unit, compared with that at the beginning, and M35.Protection against running the program for incorrectly given load values. The mean values of three measures were the same, and nine were worse including the M5.Possibility of on-line correction of data describing the given construction. The M23.The time required to learn how to use
Evolutionally Improved Quality of Intelligent Systems
199
the program gained the highest mean value, namely 4.85. The M33.Protection against unauthorized access was accessed as the worst one although a great progress has been noticed in this case (the system is not available via net). It is interesting that the obtained results do not always illustrate an effort made to improve the particular feature.
2 0%
1 0%
none 12%
5 24%
1 12%
2 0% 3 18%
5 41%
3 29%
4 29%
4 35%
Fig. 2. The dispersion of ratings related to the measure M5.Possibility of on-line correction of data describing the given construction in 2007 (on the left) and 2009 (on the right).
The dispersion of the obtained measures is less and less after each edition of software assessment. The set of available ratings, including the highest and the lowest possible mark, has been used only in five cases related to: M5.Possibility of on-line correction of data describing the given construction (illustrated in Figure 2), M6.Possibility of on-line correction and modification of data concerning loads of the construction, M22.Ability to customize views of screen objects to the user’s likings, M33.Protection against unauthorized access, and M35.Protection against running the program for incorrectly given load values. It is very important that users’ ratings related to particular measures are going to become similar. Examples of value dispersion related to two selected measures M25 and M31 are illustrated in Figures 3 and 4, respectively.
1 0% 2 6%
none 12% 5 24%
5 27%
1 0%
2 7% 3 27%
3 24% 4 34%
4 39%
Fig. 3. Dispersion of the values of ratings obtained for the measure M25.Estimated number of mistakes made by a user in one hour of the software use after the initial period obtained in 2007 (on the left) and in 2009 (on the right).
200
B. Begier
12
4
5
5
10 8 6 4
3
4
3
4
2 1
2
1
2
1
2
3
1
2
3
0 5
4
5
Fig. 4. Dispersion of the values of ratings obtained for the measure M31.Feeling of comfort of work when using the product obtained in 2007 (on the left) and in 2009 (on the right).
5 Developing Specifications of Software Quality Improvement All ratings and suggestions given by users were carefully analyzed after each questionnaire survey. Fortunately, many of the polled users showed their willingness to influence the software product (although the number of expressed suggestions was decreasing in subsequent editions of the survey) − one in two respondents in 2005, one in three in 2007, and one in four in 2009 (14) gave his/her particular suggestions concerning the software improvement. The suggestions varied a lot, from using larger size of letters indicating particular points on a specified chart, to the demand to copy automatically particular views to allow a user to analyze and discuss those views later. Also the danger of losing previous data was identified when the number of analyzed points describing a given construction is changed. Users required to specify and to explain precisely the meaning of particular data used on the screen, e.g. the notion of the level of a concrete slab floor or a ceiling at the end of design activities. Specifications of improvements were formulated on the base of poor ratings and suggestions given by users in assessment of subsequent versions of the product. They were grouped into three subsets concerning, respectively: 1. Improvement of the specified software feature, including the way of presentation of screen views, 2. Close cooperation with the AutoCAD system, 3. Number and types of analyzed constructions. Three enumerated lists of suggested changes were worked out, separately for each group. Each item on the list had its initially specified attributes: identifier, name of the suggested change, date of registration, description of the required change, justification of the change with reference to the goals of the product and its quality measures. Then the weight was assigned to each change (improvement) on a base of the team leader’s experience concerning the need of the change, its possible influence on the product, and the number of users who demand it. As a result, three subsequent attributes of the change were added to each item on the list of suggested
Evolutionally Improved Quality of Intelligent Systems
201
changes: weight marked by a distinctive color, estimated cost (in days or weeks of work), and the name of a programmer responsible for introducing the change. The red color indicates changes accepted by the team leader to be introduced in the next version. The developed lists were transferred to programmers who were then obliged to value several other attributes and to document what real modifications of the system had been made. The maintained history of all changes and their justifications indicates favorite areas of programmers’ work although these efforts are not always reflected in assessment by users. What programmers do first is to improve the interface with the AutoCAD system. Several pre- and postprocessors cooperating with the AutoCAD have been subsequently developed and modified after each assessment. Software improvements are made exactly to users’ expectations expressed during the software product assessment by the questionnaire survey. The authors of the BW system have eliminated some flaws and shortcomings pointed out in the conducted questionnaire surveys. The examples of improvements are: • Ability of the improved external pre-processor to transfer not only data concerning geometry of a building but also its loads from the AutoCAD system, • Precise and detailed messages concerning errors detected in the input data, • Available analysis of shear wall construction which characterizes alternate stiffness along the height of the building, • Added printout of geometrical characteristics for the shear wall structure, • Extended printout of the mass of the considered construction, • Automated visualization of default images.
6 Conclusions The evaluation of software quality from the users’ point of view and related product improvements are topical problems in software engineering. In the author’s opinion, only a regular and systematic feedback from users may help to solve the problem of software quality. It decidedly refers to expert systems. The user’s point of view may differ from the developer’s one. In the described approach the close cooperation of developers and users is not limited to the requirements phase. The core of continuous feedback from users is the assessment of each developed software version. Its instrument is the questionnaire survey. It enables software designers to learn the users’ point of view on a software quality. Today’s users are skilled enough to play active roles in a software process and become conscious of their rights to assess quality of software products, especially those developed for public organizations. Users, by their ratings and suggestions, influence software product development. In the presented paper the quality aspects of an expert system are referred, as the case study, to the software system supporting calculations in civil engineering. The quality criteria extend the recommendations of the ISO 9126. Accepted criteria have been decomposed into quality measures − the quality tree is specified for a given expert system. Not only stricte technical features but also some soft
202
B. Begier
features like user’s feelings and impressions are considered − it is the author’s contribution. Software developers have to accept that some quality measures may be assessed subjectively by users. The guidelines on how to conduct the survey and make use of obtained answers are described. As a result software tool is adapted to users’ needs. The author experienced that many users are eager to present their suggestions on how to improve a product and to cooperate with software authors. The evidence shows that feedback from users is easy to elicit, has been readily accepted by programmers, and resulted in successful software improvements. Ratings and suggestions given by users’ constitute the basis for specification of the required and accepted changes. Thus the directions of software product improvement exactly reflect the users’ ratings, opinions and suggestions. After six iterations of software development and product quality assessment performed by its users, the mean value of users’ satisfaction with the considered system is evidently high (4.64). And the obtained ratings of particular quality measures are also higher than in previous cases. The questionnaire itself also needs to be periodically improved. The wording in the final version of the questionnaire is clear − questionnaire items are unambiguously specified. Possible expected answers are expressed using the Likert scale. The described solution is not limited to engineering tools but also might include software expert systems used by public administration, judiciary, hospitals, etc. The presented approach is currently being used for quality evaluation of the Analyzer of Facts and Relations as a part of the Polish Technological Security Platform. There are so far no mechanisms enforcing software improvement in those areas. Despite all differences the periodical assessment of a software product may be also incorporated in the evolutional development of various business expert systems. The range of user expectations increases − users formulate new requirements after their basic needs are satisfied. This statement is also applicable to intelligent software systems, applied and assessed in a changing environment, including changes in business, legal rules, technology and the growing computing skills of software users. So the new results of assessment may be even worse than the previous ones, although many improvements have been done in the meantime. Sophisticated solutions concerning e-institutions, e-commerce or e-finance may be rejected if they do not meet user expectations. Quality improvements must respond to and also anticipate users’ needs − only close cooperation with users may help here.
References 1. Begier, B., Wdowicki, J.: Feedback from Users on a Software Product to Improve Its Quality in Engineering Applications. In: Sacha, K. (ed.) IFIP 227. Software Engineering Techniques: Design for Quality, pp. 167–178. Springer, New York (2006) 2. Begier, B.: Software quality improvement by users’ involvement in the software process (in Polish), Publishing House of Poznan University of Technology, Poznan, Poland (2007)
Evolutionally Improved Quality of Intelligent Systems
203
3. Begier, B.: Users’ involvement may help respect social and ethical values and improve software quality. Information Systems Frontiers (To appear in 2009) 4. Boehm, B., Basili, V.: Software Defect Reduction Top-10 List. Computer, 135–137 (2001) 5. Capability Maturity Model Integration (CMMISM), http://www.sei.cmu.edu/cmmi/general 6. Chen, L., Soliman, K., Mao, E., Frolick, M.N.: Measuring user satisfaction with data warehouses: an exploratory study. Information & Management 37, 103–110 (2000) 7. Conrath, D.W., Sharma, R.S.: Toward a Diagnostic Instrument for Assessing the Quality of Expert Systems. ACM SIGMIS Database 23, 37–43 (1992) 8. Consultation guidance on selecting your sample size, Clackmannanshire Council (July 2008) 9. Cooke, J.: Constructing Correct Software. The Basics. Springer, London (1998) 10. Damodaran, L.: User involvement in the systems design process - a practical guide for users. Behaviour & Information Technology 15, 363–377 (1996) 11. Doll, W.J., Torkzadeh, G.: The measurement of end-user computing satisfaction. MIS Quarterly 12(2), 259–274 (1988) 12. Highsmith, J.: Agile Project Management. Addison-Wesley, Boston (2004) 13. Ives, B., Olson, M.H., Baroudi, J.J.: The measurement of user information satisfaction. Comm. of the ACM 26, 785–793 (1983) 14. Kujala, S.: User involvement: a review of the benefits and challenges. Behaviour & Information Technology 22, 1–16 (2003) 15. Kujala, S.: Effective user involvement in product development by improving the analysis of user needs. Behaviour & Information Technology 27(6), 457–473 (2008) 16. Manifesto for Agile Software Development, Agile Alliance (2001), http://agilemanifesto.org/ 17. Martin, R.C., Martin, M.: Agile Principles, Patterns, and Practices in C#. Pearson Education and Prentice Hall, Indianapolis (2007) 18. Mattsson, J.: Exploring user-involvement in technology-based service innovation. ICEProject, Roskilde Univ. and Aalborg Univ (2009), http://www.ice-project.dk 19. McHaney, R., Hightower, R., Pearson, J.: A Validation of The End-User Computing Satisfaction Instrument In Taiwan. Information & Management 39, 503–511 (2002) 20. National Learner Satisfaction Survey: Guidance on the core methodology and core questionnaire (April 2005), http://www.lsc.gov.uk 21. Quality Criteria for Website Excellence, http://www.worldbestwebsites.com/criteria.htm 22. Petter, S.: Managing user expectations on software project: Lessons from the trenches. International Journal of Project Management 26, 700–712 (2008) 23. Saturn Quality Aspects, Web Development Company, India, 2008 (Accessible in November 2008), http://www.saturn.in/advantages/quality-aspects.shtml
Mining the Most Generalization Association Rules Bay Vo1 and Bac Le2 1
Faculty of Information Technology – Ho Chi Minh City University of Technology, Vietnam [email protected] 2 Faculty of Information Technology – University of Science National University of Ho Chi Minh City, Vietnam [email protected]
Abstract. In this paper, we present a new method for mining the smallest set of association rules by pruning rules that are redundant. Based on theorems which are presented in section 4, we develop the algorithm for pruning rules directly in generating rules process. We use frequent closed itemsets and their minimal generators to generate rules. The smallest rules set is generated from minimal generators of frequent closed itemset X to X and minimal generators of X to frequent closed itemset Y (where X is the subset of Y). Besides, a hash table is used to check whether the generated rules are redundant or not. Experimental results show that the number of rules which are generated by this method is smaller than that of nonredundant association rules of M. Zaki and that of minimal non-redundant rules of Y. Bastide et al. Keywords: Frequent closed itemsets, minimal generators, the most generalization association rules.
1 Introduction There are many methods developed for generating association rules such as nonredundant association rules [14, 16], minimal non-redundant association rules [1, 8-9]. Although those approaches are different, the ultimate common point is dividing the problem into two phases: i) Mining all frequent closed itemsets (FCIs). ii) Mining (minimal) non-redundant association rules from them. Authors in [1, 8 - 9, 14, 16] have proposed methods of mining (minimal) nonredundant1 association rules that aim to reduce the number of rules and usefulness for users. These methods are necessary to increase the effectiveness in using rules. However, the number of rules that were generated is still large. For example, consider Chess database with minSup = 70%, the number of rules generated by M. Zaki’s method [16] is 152074 and by Y. Bastide et al method [1] is 3373625. In 1
Other authors have different definitions.
N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 207–216. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
208
B. Vo and B. Le
fact, some non-redundant rules are inferred from the other rules. Therefore, we propose a method for fast pruning rules that can infer from the others. Contributions of this paper are as follows: • •
We define the most generalization association rules, a new method for pruning association rules. We propose an algorithm for fast generating the most generalization association rules from frequent closed itemsets.
The rest of this paper is as follows: In section 2, we present related works of mining association rules, mining (minimal) non-redundant association rules. Section 3 presents concepts and definitions. Section 4 presents theorems and algorithm for generating the most generalization association rules. Section 5 presents experimental results and we conclude our work in section 6.
2 Related Works 2.1 Mining Frequent Closed Itemsets (FCIs) There are many methods for mining FCIs form database. Mining FCIs was divided four categories [6, 17]: 1. Test-and-generate (Close [9], A-Close [8]): Using level-wise approach to discover FCIs. All of them based on Apriori algorithm. 2. Divide and Conquer (Closet [10], Closet+ [13], FPClose [3]): using compact data structure (extended from FP-tree) to mine FCIs. 3. Hybrid (CHARM [15], CloseMiner [11]): using both “test and generate” and “divide and conquer” to mine FCIs. They based on the vertical data format to transform the database into item – tidlist structure and develop properties to prune fast non-closed itemsets. 4. Hybrid without duplication (DCI-Close [7], LCM [12], PGMiner [5]): they differ from hybrid in not using subsume checking technique, so that FCIs are not stored in the main memory and hash table technique do not use as CHARM. 2.2 Generating Non-redundant Association Rules Mining minimal non-redundant association rules (MNARs) was proposed in 1999 by Pasquier et al [8-9]. First, authors mined all FCIs by computing closure of minimal generators. After that, they mined all MNARs by generating rules with confidence = 100% from mGs(X) to X (X is a FCI) and generating rules with the confidence < 100% from mGs(X) to Y (X, Y are frequent closed itemsets and X ⊂ Y). In 2000, M. J. Zaki proposed a method to mine non-redundant association rules (NARs) [14]. He based on FCIs and theirs mGs to mine NARs. i) Rules with the confidence = 100%: Self-rules (the rules that generating from mGs(X) to X, where X is a FCI) and Down-rules (the rules that generating from mGs(Y) to mGs(X), where X, Y are FCIs, X ⊂ Y). ii) Rules with the confidence < 100%:
Mining the Most Generalization Association Rules
209
From mGs(X) to mGs(Y), where X, Y ∈ FCIs, X ⊂ Y. In 2004, he published his paper with some extending [16].
3 Concepts and Definitions 3.1 Frequent Itemset, Frequent Closed Itemset Let D be a transaction database and X ⊆ I, where I is the set of items in D. The support of X, denoted σ(X), is the number of transactions in D containing X. X ⊆ I is called as frequent if σ(X) ≥ minSup (minSup is a support threshold). Let X be a frequent itemset. X is called a closed if there is not any frequent itemset Y that X ⊂ Y and σ(X) = σ(Y). 3.2 Galois Connection [16] Let δ ⊆ I × T be a binary relation. Let X ⊆ I , Y ⊆ T . Let P(S) include all subset of S. Two mappings between P(I) and P(T) are called Galois connection as follows: i) t : P ( I ) 6 P (T ), t ( X ) = {y ∈ T | ∀x ∈ X , xδ y} ii) i : P (T ) 6 P ( I ), i (Y ) = {x ∈ I | ∀y ∈ Y , xδ y} Example: Consider database Table 1. Example database
TID 1 2 3 4 5 6
Item bought A, C, T, W C, D, W A, C, T, W A, C, D, W A, C, D, T, W C, D, T
t ( AW ) = t ( A) ∩ t (W ) = 1345 ∩ 12345 = 1345 , and i ( 24 ) = i ( 2 ) ∩ i ( 4 ) = CDW ∩ ACDW = CDW .
3.3 IT-Tree and Equivalence Classes [15] Let I be a set of items and X ⊆ I . Function p(X,k) = X[1:k] as the k length prefix of X and a prefix-based equivalence relation
θ K on itemsets as follows:
∀X , Y ⊆ I , X ≡θk Y ⇔ p( X , k ) = p(Y , k ) . IT-tree is a tree structure in which the vertex is an IT-pair X×t(X) (i.e. itemset X and the set of transaction IDs contains X), the arc connects from the IT-pair
210
B. Vo and B. Le
X×t(X) to the IT-pair Y×t(Y) such that
X ≡θ k Y (k is the length of items in X and
the length of items in Y is k+1). 3.4 Minimal Generators [16] Let X be a frequent closed itemset. X’≠ ∅ is called the generator of X if and only if: i) ii)
X’ ⊆ X and σ(X) = σ(X’).
Let G(X) denote the set of the generator of X. X’∈G(X) is called a minimal generator of X if it has no subset in G(X). Let mGs(X) denote the set of all minimal generators of X. 3.5 Definition 1 - General Rule Let R1: X1→Y1 and R2: X2→Y2, rule R1 is called more general than R2, denoted R1 ∝ R2, if and only if X1 ⊆X2 and Y2 ⊆Y1. 3.6 Definition 2 – Rule Has Higher Precedence Let R = {R1, R2, …, Rn} be the set of rules that satisfy minSup, rule Ri is called higher precedence than rule Rj, denoted Ri ; Rj, if Ri ∝ Rj (i ≠ j) and: i) The confidence of Ri is greater than that of Rj, or ii) Their confidences are the same, but the support of Ri is greater than the support of Rj. 3.7 Definition 3 – The Most Generalization Association Rules Let R = {R1, R2, …, Rn} be the set of traditional association rules that satisfy minSup and minConf, let RMG = R \ {Rj | ∃Ri ∈ R: Ri ; Rj}. RMG is called the set of the most generalization association rules (MGRs) of R. From definition 3, we only generate association rules that have higher precedence, i.e., we will prune rule Rj in R if R has the rule Ri such that Ri ; Rj. In fact, we do not prune rules after generating all rules. We base on theorems in section 4 to direct prune rules. Only a few rules are necessary for pruning by using hash table to check them with the generated rules.
4 Mining the Most Generalization Association Rules Using FCIs 4.1. Theorem 1 Let X, Y (X ≠ Y) be two frequent closed itemsets and mGs(X) be a set of minimal generators of X. If X ⊄ Y then ∀Z ∈ mGs(X), Z ⊄ Y.
Mining the Most Generalization Association Rules
211
Proof: Assume ∃Z ∈ mGs(X): Z ⊂ Y ⇒ t(Y) = t(Z) ∩ t(Y\Z) ⊂ t(Z) = t(X) ⇒ t(Y) ⊂ t(X) ⇒ t(X∪Y) = t(X) ∩ t(Y) = t(Y) ⇒ Y is not closed. From theorem 1, if X ⊄ Y, we need not generate rules from X to Y. i.e. we only generate rules from X to Y if X ⊂ Y. 4.2 Theorem 2 MGRs with the confidence = 100% are only generated from X’→ X (∀X’ ∈ mGs(X), X is a FCI). Proof: Because σ(X’) = σ(X), where X’ ∈ mGs(X) ⇒ conf(X’ →X) = 100%. To proof this theorem, we need proof 3 sub problems: i) The rule X’→Y is not the MGR (∀Y are generators of X and Y ≠ X): Because Y ⊂ X ⇒ X’→ X ∝ X’→Y ⇒ X’ →Y is not the MGR according to definition 3. ii) The rule Z→X is not the MGR (∀Z are generators of X, Z ∉ mGs(X)): Because mG(X) ⊂ Z, mG(X)→X ∝ Z → X ⇒ Z → X is not the MGR. iii) It has never existed the rule X’→Y’ with the confidence = 100% (∀X’ are generators of X, Y’ is generators of Y, X, Y ∈ FCIs, X ⊂ Y): Because σ(X’) = σ(X), σ(Y’) = σ(Y), and X, Y∈FCIs (X⊂Y) ⇒ σ(X’) ≠ σ(Y’) ⇒ conf(X’ →Y’) = σ(Y’)/σ(X’)= σ(Y)/σ(X) < 100%. 4.3
Theorem 3
MGRs with the confidence < 100% are only generated from X’ → Y (∀X’ ∈ mGs(X); X, Y ∈ FCIs, X ⊂ Y). Proof: According to theorem 1 and because X, Y ∈ FCIs and X ⊂ Y ⇒ σ(X’) ≠ σ(Y) ⇒ conf(X’ →Y) = σ(Y)/σ(X’)= σ(Y)/σ(X) < 100%. Based on theorem 1, 2 and 3, we generate MGRs from mGs(X) to X and from mGs(X) to Y, where X, Y ∈ FCIs and X ⊂ Y. 4.4 Algorithm for Generating MGRs This section presents the algorithm for generating MGRs from FCIs. To mine efficient minimal generators of frequent closed itemsets, we use MG-CHARM in [2]. Algorithm 1 presents the method for generating MGRs from FCIs. Initially, it sorts frequent closed itemsets in ascending order according to the length of their itemsets (line 1). After that, with each frequent itemsets Ci, it will perform: - Assign RHS = ∅ (line 5). - Generating rules with the confidence = 100% from mGs(Ci) to Ci (line 6).
212
B. Vo and B. Le
Input: Frequent closed itemsets (FCIs) and minConf Output: RMG is the set of MGRs that satisfy minConf Method: GENERATE_MGRs( ) 1. Sort (FCIs) // in ascending order according to k-itemset 2. RMG = ∅ 3. for each Ci ∈ FCIs do superset = ∅ 4. 5. RHS = ∅ // Right hand side of rules are generated by Ci FIND_RULES(Ci, Ci, RHS) // Find rules with the conf = 100% 6. for each Cj ∈ FCIs, with j > i do 7. 8. if Cj.sup/Ci.sup ≥ minConf and Ci.itemset ⊂ Cj.itemset then 9. superset = superset ∪ Cj Sort(superset) // in descending order according to their support 10. ENUMERATE_MGRs( Ci, superset, RHS)//Find rules with the conf < 100% 11. ENUMERATE_MGRs( C, S, RHS) 12. for all Cj ∈ S do FIND_RULES( C, Cj, RHS) 13. FIND_RULES(Ci, Cj, RHS) 14. forall X ∈ mGs(Ci) do 15. Z = Cj.itemset \ (X ∪ RHS) if Z ≠ ∅ then 16. 17. r = {X → Z, Cj.sup, Cj.sup/Ci.sup} if CHECK_REDUNDANT(r) = False then 18. RMG = RMG ∪ r // Add r into RMG 19. 20. RHS = RHS ∪ Z Algorithm 1. Generating the most generalization association rules from FCIs
- Making superset is the set of super frequent closed itemsets to generate rules with the confidence < 100% (and satisfy minConf – lines 7,8). After making superset, it will sort closed itemsets of superset in descending order according to their support (line 10). This work will consider rules which are from high confidence to low confidence and from Ci to all Cj ∈ superset. It helps checking MGRs faster (line 18). Function FIND_RULES will generate MGRs from X ∈ mGs(Ci) to Cj (with Ci, Cj are closed itemsets, Ci.itemset ⊆ Cj.itemset). It considers rule r with the left hand side is X and the right hand side is Z = Cj.itemset \ (X∪RHS), its support is Cj.sup and its confidence is Cj.sup/Ci.sup. If Z ≠ ∅ then check whether r is a MGR or not. If r is MGR then add it into RMG and add Z into RHS (line 8 and lines 14 – 20). Function CHECK_REDUNDANT uses hash table (with key is value of item ∈ Z, i.e. with each rule, we will check it with |Z| keys).
Mining the Most Generalization Association Rules
213
4.5 Illustrations Consider the database in Table 1, using MG-CHARM [2] with minSup = 50%, we have 7 frequent closed itemsets as in Table 2. Table 2. Results of MG-CHARM of example in Table 1 (see [2] for more detail)
FCIs C CD CT CW ACW CDW ACTW
sup
mGs
6 4 4 5 4 3 3
C D T W A DW AT, TW
Table 3. Results of algorithm 1 with minConf = 60%
C
FCIs
sup mGs 6 C
Superset RHS CW, CD, CT, WDTA ACW
CD CT CW
4 4 5
D T W
CDW ACTW ACW, CDW, ACTW
CW CAW CADT
ACW CDW ACTW
4 3 3
A DW AT, TW
ACTW
CWT
MGRs satisfy minConf = 60% ,5/ 6 4, 4 / 6 C ⎯5⎯ ⎯→W , C ⎯⎯ ⎯→ D 4, 4 / 6 4, 4 / 6 C ⎯⎯⎯→ T , C ⎯⎯⎯→ A 4 ,1 ,3/ 4 D ⎯⎯→ C , D ⎯3⎯ ⎯→ W 4 ,1 3, 3 / 4 T ⎯⎯→ C , T ⎯⎯⎯→ AW 5 ,1 4, 4 / 5 W ⎯⎯→ C , W ⎯⎯ ⎯→ A 3, 3 / 5 ,3/ 5 W ⎯⎯⎯→ D , W ⎯3⎯ ⎯→ T 4 ,1 3, 3 / 4 A ⎯⎯→ CW , A ⎯⎯⎯→ T 3 ,1 TW ⎯⎯→ A
Consider the frequent closed itemset C: • • • • • • •
RHS = ∅ There is no rule generated from mGs(C) to C. The superset (after sorts descending according to their support) is {CW, CD, CT, ACW}. ,5/ 6 Consider CW: we have rule C ⎯5⎯ ⎯→W ⇒ RHS = W. 4, 4 / 6 Consider CD: we have rule C ⎯⎯⎯→ D ⇒ RHS = WD. ,4/ 6 Consider CT: we have rule C ⎯4⎯ ⎯→ T ⇒ RHS = WDT. ,4/ 6 Consider ACW: we have rule C ⎯4⎯ ⎯→ A ⇒ RHS = WDTA.
,5/ 6 After considering closed itemset CDW, we have RMG = { C ⎯5⎯ ⎯→W , 4, 4 / 6 4, 4 / 6 4, 4 / 6 4 ,1 3, 3 / 4 , , , , C ⎯⎯⎯→ D C ⎯⎯⎯→ T C ⎯⎯⎯→ A D ⎯⎯→ C D ⎯⎯⎯→ W ,
214
B. Vo and B. Le
4 ,1 ,3/ 4 5 ,1 4, 4 / 5 ,3/ 5 T ⎯⎯→ C , T ⎯3⎯ ⎯→ AW , W ⎯⎯→ C, W ⎯⎯ ⎯→ A , W ⎯3⎯ ⎯→ T , 4 ,1 3, 3 / 4 A ⎯⎯→ CW , A ⎯⎯⎯→ T }. Consider closed itemset ACTW: Two rules 3 ,1 3 ,1 3 ,1 generated are AT ⎯⎯→ CW , TW ⎯⎯→ AC . With rule AT ⎯⎯→ CW , because 4 ,1 in RMG has rule A ⎯⎯→ CW , it is not the most generalization. With rule 3 ,1 5 ,1 TW ⎯⎯→ AC , because RMG has rule W ⎯⎯→ C , it becomes rule 3 ,1 3 ,1 TW ⎯⎯→( AC − C ) = TW ⎯⎯→ A .
5 Experimental Results All experiments which are described below have been performed on a centrino core 2 duo (2×2.0 GHz), 1GB RAM memory, Windows XP. Algorithms were coded in C# 2005. The experimental databases from [4] were downloaded to perform the test with features displayed in Table 4. Table 4. Features of databases
Databases Chess Mushroom Pumsb* Pumsb Connect Retails Accidents
#Trans 3196 8124 49046 49046 67557 88162 340183
#Items 76 120 7117 7117 130 16469 468
Table 5. Number of MGRs compares to non-redundant rules with minConf = 0%
Databases Chess Connect Mushroom Pumsb* Pumsb
minSup (%) 80 97 40 60 95
MGRs (1) 951 101 283 126 101
NARs (2) 27711 1116 475 192 267
#rules MNARs (3) 316057 4600 1168 611 690
Scale (2)/(1) 29.14 11.05 1.68 1.52 2.64
Scale (3)/(1) 332.34 45.55 4.13 4.85 6.83
Number of MGRs is always smaller than that of non-redundant rules. Example, consider database chess with minSup = 80%, minConf = 0%, number of MGRs rules is 951 compared to 27711 of non-redundant [16] and 316057 of minimal non-redundant [1]. We can see that in some cases, the mining time of MGRs is slower than that is in MNARs because it must check whether some rules are MGR or not, especial when the number of MGRs generated is large. However, it is faster than NARs because in mining NARs, we must check the rule generated with a lot of rules which have been generated.
Mining the Most Generalization Association Rules
215
Table 6. Experimental results of MGRs with minConf = 50%
Databases Chess
connect
Mushroom
Pumsb*
Pumsb
retail
Accidents
minSup (%)
#rules
time (s)
85 80 75 70 97 95 92 90 30 25 20 15 60 55 50 45 94 92 90 88 1 0.8 0.6 0.4 80 70 60 50
471 951 1761 3175 101 235 382 510 728 1149 2681 4135 126 202 388 1283 170 341 619 1091 100 152 258 525 146 504 2117 8170
1.42 10.09 53.22 258.98 0.42 1.22 3.98 8.58 0.45 0.69 1.44 3.06 0.45 0.8 1.38 3.06 0.48 0.88 1.75 5.42 1.52 2.42 4.59 10.95 1.8 4.39 11.61 43.56
6 Conclusions and Future Work In this paper, we proposed the new method for mining association rules. Number of rules generated by our approach is smaller than that of two before approaches [1, 16]. This approach used hash tables to check whether the rule is MGR or not. Besides, the number of rules need checked is small .Therefore, it consumes a few time to do that. Recently, using lattice to reduce run-time has been developed. Therefore, we will use lattice to generate MGRs in future. Besides, using interestingness measures to reduce rules will also be mentioned.
References 1. Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., Lakhal, L.: Mining minimal nonredundant association rules using frequent closed itemsets. In: 1st International Conference on Computational Logic, pp. 972–986 (2000)
216
B. Vo and B. Le
2. Bay, V., Bac, L.: Fast algorithm for mining minimal generators of frequent closed itemsets and their applications. In: The IEEE 39th International Conference on Computers & Industrial Engineering, Troyes, France, July 6–8,2009, pp. 1407–1411 (2009) 3. Grahne, G., Zhu, J.: Fast Algorithms for Frequent Itemset Mining Using FP-Trees. IEEE Trans. Knowl. Data Eng. 17(10), 1347–1362 (2005) 4. http://fimi.cs.helsinki.fi/data/ (Download on April 2005) 5. Moonestinghe, H.D.K., Fodeh, S., Tan, P.N.: Frequent Closed Itemsets Mining using Prefix Graphs with an Efficient Flow-based Pruning Strategy. In: Proceedings of 6th ICDM, Hong Kong, pp. 426–435 (2006) 6. Lee, A.J.T., Wang, C.S., Weng, W.Y., Chen, J.A., Wu, H.W.: An Efficient Algorithm for Mining Closed Inter-transaction Itemsets. Data & Knowledge Engineering 66, 68– 91 (2008) 7. Lucchese, B., Orlando, S., Perego, R.: Fast and Memory Efficient Mining of Frequent Closed Itemsets. IEEE Transaction on Knowledge and data Engineering 18(1), 21–36 (2006) 8. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering Frequent Closed Itemsets for Association Rules. In: Proc. Of the 5th International Conference on Database Theory, Jerusalem, Israel. LNCS, pp. 398–416. Springer, Heidelberg (1999) 9. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient Mining of Association Rules using Closed Itemset Lattices. Information Systems 24(1), 25–46 (1999) 10. Pei, J., Han, J., Mao, R.: CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. In: Proc. of the 5th ACM-SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dallas, Texas, USA, pp. 11–20 (2000) 11. Singh, N.G., Singh, S.R., Mahanta, A.K.: CloseMiner: Discovering Frequent Closed Itemsets using Frequent Closed Tidsets. In: Proc. of the 5th ICDM, Washington DC, USA, pp. 633–636 (2005) 12. Uno, T., Asai, T., Uchida, Y., Arimura, H.: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases. In: Proc. of the 7th International Conference on Discovery Science. LNCS, pp. 16–31. Springer, Padova (2004) 13. Wang, J., Han, J., Pei, J.: CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 236–245 (2003) 14. Zaki, M.J.: Generating Non-Redundant Association Rules. In: Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA, pp. 34–43 (2000) 15. Zaki, M.J., Hsiao, C.J.: Efficient Algorithms for Mining Closed Itemsets and Their Lattice Structure. IEEE Transactions on Knowledge and Data Engineering 17(4), 462– 478 (2005) 16. Zaki, M.J.: Mining Non-Redundant Association Rules, Data Mining and Knowledge Discovery. In: Kluwer Academic Publishers. Manufactured in the Netherlands, pp. 223–248 (2004) 17. Yahia, S.B., Hamrouni, T., Nguifo, E.M.: Frequent Closed Itemset based Algorithms: A thorough Structural and Analytical Survey. ACM SIGKDD Explorations Newsletter 8(1), 93–104 (2006)
Structure of Set of Association Rules Based on Concept Lattice Tin C. Truong and Anh N. Tran Faculty of Mathematics and Computer Science, University of Dalat, 01, Phu Dong Thien Vuong, Dalat, Vietnam [email protected], [email protected]
Abstract. It is important to propose effective algorithms that find basic association rules and generate all consequence association rules from those basic rules. In this paper, we propose the new concept of eliminable itemset to show how to represent itemset by generators and eliminable itemsets. Using algebraic approach based on equivalence relations, we propose a new approach to partition the set of association rules into basic and consequence sets. After describing their strict relations, we propose two ways to derive all consequence association rules from the basic association rules. These two ways satisfy the properties: sufficiency, preserved confidence. Moreover, they do not derive repeated consequence rules. Hence, we save much time for discovering association rule mining. Keywords: association rule, generator, eliminable itemset, closed itemset.
1 Introduction Consider the problem of discovering association rules from databases introduced by Agrawal et al. (1993). It is necessary to obtain a set of basic rules such that it can derive all other consequence rules. Recently, some authors (Zaki, 2004; Pasquier et al., 2005) used the lattice of frequent closed itemsets and generators (without the need to find all frequent itemsets) for extracting basic rules. All other rules can derive from them needlessly accessing database. Zaki [8] considered the set of association rules in two subsets: rules with confidence equal to 1 and those with confidence less than 1. He showed that in each subset, there are generalized (or basic) rules and the consequence rules. He also showed how to find the basic rules. However, his method generates many candidate rules. Moreover, he did not present algorithms for deriving the consequence rules. Pasquier et al. [7], split association rule set into set of exact rules and set of approximate rules. They proposed the algorithms for finding basic and consequence sets. However, the one for finding consequence rules is not sufficient (it cannot find out all association rules) and spends much time to generate repeated rules. In this paper, based on two equivalence relations in the class of itemsets and the association rule set, and on the new concept of “eliminable itemset”, we present the structures of basic set, consequence set and their strict relations. We also N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 217–227. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
218
T.C. Truong and A.N. Tran
propose a rather smooth partition of the association rule set in which the consequence set is sufficient, non-repeated, confidence-preserved and easy to use. The rest of the paper is as follows. Section 2 recalls some primitive concepts of concept lattice, frequent itemset and association rule. Sections 3 and 4 present the structures of the class of itemsets and the association rule set. We experimentally validate out theoretical results in section 5 and conclude in section 6.
2 Concept Lattice, Frequent Itemset and Association Rule Given set O (≠∅) contained objects (or records) and A (≠∅) contained items related to each of object o ∈ O and R is a binary relation in O x A. Now, consider two set functions: λ: 2O→2A, ρ: 2A → 2O are determined in the following: ∀A⊆ A, O ⊆O: λ(O)={a∈A | (o, a)∈R, ∀o∈O}, ρ(A)={o∈O | (o, a)∈R, ∀a∈A}. In which, 2O, 2A are the classes of all subsets of O and A. Assign that, λ(∅)=A, ρ(∅)=O. Defining two set functions h, h’ in 2O, 2A respectively by: h = λ o ρ, h’ = ρ o λ, we say that h’(O), h(A) are the closures of O and A. A ⊆ A (O ⊆ O) is a closed set if h(A)=A (h’(O)=O). If A, O are closed, A = λ(O) and O = ρ(A) then the pair C=(O, A) ∈ O x A is called a concept. In the class of concepts C = {C ∈ O x A}, if defining the order relation ≤ as relation ⊇ between subsets of O then L ≡ (C, ≤) is the concept lattice [3]. Let minsup s0 be the minimum support. Let A ⊆ A be an itemset. The support of A is defined as follows: sup(A)≡ρ(A)/|O|. If sup(A) ≥ s0 then A is frequent itemset [1, 4]. Let CS be the class of all closed itemsets. Let c0 be the minimum confidence (minconf). For any frequent itemset S (with threshold s0), we take a non-empty, strict subset L from S (∅≠L⊂S), R = S\L. Denote that r: L→R is the rule created by L, R (or by L, S). Then, c(r) ≡ |ρ(L)∩ρ(R)|/|ρ(L)| = |ρ(S)|/|ρ(L)| is called the confidence of r. The rule r is an association rule if c(r) ≥ c0 [1]. Let AR ≡ AR(s0, c0) be the set of all association rules with threshold c0. For two non-empty itemsets G, A: G⊆A⊆A, G is a generator of A if h(G)=h(A) and ∀G’⊂G ⇒ h(G’)⊂h(G) [6]. Let Gen(A) be the class of all generators of A. The algorithms for finding frequent concept lattice can be found in [2], [4], [5], [6], etc. In this paper, we only concentrate on researching the structure of the association rule set based on the structure of the class of itemsets. Since the size of the paper is limited, so we do not show the proofs of some propositions and theorems, and the codes of some algorithms. We also display briefly the examples.
3 Structure of Itemsets In this section, we partition all itemsets into the disjointed equivalence classes. Each class contains the itemsets that their supports are the same. Their closures are also the same. Theorem 1 in 3.3 shows how to represent itemset by generators
Structure of Set of Association Rules Based on Concept Lattice 219
and eliminable itemsets in their closure. Only based on generators, eliminable itemsets, and frequent closed itemsets (some other methods use maximal frequent itemsets), we still can generate sufficiently and quickly all frequent itemsets. 3.1 Equivalence Relation in the Class of Itemsets Definition 1. Closed mapping h: 2A→2A generate a binary relation ~h in class 2A: ∀ A, B ⊆ A: A ~h B ⇔ h(A) = h(B). Proposition 1. ~h is an equivalence relation ([h(A)]=[A], where [A] denotes the equivalence class contained A) and generates the partition of 2A into disjointed classes (the supports of all itemsets in a class are the same). We have 2A = ∑ [ A ] 1. A∈CS
3.2 Eliminable Itemsets Definition 2. In class 2A, a non-empty set R is eliminable in S if R ⊂ S and ρ(S) = ρ(S\R), i.e., when deleting set R in S, ρ(S) will not change. Denote that N(S) is the class that contains all eliminable itemsets in S. Proposition 2 (Criteria of recognizing an eliminable itemset). a. R ∈ N(S) ⇔ ρ(S\R) ⊆ ρ(R) ⇔ c(r: S\R → R) = 1 ⇔ h(S) = h(S\R). b. N(S) = {A: ∅ ≠ A ⊆ S\GenS, GenS∈ Gen(S)}. Proof: a. By the properties of ρ, h, and c (see [3, 4, 6, 7]), we easily prove this. b. - ∀∅≠A ⊆ S\GenS, GenS ∈ Gen(S), we have h(GenS) ⊆ h(S\A) ⊆ h(S). Since GenS is a generator of S so h(GenS) = h(S). Thus, h(S\A) = h(S). By a), A is in N(S). - If A is in N(S) then there exists GenS ∈ Gen(S): A ⊆ (S\GenS). Indeed, assume inversely that there exists a0 ∈ A and a0 belong to every generator GenS of S. Let Gen0 ∈ Gen(S\A) be a generator of S\A. Thus, a0 is not in Gen0. h(Gen0) = h(S\A) = h(S), i.e., Gen0 is also a generator of S, so a0 ∈ Gen0. It is the contradiction! 3.3 Representation of Itemsets by Generators and Eliminable Itemsets Theorem 1 (Representative theorem of itemset). ∀ ∅≠A ∈ CS, ∀ X ∈ [A], ∃ Gen_A ∈ Gen(A), X’∈N(A): X = Gen_A + X’. Proof: If X ∈ Gen(A) then X’=∅. If X ∉ Gen(A), let Gen0∈Gen(X) and X’=X\Gen0 ⊆ A\Gen0, then h(Gen0)=h(X)=A, so Gen0∈Gen(A), X’∈N(A) and X=Gen0+X’. 1
A + B is the union of two disjointed sets A, B. ∑ Ai = ∪ Ai : Ai ∩ A j = ∅ , ∀i, j ∈ I ,
i≠ j.
i∈I
i∈I
220
T.C. Truong and A.N. Tran
Example 1. Consider the database T in Table 1. Figure 1 shows the lattice of closed itemsets and generators in corresponding with T. This lattice will support for the examples in the rest of this paper. Attribute subsets of {1, 2, 3, 4, 5, 6, 7, 8} are partitioned into the disjointed equivalence classes: [A], [B] ... [I]. Consider class [A], N(A) is {372, 3, 7, 5, 35}. Since Gen(A) is {15, 17} so the itemsets in [A] are 15, 17, 135, 137, 157 and 1357. Their supports are the same. Table 1. Database T Record ID (object ID) 1 2 3 4
A=1357 15, 17
1 1 1 1 0
2 0 0 0 1
B=2357 2 E=357 5, 7
4 0 0 1 0
Items 5 1 0 0 1
C=1368 36, 38 F=13 13
H=3 3
3 1 1 0 1
6 0 1 1 0
7 1 0 0 1
D=1468 4 G=168 6, 8
8 0 1 1 0
sup 0.25 sup 0.5 sup 0.75
I=1 1
Fig. 1. The lattice of closed itemsets and their generators. The closed itemsets are underlined, their supports are the outside numbers, and their generators are italicized.
4 Structure and Partition of Association Rule Set Based on the equivalence relation in 4.1, we will partition the set of association rules AR into disjointed equivalence classes. Without loss of general, we only consider an equivalence class AR(L, S) in corresponding with (L, S)3. The rules in AR(L, S) are in the form: ri:Li→Si\Li (where ∅≠Li⊂Si, and h(Li)=L, h(Si)=S). They have the same confidence. Then, AR(L, S) is also partitioned into two disjointed sets: the basic set RAR(L, S) and the consequence set CAR(L, S). The basic rules of RAR(L, S) are ri:Li→S\Li, where Li ∈ Gen(L). In [6, 7], for deriving consequence rules from the basic rule ri, they delete all subsets of the right-hand 2 3
For brief, we replace {a1, a2 ... ak} with “a1a2...ak”, where, ai ∈ A. For example, 37 is {3, 7}. From here to the end of this paper, we denote (L, S) for pair of L and S: L, S ∈ CS and L ⊆ S.
Structure of Set of Association Rules Based on Concept Lattice 221
side of ri or move their subsets to the left-hand side of ri. This method can generate a large amount of repeated consequence rules. The repeated consequence rules can be in their equivalence class AR(L, S) or in the different classes. To remove this repeat, based on the eliminable itemset concept, in 4.3, we only delete (or move to left-hand side) the subsets of the right-hand side that are eliminable subsets of S (or of L respectively). Our method does not generate repeated consequence rules. It still derives sufficiently all consequence rules from the basic rules in their equivalence class. 4.1 Equivalence Relation ~r in the Rule Set AR Definition 3. Let ~r be a binary relation in the rule set AR that is determined as follows: ∀ L, S, Ls, Ss ⊆ A, ∅ ≠ L ⊂ S, ∅ ≠ Ls ⊂ Ss, r:L→S\L, s:Ls→Ss\Ls s ~r r ⇔ (h(Ls) = h(L) and h(Ss) = h(S)) ⇔ (Ls ∈ [L] and Ss ∈ [S]). Proposition 3 (Disjointed partition of the association rule set). ~r is an equivalence relation. For each (L, S), we often take any rule r0: GenL→S\GenL (GenL∈Gen(L)) to represent the equivalence class AR(L, S) that denotes [r0]~r. Hence, the relation ~r partition the rule set AR into disjointed equivalence classes ∑ AR(L, S). (the supports of rules in each class are the same): AR = L ,S ∈CS : L ⊆ S
4.2 Basic and Consequence Rule Sets in Each Rule Class Let RAR(L, S) be the basic rule set and BARS is the algorithm that generate it. All rules in RAR(L, S) are in the form: GenL→S\GenL, where: minimal lefthand side GenL is a generator of L and maximal right-hand side is S\GenL. To derive the set NRAR(L, S) contained all consequence rules of RAR(L, S), the previous results used the ways in similar to the following ways. For any r: L→Right ∈ RAR(L, S): • W1. Delete subsets R of Right to create the consequence rules rd: L→R; • W2. Move subsets R’ of Right or R (in the result of the way W1) to the left sides respectively to create consequence rules rm: L+R’→ Right\R’ or rm: L+R’→ R\R’. Let SNRAR1 be the algorithm for finding consequence rules in NRAR(L, S) by two above ways. This algorithm generates sufficiently the consequence rules. However, it does not determine immediately their confidences and generates many repeated rules. Pasquier et al. [7] presented the algorithm for finding consequence rules and their confidences (figure 9, page 50 and figure 12, page 53). It is unfortunately the algorithm does not generate sufficiently the consequence rule set. For example, consider the database T in example 1. Their algorithm does not discover the following consequence rules: 25→7, 27→5, 23→7, etc. To overcome this disadvantage, we correct it in order to obtain the SNRAR2 algorithm. However, although using many conditional checks, SNRAR2 still derives many repeated consequence
222
T.C. Truong and A.N. Tran
rules. The repeat can take place in the same equivalence rule class or in the different classes. For example, the consequence rule 57→3 (is derived from the basic rule 5→37) coincides with one consequence rule of the basic rule 7→35. All of them are in the rule class [(357, 357)]~r. In 4.3, we will overcome all disadvantages of SNRAR1 and SNRAR2. 4.3 A Preserved-Confidence Non-repeated Partition of Association Rule Set AR In this section, we will present a rather-smooth partition of each equivalence rule class based on two set functions for generating consequence rules Rd, Rm. However, Rm still generates some repeats. To eliminate them, we propose proposition 6. Then, theorem 3 presents a confidence-preserved, non-repeated partition of the rule set AR. Proposition 4 (Relation about confidence of basic rules and their consequences). Suppose that ∅≠ L, R1, R2: R1 + R2 = R, S=L+R, consider rules r: L → R, rd: L → R1, rm: L+R2→R1, sup(L+R2) > 0. We have: a. c(r) = 1 ⇔ ρ(L) ⊆ ρ(R) ⇔ ρ(L) = ρ(S) ⇔ h(L) = h(S) ⇔ R ∈ N(S). b. c(rd) = c(r) ⇔ R2 ∈ N(S). c. c(rm) = c(r) ⇔ R2 ∈ N(h(L)) ⇔ R2 ⊆ (h(L)\L)∩R, R2≠ R. Definition 4. Consider two set functions from AR to 2AR for generating rules (let Wd, Wm be two ways in corresponding with them): ∀r:L→R ∈ AR: Rd(r) = {s:L→R\R’ | ∅⊂R’⊂R, R’∈N(L+R)}, Rm(r) = {s:L+R’→R\R’ | ∅⊂R’⊂R, R’∈N(h(L))}. Proposition 5. ∀r:L→R ∈ AR, two functions Rd, Rm satisfy:
Rd(r) ⊆ [r] ~r, Rm(r) ⊆ [r] ~r and RmoRd(r) ⊆ [r] ~r, RdoRm(r) ⊆ [r] ~r. Two above functions generate sufficiently and only generate consequence rules in the same equivalence rule class. Thus, their confidences are preserved. They are different totally from the consequence rules in the different equivalence rule classes. For each (L, S), let us call: RAR(L, S) ≡ {r0: GenL→S\GenL | GenL∈Gen(L)}, CAR(L, S) ≡ Rd(RAR(L, S)) + Rm(RAR(L, S)) + Rm(Rd(RAR(L, S))), and AR(L, S) ≡ {r: L’→R’ | h(L’)=L, h(L’+R’)=S}. Theorem 2 (Partition and structure of each equivalence rule class). For each (L, S): a. AR(L, S) = RAR(L, S) + CAR(L, S). b. ∀r∈CAR(L, S), ∃r0∈RAR(L, S): either r∈Rd(r0) or r∈Rm(r0) or r∈RmoRd(r0). Proof: a. - “⊇”: Obviously by definition and proposition 5. It is easy to see that RAR(L, S) is disjointed with CAR(L, S).
Structure of Set of Association Rules Based on Concept Lattice 223
- “⊆”: ∀r: L’→R’∈ AR (L, S), L’ ∈ [L], (L’+R’) ∈ [S]. Consider three following cases. (1) If L’∈ Gen (L), R’=S\L’ then r ∈ RAR(L, S). (2) If L’∈Gen(L), R’⊂(S\L’) then there exists ∅≠Rd∈N(S): S\L’ = R’+Rd and r ∈ Rd(L’→S\L’) ⊆ Rd(RAR(L, S)). (3) If L’∉ Gen (L) then there exists L0∈ Gen (L), R0 ∈ N(L): L’=L0+R0. Let r0:L0→R1≡S\L0 ∈ RAR (L, S), R1 ≡ (S\L’)+(L’\L0) = (S\L’)+R0. Since R’⊆ S\L’ so S\L’ = R’+R’’, where R’’ = S\L’\R’ = S\(L’+R’) and R1=R’+R0+R’’. (3a) If R’’ = ∅ then r:L0+R0→R’ ∈ Rm(r0)⊆Rm(RAR(L,S)). (3b) If R’’ ≠ ∅ then: R’’∈N(S), rd:L0→ R’+R0 ∈ Rd(r0) and r:L0+R0→R’ ∈ Rm(rd) ⊆ Rm(Rd(RAR(L,S))). b. Proposition b is proved while we prove proposition a. Example 2. Consider (L, S)=(357, 1357) in the closed lattice in figure 1. The rule class AR(L, S) contains rules with confidence ½. The rule r1:5→137 is a basic rule. In [6, 7], for example, if deleting the subset {1} of the right-hand side of r1 (or moving it to the left-hand side) then we have the consequence rule r’:5→37 with c(r’)=1 ≠ ½ (or r’’:15→37, c(r’’)=1 ≠ ½ respectively). The rule r’ (or r’’) coincides with one rule in AR(L, L) (or AR(S, S)). The reason of this repeat is “{1} is non-eliminable in S”. For (L, S), let Rd’ (RAR(L, S)) ≡ RAR(L, S) + Rd(RAR(L, S)). We see that the rules in RAR(L, S) and Rd(RAR(L, S)) are different. However, the rules derived from set Rd’ (RAR(L, S)) by function Rm (or the way Wm) can be repeated. The following proposition 6 will overcome this last disadvantage. Let Sm(L, S) ≡ {ri:Li+R’→ R\R’ | h(Li+R)=S, Li∈Gen(L), ∅≠R’⊆R∩L, R’≠R, (i=1 or (i>1 and ∀k: 1≤ k
CAR(L, S) ≡ Rd(RAR(L, S)) + Sm(L, S). Proof: a. - “⊆”: ∀ri ∈ Sm(L,S), let r0:Li→S\Li ∈ RAR(L, S). Since Li+R ⊆ S so R’’= S\(Li+R) ∈ N(S)∪{∅} and rd:Li→R ∈ Rd’ (r0:Li→R+R’’), where R∩Li=∅. Thus, Li+R’⊆L, L=h(Li) ⊆ h(L)=L. Hence, h(Li)=h(Li+R’): ∅≠R’∈N(Li+R’). Therefore ri:Li+R’→ R\R’ ∈ Rm(rd) ⊆ Rm(Rd’ (r0)) ⊆ Rm(Rd’ (RAR(L, S))). - “⊇”: ∀r: L’→R’’’ ∈ Rm(Rd’ (RAR(L, S))): R’’’ ≠∅≠ L’⊆ L ⊆ S=h(L’+R’’’) ⊇ L’+R’’’, so R’’’⊆ S\L’. Let i be the minimum index such that: L’=Li+R’, Li∈Gen(L) and ∅≠R’⊆L\Li. Let R = R’’’+R’, we have h(Li+R) = S, R’=R\R’’’⊆ R∩L and Lk ⊄Li+R’, ∀k
224
T.C. Truong and A.N. Tran
b. Suppose that ∃i>j, ri: Li+Ri’→ Ri\Ri’, rj: Lj+Rj’→ Rj\Rj’ and ri = rj. Then, Li+Ri’ = Lj+Rj’ ⊃ Lj: the contradiction happen! Therefore, the rules in Sm(L, S) are different. Let us call: RAR ≡
∑
L,S∈CS:L⊆ S
RAR(L, S), CAR ≡
∑
L,S∈CS:L⊆S
CAR(L, S).
Theorem 3 (A preserved-confidence, non-repeated partition of the association rule set). The partition AR = RAR + CAR satisfies the following properties: sufficiency (finding sufficiently all association rules), non-repeating (consequence association rules are derived from different basic rules are different) and preserving confidence (basic rules and their consequence rules have the same confidence). Based on theorem 3, the following algorithm CARS (Consequence Association Rule Set) will generate the set CAR from the set RAR: Input: RAR . Output: CAR . 1) Rd_AR = ∅; Rm_AR = ∅; 2) forall ( ∈ RAR) do { 3) Rm_AR = Rm_AR + MA (Li, Right, c(r0)); 4) if (c(r0)=1) then forall (∅≠R⊂ Right) do { 5) Rd_AR = Rd_AR + {rd: Li→R, c(rd)=1}; Rm_AR = Rm_AR + MA (Li, R, 1); 6) 7) } 8) else forall (∅≠R⊂ Right and R∈N(S)) do {// (*) 9) Rd_AR = Rd_AR + {rd: Li→Right\R, c(rd) = c(r0)}; Rm_AR = Rm_AR + MA (Li, Right\R, c(r0)); 10) 11) } 12) } 13) CAR = Rd_AR + Rm_AR; 14) return CAR In which, MA (Move Appropriately) is the algorithm that generates the different consequence rules by the moving way Wm’. Example 3. Consider the rule class AR(L=357, S=1357) in corresponding with (L, S) in the closed lattice in figure 1. Two basic rules in corresponding with two generators L1=5, L2=7 of L are , . By the ways Wd, Wm’, the algorithm CARS only generate non-repeated consequence rules as follows: • Consider r1: 5→137. The sets R (R ⊂ 137) that are eliminable itemsets in S are 3, 7, and 37. Delete each R from 137 in order to create the following consequence rules 5→17, 5→13, and 5→1. The results of moving: on the basic rule r1 are 35→17, 57→13, 357→1; on the consequence rule, 5→17 is 57→1; and on 5→13 is 35→1.
Structure of Set of Association Rules Based on Concept Lattice 225
• Consider r2: 7→135 (with L2). The sets R (R ⊂ 135) that are eliminable itemsets in S are 3, 5, and 35. Deleting each R from 135, we have 7→15, 7→13, 7→1. The results of moving: on r2 is 37→15 (57→13, 357→1 are removed because 57 ⊇ L1, 357 ⊇ L1); on 7→15 is empty because 57 ⊇ L1; and on 7→13 is 37→1.
5 Experimental Results We use four databases in [9] during these experiments. Table 2 shows their characteristics. Table 2. Database characteristics Database (DB) Pumsb (P) Mushroom (M) Connect (Co) Chess (C)
# Records (Objects) 49046 8124 67557 3196
# Items 7117 119 129 75
Average Record size 74 23 43 37
Table 3 contains the results of finding all association rules upon them with the different minconfs (MC), by BARS+SNRAR1 (SBN1), BARS+SNRAR2 (SBN2) and BARS+CARS (SBC). In which, FCS is the number of frequent closed itemsets (G is the number of generators), FS is the number of frequent itemsets (GFC = G/FCS), AR is the number of all association rules, and ER is the ratio of the number of basic rules to AR. With SBN1, CC is the ratio of the number of repeated consequence rules to the size of the basic rule set. With SBC, NS and RC are the ratios of the numbers of eliminable and non-moved subsets to the number of all subsets (of the right-hand sides of basic rules). Finally, T1, T2 and T3 are in turn the running times (by seconds) of SBN1, SBN2 and SBC. In most of the results, the amount of repeated consequence rules (CC) is large (it ranges from 641% to 5069%), and the number of the eliminable itemsets (NS) is small (from 5% to 51%). Hence, the time for using two ways Wd, Wm’ in CARS is small. Moreover, the number of generators is small (GFC ranges from 1.0 to 1.4). Therefore, checking the eliminable property (*) of a subset R of S in CARS will reduce significantly not only the cost for considering the repeated rules generated by the way W1 in SNRAR1 but also the one to traverse the subsets of the set of association rules in SNRAR2. Furthermore, the time for checking conditional in MA (see the definition of Sm) will be smaller than the one for generating (by the way W2 in SNRAR1) redundantly the consequence rules and deleting them, or the one for traversing subsets and checking repeats in SNRAR2 many times. Experimental results show that the reduction in running time by our approach ranges from a factor of 2 to 368 times. Let us recall that SNRAR1 does not determine immediately the confidences of the consequence rules.
226
T.C. Truong and A.N. Tran Table 3. Experimental results of SBN1, SBN2 and SBC upon P, M, Co and Ch
DB (MS%) P (90%)
M (30%) Co (95%) Ch (87%)
MC (%) 95 50; 5 95 50 5 95 50; 5 95 50; 5
FCS (G) 1465 (2030)
FS (GFC) 2607 (1.4)
427 (558)
2735 (1.3)
811 (811) 1183 (1183)
2201 (1.0) 1553 (1.0)
AR
ER (%) 52 51 7 7 8
CC NS RC (%) (%) (%) 1040 20 50 1966 15 50 641 51 10 4290 33 11 5069 29 9
T3 (s) 22 7 4 5 5
T1/ T3 2 7 1 5 6
T2/ T3 85 368 5 85 90
46143 71474 14366 79437 94894 78376
33
3429
14
19
4
49
19963 41878
71 74
758 1731
8 5
0 129 0 10
1 4
2 16
0
6 Conclusion The theoretical affirmations in this paper show clearly the structures of the class of itemsets and the association rule set, based on the proposal of the equivalence relations on them and on the “eliminable set” concept. We propose the partition of the class of itemsets into disjointed classes and show how to represent itemset by generators and eliminable itemsets. Then, we also propose the rather-smooth disjointed partition of the association rule set into the basic and consequence sets and their strict relation. As a result, we build the CARS algorithm that derives sufficiently and quickly all consequence rules from the corresponding basic rules. This algorithm satisfies the following properties: preserving confidence, non-repeating. Hence, it reduces significantly the time for discovering all association rules. Moreover, two ways Wd, Wm’ used in CARS are convenient and close to user.
References 1. Aggarwal, C.C., Yu, P.S.: Online Generation of Association Rules. In: Proceedings of the International Conference on Data Engineering, pp. 402–411 (1998) 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th Very Large Data Bases Conference Santiago, Chile, pp. 478–499 (1994) 3. Bao, H.T.: An approach to concept formation based on formal concept analysis. IEICE trans, Information and systems E78-D(5) (1995) 4. Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., Lakhal, L.: Mining minimal NonRedundant Association Rules Using Frequent Closed Itemsets. In: 1st International Conference on Computational Logic (2000) 5. Godin, R., Missaoul, R., Alaour, H.: Incremental concept formation algorithms based on Galois lattices. Magazine of computational Intelligence, 246–247 (1995) 6. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient Mining of association rules using closed item set lattices. Information systems 24(1), 25–46 (1999)
Structure of Set of Association Rules Based on Concept Lattice 227 7. Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., Lakhal, L.: Generating a condensed representation for association rules. J. of Intelligent Information Systems 24(1), 29–60 (2005) 8. Zaki, M.J.: Mining Non-Redundant Association Rules. Data Mining and Knowledge Discovery 9, 223–248 (2004) 9. Frequent Itemset Mining Dataset Repository (2009), http://fimi.cs.helsinki.fi/data/
Some Novel Heuristics for Finding the Most Unusual Time Series Subsequences Mai Thai Son1 and Duong Tuan Anh2 1
Department of Information Technology, Ho Chi Minh City University of Transport [email protected] 2 Faculty of Computer Science & Engineering, Ho Chi Minh City University of Technology [email protected] Abstract. In this work, we introduce some novel heuristics which can enhance the efficiency of the Heuristic Discord Discovery (HDD) algorithm proposed by Keogh et al. for finding most unusual time series subsequences, called time series discords. Our new heuristics consist of a new discord measure function which helps to set up a range of alternative good orderings for the outer loop in the HDD algorithm and a branch-and-bound search mechanism that is carried out in the inner loop of the algorithm. Through extensive experiments on a variety of diverse datasets, our scheme is shown to have better performance than previous schemes, namely HOT SAX and WAT.
1 Introduction Finding unusual patterns or discords in large time series has recently attracted a lot of attention in research community and has vast applications in diverse domains such as medicine, finance, biology, engineering and industry. Many anomaly detection techniques for time series data have been proposed for some specific areas ([2], [3], [6], [7], [8]). However, all these works did not give a clear and workable definition for the “most unusual subsequences” in a time series. The concept of time series discord, which has been first introduced by Keogh et al., 2005 ([4]), captures the sense of the most unusual subsequence within a time series. Time series discords are subsequences of a longer time series that are maximally different from all the rest of the time series subsequences. The Bruceforce Discord Discovery (BFDD) algorithm, suggested by Keogh et al. ([4]) is an exhaustive search algorithm that requires the time complexity O(n2) to find discords. To reduce the time complexity of BFDD algorithm, Keogh et al. ([4]) also proposed a generic framework, called Heuristic Discord Discovery (HDD) algorithm, with two heuristics suggested to impose the two subsequence orderings in the outer loop and the inner loop, respectively in the BFDD algorithm. To improve the efficiency of HDD algorithm, the input time series should be first discretized by Symbolic Aggregate Approximation (SAX) technique into a symbolic string before applying the HDD algorithm. This algorithm is named as HOT SAX by Keogh et al. ([4]). HOT SAX can run 3 to 4 orders of magnitude faster than BFDD. Bu et al., 2007 ([1]), proposed another method which is based on Haar N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 229–240. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
230
M.T. Son and D.T. Anh
wavelet transform and augmented trie to mine the top-K discords from time series data. This algorithm, called as WAT, was claimed more effective than HOT SAX and requires fewer input parameters due to exploiting the multi-resolution features of wavelet transformation. Yankov et al. [9] proposed a disk aware discord discovery algorithm that can work effectively on very large time series datasets with just two linear scans of the disk. In this work, we introduce a more concrete framework for finding time series discords. In the framework, we employ some novel heuristics which can enhance the efficiency and effectiveness of the Heuristic Discord Discovery (HDD) algorithm proposed by Keogh et al.. Our new heuristics consist of two components: (1) a new discord estimate function which helps to set up a range of alternative orderings for the outer loop in the HDD algorithm and (2) a branch-and-bound search mechanism that is carried out in the inner loop of the HDD algorithm. Through extensive experiments on a variety of diverse datasets, our scheme is shown to have better performance than previous schemes, namely HOT SAX and WAT. The rest of the paper is organized as follows. Some background related to finding time series discords is provided in Section 2. In section 3, we introduce the proposed algorithm with our novel heuristics for outer loop ordering and inner loop ordering. The extensive experiments of the new algorithm are reported in Section 4. Finally, Section 5 presents some conclusions and remarks for future works.
2 Background 2.1 Time Series Discords Intuitively a time series discord is a subsequence that is very different from its closest matching subsequence. However, in general, the best matches of a given subsequence (apart from itself) tend to be very close to the subsequence under consideration. For example, given a certain subsequence at position p, its closest match will be the subsequence at the position q where q is far from p just a few points. Such matches are called trivial matches and are not interesting. When finding discords, we should exclude trivial matches; otherwise they impair our effort to obtain true discords since the true discord may also be similar to its closest trivial match [4]. Definition 2.1. Non-self match: Given a time series T containing a subsequence C of length n beginning at position p and a matching subsequence M beginning at the position q, we say that M is a non-self match to C if |p – q| ≥ n. Definition 2.2. Time series discord: Given a time series T, the subsequence C of length n beginning at position p is said to be a top-one discord of T if C has the largest distance to its nearest non-self match. 2.2 Symbolic Aggregate Approximation (SAX) A time series C = c1...cn of length n can be represented in a reduced w-dimensional space as another time series D = d1...dw by segmenting C into w equally-sized
Some Novel Heuristics for Finding the Most Unusual Time Series Subsequences
231
segments and then replacing each segment by its mean value di. This dimensionality reduction technique is called Piecewise Aggregate Approximation (PAA). After this step, the time series D is transformed into a symbolic sequence A = a1...aw in which each real value di is mapped to a symbol ai through a table lookup. The lookup table contains the breakpoints that divide a Gaussian distribution in an arbitrary number (from 3 to 10) of equi-probable regions. This discretization is called SAX which is based on the assumption that the reduced time series have a Gaussian distribution. Given two time series Q and C of the same length n, we transform the original time series into PAA representations, Q’ and C’, we can define lower bounding approximation of the Euclidean distance between the original time series by: DR(Q’, C’) =
w
n w
∑ ( q 'i −c ' )
2
i
i =1
When we transform further the data into SAX representations, i.e. two symbolic strings Q’’ and C”, we can define a MINDIST function that returns the minimum distance between the original time series of two words: MINDIST(Q”,C”) =
n w
w
∑ (dist (q"i , c" )
2
i
i =1
The dist() function can be implemented using a table lookup as shown in Table 1. This table is for an alphabet a = 4. The distance between two symbols can be read off by examining the corresponding row and column. For example, dist(a, b) = 0 and dist(a, c) = 0.67. Table 1. A look-up table used by the MINDIST function
a b c d
a 0 0 0.67 1.34
b 0 0 0 0.67
c 0.67 0 0 0
d 1.34 0.67 0 0
2.3 The Generic Framework for Finding Discords Keogh et al. [4] proposed a generic framework, called HDD algorithm, for finding discords in time series. HDD algorithm exploits the observation that we can abandon early the inner loop whenever the subsequence in question can not be the time series discord. HDD augments two heuristics that impose two subsequence orderings in the outer loop and the inner loop, respectively in the original bruce force algorithm (BFDD). The two heuristics in HDD aim to reduce the run time of the BFDD remarkably. The pseudocode of the generic framework with a slight extension is given in Table 2.
232
M.T. Son and D.T. Anh Table 2. The Generic Framework for Discord Discovery 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Function Heuristic_Search (T, n, Outer, Inner) discord_distance = 0 discord_location = NaN for each subsequence p in T ordered by heuristic Outer do if p is marked then continue endif nearest_non_seft_distance = infinity // nearest neighbor distance nearest_non_self_location = NaN // nearest neighbor location for each q in T ordered by heuristic Inner do if |p – q| ≥ n then // non-self match dist = EDist (Tp, Tq, n, nearest_non_seft_distance) if dist < nearest_non_seft_distance then nearest_non_self_distance = dist nearest_non_seft_location = q endif if dist < discord_distance then break // break for loop endif endfor // end inner for if nearest_non_seft_distance > discord_distance then discord_distance = nearest_non_seft_distance discord_location = p endif mark nearest_non_self_location as is-not-discord endfor // end outer for return [discord_distance, discord_location]
In the framework, EDist(Tp, Tq, n, nearest_non_seft_distance) is the function that computes the Euclidean distance with early termination mechanism. The computation of Euclidean distance between two subsequences Tp and Tq will be terminated early when it becomes greater than the distance between Tp and its so far nearest neighbor. Line 5 and line 23 describe the optimization that will become clear very soon. When subsequence p is considered, if there exists a subsequence such that the distance between p and q is less than or equal to discord_distance, then p could not be the discord and this allows early abandonment. In addition, due to EDist(p, q, n) = EDIST(q, p, n), q is also not the discord. We mark q as a non-discord (line 23) so that it will be bypassed in the outer loop (line 5) later. The Data Structures for Finding Discords. To find discords of length n in a time series T, we begin by creating a SAX representation of the entire time series, by sliding a window of length n across the time series T, extracting subsequences, converting them to SAX words and placing them in an array where the index refers back to the original subsequence. Once we have this ordered list of SAX words, we can place them into an augmented trie where each leaf node contain a linked list of all the subsequences that map to the same SAX word. Both data structures (augmented trie and the array) can be constructed in time and space linear in the length of T [4]. The parameters we have to supply to the HDD algorithm consists of the length of discords n, the cardinality of the SAX alphabet a, and the SAX word size w. The two data structures are illustrated in Fig. 1.
Some Novel Heuristics for Finding the Most Unusual Time Series Subsequences
233
Fig. 1. The two data structures used to support finding time series discords
3 The Proposed Algorithm Now in this work, we enhance the generic framework in Table 2 by introducing some novel heuristics to reduce further the runtime of the discord discovery algorithm. We name the proposed algorithm as EHOT. 3.1 Outer Loop Heuristics As for the outer loop, if the unusual subsequences are considered early, we have an good chance of giving a large value to the discord_distance variable early on, thus allowing more early terminations of the inner loop (see lines 15-17). To achieve this goal, HOT SAX ([4]) selects the SAX words with the small numbers of occurrences to be considered first in the outer loop and after the outer loop has run out of this set of candidates, the rest of the candidates are visited in random order. The main idea behind this outer heuristic is that unusual subsequences are very likely to map to unique or rare SAX words. However, through our experiments, this observation is valid only in some specific datasets in which the discords are likely to be the unique words after SAX discretization while in general cases, the discords can be found at any linked lists of SAX words under the leaf nodes in the augmented trie. To enhance the capability of estimating how likely a SAX word (i.e. a subsequence) could be a discord, we propose another way of evaluation which is based on what will be called the symbol frequency table. This table keeps the number of occurrences of each symbol in all the SAX words obtained after SAX transformation. Symbol Frequency Table. The symbol frequency table is a two-dimensional array, named pos, consists of a rows and w columns where a is the alphabet size
234
M.T. Son and D.T. Anh
and w is the word size in SAX transformation. The value at pos[i, j] indicates the number of occurrences of the ith symbol in the jth position in a SAX word. The symbol frequency table is given in Fig. 2. Based on the array pos, the number of occurrences of a SAX word s in all words computed by the following formula: w
P ( s ) = ∏ ( pos[ s[i ], i ] /( m − n + 1)) i =1
where m is the length of the time series T, n is the length of the discords and s[i] is the symbol at the ith position of the word s.
Fig. 2. Symbol Frequency table
The computation of the symbol frequency table and P(s) for all words s take at most O(m) time. Intuitively, the subsequence s with lowest P(s) indicates that it contains several symbols with lowest numbers of occurrences, thus having a high chance to be the discord. Functions that Compare Discord Measures between Words. To estimate the probability of a word s being a discord, we adopts two criteria of evaluations: 1) the word with the lowest occurrence count, C(s), has a high chance to be the discord, and 2) the word with lowest P(s) has a high chance to be the discord. With the two criteria, we suggest some functions that compare discord measures between two words s and t. Sub(s,t): s has a higher discord measure than that of t if C(s) < C(t), s has an equal discord measure to that of t if C(s) = C(t) and otherwise, s has a lower discord measure than that of t. Pos(s,t): s has a higher discord measure than that of t if P(s) < P(t), s has an equal discord measure to that of t if P(s) = P(t) and otherwise, s has a lower discord measure than that of t. Com(s,t): s has a higher discord measure than that of t if C(s) < C(t) and P(s) < P(t), s has a lower discord measure than that of t if C(s) > C(t) and P(s) > P(t), and otherwise, s has an equal discord measure to that of t. SubPos(s,t): s has a higher discord measure than that of t if (C(s) < C(t)) or (C(s) = C(t) and P(s) < P(t)), s has a lower discord measure than that of t if C(s) > C(t) or (C(s) = C(t) and P(s) > P(t)), and otherwise, s has an equal discord measure to that of t.
Some Novel Heuristics for Finding the Most Unusual Time Series Subsequences
235
PosSub(s,t): s has a higher discord measure than that of t if (P(s) < P(t) ) or (P(s) = P(t) and C(s) < C(t)), s has a lower discord measure than that of t if (P(s) > P(t) ) or ( P(s) = P(t) and C(s) > C(t)), otherwise, s has an equal discord measure to that of t. Outer Loop Heuristics. To speed-up the discord discovery algorithm, in the outer loop, we should reorder the candidate subsequences in a good ordering in such a way that the most unusual subsequences are visited early. We propose a number of alternative ways to order the candidates in the outer loop. They are described as follows. • HOT: this ordering heuristic is the same as the one given by Keogh et al. in [4]. • SORT: the leaf nodes in the augmented trie are sorted in descending order of discord measures and then they are partitioned into groups of the same discord measure. The algorithm visits each subsequence in the groups, each group at a time, according to the established order. • SORT REV: similar to SORT except that the leaf nodes are sorted in ascending order of discord measures. • SELECT: the leaf nodes are sorted in descending order of discord measures. The algorithm selects x subsequences from the first leaf nodes in the established order and then the other subsequences are visited in random order. Here we choose x equal to half of the mean of the numbers of the subsequences at the leaf nodes. • PAR: the leaf nodes are sorted in descending order of discord measures and the subsequences are divided into three sets. The first set consists of the subsequences at the leaf nodes with highest discord measures. All the subsequences at the other leaf nodes are partitioned into two sets with the same cardinality according the established order of the leaf nodes. The outer loop visits the three sets in that order. • PAR REV: similar to PAR except that the order of visiting the second set and the third set is reversed. • HOT LIKE: the subsequences at the leaf node with the highest discord measure are visited first. After the outer loop has exhausted this set of candidates, the rest of the candidates are visited in random order. • SEQ: the leaf nodes are sorted in descending order of discord measures and then they are partitioned into groups of the same discord measure. The groups are considered in that order. The algorithm selects one subsequence in each group and after visiting a group, its moves to the next group and returns to the first group after having visited the last group. • SEQ REV: similar to SEQ except that the leaf nodes are sorted in ascending order of discord measures. 3.2 Inner Loop Heuristic As for the inner loop, when the subsequence p is considered in the outer loop, the algorithm can break the inner loop early if it visits a subsequence q that has a distance to p less than the best so far discord distance, hence the earlier the nearest
236
M.T. Son and D.T. Anh
neighbor of p is found, the better the algorithm is. For a subsequence p, to find a subsequence q with a close distance to p, the algorithm HOT SAX traverses the trie to find the leaf node containing p, and then all the subsequences in the leaf node will be visited first. After this step, the rest of the possible subsequences are visited in random order. The main idea behind this heuristic is that the subsequences that have the same SAX encoding as the candidate subsequence p are very likely to be similar to p. However, through our experiments, we found out that the possibility of examining the similar subsequences outside the matching leaf node is also very high and random mechanism can not guarantee to find true nearest neighbor of p fast. To improve the chance of finding the subsequences almost similar to p, after visiting all the subsequences in the same p’s leaf node, rather than visiting the rest of the subsequences randomly, we visit all the subsequences in the leaf nodes that associate with the similar SAX encoding as p (i.e. with SAX distance of 0). Notice that the almost similar subsequences would have slightly different SAX encodings (see the example in Table 1). To implement this idea, we use a branch-and-bound algorithm in which we try to explore every search path from the root to the leaf node that associates with the similar SAX encoding as p. After examining all the subsequences in such a leaf node, we continue the search on another search path. With any incomplete search path that can not lead to a similar SAX encoding as p, we prune the current path. This technique helps to reduce the overhead of the trie traversal. After this step, all the remaining search paths will be traversed at random to find the subsequences that might be somewhat similar to p. Fig. 3 illustrates the branch-and-bound search in which the visited paths are in bold and the pruned paths are in dotted lines. For example, we are considering subsequence p at position 11, and p is encoded as cbc. First, the inner loop examines all the subsequences in the same p’s leaf node, in this case there is no such a subsequence. Then we start the branch-and-bound search at the root of the trie to find any subsequence that associates with the same encoding cbc. In this example, we find the path bab, visit all the subsequences at the leaf node of the path bab and then move to check the next path. The process continues in this way until we can not any other path with the same SAX encoding. At this point, the unvisited paths will be traverse at random.
Fig. 3. Branch-and-bound search in the inner loop, the paths in bold representing the paths considered in the search
Some Novel Heuristics for Finding the Most Unusual Time Series Subsequences
237
4 Empirical Evaluation For experiments, the five datasets ERP, KOSKI, random walk, power data and EEG are selected. For each datasets, we create time series of length 5000, 10000, 15000, 20000 and 25000. For each setting of dataset/length, we extracted randomly 100 subsets of the data for testing. We conducted all the experiments on a Core2Duo 2.2 GHz 1GB RAM PC. For performance comparison, we adopt two following evaluation metrics: the number of times the distance function is called and CPU time. For the first evaluation metric, we apply the guideline from [4]. Besides, we use the real CPU time as a supplement for the first metric. Since our algorithm uses a branch-and-bound search on the trie, the runtime of the algorithm includes the overhead of the trie traversal. This cost depends on the height of the trie and the structure of the trie. Therefore, it is insufficient if the evaluation is based only on the number of distance function calls. Each of the experiments is repeated 10 times and the average value is taken. In these experiments, we set the parameters, the length of discord n = {128, 256}, the word size w = { 4, 8, 16, 32 }, and the alphabet cardinality a = { 3, 4, 5}. 4.1 Experiments on the Outer Loop Heuristics To compare empirically the outer loop ordering heuristics, we tested all combinations of the nine ordering heuristics (SORT, SORT REV, SELECT, PAR, PAR REV, HOT LIKE, SEQ, SEQ REV) and the five discord measuring functions (Sub, Pos, SubPos, PosSub, and Com) on five different datasets and identified the best case for each dataset. The result of this experiment may at first be surprising. It shows that the selection of outer loop heuristics and the discord measuring functions depends on the datasets. While some setting of ordering heuristic/discord measuring function is more effective on certain datasets, it may perform poorly on some other datasets. Table 3 shows the good outer loop heuristics for each of the five tested datasets, sorted from the best one. For example, on the dataset EEG, the best ordering heuristic is SEQ REV which is used along with the discord measuring function Com(), the second best is HOT LIKE which is used along with the discord measuring function Sub(). The heuristics in bold are the best ones that we will use in the next experiments. Table 3. Outer loop heuristics for each of the five datasets Data Set Heuristic
EEG 1. Com-SeqRev 2. Sub-HotLike 3. Sub-ParRev
ERP 1.Hot 2.SubPosHotLike 3. None
Koski 1.Com-Seq 2.Com-Sort 3.Com-Par
Random 1.SubPos-Par 2.SubPos-Seq 3.Com-Par
Power 1.SubPos-Seq 2. Com-Seq 3. SubPos-Par
4.2 Experiments on the Inner Loop Heuristic The overhead of trie traversal is about between 5 to 12% of the cost for finding the discords in an augmented trie with a reasonable height. Obviously, if the height of
238
M.T. Son and D.T. Anh
the trie is high, the cost of trie traversal becomes remarkable and deteriorates the benefit of this mechanism. If the height of the trie is too small, the number of subsequences in the leaf nodes increases, the unnecessary subsequence examinations increase, thus also causing the trie traversal costly. The influence of the height of the trie is given in Fig. 4. Through experiments, we found out that the best value for the height of the trie is w = 8. We recommend that the height of the trie should be about log2n where n is the length of the discord. Notice that the percentage of the overhead of the trie traversal over the cost of finding the discords decreases when the length of the time series and the length of discord increases. For example, when the length of the time series is about 50000, the cost of trie traversal is insignificant in our experiments. 5000000
18 16
4000000 3500000
Power Koski ERP Random
3000000 2500000 2000000
EEG
1500000
Total CPU Times (s)
Number of call to dist
4500000
1000000
14 12
Power Koski
10
ERP 8
Random EEG
6 4 2
500000 0
0
4
8
16
32
4
8
SAX word size (w)
16
32
SAX word size (w)
Fig. 4. Sensitivity of algorithm efficiency to the height of the trie (in terms of the number of distance function calls and CPU times)
The alphabet size also affects the performance of the algorithm. For the small value of a, e.g. a = 3, due to the nature of SAX encoding, two adjacent symbols are considered as the same, the trie traversal becomes an exhaustive search and the algorithm will be inefficient. The influence of the alphabet size on the algorithm efficiency is given in Fig. 5. Experimental results shows that our algorithm performs poorly with a = 3. For a = {4, 5}, its performance improves remarkably due to the effectiveness of the
3000000
10 9
2000000
Power Koski ERP
1500000
Random EEG
1000000
Total CPU Times (s)
Number of call to dist
2500000
8 7
Power
6 5
Koski ERP
4
Random EEG
3 2
500000
1
0
0
3
4
SAX alphabet size (a)
5
3
4
5
SAX alphabet size (a)
Fig. 5. Sensitivity of algorithm efficiency to alphabet size (in terms of the number of distance function calls and CPU times)
Some Novel Heuristics for Finding the Most Unusual Time Series Subsequences
239
trie traversal. However if a is large, the number of leaf nodes in the trie increases and thus increasing the overhead of the trie traversal. From experiments, we see that EHOT works at its best with a = {4, 5}. 4.3 Comparisons to HOT SAX and WAT We compare the efficiency of EHOT with HOT SAX and WAT in terms of the number of distance function calls and CPU times, on the five datasets and with the heuristics marked at Table 3. Here, in Fig. 6 and Fig. 7, we report only the experimental results with n = 128. For the case n = 256, we obtained the same results. EEG
ERP 6000000
1000000 800000
HOT EHOT WAT
600000 400000 200000
Number of calls to dist
Number of calls to dist
1200000
5000000 4000000
2000000 1000000 0
0 5000
10000
15000
20000
HOT EHOT WAT
3000000
5000
25000
10000
Koski
20000
25000
Random walk
400000 350000 300000
HOT EHOT WAT
250000 200000 150000 100000 50000
Number of calls to dist
600000
450000
Number of calls to dist
15000
The length of time series
The length of time series
500000 400000
HOT 300000
EHOT WAT
200000 100000 0
0 5000
10000
15000
20000
5000
25000
The length of time series
10000
15000
20000
25000
The length of time series
Fig. 6. Efficiency comparison (EHOT, HOT SAX and WAT) in terms of the number of distance function calls ERP
EEG 16
2
HOT EHOT WAT
1.5 1 0.5
Total CPU Times (s)
Total CPU Times (s)
3 2.5
14 12 10
HOT EHOT WAT
8 6 4 2 0
0 5000
10000
15000
20000
5000
25000
The length of time series
10000
Koski
20000
25000
Random walk 1.6
1.2 1
HOT EHOT WAT
0.8 0.6 0.4 0.2 0
Total CPU Times (s)
1.4
Total CPU Times (s)
15000
The length of time series
1.4 1.2 1
HOT EHOT WAT
0.8 0.6 0.4 0.2 0
5000
10000
15000
20000
The length od time series
25000
5000
10000
15000
20000
25000
The length of time series
Fig. 7. CPU time comparison (EHOT, HOT SAX and WAT)
The experimental results show that EHOT outperforms HOT SAX and WAT on efficiency. However, one shortcoming of EHOT is that it is sensitive to the parameter setting. As mentioned above, when we work with our datasets, the best values for the word size and the alphabet cardinality are w = { 8 }, and a = { 4, 5 }.
240
M.T. Son and D.T. Anh
5 Conclusions This paper proposes a more concrete framework for finding time series discords. In the framework, we incorporate some novel heuristics which can enhance the efficiency of the Heuristic Discord Discovery (HDD) algorithm proposed by Keogh et al. ([4]). Our new heuristics consist of two components: a new discord measure function which helps to set up a range of alternative orderings for the outer loop in the HDD algorithm and a branch-and-bound search mechanism on augmented trie which is employed in the inner loop of the algorithm. Through extensive experiments on a variety of diverse datasets and with different parameter settings, our scheme is shown to have better performance than those of previous schemes, namely HOT SAX and WAT. In the near future, we intend to investigate how our discord discovery algorithm performs when some other discretization technique is used, rather than SAX.
References 1. Bu, Y., Leung, T.W., Fu, A., Keogh, E., Pei, J., Meshkin, S.: WAT: Finding Top-K Discords in Time Series Database. In: SDM (2007) 2. Dasgupta, F., Forrest, S.: Novelty Detection in Time Series Data Using Ideas from Immunology. In: Proc. of the 5th International Conference on Intelligent Systems (1996) 3. Keogh, E., Lonardi, S., Chiu, B.: Finding Surprising Patterns in a Time Series Database in Linear Time and Space. In: KDD 2002: Proc. of 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, New York, USA, pp. 550–556 (2002) 4. Keogh, E., Lin, J., Fu, A.: HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. In: Proc. of 5th IEEE Int. Conf. on Data Mining (ICDM), pp. 226– 233 (2005) 5. Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In: Proc. of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003) 6. Ma, J., Perkins, S.: Online Novelty Detection on Temporal Sequences. In: Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 614–618. ACM Press, New York (2003) 7. Salvador, S., Chan, P., Brodie, J.: Learning States and Rules for Time Series Anomaly Detection. In: Proc. of 17th International FLAIRS Conference, pp. 300–305 (2004) 8. Shahabi, C., Tian, X., Zhao, W.: TSA-tree: A Wavelet-based Approach to Improve the Efficiency of Multi-level Surprise and Trend Queries on Time-series Data. In: SSDBM 2000: Proc. of the 12th Int. Conf. on Scientific and Statistical Database Management (SSDBM 2000), p. 55. IEEE Computer Society Press, Washington (2000) 9. Yankov, D., Keogh, E., Rebbapragada, U.: Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Data Sets. J. Knowledge and Information Systems 15 (2008)
Using Rule Order Difference Criterion to Decide Whether to Update Class Association Rules Kritsadakorn Kongubol, Thanawin Rakthanmanon, and Kitsana Waiyamai Data Analysis and Knowledge Discovery Laboratory (DAKDL), Computer Engineering Department, Engineering Faculty, Kasetsart University, Bangkok, Thailand {fengtwr,kitsana.w}@ku.ac.th
Abstract. Associative classification is a well-known data classification technique. With the increasing amounts of data, maintenance of classification accuracy becomes increasingly difficult. Updating the class association rules requires large amounts of processing time, and the prediction accuracy is often not increased that much. This paper proposes an Incremental Associative Classification Framework (IACF) for determining when an associative classifier needs to be updated based on new training data. IACF uses the rule Order Difference (OD) criterion to decide whether to update the class association rules. The idea is to see how much the rank order of the associative classification rules change (based on either support or confidence) and if the change is above a predetermined threshold, then the association rules are updated. Experimental results show that IACF yields better accuracy and less computational time compared to frameworks with class association rule update and without class association rule update, for both balanced and imbalanced datasets. Keywords: Dynamic Rule Updating, Incremental Data Databases, Incremental Associative Classification, Order Difference.
1 Introduction Associative classification mining is a promising approach in data mining that utilizes the association rule discovery techniques to construct classification systems, also known as associative classifiers. To build a classifier, class association rules (CARs) are first discovered from training datasets and a subset of them is selected to form a classifier. A variety of associative classification algorithms [15, 16] have been proposed in the literature, e.g. CBA [6], CBA2 [7], CMAR [8], CPAR [19], MCAR [18] and MMAC [17]. Although the existing algorithms have improved the competitiveness over other traditional classification algorithms such as decision trees with regards to accuracy, none of them incorporate incremental associative classification. As new transactions are inserted into the database, current associative classification algorithms have to re-scan the updated training dataset and update the CARs in order to reflect changes. The re-scanning operation is computationally costly. Incremental association rule discovery N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 241–252. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
242
K. Kongubol, T. Rakthanmanon, and K. Waiyamai
methods such as FUP [2], FUP2 [3], Negative Border [13], UWEP [12], SlidingWindow Filtering [1], and NUWEP [14] have been proposed to reduce the number of re-scans of the training dataset. However, regenerating CARs for every round of transaction insertions still requires significant computational time consumption and the predictive accuracy might not be significantly better than the original classifier model. This implies that classifier model updating after every round of insertions is not necessary. A criterion to decide whether or not to update the classifier model so that classification accuracy is maintained is a new and important research direction. This paper proposes an Incremental Associative Classification Framework (IACF) for determining when an associative classifier needs to be updated based on new training data. IACF uses the rule Order Difference (OD) criterion, to decide whether to update the class association rules, when there are insertions of transactions, such that the desired classification accuracy is maintained. Rules are updated only if new data result a significant change in the order of the rules, on the basis of some measure of interestingness such as support or confidence. Our criterion is based on the fact that, if the associations between attributes and the class values is almost the same (before and after insertion of transactions), rules are generated based on the same data distribution and hence the rules do not need to be updated. However, data distribution determination is a time consuming operation. We propose a method to estimate data distribution by analysing the change in order of size-2 rules. A size-2 rule is a class association rule which contains 2 items. The rule order difference is based on the traditional rule’s support or rule’s confidence measures. The OD criterion determines the rule order difference that changes after insertion of transactions. If the OD criterion exceeds a Maximum Order Difference (MOD) specified by the user, then class association rules need to be updated. Experimental results using the CBA technique and datasets from the UCI Machine Learning Repository [11] show that using OD criterion consumes computational time less than update classifier framework and gives higher accuracy than both update classifier framework and no update classifier framework for balanced and imbalanced datasets. The remainder of this paper is organized as follows. Section 2 presents general associative classification frameworks for incremental databases; section 3 presents our idea and framework for incremental associative classification; section 4 describes the experimental results on classification accuracy and computational time; conclusions are given in section 5.
2 Incremental Associative Classification Frameworks When databases are incrementally expanded with a large number of transactions, the two frameworks are with classifier update and without classifier update. The framework without classifier update (FW1) uses only one classifier even if the database is updated; the advantage is no computational time is required for updating, but the disadvantages are that the accuracy is reduced and it does not support the idea of incremental associative classification. The framework with classifier update (FW2) updates the classifier model after every round of incremental
Using Rule OD Criterion to Decide whether to Update Class Association Rules
243
updates: the advantage is the accuracy of classifier model is maintained at the expense of huge computational time needed to update the classifier model every round. To understand the two frameworks for incremental associative classification indepth details, see the following: 1. Framework 1: Framework without classifier update (FW1) FW1 is a standard incremental associative classification framework that uses the same classifier model to predict new, unseen data. The model is shown in Figure 1 and notation described in Table 1. For each incremental database round, class association rules are not updated and model accuracy depends only on the original classifier.
D D
Classifier Model
d1
Accuracy
d2 dn
Fig. 1. Framework without classifier update
2. Framework 2: Framework with classifier update (FW2) FW2 updates the classifier model after every round of incremental database updates: the update classifier model is built from the original database (D) and cumulative incremental database updates (di). The predictive accuracy is maintained consistent with the cumulative database updates. This model is shown in Figure 2, with notation as in Table 1.
D D
Classifier Model
d1 d2
Accuracy
dn
Fig. 2. Framework with classifier update
3 IACF: An Incremental Associative Classification Framework In this section, we describe the Incremental Associative Classification Framework (IACF) to dynamically update class association rules for incremental database updates. IACF uses the Order Difference (OD) criterion to decide whether to update the class association rules. An OD is the difference between size-2 rule order
244
K. Kongubol, T. Rakthanmanon, and K. Waiyamai
before and after updating size-2 class association rules. Rules are ordered by support or confidence interestingness measures. The model for IACF is shown in Figure 3 with notation described in Table 1. For each incremental round, IACF compares OD and MOD values. Original and updated databases are merged and the classifier is rebuilt if OD is greater than MOD. Otherwise, the current classifier is used to predict unseen data.
D Classifier Model
D
d1 d2
Accuracy
dn
Fig. 3. IACF framework for incremental classifier update
To illustrate the operation of IACF, consider an example of association rule discovery with minimum support 2 and minimum confidence 50%, as shown in Figure 4a and 4b. Databases before insertions TID Items 1 A,B,C 2 A,C,D 3 A,B,D 4 5
Databases after insertions
Class
TID
X Y
1 2
A,B,C A,C,D
Items
Class X Y
B,C,D
X Y
3 4
A,B,D B,C,D
X Y
A,D
X
5 6
A,D A,B
X X
7 8
C,D D
Y X
(a) Discovered rules before insertions RID
Rule
1
A,B->X
2 3
Discovered rules after insertions
Confidenc Support e
RID
Rule
1
A,B->X
2
2/2 = 100%
C,D->Y
2
2/2 = 100%
2
A->X
3
3/4 = 75%
3
4
B->X
2
2/3 = 67%
5
C->Y
2
6
D->X
7
D->Y
Support Confidence 3
3/3 = 100%
C,D->Y
3
3/3 = 100%
A->X
4
4/5 = 80%
4
B->X
3
3/4 = 75%
2/3 = 67%
5
C->Y
3
3/4 = 75%
2
2/4 = 50%
6
D->X
3
3/6 = 50%
2
2/4 = 50%
7
D->Y
3
3/6 = 50%
(b) Fig. 4. (a). Databases before and after insertions (b). Generated rules before and after insertions
Using Rule OD Criterion to Decide whether to Update Class Association Rules
245
In this example, the updated rules have new support and confidence values but their orders before and after updating remain the same. This implies that the data distribution before and after insertion is essentially identical. In this case, update of the class association rules is not necessary. Data distribution can be estimated by analysing the difference or variation in order of size-2 rules before and after database insertions. The idea is based on the fact that all the class association rules are necessarily generated from size-2 rules, thus differences of rule order can be estimated from the change in size-2 rules. 3.1 Order Difference The criterion to determine whether to update the classifier model when there are incremental updates is based on ordering of the set of size-2 rules R2. The OD is the summation of R2 rules’ order difference before and after updates. Notation used in the Equations is described in Table 1. Let f(r), f’(r) be defined as follows in the original database (D). • f(r) is the order of a rule • f’(r) is the order of r in the updated database (Dd). Let : :
; ;
r r
,R ,R
1,2,3, … and 1,2,3, …
Rule order difference is based on the rule’s support or confidence measures. OD criterion is defined as follows: Using confidence to rank R2 rules’ order, OD is defined as: ∑
|
|
|
| (1) 0
0
where
Using support to rank R2 rules’ order, the OD is defined as: 1
|
| (2)
where log |
|
0
0
Equation (1) is defined based on the observation that confidence-based rule ranking can show dramatic change in order of rules even though a very small number of transactions have been inserted. And, conversely, Equation (2) is defined based on the observation that support-based rule ranking can show only a slight change in order of rules even though a very large number of transactions have been inserted.
246
K. Kongubol, T. Rakthanmanon, and K. Waiyamai
3.2 Maximum Order Difference Rule order difference (OD) has exceed the Maximum Order Difference (MOD) threshold in order for IACF to perform a classifier update. MOD is defined as follows: (3) where • C is a coefficient specified by the user. The default value is 1. • L is the number of items. • Y is the number of class items. If OD > MOD then the classifier model is updated. If the associations between items and the class values is almost the same before and after insertion of transactions (same data distribution), it is assumed that order difference of each rule might have maximum order difference equal to 1. Then total difference order of all rules is equal to number of rules in R2. 3.3 IACF Algorithm The IACF algorithm has two phases: The initial phase calculates the original database’s rules order; the incremental phase determines R2 rules OD before and after insertion of updates. Table 1 shows the notation used in IACF. Table 1. Notation used in the IACF algorithm Parameter D d i di Dd
r, R2 RD RDd α
Description Original database Incremental database insertions Round number, i=1,2,3,…,n Incremental database insertions in the ith round Updated database using data from both the original database and cumulative incremental updates to compute Rules R2 (without merging both databases). Size-2 rules, i.e. Size-2 rules is a class association rule which contains 2 items. Size-2 rules from the original database Size-2 rules from the incremented databases Phase number, 0 for initial phase; 1 for incremental phase
Algorithm IACF Input: Original databases D, Incremental databases di, MOD Output: Size-2 Rules of original databases RD, Size-2 Rules of incremented databases RDd Procedure IACF 0 1) 0 2) i
Using Rule OD Criterion to Decide whether to Update Class Association Rules
3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32)
While ( If
247
) 0
Then //Initial phase BuildClassifier (D); ComputeOrder (RD); 1;
Else
//Incremental phase ; ; = ComputeOrder (RDd); //compute OD For each // If R order by confidence then from Equation (1) // calculate // If R order by support then from Equation (2) // calculate = ; End ∑ ; Then If ; 0; 0; i Else // current classifier model // predict incremental databases ModelPredict (d); 1;
End End End
In the initial phase, as seen in lines 4-7 of the IACF algorithm, the algorithm builds classifier from the original database D, computes rule orders in RD of the original database, sets next round in the incremental phase. During the incremental phase, lines 9-10 compute an incremented database. Line 11, computes rule order RDd of the incremented database Dd. Line 13-20 computes the order difference between rules RD of the original database D and rules RDd of the incremented database Dd and sums the order difference of each rule. Line 21-31 tests whether the OD is greater than the threshold MOD; if greater then IACF merges the original database with the incremented database and sets the next round to the initial phase. If not, it uses the current classifier model to predict incremental database and sets the next round to the incremental phase. Execution Example Figure 5 illustrates the calculation of OD, given MOD=12 and using OD Equation (2) (rule ordered by support).
248
K. Kongubol, T. Raakthanmanon, and K. Waiyamai
Fig. 5. Example of OD calculation
As illustrated in Figure 5, each round can be explained as follows: 1. Original database roun nd: IACF computes each rule’s support and order rules bby descending support ord der. Here, AX has the greatest support and its order is 1 2. Incremental database round #1: IACF computes new support of each rule bby f the original database with the incremental databasse. summing the support from Round #1 then orders the t rules by descending support order. Here, AX’s suppoort changes from 10 to 11 and a its order changes from 1 to 3. So the OD of AX is | (13)2 * log |1/ (1-3)2||=2.4 41. Finally, summing OD from all rules yields OD= 2.441 which is less than MOD D=12. Hence, IACF uses the existing classifier model. 3. Incremental database round #2: IACF computes new support of each rule bby summing of support from the original database and cumulative incremental database (incremental daatabase Round #1 and incremental database Round #22), then orders rules by descending d support order. Here, AX’s support is still bbe equal to 11 but its ord der changes from 3 to 5. So the OD of AX between thhe original database roun nd and incremental round #2 is thus | (1-5)2 * log |1/(12 5) ||=19.27. Finally, su umming OD from all rules yields OD=40.94. Since this is greater than MOD=12,, IACF will create a new classifier model.
4 Experimental Ressults In this section, we report our o experimental results comparing IACF with two exissting frameworks FW1 and d FW2 in terms of average accuracy and average compuutational time. The experim ments use CBA for the data classification task. All the experiments were w performed on a computer with 2.1 GHz Intel Coree2 Duo CPU and 4 GB of main memory, running Microsoft Windows Vista SP1. n the Java language based on LUCS-KDD implementaCBA was implemented in tion of CBA software [10 0] and compiled using Eclipse SDK 3.4.1 and JDK 1.6. The parameter values useed in these experiments are: support threshold 5%, conffidence threshold 50%, maximum number of antecedents 99, and maximum numbeer T MOD of each dataset is defined as in Equation (3). of rules (CARs) 80,000. The The performance of each e framework is evaluated using six selected dataseets from the UCI Machine Leearning Repository, namely: − Adult.D97.N48842.C2 2 − Connect4.D129.N6755 57.C3
Using Rule OD Criterion to Decide whether to Update Class Association Rules
− − − −
249
Led7.D24.N3200.C10 Nursery.D32.N12960.C5 PageBlocks.D46.N5473.C5 PenDigits.D89.N10992.C10
Explanation of the characteristics of each dataset can be found in [9]. The dataset names describe the key characteristics of each data set; for example, adult.D131.N48842.C2 denotes the "adult" data set, which includes 48842 records in 2 classes. For each dataset we evaluate as follows: split the dataset into two groups, the first group was used for 10 incremental update rounds with 200 transactions per round, and the second group was the remaining transactions. We split the second group of transactions into 70% training data and 30% testing data. When there were incremental databases and the classifier model needed to be rebuilt, the incremental databases were split 70/30, 70% of the data was merged with the 70% training data from the second group and 30% of the data was merged with the 30% test data from the second group. In the following, we first show experimental results comparing the three frameworks in terms of average accuracy and average computational time. Then, we give details on MOD tuning. Table 2. (a). Average accuracy of frameworks (b). Average computational time (milliseconds) of frameworks
Datasets Adult Connect4 Led7 Nursery PageBlocks PenDigits
FW1 Avg. Accuracy 76.10 65.89 73.24 79.32 89.78 76.65
(a) FW2 Avg. Accuracy 76.10 65.89 73.29 79.95 89.78 78.15
IACF Eqn (1) Avg. Accuracy 76.10 65.89 73.29 80.46 89.78 78.15
IACF Eqn (2) Avg. Accuracy 76.10 65.89 73.29 80.14 89.78 78.15
(b) Datasets
Adult Connect4 Led7 Nursery PageBlocks PenDigits
FW1 Avg. Comp. Time(ms)
19119 40082 123 265 163 238
FW2 Avg. Comp. Time(ms)
229434 458752 317 808 913 1205
IACF Eqn (1) Avg. Comp. Time(ms) per #rebuilt classifiers 20942/0 40672/0 339/10 437/4 698/7 1244/10
IACF Eqn (2) Avg. Comp. Time(ms) per #rebuilt classifiers
20833/0 40482/0 283/8 295/4 402/3 1281/10
250
K. Kongubol, T. Rakthanmanon, and K. Waiyamai
4.1 Average Accuracy and Average Computational Time Comparison Based on characteristics of the datasets used in our experiments, those datasets can be organized into two groups which are imbalanced and balanced datasets. Imbalanced datasets are datasets for which one major class is heavily overrepresented compared to the other classes, whereas balanced datasets are datasets for which every class is equally represented. Imbalanced datasets are Adult, Connect4, and PageBlocks. Balanced datasets are Nursery, Led7, and PenDigits. Table 2a and 2b show experimental results comparing the three frameworks in terms of average accuracy and average computational time. In case of imbalanced datasets, Table 2a shows that the average accuracies of all frameworks are similar. This can be explained by the fact that associative classification rules do not need to be updated even though there are insertions of new transactions. However, IACF’s average computational time is as good as Framework 1 which exhibits the best average computational time. In case of balanced datasets, IACF and Framework 2 give the best average accuracies for the Led7, PenDigits, and Nursery datasets. This can be explained by the fact that the insertions of transactions cause the classifier model to change; classifier model update is then necessary to maintain accuracy. IACF’s average computational time is better than Framework 2 or almost equal. 4.2 MOD Tuning MOD threshold can be tuned using Equation (3). Table 3 shows that a low value of MOD increases both average computational time and average accuracy. Larger values of MOD decrease computational time; however, the average accuracy is still better than Framework 1. Consequently, choosing an optimum MOD is a key to achieving the best accuracy. Table 3. Average accuracy and average computational time for different values of MOD using IACF Equation (2) Datasets Nursery Nursery Nursery
MOD 67.5 135 202.5
Comp. Time (ms) 4087 3293 2369
Avg. Comp. Avg.Accuracy Time (ms) (%) 371.55 80.14 299.36 80.14 215.36 79.46
# rebuilt classifiers 7 4 3
5 Conclusion and Discussion We proposed a framework for incremental associative classification, and introduced the Order Difference criterion as the measure used to decide whether to update a set of associative classification rules or not. Rules are updated only if new data modify the order of the size-2 rules to a significant extent, on the basis of some measure such as support or confidence.
Using Rule OD Criterion to Decide whether to Update Class Association Rules
251
Accuracy and computational time are crucial factors in associative classification techniques. Experiments on the six selected datasets from the UCI datasets indicate that IACF is highly competitive when compared with FW1 and FW2 in terms of accuracy and computational time for balanced and imbalanced datasets. Our future work will investigate MOD tuning to initiate suitable MOD value. Apart from support and confidence, other measures of interestingness such as lift and conviction [4, 5] will be investigated. Moreover, we will enhance the proposed framework to support associative classification that uses multiple rules for prediction. Another interesting research direction is to incorporate generic association rules into IACF. Acknowledgment. Thanks to J. E. Brucker for his reading and comments of this paper.
References 1. Chang, H.L., Cheng, R.L., Ming, S.C.: Sliding-Window Filtering: An Efficient Algorithm for Incremental Mining. In: Proc. of the ACM 10th International Conference on Information and Knowledge Management (CIKM 2001), November 5-10, 2001, pp. 263–270 (2001) 2. David, W.C., Han, J., Ng, V., Wong, C.Y.: Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Techniques. In: Proc.12th IEEE International Conference on Data Engineering (ICDE 1996), New Orleans, Louisiana, U.S.A (March 1996) 3. David, W.C., Sau, D.L., Benjamin, K.: A general incremental technique for maintaining discovered association rules. In: Proceedings of the 5 th Intl. Conf. on Database Systems for Advanced Applications (DASFAA 1997), Melbourne, Australia (April 1997) 4. Geng, L., Hamilton, H.J.: Interestingness Measures for Data Mining: A Survey. ACM Computing Surveys 38(3), Article 9 (September 2006) 5. Lenca, P., Meyer, P., Vaillant, B., Lallich, S.: On selecting interestingness measures for association rules: user oriented description and multiple criteria decision aid. European Journal of Operational Research 184(2), 610–626 (2008) 6. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 80–86. AAAI Press, New York (1998) 7. Liu, B., Ma, Y., Wong, C.K.: Improving an association rule based classifier. In: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France, pp. 504–509 (2000) 8. Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on multipleclass association rule. In: Proceedings of the International Conference on Data Mining (ICDM 2001), San Jose, CA, pp. 369–376 (2001) 9. LUCS-KDD DN Example Notes, http://www.csc.liv.ac.uk/~frans/ KDD/Software/LUCS-KDD-DN/exmpleDNnotes.html 10. LUCS-KDD implementation of CBA, http://www.csc.liv.ac.uk/~frans/KDD/Software/CBA/cba.html 11. M0065rz, C., Murphy, P.: UCI repository of machine learning databases. Department of Information and Computer Science. University of California, Irvine (1996)
252
K. Kongubol, T. Rakthanmanon, and K. Waiyamai
12. Necip, F.A., Abdullah, U.T., Erol, A.: An Efficient Algorithm to Update Large Itemsets with Early Pruning. In: ACM SIGKDD Intl. Conf. on Knowledge Discovery in Data and Data Mining (SIGKDD 1999), San Diego, California (August 1999) 13. Shiby, T., Sreenath, B., Khaled, A., Sanjay, R.: An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases. In: Proceedings of the 3rd International conference on Knowledge Discovery and Data Mining (KDD 1997), New Port Beach, California (August 1997) 14. Susan, P.I., Abdullah, U.T., Eric, P.: An Efficient Method For Finding Emerging Large Itemsets. In: The Third Workshop on Mining Temporal and Sequential Data, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (August 2004) 15. Thabtah, F.: A review of associative classification mining. In: The Knowledge Engineering Review, vol. 22(1), pp. 37–65. Cambridge University Press, Cambridge (2007) 16. Thabtah, F.: Challenges and Interesting Research Directions in Associative Classification. In: Sixth IEEE International Conference on Data Mining Workshops, ICDM Workshops 2006, December 2006, pp. 785–792 (2006) 17. Thabtah, F., Cowling, P., Peng, Y.: MMAC: A new multi-class, multi-label associative classification approach. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), Brighton, UK, pp. 217–224 (2004) 18. Thabtah, F., Cowling, P., Peng, Y.: MCAR: Multi-class classification based on association rule approach. In: Proceeding of the 3rd IEEE International Conference on Computer Systems and Applications, Cairo, Egypt, pp. 1–7 (2005) 19. Yin, X., Han, J.: CPAR: Classification based on predictive association rule. In: Proceedings of the SIAM International Conference on Data Mining, pp. 369–376. SIAM Press, San Francisco (2003)
An Integrated Approach for Exploring Path-Type Association Rules Nai-Chieh Wei1, Yang Wu2, I-Ming Chao1, and Shih-Kai Lin1 1
Department of Industrial Engineering and Management I-Shou University No. 1, Section 1, Syuecheng Rd., Dashu Township, Kaohsiung County 84001, Taiwan, R.O.C. 2 Department of Food and Beverage Management Far East University No. 49, Chung Hua Rd., Hsin-Shih, Tainan County 744, Taiwan, R.O.C. [email protected]
Abstract. This paper develops an effective approach by integrating an Artificial Neural Network–Self-Organizing Map (ANN-SOM) and Ant Theory to analyze possible path-type rules from the use of transaction data and the stocking of sections and displays. These derived rules can be viewed as popular paths taken more frequently while customers are shopping. By properly using these rules, management could benefit by further comprehending which customer groups and which products best contribute to increased profits; they should also be able to react more quickly to change than their competitors do. Keywords: Data Mining, Association Rules, Artificial Neural Net– SelfOrganizing Map, Ant Theory.
1 Introduction Studies have shown that product association rules are very important for decision makers in retailing. By knowing which items are always bought at the same time by a given customer, or by many customers, decision makers can successfully deal with marketing promotions, inventory management, and customer relations management [4, 5 & 9]. Association rules have been extensively utilized to uncover product relationships in retailing, manufacturing, telecommunications, medicine, the travel industry, and education [1, 2, 3, 11 & 12]. However, though difficult to do, management generally hopes to obtain rules which are as consistent as possible. Since such derived rules rely heavily on two measures: Support and Confidence, the resulting rules are sensitive to pre-specified values. Thus, not all rules are potentially useful, especially if they happen simply by chance [13]. In an effort to find solutions to the problems mentioned above, this research presents an integrated approach using both transaction data and stocking sections and N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 253–263. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
254
N.-C. Wei et al.
displays in order to explore the effectiveness of path-type association rules. This twostage approach will first perform the ANN-SOM to classify customers into certain groups based on such customer profiles as: those who recently made purchases, how frequently they shopped and how much customers actually spent (RFM). After that, Ant Theory will be adopted to derive path-type rules for each customer group. These path-type rules can be observed as “hot paths” indicating where each customer group travels most frequently on the floor. In addition, a hot path can reveal customer shopping flow. By taking advantage of these path-type rules, retailers can not only rearrange the product display locations along the hot paths to attract more attention, but also identify new opportunities for product cross selling. The remainder of the paper is organized as follows: Section 2 reviews the three related approaches, which are then used in Section 3 for a real case application. Conclusions and directions for future studies are presented in Section 4.
2 Three Related Approaches 2.1 Association Rules Association Rules describe the probability of product items being purchased simultaneously. Each rule can be derived from a data set which details the purchasing transactions of customers. Hence, association rules can be applied to explain the reasons why customers are likely to purchase both item A and item B at the same time. The formal definition of an association rule, according to Tan et al. (2006), can be summarized as follows: Let I = {i1, i2,…, im} be the all item sets and T = {t1, t2,…, tm} be the all transactions, a transaction ti contains an item set X and belongs to I. Each item set is then measured by Support and Confidence, which can be defined as follows:
∪
1. Support (A B, T): It is the frequency rate of two items purchased simultaneously in a given data set. Item sets with higher support level are items deserving greater attention; namely, the probability of item A and item B appearing simultaneously can be expressed as P(A B) = support-count(A B)/N, where support-count(A, B) represents the number of purchases containing both items A and B at the same time, and N represents the total number of purchases for all possible item sets. 2. Confidence (A→B): Confidence determines how frequently item A appears and then B appears in purchasing transactions. It is said that the probability of item B being purchased following the purchase of Item A (namely the probability of B appearing under the condition of A), may be expressed as P(B|A) = supportcount(A B)/support-count(A) = P(A B) / P(A).
∪
∪
∪
∪
There are two steps used to determine potential association rules. In examining the limits the user specifies for determining support and confidence measures, all the item sets must first satisfy the specified level of the support threshold. The second step is to determine whether they also satisfy the specified confidence threshold. When the item sets satisfy both measures, they are regarded as association rules.
An Integrated Approach for Exploring Path-Type Association Rules
255
2.2 Self-organizing Map The Self-Organizing Map (SOM) is an unsupervised neural network [10], adopting two layers of neurons (nodes) consisting of an input layer and output layer. These nodes are organized in a pre-specified grid structure such as a rectangular or hexagonal pattern. The influence of one node on the others is performed within a neighborhood. During a training process, the SOM employs each training sample to seek a winner node and subsequently update its neighboring nodes within a certain radius. After the training process is over, the SOM will generate an ordered set of nodes in such a way that similar samples fall into the same output node (Tan et al., 2006) [8]. The SOM algorithm is outlined below with 8 steps. Steps 1 to 3 initialize network parameters and topology structure. Steps 4 to 8 seek the winner neuron and update the weights of its neighboring nodes for each training sample. Step 1: Initialize neighborhood radius R, learning rate η and feature map. Also, assign weight vector W and dimensions of topology in J x K. Step 2: Establish a two-dimension grid topology for the output-layer neurons conforming to the characteristics. netjk=(j, k), j=1, 2,…, J k=1 ,2,…, K
;
Step 3: Input a training sample with its reference vector X network. Step 4: Seek the winner node Node j*k * :
(
[i ] , i=1, 2, 3 into the
;
)
net jk = ∑ X [i ] − W jk [i ] 2 , j= 1, 2,…, J k = 1, 2,…, K 3
i =1
where W jk [i ] represents the weight link between the input-layer node and the output-layer nodes of a two-dimensional rectangular network, and pick the node
Node j*k * with the smallest distance netjk, as such:
{
;
}
net j*k * = min net jk , j = 1, 2,…, J k= 1, 2,…, K Step 5: Compute vector Yjk of the output layer, if j = j*, k = k*, then Yjk = 1. Otherwise Yjk = 0. Step 6: Update the weight value Δ W;
(
)
Δ W jk [i ] = +η ⋅ X [i ] − W jk [i ] ⋅ Neighborhood jk ,
;
;
i = 1, 2, 3 j = 1, 2,…, J k = 1, 2,…, K Where the neighborhood function is as follows:
(
Neighborhood jk = exp − r jk / R
)
rjk is the Euclidian distance between the output-layer node Nodejk and the winner node Node j*k * , as follows:
2 2⎤ ⎡ r jk = ⎢⎜⎛ j − j * ⎟⎞ + ⎛⎜ k − k * ⎞⎟ ⎥ ⎝ ⎠ ⎝ ⎠ ⎥ ⎣⎢ ⎦
1/ 2
;k = 1, 2,…, K
, j = 1, 2,…, J
256
N.-C. Wei et al.
Step 7: Adjust weigh vector W: W jk [i ] = W jk [i ] + ΔW jk [i ] , i = 1, 2, 3 j = 1, 2,…, J
η Step 8: Apply learning rate η = η _ rate ⋅ η
;
;k = 1, 2,…, K
and neighborhood radius R:
R = R _ rate ⋅ R
Repeat Steps 3 through 8 until all the necessary training samples have been input. When the error difference between the current and the previous iterations is smaller than a given value, the computation process is concluded. Each node in the network will then represent a cluster or a group. The symbols used in the above process are listed and described below. Table 1. Symbols in SOM Code W R
η
η _rate
Description Weight vector Neighborhood radius
η
Learning rate (0 < < 1) Decreasing in learning rate
R_rate X [i ]
Decreasing in neighborhood radius Input vectors
Y jk
Output vectors
(j, k)
Representing two-dimensional rectangular grid structure in J and K.
Node j*k *
Winner node
Neighborhood
Neighborhood function
2.3 Ant Theory
Ant Theory, proposed by Marco Dorigo [6 & 7], was developed by observing the actual behavior of ants in their search for food locations/sections and the shortest path taken between their formicary and the food location. Instead of vision, ants secrete pheromones along their passageways to deliver messages to other ants seeking food locations/sections. As more and more ants take the same passageway or path to a certain food location/section, more pheromone is deposited on the section and the path. . In turn, ants select the section and the path concentrated with pheromone at a higher frequency. Hence, all ants will gradually congregate at the same section for obtaining food, and take the same path between their formicary and the food location. The formula for path selection probability in Ant Theory explains the rules ants use to seek food, and is similar to the rules customers use in selecting product sections. Therefore, in the t-th iteration, the function represents the probability of the k-th ant choosing to go from section i to section j, and it can be expressed as follows:
An Integrated Approach for Exploring Path-Type Association Rules
⎧ [τ ij (t )]α [η ij ] β , if j ∈ allowed k , ⎪ α β ⎪ pijk (t ) = ⎨ ∑ [τ il (t )] [η il ] l∈allowed k ⎪ 0 , otherwise ⎪⎩
257
(1)
where pijk (t )
: represents the probability function for the k-th ant to go to section j from section i.
τ ij
: represents the pheromone concentration left between section i and section j. η ij : represents the visibility between section i and section j. allowedk : represents the set of sections the k-th ant chooses as the next section when the k-th ant is in section i (namely the sections which the k-th ant has not yet visited).
α and β : represent parameters that control the relative influences between τ ij (t )
and η ij respectively. α and β are both ≥ 0 , and larger α symbolizes preferences in selecting paths based on the magnitude of τ ij (t ) ; while larger β represents preferences in selecting paths based on the magnitude of η ij . t: represents the number of iterations. Based on the formulae shown above, it is known that the pheromone left between two sections, affects which section ants will visit next:
τ ij , and in which
the visibility, η ij , between two sections is the same in each iteration. However, concentrations of pheromone will be renewed after all of the ants complete the journey (one iteration), and it also serves as a reference for ants in choosing their paths from their formicary and the food location. The renewal formula for the concentration of Pheromone is as follows:
τ ij (t + 1) = ρτ ij (t ) + Δτ ij
(2)
where
τ ij (t ) : represents the pheromone concentration left between section i and j in the t-th iteration, and the general assumption is that the initial value is τ ij (0) = c where c is a constant. ρ :represents the residual coefficient of pheromone ( 1 − ρ : the evaporation coefficient of pheromone), where the value of ρ is voluntarily set between [0, 1]. Δτ ij : represents the sum of the pheromone concentration left between section i and j by all ants. It is expressed as:
258
N.-C. Wei et al.
Δτ
ij
=
m
∑
k =1
Δτ
k ij
(3)
k where Δτ ij is the pheromone concentration left by the k-th ant in sections i and j,
and can be expressed as: ⎧Q , if (i, j ) ∈ tour done by ant k ⎪ Δτ = ⎨ Lk ⎪0 , otherwise ⎩ k ij
(4)
where Q, a constant, represents the pheromone concentration secreted by each ant,
L
and k is the total distance between all sections taken by the k-th ant in one iteration. Also, this research hypothesizes that the pheromone concentration on the initial path is 0 (namely τ ij (0) = 0 ). In a store, there may be thousands of stock items arranged in many different sections. When shopping in the store, customers will travel from one section to another. The shopping paths can reach any section and therefore constitute a path network. On the other hand, customers’ selection of shopping paths to travel can be viewed as a probability. Some of the shopping paths have greater customer flow than do others; these paths can be defined as having a higher probability of being selected or found by consumers on the store floor. As a result, the probability of the customer choosing product items along these paths is also relatively higher. In this paper, a real case is taken for study with customers successfully classified into certain groups based upon similarities in their profiles. According to each of the specified groups, this research then aimed to expose the path-type rules.
3 A Case Study This research takes real transactional data from a typical domestic warehouse in Kaohsiung, in southern Taiwan. About 350 transactions targeting 35 food items were extracted over a period of four consecutive months (May to August, 2007). Extracting so called path-type rules involves two major stages: Customer Segmentations and Path Findings. 3.1 Customer Segmentations Data Pre-processing According to the customers’ most important profiles, such as RFM, customer segmentations classify customers into groups by adopting ANN-SOM. The data concerning product items purchased by each group are then examined according to the ant theory, in order to extract possible path-type rules.
An Integrated Approach for Exploring Path-Type Association Rules
259
In order to obtain the information about RFM, the transactional data first has to be transformed into RFM mode. For example, original transactional data (Refer to Table 2) can be transferred into RFM mode for one particular customer, if he/she appeared on the transactional dates on each of August 10, 2007, and August 31, 2007. These two transactional dates are twenty days apart, thus R is equal to 20. Also, F represents the frequency (transaction times) with which the same customer came to the store during the four consecutive months, while M stands for the total accumulated amount this particular customer spent at the store. The transformed data corresponding to RFM is as follows. Table 2. Partial transaction data
T. Date Deal_No
Card_ No
Good_No DPS_No DPS_Name QTY TOT_AMT
2006/5/1
46
2815101071051 1154060002 115406
Ball Type
1
49
2006/5/1
46
2815101071051 1161050004 116105
Fish
1
169
2006/5/1
46
2815101071051 1151040014 115104
Dumplings
1
68
Table 3. Examples of RFM
NO
Recent (R)
Frequency (F)
Amount (M)
1
33
2
5575
2
7
7
1570
3
122
1
561
Customer Grouping by SOM Once the transaction data has been transformed into RFM mode, the next step is to classify customers into groups based upon similarities. At this step, Matlab 7.0 is used to build the SOM network. The input layer contains three neurons (nodes), each of which corresponds to RFM. All 350 transactions are treated as training sample points. In order to find the best fit of SOM in terms of the type of topology and distance function, this research did 9 experiments using 3 types of topology structure and 3 different distance functions. To evaluate each of experiment results, this study used average distance (AD) as the performance index. The performance index implies the level of density, the distance of all sample points within a group, and it indicates the proximity of a group to the centroid. Thus, the smaller distance the better, and the best network parameters should be set to a two-dimensional (2x2) triangular SOM. According to this setting, the network should generate four groups and each customer group is named according to its RFM and M/F. They are explained as follows:
260
N.-C. Wei et al.
First Customer Group: The characteristics of this customer group include: high R, low F and low M, showing that this group had not visited the store for quite some time. Because in past visits, the M value was low, the customer appears to be a low value one for the store; however the M/F average of each expense value was not the smallest. As a result this segment is designated as the “low value customer” group. Second Customer Group: The characteristics of this customer group are with middle R, middle F and middle M, showing that this group did not visit the store for a period of time, but each of its expense and frequency are second highest among the four groups. Moreover, its M/F value is ranked as the highest within the four groups. It indicates that even though the period between two subsequent visits was quite long, the buying power was high. This segment is the “potential customer” group. Third Customer Group: The characteristics of this customer group are low R, middle F and low M, showing that even if the customers had come to the store recently, the number of subsequent visits was low; the M value is low, and the M/F average of each expense was lowest among the four groups. Hence, this segment can be considered as the “developing customer” group. Fourth Customer Group: The characteristics of this customer group are low R, high F and high M, showing that the customer frequently visited the store, where the M value accounts for a great portion of the store’s earnings. This segment has been named the “high value customer” group. 3.2 Path-Type Rules Extraction
Based on the path selection probability of Ant Theory, this research explores the linkage rule of shopping paths for each customer group. The section layout of the studied store and its corresponding items are shown in Fig. 1 and Table 4. The following combination of parameters is assumed as the parameter setting: k=10, α =1, β =4, ρ =0.5, and also Q=100 [14], in which k is the number of customers and t is the stop condition (= 5). Also, a fixed initial station is assigned as the checkout counter. All customers who enter the store through the checkout counter will also leave through the checkout counter. Ant Theory’s path selection probability function will be performed on each customer group to extract the possible path rules. After going through the extraction process, the pheromone left on the paths is continuously renewed until finally, the path with a high probability of selection is obtained. For example, one of the paths used by the High Value customer group is as follows: starting from the section (36) Checkout Counter => section (9) Instant Noodles => section (28) Roast/Soy Simmered => section (31) Dairy Products => section (33) Cheese Products => returning to section (36) Checkout Counter. Figure 1 shows a path where the customers traveled on the floor.
An Integrated Approach for Exploring Path-Type Association Rules
261
Table 4. Stock sections and displays 1- Other Crisps
2 - Instant Coffee 3 - Local Tobacco 4 – Milk
5 - Small Sushi
6 - Noodles
7 - Bean curd
8 - Hotpot balls
10 - Fish Fillet
11 - Cake
12 - Ice Cream (liters)
13 - Layer Cakes 14 - Roast
16 - Sorghum
17 - Southeast Asian Fruits
18 – Soy Sauce Simmered
19 - Sugared Pork 20 - Toast
21 – Eel
22 – Instant Noodles
23 - Gluten
24 - Sports Drinks 25 - Wheat products
26 - Cream Milk 27 - Dry Dog Food
28 - Pure Soy Sauce
29 - Cold Food
30 - Toast Cheese
31 - Barbecue
33 - Dried Meat and Fish
34 - Pickles
35 - Dumplings
32 - Peach and Plum types
9 - Leafy Vegetables
15 - Olive Oil
Sushi Frozen Food Sauce Meat Products
Cake and Bread Dried Grocery
Seafood I
i
Fruits
Vegetab le
Delicious Food
Meats Seafood II
Drugs X
Canned Food, Seasoning, Grocery Section
Frozen Food Section
Dining
Imported Products
Drinks
Local Biscuits
Coffee, Milk section
Dairy products
Presents Cigar Area Tobacco and Liquor
Basement Escalator with Handrail
Food Special Price Area
Fish products
Cheese Products Pet Area
Makeup
Snack
Instant Store/Grocery Area
Silver
Main Entrance
Fig. 1. Floor layout with 35 sections
Overall, 13 path rules were extracted. There are 3 rules for the low value customer group, 3 rules for the potential customer group, 3 rules for the developing customer group, and 4 rules for the high value customer group, respectively. However, in considering support and confidence measures, aside from the 4 rules for the high value customer group, the remaining 9 rules are less meaningful.
262
N.-C. Wei et al.
These 4 rules can be seen as hot paths that customers select frequently, as shown in Table 5. Each hot path along with its product items can be identified as having the most sales potential or as being the most popular area for customers to look around; thus decision makers can utilize the useful information hidden in the transaction database to effectively make appropriate and efficient responses. It is also worth knowing that free-try (free sample) stands located on these 4 hot paths will have high sales potential. It reflects that the product items appearing on these hot paths have greater exposure to customers. Table 5. Four possible path rules for high value customer group Possible Paths
Rules
Support Confidence
(36)=>(9)=>(28) Door => Instant Noodles => Soy Sauce 0.13513 0.31681 Simmered => Milk => Toast Cheese =>(31)=>(33)=> =>Door (36) => Other Crisps => Sugared Pork 0.15067 0.28633 (36)=>(12)=>(22) Door => Hotpot balls => Barbecue => Door =>(17)=>(27)=> (36) (36)=>(15)=> Door => Pure Fermented Soy Sauce 0.21956 0.27761 => Southeast Asian (21)=>(19)=>(36) => Cold Food Fruits => Door (36)=>(9)=>(31) Door => Noodles => Cream Milk => 0.18612 0.39687 Bean curd => Instant Coffee =>Door =>(17)=>(11)=> (36) 4 Conclusion With consideration of the transactional data set and store section layout, this research developed an integrated mining approach to discover path-type product association rules. The resulting path-type rules can be viewed as hot paths, and each of these indicates a path with a high frequency of customer movement. The store management can utilize the knowledge of path-type rules to make decisions on the adjustment of product mix, location, promotion, and advertising; it will not only increase the effectiveness of product arrangement but also improve the efficiency of customer flow management. However, in order to obtain more detailed analyses of certain aspects, future research studies should consider the following: 1. It is essential for future studies to take the quantity of customer flow into account. Also, critical information about the product display sections needs to be studied in greater depth. 2. It is desirable to apply the method developed in this research to compare past and current conditions to determine if the improvement increased to a satisfactory level.
An Integrated Approach for Exploring Path-Type Association Rules
263
3. It is also worth investigating the product association rules in terms of transitive. This describes the possibility that a customer will purchase product B after he/she has purchased product A, or further depicts the chance that a customer will purchase product C after he/she has purchased products A and B.
Acknowledgements This research was supported in part by the National Science Council, Taiwan, Republic of China, under Grant No. NSC-94-2622-E-214-007-CC3.
References 1. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large database. In: Proceeding of the ACM SIGMOD international conference on management of data, pp. 207–216 (1993) 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data base, pp. 478–499 (1994) 3. Berry, M.J.A., Linoff, G.: Data Mining Techniques: For Marketing Sale and Customer Support. John Wiley and Sons, Inc, New York (1997) 4. Berzal, F., Cubero, J., Marin, N., Serrano, J.: TBAR.: An efficient method for association rule mining in relational databases. Data and Knowledge Engineering 37, 47–64 (2001) 5. Bult, J.R., Scheer, H., Wansbeek, T.: Interaction between target and mailing characteristics in direct marketing, with an application to health care fund raising. International Journal of Research in Marketing 14, 301–308 (1997) 6. Dorigo, M., Maniezzo, V., Colorni, A.: The Ant System: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics-Part B 26(1), 1–13 (1996) 7. Dorigo, M., Di Caro, G., Stutzle, T.: Ant algorithms. Future Generation Computer Systems 16(8), v-vii (2000) 8. Hagan, M.T., Demuth, H.B., Beale, M.: Neural Network Design. Thomson Learning (1996) 9. Hui, S.C., Jha, G.: Data mining for customer service support. Information and Management 38, 1–13 (2000) 10. Kohonen, T.: Self-organizing map. Springer, Berlin (1997) 11. Park, J.S., Chen, M.S., Yu, P.S.: An effective hash based algorithm for mining association rules. In: Proceeding of the ACM SIGMOD international conference on management of data, pp. 175–186 (1995) 12. Srikant, R., Agrawal, R.: Mining generalized association rules. In: Proceedings of the 21th international conference on very large data base, pp. 407–419 (1995) 13. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, Pearson International Edition (2006) 14. Wei, N.-C., Liu, H.-T., Wu, Y.: A Research on Data Mining Techniques Based on Ant Theory for Path-Type Association Rules. In: P. Perner (Ed.), Poster Proceedings: Industrial Conference on Data Mining, ICDM, IBaI CE-Report, ISSN 1617-2671, pp.7990, July 14-15, 2006, Leipzig, Germany (2006)
A Framework of Rough Clustering for Web Transactions Iwan Tri Riyadi Yanto1,2 , Tutut Herawan1,2 , and Mustafa Mat Deris1 1
FTMM, Universiti Tun Hussein Onn Malaysia, Johor, Malaysia [email protected] 2 Universitas Ahmad Dahlan, Yogyakarta, Indonesia [email protected], [email protected]
Abstract. Grouping web transactions into clusters is important in order to obtain better understanding of user’s behavior. Currently, the rough approximation-based clustering technique has been used to group web transactions into clusters. However, the processing time is still an issue due to the high complexity for finding the similarity of upper approximations of a transaction which used to merge between two or more clusters. On the other hand, the problem of more than one transaction under given threshold is not addressed. In this paper, we propose an alternative technique for grouping web transactions using rough set theory. It is based on the two similarity classes which are nonvoid intersection. Keywords: Clustering, Web transactions, Rough set theory.
1
Introduction
Web usage data includes data from web server access logs, proxy server logs, browser logs, user profiles, registration files, user sessions or transactions, user queries, bookmark folders, mouse-clicks and scrolls, and any other data generated by the interaction of users and the web [1]. Generally, web mining techniques can be defined as those methods to extract so called ”nuggets” (or knowledge) from web data repository, such as content, linkage, usage information, by utilizing data mining tool. Among such web data, user click stream, i.e. usage data, can be mainly utilized to capture users’ navigation patters and identify user intended tasks. Once the user navigational behaviors are effectively characterized, they will provide benefit for further web applications, in turn, facilitate and improve web service quality for both web-based organizations and for end users [2-9]. In web data mining research, many data mining techniques, such as clustering [8,10] is adopted widely to improve the usability and scalability of web mining. Access transaction over the web can be expressed in the two finite sets, user transaction and hyperlinks/URLs [11]. A user transaction U is a sequence of items, this set is formed by m users and the set A is set of distinct n clicks (hyperlinks/URLs) clicked by users that are U = {t1 , t2 , . . . , tm } and A = {hl1 , hl2 , . . . , hln }, where for every ti ∈ T ⊆ U is a non-empty subset of U . The temporal order of users clicks N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 265–277. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
266
I. Tri Riyadi Yanto, T. Herawan, and M. Mat Deris
within transactions has been taken into account. A user transaction t ∈ T is represented as a vector t = ut1 , ut2 , . . . , utn , where uti = 1 if hli ∈ t, and uti = 0 if otherwise. A well-known approach for clustering web transactions is using rough set theory [12-14]. De and Krishna [11] proposed an algorithm for clustering web transactions using rough approximation. It is based on the similarity of upper approximations of transactions by given any threshold. However, there are some iterations should be done to merges of two or more clusters that have the same similarity of upper approximations and didn’t present how to handle the problem if there are more than one transaction under given threshold. To overcome those problems, in this paper, we propose an alternative technique for clustering web transaction. We use the concept of similarity class proposed by [11]. But, the proposed technique differs on how to allocate transaction in the same cluster and how to handle the problem if there is more than one transaction under given threshold. The rest of the paper is organized as follows. Section 2 describes the concept of rough set theory. Section 3 describes the work of [11]. Section 4 describes the proposed technique. Section 5 describes the experimental test. Finally, we conclude our works in Section 6.
2
Rough Set Theory
An inf ormation system is a 4-tuple (quadruple) S = (U, A, V, f ), where U = {u1 , u2 , u3 , . . . , u|U | } is a non-empty finite set of objects, A = {a1 , a2 , a3 , . . . , a|A| } is a non-empty finite set of attributes, V = ∪a∈A Va , Va is the domain (value set) of attribute a, f : U × A −→ V is an information function such that f (u, a) ∈ Va , for every (u, a) ∈ U × A, called information (knowledge) function. The starting point of rough set approximations is the indiscernibility relation, which is generated by information about objects of interest. Two objects in an information system are called indiscernible (indistinguishable or similar) if they have the same feature. Definition 1. Two elements x, y ∈ U are said to be B-indiscernible (indiscernible by the set of attribute B ⊆ A in S) if and only if f (x, a) = f (y, a), for every a ∈ B. Obviously, every subset of A induces unique indiscernibility relation. Notice that, an indiscernibility relation induced by the set of attribute B, denoted by IN D(B), is an equivalence relation. The partition of U induced by IN D(B) is denoted by U/B and the equivalence class in the partition U/B containing x ∈ U , in denoted by [x]B . The notions of lower and upper approximations of a set are defined as follows. Definition 2. (See [14].) The B-lower approximation of X, denoted by B(X) and B-upper approximation of X, denoted by B(X), respectively, are defined by B(X) = {x ∈ U |[x]B ⊆ X} and B(X) = {x ∈ U |[x]B X = φ}.
A Framework of Rough Clustering for Web Transactions
267
The accuracy of approximation (accuracy of roughness) of any subset X ⊆ U with respect to B ⊆ A, denoted αB (X) is measured by αB (X) =
|B(X)| , |B(X)|
where |X| denotes the cardinality of X. For empty set φ, we define αB (φ) = 1. Obviously, 0 ≤ αB (X) ≤ 1. if X is a union of some equivalence classes of U , then αB (X) = 1. Thus, the set X is crisp (precise) with respect to B. And, if X is not a union of some equivalence classes of U , then αB (X) < 1. Thus, the set X is rough (imprecise)with respect to B [13]. This means that the higher of accuracy of approximation of any subset X ⊆ U is the more precise (the less imprecise) of itself.
3
Related Work
In this section, we discuss the technique proposed by [11]. Given two transactions t and s,the measurement of similarity between t and s is given by sim(s,t ) = |t s|/|t s|. Obviously, sim(t,s) ∈ [0, 1], where sim(t,s) = 1, when two transactions t and s are exactly identical and sim(t,s) = 0, when two transactions t and s have no items in common. De and Krishna [11] used a binary relation R defined on T defined as follows. For any threshold value th ∈ [0, 1] and for any two user transactions t and s ∈ T , a binary relation R on T denoted as tRs iff sim(t, s) ≥ th. This relation R is a tolerance relation as R is both reflexive and symmetric but transitive may not hold good always. Definition 3. The similarity class of t, denoted by R(t), is a set of transactions which are similar to t which is given by R(t) = {s ∈ T : sRt}. For different threshold values, one can get different similarity classes. A domain expert can choose the threshold based on this experience to get a proper similarity class. It is clear that for a fixed threshold ∈ [0, 1], a transaction form a given similarity class may be similar to an object of another similarity class. Definition 4. Let P ⊆ T , for a fixed threshold ∈ [0, 1] a binary tolerance relation R is defined on T. The lower approximation of P, denoted by R(P ) and the upper approximation of P, denotedby R(P ) are respectively defined as R(P ) = {t ∈ P : R(t) ⊆ P } and R(P = t∈P )R(t). They proposed a technique of clustering the clicks of user navigations called as similarity upper approximation and denoted by Si . A set of transactions that are possibly similar to R(ti ) in denoted by RR(ti ). This process continues until two consecutive upper approximations for ti , i = 1, 2, 3, · · · , |U | are the same and two or more clusters that have the same similarity upper approximations merge at each iteration. With this technique, we need
268
I. Tri Riyadi Yanto, T. Herawan, and M. Mat Deris
high computational complexity to cluster the transactions. This is due to find out the similarity upper approximation until two consecutive upper approximations are same. To overcome this problem, we propose an alternative technique to cluster the transactions.
4
The Proposed Technique
The proposed technique for clustering the transactions is based on all the possibly similar to the similarity class of t(R(t)). The union of two similarity classes with non void intersection will be the same clusters. The justification that a cluster is a union of two similarity classes with non void intersection is presented in Proposition 6. Definition 5. Two clusters Si and Sj , i = j are said to be the same if Si = R(ti ), i = 1, 2, 3, · · · , |U |. Proposition 6. Let Si be a cluster. If R(ti ) = φ, then R(ti ) = Si . Proof. We suppose that Si and Sj , where i = j are the same clusters. From Definition 5, if R(ti ) = Si , then we have i = Sj = R(tj ) R(ti ) = S R(tj ) R(ti ) = ( R(ti )) ( R(tj )) = φ R(ti ) = φ This is a contradiction from the hypothesis. 4.1
Complexity
Suppose that in an information system S = (U, A, V, f ), there are U objects that mean there are at most |U | similarity classes. For computation of similarity classes R(ti ) on R(tj ), where i = j is |U | × |U − 1|. Thus, the overall computational complexity of the proposed technique is of the polynomial (|U | × |U − 1|). 4.2
Example
In this study, the comparisons between the proposed technique and the technique proposed by [11] are presented by given two examples, where two small data sets of transactions are considered. a. The first transactions data is adopted from [11] given in Table 1 containing four objects (|U | = 4) with five hyperlinks (|A| = 5). The technique of [11] need three main steps. The first of the techniques is obtaining the measure of similarity that gives information about the users access patterns related to their common areas of interest by similarity relation between two
A Framework of Rough Clustering for Web Transactions
269
Table 1. Data transactions U/A t1 t2 t3 t4
hl1 1 0 1 0
hl2 1 1 0 1
hl3 0 1 1 1
hl4 0 1 0 0
hl5 0 0 1 1
sim(t1 , t2 ) = |t1 ∩ t2 |\|t1 ∪ t2 | = |{hl2 }|\|{hl1 , hl2 , hl3 , hl4 }| = 0.25 sim(t1 , t3 ) = |t1 ∩ t3 |\|t1 ∪ t3 | = |{hl1 }|\|{hl1 , hl2 , hl3 , hl4 }| = 0.25 sim(t1 , t4 ) = |t1 ∩ t4 |\|t1 ∪ t4 | = |{hl2 }|\|{hl1 , hl2 , hl3 , hl4 }| = 0.25 Fig. 1. The similarity t1 with with the other transactions R(t1 ) = {t1 }, R(t2 ) = {t2 , t4 }, R(t3 ) = {t3 , t4 }, R(t4 ) = {t2 , t3 , t4 } RR(t1 ) = {t1 }, RR(t2 ) = {t2 , t3 , t4 },RR(t3 ) = {t2 , t3 , t4 }, RR(t4 ) = {t2 , t3 , t4 }, and RRR(t2 ) = {t2 , t3 , t4 },RRR(t3 ) = {t2 , t3 , t4 } Fig. 2. The similarity upper approximations process
R(t1) ∩ R(t2 ) = {t1 } ∩ {t2 , t4 } = φ, R(t1 ) ∩ R(t3 ) = {t1 } ∩ {t3 , t4 } = φ, R(t1) ∩ R(t4 ) = {t1 } ∩ {t3 , t4 } = φ, R(t2 ) ∩ R(t3 ) = {t2 , t4 } ∩ {t3 , t4 } = {t4 } R(t2) ∩ R(t4 ) = {t2 , t4 } ∩ {t2 , t3 , t4 } = {t4 }, R(t3) ∩ R(t4 ) = {t3 , t4 } ∩ {t2 , t3 , t4 } = {t4 } Fig. 3. The similarity relation
transactions of objects. The three calculation of the measure of similarity t1 with the other transactions from Table 1 are shown in Figure 1. Second, The similarity classes can be obtained by given the threshold value using Definition 1. The last step is to cluster the transactions based on similarity upper approximations. Based on Figure 1, the similar calculations are performed for all the transactions. These calculations are sim(t1 , t2 ) = 0.25, sim(t1 , t3 ) = 0.25, sim(t1 , t4 ) = 0.25, sim(t2 , t3 ) = 0.2, and sim(t2 , t4 ) = 0.5. By given the value of threshold 0.5, we get the similarity classes are R(t1 ) = {t1 }, R(t2 ) = {t2 , t4 }, R(t3 ) = {t3 , t4 }, R(t4 ) = {t2 , t3 , t4 }. To get the clusters, [11] used the similarity upper approximations and the process are shown in Figure 2. Here, we can see that two consecutive upper approximation for {t1 }, {t2 }, {t3 } and {t4 } are same. Thus, we get the similarity upper approximation for {t1 }, {t2 }, {t3 } and {t4 } as S1 = {t1 }, S2 = {t2 , t3 , t4 }, S3 = {t2 , t3 , t4 }, S4 = {t2 , t3 , t4 }, where S2 = S3 = S4 and S1 = Si for i = 2, 3, 4. Finally, we get the two clusters {t1 } and {t2 , t3 , t4 }. However, for the proposed technique, it is based on nonvoid intersection. According to Definition 5 and the similarity classes as used
270
I. Tri Riyadi Yanto, T. Herawan, and M. Mat Deris Table 2. Data transactions U/A t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
hl1 1 0 0 1 0 0 0 0 0 0 0
hl2 0 0 1 0 0 1 0 0 0 1 0
hl3 1 1 0 0 0 1 1 1 1 0 1
hl4 1 1 0 0 0 0 1 1 0 1 1
hl5 0 1 1 0 1 0 0 0 0 0 1
hl6 0 1 1 0 0 0 0 1 0 0 0
Table 3. The similarity for all transactions T /T t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
t1 0.40 0 0.33 0 0.25 0.67 0.50 0.33 0.25 0.5
t2
t3
t4
t5
t6
0.40 0 0.25 0.20 0.50 0.75 0.25 0.20 0.75
0.25 0.33 0.25 0 0.20 0 0.25 0.20
0 0 0 0 0 0 0
0 0 0 0 0 0.33
0.33 0.25 0.50 0.33 0.02
t7
t8
t9
t10
0.67 0.50 0.33 0.33 0.25 0 0.67 0.50 0.33 0.25
in [11], there are a few computation we need to do to get the clusters. Therefore, the proposed technique to clusters the transactions perform better than that [11]. The calculation of similarity relation is shown in Figure 3. here, we can see that R(ti ) ∩ R(tj ) = φ, i = j, for i, j = 2, 3, 4. We get the clusters S1 = R(t1 ) = {t1 }, S2 = S3 = S4 = R(ti ) = {t2 , t4 } ∪ {t3 , t4 } ∪ {t2 , t3 , t4 } = {t2 , t3 , t4 }, i = 2, 3, 4. Hence, the two clusters are {t1 } and {t2 , t3 , t4 }. b. For the second data transactions is given in Table 2 containing eleven objects |U | = 11, with six hyperlinks |A| = 6. The similarity for the transactions are shown in Table 3. By given the threshold value 0.5, the similarity classes are shown in Figure 4. The process for finding similarity upper approximations in each transaction is shown in Figure 5. In Figure 5, two consecutive upper approximation for {ti }, i = 1, 2, . . . , 11 are the same. Therefore, we get the similarity upper approximation for {ti }, i = 1, 2, . . . , 11 as S1 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S2 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S3 = {t3 }, S4 = {t4 }, S5 = {t5 }, S6 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S7 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S8 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S9 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S10 = {t10 },
A Framework of Rough Clustering for Web Transactions R(t1 ) = {t1 , t7 , t8 , t11 }, R(t2 ) = {t2 , t7 , t8 , t11 } R(t3 ) = {t3 },R(t4 ) = {t4 } R(t5 ) = {t5 } R(t6 ) = {t6 , t9 }
271
R(t7 ) = {t1 , t2 , t7 , t8 , t9 , t11 } R(t8 ) = {t1 , t2 , t7 , t8 , t11 } R(t9 ) = {t6 , t7 , t9 } R(t10 ) = {t10 } R(t11 ) = {t1 , t2 , t7 , t8 , t11 }
Fig. 4. The similarity classes
R(t1 ) = {t1 , t7 , t8 , t11 }, R(t2 ) = {t2 , t7 , t8 , t11 },R(t3 ) = {t3 },R(t4 ) = {t4 }, R(t5 ) = {t5 }, R(t6 ) = {t6 , t9 }, R(t7 ) = {t1 , t2 , t7 , t8 , t9 , t11 }, R(t8 ) = {t1 , t2 , t7 , t8 , t11 }, R(t9 ) = {t6 , t7 , t9 }, R(t10 ) = {t10 }, R(t11 ) = {t1 , t2 , t7 , t8 , t11 } RR(t1 ) = {t1 , t2 , t7 , t8 , t9 , t11 }, RR(t2 ) = {t1 , t2 , t7 , t8 , t9 , t11 }, RR(t3 ) = {t3 } RR(t4 ) = {t4 }, RR(t5 ) = {t5 }, RR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 } RR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RR(t8 ) = {t1 , t2 , t7 , t8 , t9 , t11 } RR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RR(t10 ) = {t10 }, RR(t11 ) = {t1 , t2 , t7 , t8 , t9 , t11 } RRR(t1 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RRR(t2 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RRR(t3 ) = {t3 }, RRR(t4 ) = {t4 }, RRR(t5 ) = {t5 }, RRR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 } RRR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RRR(t8 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 } RRR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RRR(t10 ) = {t10 }, RRR(t11 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 } RRR(t1 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RRRR(t2 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RRRR(t3 ) = {t3 }, RRRR(t4 ) = {t4 }, RRRR(t5 ) = {t5 }, RRRR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 } RRRR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RRRR(t8 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 } RRRR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, RRRR(t10 ) = {t10 }, RRRR(t11 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 } Fig. 5. The similarity upper approximations Table 4. The intersection of similarity classes T /T t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
t1
t2
t3
t4
t5
− − − − − − − −
− − − − − − −
− − − − − −
t6
t7
t8
− −
1,2,7,8,11
6,9
7,9
7
− −
−
−
t9
t10
7,8,11
− − − −
− − − −
1,2,8,11 2,7,8,11 1,2,8,11 2,7,8,11 7
7
−
−
1,2,8,11 2,7,8,11
1,2,7,8,11 1,2,7,8,11
− 7
−
272
I. Tri Riyadi Yanto, T. Herawan, and M. Mat Deris Table 5. The similarity for the remainder transactions t4 t5 t10
t3 0.25 0.33 0.25
t4
t5
0 0
0
S11 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }. Since Si = Sj = Sk , where i, j = 1, 2, 7, 8, 9, 11 and k = 3, 4, 5, 10, then according to [11], there are five clusters {t3 }, {t4 }, {t5 }, {t10 } and{t1 , t2 , t6 , t7 , t8 , t9 , t11 } . For the proposed method, the intersection of similarity classes are summarized in the Table 4. From Table 4, notice that R(ti ) ∩ R ∩ (tj ) = φ, for i = j; i, j = 1, 2, 6, 7, 8, 9, 11, and R(tk )∩R(tl ) = φ, for k = l; k = 1, 2, . . . , 11; l = 3, 4, 5, 10. We get the clusters S1 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S2 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S3 = {t3 }, S4 = {t4 }, S5 = {t5 }, S6 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S7 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S8 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S9 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, S10 = {t10 }, S11 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }. The five clusters are {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, {t3 }, {t4 }, {t5 } and {t10 }. These are the same cluster with that in [11]. However, the iteration is lower than that of the technique proposed by [11]. For the clusters {t3 }, {t4 }, {t5 }, {t11 }, with the threshold value given, {t3 }, {t4 }, {t5 }, {t11 } be segregated clusters, but if we see in the data transactions, may be there is a related transactions among the clusters. To this, we propose the alternative technique to handle this problem by given the second threshold value.
Fig. 6. The intersection of similarity classes
11
11
10
10
9
9
8
8
Transactions
Transactions
R(t3 ) R(t4 ) R(t5 ) R(t4 ) − R(t5 ) {3, 5} − R(t10 ) − − −
7 6 5
7 6 5
4
4
3
3
2
2
1
by given threshold 0.5 1
2
3
Clusters
4
5
1
after given second threshold 0.3 1
2
Clusters
Fig. 6. Visualization of example 2
3
4
A Framework of Rough Clustering for Web Transactions
273
Therefore, we decide {t1 , t2 , t6 , t7 , t8 , t9 , t11 } as the first cluster on the first threshold value given and the remainder {t3}, {t4 }, {t5 }, {t11 }, we given the second threshold value and group the similarity for the remainder transactions. Let given a second threshold value 0.3, then we have similarity classes are R(t3 ) = {t3 , t5 }, R(t4 ) = {t4 }, R(t5 ) = {t3 , t5 }, R(t10 = {t10 }). The intersection of similarity classes are summarized in Table 6. Based on Table 6, we see that R(t3 ) ∩ R(t5 ) = φ and R(ti ) ∩ R(tj ) = φ for i = 3, 4, 5, 10; j = 4, 10. We get the cluster S3 = {t3 , t5 }, S4 = {t4 }, S5 = {t3 , t5 }, S10 = {t10 }. Hence, the three clusters are {t3 , t5 }, {t4 }, {t10 }. Overall, for both of threshold values given we have four clusters {t1 , t2 , t6 , t7 , t8 , t9 , t11 },{t3 , t5 }, {t4 }, and {t10 }.
5
Experiment Test
In order to test the proposed technique and compare it with the technique of [11], we use a data set from: http://kdd.ics.uci.edu/databases/msnbc/ msnbc.html. The data describes the page visits by users who visited on September 28, 1999. Visitors are recorded at the level of URL category and are recorded chronologically. The data comes from Internet Information Server (IIS) logs for msnbc.com. Each row in the data set corresponds to the page visits of a user within a twenty-four hour period. Each item in a row corresponds to a request of a user for a page. The client-side cached data is not recorded, thus this data contains only the server-side log. From almost one million transactions, we take 2000 transactions and split into five categories; 100, 200, 500, 1000 and 2000. The proposed technique for clustering web transactions is implemented in MATLAB version 7.6.0.324 (R2008a). They are executed sequentially on a processor Intel Core 2 Duo CPUs. The total main memory is 1 Gigabyte and the operating system is Windows XP Professional SP3. 5.1
Purity the Clusters
The purity of clusters was used as a measure to test the quality of the clusters. The purity of a cluster and overall purity are defined as Purity(i)= the
number of data occuring in both the ith cluster under given threshold the number of data in the data set
OverallP urity =
5.2
P urity(i) of cluster
of cluster i=1
Response Time
The comparisons of executing time are captured in Figure 7.
274
I. Tri Riyadi Yanto, T. Herawan, and M. Mat Deris Table 7. The Purity of clusters Number of The proposed The technique Improvement Transaction technique of [11] 100 100% 93.0% 7.0% 200 100% 96.0% 4.0% 500 100% 95.5% 0.5% 1000 100% 95.5% 0.5% 2000 100% 99.9% 0.1% Average 2.5% Table 8. The executing time Number of The proposed The technique Improvement Transaction technique of [11] 100 1.6969 6.250 68.79% 200 9.093 6.250 66.25% 500 77.266 163.760 48.35% 1000 554.426 2205.100 65.10% 2000 3043.500 9780.900 64.97% Average 62.69%
Response time 10000 9000 8000
The technique of [11]
7000
second
6000 5000 4000 3000 2000
The proposed technique
1000 0
100
200
500
1000
2000
Transactions
Fig. 7. The executing time
after given 2nd threshold 0.3 100
90
90
80
80
70
70
60
60
Transactions
Transactions
1st threshold 0.6 100
50 40
50 40
30
30
20
20
10
10
1
3
5
7
9
11 13 Clusters
15
17
19
21
23
0
2
4
6
8 10 Clusters
Fig. 8. Visualization of 100 transactions
12
14
16
18
A Framework of Rough Clustering for Web Transactions st
nd
threshold 0.6
after given 2 200
180
180
160
160
140
140
120
120
Transactions
Transactions
1 200
100 80
80 60
40
40
20
20
1
3
5
7
0
9 11 13 15 17 19 21 23 25 27 29 31 Clusters
threshold 0.3
100
60
0
275
1
3
5
7
9
11 13 15 Clusters
17
19
21
23
25
Fig. 9. Visualization of 200 transactions after given 2nd threshold 0.3 500
450
450
400
400
350
350
300
300
Transactions
Transactions
1st threshold 0.6 500
250 200
250 200
150
150
100
100
50
50
4
8
12
16 20 Clusters
24
28
32
36
4
8
12
16 20 Clusters
24
28
32
35
40
Fig. 10. Visualization of 500 transactions st
nd
threshold 0.6
after given 2 1000
900
900
800
800
700
700
600
600
Transactions
Transactions
1 1000
500 400
500 400
300
300
200
200
100
threshold 0.3
100
5
10
15
20 25 Clusters
30
35
40
45
5
10
15
20 Clusters
Fig. 11. Visualization of 1000 transactions
25
30
276
I. Tri Riyadi Yanto, T. Herawan, and M. Mat Deris after given 2nd threshold 0.3 2000
1800
1800
1600
1600
1400
1400
1200
1200
Transactions
Transactions
1st threshold 0.6 2000
1000 800
1000 800
600
600
400
400
200
200
4
8
12 16 20 24 28 32 36 40 44 48 52 56 Clusters
5
10
15
20
25 30 Clusters
35
40
45
50
55
Fig. 12. Visualization of 2000 transactions
6
Conclusion
A web clustering technique can be applied to find interesting user access patterns in web log. In this paper, we have proposed an alternative technique for clustering web transactions using rough set theory based on similarity between two transactions. The analysis of the proposed technique was presented in terms of processing time and cluster purity. We elaborate the proposed technique through UCI benchmark data, i.e., msnbc.com web log data. It is shown that the proposed technique requires significantly lower response time up to 62.69 % as compared to the technique of [11]. Meanwhile, for cluster purity it performs better up to 2.5 %.
Acknowledgement This work was supported by the FRGS under the Grant No. Vote 0402, Ministry of Higher Education, Malaysia.
References 1. Pal, S.K., Talwar, V., Mitra, P.: Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions. IEEE Transactions on neural network 13(5), 1163–1177 (2002) 2. Bucher, A.G., Mulvenna, M.D.: Discovering Internet Marketing Intelligence through Online Analytical Web Usage Mining. SIGMOD Record 27(4), 54–61 (1998) 3. Cohen, E., Krishnamurthy, B., Rexford, J.: Improving and-to-end performance of the web using server volumes and proxy lters. In: Proceeding of the ACM SIGCOMM 1998, ACM Press, Vancouver (1998) 4. Joachims, T., Freitag, D., Mitchell, T.: Webwatcher: A tour guide for the world wide web. In: the 15th international Joint Conference on Artificial Intelligence (ICJAI 1997), Nagoya, Japan (1997)
A Framework of Rough Clustering for Web Transactions
277
5. Lieberman, H.: Letizea: An agent that assists web browsing. In: Proceeding of the 1995 International Joint Conference on Artificial Intelligence. Morgan Kaufmann, Montreal (1995) 6. Mobasher, B., Cooley, R., Srivastava, J.: Creating adaptive web sites trough usage based clustering of URLs. In: Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange. IEEE Computer Society, Los Alamitos (1999) 7. Ngu, D.S.W., Wu, X.: Sitehelper: A localized agent that helps incremental exploration of the world wide web. In: Proceeding of 6th International World Wide Web Conference. ACM Press, Santa Clara (1997) 8. Perkowitz, M., Etzioni, O.: Adaptive Web Sites: Automatically Synthesizing Web Pages. In: Proceedings of the 15th National Conference on Artificial Intelligence. AAAI, Madison (1998) 9. Yanchun, Z., Guandong, X., Xiaofang, Z.: A Latent Usage Approach for Clustering Web Transaction and Building User Profile, pp. 31–42. Springer, Heidelberg (2005) 10. Han, E., et al.: Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results. IEEE Data Engineering Bulletin 21(1), 15–22 (1998) 11. De, S.K., Krishna, P.R.: Clustering web transactions using rough approximation. Fuzzy Sets and Systems 148, 131–138 (2004) 12. Pawlak, Z.: Rough sets. International Journal of Computer and Information Science 11, 341–356 (1982) 13. Pawlak, Z.: Rough sets: A theoretical aspect of reasoning about data. Kluwer Academic Publishers, Dordrecht (1991) 14. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Information Sciences. An International Journal. 177(1), 3–27 (2007)
Fuzzy Bayesian Belief Network for Analyzing Medical Track Record Rolly Intan and Oviliani Yenty Yuliana Informatics Engineering Department Petra Christian University, Surabaya, Indonesia {rintan,ovi}@peter.petra.ac.id
Abstract. Bayesian Belief Network (BBN), one of the data mining classification methods, is used in this research for mining and analyzing medical track record from a relational data table. In this paper, the BBN concept is extended with meaningful fuzzy labels for mining fuzzy association rules. Meaningful fuzzy labels can be defined for each domain data. For example, fuzzy labels secondary disease and complication disease are defined for disease classification. We extend the concept of Mutual Information dealing with fuzzy labels for determining the relation between two fuzzy nodes. The highest fuzzy information gain is used for mining association among nodes. A brief algorithm is introduced to develop the proposed concept. Experimental results of the algorithm show processing time in the relation to the number of records and the number of nodes. The designed application gives a significant contribution to assist decision maker for analyzing and anticipating disease epidemic in a certain area. Keywords: Bayesian Belief Network, Classification Data, Data Mining, Fuzzy Association Rules.
1 Introduction Bayesian Belief Network (BBN) is a powerful knowledge representation and reasoning tool under conditions of uncertainty. A Bayesian network is a Directed Acyclic Graph (DAG) with a probability table for each node. The nodes in a Bayesian network represent propositional variables in a domain, and the arcs between nodes represent the dependency relationship among the variables. There are several BBN researches. Integrating fuzzy theory into Bayesian networks by introducing conditional Gaussian models to make a fuzzy procedure was conducted by [1]. Integrating fuzzy logic into Bayesian networks were also proposed by [2, 3]. Learning Bayesian network structures using information theoretic approach was proposed in [5]. The last research is very close related with our proposed concept. In this paper, we extend the concept of Mutual Information (MI) dealing with meaningful fuzzy values in constructing Fuzzy Bayesian Belief Network (FBBN). The result of MI is used to determine whether there is a relation between two fuzzy nodes. Direction of arc between two nodes depends on N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 279–290. springerlink.com © Springer-Verlag Berlin Heidelberg 2009
280
R. Intan and O.Y. Yuliana
comparison between asymmetric results of their conditional probability. Conditional probability table can be provided during the process of generating FBBN. Fuzzy association rules are directly achieved from the networks in which every weight of relationship between two nodes might be considered as a confidence factor of the rule. A brief algorithm is given to develop the proposed concept. The experimental results show processing time in the relation to the number of records and the number of nodes. The paper is organized as follows. Section 2 as a main contribution of this paper discusses our proposed concept and algorithm for generating FBBN. Section 3 demonstrates the concept and algorithm with an illustrative example. Experimental result expressing processing time is also provided in this section. Finally a conclusion is given in Section 4.
2 Fuzzy Bayesian Belief Network (FBBN) Bayesian Belief Network specifies joint conditional probability distributions. It allows class conditional independencies to be defined between subsets of domains. It provides a graphical model of causal relationships, on which learning can be performed. BBN is defined by two components, i.e. Directed Acyclic Graph (DAG) and Conditional Probability Table (CPT) [6]. In this section, a concept of FBBN is proposed and generated from a relational data table. Every node in the FBBN is considered as a fuzzy set over a given domain in the relation. Formally, let a relation schema [7] R consists of a set of tuples, where ti represents the i-th tuple and if there are n domain attributes D, then ti = 〈 d i1 , d i 2 ,", d in 〉. Here, dij is an atomic value of tuple ti with the restriction to the domain Dj, where d ij ∈ D j . A relation schema R is defined as a subset of the set of cross product D1 × D2 × " × Dn , where D = {D1 , D2 , " , Dn } . Tuple t (with respect to R) is an element of R. In general, R can be shown in Table 1. Table 1. A schema of relational data table Tuples
D1
t1
d11
d12
t2
d 21
d 22
… … …
D2
Dn d1n d 2n
#
#
#
%
#
ts
d s1
ds2
…
d sn
Now, we consider A and B as two fuzzy subsets over Dj and Di as defined by A : D j → [0,1] and B : Di → [0,1] so that A ∈ Γ( D j ) and B ∈ Γ( Di ) , where
Γ( D j ) and Γ( Di ) are fuzzy power set over domain Dj and Di, respectively. As defined in [4, 8], some basic operations of fuzzy sets are given by:
Fuzzy Bayesian Belief Network for Analyzing Medical Track Record
281
Complement: ~ A( x) = 1 − A( x) for all x ∈ D j Intersection: A ∩ B( x) = min( A( x), B( x)) for all x ∈ D j
A ∪ B( x) = max(A( x), B( x)) for all x ∈ D j
Union:
In probability theory and information theory, the MI of two random variables is a quantity that measures the mutual dependence of the two variables. The MI value between two fuzzy sets A and B can be defined by a function in (1).
⎛ P ( A, B) ⎞ ⎟⎟ MI ( A, B ) = P( A, B ) log 2 ⎜⎜ ⎝ P ( A) × P( B ) ⎠
(1)
where P(A)≠0 and P(B)≠0. Here, P( A, B) is probability of fuzzy sets A and B or intersection between A and B. Therefore, P( A, B) can be also denoted by P( A ∩ B) as given by the following definition: |R |
P ( A, B ) = P ( A ∩ B ) =
∑ min ( A ( d
kj
), B ( d ki ))
k =1
(2)
|R|
where A(d kj ), B( d ki ) ∈ [0,1] are membership degrees of dkj and dki in fuzzy sets A and B, respectively. |R| is the number of tuples/ records in the relation R. P(A) and P(B) are defined as probability of fuzzy set A and B, respectively as follow. |R |
P ( A) =
∑
k =1
|R |
A ( d kj ) |R|
and
P(B) =
∑ B (d k =1
|R|
ki
) ,
(3)
It can be verified from (1), MI(A,B) is a symmetric function (MI(A,B)=MI(B,A)). The relationship between A and B strongly depend on the following equation.
P ( A, B ) P ( A) × P ( B ) The above equation represents a correlation measure as one of important measures in determining interestingness of an association rule. There are three possible results given by the correlation measure, namely positive correlation (if the result of calculation is greater than 1), independent correlation (if the result of calculation is equal to 1) and negative correlation (if the result of calculation is less than 1). It can be verified that MI(A,B) might be greater than 0 (MI(A,B)>0)), equal to 0 (MI(A,B)=0) and less than 0 (MI(A,B)<0). Fuzzy sets A and B are assumed to have a relationship in constructing a network if and only if A and B have a positive correlation so that the value of MI(A,B) is greater than 0. If A and B have a relation in the network then direction of relationship between A and B is
282
R. Intan and O.Y. Yuliana
determine by comparing values of conditional probability. Conditional probability of fuzzy event A given B denoted by P(A|B) is defined as follows [14]. | R|
P( A | B) =
P ( A, B ) = P( B)
∑ min ( A(d
kj
), B ( d ki ))
k =1
(4)
| R|
∑ B(d k =1
ki
)
Similarly, conditional probability of fuzzy event B given A is given by P(B|A) as defined by | R|
P ( A, B ) P ( B | A) = = P( B)
∑ min ( A(d k =1
kj
), B (d ki )) (5)
| R|
∑ A(d k =1
kj
)
A comparison between P(A|B) and P(B|A) is used to decide a relationship direction between two nodes represented by two fuzzy sets in constructing a Bayesian Belief Network. Here, if P(A|B) > P(B|A), then relationship direction from B to A as shown in Figure 1(a). If P(A|B) < P(B|A), then relationship direction from A to B as shown in Figure 1(b). Suppose in a particular case P(A|B) = P(B|A), the direction might be either from A to B or from B to A. However, it is necessary to make sure that the chosen direction does not cause any cyclic in the network. P(B|A )>P(A |B)
P(A|B )> P(B |A)
B
A (a)
A
B (b)
Fig. 1. Determining direction between A and B
The algorithm to calculate conditional probability, mutual information, and direct arc between two nodes is given by the following algorithm. 1. Prepare data using relational algebra (e.g. select, project, Cartesian product). 2. Select domains and define node types as a fuzzy set for every domain (e.g. value, group of value, or fuzzy set) to be analyzed. 3. Generate a temporary relation from the selected domains. Name the domains with defined labels sequentially and the default value is 0. Fill a value of new domain with a weight depend on the defined node type in the selected domains sequentially. In case value and group of value node type is selected for the domain, set weight 1 for
Fuzzy Bayesian Belief Network for Analyzing Medical Track Record
283
every selected value and set weight 0 for unselected value. Another case, fuzzy set is selected for the domain, set weight with defined alphanumeric fuzzy set weight. Set weight with function fuzzy set for numeric fuzzy set. Set weight 0 for undefined values. 4. Calculate total weight for every domain into dynamic array respectively. 5. Check relationship and decide arc direction Max=number of domains First=2 {the first node} For First to Max-7 Second=First + 1 {the second node} For Second to Max Calculate P(First) using equation (3) Calculate P(Second) using equation (3) Calculate P(First, Second) using equation (2) Calculate MI(First, Second) using equation (1) If MI(First, Second) > 0 Then Calculate P(First| Second) using equation (4) Calculate P(Second| First) using equation (5) If P(First| Second)> P(Second| First) Then Save Second, First, P(First| Second)into FBBN {arc direction SecondÆFirst} Else Save First, Second, P(Second| First) into FBBN {arc direction FirstÆSecond} End If End If End For End For 6. Draw the network base on FBBN. 7. Generate a conditional probability table for each node for the value node with each possible combination from parent nodes value.
3 Illustrative Example To make our proposed FBBN method clearly understandable, we demonstrate an illustrative example. A relational data table of patient medical record is given in Table 2. The data table consists of 10 records with several domains, such as Patient Id, Diagnose, Another Diagnose, Age, and Education. Diagnose and Another Diagnose use ICD-10 identifier [14]. First of all, we need to define every node that will be used in constructing FBBN. Here, every node is subjectively defined by users as a meaningful fuzzy
284
R. Intan and O.Y. Yuliana Table 2. A medical record example Patient Id 806931 806932 806933 806934 806935 806936 806937 806938 806939 806940
Diagnose A09.X Z51.1 Z51.1 S02.1 Z51.1 Z51.1 Z51.1 A09.X E14.9 A16.2
Another Diagnose D50 A15.9 D50 K56.6 C79.8 A41.9 D64.9 K56.6 A16.9 K74.6
Age 20 36 35 59 19 49 36 27 52 56
Education Bachelor Master Master Master Bachelor Bachelor Master Master Bachelor Master
set over a given domain. Secondary and Complication nodes are arbitrarily defined as fuzzy sets on Another Diagnose domain as follows.
0.2 0.5 0.2 ⎫ , ⎧ 0.85 , , , Secondary = ⎨ ⎬ A15.9 A41.9 D50 D64.9 ⎭ ⎩ 0.5 ⎫ . ⎧ 0.85 0.3 0.6 , , , Complicati on = ⎨ ⎬ ⎩ C79.8 D50 D64.9 K56.6 ⎭ The above expressions mean that Secondary(A15.9)=0.85, Secondary(A41.9)=0.2, Complication(C79.8)=0.85, Complication(D50)=0.3, etc. Chemo-Neo node has only one data value defined on the domain Diagnose as given by Chemo-Neo = {Z51.1}. Similarly, Node Master is defined on the domain Education by Master = {Master}. Three nodes, Young, Middle, Old are defined on domain Age as numerical fuzzy sets as given by the following equations. Their fuzzy sets graph is shown in Fig. 2. Based on fuzzy sets definition on every node, we transform Table 2 into a temporary relation as shown in Table 3. For instance, values {0.30, 0.00, 0.30, 0.50, 0.85, 0.00, 0.60, 0.50, 0.00, 0.00} in the column Complication in Table 3 is related to the Another Diagnose values {D50, A15.9, D50, K56.6, C79.8, A41.9, D64.9, K56.6, A16.9, K74.6} in Table 2, where Complication(D50)=0.30, Complication(A15.9)= 0.00, Complication(D64.9)=0.60, etc. Total weight of Chemo-Neo, Complication, Secondary, Young, Middle, Old, and Master are calculated as given by: 5, 3.05, 2.25, 2.53, 5.07, 2.40, and 6, respectively. Mutual information relation between two nodes are calculated in order to decide whether there is a relationship or not between them. In addition, to decide the arc direction between two nodes, we use conditional probability as described in Section 2. For example, mutual information between node Chemo-Neo and Complication is calculated by MI(Chemo-Neo, Complication) as given by (1). P(Chemo-Neo) is calculated by equation (3), i.e.
Fuzzy Bayesian Belief Network for Analyzing Medical Track Record
⎧1 ⎪ 35 − x ⎪ Young ( x ) = ⎨ ⎪ 15 ⎪⎩0
⎧0 ⎪ x − 20 ⎪ ⎪ 15 ⎪ Middle ( x ) = ⎨1 ⎪ 60 − x ⎪ ⎪ 15 ⎪⎩0
, x < 20 , 20 ≤ x ≤ 35 , , x > 35
285
, x ≥ 20 , 20 ≤ x ≤ 35 , 35 ≤ x ≤ 45 , , 45 ≤ x ≤ 60 , x > 60
, x < 45
⎧0 ⎪ x − 45 ⎪ Old ( x ) = ⎨ ⎪ 15 ⎩⎪1
, 45 ≤ x ≤ 60 . , x > 60 Age Fuzzy Sets
Fuzzy Value
1.00 0.75 0.50 0.25 0.00 0
10
20
30
40
50
60
70
80
Age YOUNG
MIDDLE
OLD
Fig. 2. Age fuzzy sets
P(Chemo-Neo)
= Total weight of Chemo-Neo/Total Records = 5/10 = 0.5
P(Complication) = Total weight of Complication/Total Records = 3.05/10 = 0.305 Furthermore, P(Chemo-Neo, Complication) is calculated by equation (2) as follows. P(Chemo-Neo, Complication) = 1.75/10 = 0.175. Since MI(Chemo-Neo, Complication) is greater than zero, we can conclude that there is a relationship between Chemo-Neo and Complication nodes. Similarly, other MI relationship between two nodes can be calculated by equation (1) as shown in Table 4. Arc direction between two nodes can be determined by comparing the results of P(Node1| Node2) and P(Node2| Node1) as given by equation (4) and (5). For example, since P(Chemo-Neo| Complication) = 0.574 is greater than P(Complication| ChemoNeo) = 0.35, the direction of arc is from Complication to Chemo-Neo with conditional probability 0.574. Other directed arcs are calculated in the same way as given in Table 5. Finally, FBBN is generated as shown in Fig. 3.
286
R. Intan and O.Y. Yuliana Table 3. The weighted medical record example
Patient Id Chemo-Neo Complication Secondary Young Middle Old Master 806931 0 0.30 0.50 1.00 0.00 0.00 0 806932 1 0.00 0.85 0.00 1.00 0.00 1 806933 1 0.30 0.50 0.00 1.00 0.00 1 806934 0 0.50 0.00 0.00 0.07 0.93 1 806935 1 0.85 0.00 1.00 0.00 0.00 0 806936 1 0.00 0.20 0.00 0.73 0.27 0 806937 1 0.60 0.20 0.00 1.00 0.00 1 806938 0 0.50 0.00 0.53 0.47 0.00 1 806939 0 0.00 0.00 0.00 0.53 0.47 0 806940 0 0.00 0.00 0.00 0.27 0.73 1 Total
5
3.05
2.25
2.53
5.07
2.40
6
Table 4. Mutual information between two nodes Node1 Chemo-Neo Chemo-Neo Chemo-Neo Chemo-Neo Chemo-Neo Chemo-Neo Complication Complication Complication Complication Complication Secondary Secondary Secondary Secondary Young Young Young Middle Middle Old
Node2 P(Node1) P(Node2) P(Node1, Node2) MI(Node1, Node2) Complication 0.500 0.305 0.175 0.035 Secondary 0.500 0.225 0.175 0.112 Young 0.500 0.253 0.100 -0.034 Middle 0.500 0.507 0.373 0.208 Old 0.500 0.240 0.027 -0.058 Master 0.500 0.600 0.300 0.000 Secondary 0.305 0.225 0.080 0.018 Young 0.305 0.253 0.165 0.181 Middle 0.305 0.507 0.143 -0.016 Old 0.305 0.240 0.050 -0.027 Master 0.305 0.600 0.190 0.010 Young 0.225 0.253 0.050 -0.009 Middle 0.225 0.507 0.175 0.108 Old 0.225 0.240 0.020 -0.029 Master 0.225 0.600 0.155 0.031 Middle 0.253 0.507 0.047 -0.068 Old 0.253 0.240 0.000 -∞ Master 0.253 0.600 0.053 -0.081 Old 0.507 0.240 0.107 -0.020 Master 0.507 0.600 0.380 0.122 Master 0.240 0.600 0.167 0.036
Fuzzy Bayesian Belief Network for Analyzing Medical Track Record
287
Table 5. Generated FBBN Node1 Chemo-Neo Chemo-Neo Chemo-Neo Complication Complication Complication Secondary Secondary Middle Old
Node2 P(Node1|Node2) P(Node2|Node1) Fuzzy Association Rules Complication 0.574 0.350 ComplicationÆChemo-Neo Secondary 0.778 0.350 SecondaryÆChemo-Neo Middle 0.736 0.746 Chemo-NeoÆMiddle Secondary 0.356 0.262 SecondaryÆComplication Young 0.651 0.541 YoungÆComplication Master 0.317 0.623 ComplicationÆMaster Middle 0.345 0.778 SecondaryÆMiddle Master 0.258 0.689 SecondaryÆMaster Master 0.633 0.750 MiddleÆMaster Master 0.278 0.696 OldÆMaster
Old
0.696 Master 0.750
Young
0.689
0.623
0.778
Secondary 0.651
Complication
0.356
0.778 0.574
Middle
0.746
Chemo-Neo
Fig. 3. The generated FBBN for analyzing medical track record
Finally, MI(Chemo-Neo, Complication) is given by ⎛ P(Chemo − Neo, Complication) ⎞ ⎟⎟ = 0.035. P(Chemo − Neo, Complication) log2 ⎜⎜ ⎝ P(Chemo − Neo) × P(Complication) ⎠ The last step, we generate a CPT. For example, a conditional probability node of Chemo-Neo given Complication as parental node is P(Chemo-Neo| Complication) and P(~Chemo-Neo| Complication). Fig. 3 shows that P(Chemo-Neo| Complication) = 0.574. It can be verified that P(~Chemo-Neo| Complication) = 1 - P(Chemo-Neo| Complication) = 1 – 0.574 = 0.426. Similarly, conditional probability node of Chemo-Neo given ~Complication as parental node is P(Chemo-Neo| ~Complication) and P(~Chemo-Neo| ~Complication). As defined in [8], ~Complication is given as follows. ~Complication(x) = 1 – Complication(x) for every x ∈ Another Diagnose Suppose the column Complication in Table 3 is changed to be ~Complication, values {0.30, 0.00, 0.30, 0.50, 0.85, 0.00, 0.60, 0.50, 0.00, 0.00} in the column Complication
288
R. Intan and O.Y. Yuliana
are recalculated by the above equation to be {0.70, 1.00, 0.70, 0.50, 0.15, 1.00, 0.40, 0.50, 1.00, 1.00}, where ~Complication(D50)=0.70, ~Complication(A15.9)=1.00, ~Complication(D64.9)=0.40, etc. It can be calculated P(Chemo-Neo,~Complication)= 3.25 and P(~Complication)=6.95. Finally, P(Chemo-Neo| ~Complication) and P(~Chemo-Neo| ~Complication) are given by P(Chemo-Neo| ~Complication) = 3.25/6.95 = 0.468, P(~Chemo-Neo| ~Complication) = 1 – 0.468 = 0.532. All conditional probabilities node of Chemo-Neo in the relation to Complication and ~Complication is shown in Table 6. Table 6. Conditional probability for node of Chemo-Neo
Chemo-Neo ~Chemo-Neo
Complication, ~Complication, 0.574 0.468 0.426 0.532
The system has already been implemented in the real database stored in Oracle Database using PC IBM/AT Compatible with AMD Turion X2 Processor, 2 GB memory and 320 GB Hard Disk by [15]. Table 7 shows the experimental results of processing time required in the various number of processing records and nodes. In addition, Fig. 4 graphically shows processing time of the various number of nodes. Table 7. Experimental results of processing time. Number of Records Number of Nodes 2 nodes 3 nodes 12,000 records 4 nodes 5 nodes 2 nodes 3 nodes 24,000 records 4 nodes 5 nodes 2 nodes 3 nodes 36,000 records 4 nodes 5 nodes 2 nodes 3 nodes 48,000 records 4 nodes 5 nodes
Time 3 seconds 7 seconds 12 seconds 23 seconds 4 seconds 11 seconds 19 seconds 28 seconds 6 seconds 15 seconds 25 seconds 40 seconds 7 seconds 22 seconds 28 seconds 54 seconds
Fuzzy Bayesian Belief Network for Analyzing Medical Track Record
289
Fig. 4. The processing time of various number of nodes
4 Conclusion In order to propose a concept of Fuzzy Bayesian Belief Network (FBBN), we extended the concept of mutual information gain induced by fuzzy labels. Relation between two fuzzy nodes was determined by the calculation of their MI. Comparison of conditional probability between two nodes was used to decide their arc direction. This paper also introduced an algorithm to implement the application of generating FBBN. An illustrated example was discussed to clearly understand the proposed concept. The generated FBBN can be applied to medical diagnosis tasks involving disease semantics by doctors or hospital staff. Acknowledgments. This work was supported by the Indonesian Higher Education Directorate under HIKOM Grant No. 25/SP2H/PP/DP2M/V/2009. The authors also thank Dwi Kristanto for implementing the propose method. Without any of these contributions, it would have been impossible for us to conduct the experiements.
References 1. Lin, C.Y., Yin, J.X., Ma, L.H., Chen, J.Y.: Fuzzy Bayesian Network-Based Inference in Predicting Astrocytoma Malignant Degree. In: 6th World Congress on Intelligent Control and Automation, pp. 10251–10255. IEEE Press, China (2006) 2. Lin, C.Y., Yin, J.X., Ma, L.H., Chen, J.Y.: An Intelligent Model Based on Fuzzy Bayesian Networks to Predict Astrocytoma Malignant Degree. In: 2nd Cybernetics and Intelligent System, pp. 1–5. IEEE Press, Thailand (2006) 3. Chiu, C.Y., Lo, C.C., Hsu, Y.X.: Integrating Bayesian theory and Fuzzy logics with Case-Based Reasoning for Car-diagnosing Problems. In: 4th Fuzzy System and Knowledge Discovery, pp. 344–348. IEEE Press, China (2007) 4. Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall, New Jersey (1995)
290
R. Intan and O.Y. Yuliana
5. Cheng, J., Bell, D., Liu, W.: Learning Bayesian Networks from Data: An Efficient Approach Based on Information Theory. In: 6th Conference on Information and Knowledge Management. ACM Press, USA (1997) 6. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series (2001) 7. Codd, E.F.: A Relational Model of Data for Large Shared Data Bank. Communication of the ACM 13(6), 377–387 (1970) 8. Zadeh, L.A.: Fuzzy Sets and systems. International Journal of General Systems 17, 129–138 (1990) 9. Intan, R., Yuliana, O.Y.: Mining Multidimensional Fuzzy Association Rules from a Normalized Database. In: Proceedings of International Conference on Convergence and Hybrid Information Technology, pp. 425–432. IEEE Computer Society Press, Los Alamitos (2008) 10. Intan, R.: Generating Multi Dimensional Association Rules Implying Fuzzy Values. In: Proceedings of the International Multi-Conference of Engineers and Computer Scientist, Hong Kong, pp. 306–310 (2006) 11. Intan, R., Yuliana, O.Y.: Fuzzy Decision Tree Induction Approach for Mining Fuzzy Association Rules. In: Proceeding of the 16th International Conference in Neural Information Processing (ICONIP 2009). LNCS, vol. II, pp. 720–728. Springer, Heidelberg (2009) 12. Intan, R., Yuliana, O.Y., Handojo, A.: Mining Fuzzy Multidimensional Association Rules using Fuzzy Decision Tree Induction Approach. International Journal of Computer and Network Security (IJCNS) 1(2), 60–68 (2009) 13. Intan, R., Mukaidono, M.: Fuzzy Conditional Probability Relations and its Application in Fuzzy Information Systems. Knowledge and Information systems, an International Journal 6(3), 345–365 (2004) 14. World Health Organization, ICD-10 Version (2007), http://apps.who.int/classifications/apps/icd/icd10online 15. Kristanto, D.: Design and Implementation Application for Supporting Disease Track Record Analysis Using Bayesian Belief Network, final project, no. 01020788/INF/2009 (2009)
An Experiment Model of Grounded Theory and Chance Discovery for Scenario Exploration Tzu-Fu Chiu1, Chao-Fu Hong2, and Yu-Ting Chiu3 1
Department of Industrial Management and Enterprise Information, Aletheia University, Taipei, Taiwan, R.O.C. [email protected] 2 Department of Information Management, Aletheia University, Taipei, Taiwan, R.O.C. [email protected] 3 Department of Information Management, National Central University, Taoyuan, Taiwan, R.O.C. [email protected]
Abstract. To explore the scenario of a technology is a valuable task for managers and stakeholders to grasp the overall situation of that technology. Solar cell, one of renewable energies, is growing at a fast pace with its unexhausted and non-polluted characters. In addition, the patent data contains plentiful technological information from which is worthwhile to extract further knowledge. Therefore, an experiment model has been proposed so as to analyze the patent data, to form the scenario, and to explore the tendency of solar cell technology. Finally, the relation patterns were identified, the directions of solar cell technology were recognized, the active companies and vital countries in solar cell industry were also observed. Keywords: Experiment model, grounded theory, chance discovery, scenario exploration, solar cell, patent data.
1 Introduction It is essential for a company or stakeholders to realize the situation of a certain technology so that the company can review its development directions of products regarding that technology and the stakeholders can examine the suitability of their relevant investments. Within technological information, up to 80% of the disclosures in patents are never published in any other form [1]. Therefore, patent analysis has been recognized as an important task at the government and company levels. Through appropriate analysis, technological details and relations, business trends, novel industrial solutions, or making investment policy can be achieved [2]. Apart from those existing methods, an experiment model of grounded theory and chance discovery will be built for this area in order to explore the technological scenario of solar cell in U.S. N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 291–301. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
292
T.-F. Chiu, C.-F. Hong, and Y.-T. Chiu
2 Related Work As this study is attempted to explore the scenario of technology, an experiment model is required to be constructed via a consideration of Grounded Theory and Chance Discovery. In the experiment model, grounded theory is employed to obtain a procedural framework for directing the experiment from the first ‘data preprocessing’ to final ‘new findings’ phase and to keep the nature of qualitative research for the study. Chance discovery is adopted to exploit the data mining techniques for analyzing the textual data in a computer-aided way. Subsequently, the experiment model will be applied in a problem domain of solar cell. The related areas, i.e., scenario exploration, solar cell, grounded theory, and chance discovery, will be described briefly in the following subsections. 2.1 Scenario Exploration A scenario is a product that describes some possible future state and/or that tells the story about how such a state might come about [3]. The main categories of scenarios are: predictive (forecasts, what-if), explorative (external, strategic), and normative (preserving, transforming) [4]. The scenario approach (i.e., scenario analysis, scenario planning, or scenario building) can be comprised of five stages: scenario preparation, scenario-field analysis, scenario prognostics, scenario development, and scenario transfer [5], and had been applied in perceiving, framing minds, thinking of (top) managers, decision support, strategic management, environmental study, and some other areas [6, 7]. In this study, an experiment model, derived from grounded theory and chance discovery, will be applied in the scenario exploration of solar cell via patent data. 2.2 Solar Cell A solar cell or photovoltaic (PV) cell is a device that converts light directly into electricity by the photovoltaic effect [8]. Solar cell, a sort of green energy, is clean, renewable, sustainable, and good for protecting our environment. A number of materials of solar cell are currently under investigation or in mass production, including singlecrystalline, poly-crystalline, amorphous (a-Si), LED (i.e., light emitting diode), TCO (i.e., transparent conductive oxide), dye-sensitized, thin-film, and compound type. In recent years (2003-2007), total PV production grew in average by almost 50% worldwide, whereas the thin film segment grew by over 80% (from a very low level) and reached 400 MW or 10% of total PV production in 2007 [9]. In order to understand the development scenario of solar cell in U.S., this study utilizes grounded theory and chance discovery to analyze the patent data of year 2007 from USPTO (the United States Patent and Trademark Office) [10]. 2.3 Grounded Theory Grounded Theory, one of the qualitative research methods, has been developed to generate a theory from data, where data is systematically gathered and analyzed through the research process [11]. Theory derived from data is more likely to
An Experiment Model of Grounded Theory and Chance Discovery
293
resemble the “reality” than is theory derived by putting together a series of concepts based on experience or solely through speculation (how one thinks things ought to work). The fundamental analytic processes of grounded theory are three types of coding: open, axial, and selective [12]. Open Coding: The purpose of open coding is to give the analyst new insights of data by breaking down analytically through standard ways of thinking about or interpreting phenomena reflected in the data. In open coding, events/actions/interactions are compared with others for similarities and differences. They are also given conceptual labels. In this way, conceptually similar events/actions/interactions are grouped together to form categories and subcategories. Axial Coding: In axial coding, categories are related to their subcategories, and the relationships tested against data. Through the “coding paradigm” of conditions, context, strategies (action/interaction), and consequences, subcategories are related to a category. All hypothetical relationships proposed deductively during axial coding must be considered provisional until verified repeatedly against incoming data. To be verified (that is, regarded as increasingly plausible) a hypothesis must be indicated by the data over and over again. Deductively arrived at hypotheses that do not hold up when compared with actual data must be revised or discarded. Selective Coding: Selective coding is the process by which all categories are unified around a “core” category, and categories that need further explication are filled-in with descriptive detail. The core category represents the central phenomenon of the study. It might emerge from among the categories already identified or a more abstract term may be needed to explain the main phenomenon. The other categories will always stand in relationship to the core category as conditions, action/interactional strategies, or consequences. In this study, grounded theory will be employed to integrate with chance discovery for constructing an experiment model. 2.4 Chance Discovery Chance discovery, originating from Data Mining, is to become aware of a chance and to explain its significance, especially if the chance is rare and its significance is unnoticed [13]. A chance means an event or a situation with significant impact on decision making. In addition, a chance can be also conceived either as an opportunity or as a risk; where desirable effects from an opportunity should be promoted, and undesirable effects from a risk should be prevented [14]. From chance discovery, two visualization tools, namely event map and data crystallization, are chosen for supporting grounded theory in this study. Event Map: An event map is a two-dimension undirected graph, which consists of event clusters, visible events, and chances [14]. An event cluster is a group of frequent and strongly related events. The occurrence frequency of events and co-occurrence between events within an event cluster are both large. The cooccurrence between two events is measured by the Jaccard coefficient as in Equation (1), where ei is the ith event in a data record (of the data set D).
294
T.-F. Chiu, C.-F. Hong, and Y.-T. Chiu Ja (ei , e j ) =
Freq(ei ∩ e j )
Freq(ei ∪ e j )
(1)
The events outside event clusters are visible events and chances. In fact, an event map without chances is similar to an ‘association diagram’ in Data Mining area, which will be quoted in the followings. Data Crystallization: Data crystallization is a technique to detect the unobservable (but significant) events via inserting these unobservable events as dummy items into the given data set, and then to show them on the diagram [15]. In practice, unobservable events and their relations with other events are visualized by applying the event map [16]. A generic data crystallization algorithm can be summarized as follows [17]: (1) event identification; (2) clustering; (3) dummy event insertion; (4) co-occurrence calculation; and (5) topology analysis. Among them, the co-occurrence between a dummy event and clusters is measured by Equation (4), where DEi is an inserted dummy event, C is the specific number of clusters which is utilized to classify the event set E into vertex groups. C -1
Co (DE i , C ) = ∑ max Ja (DE i , ek ) j =0
e k ∈c j
(2)
Chance discovery will be used to identify the categories as well as the relations between or among categories for grounded theory in this study.
3 Experiment Model for Scenario Exploration An experiment model for scenario exploration, based on grounded theory and chance discovery, has been constructed as shown in Fig. 1. It consists of five phases: data preprocessing, open coding, axial coding, selective coding, and new findings; and will be described in the following subsections.
Fig. 1. Experiment model for scenario exploration
3.1 Data Preprocessing In the first phase, the patent data of solar cell (during a certain period of time) will be downloaded from the USPTO beforehand. For considering an essential part to represent a complex patent data, the abstract, assignee-name, and assignee-country fields are selected as the objects for this study. Afterward, two processes, POS tagging and data cleaning, will be executed to clean up the source textual data.
An Experiment Model of Grounded Theory and Chance Discovery
295
POS Tagging: An English POS tagger (i.e., a Part-Of-Speech tagger for English) from the University of Tokyo [18] will be employed to perform word segmenting and labeling on the patent documents (i.e., the abstract field). Then, a list of proper morphological features of words needs to be decided for screening out the initial words. Data Cleaning: Upon these initial words, files of n-grams, stop words, and synonyms will be built so as to combine relevant words, to eliminate less meaningful words, and to aggregate synonymous words. Consequently, the meaningful terms obtained from this process will be passed onto the following phases. 3.2 Open Coding Second phase is used to conduct the open coding via a data mining tool, association diagram (i.e., an event map in chance discovery), so as to obtain the categories. Association Diagram Generation: An association diagram needs to be drawn using term frequency and co-occurrence of meaningful terms, so that a number of clusters will be generated through the proper thresholds setting of frequency and co-occurrence. Herein, these clusters are regarded as categories in open coding and will be named according to the domain knowledge. 3.3 Axial Coding Third phase is designed to execute the axial coding using data crystallization and association analysis so as to insert dummy events and to generate a modified association diagram. Based on the diagram, the axial categories and relations among categories and dummy events will be recognized. Data Crystallization: According to the above association diagram, the data crystallization will be triggered to draw a crystallization diagram. Firstly, a dummy event (i.e., the assignee-name field) will be added as an extra item into each patent record. Secondly, the association diagram will be redrawn with dummy events included, also called a “crystallization diagram”. Lastly, the nodes of dummy events as well as their corresponding links (from each dummy node to nodes in clusters) will be computed and recorded. Modified Association Diagram: Nodes of dummy events and their corresponding links, recorded in the above crystallization diagram, will be inserted into the previous association diagram to construct a modified association diagram so as to find out the axial categories and the relations among categories and dummy nodes (i.e., companies). 3.4 Selective Coding In fourth phase, based on the modified association diagram, the core category and relation patterns will be figured out according to the axial categories and relations among categories and companies. Pattern Recognition: In accordance with the modified association diagram, the relations among categories and companies will be applied to recognize the relation patterns by observing the linkages from a category to companies, linkages from a
296
T.-F. Chiu, C.-F. Hong, and Y.-T. Chiu
company to categories, and linkages from a country via companies to categories (by adding the assignee-country field in). Firstly, the linkages from a category to companies will be used to construct the “a technique relates to multiple companies” pattern. Secondly, the linkages from a company to categories will be applied to construct the “a company relates to multiple techniques” pattern. Lastly, the linkages from a country via companies to categories will be utilized to construct the “a country relates to multiple techniques” pattern. 3.5 New Findings Last phase intends to form a scenario for solar cell and to explore the tendency of solar cell technology, based on the relation patterns. Scenario Exploration: According to the relation patterns from the above pattern recognition and based on the domain knowledge, a scenario of solar cell technology will be formed so that different aspects of the technology can be perceived. Consequently, the scenario of solar cell can be used to depict the directions of technology, situation of companies, and tendency of countries.
4 Experimental Results and Explanation The experiment has been performed according to the experiment model. The experimental results will be illustrated in the following four subsections: result of data preprocessing, result of association diagram generation, result of modified association diagram generation, and scenario of solar cell technology. 4.1 Result of Data Preprocessing In order to explore the scenario of solar cell, the patent documents of solar cell were used as the target data for this experiment. Mainly, the abstract, assigneename, and assignee-country fields of patent documents were utilized in this study. Then, 81 patent records of year 2007 were collected from USPTO, using key words: “solar cell, or photovoltaic cell, or PV cell” on “title field or abstract field”. The POS tagging and data cleaning processes were then performed upon the collected 77 patent records (as four patent documents with empty abstract fields) for obtaining meaningful terms. 4.2 Result of Association Diagram Generation Using meaningful terms from data preprocessing, the association diagram was generated in Fig. 2. In the diagram, twenty clusters were found while the number of comprising nodes of a cluster was set to no less than four. According to the domain knowledge, these clusters were named as: metal-oxide, nanocrystal, telemetering-system, LCD (i.e., liquid crystal display), dye-sensitized, activesemiconductor-structure, LED (i.e., light emitting diode), monitoring-terminal, burette, TCO (i.e., transparent conductive oxide), collection-region, thin-panelenclosure, adjacent-row, thin-film-1, thin-film-2, intruder-detector, hammock,
An Experiment Model of Grounded Theory and Chance Discovery
297
Fig. 2. Association diagram with 20 clusters
○ ○
lantern, light-intensity, and spectrum from cluster 1 to 20 respectively, and were the categories of solar cell technology. 4.3 Result of Modified Association Diagram Generation Based on the above association diagram and using the assignee-name field (i.e., company name) as a dummy event, a crystallization diagram was drawn. Subsequently, the dummy nodes (i.e., companies) and their links (to clusters) of the crystallization diagram were recorded and inserted into the previous association diagram to form a modified association diagram (by ignoring the unclustered nodes) as Fig. 3.
Fig. 3. Modified association diagram with dummy nodes (companies)
298
T.-F. Chiu, C.-F. Hong, and Y.-T. Chiu
○
In Fig. 3, the category (technique) 1 metal-oxide related to GAS, SHOWA, CANON, HUANG, NAKATA, AKZO, KONARKA, and KENT companies. The company CANON related to 1 metal-oxide, 7 LED, 16 intruder-detector, and 20 spectrum techniques. After using the assignee-country field (i.e., country name), the country KR (Korea) related to 5 dye-sensitized, 7 LED, 16 intruder-detector, and 17 hammock techniques (via companies SAMSUNG and SAMSUNG-SDI). Consequently, the summarized information of Fig. 3 was used to identify the relation pattern 1 to relation pattern 3 and summed up in Table 1(a) to 1(c) for facilitating the following explorations.
○
○
○
○
○ ○
○
○
4.4 Scenario of Solar Cell Technology By observing Fig. 3 and Table 1, the scenario of solar cell could be depicted as follows. According to Table 1(a), seven technical categories were recognized (from the relation pattern 1) by setting the threshold of number of related companies to greater than four. Among them, ‘dye-sensitized’ (12) and ‘thin-film’ (7+4) were the most significant categories and followed by ‘metal-oxide’ (8), ‘LCD’ (8), and ‘LED’ (7). Referring to Table 1(b), eleven companies were perceived (from the relation pattern 2) by setting the threshold of number of related techniques to greater than two. Within them, ‘MIDWEST’ (5) and ‘NAKATA’ (5) were the most focused companies and trailed by ‘AUO’ (4), ‘CANON’ (4), and ‘MOTOROLA’ (4). In accordance with Table 1(c), six countries were found (from the relation pattern 3) by skipping the countries with only one company included. Amid them, ‘US’-United States (28) and ‘JP’-Japan (16) were the most noteworthy countries and followed by ‘DE’-Germany (4), ‘KR’-Korea (3), ‘CH’-China (2), and ‘TW’-Taiwan (2). Moreover, the top three technical categories of US were ‘metal-oxide’ (4), ‘LCD’ (4), and ‘dye-sensitized’ (4); while the top four technical categories of Japan were ‘dye-sensitized’ (6), ‘metal-oxide’ (3), ‘LED’ (3), and ‘thin-film-1’ (3). Both US and Japan emphasized on ‘metal-oxide’ and ‘dyesensitized’ categories. The tendency of solar cell technology could be presented as the following. Firstly, the directions of solar cell technology in 2007 were: dye-sensitized, thinfilm, metal-oxide, LCD, and LED topics (i.e., categories), especially dyesensitized and thin-film. In fact, dye-sensitized and thin-film were two noticeable items of light-absorbing materials in the solar cell industry [19]. Secondly, the companies, being active in various technical topics, were: MIDWEST, NAKATA, AUO, CANON, and MOTOROLA, particularly MIDWEST and NAKATA. In reality, MIDWEST is an independent, not-for-profit, contract research organization (with about 1.8 thousand employees) in U.S.; containing a National Renewable Energy Laboratory; and involved in photovoltaic energy projects continuously [20]. NAKATA is a well-experienced inventor of solar cell patent and the President of Kyosemi Corporation in Japan [21]. Thirdly, the countries, being vital in US patent application, were: US, Japan, Germany, Korea, China, and Taiwan, and relating to the dye-sensitized, thin-film, LCD, and LED topics averagely. Actually, most of the above countries appeared in the list of top five PV production countries (all European countries regarded as an EU unit) [9].
An Experiment Model of Grounded Theory and Chance Discovery
299
Table 1. Summarized information of Fig. 3 (a) Relation pattern 1: a category relating to multiple companies Technique metal-oxide
Num. 8
LCD
8
dyesensitized LED TCO thin-film-1;
12
(thin-film-2)
(4)
spectrum
5
Related companies GAS, SHOWA, CANON, HAUNG, NAKATA, AKZO, KONARKA, KENT KENT, SHARP, MOTOROLA, ASULAB, GAMASONIC, SOLYNDRA, NIAIST, AYLAIAN AEROSPACE, MOTOROLA, ASAHI, AUO, JSR, NIAIST, SONY, AYLAIAN, DAVIS, NARA, SAMSUNG, KYOSEMI CANON, LOS, AUO, NAKATA, MOTOROLA, SAMSUNG, KYOSEMI AKZO, SOLYNDRA, SONY, ASULAB, SPHERAL ARKANSAS-UNI, SINTON, JSR, ASAHI, HITACHI, MOTOROLA, AEROSPACE; (LOS, SANYO, MIDWEST, AUO) CANON, CALIFORNIA-UNI, MIDWEST, AUO, THOMPSON
7 5 7;
(b) Relation pattern 2: a company relating to multiple categories Company AKZO ASULAB AUO CANON HAUNG MIDWEST MOTOROLA NAKATA
Num. Related techniques 3 1 metal-oxide, 8 monitoring-terminal, 10 TCO 3 3 telemetering-system, 4 LCD, 10 TCO 4 5 dye-sensitized, 7 LED, 15 thin-film-2, 20 spectrum 4 1 metal-oxide, 7 LED, 16 intruder-detector, 20 spectrum 3 1 metal-oxide, 13 adjacent-row, 19 light-intensity 5 2 nanocrystal, 3 telemetering-system, 15 thin-film-2, 18 lantern, 4 4 LCD, 5 dye-sensitized, 7 LED, 14 thin-film-1
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○spectrum ○ ○ ○ ○ ○ metal-oxide, ○ active-semiconductor-structure, ○ LED, ○ monitoringterminal, ○thin-panel-enclosure ○LCD, ○dye-sensitized, ○collection-region ○LCD, ○TCO, ○hammock ○telemetering-system, ○monitoring-terminal, ○spectrum
5
1
6
7
20
8
12
NIAIST SOLYNDRA THOMPSON
3 3 3
4
5
4
10
11
17
3
8
20
(c) Relation pattern 3: a country relating to multiple categories (via companies) Country CH DE JP
Num. of comp.
Num. of tech.
2 4 16
3 2 top 8
Related techniques
○telemetering-system, ○LCD, ○TCO ○active-semiconductor-structure, ○burette ○ dye-sensitized, ○ metal-oxide, ○ LED, ○ thin-film-1, ○ LCD, ○ active-semiconductor-structure, ○ monitoring-terminal, ○ thin-panelenclosure ○dye-sensitized, ○LED, ○intruder-detector, ○hammock ○dye-sensitized, ○LED, ○thin-film-2, ○spectrum ○ metal-oxide, ○ LCD, ○ dye-sensitized, ○ collection-region, ○ adjacent-row, ○thin-film-1, ○thin-film-2, ○light-intensity, ○spectrum 3
4
10
6
9
5
1
7
14
4
8
KR TW US
3 2 28
4 4 top 11
5
7
16
5
7
15
1
4
14
17
20
5
15
6
12
11
19
13
20
5 Conclusions The proposed experiment model has been applied in the solar cell technology using patent data. The experiment was performed and the experimental results were obtained. Three relation patterns were identified: “a category relating to multiple companies”, “a company relating to multiple categories”, and “a country relating to
300
T.-F. Chiu, C.-F. Hong, and Y.-T. Chiu
multiple categories (via companies)”. The directions of solar cell technology in 2007 were: dye-sensitized, thin-film, metal-oxide, LCD, and LED topics. The active companies were: MIDWEST, NAKATA, AUO, CANON, and MOTOROLA. The vital countries were: US, Japan, Germany, Korea, China, and Taiwan, and relating to the dye-sensitized, thin-film, LCD, and LED topics averagely. In the future work, the experiment model may be joined by some other methods such as value-focused thinking or social computing so as to enhance the validity of experimental results. In addition, the data source can be expanded from USPTO to WIPO, EPO, or TIPO in order to explore the situation of solar cell technology globally. Acknowledgments. This research was supported by the National Science Council of the Republic of China under the Grants NSC 97-2410-H-415-043.
References 1. Blackman, M.: Provision of patent information: a national patent office perspective. World Patent Information 17(2), 115–123 (1995) 2. Tseng, Y., Lin, C., Lin, Y.: Text mining techniques for patent analysis. Information Processing and Management 43, 1216–1247 (2007) 3. Bishop, P., Hines, A., Collins, T.: The current state of scenario development an overview of techniques. Foresight 9(1), 5–25 (2007) 4. Borjeson, L., Hojer, M., Dreborg, K., Ekvall, T., Finnveden, G.: Scenario types and techniques: towards a user’s guide. Future 38, 723–739 (2006) 5. Gausemeier, J., Fink, A., Schlake, O.: Scenario management: an approach to develop future potentials. Technological Forecasting and Social Change 59, 111–130 (1998) 6. Postma, T., Liebl, F.: How to improve scenario analysis as a strategic management tool. Technological Forecasting & Social Change 72, 161–173 (2005) 7. Mahmoud, M., Liu, Y., Hartmann, H., Stewart, S., Wagener, T., Semmens, D., Stewart, R., Gupta, H., Dominguez, D., Dominguez, F., Hulse, D., Letcher, R., Rashleigh, B., Smith, C., Street, R., Ticehurst, J., Twery, M., Delden, H., Waldick, R., White, D., Winter, L.: A formal framework for scenario development in support of environmental decision-making. Environmental Modelling & Software 24, 798–808 (2009) 8. Solar cell, http://en.wikipedia.org/wiki/Solar_cell (2009/08/30) 9. Jager-Waldau, A.: PV status report 2008: research, solar cell production and market implementation of photovoltaics. JRC Technical Notes (2008) 10. USPTO: the United States Patent and Trademark Office, http://www.uspto.gov/ (2008/10/10) 11. Strauss, A., Corbin, J.: Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory, 2nd edn. Sage, London (1998) 12. Corbin, J., Strauss, A.: Grounded theory research: procedures, canons, and evaluative criteria. Qualitative Sociology 13(1), 3–21 (1990) 13. Ohsawa, Y., Fukuda, H.: Chance discovery by stimulated groups of people: application to understanding consumption of rare food. Journal of Contingencies and Crisis Management 10(3), 129–138 (2002)
An Experiment Model of Grounded Theory and Chance Discovery
301
14. Maeno, Y., Ohsawa, Y.: Human-computer interactive annealing for discovering invisible dark events. IEEE Transactions on Industrial Electronics 54(2), 1184–1192 (2007) 15. Ohsawa, Y.: Data Crystallization: Chance Discovery Extended for Dealing with Unobservable Events. New Mathematics and Natural Computation 1(3), 373–392 (2005) 16. Horie, K., Maeno, Y., Ohsawa, Y.: Data crystallization applied for designing new products. Journal of Systems Science and Systems Engineering 16(1), 34–49 (2007) 17. Maeno, Y., Ohsawa, Y.: Stable deterministic crystallization for discovering hidden hubs. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, 2, 1393–1398 (2006) 18. An English POS tagger, http://www-tsujii.is.s.u-tokyo.ac.jp/ ~tsuruoka/postagger/ (2008/07/28) 19. Solar cell, Light-absorbing materials, http://en.wikipedia.org/wiki/ Solar_cell#Light-absorbing_materials (2009/08/30) 20. Midwest, Energy, http://www.mriresearch.org/ResearchServices/ Energy/NREL.asp (2009/08/30) 21. Kyosemi corporation, http://www.kyosemi.co.jp/index_e.html (2009/08/30)
Using Tabu Search for Solving a High School Timetabling Problem Khang Nguyen Tan Tran Minh1, Nguyen Dang Thi Thanh1, Khon Trieu Trang1, and Nuong Tran Thi Hue2 1
Faculty of Information Technology, Ho Chi Minh University Of Science, Vietnam 2 Faculty of Mathematics, Ho Chi Minh University Of Science, Vietnam [email protected], [email protected], [email protected], [email protected]
Abstract. Tabu Search is known to be an efficient metaheuristic in solving various hard combinatorial problems, which include timetabling. This paper applies Tabu Search to a real-world high school timetabling problem, which involves assigning courses with different lengths into appropriate periods. The proposed algorithm has two phases: initialization phase using greedy algorithm and improvement phase using Tabu Search. In Tabu Search algorithm, three kinds of moves are used: single moves, swap moves and block-changing moves. The algorithm’s implementation has been efficiently experimented on three real instances of two high schools in Vietnam. Keywords: Course timetabling, high school timetabling, metaheuristics, tabu search.
1 Introduction In general, high school timetabling problem involves weekly scheduling all school’s lectures to appropriate periods, but the differences between the requirements of various high schools make it mostly impossible to develop an algorithm that can efficiently solve problems of all high schools. Description of the general problem, mathematical formulation, constraint variants and the overview of solving methods could be found in the timetabling survey of Andrea Schaerf [3]. The problem we considered in this paper is a real-world problem taken from high schools in Vietnam and our proposed solution technique focuses on solving this concrete problem. Like most of the educational timetabling problems, high school timetabling is known to be NP-hard. Due to the problem’s complexity, in recent decades (from 1980s on), metaheuristics are promised candidates for solving hard combinatorial optimization problems like timetabling. Popular metaheuristics such as Genetic Algorithms, Simulated Annealing, Tabu Search… have been applied to high school timetabling with remarkable results [8]-[10]-[11]-[12]. N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 305–313. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
306
Khang Nguyen et al.
The approach that we use to solve our problem is Tabu Search, which was first introduced by Fred Glover in 1989 [2]. Up to now, many scientists around the world have improved various versions of Tabu Search algorithms to solve different concrete high school timetabling problems [4]-[5]-[6]-[13]. Our proposed algorithm is a simple but efficient adaptation of Tabu Search for solving the problem of assigning blocks of consecutive lectures to appropriate periods. It was tested on three real instances of two Vietnamese high schools: High school for the Gifted and Tran Phu High school. Results are obtained in reasonable time (less than fifteen minutes) and their quality is good enough for being used in practice. This paper is organized as follow: section 2 gives an overview of Tabu Search metaheuristic, section 3 describes the details of our considered high school timetabling problem, section 4 explains details of the proposed algorithm, section 5 presents the experimental results, and section 6 shows the conclusion.
2 Tabu Search Tabu Search is one of the most popular metaheuristics used for finding solution of combinatorial optimization problems. We now briefly describe the most basic components of Tabu Search. We refer to [9] for a comprehensive description with various variants and many effective additional strategies of this metaheuristic. Starting from an initial solution, Tabu Search iteratively moves from one solution to another by searching through different parts of search space to try to find a solution that minimizes the value of the objective function - a function that evaluates the cost of a solution. At each iteration, only one solution is chosen to be the current solution, and the part of the search space examined at this iteration is generated from it, this part is called the neighborhood of the current solution. The modification that makes a solution become it neighbor is a move. To prevent cycling, a Tabu list is used to store information of recent applied moves. Moves that are currently stored in Tabu list are called Tabu moves. These moves are forbidden to be used as long as they’re still in Tabu list. However, some Tabu moves might be good enough to highly improve the current solution and should be considered to be used, that means their Tabu status should be dropped, that the reason why aspiration criteria are used. At each iteration, Tabu moves are checked if they satisfy aspiration criteria, if they do, their Tabu status will be dropped immediately. In many cases, aspiration criteria that are often used is that “if the new solution created after applying a Tabu move to the current solution has the objective function’s value smaller than the best solution found so far, this Tabu move will be removed from the Tabu list”.
3 Problem Description The high school timetabling problem considered in this paper is taken from two high schools in Vietnam. It involves finding a map between a set of courses and a set of periods in such a way that satisfies the high school’s requirements as much as possible. A course is a group of lectures (a lecture lasts one period) that have
Using Tabu Search for Solving a High School Timetabling Problem
307
the same teacher, attending classes and subject. The main information of a course includes the teacher teaching this course, one or some attending classes, the involved subject and the number of lectures belonging to this course. Courses that have the same teacher or the same class are grouped into a clashed-course group. There are 3 hard constraints, which must be satisfied, and 7 soft constraints, which should be satisfied as much as possible. •
1st hard constraint (H1): The number of a course's consecutive lectures in a day should be in the range of [minConsecutiveLectures, maxConsecutiveLectures].
•
2nd hard constraint (H2): Teachers, classes could not be assigned to periods at which they're not available.
•
3rd hard constraint (H3): All the lectures of all courses must be assigned.
•
1st soft constraint (S1): Courses that belong to the same clashed-course group should not be overlapped.
•
2nd soft constraint (S2): The number of sessions (a session is a part of a day, e.g., morning session, afternoon session) that a course is scheduled into should be in the range of [minSessions, maxSessions].
•
3rd soft constraint (S3): Courses should be assigned to periods that they're preferred.
•
4th soft constraint (S4): Idle time between lectures in a session for a teacher or a class should be avoided.
•
5th soft constraint (S5): The number of periods that a class is assigned in a day should be equal or less than maxStudyingPeriods.
•
6th soft constraint (S6): Idle time between lectures in a day for a teacher or a class should be avoided.
•
7th soft constraint (S7): The number of sessions that a teacher was assigned to should be as small as possible.
(Note that each course has its own minConsecutiveLectures, maxConsecutiveLectures, minSessions and maxSessions parameters) The objective function f (q) is used to evaluate a timetable q:
f (q ) = ∑ wi d i
(1)
wi and di are the weight and the violation number of the ith soft constraint.
4 The Algorithm 4.1 Phase 1: Building an Initial Solution Using Greedy Algorithm In this phase, courses are split into blocks (a block is a group of consecutive lectures), e.g., course A whose duration is 5 periods (i.e., includes 5 lectures) is split
Khang Nguyen et al.
308
into two blocks: a 2-lecture block and a 3-lecture block, then these blocks will be assigned to appropriate periods. All of those steps are done using greedy algorithm. Details of each step are described as follow: 1. For each course, list all of its block splitting ways, i.e., the ways this course could be split into blocks. For example: Course A whose duration is 6 periods, whose minConsecutiveLectures is 2 and whose maxConsecutiveLectures is 3 has two block splitting way: • 1st block splitting way (B1): (2,2,2), i.e., course A is split into three blocks; each block includes 2 lectures (last two continuous periods). •
2.
3. 4.
5.
6. 7.
2nd block splitting way (B2): (3,3), i.e., course A is split into two blocks; each block includes 3 lectures (last three continuous periods). With each block splitting way Bij of course Ai, list all the period blocks (i.e., sets of consecutive periods) that blocks in Bij could be assigned to and insert them into a list Lij of period assigning ways (Note that period pk could be assigned to course A if and only if course A's teacher and classes are available at period pk). Choose the course Ai which has the smallest total number of period assigning ways (based on result of the second step). From all lists Lij of course Ai, choose a period assigning way Sijk that influences other courses as least as possible. It means that if course Ai is split into block splitting way Bij and assigned to period assigning way Sijk, the decrease in total number of period assigning ways that courses which belong to the same clashed-course group of course Ai is the smallest. Assign course Aj to chosen block splitting way Bij and period assigning way Sijk. If all of list Lij of course Ai is empty, the block splitting way and period assigning way used to assign to course Ai are randomly chosen. Update information (remove course Ai and all of its assigned periods from consideration). If there is no course left for consideration, move to phase 2. Otherwise, return to the 1st step.
4.2 Phase 2: Improving Obtained Solution Using Tabu Search In this Tabu Search phase, the main kind of moves used is single moves. Moreover, two other kinds of moves, including swap moves and block-changing moves, are used when a number of unimproved moves (moves that don’t improve the best solution found so far) have passed. Details of each component of Tabu Search are described below: Single moves A single move includes three components: a course Ai, a block Pij of course Ai, and a new appropriate period block Rij (i.e., course Ai’s teacher, classes must be available at all periods of Rij) that block Pij will be assigned to. We just consider VSingleMove(qcur), a subset of the neighborhood NSingleMove(qcur) of the current solution qcur. VSingleMove(qcur) is created by examining all blocks of all courses and choosing
Using Tabu Search for Solving a High School Timetabling Problem
309
feasible solutions (solutions that satisfy all hard constraints) that are non-tabu or aspiration function satisfying. Swap moves In some cases, single moves cannot improve the current solution and the searching process would be stuck, e.g., when the timetable created is too tight, there may be no hole to move a course's block to. Swap moves could be a good candidate for solving this problem. A swap move has five parts: two courses Ai and Aj, two blocks Pim and Pjn (size of Pim must be greater than or equal to Pjn) of courses Ai and Aj and an integer number ExchangingWay. If the size of Pim is equal to the size of Pjn, the value of ExchangingWay is always 1, otherwise, ExchangingWay is in the range of [1,4]. Let Rim, Rjn be the current period blocks of Pim and Pjn and R’im, R’jn be the new period blocks of Pim and Pjn after applying swap move. R’im and R’jn are identified based on the value of ExchangingWay of the move: •
ExchangingWay = 1: R’im = Rjn, R’jn = Rim.
•
ExchangingWay = 2: R’im = Rjn - 1, R’jn = Rim.
•
ExchangingWay = 3: R’im = Rjn - 1, R’jn = Rim + 1.
•
ExchangingWay = 4: R’im = Rjn, R’jn = Rim + 1.
This kind of moves is applied every nSwapMove continuous unimproved iterations with single moves. When the swap moves are applied, the single moves are not considered. Swap moves are used in three consecutive iterations are just chosen whenever they could improve the objective function value of the current solution qcur. After that, single moves are returning to be used. No Tabu list, no aspiration criterion are used for swap moves. Block-changing moves A block-changing move changes the current block-splitting way and periods assigned to new blocks of a course. These moves is applied every nBlock-changingMove unimproved continuous iterations with single moves. When they are applied, the other moves are not considered. This move will be considered in one iteration and just be applied to the current solution qcur if it can improve qcur. A move of this kind contains 3 components: a course Ai, a new block splitting way Bij of course Ai, a period assigning way Sijk of new periods that course Ai will be assigned to. The subset VBlock-changingMoves(qcur) of the neighborhood NBlockchangingMoves(qcur), which contains feasible solutions that are non-tabu or aspiration satisfying, are examined. Because the total number of courses is large, we just investigate the subset VSwapMove(qcur) of the swap move neighborhood NSwapMoves(qcur). VSwapMove(Qcur) is generated by examining 30 percent of the available set of course pairs and choosing moves that could improve qcur. Note that swap moves has higher priority than block-changing moves.
Khang Nguyen et al.
310
Tabu list There are two Tabu lists: one for single moves and the other for block-changing moves. An element of single moves’ Tabu list keeps three components: course Ai, block Pij and old period block Rij assigned to Pij. When a move is chosen to be applied to the current solution; these information will be added into Tabu list and will exist there for a Tabu tenure number of iterations. A move is Tabu whenever its course, its block and its new periods are in Tabu list. At each iteration, the Tabu tenure value is randomly chosen in the range of [0.25Tb, 0.5Tb] (Tb is the square root of the number of courses). An element of block-changing moves’ Tabu list contains the information of course Ai and old block-splitting way Bij of course Ai. A move is Tabu if its course and its new block-splitting way are in Tabu list. The Tabu tenure is fixed to three. Aspiration criterion The aspiration criterion is the same with the one often used in many other papers [4][5][6]. A neighbor solution qnei satisfies aspiration criterion if and only if f(qnei) < f(qbest) while qbest is the best solution found so far. Stop criteria The searching process is stopped when it meets one of the following criteria: •
The number of iterations passed is 3000.
•
The best solution found so far satisfies all constraints.
•
The number of consecutive unimproved iterations is 1000.
5 Experimental Results The proposed algorithm is applied to three real-world instances of two high schools in Vietnam: High school for the Gifted and Tran Phu high school. The details of each instance are listed in Table I, with the following abbreviations: •
S: the number of subjects, Cr: the number of courses, T: the number of teachers, C: the number of classes.
•
Co: the conflict percent of courses Co =
Number Of Clashed Activity Pairs Number Of Activity Pairs
(2)
Number Of Clashed Course Pair: the total number of courses’ pairs that belong to the same clashed-course group.
Using Tabu Search for Solving a High School Timetabling Problem
•
311
AvT (Available Teachers): the availability percent of teachers. T
AvT =
∑ Number Of AvailablePeriods Of Teacher T
i
i =1
× 100
T × PpD × D
(3)
PpD is the number of periods per day, D is the number of days in the timetable. • AvC (Available Classes): the availability percent of classes C
AvC =
•
∑ Number Of AvailablePeriods Of Class C i =1
C × PpD × D
i
× 100
(4)
AvCr (Available Courses) : the availability percent of courses Cr
AvCr =
∑ Number Of AvailablePeriods Of Course Cr
i
i =1
Cr × PpD × D
× 100
(5)
The weights of soft constraints used to test are: w(S1) = 100, w(S2) = 5, w(S3) = 10, w(S4) = 5, w(S5) = 5, w(S6) = 5, w(S7) = 5 In all of these instances, the value of maxStudyingPeriods, PpD and D are 9, 12 and 6. We choose nSwapMove and nBlock-changingMove in the set {10, 20,30,40,50,70,100,200}, test all possible pairs of these parameters’ values and find out that the pair {20,30} (nSwapMove = 20 and nBlock-changingMove = 30) gives the best result. Therefore, results listed below have this pair values of these two parameters. The source code is implemented in C++ by the authors. The configuration of the computer used to test is CPU Core2 Duo, 1GB RAM, and Windows XP OS. The algorithm is tested ten times on each instance. The general results, including initial solution, best solution, average solution and average running time, are shown in Table II. Average detail results are shown in Table III, and Table IV lists the violation results of handmade timetables used practice. Sizes of instance 1 and 2 are medium with 128 and 244 courses; instance 3 is a bit large with 401 courses. Availability percent of classes is all greater than 90%, which may help the timetabling easier. However, time of teacher is a bit tight and the conflict between courses is not small. Therefore, in general, the timetabling task might be not easy. Initial solution has been much improved after Tabu Search phase. The average results are not much different from the best results. Time for running the proposed algorithm on these instances is not too large (less than fifteen minute). The first soft constraint S1, which is the most important (i.e., has the highest weight), is satisfied in all instances. The violated number of constraint S7 is big (25% in data 1, 80% in data 2 and data 3). The reason is that the method we use to calculate the violation of constraint S7 is just an approximate one because it’s
312
Khang Nguyen et al. Table 1. Details of data instances S Cr T Instance 1 19 128 88 Instance 2 19 244 116 Instance 3 19 401 116
C 9 27 27
AvT 75 64 69
AvC 95 93 91
AvCr 87 86 82
Co 33 50 46
Table 2. General results Initial result Instance 1 6855 Instance 2 15060 Instance 3 13755
Best result 2995 1915 1650
Average result 3015 1954 1688
Average time (s) 821 724 764
Table 3. Detail results S1 Instance 1 0 Instance 2 0 Instance 3 0
S2 46 34 40
S3 2369 0 0
S4 8 212 173
S5 10 0 30
S6 20 0 0
S7 562 1708 1445
Total violation 3015 1954 1688
Table 4. Handmade results S1 Instance 1 0 Instance 2 0 Instance 3 0
S2 5 0 0
S3 1660 80 105
S4 120 200 355
S5 40 35 65
S6 20 25 40
S7 1285 2015 1440
Total violation 3130 2355 2005
much faster than the exact one, so that obtained violation is always greater than or equal to the real violation. Timetables obtained from the proposed algorithm are compared with the handmade ones used in practice. In fact, the availability percent of teachers when doing timetabling by hands are all greater than 85 (bigger than the numbers listed in Table I), that means the problem instances considered in handmade timetabling is easier than ours. Most of the soft constraints are less violated in the automated version, except the 2nd constraint. In general, the automated timetabling solutions are better than the manual ones. Move over, time for doing timetabling by hands is always larger than the automated systems, about half of a day to one day.
6 Conclusion In this paper, a Tabu Search-based algorithm that can be efficiently applied to the timetabling problem of two concrete high schools is investigated. Obtained results are better than the handmade ones used in practice. This algorithm could also be extended to adapt to requirements of other high school timetabling problems.
Using Tabu Search for Solving a High School Timetabling Problem
313
Moreover, authors’ group has developed a web-based automated timetabling system that uses the proposed algorithm as the kernel solver. This system also provides many functions that support high schools’ staff and teachers for creating and managing their institutions’ timetables throughout the Internet. More details of this system could be referred to www.fit.hcmuns.edu.vn/~nmkhang/hts Acknowledgments. We would like to express our thanks to Tran Duc Khoa, Tran Thi Ngoc Trinh, Dinh Quang Thuan, and Vo Xuan Vinh for their investigation of the problem requirement, data structures and other great contributions to help us complete this research. We also want to thank Halvard Arntzen for giving us his documents and source code for our reference.
References 1. Nguyen, D.T.T., Khon, T.T.: Research on university timetabling algorithms. Bachelor Thesis, Information Technology Faculty, Ho Chi Minh University of Science, Vietnam (2007) 2. Glover, F.: Tabu Search - part I. ORSA J. Comput. 1(3), 190–206 (1989) 3. Schaerf, A.: A Survey of Automated Timetabling. Dipartimento di Informatica e Sistemistica, Dipartimento di Informatica e Sistemistica (1999) 4. Schaerf, A.: Tabu search techniques for large high-school timetabling problems. Computer Science/Department of Interactive Systems (1996) 5. Santos, H.G., Ochi, L.S., Souza, M.J.F.: A Tabu Search Heuristic with Efficient Diversification Strategies for the Class/Teacher Timetabling Problem. ACM J. E. A 10, 1– 16 (2005) 6. Alvarez, R., Crespo, E., Tamarit, J.M.: Design and implementation of a course scheduling system using Tabu Search. E. J. O. R. 137, 517–523 (2002) 7. McCollum, B., McMullan, P., Paechter, B., Lewis, R., Schaerf, A., Di Gaspero, L., Parkes, A.J., Qu, R., Burke, E.: Setting the Research Agenda in Automated Timetabling: The Second International Timetabling Competition. Technical Report, Internaltional Timetabling Competition (2007) 8. Colorni, A., Dorigo, M., Maniezzo, V.: Metaheuristics for High School Timetabling. Comput. Optim. Appl. 9, 275–298 (1998) 9. Glover, F., Laguna, M.: Tabu Search. Kluwer Academic Publishers, New York (1997) 10. Abramson, D.: Constructing school timetables using Simulated Annealing: sequential and parallel algorithms. Management Science 37, 98–113 (1991) 11. Erben, W., Keppler, J.: A genetic algorithm solving a weekly course-timetabling problem. In: Burke, E.K., Ross, P. (eds.) PATAT 1995. LNCS, vol. 1153, pp. 21–32. Springer, Heidelberg (1996) 12. Schaerf, A., Schaerf, M.: Local search techniques for high school timetabling. In: Burke, E.K., Ross, P. (eds.) PATAT 1995. LNCS, vol. 1153, pp. 313–323. Springer, Heidelberg (1996) 13. Hertz, A.: Tabu search for large scale timetabling problems. E. J. O. R. 54, 39–47 (1991)
Risk Management Evaluation Based on Elman Neural Network for Power Plant Construction Project Yongli Wang1, Dongxiao Niu1, and Mian Xing2 1
School of Economics and Management, North China Electric Power University, Beijing, China [email protected], [email protected] 2 Department of Mathematics and Physics, North China Electric Power University, Baoding, China [email protected]
Abstract. Risk management is very important to the power construction. The power plant construction project risk management means the uncertain influence on construction project goals and production operation management throughout the life cycle which could cause the losses made by uncertain events. The purpose of this paper is to establish an evaluation model for risk management evaluation. Firstly, the method of principle component analysis were used to deal with the original indexes, it selects the factors which made more influence for power plant construction evaluation through the method to pre-process all the influenced factors. Secondly, the Elman neural network(ENN) model was adopted to establish an evaluation model and classify the evaluation result. It performs well with small sample set. This was a new thought for the Elman neural network to be used in risk management evaluation for power plant construction project and it could make a more objective evaluation to the contract risk in power construction project combined with cases. Comparing with AHP and FUZZY evaluation model, this new method can achieve greater accuracy. Keywords: Risk Management; Elman Neural Network; Power Plant Construction; Comprehensive Evaluation.
1 Introduction Risk management is one of the main components of project management1. Risk identification is the basis of risk management, and the fundamental purpose is to find and penetrate into various risks during the thermal power units construction, and then detail and classify the found main risk. 1
This research has been supported by Natural Science Foundation of China(70671039), It also has been supported by Beijing Municipal Commission of Education disciplinary construction and Graduate Education construction projects.
N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 315–324. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
316
Y. Wang, D. Niu, and M. Xiang
Regarding to power plant construction, safety management evaluation is useful for the construction unit to find the potential hazard source, the practical precautions and generalize lessons for the future. Many power plant construction projects are normally managed by the Client or the representatives of Client, instead of professional consultancy company. However, the Client only controls the milestone schedule of the project and don’t strictly supervise or control the detailed work done. Under such circumstance it had to undertake all the risks for engineering, procurement, construction, tests and commissioning. Meanwhile, the large scale project always has long execution period, large scale, more parties’ involvement and more risky factors that have complicated internal relation, which is apparently difficult the contractor to control the project risks. The risks exist in all stages of the whole project life cycle, and whether the risks happen or not, the sizes of the risks and the loss followed would affect the project itself and the owner to a large extent. Through expert investigation, PCA analysis and other methods, we identify the major risks which affect project quality, schedule and investment and other aspects. As there are a large number of indexes in the process of evaluation, it is very difficult to define their membership grade one by one. Thereby, the fuzzy comprehensive evaluation model is hard to realize. Regarding to the single artificial neural network model, the node number in the intermediate layer is difficult to be formed and its performance of generalization is poor. Due to these shortcoming, the model results in low calculation accuracy. In this paper, the Elman neural network algorithm is adopted for risk management evaluation. The practical example show that the model established in this paper has more accuracy.
2 Index System Establishment The data treatment process is to transform indexes of varied type into single positive quantitative index type. According to the characteristic of each index, different methods are taken as the following: Regarding to the qualitative indexes such as s13 , usually the method of questionnaire is: The informants are asked to select “good”, “fine”, “fair” and “poor” mentioned in the questionnaire. And it can be defined that the area [100, 85] belong to “good”; (85, 80] belong to “fine”; (80, 65] belong to “fair”; (65, 0] belong to “poor”. The selection frequency and the grade weight, a final mark is calculated by using:
⎛ w1 ⎞ ⎜ ⎟ w2 ' xij = (v1 , v2 , v3 , v4 ) ⎜ ⎟ ⎜ w3 ⎟ ⎜⎜ ⎟⎟ ⎝ w4 ⎠
(1)
Risk Management Evaluation Based on Elman Neural Network
Where
317
xij' is the final mark for the j th index of the i th sample; v is the selection
frequency and weight for each grade. Regarding to the negative indexes such as the
S21 (Injury rate per thousand
people), a positive index can be acquired on the basis of their reciprocal, which is
xij = Where sample;
1 xij'
(2)
xij' is a negative index, the original data for the j th index of the i th xij is the final positive index.
All the data should be normalized before they are input into the model.
3 Elman Neural Network Evaluation Model At present, the static feedforward neural network based on the back-propagation (BP) algorithm is applied widely in the neural network model. To identify the dynamic system with the static feedforward neural network is actually to simplify the dynamic modeling problem as the static modeling one [9]. Compared to it, the Elman neural network (ENN) is a typical dynamic neural network; it has the function to mapping the dynamic character through storing the inner condition based on the basic structure of the BP NN, so it can reflect the dynamic character of the system more vivid and direct due to its suitable ability. Therefore, this paper adopts the modified ENN considering the dynamic character of the contract risk in power construction project. 3.1 The Modified ENN Structure The ENN can be looked as a BP NN with the local memory unit and the local feedback connection. In the recurrent networks, the ENN has a simple structure and is sensitive about the historical data. Besides an input layer, an hidden layer and an output layer, there is a specific structural unit in the Elman network. In order to improve the ability of the approximation and the dynamic performance of the ENN, the modified ENN increase the self-feedback connection with a fixed plus in the structural unit, its structure appears in figure 1. The same to the multilayer feedforward network,the input layer can only be used for signal transmission, the output layer has the function of linear weighted, the hidden layer may conclude the linear or nonlinear excitation function. But the specific structural unit is used to memorize the previous output value of the hidden layer, and it can be thought as a delay operator. In addition, the dynamic characteristics of the ENN can only be supported by the inner connection, so it doesn’t need the direct use of state as the input or the training signal and this is also its superiority compared to the static feedforward neural networks. And the modified ENN has the better identification capability of the high order system.
318
Y. Wang, D. Niu, and M. Xiang
Output layer
Immediate layer
Correlation layer Input layer
Fig. 1. Structure of the modified ENN
u (k − 1) ∈ R r , the output is y (k − 1) ∈ R m ,if mark the output of the immediate layer as x(k ) ∈ R n , the space As Fig.1 shows that, suppose the exterior output is
expression of the nonlinear state as follows is tenable:
x(k ) = f ( w1 xc (k ) + w2u (k − 1))
(3)
xc (k ) = x(k − 1) + axc (k − 1)
(4)
y (k ) = g ( w3 x(k ))
(5)
In the Formula:
w1 , w2 , w3 ——the connection weight matrix from the structure unit to hidden layer, the connection weight matrix from input layer to hidden layer and the connection weight matrix from hidden layer to output layer. f (⋅), g (⋅) ——the nonlinear vector function consisted of the excitation function of the output unit and the hidden unit. 3.2 The Algorithm of the Modified ENN The basic ENN can only distinguish the first order linear dynamic system adopting the normal BP algorithm., It decreased the learning stability to the connection weight of the structure unit because the normal BP algorithm only has the first-order gradient, so when the order of the system or the hidden unit increase, it will make the corresponding learning rate minimal and can’t provide the acceptable approximation accuracy[10]. However, because of the existence of the self join
Risk Management Evaluation Based on Elman Neural Network
319
feedback plus , it solves the disadvantages and achieves the satisfactory effect. Take the global error objective function as: N
E = ∑ Ep
(6)
p =1
thereinto:
Ep =
1 ( yd (k ) − y (k ))T ( yd (k ) − y (k )) 2
For the connection weight from hidden layer to output layer w
∂E p ∂w ij 3
= −( yd ,i (k ) − yi (k ))
(7) 3
∂yi (k ) ∂w3ij
(8)
= −( yd ,i (k ) − yi (k )) g i (⋅) x j (k ) let be
δ i0 = ( yd ,i (k ) − yi (k )) gi (⋅) . ∂E p
= −δ i0 x j (k )
∂w 2jp
(9)
i = 1,2,⋅ ⋅ ⋅, m; j = 1,2,⋅ ⋅ ⋅, n For the connection weight from input layer to hidden layer w
∂E p
∂E p
=
∂w 2 jp
⋅
2
∂x j ( k )
∂x j ( k ) ∂w 2 jp (10)
m
= ∑ ( −δ i0 w 3 ij ) f j (⋅)u q ( k − 1) i =1
m
Let be
δ jh = ∑ (δ i0 w 3 ij ) f j (⋅) i =1
has
∂E p ∂w j = 1,2,⋅ ⋅ ⋅, n; q = 1,2,⋅ ⋅ ⋅, r
2 jp
= −δ jh uq (k − 1)
(11)
320
Y. Wang, D. Niu, and M. Xiang 1
Similarly, for the connection weight from structure unit to hidden layer w , obtains
∂E p ∂w
m
∂x j ( k )
i =1
∂w1 jl
= −∑ (δ i0 w3ij )
1 jl
(12)
j = 1,2,⋅ ⋅ ⋅, n; l = 1,2,⋅ ⋅ ⋅, n The modified ENN doesn’t consider the dependence between and When adopt the normal BP algorithm, so there is that:
∂x j ( k ) ∂w1 jl
= f j (⋅) xc ,l (k )
(13)
Let xc ,l ( k ) = axc ,l ( k − 1) + xl ( k − 1) above find
f j (⋅) xc ,l (k ) = f j (⋅) xl ( k − 1) + af j (⋅) xc ,l (k − 1)
(14)
thus
∂x j ( k ) ∂w
1
= f j (⋅) xl (k − 1) + a
jl
∂x j ( k − 1) ∂w1 jl
(15)
Therefore, the algorithm of the ENN can be reduced as follows:
Δwij3 = ηδ i0 x j (k )
i = 1,2,⋅ ⋅ ⋅, m; j = 1,2,⋅ ⋅ ⋅, n
Δw2jq = ηδ huq (k − 1) j = 1,2,⋅ ⋅ ⋅, n; q = 1,2,⋅ ⋅ ⋅, r m
Δw1jl = η ∑ (δ i0 wij3 ) i =1
∂x j ( k ) ∂w1jl
j = 1,2,⋅ ⋅ ⋅, n; l = 1,2,⋅ ⋅ ⋅, n
4 Evaluation Model Based on ENN The index system can be seen in table.1, the evaluation result could be calculated through comprehensive analysis, for a power plant that one or a few index be “good”, the evaluation result may not be “good”, it must evaluate all the index and should consider the weight of these indexes. For example, the score of some indexes were chosen can be seen in table 2, and the evaluation result can be seen in table 2. So we made comprehensive evaluation with all the index.
Risk Management Evaluation Based on Elman Neural Network
321
Table 1. Original Index Index No.
Name
S1
Precautions in the dangerous area
S2
Precautions in the access area
S3
Precautions in the power installation area
S4
Precautions in the radioactive area
S5
Management organization structure
S6
Supervising organization structure
S7
Rules and regulations
S8
Safety budget
S9
The Risk of Contract Conditions
S10
The Risk of sub-contract
S11
The Risk of Claim
S12
The Risk of Nature
S13
the technical specification
S14
technical structure
S15
project scale and the technical capacity
S16
experience of the Contractor
S17
The Economic Risk
S18
Political Risk
13 experts are designated from different risk management domain, according to the fuzzy comprehensive evaluation and the method above, we can obtain the fuzzy evaluation value and the risk comprehensive evaluation value of each various factors on contract risk in these power construction project. 17 groups of data are taken as the training samples, another 5 groups as the detection samples. 18 risk factors of each group are selected as the input, and a node of output layer is the comprehensive evaluation value. Using the modified ENN algorithm, to hypotheses the maximum training times l000, the training objective error 0.01 and other parameter use default value through the corresponding function of MATLAB neural network, we can obtain the comprehensive evaluation value of the contract risk in power construction project, the entire process is realized on Matlab7.0. Training the network with the sample data in the Table 3. Input the data into the network trained, we can obtain the test result to examine the network training effect and its validity. If the fitting effect is well, it explains that the network model can be used in the risk early warning of the contract risk in
322
Y. Wang, D. Niu, and M. Xiang Table 2. Sample data of network training
S9
S10
S14
S21
S1
S11
S16
..……
S18
10 0 10 0 95 90 80 83 89 80 71 74 77 76 70 60 57 40 50 20 0
100 97 90 90 85 84 88 79 80 73 68 77 69 80 69 45 45 30 0
100 96 89 86 84 81 87 78 78 75 74 78 68 57 50 50 34 25 0
100 95 99 87 85 88 80 85 74 71 71 75 62 64 59 55 46 40 0
100 96 96 89 86 82 84 82 74 74 85 76 75 53 63 60 42 50 0
100 94 92 85 80 79 86 85 73 78 69 79 64 65 62 49 40 29 0
100 89 94 91 90 80 80 81 72 75 73 74 68 66 65 65 45 60 0
………. ………. ………. ………. ………. ………. ………. ………. ………. ………. ………. ………. ………. ………. ………. ………. ………. ………. ……….
100 90 93 87 83 80 82 81 73 72 70 72 69 72 67 51 52 53 0
Gradati on Good Good Good Good Fine Fine Fine Fine Fair Fair Fair Fair Fair Poor Poor Poor Poor Poor Poor
The evaluation results with ENN, AHP, FUZZY model are shown on Table 3. Table 3. Comparison of the evaluation results with ENN, AHP, FUZZY model
Power Plant
Actual Evaluation
ENN
ENN Grade
AHP
AHP Grade
FUZZY
A
Good
93.56
Good
88.56
Good
Fine
B
Fair
77.33
Fair
82.11
Fine
Fine
C
Good
95.11
Good
92.13
Good
Good
D
Fair
73.97
Fair
80.21
Fair
Fair
E
Fine
84.12
Fine
78.96
Fair
Fair
power construction project through the valid examination. Whenever inputs a target value of the contract, we can obtain the value of the contract risk. Then we can send out the corresponding risk alarm according to the risk grade determined by the risk’s output value. The comparison between AHP and FUZZY is shown as the following. it can be seen that the accuracy of the evaluation result with ENN is 100%, and the accuracy of AHP and FUZZY model is 60% respectively. It can be seen that ENN model can make more accurate than other models.
Risk Management Evaluation Based on Elman Neural Network
323
5 Conclusion In this paper, it takes advantage of Elman neural network method to evaluate the power plant construction project risk management. The evaluation of the risk management in power construction project is a very complicated problem of system evaluation. The modified ENN can reflect the dynamic characteristic of the system directly. It defines the parametric variable which is dynamical with the self study ability of ENN, so it overcomes the shortage which is that the parametric variable be defined on experience by experts; It can carry on the modeling to the complicated nonlinear system effectively and has the advantages such as the rapid learning speed and the high precision etc. Combining the fuzzy comprehensive evaluation with the modified neural network makes the evaluation work more easy because that the fuzzy comprehensive evaluation quantifies the fuzzy risk factors. For different application, it can select different training sample data, so it has more extensive use and the network which has been trained has better effective. It can be seen that comparing with AHP and FUZZY evaluation model, this new method can achieve greater accuracy. Furthermore, it is easier to get the sample data of network, so this method has better future.
References 1. Niu, D., Wang, Y., Shen, W.: Project construction evaluation model for biomass power plant based on support vector machines. Journal of Information and Computational Science 5(1), 79–85 (2008) 2. Ming, M., Dongxiao, N.: Neural network evaluation model based on principle component analysis. Journal of North China Electric Power University 31(2), 53–56 (2004) 3. Yikun, S., Shoujian, Z.: Evaluation of construction engineering safety management effect. Jounal of Jlamusi University 21(4), 383–389 (2003) 4. Jinshun, L., Pei, H.: Method of software quality evaluation based on fuzzy neural network. Computer Technology and Development 16(2), 194–196 (2006) 5. Zixiang, X., Deyun, Z., Yiran, L.: Fuzzy neural network based on principal component. Computer Engineering and Applications, 34–36 (2006) 6. Shizhong, W., Xin, L., Guangqun, X.: Safety supervision of construction engineering. Quality Management (4), 8–10 (2005) 7. Kaiya, W.: Principal component projection applied to evaluation of regional ecological security. Chinese Soft Science (9), 123–126 (2003) 8. Shoukang, Q.: Theory and application of synthetical assessment. Publishing House of Electronics Industry, Beijing (2003) 9. Niu, D.-X., Wang, Y.-L., Wu, D.-S.D.: Power load forecasting using support vector machine and ant colony optimization. Expert Systems with Applications 37, 2531–2539 (2010) 10. Jaselsks, E.J., Anderson, S.D.: Strategies for achieving excellence in safety performance. J. Constr. Engrg. and Mgmt. (1), 61–70 (1996) 11. Qinhe, G., Sunan, W.: Identification of nonlinear dynamic system based on Elman neural network. Computer Engineering and Applications 43(31) (2007)
324
Y. Wang, D. Niu, and M. Xiang
12. Yu, J., Zhang, A., Wang, X., Wu, B.: Adaptive Neural Network Control with Control Allocation for A Manned Submersible in Deep Sea. China Ocean Engineering 21(1), 147–161 (2007) 13. Niu, D.-x., Wang, Y.-l., Duan, C.-m., Xing, M.: A New Short-term Power Load Forecasting Model Based on Chaotic Time Series and SVM. Journal of Universal Computer Science 15(13), 2726–2745 (2009) 14. Qi, H., Haoyong, P.: The Discussion of making the post-evaluating methods target system and content of thermal power plant construction project. Journal of Taiyuan University of Technology 29(5), 492–496 (1998)
A New Approach to Multi-criteria Decision Making (MCDM) Using the Fuzzy Binary Relation of the ELECTRE III Method and the Principles of the AHP Method Laor Boongasame1 and Veera Boonjing2 1
Department of Computer Engineering, Bangkok University, Bangkok, Thailand 2 Department of Mathematics and Computer Science, King Mongkut’s Institue of Technology Ladkrabang, Bangkok, Thailand [email protected], [email protected]
Abstract. There are several methods for Multi Criteria Decision Making (MCDM) such as multiple attribute utility theory (MAUT), the analytical hierarchy process (AHP), and Fuzzy AHP. However, these methods are compensatory optimization approaches for which bad score on some criteria can be compensated by excellent scores on other criteria. So, the Elimination and Choice Translating Reality III (ELECTRE III) was proposed to solve such problem. Nevertheless, thresholds determined by identified experts and used in this method may be inconsistent. Therefore, this paper proposes an integrated approach which employs ELECTRE III and partial concepts of AHP together, called the Consistency ELECTRE III. In this method, ELECTRE III is used in ranks the alternatives and AHP is used in determining the consistency of the criteria thresholds within ELECTRE III. In the simulation, it is found that threshold values of criteria within ELECTRE III affect the ranking of the alternatives. Specially, the ranking of the Consistency ELECTRE III and that of the ELECTRE III are different. Keywords: Multi Criteria Decision Making (MCDM), Analytical Hierarchy Process (AHP), Elimination and Choice Translating Reality III (ELECTRE III).
1
Introduction
Multi Criteria Decision Making (MCDM) is a technique used in evaluating alternatives combined from multiple criteria. The problem of MCDM
The authors gratefully recognize the financial support from Bangkok University.
N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 325–336. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
326
L. Boongasame and V. Boonjing
can be found widely in the real world such as management of agricultural systems and forestry resource-use problems. Therefore, several methods for solving MCDM were proposed, for example, Multiple Attributes Utility Theory (MAUT) [11], Analytical Hierarchy Process (AHP) [9,10,6], Fuzzy AHP [1,2,3,4]. However, these methods are compensatory optimization approaches. To overcome the restriction of these methods, B. Roy [7,8] proposed the concept of outranking. The performance of alternatives on each criterion in outranking is compared in pairs in the concept. Alternative a is said to outrank alternative b if it performs better on some criteria and at least as well as b on all others. Among the outranking methods, the most popular one is Elimination and Choice Translating Reality III (ELECTRE III). In the ELECTRE III method, a technique concerning threshold values is used. However, their values which affect the ranks of the alternatives may be inconsistent. Therefore, this paper proposes an integrated approach that employs ELECTRE and partial concepts of AHP together, called the Consistency ELECTRE III. In the Consistency ELECTRE III, ELECTRE III is used in ranking the alternatives and AHP is used in determining the logical consistency of the thresholds. In the simulation, it is found that threshold values of the criteria within ELECTRE III affect the ranks of alternatives and that the ranking of the Consistency ELECTREIII and that of the ELECTRE III are different. The rest of the paper is organized as follows. In section 2, a literature review is presented. In section 3, the Consistency ELECTRE III approach is explained. In section 4, the application of the Consistency ELECTRE III method is presented. In section 5, the simulation and experimental results are revealed before the discussion and conclusion in Section 6.
2
Literature Review
Multi Criteria Decision Making (MCDM) is divided into two types (i) the family of Multiple attributes utility theory (MAUT) method such as the Analytical Hierarchy Process (AHP) and (ii) the family of Outranking method. 2.1
The Analytical Hierarchy Process (AHP)
AHP [9,10,6] is a compensatory optimization approach. Nevertheless, the AHP based on the assumption that humans are more capable of making relative judgments than absolute judgments. The AHP is based on three principles as follows: (1) construction of a hierarchy, (2) priority setting and (3) logical consistency. (1) construction of a hierarchy. The AHP decomposes a problem into a hierarchy. (2) priority setting. The AHP based on pair-wise comparisons of decision criteria. In the pair-wise comparisons, a nine-point scale is used to express an intensive preference’s evaluator for one criterion versus another as shown in Table 1.
A New Approach to Multi-criteria Decision Making (MCDM)
327
Table 1. Fundamental scale for pair-wise comparisons in AHP Verbal scale Equally important, likely or preferred Moderately more important, likely or prefered Strongly more important, likely or preferred Very strongly more important, likely or preferred Extremely more important, likely or preferred Intermediate values to reflect compromise
Numerical values 1 3 5 7 9 2, 4, 6, 8
Each individual criterion must be paired against all others and the results are displayed in the matrix form. The matrix is reciprocal. For example, if criterion A is strongly more important compared to criterion B (i.e., a value of 5), then criterion B has a value of 1/5 compared to criterion A. (3) logical consistency. With numerous pair-wise comparisons, logical consistency is hard to achieve. For example, if criterion A has a value of 3 compared to criterion B and if criterion B has a value of 2 compared to criterion C, perfect consistency of criterion A compared to criterion C would have a value of 3*2 = 6. If the A to C value assigned by the evaluator was 4 or 5, some inconsistency would occur among the pair-wise comparisons. If the degree of consistency is unacceptable, the evaluator should revise the pair-wise comparisons [9]. Fuzzy Analytical Hierarchy Process (FAHP). FAHP method is a systematic approach to the alternative selection by using the concepts of fuzzy set theory and analytical hierarchy process. In the FAHP method, the pairwise comparisons in the judgment matrix are fuzzy numbers. Additionally, the fuzzy arithmetic and fuzzy aggregation operators are used in this method. There are several FAHP methods proposed by many authors. The earliest work in fuzzy AHP appears in [1], which compares fuzzy ratios described by triangular membership functions. Then, Buckley [2] determines fuzzy priorities of comparison ratios whose membership functions trapezoidal. Additionally, Chang [3] introduces a new approach for handing fuzzy AHP, with the use of triangular fuzzy numbers for pair-wise comparisons scale of fuzzy AHP, and the use of the extent analysis method for the synthetic extent values of the pair-wise comparisons. Finally, Cheng [4] proposes a new algorithm for evaluating naval tactical missile systems by the fuzzy analytical hierarchy process based on grade value of the membership function. 2.2
The Family of Outranking Method
The concept of outranking was proposed by Roy [7]. The performance of alternatives on each criterion in outranking is compared in pairs. Alternative A is said to outrank alternative B if it performs better on some criteria and at least as well as B on all others. In contrast to the AHP method, the outranking method is only partially compensatory and may not be considered finding for in the best alternative. In addition, preference and indifference thresholds are introduced into each criterion to avoid exaggerating the significance of small distinction in performance. The indifference threshold is
328
L. Boongasame and V. Boonjing
the distinction beneath which an evaluator has no preference, that is, a difference that is too small to be used as a basis of distinction between the two. The preference threshold is the distinction above which the evaluator strongly prefers one alternative to another. The most significant methods in the outranking methods are the ELECTRE III methods. This method is a complex decision-aid model which evaluates the number of alternatives by using a family of pseudo-criteria. The outranking relation in ELECTRE III is a fuzzy binary relation. It uses three distinct thresholds to incorporate the uncertainties that are inherent in most influence valuations. The method is relatively well known and was successfully used to solve different concrete problems [5].
3
The Consistency ELECTRE III Approach
Let A = {a1 , a2 , ...an } be a set of alternatives, C = {c1 , c2 , ...cr } be a set of criteria, W = {w1 , w2 , ...wr } be a set of weights of the influence on criterion j, akj be the performance value of criterion cj of alternative ak , (ak , aj ) be any ordered pair of alternatives. This approach is divided into four classes of definitions as follows: 1) thresholds 2) pair-wise threshold comparisons 3) the index of concordance and discordance and 4) the degree of outranking. 3.1
Thresholds
Definition 1. The indifferent threshold of criterion j (qj ) is defined by expert: ak and al are indifferent if |akj − alj | < qj . Definition 2. The strict preference threshold of criterion j (pj ) : ak is strictly preferred to al if |akj − alj | > pj . Definition 3. The weak preference threshold of criterion j (pj ) : ak is weakly preferred to al if |akj − alj | <= pj . Definition 4. The veto threshold of criterion j j (vj ) : reject the hypothesis of outranking of ak over al if |akj − alj | > vj . 3.2
Pair-Wise Threshold Comparisons
Each criterion used in the problem is assigned a threshold using partial concepts of Analytical Hierarchy Process (AHP). Firstly, indifferent threshold q of the criterion is determined by identified experts. Then, pair-wise comparisons in the judgment matrix is formed to determine the preference p and veto v thresholds of the criterion as follows: 1) the experts make individual evaluations using the scale, provided in Table 1, to determine the values of the elements of the pair-wise comparisons in the matrix; 2) the weights of these thresholds of the criterion are calculated based on this matrix; and 3) the preference p and veto v threshold of the criterion are calculated based on the weights.
A New Approach to Multi-criteria Decision Making (MCDM)
329
Definition 5. Let W txj yj be the number indicating the strength of threshold txj when compared with threshold tyj on criterion cj . The experts determine the values of the elements of pair-wise comparison matrices using the scales of Table 1. Definition 6. The matrix of these number W txj yj is denoted by W tj , or W tj = (W txj yj ). Definition 7. The matrix is consistent if W txj yj = W txj zj * W tzj yj for all x, y, z which matrix W tj is reciprocal, that is, W txj yj = W t1y x . j j
Definition 8. From definition 7, it is observed that a consistent matrix is one in which the comparisons are based on exact measurements; that is, W tx the weights W = {w1 , w2 , ...wr } are already known. Then W txj yj = W tyj , j x=q,p,v and y=q,p,v. Definition 9. An (3*3 ) evaluation matrix A in which every element W txj yj (x, y = q, p, v and j = 1, 2, ..., n) is the quotient of weights of the thresholds, as W tvj vj W tvj pj W tvj qj A = W tpj vj W tpj pj W tpj qj , W tqj vj W tqj pj W tqj qj W txj yj = 1, W txj yj = 1/W tyj xj , W tyj xj not = 0 Definition 10. W tj , a normalized pair-wise comparison matrix, is calculated by computing the sum of each column (step I), then divide each entry in the matrix by its column sum (step II), and finally, average across the rows to get the relative weights (step III). Definition 11. The relative weights are given by the right eigenvector (W tj ) corresponding to the largest eigenvalue (λmax ), as AW tj = λmax W tj . Definition 12. Let the consistency index (CI) for each matrix be given by the formulae, CI = (λmax − n)/(n − 1). Definition 13. Consistency Ratio (CR) is then calculated using the formulae CR = CI/RI where RI is a known random consistency index and can be obtained [18] and as shown in Table 2. As a rule of thumb, a value of CR <= 0.1 will be accepted. Otherwise a re-voting of the comparison-matrix has to be performed. Definition 14. The preference threshold of criterion j pj = qj /W tqj ∗ W tpj . Definition 15. The veto threshold of criterion j vj = qj /W tqj ∗ W tvj .
330
L. Boongasame and V. Boonjing Table 2. Random Inconsistency Indices (RI) for n=1,2,...,15 N 1 2 3 4 5
3.3
RI 0.00 0.00 0.58 0.90 1.12
n 6 7 8 9 10
RI 1.24 1.32 1.41 1.45 1.49
n 11 12 13 14 15
RI 1.51 1.48 1.56 1.57 1.59
The Index of Concordance and Discordance
Definition 16. The degree of concordance with the judgmental statement that ak outranks al under the j the criterion crj (k, l) is defined as ⎧ 1 : if (qj >= alj − akj ), ⎨ 0 : if (pj <= alj − akj ), crj (k, l) = ⎩ pj −(alj −akj ) : otherwise pj −qj Definition 17. A concordance index of each ordered pair ak , al of alternar wj crj (k,l) r tives cr(k, l) is defined as cr(k, l) = j=1 where wj is the weight j=1 wj determining the relative importance of j th criterion. Definition 18. The degree of discordance with the judgmental statement that ak outranks al under the j the criterion dj (k, l) is defined as ⎧ 0 : if (alj − akj <= pj ), ⎨ 1 : if (alj − akj >= vj ), dj (k, l) = ⎩ (alj −akj )−pj : otherwise (vj −pj ) 3.4
The Degree of Outranking
Definition 19. The degree of credibility of outranking with the judgmental statement that ak outranks al is defined as cr(k, l) : if J(k, l) = ∅, s(k, l) = 1−dj (k,l) cr(k, l) j∈J(k,l) 1−cr(k,l) : otherwise where J(k, l) is defined as the set of criterion for which dj (k, l) > cr(k, l). If J(k, l) = ∅ , we have dj (k, l) <= cr(k, l) for any criterion, then s(k, l) is the same as cr(k, l). Definition 20. The ranking of the alternatives n n δk = l=1 s(k, l) − l=1 s(l, k), k = 1, 2, ..., n.
4
is
defined
as
Application
In this section, an application of the Consistency ELECTRE III method is presented. The method is used in selecting a university. Suppose that a
A New Approach to Multi-criteria Decision Making (MCDM)
331
student wants to select a university. There are four alternatives that the student wants to study as follows: a1) Bangkok University a2) Chulalongkorn University a3) King Mongkut University of Technology Thonburi a4) King Mongkut Institute of Technology Ladkrabang and a5) National Institute of Demonstrative Administration. Criteria considered in selecting each university such as their weights, and their indifference thresholds are defined for this application in Table 3. The criteria c1, c2, and c3 are all to be minimized. The performances of the alternatives are shown in Table 4. Table 3. Criteria and Weights Criteria Description Units c1 c2 c3
Academic Price Location
Weight
Rank 0.5 (w1) Bath 0.3 (w2) Kilometers 0.2 (w3)
Table 4. Performance Matrix a1 a2 a3 a4 a5
c1 15 14 17 19 10
c2 308810 301006 308253 303487 307796
c3 16 13 13 18 16
This application demonstrates the Consistency ELECTRE III mehtod as follows: 4.1
Thresholds
The Criterion and their indifference thresholds q are defined for the example, as in Table 5. Table 5. Criteria and Indifference Threshold Criteria c1 c2 c3
4.2
Indifference Threshold (q) 2 1500 4
Pair-Wise Threshold Comparisons
Pair-wise comparisons in the matrix which are formed to determine the indifference, preference and veto threshold of criteria c1 is shown in Table 6. From Definition 10, a normalized pair-wise comparison in the matrix is computed, as shown in Table 7. From Definition 11, the largest eigenvalue (λmax ) is computed as shown in Table 8.
332
L. Boongasame and V. Boonjing Table 6. The matrix of W t1 of criterion c1 Thresholds v1 p1 q1
v1 p1 1 3 1/3=0.33 1 1/5=0.2 1/2=0.5
q1 5 2 1
Table 7. Determining the Ralative Weight of c1 Threshold Criterion Step I v1 p1 1 3 0.33 1 0.2 0.5 1.53 4.5
v1 p1 q1
q1 5 2 1 8
Step II v1 p1 q1 0.65 0.66 0.625 0.22 0.22 0.25 0.13 0.11 0.125 1 1 1
Step III wt1 (0.65+0.66+0.625)/3 0.645 (0.22+0.22+0.25)/3 0.23 (0.13+0.11+0.125)/3 0.122 1
Table 8. Determining the Consistency Ratio Criterion v1 p1 q1
Step I Step II 2 3.2 0.708 2.83 0.375 3
Table 9. Indifferent, Preference, and Veto Thresholds of C1 C1 v1 2/0.122*0.645=10.572 p1 2/0.122*0.23=3.7 q1 2
Table 10. The matrix of W t2 of criterion c2 Thresholds v2 p2 q2
v2 1 1/7 1/18
p2 7 1 1/2
q2 18 2 1
Table 11. The matrix of W t3 of criterion c3 Thresholds v3 p3 q3
v3 1 1/2 1/6
p3 2 1 1/3
q3 6 3 1
Table 12. Indifferent, Preference, and Veto Thresholds of all criteria c1 c2 vj 2/0.122*0.645=10.572 3000/0.05*0.83=49800 pj 2/0.122*0.23=3.7 3000/0.05*0.11=6600 qj 2 3000
c3 5/0.1*0.6=30 5/0.1*0.3=15 5
From Table 8, λmax = 3.2+2.83+3 = 3.011. From Definition 12 and 13, we 3 λmax −n 3.011−3 0.005 can calculate CI = n−1 = = 0.005 and CR = CI 2 RI = 0.58 = 0.009. The preference and veto thresholds of criteria are calculated by Definition 14 and 15, as in Table 9. Additionally, pair-wise comparisons in the matrices
A New Approach to Multi-criteria Decision Making (MCDM)
333
formed to determine the indifference, preference and veto threshold of criteria c2 and c3 are shown in Table 10 and 11 respectively. Table 12 shows all indifferent, preference, and veto thresholds of all criteria. 4.3
Calculation of the Index of Concordance and Discordance
The concordance index for every pair of alternatives is calculated by Definitions 16-17, as shown in Table 13. According to the criteria, the list of discordance matrix is calculated by Definition 18, as in Table 14. Table 13. Concordance Matrix a1 a2 a3 a4 a5
a1 1 1 1 1 1
a2 0.847 1 0.877 1 0.9
a3 1 1 1 1 1
a4 0.98 1 1 1 1
a5 1 1 1 1 1
Table 14. Disconcordance Matrix for c1 a1 a2 a3 a4 a5 a1 a2 a3 a4 a5
4.4
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Calculation of the Degree of Outranking
Table 15 shows the outranking matrix calculated by Definition 19. The scores n n δk = l=1 s(k, l) − l=1 s(l, k), k = 1, 2, ..., n show the degree of preferences of the alternatives and are calculated by Definition 20, as in Table 16. Thus, the final ranking of the alternatives in this application is: a1, a2, a3, a4, and a5. Table 15. Outranking Matrix a1 a2 a3 a4 a5
a1 1 1 1 1 1
a2 0.84 1 0.87 1 0.9
a3 1 1 1 1 1
a4 0.98 1 1 1 1
a5 1 1 1 1
Table 16. The score of the alternatives Alternative Name Score Rank a1 a2 a3 a4 a5
0.17 -0.37 0.122 -0.01 -0.97
1 5 2 4 3
334
5 5.1
L. Boongasame and V. Boonjing
Simulation Uncertainty in Determining Threshold Values
In this section, we revisit the application and compare simulation results of the two conditions of the ELECTRE III method as follows: 1) exact values of q, p and v of each criterion as shown in Table 17; 2) values of q, p and v of each criterion within the ranges as shown in Table 18 that are the result from 1000 runs. The weights of the criteria and the performances of alternatives used the values from Table 3 and Table 4. In Figure 1, the horizontal axis of this graph is the alternatives. The vertical axis of this graph is the rank. The comparison of the ranks of alternatives between certainty and uncertainty and range are shown in Figure 1. In this figure, it can be observed that ranks of alternatives are sensitive to the threshold values.
Table 17. Indifference Preference and Veto Thresholds for Uncertainty in Determining Threshold Values Criteria C1 C2 C3
Indifference Threshold (q) Preference Threshold (p) Veto Threshold(v) 10 30 50 5000 10500 47500 5 15 25
Table 18. Indifference Preference and Veto Thresholds Ranges for Uncertainty in Determining Threshold Values Criteria Indifference Threshold (q) Min Max C1 1 20 C2 1000 10000 C3 1 10
Preference Threshold (p) Min Max 21 40 10001 11000 11 20
Veto Threshold(v) Min 41 40000 21
Max 60 55000 30
Fig. 1. Comparing the ranks of the alternatives between certainty and uncertainty in determining threshold values of the pair-wise method
A New Approach to Multi-criteria Decision Making (MCDM)
335
Table 19. Simulation Fixed Parameters Criteria c1 c2 c3
5.2
Description Academic Price Location
Units Rank Bath Kilometers
Weight 0.5 (w1 ) 0.3 (w2 ) 0.2 (w3 )
Indifference Threshold (q) 10 5000 5
Logical Consistency and Inconsistency
The simulation is fixed parameters are shown in Table 19. There are one hundred replications and in each one, one hundred random performances of the alternatives are generated. In this section, simulation results are compared between the consistency pair-wise comparisons in the Consistency ELECTRE III method and the inconsistency pair-wise comparisons in the ELECTRE method. Each replication of them is performed 200 times. In these simulations, the ranking of alternatives in the inconsistency pairwise comparisons is different from those in the consistency pair-wise comparisons. An independent t-test, a Wilcoxon signed rank statistic, a nonparametric statistical hypothesis test for the case of two related samples, used to determine whether a significant difference exists between the ranking of the two groups, with a significance level of α = 0.05 was applied to the results of these simulations. Each replication tests one hypothesis. The 2/3 of all generations hypothesis stating that the ranking of the logical consistency pair-wise comparisons is equal to that of the logical inconsistency pair-wise comparisons is rejected. However, the 2/3 of all generations hypothesis postulating that the ranking of the logical consistency pair-wise comparisons is not equal to that of the logical inconsistency pair-wise comparisons is accepted.
6
Discussion and Conclusion
In this paper, the Consistency ELECTRE III approach which employs the ELECTRE III and partial concepts of AHP is proposed. The Consistency ELECTRE III method in this paper guarantees that it gives consistency threshold values within ELECTRE III by using the logical consistency pairwise comparisons method in AHP. From the simulation, it is observed that ranking of the alternatives is sensitive to the threshold values in ELECTRE III. It can also be concluded that the ranking of the Consistency ELECTREIII and that of the ELECTRE III are different. Therefore, it is better if the decision maker defines the consistency threshold values by using logical consistency pair-wise comparisons method in the Consistency ELECTRE III.
References 1. Van Laarhoven, P.J.M., Pedrycz, W.: A fuzzy extension of Saaty’s priority theory. Fuzzy Sets and Systems 11, 229–241 (1983) 2. Buckley, J.J.: Fuzzy hierarchical analysis. Fuzzy Sets and Systems 17, 233–247 (1985)
336
L. Boongasame and V. Boonjing
3. Chang, D.-Y.: Applications of the extent analysis method on fuzzy AHP. European Journal of Operational Research 95, 649–655 (1996) 4. Cheng, C.-H.: Evaluating naval tactical missile systems by fuzzy AHP based on the grade value of membership function. European Journal of Operational Research 96(2), 343–350 (1997) 5. Brans, J.P., Vincke, P.: A preference ranking organization method. Management Science 31, 647–656 (1985) 6. Saaty, T.L.: How to make a decision: the analytic hierarchy process. Interfaces 24, 9–43 (1994) 7. Roy, B.: Classement et choix en presence de points de vue multiples (la methode ELECTRE). RIRO, 2e annee 8, 57–75 (1968) 8. Roy, B.: Mathematiques modernes et sciences de la gestion. Revue de l’Economie du Centre-Est (52-53), 128–134 (1971) 9. Saaty, T.L.: Decision Making for Leaders; The Analytical Hierarchy Process for Decisions in a Complex World. Wadsworth, Belmont; Translated to French, Indonesian, Spanish, Korean, Arabic, Persian, and Thai, latest revised edn. RWS Publications, Pittsburgh (2000) 10. Saaty, T.L., Alexander, J.: Conflict Resolution: The Analytic Hierarchy Process. Praeger, New York (1989) 11. von Neumann, J., Morgenstern, O.: Theory of games and economic behavior. Princeton University Press, Princeton (1947)
A Routing Method Based on Cost Matrix in Ad Hoc Networks Mary Wu, Shin Hun Kim, and Chong Gun Kim* Dept. Of Computer Engineering, Yeungnam University Dae-dong, Gyeongsan, Gyeongbuk, Republic of Korea [email protected], [email protected], [email protected]
Abstract. An ad hoc network does not rely on an existing infrastructure and it is organized as a network with nodes that act as hosts and routers to transmit packets. With its frequent change in topology, an ad hoc network does not rely on routing methods for pre-established wired networks; it requires a special routing method. In this paper, we introduce an agent based routing algorithm. The agent node creates the knowledge of topology in the form of an adjacency cost matrix. Based on this adjacency cost matrix, we can calculate the shortest cost matrix and the next hop matrices, in which each entry lists the minimum cost and routes between nodes. These matrices are distributed to the other nodes by the agent. Based on the shortest cost matrix and the next hop matrices, each node decides a shortest path to a destinaion without the process of path discovery. Because every node doesn’t need the information of network topology, the overhead of the control messages of our proposed method is considered to be small, compared with the overhead of the control messages that is required in the general table-driven routing protocols. Keywords: Topology discovery agent, adjacency cost matrix, shortest path search, ad hoc networks.
1 Introduction The ad hoc network does not rely on an existing infrastructure and consist of mobile hosts interconnected by routers that can also move[1, 2, 3]. With the frequent changes of topology, the ad hoc network requires a special routing algorithm. Routing methods used in ad hoc networks can be classified into two categories: the table-driven routing and the on-demand routing[4, 5]. The table-driven routing protocols exchange background routing information regardless of communication requests. The protocols have many desirable properties, especially for applications including real-time communications and QoS guarantees, such as low-latency route access and alternate QoS path support. *
Corresponding author.
N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 337–347. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
338
M. Wu, S.H. Kim, and C.G. Kim
The Destination-Sequenced Distance-Vector(DSDV) routing protocol is a table driven algorithm that modifies the Bellman-Ford routing algorithm to include timestamps that prevent loop-formation[6]. The OLSR(Optimized Link State Routing Protocol) is link state routing protocol. It periodically exchanges topology information with other nodes in the network. The protocol uses multipoint relays to reduce the number of broadcast packet retransmissions and also the size of the update packets, leading to efficient flooding of control messages in the network[7]. The table-driven routing protocols try to maintain shortest paths to all nodes in the network at all times. A consequence of maintaining shortest paths is that if the topology of the network changes rapidly, the control overhead increases dramatically[6, 7, 8]. On-demand routing protocols were designed with the aim of reducing control overhead, thus increasing bandwidth and conserving power at the mobile stations. The design follows the idea that each node tries to reduce routing overhead by only sending routing packets when a communication is awaiting. The examples include AODV (Ad hoc On-demand Distance Vector), DSR (Dynamic Source Routing), R-AODV (Reverse AODV), etc. Operation of on-demand routings react only to communication needs. The routing overhead thus relates to the discovery and maintenance of the routes in use. With light traffic and low mobility, ondemand protocols scale well. However, for heavy traffic with a large number of destinations, more sources will search for destination. Also, as mobility increase, the discovered route may break down, requiring repeated route discoveries on the way to the destination. Longer delays are expected due to the route discovery process in large mobile networks[9, 10, 11]. In this paper, we propose an cost matrix based routing algorithm. The agent collects the information of network topology and creates the shortest cost matrix and the next hop matrices using the algorithm. Each node maintains the information of neighbors using Hello messages that report the changes of neighbors and updates the changed information to the agent. The agent distributes the matrices of updates periodically. The remainder of this paper is organized as follows. Section 2 introduces related researches. Section 3 presents the creation of network topology including link costs and introduces the calculation methods for the shortest cost matrix and the next hop matrices. Section 4 analyzes the performance. Finally, the conclusions are given in Sections 5.
2 Relate Research The topology of an ad hoc network can be modeled by an undirected graph G(V, A). V denotes the node set in the network and A is an adjacency matrix that describes the topology of the network. The adjacency matrix A={ a ij } for an nnode network, in which a ‘1’ entry at (i, j) indicates a connection from node i to node j and a ‘0’ entry at (i, j) indicates no connection from node i to node j, can be manipulated to obtain the connectivity matrix C= { c ij }, for which the entry at (i, j) lists the minimum number of hops needed to connect node i to node j [12, 13].
A Routing Method Based on Cost Matrix in Ad Hoc Networks
339
In [16], Ning Li and his partners calculated the k-hop adjacency matrix to elevate the complexity of the network. The k-hop adjacency is obtained on the multiplication, which is defined A (ni+x1)n = A (ni )x n × A n x n and ⎧⎪ i + 1 A ( i +1) (u, v) = ⎨ ( i ) ⎪⎩A (u, v)
A (1)
⎡1 ⎢1 ⎢ ⎢0 =⎢ ⎢0 ⎢0 ⎢ ⎢⎣0
if A ( i ) (u, v) = 0, A (i +1) (u, v) > 0 if A (ni ) (u, v) > 0 1 0 1 1 1 0 0
1 1 0 1 0
0 0 1 1 0 0
0 1 0 0 1 1
0⎤ 0⎥⎥ 0⎥ ⎥ 0⎥ 1⎥ ⎥ 1⎥⎦
A (2)
⎡1 ⎢1 ⎢ ⎢2 =⎢ ⎢0 ⎢2 ⎢ ⎢⎣0
1 1 1 2 1 2
2 1 1 1 2 0
0 2 1 1 0 0
2 1 2 0 1 1
0⎤ 2⎥⎥ 0⎥ ⎥ 0⎥ 1⎥ ⎥ 1⎥⎦
They presented a matrix-based fast calculation algorithm to calculate the maximum number of k-hop paths based on the k-hop adjacency matrix of the network and elevated the complexity of the network using the maximum number of k-hop paths.
3 Cost Matrix for Shortest Route Discovery In this section, we introduce the creation of the shortest cost matrix and the next hop matrices based on the adjacency cost matrix. 3.1 An Agent and a Creation of Adjacency Cost Matrix The agent is a distinctive node designed for implementing the proposed routing method. The agent initiates the topology discovery process. First, it collects the information of costs between links in the networks and creates an adjacent matrix based on the information. Next, it creates the shortest cost matrix and the next hop matrices based on the adjacent matrix. Each node periodically broadcasts ‘Hello’ messages in order to check the connectivity with its neighbor nodes[16]. In our proposed routing method, we assume that each node maintains a neighbor table for the costs of neighbor links. In the network shown in fig. 1, node 1 and node 2 are the neighbors with cost 7 and node 1 and node 4 are the neighbors with cost 3.
Fig. 1. A network topology
340
M. Wu, S.H. Kim, and C.G. Kim
Fig. 2. A network topology and link costs
To collect the information of the link costs in a network, the agent broadcasts a query message to all nodes. The nodes that receive the query message send back a reply message that contains the information of their own link costs. At the end of the query process, the agent can create the one-hop adjacency cost matrix of the whole network topology. In the network shown in fig. 2, the adjacent cost matrix is presented in (1). ⎡0 ⎢7 ⎢ ⎢0 A=⎢ ⎢3 ⎢0 ⎢ ⎣⎢0
7 0 3 0 0⎤ 0 3 3 0 0⎥⎥ 3 0 1 1 1⎥ ⎥ 3 1 0 3 4⎥ 0 1 3 0 5⎥ ⎥ 0 1 4 5 0⎦⎥
(1)
3.2 Shortest Cost Matrix and Next Hop Matrices After establishing the adjacency cost matrix, the agent can create the shortest cost matrix and the next hop matrices from the adjacency cost matrix. The 2-hop cost matrix is defined as follows. B ( 2 ) = bi(,2j) can be given by
{
}
(2) i, j
b
⎧0, i = j ⎪ = ⎨ min(ai ,k + a k , j ), ai ,k > 0, ak , j > 0 ⎪0, otherwise. ⎩
{
The 2-hop shortest cost matrix C ( 2 ) = ci(,2j) on the values of bi(,2j) and ai , j .
ci(,2j)
} can be made by the operation based
⎧ bi(,2j) , ai , j = 0, bi(,2j) > 0 ⎪ (2) ⎪ ai , j , ai , j > 0, bi , j = 0 =⎨ ( 2) (2) ⎪ min(ai , j , bi , j ), ai , j > 0, bi , j > 0 ⎪0, otherwise. ⎩
The 3-hop cost matrix can be calculated depend on 2-hop shortest cost matrix. B ( 3) = bi(,3j) can be given by
{ }
(2)
(3)
A Routing Method Based on Cost Matrix in Ad Hoc Networks
⎧0, i = j ⎪ bi(,3j) = ⎨ min(ci(,2k) + a k , j ), ci(,2k) > 0, ak , j > 0 ⎪0, otherwise. ⎩
341
(4)
{ } can be made by the operation based
The 3-hop shortest cost matrix C ( 3) = ci(,3j) on the values of bi(,3j) and ci(,2j) .
ci(,3j)
⎧ bi(,3j) , ci(,2j) = 0, bi(,3j) > 0 ⎪ ( 2) ( 2) ( 3) ⎪ci , j , ci , j > 0, bi , j = 0 =⎨ ( 2) ( 3) ( 2) ( 3) ⎪ min(ci , j , bi , j ), ci , j > 0, bi , j > 0 ⎪0, otherwise. ⎩
(5)
The m-hop cost matrix rule can be stated as follows. (m) i, j
b
⎧ 0, i = j ⎪ = ⎨min(ci(,mk−1) + ak , j ), ci(,mk−1) > 0, ak , j > 0, m ≥ 2 ⎪ 0, otherwise ⎩
{
The m-hop shortest cost matrix C ( m) = ci(,mj) on the values of bi(,mj ) and ci(,mj−1) .
(6)
} can be made by the operation based
⎧bi(,mj ) , ci(,mj−1) = 0, bi(,mj ) > 0 ⎪ ( m−1) ( m−1) ( m) ⎪ ci , j , ci , j > 0, bi , j = 0 =⎨ ( m −1) ( m) ( m−1) (m) ⎪min(ci , j , bi , j ), ci , j > 0, bi , j > 0 ⎪ 0, otherwise ⎩
ci(,mj )
(7)
In the topology of fig. 2, we can calculate the shortest cost matrix. The below matrices are shown the results of calculation. The adjacent cost matrix is presented as follows, ⎡0 ⎢7 ⎢ ⎢0 A=⎢ ⎢3 ⎢0 ⎢ ⎣⎢ 0
7
0
3
0
0
3
3
0
3
0
1
1
3 0
1 1
0 3
3 0
0
1
4
5
0⎤ 0 ⎥⎥ 1⎥ ⎥ 4⎥ 5⎥ ⎥ 0 ⎦⎥
{
The 2-hop shortest cost matrix C ( 2 ) = ci(,2j)
B
( 2)
⎡0 ⎢6 ⎢ ⎢4 =⎢ ⎢10 ⎢6 ⎢ ⎣⎢ 7
6 4 10 6 7⎤ 0 4 4 4 4⎥⎥ 4 0 4 4 5⎥ ⎥ 4 4 0 2 2⎥ 4 4 2 0 2⎥ ⎥ 4 5 2 2 0⎦⎥
} can be made by
(8) C
( 2)
⎡0 ⎢6 ⎢ ⎢4 =⎢ ⎢3 ⎢6 ⎢ ⎣⎢7
6 4 3 6 7⎤ 0 3 3 4 4⎥⎥ 3 0 1 1 1⎥ ⎥ 3 1 0 2 2⎥ 4 1 2 0 2⎥ ⎥ 4 1 2 2 0⎦⎥
(9)
342
M. Wu, S.H. Kim, and C.G. Kim
{
}
{
}
In the process of B ( 2 ) = bi(,2j) and C ( 2 ) = ci(,2j) , we can obtain one-hop nodes in the path of two-hops. bi(,2j) = min(ai ,k + ak , j ), where ai ,k > 0, ak , j > 0 . In this case, k can be one-hop nodes. One-hop next node matrix NH ((12)) = nh((12)) can be made by k-elements in the process of B ( 2) and C ( 2 ) and presented by i , j
{
NH ((12))
⎡0 ⎢4 ⎢ ⎢4 =⎢ ⎢0 ⎢4 ⎢ ⎢⎣4
}
4 4 0 4 4⎤ 0 0 0 3 3⎥⎥ , 3 0 0 0 0⎥ ⎥ 0 0 0 3 3⎥ 3 0 3 0 3⎥ ⎥ 3 0 3 3 0⎥⎦
(10)
where the above suffix (2) is the hop counts from the source and the destination and the below suffix (1) means the one-hop distance from the source. c5( 2, 4) is 2 in the C ( 2 ) and nh((12)) 5, 4 is 3 in the NH ((12)) . We can know that the path from node 5 to node 4 in the path of two-hop is 5->3->4 and the cost is 2 based on the information of the matrices.
Fig. 3. The example of a two-hop path
{ } can be made by k-elements in the
The 3-hop shortest cost matrix C ( 3) = ci(,3j) process of B ( 2) and C ( 2 ) and can be made by
B
( 3)
⎡0 ⎢6 ⎢ ⎢4 =⎢ ⎢10 ⎢5 ⎢ ⎢⎣ 5
6 4 5 5 5⎤ 0 4 4 4 4⎥⎥ 4 0 4 4 5⎥ ⎥ 4 3 0 2 2⎥ 4 3 2 0 2⎥ ⎥ 4 3 2 2 0⎥⎦
(11) C
{ }
( 3)
⎡0 ⎢6 ⎢ ⎢4 =⎢ ⎢3 ⎢5 ⎢ ⎢⎣5
6 4 3 5 5⎤ 0 3 3 4 4⎥ ⎥ 3 0 1 1 1⎥ ⎥ 3 1 0 2 2⎥ 4 1 2 0 2⎥ ⎥ 4 1 2 2 0⎥⎦
{ }
(12)
In the process of B ( 3) = bi(,3j) and C ( 3) = ci(,3j) , we can obtain two-hop nodes in the path of three-hops. bi(,3j) = min(ci(,2k) + ak , j ), where ci(,2k) > 0, ak , j > 0 . In this case, k can be two-hop nodes. Two-hop node matrix NH ((23)) = nh((23)) can be made by k-elements in the process of B ( 3) and C ( 3) and presented byi , j
{
}
A Routing Method Based on Cost Matrix in Ad Hoc Networks
NH ((23))
⎡0 ⎢0 ⎢ ⎢0 =⎢ ⎢0 ⎢4 ⎢ ⎣⎢ 4
0 0 0 3 3⎤ 0 0 0 0 0⎥ , ⎥ 0 0 0 0 0⎥ ⎥ 0 0 0 0 0⎥ 0 0 0 0 0⎥ ⎥ 0 0 0 0 0⎦⎥
343
(13)
where the above suffix (3) is the hop counts from the source and the destination and the below suffix (2) means the two-hop distance from the source. c5( 3,1) in the C ( 3) is 5. nh((23)) 5,1 is 4 in the NH ((23)) and means that two-hop distance node from source node 5 in the three-hop shortest path from 5 to 1 is 4. NH ((13)) has to be calculated and then we can obtain one-hop distance nodes in the three-hop shortest paths through the matrix. c5( 3,1) = c5( ,24) + a4,1 and nh((12)) 5, 4 is 3 in the NH ((12)) . nh((13)) 5,1 is updated as the value of ( 2) nh(1) 5, 4 that is one-hop distance node in the previous step. We can obtain the path from node 5 to node 1 that is 5->3->4->1 in the next hop matrices the NH ((13)) , the NH ((23)) and the cost that is 5 in the C ( 3) .
Fig. 4. The example of a three-hop path
{
One-hop node matrix NH ((13)) = nh((13)) i, j
NH ((13))
⎡0 ⎢4 ⎢ ⎢4 =⎢ ⎢0 ⎢3 ⎢ ⎣⎢ 3
} can be obtained by
4 4 0 4 4⎤ 0 0 0 3 3⎥⎥ 0 0 0 0 0⎥ ⎥ 0 0 0 3 3⎥ 3 0 3 0 3⎥ ⎥ 3 0 3 3 0⎦⎥
(14)
4 Numerical Experiment and Analysis To evaluate the performance of our proposed method, we compare the shortest cost matrix method and Dijkstra's algorithm and Bellman–Ford algorithm in terms of time complexity. Dijkstra's algorithm is a graph search algorithm that solves the single-source shortest path problem producing a shortest path tree. For a given source node, the algorithm finds the path with lowest cost between that node and every other node.
344
M. Wu, S.H. Kim, and C.G. Kim
Dijkstra's algorithm will assign some initial distance values and will try to improve them step-by-step. a. b. c. d. e.
Assign to every node a distance value. Set it to zero for our initial node and to infinity for all other nodes. Mark all nodes as unvisited. Set initial node as current. For current node, consider all its unvisited neighbors and calculate their distance from the initial node. If this distance is less than the previously recorded distance, overwrite the distance. When we are done considering all neighbors of the current node, mark it as visited. A visited node will not be checked ever again and its distance recorded now is final and minimal. Set the unvisited node with the smallest distance as the next "current node" and continue from step c.
The running time of the algorithm is O(N2), where N is the number of nodes. Bellman–Ford algorithm is in its basic structure similar to Dijkstra's algorithm, but instead of greedily selecting the minimum-weight node, it simply relaxes all the edges, and does this N − 1 times, where N is the number of nodes. Bellman– Ford algorithm runs in O(NE) time, where N and E are the number of nodes and edges respectively. The shortest cost matrix and the next hop matrices method calculates a matrix C(k) and the next hop matrices based on C(k-1) matrix and A matrix. The running time of the method is O(kN2), where k is the number of maximum hops in the network topology. Compared with other algorithms, this method requires long running time. Because only the agent calculates theses matrices and all nodes in the network use the information provided from the agent in routing, the efficiency of this method can increase. Fig. 5 shows the running time depending on the number of connection with Dijkstra's algorithm, Bellman–Ford algorithm, and the shortest cost matrix and the next hop matrices method in the topology of fig. 2. The number of nodes is 6, the number of edges is 10, the number of maximum hop is 3, and the number of connection varies from 1 to 10. This result of fig. 5 presents the amount of running time in the view of the network when nodes operate an algorithm for routing. The running time of our proposed method is constant regardless of the number of connections.
Fig. 5. The amount of running time in the view of the network
A Routing Method Based on Cost Matrix in Ad Hoc Networks
345
Fig. 6 shows the running time depending on the number of nodes with Dijkstra's algorithm, Bellman–Ford algorithm, and the shortest cost matrix and the next hop matrices method in the view of the network. The number of nodes varies from 6 to 12, the number of edges varies from 10 to 16, the number of maximum hop is 4, and the number of connection is 4. The results of Dijkstra's algorithm and shortest cost matrix and the next hop matrices method are the same. The reason is that both are in direct proportion to the square the number of nodes and Dijkstra's algorithm is in direct proportion to the number of connections and the shortest cost matrix and the next hop matrices method is in direct proportion to the number of maximum hops.
Fig. 6. The amount of running time depending on the number of nodes
In our proposed method, the amount of the operation that the agent has to calculate is large. Therefore, to implement our proposed method, a performance agent is required like as long life battery, memory resource, etc. Normal nodes, exclusive of the agent, aren’t required routing process to find paths to a destination and transfer data without the routing delay which is required on-demand routing protocols. The agent sends query messages to other nodes and receives reply message and distributes the shortest cost matrix and the next hop matrices. Because every node doesn’t need the information of network topology, the overhead of the control messages of our proposed method is considered to be small, compared with the overhead of the control messages that is required in the general table-driven routing protocols.
5 Conclusion A routing method based on a shortest cost matrix and the next hop matrices is proposed. The agent sends the query messages for the information of topology and link costs and receives reply messages that contain the information of the topology. The agent creates the shortest cost matrix and the next hop matrices based on the received information. Through the distribution of the shortest cost matrix and next hop matrices, every node can maintain the shortest cost matrix and use it for discovering the shortest path to destination nodes. Experiments
346
M. Wu, S.H. Kim, and C.G. Kim
show the performance of the run time in two scenarios. To verify the efficiency of our proposed in the view of routing delays the amount of control messages, etc, some simulations are required. Therefore, we have the plan for the simulations about these factors. Efficient agent election method is one of related future study topics.
Acknowledgements This research was financially supported by the Ministry of Education, Science Technology(MEST) and Korea Institute for Advancement of Technology(KIAT) through the Human Resource Training Project for Regional Innovation and by the Small &Medium Business Administration(SMBA).
References 1. Corson, S., Macker, J.: Mobile Ad hoc Networking (MANET): Routing Protocol Performance. IETF RFC 2501 (January 1999) 2. IETF MANET Working Group (2004), http://www.ietf.org/html.charters/manet-charter.html 3. Perkins, C.E.: Ad hoc Networking. Addison Wesley, Reading (2000) 4. Abolhasan, M., Wysocki, T., Dutkiewicz, E.: A review of Routing Protocols for Mobile Ad Hoc Networks. Ad hoc Networks 2, 1–22 (2002) 5. Raju, J., Garcia-Luna-Aceves, J.J.: A Comparison of On-demand and Table-driven Routing for Ad Hoc Wireless Networks. In: Proceedings of IEEE ICC (June 2000) 6. Perkins, C., Bhagwat, P.: Highly Dynamic Destination-Sequenced Distance-Vector Routing(DSDV) for Mobile Computers. In: Proceedings of ACM SIGMOMM 1994 (1994) 7. Clausen, T., Jacquet, P.: Optimized Link State Routing Protocol (OLSR). IETF RFC3626 (October 2003) 8. Ogier, R.G., Lewis, M.G., Templin, F.L.: Topology Broadcast Based on Reverse-Path Forwarding. IETF RFC3684 (February 2004) 9. Das, S., Perkins, C., Royer, E.: Ad Hoc On Demand Distance Vector (AODV) Routing. IETF RFC3561 (July 2003) 10. Johnson, D.: The Dynamic Source Routing Protocol for Mobile Ad Hoc Networks (DSR). IETF RFC2026 (July 2004) 11. Kim, C., Talipov, E., Ahn, B.: A Reverse AODV Routing Protocol in Ad Hoc Mobile Networks. In: Zhou, X., Sokolsky, O., Yan, L., Jung, E.-S., Shao, Z., Mu, Y., Lee, D.C., Kim, D.Y., Jeong, Y.-S., Xu, C.-Z. (eds.) EUC Workshops 2006. LNCS, vol. 4097, pp. 522–531. Springer, Heidelberg (2006) 12. Miller, L.E.: Multihop connectivity of arbitrary networks (March 2001), http://w3.antd.nist.gov/wctg/netanal/ConCalc.pdf 13. Matouseck, J., Nesetril, J.: Invitation to discrete mathematics, pp. 109–110. Clarendon Press, Oxford (1998) 14. Li, N., Guo, Y., Zheng, S., Tian, C., Zheng, J.: A Matrix-Based Fast Calculation Algorithm for Estimating Network Capacity of MANETs. In: ICW/ICHSN/ICMCS/ SENET 2005, pp. 407–412 (2005)
A Routing Method Based on Cost Matrix in Ad Hoc Networks
347
15. Lee, S., Muhammad, R.M., Kim, C.: A Leader Election Algorithm within candidates on Ad Hoc Mobile Networks. In: Lee, Y.-H., Kim, H.-N., Kim, J., Park, Y.W., Yang, L.T., Kim, S.W. (eds.) ICESS 2007. LNCS, vol. 4523, pp. 728–738. Springer, Heidelberg (2007) 16. Chakeres, I.M., Belding-Royer, E.M.: The utility of hello messages for determining link connectivity. In: The 5th International Symposium on Wireless Personal Multimedia Communication (WPMC), Honolulu, Hawaii, October 2002, vol. 2 (2002)
A Fusion Approach for Multi-criteria Evaluation Jia-Wen Wang and Jing-Wen Chang Department of Electronic Commerce Management, Nanhua University 32, Chung Kcng Li, Dalin, Chiayi, 62248, Taiwan [email protected], [email protected]
Abstract. Multi-Criteria Decision Making (MCDM) methods are famous approaches to structure information and decision evaluation in problems with multiple, conflicting goals. This paper proposes a fusion approach for solving the alternative selection problem. It has three advantages described as below: (1) it uses the entropy method and the ME-OWA operator to get the value of the attribute weight; (2) it uses Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) to find out the critical alternative; (3) it can deal with the dynamical weighting problem more rationally and flexibly according to the situational parameter from the user’s viewpoint. In experiments and comparisons, a case study of comparing three houses is adopted. Finally, the comparison results show that the proposed method has better performance than other methods. Keywords: Multiple Criteria Decision Making (MCDM), ME-Order Weighted Averaging (ME-OWA) operator, Technique for Order Preference by Similarity to Ideal Solution (TOPSIS), Entropy.
1 Introduction People always make decisions in their daily life. However, the most problems are easy to solve, but the more complex problems and more criteria we must have to solve. It produces the multiple criteria decision making problems, when the decisions become more difficult. Multi-criteria decision-making methods, which originated in Koopmans concept of an effective vector [23], many researchers will use MCDM methods applied to select or evaluate object. In the decision-making, decision maker have to dealing with the problems include multiple assessment criteria , if the decision maker just use single criteria to evaluate object , and the result is not corresponded to real situation. Therefore, MCDM methods according to the various attributes feature of each object to select best object to do pros and cons of sorting. Decision makers usually evaluate alternative by the cost and the benefit analysis as evaluation criteria, in each objects to differentiate between the minimum cost and maximum benefit. In the multi-attribute decision making assessment methods, attribute weight value have the ability to impact assessment, That is, different attribute weight N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 349–358. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
350
J.-W. Wang and J.-W. Chang
values will lead to different results of the evaluation. The weight factors are evaluated using the concept of entropy proposed by Shannon and Weaver [20]. Entropy theory [22] can be used to measure the amount of information in choice set which supports the identification of the relative importance of decision criteria. Entropy measures are used to deal with the interdependency of criteria and inconsistency of subjective weights Entropy concept is well suited for measuring the relative contrast intensities of criteria to represent the average intrinsic information transmitted to decision makers [26]. In this paper, we use (1) the entropy method; (2) the ME-OWA operator to get the value of the attribute weights, and then to construct a multi attribute decision making by TOPSIS. Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) method is mainly based on the criteria for the object to find the distance between the positive and negative ideal solution, and we apply the TOPSIS method to do the sorting, integrate with the ME-OWA weights given the various attributes feature of each objects weight. The evaluated is based on MCDM method and TOPSIS method, among them are the ideal solution from the recent, negative ideal solution as far away from the object as the best choice. This paper uses an evaluation comprised of three evaluation goals, four evaluation criteria, and three decision-makers to assess the cost and benefit of company office, and the result is corresponded to real situation that endorses this assessment method to be highly practicable. The paper is organized as follows. Next section introduces the literary reviews and the basic definitions of the MCDM, TOPSIS and entropy. In Section 3, we introduce the proposed algrothim and process. In Section 4, we proposed method is illustrated with an example. Finally, some conclusions are pointed out.
2 Literary Reviews In this section, we divided into two parts. The first part of Multiple Criteria Decision Making method (MCDM) and the second part of Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) method. 2.1 Multiple Criteria Decision Making (MCDM) The development of MCDM models is often dictated by real-life problems. However, the most problems are easy to solve, but the more complex problems and more criteria we must have to solve. Multiple criteria decisions involve alternatives which are usually evaluated on the basis of a hierarchical system of criteria. The primary goal in MCDM is to provide a set of criteria aggregation method for considering the preferential system and judgments of decision makers. For example to buy a car , we usually consider a number of factors. These assessment criteria are often conflicting, such as price and security. This is a difficult task requiring the implementation of complex processes. Therefore, MCDM methods according to the various attributes feature of each object to select best object to do pros and cons of sorting. Decision maker usually evaluated by
A Fusion Approach for Multi-criteria Evaluation
351
the cost and the benefit analysis as evaluation criteria, in each objects to differentiate between the minimum cost or maximum benefit. 2.2 Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) TOPSIS method can deal with the ratings of both quantitative as well as qualitative criteria and select the suitable case effectively. This is based on the principle that the alternative solution that is selected solution should have the shortest distance to the ideal solution alternative and the farthest distance from the negative-ideal solution amongst all the available alternatives. In an inter-house comparison problem, a set of house, the alternatives ( X = {X i , i = 1, 2,..., m}) is to be comparison with respect to a set of criteria ( C = {C j , j = 1, 2,..., n}) .The computational procedure of which is summarized as follows: STEP 1: The matrix format as follows:
D= ⎡⎣ X ij ⎤⎦ m×n
i = 1, 2,… , m
j = 1, 2,… , n
(1)
The every decision-makers respectively compare these criteria and sum of square value. m
∑X i =1
i = 1, 2,… , m
2 ij
j = 1, 2,… , n
(2)
STEP 2: Apply the modified TOPSIS approach developed, the decision matrix contained needs to be normalized.
xij
rij =
i = 1, 2,… , m
m
∑X i =1
j = 1, 2,… , n
2 ij
(3)
STEP 3: Calculation of the weights matrix V of evaluation criteria.
V = ⎡⎣vij ⎤⎦ m×n i = 1, 2,… , m
j = 1, 2,… , n
(4)
and
vij = rij × wij n
furthermore
∑W j =1
j
(5)
=1
STEP 4: The algorithm computes the so-called Positive Ideal Solution (PIS) and Negative Ideal Solution (NIS) of each weight .
352
J.-W. Wang and J.-W. Chang
Positive Ideal Solution (PIS)
A+ = Max {v1+ , v2+ ..., vn+ }
(6)
Negative Ideal Solution (NIS)
A− = Min {v1− , v2− ..., vn− }
(7)
:We use "Euclidean distance" to calculate for each alternative, the two
STEP 5
distances
( S ) and ( S ) of X i from ( A ) , ( A ) + i
− i
Si+ =
∑ (v
Si− =
∑ ( vij − vn− )
n
j =1
n
ij
− vn+ )
+
−
2
i = 1, 2,… , m
j = 1, 2,… , n
(8)
i = 1, 2,… , m
j = 1, 2,… , n
(9)
2
j =1
:Separation index (C ) i
STEP 6
Ci = STEP 7
Si− Si+ − Si−
(10)
: The alternatives are ranked of the index, the closeness coefficient ( Ci ) is larger the higher priority to evaluation value of alternative.
3 A Fusion Approach for Multi-criteria Evaluation This paper proposes a fusion approach for solving the alternative selection problem. It can deal with the dynamical weighting problem more rationally and flexibly according to the situational parameter from the user’s viewpoint. Section 3.1 is the algorthim of the ME-OWA. Section 3.2 is the proposed approach. 3.1 Information Fusion Techniques: ME-Order Weighted Averaging (ME-OWA) Yager proposed an order weighted averaging (OWA) operator which had the ability to get optimal weights of the attributes based on the rank of these weighting vectors after aggregation process[22]. An OWA operator of dimension n is a mapping f : R → R , that has an associated weighting vector W= [w1, w2 ,......, wn] n
with the following properties: Wi ∈ [0,1] for i ∈ I = {1,2, … , n} and
Such that
T
∑i∈I Wi = 1 ,
A Fusion Approach for Multi-criteria Evaluation
f ( a1 , a2 ,… , an ) = ∑ i∈I Wi bi
353
(11)
where bi is the ith largest element in the collection. Thus, it satisfies the relation Mini [ai ] ≤ f (a1 , a 2 , … , a n ) ≤ Maxi [ai ] [1]. Fuller and Majlender [6] transform Yager’s OWA equation to a polynomial equation by using Lagrange multipliers. According to their approach, the associated weighting vector can be obtained by (12) ~ (14) ln w j =
j −1 n− j n− j j −1 ln + ln ⇒ = n −1 w1 wn n − 1 wn n − 1 w1 w j
(12)
And
w ⎡⎣( n − 1) α + 1 − n w ⎤⎦ 1
1
n
= ⎣⎡( n − 1) α ⎦⎤
n −1
⎡⎣( ( n − 1) α − n ) w1 + 1⎤⎦
(13)
1
if w1 = w2 = ...... = wn = n ⇒ disp (W) = ln n ( α = 0.5) then
w
n
=
((n − 1)α − n) w1 + 1 ( n − 1)α + 1 − n w1
(14)
where Wi is weight vector, N is number of attributes, and α is the situation parameter. 3.2 The Process of the Proposed Approach
In this paper, we use (1) the entropy method; (2) the ME-OWA operator to get the value of the attribute weights, and then to construct a multi attribute decision making by TOPSIS. The process is shown as follows: STEP 1: From equation (1)~(2), the normalized performance rating can be calculated.. STEP 2: Construct the weighted normalized decision metrix based on (1) the entropy method; (2) the ME-OWA operator. STEP 3: Calculate the PIS and NIS by equation (6)~(7). STEP 4: The normalized Euclidean distance can be calculated by equation (8)~(9). STEP 5: Calcuate the separation index by equation(10) STEP 6: Rank the alternatives.
4 Verification and Comparison A case study of comparing three houses(X,Y,Z), a company was conducted to examine the applicability of the modified TOPSIS approach and the company have three decision-makers(D1,D2,D3)) to select the most best house as the Office. Four selection criteria are considered (Size (C1) , Transportation (C2) , Condition (C3) , Finance (C4) ) were identified as the evaluation criteria.
354
J.-W. Wang and J.-W. Chang
The entropy of the objective criteria are shown as Table 1. The Weights of the ME-OWA operator are shown as Table 2. Calculation Positive Ideal Solution(PIS) and Negative Ideal Solution(NIS) of each weight in Table 3. Positive Ideal Solution is all best value evaluation criteria component. Negative Ideal Solution is all worst value evaluation criteria component. Table 1. The entropy under objective criteria
C1 -0.200 -0.299 -0.296 0.724 0.276
X Y Z ej
1- e j
C2 -0.361 -0.328 -0.11 0.735 0.265
C3 -0.365 -0.268 -0.268 0.820 0.180
C4 -0.367 -0.218 -0.336 0.838 0.162
:C1 > C2 > C3 > C4 .
From the Table 1, we can see the four criteria is ranked as
Table 2. The Weights of the ME-OWA operator
Weights
α =0.5 (equal weight) α =0.6 α =0.7 α =0.8 α =0.9 α =1
C1 0.25 0.417 0.494 0.596 0.764 1
C2 0.25 0.233 0.237 0.252 0.182 0
C3 0.25 0.131 0.114 0.106 0.043 0
C4 0.25 0.074 0.055 0.045 0.010 0
Table 3. Positive Ideal Solution and Negative Ideal Solution of each weight Weight Equal Weight
α =0.6 α =0.7 α =0.8 α =0.9 α =1
PIS NIS PIS NIS PIS NIS PIS NIS PIS NIS PIS NIS
C1 0.1929 0.0410 0.3214 0.0683 0.3810 0.0809 0.4602 0.0978 0.5895 0.1252 0.7715 0.1639
C2 0.2184 0.0524 0.2039 0.0418 0.2073 0.0498 0.2201 0.0528 0.1591 0.0382 0 0
C3 0.1675 0.0799 0.0877 0.0418 0.0762 0.0364 0.0713 0.0340 0.0291 0.0139 0 0
C4 0.1867 0.0996 0.0549 0.0293 0.0410 0.0219 0.0336 0.0179 0.0077 0.0041 0 0
A Fusion Approach for Multi-criteria Evaluation Table 4. PIS
weight Equal Weight
α =0.6 α =0.7 α =0.8 α =0.9
α =1
X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z
PIS 0.0291 0.0260 0.1478 0.0283 0.0643 0.1556 0.0308 0.0902 0.1956 0.0367 0.1314 0.2664 0.0290 0.2155 0.3737 0.0246 0.3692 0.5952
Table 5. NIS
weight Equal Weight
α =0.6 α =0.7 α =0.8 α =0.9 α =1
X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z
NIS 0.0264 0.0280 0.0363 0.0670 0.0380 0.0262 0.0930 0.0515 0.0265 0.1347 0.0740 0.0294 0.2173 0.1189 0.0149 0.3692 0.2033 0
355
356
J.-W. Wang and J.-W. Chang Table 6. The results of comparing three house
X Y Z
ej
Equal Weight
α =0.6
1 3 2
2 3 1
1 2 3
α =0.7 α =0.8 α =0.9 1 2 3
1 2 3
1 2 3
α =1 1 2 3
The results are shown as Table 3. According to Euclidean Distance calculated for each score language ideal solution gap PIS in Table 4 and NIS in Table 5.
5 Conclusions From the MCDM point of view, although intuition and simple rules are still popular decision making, but they may be dangerously inaccurate, therefore The development of MCDM models is often dictated by real life problems. A case study of comparing three house(X, Y, Z) , a company was conducted to examine the applicability of the modified TOPSIS approach and the company have three decision makers(D1,D2,D3) to select the most best house as the office. Four selection criteria are considered (Size(C1) ,Transportation(C2) ,Condition (C3) and Finance (C4)). From the results, we can see the X house is the best choice. This paper integrates with the ME-OWA operater and TOPSIS. We use (1) the entropy method; (2) the ME-OWA operator to get the value of the attribute weights, and then to construct a multi attribute decision making by TOPSIS. It can deal with the dynamical weighting problem more rationally and flexibly according to the situational parameter from the user’s viewpoint.
Acknowledgements The first author gratefully appreciates the financial support from National Science Council, Taiwan, ROC under contract NSC98-2221-E-343-002.
References [1] Ahn, B.S.: On the properties of OWA operator weights functions with constant level of orness. IEEE Transactions on Fuzzy System 14(4) (2006) [2] Bandyopadhyay, S., Maulik, U.: An evolutionary technique based on K-Means algorithm for optimal clustering in RN. Information Sciences 146, 221–237 (2002) [3] Chen, S.M., Yu, C.A.: New method to generate fuzzy rules from training instances for handling classification problems. Cybernetics and Systems: An International Journal 34(3), 217–232 (2003) [4] Das, S.: Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, pp. 74–81 (2001)
A Fusion Approach for Multi-criteria Evaluation
357
[5] Dunham, M.H.: Data mining introductory and advanced topics. Pearson Education, Inc., London (2002) [6] Fuller, R., Majlender, P.: An analytic approach for obtaining maximal entropy OWA operator weights. Fuzzy Sets and Systems 124, 53–57 (2001) [7] Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006) [8] Hong, T.P., Chen, J.B.: Finding relevant attributes and membership functions. Fuzzy Sets ond Systems 103(3), 389–404 (1999) [9] Hong, T.P., Lee, C.Y.: Induction of fuzzy rules and membership functions from training examples. Fuzzy Sets and Systems 84(1), 33–47 (1996) [10] Jain, A.K., Chandrasekaran, B.: Dimensionality and Sample Size Considerations in Pattern Recognition Practice. In: Krishnaiah, P.R., Kanal, L.N. (eds.) Handbook of Statistics 2, pp. 835–855. North-Holland, Amsterdam (1982) [11] Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997) [12] Liu, H., Motoda, H., Yu, L.: A selective sampling approach to active feature selection. Artificial Intelligence 159(1-2), 49–74 (2004) [13] Liu, X.: Three methods for generating monotonic OWA operator weights with given orness level. Journal of Southeast University 20(3) (2004) [14] MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967) [15] McClave, J.T., Benson, P.G., Sincich, T.: Statistics for business and economics, 9th edn. Prentice Hall, Englewood Cliffs (2005) [16] Nauck, D., Kruse, R.: Obtaining interpretable fuzzy classification rules from medical data. Artif. Intell. Med. 16, 149–169 (1999) [17] O’Hagan, M.: Aggregating template or rule antecedents in real-time expert systems with fuzzy set logic. In: Proc. 22nd Annu. IEEE Asilomar Conf. Signals, Systems, Computers, Pacific Grove, CA, pp. 681–689 (1988) [18] Peña-Reyes, C.A., Sipper, M.: A fuzzy genetic approach to breast cancer diagnosis. Artif. Intell. Med. 17, 131–155 (2000) [19] Setiono, R.: Generating concise and accurate classification rules for breast cancer diagnosis. Artif. Intell. Med. 18, 205–219 (2000) [20] Shannon, C.E., Weaver, W.: The mathematical theory of communication. The University of Illinois Press, Urbana (1947) [21] Tsai, F.M., Chen, S.M.: A new method for constructing membership functions and generating fuzzy mles for fuzzy classification systems. In: Proceedings of 2002 Tenth National Conference on Fuzzy Theory and its Application, Hsinchu. Taiwan Republic of China (2002) [22] Wang, T.C., Lee, H.D., Chang, M.C.S.: A fuzzy TOPSIS approach with entropy measure for decision-making problem. In: IEEE International Conference on Industrial Engineering and Engineering Management, Singapore, December 2007, pp. 124–128 (2007) [23] Wu, T.P., Chen, S.M.: A new method for constructing membership functions and fuzzy rules from training examples. IEEE Transactions on Systems, Man and Cybernetics-Part B 29(1), 25–40 (1999)
358
J.-W. Wang and J.-W. Chang
[24] Yager, R.R.: On ordered weighted averaging aggregation operators in multi-criteria decision making. IEEE Trans. on SMC 18, 183–190 (1988) [25] Yoon, K.: Systems selection by multiple attribute decision making. Ph.D. dissertation. KansasState University Press, Manhattan (1980) [26] Zeleny, M.: The attribute – dynamic attitude model (ADAM). Management Science 23, 12–26 (1976)
Algorithmic Aspects of the Reachability of Conflicting Chip Firing Game Le Manh Ha, Nguyen Anh Tam, and Phan Thi Ha Duong Hue University’s College of Education, 32 Le Loi, Hue, Vietnam Vietnam National University, 144 Xuan Thuy str, Cau Giay Distric, Hanoi, Vietnam Institute of Mathematics, 18 Hoang Quoc Viet, Hanoi, Vietnam [email protected], [email protected], [email protected]
Abstract. Chip-firing game is a cellular automaton model on finite directed graphs often used to describe the phenomenon of selforganized criticality. Here we investigate a variation of the chip-firing game on a directed acyclic graph G = (V, E). Starting from a given chip configuration, we can fire a vertex v by sending one chip along one of its outgoing edges to the corresponding neighbors if v has at least one chip. We study the reachability of this system by considering the order structure of its configuration space. Then we propose an efficient algorithm to determine this reachability. Keywords: Conflicting chip firing game, dynamic system, energies, multi agents system, order filter, order structure, reachability, self organization.
1
Introduction
Self organization is relevant in many complex systems, for example, in physics [2], in chemistry [23], and recently in complex network [9,3]. Among many approaches to study this phenomenon, multi-agent systems can manifest self organization and complex behaviors even when the individual strategies of all their agents are simple. In this context, the Chip Firing Game was introduced in 1990 [4] and it has rapidly become an important and interesting object of study in structural combinatorics. The reason for this is partly due to its relation with the Tutte polynomial [6], the lattice theory [19] and the group theory [5], but also because of the contribution of people in theoretical physics who know it as the (Abelian) sandpile model [2]. A chip firing game [4] is defined over a (directed) multigraph G = (V, E), called the support or the base of the game. A weight w(v) is associated with each vertex v ∈ V , which can be regarded as the number of chips stored at
This work is supported in part by UMI 209 UMMISCO, MSI-IFI, Vietnam.
N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 359–370. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com
360
Le M. Ha, Nguyen A. Tam, and Phan T.H. Duong
the site v. The CFG is then considered as a discrete dynamical system with the following rule, called the firing rule: a vertex containing at least as many chips as its outgoing degree (its number of going out edges) transfers one chip along each of its outgoing edges. A configuration of CFG is a composition of n into V where n is the total number of chips which is constant over transfers process of CFG. The set of all reachable configuration from initial configuration O is said to be configuration space. It is known from [19] that the configuration space of CFGs which defined on a support graph G which has no close component is a graded lattice. From the first definition of CFG, many variants of this system were introduced in different domains: the game of cards [12,14] in the context of distributed system, the rotor-router model [20,1] in the random walks, the color chip firing game in lattice theory [21] and the conflicting chip firing game (CCFG) in complex networks and Petri Net [10]. The last on has the following rule: one vertex v is firable if it contains at least one chip and its firing is carried out by sending one chip along one of its outgoing edge. The main result of our work is to study the reachability of this CCFG, we give an algorithm determining whether a configuration is reachable from another with a complexity depends only on the support graph of the game but do not depend (on the total number of chips, and) on the size of the configuration space of the game (which may be in exponential on the size of the support graph). Recall that the reachability of a system is very important in the study of complex systems because it gives information on the controllability and observability of the system; and this problem is in general very difficult because the size of the configuration space of a system is very large. The paper is structured as follows. We recall in Section 2 some basic definitions of directed acyclic graph theory and of Conflicting Chip Firing Game. Then in Section 3, we first present notion of order and filters, then we introduce the notion of (collection of) energy which is the key to present a characterization of the order structure of the configuration space of CCFG. The algorithm illustrating this characterization is presented and analyzed in Section 4.
2
Definitions and Notations
We recall here some definitions and basic results. A directed graph or digraph is a pair G = (V, E) of: – a set V , whose elements are called vertices or nodes, – a set E of ordered pairs of vertices, called arcs, directed edges, or arrows. An arc e = (x, y) is considered to be directed from x to y; y is called the head and x is called the tail of the arc, the out-degree d+ (x) of a vertex x is the number of arcs starting at x and the in-degree d− (x) is the number of arcs ending at x. A vertex with deg − (v) = 0 is called a source, and a vertex with deg + (v) = 0 is called a sink. A directed acyclic graph, occasionally called a DAG, is a directed graph without directed cycles. Throughout this paper, G = (V, E) is a DAG.
Algorithmic Aspects of the Reachability of Conflicting Chip Firing Game
361
A topological sort of a DAG is an ordering v1 , v2 , . . . , vn of its vertices such that for all edge (vi , vj ) of the graph we have i < j. We can see at once that a directed graph G has a topological sort if and only if it is acyclic. Next, to represent configuration of CCFG, we use integer composition, whose the explicit notion is given as follow: Definition 1. Let n be a positive integer and let S be a set of k elements. A composition of n into S is an ordered sequence (a1 , a2 , . . . , ak ) of non negative integers such that a1 + a2 + . . . + ak = n. The integer number ai is called the weight of i. It is easy to check that the number of compositions of n into S is n+k−1 . n Now, we introduce a variation of the chip-firing game on a directed acyclic graph G = (V, E). Definition 2. [24] The conflicting chip firing game (CCFG) on a DAG G = (V, E) with n chips, denoted by CCF G(G, n), is a dynamical model defined as follow: each configuration is a composition of n into V ; an edge (u, v) of E is firable if u has at least one chip; the evolution rule (firing rule) of this game is the firing of one firable edge (u, v), that means the vertex u gives one chip to the vertex v. We also denote by CCF G(G, n) the set of all configurations of CCF G(G, n) and call the configuration space of this game. This set is exactly the set of compositions of n into V . Definition 3. [24] Given two configurations a and b of an CCF G(G, n), we say that b is reachable from a, denoted by b ≤ a, if b can be obtained from a by a firing sequence (in the case the firing sequence is empty, a = b). In particular, we write a → b if b is obtained from a by applying once firing rule.
3
CCFG and the Order Structure
The goal of this section is to give an explicit definition of energy of configurations which is an important characterization to show the partial order structure of the configuration space of the game. We recall the notation of partially ordered set and some properties of order filter. For more details about order theory, see e.g [7]. Besides, we used these notations in the set of vertices of a DAG. An order relation or partial order relation is a binary relation ≤ over a set, such that for all x, y and z in this set, x ≤ x (reflexivity), x ≤ y and y ≤ z implies x ≤ z (transitivity), and x ≤ y and y ≤ x implies x = y (antisymmetry). The set is then called a partially ordered set or, for short, a poset. Let P be a poset and let Q be a subset of V . Then Q inherits an order relation from V ; given x, y ∈ Q, x ≤ y in Q if and only if x ≤ y in P . We say in these circumstances that Q has the order induced from P and call it a subposet of P . The order of vertices in a DAG G = (V, E) is defined as follows:
362
Le M. Ha, Nguyen A. Tam, and Phan T.H. Duong
Definition 4. [8] A walk in a directed graph is a sequence of vertices and edges v0 , e1 , v1 , . . . , ek , vk such that for each 1 ≤ i ≤ k, ei goes from vi−1 to vi . A (directed) trail is a walk without repeated edges, and a (directed) path is a trail without repeated vertices. Let x, y be in V . We define a binary relation ≤ on V as follows: for all x, y ∈ V, y ≤ x if and only if either x = y or there is a path from x to y. The following result is straightforward from the definition of relation ≤ on the set of vertices of a DAG. Lemma 1. If G = (V, E) is a DAG, then (V, ≤) is a poset. Actually, the problem to characterize the order relation of a dynamical system is always difficult. We will show that the configuration space of CCFG is a poset. This order also resembles the Lamport’s happened-before relation between events in an asynchronous distributed system [17]. In CCFG, the characterization of this order is unlike Chip-firing game where the order of configurations is characterized by the notation of shot vector, that is the number of applications of firing rule to the vertices and configurations must be reachable from the same given configuration. In CCFG, the order of configurations is characterized by the notation of energies collection which is fastened up configurations by themselves and represents the firing possess. The initial energies is maximal and the more its firing the more energies reduces. Moreover, configurations in CCFG is not necessary starting from the same given configuration. Now, we recall here some definition of order and energy of configurations in CCF G(G, n). Definition 5. Let V be an poset, and let Q ⊆ V . Q is an order filter or, for short, filter (alternative terms are increasing set or up-set) if, whenever x ∈ Q, y ∈ V and y ≥ x, we have y ∈ Q. We denote by F (V ) the set of all filters of V . Definition 6. [13] Let G = (V, E) be a DAG and let a = (a1 , a2 , . . . , a|V | ) be a composition of n on V . The energy e(A, a) of a on a subset A ⊆ V is the quality e(A, a) = i∈A ai , the set (e(A, a)A∈F (V ) ) is called the energies collection of a and the energy E(a) of a is the quality E(a) = A∈F (V ) e(A, a). Next, we show that the configuration space of CCF G has an order structure as the configuration space of many other dynamical systems. In [13], we have presented the characterization of order structure CCFG by using energies collection. Now, we are going to give here another shorter proof. Theorem 1. [13] Let a and b be two configurations of CCF G(G, n). Then a ≥ b in CCF G(G, n) if and only if e(A, a) ≥ e(A, b), for all filter A ∈ F (V ). Proof. The necessary condition is obvious. We prove the sufficient condition for showing that there exists a firing sequence from a to b. We prove by induction on the cardinality of V . The base case is obvious when |V | = 1.
Algorithmic Aspects of the Reachability of Conflicting Chip Firing Game
363
We consider two cases: – Case 1: there exist A ∈ F (V ), A = ∅, A = V such that e(A, a) = e(A, b). Then on the induced graph G1 = G[A] we have A ∈ F (A) implies that A ∈ F (V ). By assumption e(A , a) ≥ e(A , b) for all filter A ∈ F (A), so a b on G[A] by hypothesis induction. Now, we consider the induced graph G2 = G[V \A]. It is clear that, if B ∈ F (V \A) then A∪B ∈ F (V ). Let B be any filter of F (V \ A), because e(A ∪ B, a) ≥ e(A ∪ B, b) so e(B, a) ≥ e(B, b). Therefore, b also reachable from a on the induced graph G[V \ A] by hypothesis induction. – Case 2: e(A, a) ≥ e(A, b) + 1 for all filter A ∈ F (V ), A = ∅, A = V . Then, there exists a sink v such that a(v) < b(v). Let u be a neighbor of v such that (u, v) ∈ E. Let c be a configuration defined by c(u) = b(u)+1, c(v) = b(v) − 1 and c(w) = b(w), ∀w = u, v. It is easy to see that c → b. From this, E(c) > E(b) and then E(a) − E(c) < E(a) − E(b). It remains to prove that e(A, a) ≥ e(A, c), ∀A ∈ F (V ). Then, by induction on E(a) − E(b) we imply that a ≥ b. Let A ∈ F (V ) be an arbitrary filter, we need only consider two cases: + If v ∈ A then u ∈ A, due to u ≥ v in (V, ≤) and A ∈ F(V ). Hence, e(A, a) ≥ e(A, b) = e(A, c). + If v ∈ A then e(A, b), if u ∈ A e(A, c) = e(A, b) + 1, otherwise Therefore, e(A, c) ≤ e(A, b) + 1 ≤ e(A, a) (due to e(A, a) > e(A, b)).
4
Algorithms Determining the Reachability of CCF G(G, n)
In this section, we present the algorithm which takes two configurations a and b of a CCF G(G, n), and which answers if b can be obtained form a by applying firing rule. This reachability problem of dynamic systems is always a very complex problem because the size of the configuration is very large. In our case, systems related to the model Chip firing Games are always based on a fixed support graph G which has a relatively small size, for example the number of a nodes on a local network in the rotor router model or the number of machines in the distributed systems. On the other hand, the total number of chips is very large, which make the size of the configuration space become huge, and it is even in exponential in the size of the graph G. Our algorithm has nevertheless a complexity depending only on the size of the graph G and not depending on n, this is polynomial on the size of filters of G and constant on the size of n. So it is very efficient for problems on a fixed support graphs. The main algorithm is composed of two phrases whose the first one, and the most important, is a pretreatment of the graph G which gives the set of all filters of V .
364
Le M. Ha, Nguyen A. Tam, and Phan T.H. Duong 2
1 1 1
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
Fig. 1. The configuration space of a CCF G with 2 chips
Algorithm 1: generating filters Input: adjacency matrix E = (eij )m×m of graph G = (V, E) with |V | = m Output: File contains filters F (V ). Algorithm 2: comparing two configurations of CCF G(G, n) Input: two compositions of n on V : a = (a1 , . . . , am ) and b = (b1 , . . . , bm ). Output: answer Yes if a ≥ b and No if otherwise. Our algorithm runs as follow: begin read each filter F of F(V); { calculte energie of a and b on F, if e(F,a) < a(F,b) then we answer No and stop.} after this loop (that means e(F,a) >= e(F,b) for all F in F(V)), we answer Yes. end
Thus, in the pretreatment process, the algorithm 1 is used once and for each pair (a, b) we only need run the algorithm 2. The running time of algorithm comparing p pairs is equal the running time of algorithm 1 plus p times the running time of algorithm 2.
Algorithmic Aspects of the Reachability of Conflicting Chip Firing Game
4.1
365
Algorithm Generating Filters
To construction all filter of the poset V , we use the bijection between the set of filters F (V ) and the set of antichains A(V ) (recall that an antichain A of a poset V is a subset of V such that any two distinguish elements are incomparables, and a filter F can be obtained from an antichain A by taking all elements y ∈ V such that there exists x ∈ A and x ≤ y). Our algorithm 1 is then described as follows: - Procedure Floyd: takes the matrix adjacent M [m][m] (m = |V |) of the graph G, and transforms it to the new matrix M [m][m] such that for all i, j ∈ V , M[i][j] is the shortest length of a chain from i to j, and it is equal to 2m if there is no chains from i to j. Note that this definition means that if M [i][j] < 2m then j ≤ i in the poset V . - Procedure Filter(h): calculates all antichains of V , and for each antichain A, it calculates the corresponding filter F and save F to a file. + An antichain is denoted by A[], for an integer h, u = A[h] is an element of A. And we define an array index[i] for all i ∈ V as follows: ⎧ ⎪ ⎨0 means thati is incomparable with all element of A (by default) index[i] = −u means that i ≤ u ⎪ ⎩ u means that i > u + We begin with an antichain A of one element and will write the corresponding filter of this antichain. After that, we add one element to A, this element much have an index 0. + Each time we add one element to A, we index all elements comparable with element in A as follows (procedure index): · At the first time, all elements i ∈ V are indexed by 0. · Each time we add one element u to A (u = A[h]), we update the index for all i ∈ V :
if i ≤ u( that is M [u][i] < 2m) then index[i] = −u, if index[i] = 0 then if i > u( that is M [i][u] < 2m) then index[i] = u. Algorithm The algorithm includes three subfunctions and one main function. 1) Floyd function: procedure Floyd for i := 1 to m do // Translating adjacency matrix to weight matrix for j := 1 to m do if j # i then if M[i][j] == 0 then M[i][j] := 2 * m; else M[i][j] := 0; for k := 1 to m do for i := 1 to m do
366
Le M. Ha, Nguyen A. Tam, and Phan T.H. Duong for j := 1 to m do if M[i][k] +M[k][j] < M[i][j] then M[i][j] = M[i][k] +M[k][j]; end if; end for; end for;
end;
2) The index function: procedure index (integer h) k := AC[h]; index[k] = -k; for i := 1 to m do if index[i] == 0 then if M[k][i] < 2 * m then index[i] := k; if M[i][k] < 2 * m then index[i] := -k; end if; end for; 3) Filter function: procedure Filter (interger h) index(h); // Labeling vertices connecting with AC[h] for i := 1 to m do if (index[i]<0) then print i; // Writing filter generating by antichan l := 0;// Counting vertices that are incomparable to AC[1], ..., AC[h] for i := 1 to m do if index[i] == 0 then l := l + 1; if l == 0 then return; // All vertices are comparable then exit. else for i := AC[h] + 1 to m // The vertices are incomparable to if index[i] == 0 then //the first h elements added to array AC[m] AC[h + 1] := i; Filter(h + 1); for j := 1 to m do if (index[j] == -i OR index[j] == i) then index[j] := 0;// Deleting end for;// vertices which labeled connecting to AC[h +1] end if; end for; end else;
4) main function procedure main FloydWarshall(); for i := 1 to m // Assigning the first element of antichain AC[1] := i; for j := 1 to m do DS[j]:= 0; // re-numbering vertices and generating filters end for // created by antichains which has first Filter(1); // element assigned. end for
Theorem 2. Algorithm 1 writes the list of all filter of V with the complexity of O(m3 + m|F(V )|) where m = |V |.
Algorithmic Aspects of the Reachability of Conflicting Chip Firing Game
367
Proof. We need to show that the algorithm generates all filter or equivalently it generates all antichain and each filter (or equivalently each antichain) of V is generated exactly once. Indeed, let us analyse how an antichain A = {p1 , . . . , ph } of h elements is generated. We begin with an antichain A := {p1 } (AC[1] = p1 ), after that we add element {p2 } in the ”for loop” for i := AC[1] + 1 to m (p2 > p1 and it is incomparable to p1 ), now we have A := {p1 , p2 }. Similarly, in the ”for loop” for i := AC[h − 1] + 1 to m, we add AC[h] = ph to A. This function generates antichain A = {p1 , . . . , ph }. Moreover, the vertices in an antichain is assigned by increasing index and whenever generating a new antichain we add one element which is incomparable to elements in previous antichains. So, each filter is generated exactly once and we complete the justification of the algorithm. Now, we are analyzing the running time of algorithm. The FloydWarshall function runs in time O(m3 ), where |V | = m, because it must run three ”for loops”. The second ”for loop” in the main function has m steps and each step runs in time O(m). Whenever we call Filter function then the index function is called also in Filter function. The index function runs in time O(m) so the complexity of the algorithm is O(m3 + m2 + m.|F(V )|) and hence it equals to O(m3 +m.|F(V )|). 4.2
Algorithm Comparing Two Configurations
Data 1. Two one-dimensions array a[m] and b[m] of two configurations a and b; 2. File F containing filters. 3. A variable C to determine if a ≥ b? Algorithm procedure main C := 1; read F; while (C == 1 AND F is not the last filter) A in F(V ) if A is the last filter then C = 0; x := 0; y := 0;// setup energies of a and b. for i in A x := x + a[i];// compute energy of configuration a on filter A y := y + b[i];// compute energy of configuration b on filter A if y > x then C := -1; else C := 1; end while if C == 0 then a >= b; else b can not obtained from a;
The complexity of Algorithm 2 depend on the number of element of each filter and the result a ≥ b or not. In the worst case Algorithm 2 runs in time
368
Le M. Ha, Nguyen A. Tam, and Phan T.H. Duong
|V |.|F(V )|. In this case a ≥ b, so it calculates energies of a and b on all filter F ∈ F (V ).
5
Conclusion
The Chip Firing Game systems is special Petri Nets [10], particularly CCFG is a (finite) state machine which is an ordinary Petri net such that each transition has exactly one input place and exactly one output place. The reachability problem for Petri nets was soon observed by Keller [16] that many other problems were recursively equivalent to the reachability problem and so it became a central issue of net theory. The complexity of the reachability problem has been open for many years. However, tight complexity bounds of the reachability problem are known for many net classes [11], [25], [18]. The necessary and sufficient condition for reachability of state machine has been studied in [22] by using the incidence matrix and state equation. An extension version of state machine is BPP-net that is a Petri net (N, M0 ) in which every transition has exactly one input place. BPP stands for Basic Parallel Process. It can be viewed as an extended model of Conflicting Chip Firing Game and reachability is NP-complete for BPP-net [15]. In this paper, we have solved this problem in the case the support graph is a directed acyclic graph (DAG) by using the notation of energies collection of configurations on the filters order of the vertices set V (G), and we also have constructed algorithm to determine order between two configuration of CCFG. Our algorithm is very effective in the case of small support graphs.
References 1. Meszaros, K., Peres, Y., Propp, J., Holroyd, A.E., Levine, L., Wilson, D.B.: Chip-firing and rotor-routing on directed graphs. In and out of equilibrium 2., Progr. Probab. 60, 331–364 (2008) 2. Bak, P., Tang, C., Wiesenfeld, K.: Self-organized criticality. Phys. rev. A (38), 364–374 (1988) 3. Bianconi, G., Marsili, M.: Clogging and selforganized criticality in complex networks. Phys. Rev. E 70, 035105 (R) (2004) 4. Bjorner, A., Lov´ asz, L., Shor, W.: Chip-firing games on graphes. E.J. Combinatorics 12, 283–291 (1991) 5. Cori, R., Rossin, D.: On the sandpile group of a graph. Eur. J. Combin. 21(4) 6. Lo’pez, M., Merino, C.: Chip fring and the tutte polynomial. Annals of Combinatorics 1(3), 253–259 (1997) 7. Davey, B.A., Priestley, H.A.: Introduction to Lattices and Order. Cambridge University Press, Cambridge (1990) 8. Diestel, R.: Graph Theory, Electronic edn., New York (2005) 9. Dixit, S.: Self-organization of complex networks applied to wireless world systems. Wirel. Pers. Commun. 29(1-2), 63–70 (2004) 10. Le, M.H., Pham, T.A., Phan, T.H.D.: On the relation between chip firing games and petri nets. In: Proceeding of IEEE-RIVF International Conference on Computing and Communication Technologies, pp. 328–335 (2009)
Algorithmic Aspects of the Reachability of Conflicting Chip Firing Game
369
11. Lipton, R.J., Cardoza, E., Meyer, A.R.: Exponential space complete problems for petri nets and commutative semigroups. In: 8th Annual Symposium on Theory of Computing, pp. 50–54 (1976) 12. Goles, E., Morvan, M., Phan, H.D.: Lattice structure and convergence of a game of cards. Ann. of Combinatorics 6, 327–335 (2002) 13. Le, M.H., Phan, T.H.D.: Order structure and energy of conflicting chip firing game. Acta Math. Vietnam. (2008) (to appear) 14. Huang, S.-T.: Leader election in uniform rings. ACM Trans. Programming Languages Systems 15(3), 563–573 (1993) 15. Huynh, D.T.: Commutative grammars: The complexity of uniform word problems. Information and Control 57(1), 21–39 (1983) 16. Keller, R.M.: A fundamental theorem of asynchronous parallel computation. In: Tse-Yun, F. (ed.) Parallel Processing. LNCS, vol. 24, pp. 102–112. Springer, Heidelberg (1975) 17. Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM (CACM) 21(7), 558–565 (1978) 18. Jones, N.D., Landweber, L.H., Lien, Y.E.: Complexity of some problems in petri nets. Theoretical Computer Science 4, 277–299 (1977) 19. Latapy, M., Phan, H.D.: The lattice structure of chip firing games. Physica D 115, 69–82 (2001) 20. Levine, L., Peres, Y.: The rotor-router shape is spherical. Math. Intelligence 27(3), 9–11 (2005) 21. Magnien, C., Phan, H.D., Vuillon, L.: Characterization of lattices induced by (extended) chip firing games. Discrete Math. Theoret. Comput. Sci. AA, 229– 244 (2001) 22. Murata, T.: Petri nets: properties, analysis and applications. Proceedings of the IEEE 77(4), 541–580 (1989) 23. Epstein, I.R., Pojman, J.A., Steinbock, O.: Introduction: Self-organization in nonequilibrium chemical systems. Chaos 2006 16, 037101 (2001) 24. Pham, T.A., Phan, T.H.D., Tran, T.T.H.: Conflicting chip firing games on directed graphs and on treese. VNU Journal of Science. Natural Sciences and Technology 24, 103–109 (2007) 25. Huynh, D., Howell, R., Rosier, L., Yen, H.: Some complexity bounds for problems concerning finite and 2-dimensional vector addition systems with states. Theoretical Computer Science 46, 107–140 (1986)
Appendix A : Algorithm Generates All Filters The program is setup by C ++ language and Code::Blocks 8.02 containing following files: – source code: filter.cpp; – executing file: filter.exe; – input file: filter.inp; (the first line is the number of vertices of graph and from the second line is adjacency matrix) – out file: filter.out (in each line the first number is the number of elements of filter and next numbers is the label of vertices in filter, the last number is m + 1 to recognize the last filter).
370
Le M. Ha, Nguyen A. Tam, and Phan T.H. Duong
Fig. 2. Input and output of the program writing filter
Fig. 3. Some results of the program comparing two configurations. Left: input file; right: output
Appendix B: Algorithm Compares Two Configuration in CCF G(G, n) The program is setup by C ++ language and Code::Blocks 8.02 containing following files: – – – –
source code: compare.cpp; executing file: compare.exe; File contains filters: filter.out; input file: compare.inp (the first line is the number of vertices of graph and two next line is the number of chips of two configurations); – out file: print out screen (a ≥ b or a is not ≥ b).
Neurofuzzy Decision-Making Approach for the Next Day Portfolio Thai Stock Index Management Trading Strategies Monruthai Radeerom and M.L. Kulthon Kasemsan Department of Information Technology, Faculty of Information Technology, Rangsit University, Pathumtani, Thailand 12000 [email protected], [email protected]
Abstract. Stock investment has become an important investment activity in Thailand. However, investors usually got loss because of unclear investment objective and blind investment. Therefore, a good investment decision support system to assist investors in making good decisions has become an important research problem. Thus, this paper introduces an intelligent decision-making model, based on the application of Neurofuzzy system (NFs) technology. Our proposed system is used to decide a trading strategy on next day for highly profit of each stock index. Firstly, our decision-making model was including in two NFs model, such as Predictor and Portfolio Management. Secondly, the optimizing algorithm, included in our proposed model, shown as a result of different forms of decision-making model, different number of membership function in NFs, NFs inputs and sliding windows, that effect to profit. Finally, the experimental results have shown higher profit than the single NFs and “Buy & Hold” models over each stock index. The results are very encouraging and can be implemented in a Decision-Making trading system during the next trading day period.
1 Introduction The prediction of financial market indicators is a topic of considerably practical interest and, if successful, may involve substantial pecuniary rewards. People tend to invest in equity because of its high returns over time. Considerable efforts have been put into the investigation of stock markets. The main objective of the researchers is to create a tool, which could be used for the prediction of stock markets fluctuations. The main motivation for that is the financial gain profit. In financial marketplace, traders have to be fast and powerful tools for decision making in order to be able to work efficiently and to get profit. The use of Artificial Intelligence (AI) has made a big influence on the forecasting and investment decision-making technologies. There are a number of examples using neural networks in equity market applications include forecasting the value of a stock index [4,5,19,21] recognition of patterns in trading charts [11,16],
N.T. Nguyen et al. (Eds.): Adv. in Intelligent Inform. and Database Systems, SCI 283, pp. 371–381. springerlink.com © Springer-Verlag Berlin Heidelberg 2010
372
M. Radeerom and M.L.K. Kasemsan
rating of corporate bonds [7], estimation of the market price of options [10], and the indication of trading signals of selling and buying [3,11] etc. Even though nearly everybody agrees on the complex and nonlinear nature of economic systems, there is skepticism as to whether new approaches to nonlinear modeling, such as neural networks, can improve economic and financial forecasts. Some researchers claim that neural networks may not offer any major improvement over conventional linear forecasting approaches [8,12]. In addition, there is a great variety of neural computing paradigms, involving various architectures, learning rates, etc., and hence, precise and informative comparisons may be difficult to make. In recent years, an increasing number of researches in the emerging and promising field of financial engineering are incorporating neurofuzzy approaches [6,9,15,17]. Almost all models are focused on the prediction of stock prices as well. The difference of our proposed model is that we are focusing on decision-making in stock markets, but not in its forecasting. Differently from our previous works [14], we are not making a direct prediction of stock markets, but we are working on one day forward decision making for buying/selling the stocks. For that we are developing a decision-making model, where besides the application of NFs, we have decided to use optimization algorithm based on rate of return profit of each stock indexes to construing our NFs Model for decision support making system. In this paper, we present a decisionmaking model which combines two NFs model: NFs Predictor and NFs Portfolio management for Thai Stock Index. The NFs predictor forecast close price on next day. And, the NFs portfolio management decide buy, sell and hold strategy for each stock index. The objective of this model is to analyze the daily stock returns and to make one day forward decisions related to the purchase of the stocks. The paper is organized as follows: Section 2 presents neurofuzzy system and NFs decision-making model; Sections 3 is devoted for experimental investigations and evaluation of decision-making model. This section gives the ground for selection of different variable which are used in the model, as well to the model structure. The main conclusions of the work are presented in Section 4, with remarks on future directions.
2 Neurofuzzy Approaches for Decision Making System in Portfolio Stock Index Management Neurofuzzy systems combine semantic transparency of rule-based fuzzy systems with a learning capability of neural networks. Depending on the structure of ifthen rules, two main types of fuzzy models are distinguished as mamdani (or linguistic) and takagi-sugeno models [1]. The mamdani model is typically used in knowledge-based (expert) systems, but the takagi-sugeno model used in datadriven systems. In this paper, we consider only the Takagi-Sugeno-Kang (TSK) model. Takagi Sugeno and Kang formalized a systematic approach for generating fuzzy rules from an input-output data pairs. The fuzzy if-then rules, for pure fuzzy inference system, are of the following form:
Neurofuzzy Decision-Making Approach for the Next Day Portfolio
373
if x1 is A1 and x 2 is A2 and x N is AN then y = f ( x) if x1 is A1 and x2 is A2 and xN is AN then y = f ( x) ,
(1)
Where x = [ x1 , x2 ,..., x N ]T , A1 , A2 ,K, AN fuzzy sets are in the antecedent and while y is a crisp function in the consequent part. The function is a polynomial function of input variables x1 , x 2 , x3 ,K, x N , see Figure 1.
Fig. 1. An example of a first-order TSK fuzzy model with two rules systems [1] N e u r o F u z z y S y s t e m
Fuzzy Inference System
Training Data Input
Test Data Subtractive clustering
(If – Then Rule, Fuzzy Reasoning)
Membership Functions
Experts
Tuning (NN) System evaluation (errors)
Output
Fig. 2. Constructing Neurofuzzy Networks [14]
In conclusion, Figure 2 is summarizing the constructing Neurofuzzy Networks System (NFs). Process data called “training data sets” can be used to construct neurofuzzy systems. We do not need prior knowledge called “knowledge-based (expert) systems”. In this way, the membership functions of input variables are designed by subtractive clustering method. Fuzzy rule (including the associated parameters) are constructed from scratch by using numerical data. And, the parameters of this model (the membership functions, consequent parameters) are then fine-tuned by process data.
374
M. Radeerom and M.L.K. Kasemsan
The advantage of the TSK fuzzy system is to provide a compact system. Therefore, some classical system identification methods, such as parameter estimation and order determination algorithms, could be developed to get the fuzzy inference rules by using input/output data. Similar to neural networks, neurofuzzy systems are universal approximations. Therefore, the TSK fuzzy inference systems are general for many complex nonlinear practical problems, such as time series data. 2.1 Decision-Making Model for Stock Market At first, when the NFs predictor and NFs portfolio management, models are constructed individually as the baselines, the test data is used to verify those two models in order to combine the second NFs decision making support model. Then, our following steps shown in Figure 3, where the system can be train and combine with both NFs model as a hybrid model. The proposed decision making model combines the application of two NFs is used for the calculation of one day forward decision (like buy, sell or hold). The scenario of decision-making model is presented in Figure 3. The model scenario represents one day calculations which are made in order to get decision concerning the purchase of stocks. For the model realization there are used historical data of daily stock returns. In the first step of the model realization, the next daily close price of stock index is passed to the second model (see the first block of model scenario). The second block of decision-making model presented. The second NFs are used for the calculation of the investment recommendations, such as sell, hold or buy. The recommendations (R) represent the relative rank of investment attraction to each stock in the interval [−1, 1]. The values −1, 0, and 1 represent recommendations: sell, hold and buy, respectively. It is important to mention, that decision-making model is based on the idea of sliding window. The size of sliding window shows how many times the cycle of the model has to be run in order to get the decision. For each day’s decision a new Decision Making Support System for Stock Trading Close Price (t-k)
Neurofuzzy Close Price (t-2)
Close Price (t+1)
Strategy
Neurofuzzy
Predictor
Close Price (t-1)
Trading
Buy : (-1)
Portfolio
Hold : (0)
Management
Sell : (+1)
Close Price (t) Close Price (t-k) Close Price (t-2) Close Price (t-1) Close Price (t)
Fig. 3. The scenario of decision-making model
Neurofuzzy Decision-Making Approach for the Next Day Portfolio
1
375
300
Time period Training
Window 1
Testing
1
100
Window 2
Training
2
101 Testing
101
102 Training
Testing
Window 200 200
299
300
Fig. 4. Sliding window of training data and testing data (one day)
sliding window is needed. An example of sliding window is presented in Figure 4. As, it can be seen from the presented picture, sliding window represents the training part of each time interval. For the training of NFs, there is used the optimizing algorithm which is selecting a number of membership function and fuzziness parameter for NFs model. At first, the best NFs of the day is selected. The best NFs is called that NFs which have shown the best performance (the highest total profit for the selected sliding window). 2.2 Evaluating Function for NFs Forecasting and NFs Decision-Making Model There are several kind of error function used in financial forecasting, namely, Mean absolute Deviation(MAD), Mean Squared Error(MSE) and Mean Absolute Percentage Error(MAPE). But, in this paper, like a neural network model, we used two error functions for NFs predictor and NFs portfolio management, Firstly, the Percentile Variance Accounted For (VAF) [1] is selected for evaluating NFs model. The VAF of two equal signals is 100%. If the signals differ, VAF is lower. When y1 and y2 are matrices, VAF is calculated for each column. The VAF index is often used to assess the quality of a model, by comparing the true output and the output of the model. The VAF between two signals is defined as follows: VAF = 100% ∗ [1 −
var( y1 − y 2) ] var( y1)
(2)
Secondly, we use Root Mean Squared Error (RMSE), this is root of MSE, accordingly. Moreover, for NFs portfolio management, the expected returns are calculated considering in the stock markets. The value obtained on the last investigation day is considered as the profit. The traders profit is calculated as Profit(n) = Stock Value(n) - Investment value
When n is number of trading days.
(3)
376
M. Radeerom and M.L.K. Kasemsan
And, Rate of Return Profit (RoRP) is RoRP =
Profit(n) ×100 Investment value
(4)
3 Experimental Methodology and Results The model realization could be run having different groups of stocks (like Banking group, Energy group, etc.), indexes or other groups of securities. For that, we are using market orders as it allows simulating the buy action of stocks, which are the market closing time. All the experimental investigations were run according to the above presented scenario and were focused on the estimation of Rate of Return Profit(RoRP). At the beginning of each realization the start investment is assumed to be 1,000,000 Baht (Approximately USD 29,412). The data set including the Stock Exchange of Thailand (SET) index, Bank of Ayudhya public Company Ltd (BAY), Siam Commercial Bank (SCB) and Petroleum Authority of Thailand (PTT) stock index have been decomposed into two different sets: the training data and test data. The data for stock index are from June 2, 2008 to July 17, 2009 totally 275 records and the first 248 records will be training data and the rest of the data, i.e., 27 records will be test data. Moreover, the data for stock prices are including buy-sell strategy, close price and its technical data. Consequently, maxmin normalization can be used to reduce the range of the data set to values appropriate for inputs and output data being used training and testing method. 3.1 Forecasted Next Day Close Price Based on Neurofuzzy Model The neurofuzzy as a stock index prediction, input data of neurofuzzy system are including in close price of stock index and step time day ago items and one output data items namely close next day. We are selecting each input for BAY, SCB and PTT based on evaluating function as VAF, see equation 2. Moreover, we perform a number of benchmark comparisons of a Backpropagation Neural Network (BPN) within various training algorithms and our proposed neurofuzzy system. Their learning method of BPN are Levenberg-Marquardt (TRAINLM) and Scaled Conjugate Gradient (TRAINSCG) methods. The BPN model has one hidden layer with 30 nodes. And, learning iteration is 10,000 epochs. After trained their learning method, the example of comparisons of different models such as BPN and the neurofuzzy model are listed in Table 1. After optimizing clustering method, numbers of memberships are 6 for several inputs for BAY, 5 for several inputs for SCB and 3 for several inputs for PTT. After training and testing processing, for BAY, VAF value of this model is 99.448 percent for training data sets. And, VAF value of this model is 86.087 percent for testing data sets. Moreover, after training process already for SCB stock index, VAF value of this model is 95.59 percent for testing set, see Figure 5. In summary, our proposed neurofuzzy system(NFs) and preparing method basically success and generalize for stock prediction.
Neurofuzzy Decision-Making Approach for the Next Day Portfolio
377
Historical Quotes of Siam Comercial Bank Company Ltd. (SCB) 62 VAF = 95.59
61 60
Close Price
59 58 57 56 55 54 NFs Real
53 52
0
5
10
15
20 25 30 Tested Data Points
35
40
45
Fig. 5. Neurofuzzy close price (dash line) and Test close price (solid line) of Siam Commercial Bank. (SCB). Table 1. Example of Comparison between various backpropagation and neurofuzzy system Stock Index
Training Algorithm
Hidden Node
Epochs
SCB SCB
Elap VAF VAF Time Training Set Testing Set (Sec) (Accuracy %) (Accuracy %) Neural Network with various learning Method TRAINSCG 30 10000 335.38 99.17 79.17 TRAINLM 30 10000 758.69 99.42 78.15
SCB SCB
NeuroFuzzy with various memberships (Clusters) Membership 5 1.39 or Cluster 10 1.76
99.48 99.50
95.59 90.17
3.2 Construction of Decision-Making Model In the previous investigation of decision-making model, we already had a suitable forecasting next close price model based on NFs. In this section, we were focused on the construction of the following decision-making model features. For decision-making model, highly profit or Rate of Return Profit (RoRP) was our appreciate objective. Many various almost effected to constructing decision-making model such as number of input, number of membership function for NFs, number of sliding window for data sets. In this paper, the sliding window was selecting on 248 days. Thus, this paper mainly focused on number of input and membership function. The experiments were run taking into account different number of membership function of input data in NFs. In the proposed decision-making model (NFs), several inputs were close price as step delay time interval, output is the investment recommendations, such as
378
M. Radeerom and M.L.K. Kasemsan
sell(−1), hold(0), and buy(1),respectively. After calculating recommendation, we simulated trading stock index and calculated profit and RoRP. The methodology used for evaluating an optimized Decision-Making NFs model for each stock index. In Table 2 presented, the best performance of decision- making model is reached while taking RoRP for each number of input (k = 1:11 inputs). It means that such k inputs have the biggest influence on RoRP and used to selecting a number of inputs, which are quite good profit results. The benefit (RoRP) of decision-making model are difference each NFs inputs and each stock index. As a result of this method, see Table2, best results are achieved while having k=4 step time day ago for BAY, k=6 step time day ago for SCB and k=4 step time day ago for PTT into NFs inputs. About membership function, BAY is 6, SCB is 12 and PTT is 19, respectively. Note: Maximum profit is possible profit from knowing the future. It is powerful in a correct trading of each stock index. Table 2. Model performance evaluation (different number of inputs) BAY
Stock index
k (step time day ago)
SCB
PTT
No.of Cluster Maximum No.of Cluster Maximum No.of Cluster Maximum
on total profit on total profit on total profit max profit NFs max profit NFs max profit NFs
1
NaN
NaN
NaN
NaN
NaN
NaN
2
14
43.69
12
30.69
6
30.74
3
12
42.15
18
30.70
21
28.63
4
6
44.51
7
32.13
19
31.47
5
14
40.68
9
28.35
9
29.23
6
11
40.63
12
32.58
10
30.86
7
2
32.91
7
32.58
3
18.79
8
11
36.59
2
32.58
10
20.97
9
1
32.91
6
32.58
6
28.09
10
9
34.13
5
32.58
7
29.15
11
4
35.33
5
31.75
14
22.64
Maxinum Profit is 46.74 %Maxinum Profit is 34.37 %Maxinum Profit is 32.66 %
3.3 Evaluating Decision-Making System Based on Neurofuzzy Model After developing the NFs predictor and NFs portfolio management, we were given 1,000,000 baht for investment at the beginning of the testing period. To translate the produced RoRP results into a more valuable form are meaningful in that verifies the effectiveness of the prediction NFs model and decision NFs modal. The timing for when to buy and sell stocks is given by NFs output: If prediction system produce the negative signal (-1) , take a position for “Buy”. By the another way, If prediction system produce the positive signal (+1) , take a position for “Sell”. And, If prediction system produce the zero signal (0) , take a position for “Hold”. In this paper, our proposed decision-making NFs model compared with Buy & Hold Strategy. Buy & hold is a long term investment strategy based on the view that in the long run financial markets give a good rate of return despite
Neurofuzzy Decision-Making Approach for the Next Day Portfolio
379
periods of volatility or decline. The antithesis of buy and hold is the concept of day trading in which money can be made in the short term if an individual tries to short on the peaks, and buy on the lows with greater money coming with greater volatility. From result in Table 3, the buy and hold strategy yield the 12 - 20 percent rate of return, that difference between initial and final rate of return, otherwise the NFs portfolio management model yields a rate of return of about 31 - 44 percent over the same period (27 testing Days) close to possible profit of each stock index (32 – 46 percent rate of return) . In the case of experimental results, NFs display the greater rate of return than “buy, sell and hold” model. In spite of the result of difference stock indexes is small. It is more valuable task to calculate the loss and gains in terms of profitability in practice. Table 3. Rate of Return Profit (RoRP) gained from each trading stock index
STOCK INDEX STOCK GROUP Possible profit NFs Portfolio Buy&Hold (%) (%) (%) BAY
Banking
46.74
44.51
20.15
SCB
Banking
34.37
32.58
15.71
PTT
Energy
32.66
31.47
12.25
4 Conclusion In this paper there was presented the decision-making model based on the application of NFs. The model was applied in order to make one-step forward decision considering historical data of daily stock returns. The experimental investigation has shown, firstly, the best decision-making model performance is achieved having (6, 4) (12, 6) and (19, 4), which are a set of number of membership function and step time interval into NFs inputs), BAY, SCB and PTT, respectively, for during with sliding window of 248 days, secondly, the combination of two NFs, NFs predictor predict next close index price and NFs Portfolio management make a decision of trading strategy, lets to achieve more stable results and highly profits. For future work, several issues could be considered. First, other techniques, such as support vector machines, genetic algorithm, etc. can be applied for further comparisons. And, other stock index group, another stock exchange or other industries in addition to the electronic one can be further considered for comparisons. Acknowledgments. We would like to thank participants for their helpful comments and invaluable discussions with them. This piece of work was partly under the Graduate Fund for Ph.D. Student, Rangsit University, Pathumthanee, Thailand.
380
M. Radeerom and M.L.K. Kasemsan
References 1. Babuska, R.: Neuro-fuzzy methods for modeling and identification. In: Recent Advances in intelligent Paradigms and Application, pp. 161–186 (2002) 2. Cardon, O., Herrera, F., Villar, P.: Analysis and Guidelines to Obtain A Good Uniform Fuzzy rule Based System Using simulated Annealing. International Journal of Approximated Reason 25(3), 187–215 (2000) 3. Chapman, A.J.: Stock market reading systems through neural networks: developing a model. International Journal of Apply Expert Systems 2(2), 88–100 (1994) 4. Chen, A.S., Leuny, M.T., Daoun, H.: Application of Neural Networks to an Emerging Financial Market: Forecasting and Trading The Taiwan Stock Index. Computers and Operations Research 30, 901–902 (2003) 5. Conner, N.O., Madden, M.: A Neural Network Approach to Prediction Stock Exchange Movements Using External Factor. Knowledge Based System 19, 371–378 (2006) 6. James, N.K., Liu, Raymond, W.M., Wong, K.: Automatic Extraction and Identification of chart Patterns Towards Financial Forecast. Applied Soft Computing 1, 1–12 (2006) 7. Dutta, S., Shekhar, S.: Bond rating: A non-conservative application of neural networks. In: Proceedings of the IEEE International Conference on Neural Networks, pp. 124–130 (1990) 8. Farber, J.D., Sidorowich, J.J.: Can new approaches to nonlinear modeling improve economic forecasts? The Economy as an Evolving Complex System, 99–115 (1988) 9. Hiemstra, Y.: Modeling Structured Nonlinear Knowledge to Predict Stock Markets: Theory. Evidena and Applications, 163–175 (1995) 10. Hutchinson, J.M., Lo, A., Poggio, T.: A nonparametric approach to pricing and hedging derivative securities via learning networks. International Journal of Finance 49, 851–889 (1994) 11. James, N.K., Raymond, W.M., Wong, K.: Automatic Extraction and Identification of chart Patterns towards Financial Forecast. Applied Soft Computing 1, 1–12 (2006) 12. LeBaron, B., Weigend, A.S.: Evaluating neural network predictors by bootstrapping. In: Proceedings of the International Conference Neural Information Process (ICONIP 1994), pp. 1207–1212 (1994) 13. Li, R.-J., Xiong, Z.-B.: Forecasting Stock Market with Fuzzy Neural Network. In: Proceeding of 4th International Conference on Machine Learning and Cybernetics, pp. 3475–3479 (2005) 14. Radeerom, M., Srisaan, C.K., Kasemsan, M.L.: Prediction Method for Real Thai Stock Index Based on Neurofuzzy Approach. Lecture Notes in Electrical Engineering: Trends in Intelligent Systems and Computer Engineering, vol. 6, pp. 327–347 (2008) 15. Refenes, P., Abu-Mustafa, Y., Moody, J.E., Weigend, A.S. (eds.): Neural Networks in Financial Engineering. World Scientific, Singapore (1996) 16. Tanigawa, T., Kamijo, K.: Stock price pattern matching system: dynamic programming neural network approach. In: Proceedings of the International Joint Conference on Neural Networks, vol. 2, pp. 59–69 (1992) 17. Trippi, R., Lee, K.: Artificial Intelligence in Finance & Investing. Irwin, Chicago (1996) 18. Tsaih, R., Hsn, V.R., Lai, C.C.: Forecasting S&P500 Stock Index Future with A Hybrid AI System. Decision Support Systems 23, 161–174 (1998)
Neurofuzzy Decision-Making Approach for the Next Day Portfolio
381
19. Yao, J.T., Poh, H.-L.: Equity forecasting: a case study on the KLSE index, Neural Networks in Financial Engineering. In: Proceeding of the 3rd International Conference on Neural Networks in the Capital Markets, pp. 341–353 (1995) 20. Yoo, P.D., Kim, M.H., Jan, T.: Machine Learning Techniques and Use of Event information for Stock Market Prediction; A survey and Evaluation. In: Proceeding of International Conference on Computational Intelligence for modeling, control and Automation, and Int. Conf. of Intelligent Agents, Web Technologies and Internet Commerce (IMCA – IAWTIC 2005), pp. 1234–1240 (2005) 21. White, H.: Economic prediction Using Neural Networks: A Case IBM Daily Stock Returns. In: Proceeding of IEEE International Conference on Neural Networks, vol. 2, pp. 451–458 (1998)
Author Index
Anh, Duong Tuan
229
Bainbridge, David 79 Banachowski, Lech 181 Begier, Barbara 191 Boongasame, Laor 325 Boonjing, Veera 325 Cao, Tru H. 41 Ceglarek, Dariusz 111 Chang, Jing-Wen 349 Chao, I-Ming 253 Chiu, Tzu-Fu 291 Chiu, Yu-Ting 291 Chmielewski, Mariusz 157 Chomya, Sinthop 99 Cunningham, Sally Jo 79 Dang, Tran Khanh 133 Deris, Mustafa Mat 3, 265 Duong, Phan Thi Ha 359 Eder, Johann
145
Galka, Andrzej 157 Gorawski, Marcin 53 Ha, Le Manh 359 Haniewicz, Konstanty 111 Herawan, Tutut 3, 265 Hong, Chao-Fu 291 Hu, Jia-Ying 125 Hue, Nuong Tran Thi 305 Huynh, Thuy N.T. 29 Intan, Rolly
279
Janckulik, Dalibor
67
Kasemsan, M.L. Kulthon 371 Kim, Cheonshik 89 Kim, Chong Gun 337 Kim, Shin Hun 337 Kongubol, Kritsadakorn 241
Kozierkiewicz-Hetma´ nska, Adrianna 169 Krejcar, Ondrej 67 Le, Bac 207 Lenkiewicz, Pawel 17, 181 Lertnattee, Verayuth 99 Lin, Shih-Kai 253 Malczok, Rafal 53 Minh, Khang Nguyen Tan Tran Motalova, Leona 67 Musil, Karel 67 Ngo, Vuong M. 41 Nguyen, Ngoc Thanh 169 Nguyen, Thanh C. 29 Niu, Dongxiao 315 Nowacki, Jerzy Pawel 181 Penhaker, Marek 67 Phan, Tuoi T. 29 Radeerom, Monruthai 371 Rakthanmanon, Thanawin 241 Rutkowski, Wojciech 111 Son, Mai Thai 229 Sornlertlamvanich, Virach Stencel, Krzysztof 17
99
Tahamtan, Amirreza 145 Tam, Nguyen Anh 359 Tang, Cheng-Hsien 125 Thanh, Nguyen Dang Thi 305 Tran, Anh N. 217 Trang, Khon Trieu 305 Truong, Anh Tuan 133 Truong, Quynh Chi 133 Truong, Tin C. 217 Tsai, Meng-Feng 125 Vo, Bay
207
Waiyamai, Kitsana 241 Wang, Jia-Wen 349
305
384
Author Index
Wang, Min-Feng 125 Wang, Yongli 315 Wei, Nai-Chieh 253 Wu, Mary 337 Wu, Yang 253
Xing, Mian
315
Yanto, Iwan Tri Riyadi Yuliana, Oviliani Yenty
3, 265 279