Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2311
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
David Bustard Weiru Liu
Roy Sterritt (Eds.)
Soft-Ware 2002: Computing in an Imperfect World First International Conference, Soft-Ware 2002 Belfast, Northern Ireland, April 8-10, 2002 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors David Bustard Weiru Liu Roy Sterritt University of Ulster Faculty of Informatics School of Information and Software Engineering Jordanstown Campus, Newtownabbey, BT37 0QB, Northern Ireland E-mail: {dw.bustard/w.liu/r.sterritt}@ulster.ac.uk
Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Computing in an imperfect world : first international conference, soft ware 2002, Belfast, Northern Ireland, April 8 - 10, 2002 ; proceedings / David Bustard ... (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2002 (Lecture notes in computer science ; Vol. 2311) ISBN 3-540-43481-X
CR Subject Classification (1998): D.2, K.6, F.1, I.2, J.1, H.2.8, H.3, H.4 ISSN 0302-9743 ISBN 3-540-43481-X Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Stefan Sossna Printed on acid-free paper SPIN 10846571 06/3142 543210
Preface
This was the first conference of a new series devoted to the effective handling of soft issues in the design, development, and operation of computing systems. The conference brought together contributors from a range of relevant disciplines, including artificial intelligence, information systems, software engineering, and systems engineering. The keynote speakers, Piero Bonissone, Ray Paul, Sir Tony Hoare, Michael Jackson, and Derek McAuley have interests and experience that collectively span all of these fields. Soft issues involve information or knowledge that is uncertain, incomplete, or contradictory. Examples of where such issues arise include: – requirements management and software quality control in software engineering, – conflict or multiple sources information management in information systems, – decision making/prediction in business management systems, – quality control in networks and user services in telecommunications, – traditional human rationality modeling in artificial intelligence, – data analysis in machine learning and data mining, – control management in engineering. The concept of dealing with uncertainty became prominent in the artificial intelligence community nearly 20 years ago, when researchers realized that addressing uncertainty was an essential part of representing and reasoning about human knowledge in intelligent systems. The main methodologies that have emerged in this area are soft computing and computational intelligence. It was also about 20 years ago that the notion of hard and soft systems thinking emerged from the systems community, articulated by Checkland in his seminal work on Soft Systems Methodology1 . This work has influenced information system research and practice and is beginning to have an impact on systems and software engineering. The conference gave researchers and practitioners with an interest in soft issues an opportunity to learn from each other to identify ways of improving the development of complex computing systems. The conference had a strong industrial focus. In particular, all of the keynote speakers had both industrial and academic experience, and the conference concluded with a session taking an industrial perspective on soft issues. Also, the first day of the conference was integrated with the 2nd European Workshop on Computational Intelligence in Telecommunications and Multimedia, organized by the Technical Committee C of EUNITE, the European Network on Intelligent Technologies for Smart Adaptive Systems. This has a significant industrial membership. There were two EUNITE keynote speakers: Ben Azvine, chairman 1
Peter Checkland, Systems Thinking, Systems Practice, Wiley, 1981
VI
Preface
of the Technical Committee C, and John Bigham who has many years’ experience in applying computational intelligence in telecommunications. The SS Titanic was chosen as a visual image for the conference because it represents the uncertainty associated with any engineering endeavor and is a reminder that the Titanic was built in Belfast – indeed just beside the conference venue. Coincidentally, the conference took place between the date the Titanic first set sail, 2 April 1912, and its sinking on 15 April 1912. Fortunately, the organizing committee is not superstitious! A total of 24 papers were selected for presentation at the conference. We are very grateful to all authors who submitted papers and to the referees who assessed them. We also thank Philip Houston and Paul McMenamin of Nortel Networks (Northern Ireland) whose participation in a collaborative project with the University of Ulster provided initial support and encouragement for the conference. The project, Jigsaw, was funded by the Industrial Research and Technology Unit of the Northern Ireland Department of Enterprise, Trade, and Investment. Further support for the conference was provided by other industry and government collaborators and sponsors, especially Des Vincent (CITU-NI), Gordon Bell (Liberty Technology IT), Bob Barbour and Tim Brundle (Centre for Competitiveness), Dave Allen (Charteris), and Billy McClean (Momentum). Internally, the organization of the conference benefitted from contributions by Adrian Moore, Pat Lundy, David McSherry, Edwin Curran, Mary Shapcott, Alfons Schuster, and Kenny Adamson. Adrian Moore deserves particular mention for his imaginative design and implementation of the Web site, building on the Titanic theme. We are also very grateful to Sarah Dooley and Pauleen Marshall whose administrative support and cheery manner were invaluable throughout. Finally, we thank Rebecca Mowat and Alfred Hofmann of Springer-Verlag for their help and advice in arranging publication of the conference proceedings.
February 2002
Dave Bustard Weiru Liu Roy Sterritt
Organization
SOFT-WARE 2002, the 1st International Conference on Computing in an Imperfect World, was organized by the Faculty of Informatics, University of Ulster in cooperation with EUNITE IBA C: Telecommunication and Multimedia Committee.
Organizing Committee Dave Bustard Weiru Liu Philip Houston Des Vincent Billy McClean Ken Adamson Edwin Curran Pat Lundy David McSherry Adrian Moore Alfons Schuster Mary Shapcott Roy Sterritt
General Conference Chair EUNITE Workshop Chair Nortel Networks, Belfast Labs CITU (NI) Momentum Industry Applications and Finance Local Events Full Submissions Web Short Submissions Local Arrangements Publicity and Proceedings
Program Committee Behnam Azvine Salem Benferhat Keith Bennett Dan Berry Jean Bezivin Prabir Bhattacharya Danny Crookes Janusz Granat Rachel Harrison Janusz Kacprzyk Stefanos Kollias Rudolf Kruse Manny Lehman Paul Lewis Xiaohui Liu Abe Mamdani Trevor Martin
BT, Ipswich, UK IRIT, Universit´e Paul Sabatier, France Durham University, UK University of Waterloo, Canada University of Nantes, France Panasonic Technologies Inc., USA Queen’s University, UK National Institute of Telecoms, Poland University of Reading, UK Warsaw University of Technology, Poland National Technical Univ. of Athens, Greece Otto-von-Guericke-University of Magdeburg, Germany Imperial College, UK Lancaster University, UK Brunel University, UK Imperial College, UK University of Bristol, UK
VIII
Organization
Stephen McKearney Andreas Pitsillides Simon Parsons Henri Prade Marco Ramoni Alessandro Saffiotti Prakash Shenoy Philippe Smets Martin Spott Frank Stowell Jim Tomayko Athanasios Vasilakos Frans Voorbraak Didar Zowghi
University of Bournemouth, UK University of Cyprus, Cyprus University of Liverpool, UK IRIT, Universit´e Paul Sabatier, France Harvard Medical School, USA University of Orebro, Sweden University of Kansas, USA Universit´e Libre de Bruxelles, Belgium BT Ipswich, UK De Montfort University, UK Carnegie Mellon University, USA University of Crete, Greece University of Amsterdam, The Netherlands University of Technology, Sydney, Australia
Additional Reviewers Werner Dubitzky Sally McClean Mike McTear
Gerard Parr William Scanlon
Bryan Scotney George Wilkie
Sponsoring Institutions – University of Ulster – EUNITE - EUropean Network on Intelligent TEchnologies for Smart Adaptive Systems – Nortel Networks, Belfast Labs. – IRTU - Industrial Research and Technology Unit – Liberty IT – INCOSE – British Computer Society – Momentum – Centre for Competitiveness – Charteris – CSPT - Centre for Software Process Technologies
Table of Contents
Technical Session 1 Overview of Fuzzy-RED in Diff-Serv Networks . . . . . . . . . . . . . . . . . . . . . . . . . L. Rossides, C. Chrysostomou, A. Pitsillides (University of Cyprus), A. Sekercioglu (Monash University, Australia)
1
An Architecture for Agent-Enhanced Network Service Provisioning through SLA Negotiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 David Chieng (Queen’s University of Belfast), Ivan Ho (University of Ulster), Alan Marshall (Queen’s University of Belfast), Gerard Parr (University of Ulster) Facing Fault Management as It Is, Aiming for What You Would Like It to Be . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Roy Sterritt (University of Ulster) Enabling Multimedia QoS Control with Black-Box Modelling . . . . . . . . . . . . 46 Gianluca Bontempi, Gauthier Lafruit, (IMEC Belgium)
Technical Session 2 Using Markov Chains for Link Prediction in Adaptive Web Sites . . . . . . . . . 60 Jianhan Zhu, Jun Hong, John G. Hughes (University of Ulster) Classification of Customer Call Data in the Presence of Concept Drift and Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Michaela Black, Ray Hickey (University of Ulster) A Learning System for Decision Support in Telecommunications . . . . . . . . . 88 ˇ Filip Zelezn´ y (Czech Technical University), Jiˇr´ı Z´ıdgek (Atlantis ˇ ep´ Telecom), Olga Stˇ ankov´ a (Czech Technical University) Adaptive User Modelling in an Intelligent Telephone Assistant . . . . . . . . . . . 102 Trevor P. Martin, Benham Azvine (BTexact Technologies)
Technical Session 3 A Query-Driven Anytime Algorithm for Argumentative and Abductive Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Rolf Haenni (University of California, Los Angeles) Proof Length as an Uncertainty Factor in ILP . . . . . . . . . . . . . . . . . . . . . . . . . 128 Gilles Richard, Fatima Zohra Kettaf (IRIT, Universit´e Paul Sabatier, France)
X
Table of Contents
Paraconsistency in Object-Oriented Databases . . . . . . . . . . . . . . . . . . . . . . . . 141 Rajiv Bagai (Wichita State University, US), Shellene J. Kelley (Austin College, US) Decision Support with Imprecise Data for Consumers . . . . . . . . . . . . . . . . . . . 151 Gergely Luk´ acs (University of Karlsruhe, Germany) Genetic Programming: A Parallel Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Wolfgang Golubski (University of Siegen, Germany) Software Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Manny M. Lehman (Imperial College, University of London, UK), J.F. Ramil (The Open University, UK)
Technical Session 4 Temporal Probabilistic Concepts from Heterogeneous Data Sequences . . . . 191 Sally McClean, Bryan Scotney, Fiona Palmer (University of Ulster) Handling Uncertainty in a Medical Study of Dietary Intake during Pregnancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Adele Marshall (Queen’s University, Belfast), David Bell, Roy Sterritt (University of Ulster) Sequential Diagnosis in the Independence Bayesian Framework . . . . . . . . . . 217 David McSherry (University of Ulster) Static Field Approach for Pattern Classification . . . . . . . . . . . . . . . . . . . . . . . . 232 Dymitr Ruta, Bogdan Gabrys (University of Paisley) Inferring Knowledge from Frequent Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Marzena Kryszkiewicz (Warsaw University of Technology, Poland) Anytime Possibilistic Propagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Nahla Ben Amor (Institut Sup´erieur de Gestion, Tunis), Salem Benferhat (IRIT, Universit´e Paul Sabatier, France), Khaled Mellouli (Institut Sup´erieur de Gestion, Tunis)
Technical Session 5 Macro Analysis of Techniques to Deal with Uncertainty in Information Systems Development: Mapping Representational Framing Influences . . . . . 280 Carl Adams (University of Portsmouth, UK), David E. Avison (ESSEC Business School, France)
Table of Contents
XI
The Role of Emotion, Values, and Beliefs in the Construction of Innovative Work Realities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Isabel Ramos (Escola Superior de Tecnologia e Gest˜ ao, Portugal), ´ Carvalho Daniel M. Berry (University of Waterloo, Canada), Jo˜ ao A. (Universidade do Minho, Portugal) Managing Evolving Requirements Using eXtreme Programming . . . . . . . . . . 315 Jim Tomayko (Carnegie Mellon University, US) Text Summarization in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Colleen E. Crangle (ConverSpeech, California)
Invited Speakers Industrial Applications of Intelligent Systems at BTexact . . . . . . . . . . . . . . . 348 Benham Azvine (BTexact Technologies, UK) Intelligent Control of Wireless and Fixed Telecom Networks . . . . . . . . . . . . . 349 John Bigham (University of London, UK) Assertions in Programming: From Scientific Theory to Engineering Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 Tony Hoare (Microsoft Research, Cambridge, UK) Hybrid Soft Computing for Classification and Prediction Applications . . . . 352 Piero Bonissone, (General Electric Corp., Schenectady, NY, US) Why Users Cannot ‘Get What They Want’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Ray Paul (Brunel University, UK) Systems Design with the Reverend Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Derek McAuley (Marconi Labs, Cambridge, UK) Formalism and Informality in Software Development . . . . . . . . . . . . . . . . . . . 356 Michael Jackson (Consultant, UK)
Industrial Panel An Industrial Perspective on Soft Issues: Successes, Opportunities, and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Industrial Panel
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Overview of Fuzzy-RED in Diff-Serv Networks 1
1
1
2
L. Rossides , C. Chrysostomou , A. Pitsillides , and A. Sekercioglu 1
Department of Computer Science University of Cyprus, 75 Kallipoleos Street, P.O. Box 20537, 1678 Nicosia, Cyprus, Phone: +357 2 892230, Fax: +357 2 892240. 2 Centre for Telecommunications and Information Engineering, Monash University, Melbourne, Australia, Phone: +61 3 9905 3503, Fax: +61 3 9905 3454.
Abstract. The rapid growth of the Internet and increased demand to use the Internet for time-sensitive voice and video applications necessitate the design and utilization of new Internet architectures with effective congestion control algorithms. As a result the Diff-Serv architectures was proposed to deliver (aggregated) QoS in TCP/IP networks. Network congestion control remains a critical and high priority issue, even for the present Internet architecture. In this paper we present Fuzzy-RED, a novel approach to Diff-Serv congestion control, and compare it with a classical RIO implementation. We believe that with the support of fuzzy logic, we are able to achieve better differentiation for packet discarding behaviors for individual flows, and so provide better quality of service to different kinds of traffic, such as TCP/FTP traffic and TCP/Weblike traffic, whilst maintaining high utilization (goodput).
1 Introduction The rapid growth of the Internet and increased demand to use the Internet for timesensitive voice and video applications necessitate the design and utilization of new Internet architectures with effective congestion control algorithms. As a result the Diff-Serv architecture was proposed [1] to deliver (aggregated) QoS in TCP/IP networks. Network congestion control remains a critical and high priority issue, even for the present Internet architecture. In this paper, we aim to use the reported strength of fuzzy logic (a Computational Intelligence technique) in controlling complex and highly nonlinear systems to address congestion control problems in Diff-Serv. We draw upon the vast experience, in both theoretical as well as practical terms, of Computational Intelligence Control (Fuzzy Control) in the design of the control algorithm [2]. Nowadays, we are faced with increasingly complex control problems, for which different (mathematical) modeling representations may be difficult to obtain. This difficulty has stimulated the development of alternative modeling and control techniques which include fuzzy logic based ones. Therefore, we aim to exploit the well known advantages of fuzzy logic control [2]: • Ability to quickly express the control structure of a system using a priori knowledge. • Less dependence on the availability of a precise mathematical model. D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 1–13, 2002. © Springer-Verlag Berlin Heidelberg 2002
2
L. Rossides et al.
• Easy handling of the inherent nonlinearities. • Easy handling of multiple input signals. Our approach will be to adopt the basic concepts of RED [14], which was proposed to alleviate a number of problems with the current Internet congestion control algorithms, has been widely studied, and has been adapted, in many variants, for use in the Diff Serv architecture. Despite the good characteristics shown by RED and its variants in many situations, and the clear improvement it presents against classical droptail queue management, it has a number of drawbacks, including problems with performance of RED under different scenarios of operation, parameter tuning, linearity of the dropping function, and need for other input signals. We expect that Fuzzy-RED, the proposed strategy, will be robust with respect to traffic modeling uncertainties and system nonlinearities, yet provide tight control (and as a result offer good service). It is worth pointing out that there is increasing empirical knowledge gathered about RED and its variants, and several ‘rules of thumb’ have appeared in many papers. It will be beneficial to build the Fuzzy Control rule base using this knowledge. However, in this paper we only attempt to highlight the potential of the methodology, and choose a simple Rule Base and simulation examples, but with realistic scenarios.
2 Issues on TCP/IP Congestion Control As the growth of the Internet increases it becomes clear that the existing congestion control solutions deployed in the Internet Transport Control Protocol (TCP) [3], [4] are increasingly becoming ineffective. It is also generally accepted that these solutions cannot easily scale up even with various proposed “fixes” [5], [6]. Also, it is worth pointing out that the User Datagram Protocol (UDP), the other transport service offered by IP Internet, offers no congestion control. However, more and more users employ UDP for the delivery of real time video and voice services. The newly developed (also largely ad-hoc) strategies [7], [8] are also not proven to be robust and effective. Since these schemes are designed with significant non-linearities (e.g. twophase—slow start and congestion avoidance—dynamic windows, binary feedback, additive-increase multiplicative-decrease flow control etc), and they are based mostly on intuition, the analysis of their closed loop behaviour is difficult if at all possible, even for single control loop networks. Even worse, the interaction of additional non-linear feedback loops can produce unexpected and erratic behavior [9]. Empirical evidence demonstrates the poor performance and cyclic behavior of the TCP/IP Internet [10] (also confirmed analytically [11]). This is exacerbated as the link speed increases to satisfy demand (hence the bandwidth-delay product, and thus feedback delay, increases), and also as the demand on the network for better quality of service increases. Note that for wide area networks a multifractal behavior has been observed [12], and it is suggested that this behavior—cascade effect—may be related to existing network controls [13]. Based on all these facts it is becoming clear that new approaches for congestion control must be investigated.
Overview of Fuzzy-RED in Diff-Serv Networks
3
3 The Inadequacy of RED The most popular algorithm used for Diff-Serv implementation is RED (Random Early Discard) [14]. RED simply sets some min and max dropping thresholds for a number of predefined classes in the router queues. In case the buffer queue size exceeds the min threshold, RED starts randomly dropping packets based on a probability depending on the queue length. If the buffer queue size exceeds the max threshold then every packet is dropped, (i.e., drop probability is set to 1) or ECN (Explicit Congestion Notification) marked. The RED implementation for Diff-Serv defines that we have different thresholds for each class. Best effort packets have the lowest min and max thresholds and therefore they are dropped with greater probability than packets of AF (Assured Forwarding) or EF (Expedited Forwarding) class. Also, there is the option that if an AF class packet does not comply with the rate specified then it would be reclassified as a best-effort class packet. Apart from RED, many other mechanisms such as n-RED, adaptive RED [15], BLUE [16], [17] and Three Color marking schemes were proposed for Diff-Serv queue control. In Figure 1 we can see a simple Diff-Serv scenario where RED is used for queue control. A leaky bucket traffic shaper is used to check if the packets comply with the SLA (Service Level Agreement). If EF packets do not comply with the SLA then they are dropped. For AF class packets, if they do not comply then they are remapped into Best Effort Class packets. Both AF and Best Effort packets share a RIO [18] Queue. RIO stands for RED In/Out queue, where “In” and “Out” means packets are in or out of the connection conformance agreement. For AF and Best Effort class we have different min and max thresholds. EF packets use a separate high priority FIFO queue. EF
AF Class
Best Effort
Ye Check and traffic shaping
Priority Queue
No
Discard
RIO Queue
Max
Min
Fig. 1. Diff-Serv scenario with RED queue for control
Despite the good characteristics shown by RED in many situations and the clear improvement it presents against classical droptail queue management, it has a number of drawbacks. In cases with extremely bursty traffic sources, the active queue management techniques used by RED, unfortunately, are often defeated since queue lengths grow and shrink rapidly well before RED can react.
4
L. Rossides et al.
The inadequacy of RED can be understood more clearly by considering the operation of an ideal queue management algorithm. Consider an ideal traffic source sending packets to a sink through two routers connected via a link of capacity of L Mbps (see Figure 2). A
L Mbps
B
Source Sink Sending rate L Mbps
Queue
Fig. 2. Ideal Scenario
An ideal queue management algorithm should try to maintain the correct amount of packets in the queue to keep a sending rate of sources at L Mbps thus having a full 100% throughput utilization. While RED can achieve performance very close to this ideal scenario, it needs a large amount of buffer and, most importantly, correct parameterization to achieve it. The correct tuning of RED implies a “global” parameterization that is very difficult, if not impossible to achieve as it is shown in [16]. The results presented later in this article show that Fuzzy-RED can provide such desirable performance and queue management characteristics without any special parameterization or tuning.
4 Fuzzy Logic Controlled RED A novel approach to the RED Diff-Serv implementation is Fuzzy-RED, a fuzzy logic controlled RED queue. To implement it, we removed the fixed max, min queue thresholds from the RED queue for each class, and replaced them with dynamic network state dependant thresholds calculated using a fuzzy inference engine (FIE) which can be considered as a lightweight expert system. As reported in [16], classical RED implementations with fixed thresholds cannot provide good results in the presence of dynamic network state changes, for example, the number of active sources. The FIE dynamically calculates the drop probability behavior based on two network-queue state inputs: the instantaneous queue size and the queue rate of change. In implementation we add an FIE for each Diff-Serv class of service. The FIE uses separate linguistic rules for each class to calculate the drop probability based on the input from the queue length and queue length growth rate. Usually two input FIEs can offer better ability to linguistically describe the system dynamics. Therefore, we can expect that we can tune the system better, and improve the behavior of the RED queue according to our class of service policy. The dynamic way of calculating the drop probability by the FIE comes from the fact that according to the rate of change of the queue length, the current buffer size, and the class the packet belongs, a different set of fuzzy rules, and so inference apply. Based on these rules and inferences, the drop probability is calculated more dynamically than the classical
Overview of Fuzzy-RED in Diff-Serv Networks
5
RED approach. This point can be illustrated through a visualization of the decision surfaces of the FIEs used in the Fuzzy-RED scheme. An inspection of these surfaces and the associated linguistic rules provides hints on the operation of Fuzzy-RED. The rules for the “assured” class are more aggressive about decreasing the probability of packet drop than increasing it sharply. There is only one rule that results in increasing drop probability, whereas two rules set the drop probability to zero. If we contrast this with the linguistic rules of the “best effort” class packets, we see that more rules lead to an increase in drop probability, and so more packet drops than the assured traffic class. These rules reflect the particular views and experiences of the designer, and are easy to relate to human reasoning processes. We expect the whole procedure to be independent of the number of active sources and thus avoid the problems of fixed thresholds employed by other RED schemes [16]. With Fuzzy-RED, not only do we expect to avoid such situations but also to generally provide better congestion control and better utilization of the network.
5 Simulation Results In this section we evaluate using simulation the performance of Fuzzy-RED and compare with other published results. The implementation of the traffic sources is based on the most recent version of ns simulator (Version 2.1b8a). Three simulation scenarios are presented. Scenario 1 was a simple scenario used to make an initial evaluation and test of Fuzzy-RED. Scenario 2 compares the behavior of RED and Fuzzy-RED using only TCP/FTP traffic, and finally in Scenario 3 we introduce web traffic, reported to test the ability of RED [19] to evaluate a simple Diff-Serv implementation using RIO and Fuzzy-RED. 5.1 Scenario 1 We have done an initial testing of the performance of the fuzzy-RED queue management using the ns simulation tool with the simple network topology shown in Figure 3 (also used by other researchers for RED performance and evaluation [14]). The buffer size was set to 70 packets (max packet size 1000 bytes), the min threshold (minth) for RED was 23 packets and the max threshold was 69 packets. The link between the two routers was set to 40 Mbps and the simulation lasted for 100 seconds. The scripts used for simulation Scenario 1 (and in all other simulation scenarios presented here) were based on the original scripts written in [14]. The rule files used for Scenario 1 were written without any previous study on how they can affect the performance of Fuzzy-RED, so they can been seen as a random pickup of sets and rules for evaluating Fuzzy-RED. After an extensive series of simulations based on Scenario 2 topology, and analysis of their results a new set of rule base files was created. This set of files was used in all the rest simulation scenarios (Scenarios 2 and 3) without any change in order to show the capabilities of Fuzzy-RED in various scenarios using different parameters.
6
L. Rossides et al.
* Router 0
*
Router 1 40 Mb/s 20ms delay link
* All access connections are 100 Mb/s
Fig. 3. Simple network topology used for the initial simulations
From the simulation results shown in Figure 4 and Figure 5 Fuzzy-RED achieves more than 99% utilization while RED and droptail fail to achieve more than 90%. As one can see from Figure 4 and Figure 5, Fuzzy-RED presents results very close to the ideal (as presented in Figure 2). 45 40
Throughput
35 30 25 20 15 10 5 0 0
Fuzzy RED RED
20
40
60
80
100
Time
Droptail
Fig. 4. Throughput vs. Time
The throughput goes up to the 99.7% of the total link capacity (40 Mbps link) and the average queue size is around half the capacity of the buffer while maintaining a sufficient amount of packets in the queue for achieving this high throughput. While these are results from a very basic scenario (see Fig. 3) they demonstrate the dynamic abilities and capabilities of an FIE RED queue compared to a simple RED queue or a classical droptail queue. The results presented in the following simulation scenarios shows that these characteristics and abilities are maintained under all conditions without changing any parameters of Fuzzy-RED.
near 100% throughput (99.57%), see Figure 8, while packet drops are kept at very low levels (Figure 10) compared to the number of packets sent (Figure 9). 120.00
--
,
100.00
VI
%
&
80.00
1 --RED
60.00
-Fuzzy
a
=a
z
40.00
RED
20.00 0.00 0.00
7 - - - - 7
20.00
40.00
60.00
80.00
100.00
Time (sec)
Fig. 7. Scenario 2 - Buffer size vs, t h e
These results show clearly that Fuzzy-RED manages to adequately control the queue size while keeping a higher than RED throughput (97.1 1% compared with 99.57% of Fuzzy-RED).
.c
0 1 0 2 3 ? 0 4 0 5 3 ~ 7 0 f f l 9 3 1 O l
lime (sec) Fig. 8. Scenario 2 - Throughput vs, t h e
5.3 Scenario 3
In Scenario 3 we introduce a new network topology used in [19]. The purpose of this scenario is to investigate how Fuzzy-RED and RIO perform under a Diff-Serv scenario. To simulate a basic Diff-Serv environment we introduce a combination of web-like traffic sources and TCPFTP sources. Half of the sources are TCPFTP and the other half TCPlVeb-like Traffic. Traffic from the Web-like sources is tagged as assured class traffic and the FTP traffic as best effort. We run the simulation three times for 5000 seconds each in order to enhance the validity of the results. The network topology is presented in Figure 11. We use TCPISACK with a TCP window
Ovemiew of FUZZY-RED h~Diff-Sem Networks
9
of 100 packets. Each packet has a size of 1514 bytes. For the Droptail queue we define a buffer size of 226 packets. We use AQM (Fuzzy RED or RIO) in the queues of the bottleneck link between router 1 and router 2. All other links have a simple droptail queue. The importance of this scenario is that it compares and evaluates not simply the performance of an algorithm but the performance in implementing a new IP network architecture, Diff-Sew. This means that we want to check whether FuzzyRED can provide the necessary congestion control and differentiation and ensure acceptable QoS in a Diff-Sew network. We also attempt to compare Fuzzy-RED with RIO (a RED based implementation of Diff-Sew).
Fig. 9. Scenario 2 P a c k e t s transmitted
The choice of distributions and parameters is based on [I91 and is summarized in Table 1. The implementation of the traffic sources is based on the most recent version of ns simulator (version 2.1.8b). All results presented for this scenario were extracted from the ns simulation trace file. Table 1. Distributions and Parameters
Distribution Mean Shape
Inter-page Time Pareto 50ms 2
Objects per page Pareto 4m 1.2
Interobject Time Pareto 0.5ms 1.5
Object Size Pareto 12KB 1.2
z
10000
-Best
6
EffOll Class
6000
4" 0"
4
000 000
L 20.00
40.00
60 00
80.00
100 00
Time (set)
Fig. 10. Scenario 2 P a c k e t Drops
500 Mbps I Ilnks N1"Ps 130 ms
'IU
100 Mbps I dest2
Fig. 11. Scenario 3 N e t w o r k Topology
Although this scenario is a simple and basic one it can be the starting point for investigating the abilities of Fuzzy-RED in a Diff-Sen, network. The results presented here are limited to just two graphs since the purpose of this paper is to present the prospective of Fuzzy-RED in providing a comparable to RIO congestion control in Diff-Serv networks and not a detailed description of its performance.
Overview of Fuzzy-RED hDiff-Sem Networks
11
From the results we see that although Fuzzy-RED and N O show similar behaviour, Fuzzy-RED appears to control better the flow rate across the network. From Figure 12 we see the throughput behaviour. With throughput here we mean the rate at which traffic is coming to a link (in this case the link connecting routerl with router2) before entering any queue. So it is the total traffic arriving at routerl (both Rp and web traffic). From the graph N O seems to stabilise its throughput around 10.25 Mbps (note that the link speed is limited to 10 Mbitlsec). This means that it can't effectively control the rate at which the sources are sending traffic (according to the ideal scenario shown in Figure 3). Around t=2500sec we see a small ascending step. At that point the traffic increases further from 10.1 Mbps to a 10.25 Mbps. This means that we have an increase of drops and therefore a decrease of goodput. We define goodput as the traffic rate traversing a link minus all dropped packets and all retransmitted packets.
-FUZZY
RW
-RED
I
,"
2
7.0
' 0
1000
2000
3000
4000
5000
Time (sec)
Fig. 12. Scenario 3 Throughput vs, t h e
From Figure 13 we see that Fuzzy-RED delivers a steady goodput around 9.9 Mbps while RIO has a decrease from 9.9 to 9.8 due to dropped packets that create retransmissions. The difference is not as important as the fact that Fuzzy-RED seems to provide a more stable behaviour. This result along with the previous encourages us to proceed with further testing in the future.
6 Conclusions Current TCPIIP congestion control algorithms cannot efficiently support new and emerging services needed by the Internet community. RED propose a solution, however in cases with extreme bursty sources (such a case is Internet) it fails to effectively control congestion. Diff-Serv using N O (RED In-Out) was proposed to offer differentiation of services and control congestion. Our proposal in implementing Diff-Serv using a fuzzy logic controlled queue is a novel, effective, robust and
flexible approach and avoids the necessity of any special parameterization or tuning, apart from linguistic interpretation of the system behavior.
-
10 0
.m
n
Furry FED n
=
A
RED
90
8.5
, 0
1000
2000
3000
4000
5000
Time (sec)
Fig. 13. Scenario 3 - Goodput vs, t h e
It can provide similar or better performance compared to RIO without any retuning or parameterization. Specifically in scenario 3 we see that Fuzzy-RED, using the same rules and values in the fuzzy sets (i.e. no finer tuning), has achieved equal or better performance than N O , in which we use the optimal parameterization discussed in paper [19]. From this scenario we see that Fuzzy-RED can perform equally well using homogeneous or heterogeneous traffic sources (in this case TCPFTP traffic and TCPAVeb-like traffic) without any change in the way we define it or any special tuning. We believe that with further refinement of the rule base through mathematical analysis, or self-tuning, our algorithm can achieve even better results. In future work we will investigate further performance issues such as fairness among traffic classes, packet drops per class (a QoS parameter), utilization and goodput under more complex scenarios. We expect to see whether Fuzzy-RED can be used to provide the necessary QoS needed in a Diff-Sen, network. From these results and based on our past experience with successful implementations of fuzzy logic control [20], 1211, we are very optimistic that this proposal will offer significant improvements on controlling congestion in TCPIIP Diff-Sen, networks.
References 1. S.Blake et al. "An architecture for Differentiated Services", RFC 2475, December 1998. 2 W. Pedrycz, A. V. Vasilakos (Ed.), Computatzonal Intellzgence zn Telecommunicatzons Network, CRC Press, ISBN: 0-8493-1075-X, September 2000. 3. V. Jacobson, Congestion Avoidance and Control, ACM SIGCOMM88,1988. 4. W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms", RFC2001, January 1997. 5. W. Stevens, TCPIIP Illustrated, Volume 1 The Protocols', Addison-Wesley, 1994. 6 . V. Jacobson, R. Braden, D. Borman, TCP Extensions for High Performance, RFC 1323, May 1992. 7. K.K. Ramakrishnan, and S. Floyd A proposal to add explicit congestionnotification (ECN) to IP, drafi-kksjf-ecn-03.txt, October 1998. (RFC2481, January 1999).
Overview of Fuzzy-RED in Diff-Serv Networks
13
8. Braden et al, Recommendations on Queue Management and Congestion Avoidance in the Internet, RFC2309, April 1998. 9. C.E. Rohrs and R. A. Berry and S. J.O’Halek, A Control Engineer’s Look at ATM Congestion Avoidance, IEEE Global Telecommunications Conference GLOBECOM’95, Singapore, 1995. 10. J. Martin, A. Nilsson, The evolution of congestion control in TCP/IP: from reactive windows to preventive flow control, CACC Technical Report TR-97/11, North Carolina State University, August 1997. 11. T.V. Lakshman and U. Madhow, The Performance of TCP/IP for Networks with High Bandwidth Delay Products and Random Loss, IEEE/ACM Transactions on Networking, vol. 5, pp. 336-350, June 1997. 12. A.Feldmann, A.C. Gilbert, W. Willinger, “Data Networks as cascades: Investigating the multifractal nature of the Internet WAN traffic”, SIGCOMM 98, Vancouver, 1998. 13. A. Feldmann, A. C. Gilbert, P. Huang, W. Willinger, “Dynamics of IP Traffic: A study of the role of variability and the impact of control”, Proceedings of ACM, SIGCOMM 2000. 14. S. Floyd and V. Jacobson, “Random Early Detection Gateways for Congestion Avoidance”, ACM/IEEE Transaction on Networking, 1(4): pgs 397-413, August 1993. 15. W. Feng, D. Kandlur, D. Saha, and K. Shin, “A self-configuring RED gateway,” IEEE INFOCOM’99, New York, Mar. 1999. 16. Wu-chang Feng, “Improving Internet Congestion Control and Queue Management Algorithms”, PhD Dissertation, University of Michigan, 1999 . 17. W. Feng, D. Kandlur, D. Saha, and K. Shin, “Blue: A New Class of Active Queue Management Algorithms,” tech. rep., UM CSE-TR-387-99, 1999. 18. Clark D. and Fang W. (1998), Explicit allocation of best effort packet delivery service, IEEE/ACM Transactions on Networking, Volume 6, No. 4, pp. 362 – 373, August 1998. 19. G. Iannaccone, C. Brandauer, T. Ziegler, C. Diot, S. Fdida, M. May, Comparison of Tail Drop and Active Queue Management Performance for Bulk-data and Web-like Internet Traffic, 6th IEEE symposium and Computers and Communications. Hammamet. July 2001. 20. A. Pitsillides, A. Sekercioglu, G. Ramamurthy, “Effective Control of Traffic Flow in ATM Networks using Fuzzy Explicit Rate Marking (FERM)”, IEEE JSAC, Vol. 15, Issue 2, Feb 1997, pp.209-225. 21. A. Pitsillides, A. Sekercioglu, ‘Congestion Control’, in Computational Intelligence in Telecommunications Networks, (Ed. W. Pedrycz, A. V. Vasilakos), CRC Press, ISBN: 08493-1075-X, September 2000.
An Architecture for Agent-Enhanced Network Service Provisioning through SLA Negotiation 1
2
1
David Chieng , Ivan Ho , Alan Marshall , and Gerard Parr
2
1
The Advanced Telecommunication Research Laboratory School of Electrical and Electronic Engineering, Queen’s University of Belfast Ashby Building, Stranmillis Road, Belfast, BT9 5AH, UK {d.chieng, a.marshall}@ee.qub.ac.uk 2 Internet Research Group, School of Information and Software Engineering University of Ulster, Coleraine, BT52 1SA, UK {wk.ho, gp.parr}@ulst.ac.uk
Abstract. This paper focuses on two main areas. We first investigate various aspects of subscription and session Service Level Agreement (SLA) issues such as negotiating and setting up network services with Quality of Service (QoS) and pricing preferences. We then introduce an agent-enhanced service architecture that facilitates these services. A prototype system consisting of real-time agents that represent various network stakeholders was developed. A novel approach is presented where the agent system is allowed to communicate with a simulated network. This allows functional and dynamic behaviour of the network to be investigated under various agent-supported scenarios. This paper also highlights the effects of SLA negotiation and dynamic pricing in a competitive multi-operator networks environment.
1 Introduction The increasing demand to provide Quality of Service (QoS) over Internet type networks, which are essentially best effort in nature, has led to the emergence of various architectures and signalling schemes such as Differentiated Services (DiffServ) [1], Multi Protocol Label Switching (MPLS) [2], IntServ’s Resource Reservation Protocol (RSVP) [3] and Subnet Bandwidth Management [4]. For example Microsoft has introduced Winsock2 GQoS (Generic QoS) APIs in their Window OS that provides RSVP signalling, QoS policy support, and invocation of traffic control [5]. CISCO and other vendors have also built routers and switches that support RSVP, DiffServ and MPLS capabilities [6]. Over the next few years we are going to witness rapid transformation in the functions provided by network infrastructures, from providing mere connectivity, to a wider range of tangible and flexible services involving QoS. However, in today’s already complex network environment, creating, provisioning and managing such services is a great challenge. First, service and network providers have to deal with a myriad of user requests that come with diverse Service Level Agreements (SLAs) or QoS requirements. These SLAs then need to be mapped to respective policy and QoS schemes, traffic engineering protocols, and so on, in D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 14–30, 2002. © Springer-Verlag Berlin Heidelberg 2002
An Architecture for Agent-Enhanced Network Service Provisioning
15
accordance with the underlying technologies. This further involves dynamic control and reconfiguration management, monitoring, as well as other higher-level issues such as service management, accounting and billing. Matching these service requirements to a set of control mechanisms in a consistent manner remains an area of weakness within the existing IP QoS architectures. These processes rely on the underlying network technologies and involve the cooperation of all network layers from top-to-bottom, as well as every network elements from end-to-end. Issues regarding SLA arise due to the need to maximize customer satisfaction and service reliability. According to [7], many end users and providers in general are still unable to specify SLAs in a way that benefits both parties. Very often, the service or network providers will overprovision their networks which leads to service degradation or alternatively they may fail to provide services to the best of their networks capabilities. Shortcomings cover a wide range of issues such as content queries, QoS preferences, session preferences, pricing and billing preferences, security options, and so on. To set up a desired service from end-to-end, seamless SLA transactions need to be carried out between users, network providers and other service providers. We propose an agent-enhanced framework that facilitates these services. Autonomous agents offer an attractive alternative approach for handling these tasks. For example a service provider agent can play a major role in guiding and deciphering users’ requests, and is also able to respond quickly and effectively. These issues are critical, as the competitiveness of future providers relies not only on the diversity of the services they can offer, but also their ability to meet customers’ requirements. This rest of the paper is organized as follows: Section 2 discusses related research in this area. Section 3 investigates various aspects of subscription and session SLA issues for a VLL service such as guaranteed bandwidth, service start time, session time, and pricing. A generic SLA utility model for multimedia services is also explained. Section 4 describes the architecture and the agent system implementation. Section 5 provides a brief service brokering demonstration using an industrial agent platform. In section 6 a number of case studies on bandwidth negotiation and dynamic pricing are described.
2 Related Work Our work is motivated by a number of research themes, especially the Open Signalling community (OPENSIG)’s standard development project, IEEE P1520 [8], and Telecommunications Information Networking Architecture Consortium (TINAC) [9] initiatives. IEEE P1520 is driving the concept of open signalling and network programmability. The idea is to establish an open architecture between network control and management functions. This enhances programmability for diverse kinds of networks with diverse kinds of functional requirements, interoperability and integration with legacy systems. TINA-C promotes interoperability, portability and reusability of software components so that it is independent from specific underlying technologies. The aim is to share the burden of creating and managing a complex system among different business stakeholders, such as consumers, service providers, and connectivity providers. The authors in [11] proposed a QoS management architecture that employs distributed agents to establish and maintain the QoS
16
D. Chieng et al.
requirements for various multimedia applications. Here, QoS specification is categorized into two main abstraction levels: application and system. The authors in [10] introduced Virtual Network Service (VNS) that uses a virtualisation technique to customize and support individual Virtual Private Network (VPN)’s QoS levels. While this work concentrates on the underlying resource provisioning management and mechanisms, our work focuses on the aspects of service provisioning using agents on top of it, which is complementary. The capability of current RSVP signalling is extended in [12] where users are allowed to reserve bandwidth or connections in advance so that blocking probability can de reduced. Authors in [13] offered a similar idea but use reservation agents to help clients to reserve end-to-end network resources. The Resource Negotiation and Pricing Protocol (RNAP) proposed by [14] enables network service providers and users to negotiate service availability, price quotation and charging information by application. This work generally supports the use of agents and demonstrats how agents can enhance flexibility in provisioning network services. In our work we extended the research by looking at the effects of allowing SLA negotiation and the results of dynamic pricing in a competitive multioperator networks environment.
3 Service Level Agreement The definition of SLAs or SLSs is the first step towards QoS provisioning. It is essential to specify the SLAs and SLSs between ISPs and their customers, and between their peers with assurance that these agreements can be met. A Service Level Agreement (SLA) provides a means of quantifying service definitions. In the networking environment, it specifies what an end user wants and what a provider is committing to provide. The definitions of an SLA differ at the business, application and network level [15]. Business level SLAs involve issues such as pricing schemes and contract. Application level SLAs are concerned with the issues of server availability (e.g., 99.9% during normal hours and 99.7% during other hours), response time, service session duration, and so on. Network level SLAs (or also often referred as Service Level Specifications (SLSs)) involve packet level flow parameters such as throughput requirements or bandwidth, end-to-end latency, packet loss ratio/percentage, error rate, and jitter. In this work, we concentrate on the SLA issues involved in establishing a Virtual Leased Line (VLL) type of services over IP networks. The following describes some basic SLA parameters considered in our framework during a service request [16]. Guaranteed Bandwidth (bi) for request i is the amount of guaranteed bandwidth desired by this service. b may be the min or mean guaranteed bandwidth depending on the user’s application requirement and network provider’s policy. This parameter is considered due to the ease of its configuration and also because it is the single most important factor that affects other lower level QoS issues such as delay and jitter. For our prototype system, the guaranteed bandwidth can be quantified in units of 1kb, 10kb, and so on. Reservation Start Time (Tsi) is the time when this service needs to be activated. If a user requires an instant reservation, this parameter is simply assigned to the current time. A user can also schedule the service in advance. Any user who accesses the
An Architecture for Agent-Enhanced Network Service Provisioning
17
network before his or her reservation start time may only be given the best effort service. Non-preemptable Session (Ti) is the duration required for this guaranteed service. When this duration expires, the connection will automatically go into preemtable mode where the bandwidth can no longer be guaranteed. The user needs to renegotiate if he or she wishes to extend the reservation session. This parameter is typically used for Video on Demand (VoD) and news broadcast type services where service session is known a priori. Price (Pi). It is believed that value-added services and applications are best delivered and billed on individual or per-transaction basis. Pi can be the maximum price a user is willing to pay for this service which may represent the aggregated cost quoted by various parties involved in setting up this service. For example, the total charge for a VoD service can be the sum of connection charge imposed by the network provider, content provider’s video delivery charge and service provider value added charge, etc. Alternatively the service can also be charged on a usage basis such as cost per kbps per min. Rules ( i) contain a user’s specified preferences and priorities. This is useful at times when not all the required SLA parameters can be granted. User can specify which parameter is the priority or which parameter is tolerable. 3.1 SLA Utility Model The SLA Management utility model proposed by [17] is adopted in this work (Fig. 1). Utility
Resource Profile
u1
r1 Service Provider
u2
r2
ri ui Sum of each service utility ∑iu i (q i )
SLA Management Objectives Maximize user satisfaction, resource usage, profit
Resource with Constraint
∑ r (q ) ≤ C i i
i
Fig. 1. General Utility Model
This is a mathematical model for describing the management and control aspects of SLAs for multimedia services. The utility model formulates the adaptive SLA management problem as integer programming. It provides a unified and
18
D. Chieng et al.
computationally feasible approach to make session request/admission control, quality selection/adaptation, and resource request/allocation decisions. For a user i, a service utility can be defined as ui(qi), where qi is the service quality requested by this session. This utility then needs to be mapped to the amount of resource usage, ri(qi). The service quality represents the QoS level requested by a user, which then needs to be mapped to bandwidth, CPU or buffering usage. In this work however, only the bandwidth resource is considered. The scope of total utility U, is infinite but the total amount of resource available, i.e. link capacity C is finite. The objective of a service provider is to maximize the service utility objective function U: U =max
∑
∑
m
x j =1 ij
∑ ∑ n
n u (q ) =max i =1 i i
i =1
m
x u j =1 ij ij
= 1 and x ij ∈ {0, 1}
(1) (2)
Where n = total number of service sessions and m = total number of service quality options. The total resource usage can be represented by: R=
∑
n r (q ) = i =1 i i
∑ ∑ n
i =1
m
x r j =1 ij ij
≤ C,
(3)
Equation (2) means that a user can only choose one service quality at one time. Equation (1) & (3) mean that the problem of SLA management of a multimedia service provider is to allocate the resource required by each customer, while maximizing the service utility under the resource constraint. From the SLA parameters described in the previous section, a user request i for a VLL service can be represented by the extended utility function [16]: ui (bi, Tsi, Ti, Pi,
i)
(4)
After mapped to resource usage, the utility is: ri (bi, Tsi, Ti,)
(5)
From [12], pre-emption priority, p(t) is introduced. When p(t)=1, the session bandwidth is guaranteed when p(t)=0, the session bandwidth is not guaranteed. This can be illustrated by: pi(t) =
0 , t < Ts i , t > Ts i + Ti 1 , Ts i < t < Ts i + Ti
(6)
If we consider the case of a single link, where C = total link capacity, the total reserved bandwidth resource at time t, R(t) can be represented as: R(t) = r1 + r2 + r3 + r4 + ....rn = b1 . p1 (t ) + b2 . p 2 (t ) + b3 . p (t ) + b4 . p 4 (t ) + ....bn . p n
= ∑i =1 bi . pi (t ) & n
(7)
Where n = number of active VLLs at time t. This allows network provider to know the reserved bandwidth load at present and in advance. In order to provide end-to-end bandwidth reservation facility (immediate or future), three sets of databases are required. User Service Database (USD) stores individual customer profiles and information particularly the agreed SLAs on per service session basis. This information is essential for resource manager to manage individual user sessions such as when to
An Architecture for Agent-Enhanced Network Service Provisioning
19
guarantee and when to preempt resources. This database can also be utilized for billing purposes. Resource Reservation Table (RRT) provides a form of lookup table for network provider to allocating network resources (bandwidth) for new requests. It tells the network provider regarding the current or future available resources in any link at DQ\WLPH $ PLQLPXP XQLW RI VHVVLRQ WLPH PXVW EH GHILQHG 7KLV FDQ EH minutes, 1 minutes or one second depending on the network provider’s policy. Similarly, the minimum amount of guaranteed bandwidth can be defined in a PLQLPXPXQLWRI WKDWFDQEHTXDQWLILHGLQWKHXQLWVRINENEHWF Path Table (PT) store the distinct hop-by-hop information from an ingress point to an egress point and vice versa. If there is an alternative path, another set of path table should be created. These tables are linked to RRTs. Reserved BW, R(t) Fail Available BW
Rx
C bx
Tx
Success Ry
Reserved BW Time
t Tsx Fig. 2. RRT and Reservation Processes
Figure 2 illustrates the reservation table and how do reservation processes take place at a link for requests x and y. It can be observed that requested ux‘s guaranteed bandwidth request couldn’t be honoured throughout the requested time period Tx. In this case the requester has a number of options. They can either reduce the amount of bandwidth requested, postpone the reservation start time, decrease the nonpreemptable session duration, or accept the compromise that for the short period their bandwidth will drop off. This system offers a greater flexibility on resource reservation compared to conventional RSVP requests. In RSVP, reservation process will fail even if only a single criterion such as the required bandwidth is not available. In addition, the proposed mechanism does not pose scalability problems as experienced in RSVP as the reservation states and SLAs are stored in a centralized domain server (network/resource manager server) and not in the routers. Unlike the existing resource reservation procedures, which are static and normally lack of quantitative knowledge of traffic statistics, this approach provides a more accountable resource management scheme. Furthermore, the resource allocated in existing system is based on initial availability and does not take into account changes in future resource availability.
20
D. Chieng et al.
4 Agent-Enhanced Service Provisioning Architecture Software agents offer many advantages in this kind of environment. Agents can carry out expertise brokering tasks on behalf of the end users, the service and network providers. Agents can negotiate services at service start-up or even when service is in progress. Agents are particularly suited for those tasks where fast decision-making is critical. These satisfy the two most important aspects of performance in an SLA; availability and responsiveness [15]. The following attributes highlight the capabilities of agents [18]. Autonomy. Agents can be both responsive and proactive. It is able to carry out tasks autonomously on their owners’ behalf under pre-defined rules or tasks. The level of their intelligence depends on the given roles or tasks. Communication. With this ability, negotiations can be established between two agents. FIPA’s [19] Agent Communication Language (ACL) has become the standard communication language for agents. An ACL ontology for VPN has also been defined within FIPA. Cooperation. Agents can cooperate to achieve a single task. Hence, agents representing end users, service providers and network providers are able to cooperate to set-up an end-to-end service e.g. VPN that spans across multi domains. Within a domain, agents also enhance coordination between nodes, which is lacking in most current systems. Mobility. Java-based agents can migrate across heterogeneous networks and platforms. This attribute differentiates mobile agents from the other forms of static agents. Mobile agents can migrate their executions and computation processes to remote hosts. This saves shared and local resources such as bandwidth and CPU usage compared to conventional client/server systems. Thus, intensive SLA negotiation processes can be migrated to the service provider or network provider’s domain. The value-added services provided by the agent system have been developed using Phoenix v1.3 APIs [20]. Phoenix is a framework for developing, integrating and deploying distributed, multi-platform, and highly scalable Web/Java-based applications. Being 100% Java-based, Phoenix is object-oriented in nature and platform independent. The Phoenix Core Services offers various functions required in our framework i.e. service/user administration, session management, resource management, event management, service routing, logging, billing, template expansion, and customization. The Phoenix Engine is basically a threaded control program that provides the runtime environment for our servlet-based agents. The simplicity of the Phoenix framework allows fast prototyping and development of multi-tiered, highly customisable applications or services. Figure 3 shows our overall prototype system architecture. This represents a simplified network marketplace consisting of user, network and content provider domains. For our prototype system, the agents are built using Java 2 (JDK 1.2.2) and J2EE Servlet API [21]. These agents communicate via HTTP v1.1 [22] between different Phoenix Engines/virtual machines (JVMs), and via local Java method calls within the same Phoenix Engine/VM. Port 8888 is reserved as the Agents’ Communication Channel (ACC). HTTP is preferred due to its accessibility in current web environment. The prototype system also incorporated some components from Open Agent Middleware (OAM) [23] developed by Fujitsu Laboratories Ltd. that allows a
An Architecture for Agent-Enhanced Network Service Provisioning
21
dynamic, flexible and robust operation and management of distributed service components via mediator agents. This realizes services plug-and-play, component repository management, service access brokering, dynamic customization according to user preferences, and so on.
Telecom Lab LAN User Domain
UA
Network Domain
Content Provider Domain
CPA
ASPA
User GUI
Agent Communication Interface
Simulation Environment
RM TA
TA RA
User Terminal
Stream Flow
Stream Interface
NA SM
Agent Communication Channel (Http)
RA
Network Provider
Content Server
Fig. 3. Agent-Enhanced Architecture.
In operation, the UA (User Agent) first receives requests from users to establish a service. ASPA (Access Service Provider Agent) acts as the central contact point for all the authorized UA, CPA (Content Provider Agent) and NA (Network Agent). The tasks undertaken by the ASPA include brokering, scheduling and pricing content delivery with the CPA. It also facilitates connection configuration, reconfiguration and teardown processes with the NA. OAM Mediator components are incorporated within ASPA to provide the brokering facility. The NA is responsible for mapping user level SLAs into network level SLAs. The SM (Service Manager) manages individual user sessions i.e. the USD and the RM (Resource Manager) manages the network resources that include the collection of PTs and RRTs (Resource Reservation Tables) within its domain. At the element layer, we introduce a RA (Router Agent) that performs router configurations, flow management and control functions such as packets classification, shaping, policing, buffering, scheduling and performance monitoring according to the required SLA. The TA (Terminal Agent) manages local hardware and software components at the end system such as display property, RAM, resources, drivers, MAC/IP Addresses, applications library, etc. In the current prototype, most of the functions residing within the network layer and below are implemented in simulation environment.
22
D. Chieng et al.
5 Service Brokering Demonstration Figure 4 illustrates various stakeholders involved in our prototype system scenario. In this framework, a novel approach is taken where the real-time agents are allowed to communicate with a simulated network model via TCP/IP socket interfaces.
Phoenix Engine
UA Servlet
VoD1 Agent Servlet
ASP Agent Servlet Mediator
Web Browser Manager/ Monitoring Agent Servlet
Network Agent Servlet
Simulator
BONeS Interface Module
VoD2 Agent Servlet
Music Agent Servlet
Dbase Agent Servlet
TCP/IP
BONeS Network Manager Module
BONeS Network Elements
Fig. 4. Service Brokering Scenario
Video on Demand (VoD), Music, and Database Agents have earlier registered their services with the ASP agent. When first contacted by the UA, the ASPA offers users a range of services to which they can subscribe. For example, if a VoD service is chosen, the ASPA will first lookup its database for registered content providers that match the user’s preferences. If a use chooses to ‘subscribe’, ASPA will propagate the request to the target agent, which then take the user through a VoD subscription process. The experimental VoD demonstration system allows users to select a movie title, desired quality, desired movie start time, maximum price they are willing to pay, tolerances, and so on. Here, the service quality, qi of a video session can be classified in terms of high, medium and low. The chosen service quality will then be mapped to the amount of resource, ri(qi) required. The UA then encodes these preferences and sends it to ASPA. The process is summarised in Figure 5.
An Architecture for Agent-Enhanced Network Service Provisioning 2: Submit Request /Re-Negotiate
1: Submit Request /Negotiate
UA
23
ASPA
VOD1 3: Request Accepted /Request Denied
6: Ack User 5: Connection Granted /Connection Failed
4: Request/ Re-Negotiate Connection
NA
Fig. 5. Negotiation and Configuration
5.1 Bandwidth Negotiation Evaluation For this scenario, we assume there is only one link and only bandwidth is negotiable. The following simulation parameters were applied: Link capacity, C = 100Mbps User requests, u(t) arrived according to Poisson distribution with mean request arrival rate, and all requested for immediate connections. Exponential mean session duration, T or -1=300s Users’ VLL requested bandwidth unit, Br are according to exponential random distribution, with minimum requirement, B rmin =1unit and with a max cut-off limit,
B rmax . The request that exceed B rmax will be put back to the loop until it satisfies the limit. If 1 unit is assumed 64kbps, hence the bandwidth requested, Br ranges between {64..10000}kbps. This emulates the demand for different qualities of voice, hi-fi music, and video telephony up to high quality video conferencing or VoD sessions. Users’ bandwidth tolerance is varied from 0% (not negotiable) up to 100% (accept any bandwidth or simply best effort). Arrival rates, requested bandwidth and session duration are mutually independent. The data was collected over a period of 120000s or 33.3 hours simulation time. Here bandwidth negotiations only occurred at resource limit. When the requested bandwidth exceeds the available bandwidth, ri(t) + R(t) ≥ C, for the whole requested session, Tsi< t < Tsi + Ti, the ASPA will have to negotiate the guaranteed bandwidth with the UA. For the following experiments, different levels of mean percentage offered load (requested bandwidth) were applied. These loads are defined in the percentage of link capacity where 120% Mean Offered Load means the demand for bandwidth is exceeding the available capacity. Figure 6 shows the request blocking or rejection probability when bandwidth negotiation at different tolerance levels was applied. It is obvious that if bandwidth request is tolerable, the rejection probability can be reduced. Here, 100% toleration is equivalent to best effort request. Figure 7 shows that the effect on the mean percentage reservation load or mean R is almost negligible at 60% mean offered load or lower. This is because only those users whose requested bandwidth is not available are subjected to negotiation.
24
D. Chieng et al. 0.3000
Mean % Reserved Load vs. Offered Load 0% Tol 20% Tol 40% Tol 60% Tol 80% Tol 100% Tol
0.2000 0.1500 0.1000 0.0500 0.0000
0
20
40
0% BW Tol 20% BW Tol 40% BW Tol 60% BW Tol 100% BW Tol
88.00 Mean % Reservation Load
Blocking Probability
0.2500
60
80
100
83.00 78.00 73.00 68.00 63.00 58.00
120
60
70
% Offered Load (Requested BW)
Fig. 6. Request Rejection Probability
BWI vs Mean % Offered Load
0.95 0% BW Tol 20% BW Tol 40% BW Tol 60% BW Tol 100% BW Tol
1.20 BWI
Improvement (Mean % Load)
1.40
120
1
20%Tol 40%Tol 60%Tol 100%Tol
1.60
110
Fig. 7. Mean % Reservation Load(R)
Improvement on Mean % Reservation Load 1.80
80 90 100 Mean % Offered Load
1.00
0.9
0.80 0.85
0.60 0.40
0.8
0.20 0.00
0.75
60
70
80
90
100
110
120
40
Mean % Offered Load
Fig. 8. Improvement on Mean %R
60
80
100
120
Mean % Offered Load
Fig. 9. Bandwidth Index (BWI)
In Figure 8, we observe a significant improvement on mean R when the bandwidth required is tolerable. An improvement 1% R means an extra 1 Mbps of bandwidth on average was reserved/sold over 33.3 hours. If 1 Mbps was priced at £2 per hour, this would generate the provider an extra revenue of £66.60. The Bandwidth Index (BWI) is introduced in Figure 9. It corresponds to the amount of bandwidth granted over the amount of bandwidth originally requested or bg/br. We can see that at low load, most users get what they want (BWI ~ 1). However at high load, those users who can tolerate their bandwidth requirements will have to accept lower bandwidth if they need the service.
6 Dynamic Bandwidth Pricing Evaluation A demonstration on dynamic bandwidth pricing was carried out during the technical visit session at Fujitsu Telecommunications Europe Ltd. headquater in Birmingham in conjuction with World Telecommuncation Congress (WTC2000). In each session, three volunteers were invited from the audience to assume the role of the future network operators. Their common goal was basically to maximize their network revenue by dynamically pricing bandwidth within the agent-brokering environment. The scenario presents three competing network providers with identical network topology and resources. Each network consisted of 4 routers with 2x2 links and all provide connections to a multimedia server.
An Architecture for Agent-Enhanced Network Service Provisioning
25
All the possible paths from one ingress point to another egress point were precomputed and stored in the PTs (path tables). The Network Provider Agent (NPA) always offers the shortest available path to the ASPA. The access service provider agent (ASPA) acts as the mediator who aggregates resource requirements by various user requests and search for the best possible deals in order to satisfy customers’ requirements. A simple LAN was set-up for this game/demo. The multi-operator network model coupled with internal agent brokering mechanisms was run on the Sun Ultra 10 workstation. A few PCs were set up for the competing network operators and a separate monitoring screen was provided for other audience’s viewing. In this game we considered some universal customer SLA preferences such as best QoS (guaranteed bandwidth) and cheapest price. We assumed most users prefer QoS as their first priority and the cheapest offering price as the second priority. Hence the User’s SLA preferences are: bi, Tsi, Ti = Not negotiable (priority) AND Pi < 21 (maximum acceptable price) Then ACCEPT (min NetA||NetB||NetC) Therefore if the cheapest network provider could not provide the required guaranteed flow bandwidth, the service provider would opt for the second cheapest one. During the game, ASPA continually negotiated with NPAs to obtain the required number of VLLs to a multimedia server that offered voice, audio, VoD and data services. In order to show the effects of dynamic pricing, we allowed the invited audience/acting network operator to manually change the bandwidth price. For this demo we did not provide different pricing scheme for different user classes (voice, video, etc.), although this is also possible. We also did not provide bandwidth negotiation facility in this game, as we only want to focus on dynamic pricing competition between the operators. Table 1 describes the billing parameters associated with this demonstrator. Table 1. Billing Parameters
Items
Description
Our Price,
pi
The selling price for a bandwidth unit per min.for VLL i.
Cost Price,
θi
The cost price for a bandwidth unit per minute for VLL i. This value changes according to link’s reservation load and can loosely represent the management overhead. This represents the overall maintenance cost, hardware cost, labour cost, etc per minute.
Operation Cost, s Guaranteed bandwidth, No. of links, Session,
Ti
li
bi
This is the amount of guaranteed/reserved bandwidth unit allocated to VLL i. The number of links used by VLL i. Since no. of links is considered in the charging equation, the shortest available path is therefore preferred. The session length in minute subscribed by VLL i.
26
D. Chieng et al.
In this game, each user was billed at the end of their service session. Therefore the calculation for gross revenue earned from each VLL i is based on the following equations: Revi = ( p i
− θ i ) ⋅ bi ⋅ l i ⋅ Ti = Pi , total charge imposed to user i
(8)
Total the gross revenue earned by a network operator, Revgross after period t hence: Revgross(t) =
∑
n (t )
i =1
Rev i
(9)
Where t = simulation time elapsed (0 ≤ t ≤ Tstop ) in minute and n(t) = total number of VLL counts after t. The total net revenue Revnet after t is then: Revnet(t) =
∑
n (t )
i =1
Rev i – s.t
(10)
Each player or acting network operator could monitor his/her competitors’ offered bandwidth prices from the console. The instantaneous reservation load for a network’s links, and the network topology showing current network utilisation, were displayed. The reservation load R(t) is the sum of all the independent active VLLs’ reserved bandwidth. The number of active users, and network rejection statistics were also reported. For this game, the operators’ revenues were solely generated from VLL sales. A monitoring window displayed the total revenues (profit/lost) generated by each network operator. We associated the link QoS level in terms of Gold, Silver and Bronze by referring to link’s reservation load where 0
1
Request Arrival Rate (per hour) 70
Guaranteed unit BW Requested Per Flow 2
2
15
30
10-60
3
28
20
1-10
User Class
Session Time Per VLL (mins)
Example Applications
3-10
VOIP/Audio VoD/Video Conferencing WWW/FTP
* One bandwidth unit is roughly defined as 64kbps. Figure 10 shows the accumulated requested bandwidth (offered load) profile for different classes of users. The results from one of the sessions were collected and analyzed in figures 11 through figure 15. Figure 11 shows the pricing history of the three acting network operators. Here, the network operators were trying to maximise the revenues by setting different bandwidth prices at one time.
An Architecture for Agent-Enhanced Network Service Provisioning Class1 Users Req
Class3 Users Req
Class2 Users Req
600
Net A Price
Net B Price
27
Net C Price
20
BW Unit Price
500 BW Unit
15
400
10
300 200
5
100
0
0 27
49
66
85
100
116
140
157
177
0
197
Time(mins)
Fig. 10. Offered Load at the Access Node
20 40 60 80 100 120 140 160 180 200 Time(mins)
Fig. 11. Price Bidding History
At t~20mins, network operator A lowered its bandwidth price to 1 and caused a sharp increase in load over the measured link (see Figure 12). At t>40mins, network operator A increased its price dramatically, and soon became much more expensive than the others. As a result, a significant drop in reservation load observed after t>75mins. This was most likely due to video subscribers leaving the network. Note that at time 100mins
Net B Load
180,000
Net C Load
Net A Revenue Net B Revenue Net C Revenue
100
130,000
80
Revenues
Network Load (BW Unit)
120
80,000
60 40
30,000
20 0 0
20
40
60
80 100 120 Time(mins)
140
160
Fig. 12. Link Load (link2) vs. Time
180
200
-20,000 0
50
100 150 Time(mins)
200
Fig. 13. Revenue Generated vs. Time
In Figure 14, we can observe a close relationship between reservation load and price. In this case, it seemed that the cheapest provider earned the most revenues. However it can be observed that the average network B’s price was just marginally more than to network A. This means network B can actually bid a higher average price and win the game because network C had a significantly higher average bandwidth price as compared to network B. Although this strategy is only applicable for this scenario, it illustrates the basic principle of how network providers can maximise their revenues in such a dynamically changing market.
28
D. Chieng et al. Network A
Network B
Network C
80,000 70,000
70 60 50 40 30 20 10 0
Average Price
60,000 50,000
61.31 Revenues
Lo ad/P rice
Average Load
52.51 43.93
40,000 30,000 20,000 10,000
14.35
11.74
11.10
0 -10,000 -20,000
Network A
Network B Operator
Network C
Fig. 14. Average Load Vs. Price
1-20
20-40
40-60
60-80 80-100
100120
120140
140160
160180
180200
Time Interval
Fig. 15. Revenue per Time Interval
Figure 15 shows the importance of setting the right price at the right time. Network A made a loss at 20-40 minutes interval due to low price offer. However much of the loss was compensated at 100-120 interval due to bulk VLL sales at high price. It is also observed from this simple experiment that it is more profitable to get revenue from high bandwidth, long session streams such as video conferencing.
7 Conclusion In this paper we have described an agent-enhanced service provisioning architecture that facilitates SLA negotiations and brokering. An extended utility model is devised to formulate the SLA management problem. A prototype system consisting of different types of real-time agents that interact with a simulated network was developed to demonstrate scenarios and allows functional and dynamic behaviour of the network under various agent-supported scenarios to be investigated. Some futuristic scenarios on dynamic SLA negotiations i.e. bandwidth and price were demonstrated, particularly for VLL type of services. The results show that agentbased solution introduces a much greater dynamic and flexibilities in how services can be provisioned. During the dynamic pricing demonstration, we allowed the audience to manually set the bandwidth price. In the future, an pricing agent could potentially take over this task where it can set the right price at the right time based on a more sophisticated pricing mechanism (e.g. different pricing schemes for different service classes). A well-defined pricing structure is not only important as a tool to enable users to match and negotiate services as a function of their requirements, it can also be a traffic control mechanism in itself, as the dynamic setting of prices can be used to control the volume of traffic carried. We learned that how the straightforward cheapest bandwidth price-bidding scenario affects the competition between various network providers. In our current prototype system, our agents only exercise the simplest forms of transactions such as bandwidth negotiation and pricing comparison. Hence, agents’ activities within the network are negligible whether client-server based agents or mobile agents were used. Nevertheless it is anticipated that when agents
An Architecture for Agent-Enhanced Network Service Provisioning
29
have acquired a higher level of negotiation capabilities and intelligence, this issue must be further addressed.
Acknowledgement. The authors gratefully acknowledge Fujitsu Telecommunications Europe Ltd for funding this project and, Fujitsu Teamware Group Ltd. and Fujitsu Laboratories Japan for their software support. Special thanks to Professor Colin South, Dr. Dominic Greenwood, Dr. Keith Jones, Dr. Desmond Maguire and Dr. Patrick McParland for all their invaluable feedback.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
12. 13. 14. 15.
S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. “An Architecture for Differentiated Services”, RFC 2475, Network Working Group, IETF, December 1998. (www.ieft.org) E. C. Rosen, A. Viswanathan and R. Callon. “Multiprotocol Label Switching Archiecture”, Internet Draft, Network Working Group, IETF, July 2000. (www.ieft.org) R. Braden, L. Zhang, S. Berson, S. Herzog, and S. Jamin. “Resource ReSerVation Protocol (RSVP) -- Version 1 Functional Specification”, RFC 2205, Network Working Group, IETF, September 1997. (www.ietf.org) Stardust.com Inc. “White Paper- QoS Protocols & Architectures”, 8 July 1999, (www.qosforum.com/white-papers/qosprot_v3.pdf) Microsoft Cooperation, “Quality of Services and Windows”, June 2000. (www.microsoft.com/hwdev/network/qos/#Papers) Cisco IOS Quality of Service (QoS). (www.cisco.com/warp/public/732/Tech/qos/) Fred Engel, Executive Vice President and CTO of Concord Communications. “Grasping the ASP means service level shakedown”, Communications News, Aug 2000, pp. 19-20. Jit Biswas et. al., “EEE P1520 Standards Initiative for Programmable Network Interfaces” IEEE Communications, Vol.36, No.10, October 1998, pp. 64-70. Martin Chapman, Stefano Montesi, “Overall Concepts and Principles of TINA”, TINA th Baseline Document, TB_MDC.018_1.0_94, 17 Feb 1995. (www.tinac.com) L.K. Lim, J.Gao, T.S. Eugene Ng, P. Chandra, “Customizable Virtual Private Network Service with QoS”, Computer Networks Journal, Elsevier Science, Special Issue on “Overlay Networks”, to appear in 2001. N. Agoulmine, F. Nait-Abdesselam and A. Serhrouchni, “QoS Management of Multimedia Services Based On Active Agent Architecture”, Special Issue: QoS Management in Wired & Wireless Multimedia Communication Networks, ICON, Baltzer Science Publishers, Vol 2/2-4, ISSN 1385 9501, Jan 2000. M. Karsen, N. Beries, L. Wolf, and R. Steinmetz, “A Policy-Based Service Specification for Resource Reservation in Advance”, Proceedings of the International Conference on Computer Communications (ICCC’99), September 1999, pp. 82-88. O. Schelen and S. Pink, “Resource sharing in advance reservation agents”, Journal of High Speed Networks: Special issue on Multimedia Networking, vol 7, no. 3-4, pp. 213-28, 1998. X. Wang and H. Schulzrinne, “RNAP: A Resource Negotiation and Pricing Protocol”, Proc. International Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV’99), New Jersey, Jun. 1999. Dinesh Verma, “Supporting Service Level Agreements on IP Networks”, Macmillan Technical Publishing.
30
D. Chieng et al.
16. David Chieng, Alan Marshall, Ivan Ho and Gerald Parr, “Agent-Enhanced Dynamic Service Level Agreement In Future Network Environments”, IFIP/IEEE MMNS 2001, Chicago, 29 Oct - 1 Nov 2001. 17. S. Khan, K. F. Li and E. G. Manning, “The Utility Model for Adaptive Multimedia Systems”, International Conference on Multimedia Modeling, Singapore, Nov 97. 18. David Chieng, Alan Marshall, Ivan Ho and Gerald Parr, “A Mobile Agent Brokering Environment for The Future Open Network Marketplace”, IS&N2000, Athens, 23-25 February 2000, pp 3-15. (Springer-Verlag LNCS Series, Vol. 1774) 19. Foundation for Intelligent Physical Agents. (www.fipa.org) 20. White Paper, “Phoenix-The Enabling Technology Behind Pl@za”, MC0000E, Teamware Group Oy, April 2001. (www.teamware.com) 21. http://java.sun.com/j2ee/tutorial/doc/Overview.html 22. R. Fielding et. al, “Hypertext Transfer Protocol – HTTP/1.1”, RFC 2048, Network Working Group, IETF, Jan 1997. (www.ieft.org) 23. http://pr.fujitsu.com/en/news/2000/02/15-3.html
Facing Fault Management as It Is, Aiming for What You Would Like It to Be Roy Sterritt University of Ulster School of Information and Software Engineering, Faculty of Informatics Jordanstown Campus, Northern Ireland
[email protected]
Abstract. Telecommunication systems are built with extensive redundancy and complexity to ensure robustness and quality of service. Such systems requires complex fault identification and management tools. Fault identification and management are generally handled by reducing the number of alarm events (symptoms) presented to the operating engineer through monitoring, filtering and masking. The goal is to determine and present the actual underlying fault. Fault management is a complex task, subject to uncertainty in the symptoms presented. In this paper two key fault management approaches are considered: (i) rule discovery to attempt to present fewer symptoms with greater diagnostic assistance for the more traditional rule based system approach and (ii) the induction of Bayesian Belief Networks (BBNs) for a complete ‘intelligent’ approach. The paper concludes that the research and development of the two target fault management systems can be complementary.
1 Introduction It has been proposed that networks will soon be the keystone to all industries [1]. Effective network management is therefore increasingly important for profitability. Network downtime will not only result in the loss of revenue but may also lead to serious financial contractual penalties for the provider. As the world becomes increasingly reliant on computer networks the complexity of such networks has grown in a number of dimensions [2]. The phenomenal growth of the Internet has shown a clear example of the extent to which the use of computer networks are becoming ubiquitous [3]. As users’ demands and expectations become more varied and complex so do the networks themselves. In particular, heterogeneity has become the rule rather than the exception [2]. Data of any form may travel under the control of different protocols through numerous physical devices manufactured and operated by large numbers of different vendors. Thus there is a general consensus that the trend is towards increasing complexity.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 31–45, 2002. © Springer-Verlag Berlin Heidelberg 2002
32
R. Sterritt
Such complexity lies in the accumulation of several factors: the embedded increasing function of network elements, the need for sophisticated services and the heterogeneity challenges of customer networks [4]. This paper explores one aspect of network management in detail, fault identification. Section 2 looks at the complexity in networks, network management and fault management. Section 3 considers techniques for discovering rules for existing rulebased systems used in fault management. Section 4 discusses other intelligent techniques that may offer solutions to the inherent problems associated with rule-based systems and finally Section 5 concludes the paper. Throughout the sections reference is made to a data set of fault management alarms that was gathered from an experiment on an SDH/Sonet network in Nortel Networks.
2 Uncertainty in Fault Management Network management encompasses a large number of tasks with various standards bodies specifying a formal organisation of these tasks. The International Standards Organization (ISO) divides network management into six areas as part of the Open Systems Interconnection (OSI) model: configuration management, fault management, performance management, security management, accounting management and directory management these sit within a seven layer network hierarchical structure. However, with the Internet revolution and the convergence of the Telcos and Data Communications the trend is towards a flatter structure. 2.1 Faults and Fault Management Essentially, network faults can be classified into hardware and software faults, which cause elements to produce outputs, which in turn cause overall failure effects in the network, such as congestion [5]. A single fault in a complex network can generate a cascade of events, potentially overloading a manager’s console with information [6]. The fault management task can be characterised as detecting when network behavior deviates from normal and formulating a corrective course of action when required. Fault management can be decomposed into three tasks: fault identification, fault diagnosis and fault remediation [2]. A fourth could be added as a desire or expectation of the fault management task, considered as a natural extension of fault identification— fault prediction [7]. 2.2 An Experiment to Highlight Uncertainty in Fault Management A simple experiment was performed to highlight some of the uncertainty that can be experienced within fault management, specifically looking at the physical network element layer from which other management layer information is derived. The configuration was a basic network with two STM-1 network elements (NEs) and an ele-
Facing Fault Management as It Is, Aiming for What You Would Like It to Be
33
ment manager. A simple test lasting for just over 3 minutes containing 16 commands that exercises the 2M Single channel was run on the network 149 times. The network was dedicated to the test and no external configuration changes took place during the experiment. After each run the network was allowed a rest period to ensure no trailing events existed before being reset. 35
30
Alarm Freq
25
20
15
10
5
149
145
141
137
133
129
125
121
117
113
109
105
97
101
93
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
29
25
21
17
9
13
5
1
0 Test run
Fig. 1. Simple test repeated 149 times: Uncertainty in frequency of alarms raised.
The graph (Figure 1) displays the number of alarms raised, 24 being the lowest (run 124) and 31 being the highest (run 22). The average was 27 alarm events. This experiment on a small network highlights the variability of alarm data under fault conditions. The expectation that the same fault produces the same symptoms each time is unfortunately not true of this domain.
3 Facing Fault Management as It Is The previous experiment hints at the number of alarm events that may be raised under fault conditions. Modern networks produce thousands of alarms per day, making the task of real-time network surveillance and fault management difficult [8]. Due to the large volume of alarms, it is potentially possible to overlook or misinterpret them. Alarm correlation has become the main technique to reduce the number to consider.
3.1 Alarm Correlation Alarm correlation is a conceptual interpretation of multiple alarms, giving them a new meaning [8] and from that, potentially creating derived higher order alarms [9]. Jakob-
34
R. Sterritt
son and Weissman proposed correlation as a generic process of five types: compression, suppression, count, generalisation and Boolean patterns. Compression is the reduction of multiple occurrences of an alarm into a single alarm. Suppression is when a low-priority alarm may be inhibited in the presence of a higher alarm, generally referred to as masking. Count is the substitution of a specified number of occurrences of an alarm, with a new alarm. Generalisation is when an alarm is referenced by its super-class. Boolean patterns is the substitution of a set of alarms satisfying a Boolean pattern with a new alarm. 3.2 Alarm Monitoring, Filtering, and Masking Within current fault management systems, as specified by ITU-T, alarm correlation is generally handled in three sequential transformations: alarm monitoring, alarm filtering and alarm masking. These mean that if the raw state of an alarm instance changes an alarm event is not necessarily generated. Alarm monitoring takes the raw state of an alarm and produces a monitored state. Alarm monitoring is enabled/disabled for each alarm instance. If monitoring is enabled, then the monitored state is the same as the raw state; if disabled, then the monitored state is clear. Alarm filtering is also enabled/disabled for each alarm instance. An alarm may exist in any one of three states: present, intermittent or clear, depending on how long the alarm is raised. Assigning these states, by checking for the presence of an alarm within certain filtering periods, determines the alarm filtering. Alarm masking is designed to prevent unnecessary reporting of alarms. The masked alarm is inhibited from generating reports if an instance of its superior alarm is active and fits the masking periods. A masking hierarchy determines the priority of each alarm type. Alarm masking is also enabled/disabled for each alarm instance. When an alarm changes state the network management system must be informed. The combination of alarm monitoring, filtering and masking makes alarm handling within the network elements relatively complex. 3.3 Fault Management Systems All three stages of fault management (identification, diagnosis and remediation) involve reasoning and decision making based on information about current and past states of the network [2]. Interestingly, much of the work in this area makes use of techniques from Artificial Intelligence (AI), especially expert systems and increasingly machine learning. The complexity of computer networks and the time critical nature of management decisions make network management a domain that is difficult for humans [2]. It has been claimed that expert systems have achieved performance equivalent to that of human experts in a number of domains, including certain aspects of network management [10].
Facing Fault Management as It Is, Aiming for What You Would Like It to Be
35
Most systems employing AI technologies for fault diagnosis are expert or production rule systems [9]. Some of these classic applications and theory are discussed in [13,14], containing cases such as ACE (Automated Cable Expertise) [15] and SMART 33 (Switching Maintenance Analysis and Repair Tool) [16]. Details of several other artificial intelligence applications in telecommunications network management may also be found in [17-19]. Many of these systems have proved very successful yet have their limitations. Generally speaking, as Gürer outlines, rule-based expert systems: Cannot handle new and changing data since rules are brittle and can not cope when faced with unforeseen situations. Can not adapt. That is, they can not learn from past experiences. Do not scale well to large dynamic real world domains. Require extensive maintenance. Are not good at handling probability or uncertainty. Have difficulty in analysing large amounts of uncorrelated, ambiguous and incomplete data. These drawbacks support the use of different AI techniques that can overcome these difficulties either alone or as an enhancement of expert systems [9]. There is a predicament, however, as there is doubt if such techniques would be accepted as the engine within the fault management system. 3.4 Facing the Challenge At the heart of alarm correlation is the determination of the cause. The alarms represent the symptoms and as such, in the global scheme, are not of general interest once the failure is determined [11]. There are two real world concerns: (i) the sheer volume of alarm event traffic when a fault occurs; and (ii) identifying the cause from the symptoms. Alarm monitoring, filtering and masking and their direct application in the form of rule-based systems address concern (i), which is vital. They focus on reducing the volume of alarms but do not necessarily help with (ii) to determine the actual cause— this is left to the operator to resolve from the reduced set of higher priority alarms. Ideally, a technique that can tackle both these concerns would be best. AI offers that potential and has been and still is an active and worthy area of research to assist in fault management. Yet telecommunication manufacturers, understandably, have shown reluctance in incorporating AI techniques, in particular those that have an uncertainty element, directly into their critical systems. Rule-based type systems have achieved acceptance largely because the decisions obtained are deterministic, they can be traced and understood by domain experts. As a step towards automated fault identification, and with the domain challenges in mind, it is useful to use AI to derive rule discovery techniques that present fewer symptoms with greater diagnostic assistance. A potential flaw in data mining is that it is not user-centered. This may be alleviated by the visualisation of the data at all stages to enable the user to gain trust in the process and hence have more confidence in the mined patterns. The transformation
36
R. Sterritt
from data to knowledge requires interpretation and evaluation, which also stands to benefit from multi-stage visualisation of the process [20-22], as human and computer discovery are interdependent. The aim with computer-aided human discovery is to reveal hidden knowledge, unexpected patterns and new rules from large datasets. Computer handling and visualisation techniques of vast amounts of data make use of the remarkable perceptual abilities that humans possess, such as the capacity to recognise images quickly, and detect the subtlest changes in size, colour, shape, movement or texture—thus, potentially, discovering new event correlations in the data. Data mining (discovery algorithms) may reveal hidden patterns and new rules yet these require human interpretation to transform them into knowledge. The human element attaches a more meaningful insight to decisions allowing the discovered correlations to be coded as useful rules for fault identification and management. 3.5 Three-Tier Rule Discovery Process Computer-assisted human discovery and human-assisted computer discovery can be combined in a three tier process, to provide a mechanism for discovery and learning of rules for fault management [23]. The tiers are; - Tier 1 - Visualisation Correlation - Tier 2 - Knowledge Acquisition or Rule Based Correlation - Tier 3 - Knowledge Discovery (Data Mining) Correlation The top tier (visualisation correlation) allows the visualisation of the data in several forms. The visualisation has a significant role throughout the knowledge discovery process, from data cleaning to mining. This allows analysis of the data with the aim of identifying other alarm correlations (knowledge capture). The second tier (knowledge acquisition or rule-based correlation) aims to define correlations and rules using more traditional knowledge acquisition techniques—utilising documentation and experts. The third tier (knowledge discovery correlation) mines the telecommunications management network data to produce more complex correlation candidates. The application of the 3-tier process is iterative and flexible. The visualisation tier may require the knowledge acquisition tier to confirm its findings. Likewise visualisation of the knowledge discovery process could facilitate understanding of the patterns discovered. The next section uses the experimental data to demonstrate how the application of the 3-tier process may be of use in discovering new rules. 3.6 Analysing the Experimental Data: Rule-Based Systems The experiment demonstrated how a relatively large volume of fault management data can be produced on a simple network and how the occurrence of alarm events under fault conditions is uncertain. Since the data is only concerned with the same simulated fault/test being repeated 149 times it does not provide the right context for any mined
Facing Fault Management as It Is, Aiming for What You Would Like It to Be
37
results to be interpreted as typical network behavior, yet is sufficient to illustrate the analysis process. Mining the data can identify rules of alarms that occur together, that is potential correlations such as If PPI-AIS on Gate_02 then INT-TU-AIS on Gate_02. Through traditional knowledge acquisition the majority of these may be explained as existing knowledge—for instance with reference to the alarm masking hierarchy from ITU-T. Visualisation may help explain the data that led to the mined discoveries or allow for human discoveries themselves, as illustrated in Figure2. The alarm life spans are displayed as horizontal Gantt bars. In this view, the alarms are listed for each of the two network elements as such alarms that are occurring on the same vertical path (time) are potential correlations. The two highlighted alarms are PPI-Unexpl_Signal and LP-PLM. This matches a discovery found in a different data set [23].
Fig. 2. Network Element view - alarms raised on each element. Highlighted is a possible correlation in time (PPI-Unexpl_Signal and LP-PLM).
On investigating the standards specifications it is found that a PPI-Unexp-Signal has no impacts or consequent actions. LP-PLM affects traffic and can inject an AIS and LP-RDI alarm depending on configuration (consequent actions for LP-PLM can be enabled/disabled, the default being disabled). Thus there is no explicit connection defined for these two alarms. Visualising only these alarms in the alarm view (Figure 3) would tend to confirm this correlation. Each time PPI-Unexp_Signal is raised LP-PLM becomes active on
38
R. Sterritt
the other connected network element. This occurs both ways. That is, it is not dependant on which NE PPI-Unexp_Signal is raised. Since this discovered correlation is an unexpected pattern it is of interest and can be coded as a rule for the fault management system or other diagnostic tool. This may be automated for the target rule system using, for example, ILOG rules. The rule would state that when PPI-Unexpl_Signal and LP-PLM occur together on the same port but with different connected multlplexers, then correlate these alarms and retract them while raising a derived alarm that specifies the correlation. This derived alarm would be used to trigger diagnostic assistance or be correlated with further alarms to define the fault. This example shows that tools can be developed to semi-automate the rule development process. The rule in this example still has all the problems connected with handling uncertainty, however, since it would be used within existing rule-based fault management systems. The next section considers how the approach can be improved using uncertainty AI techniques in the discovery process.
Fig. 3. Alarm view - each time PPI-Unexp_Signal becomes active on one NE, LP-PLM becomes active on the other. This occurs on both NEs.
4 Aiming for What You Would Like Fault Management to Be Fault management is a complex task, subject to uncertainty in the ‘symptoms’ presented. It is a good candidate for treatment by an AI methodology that handles uncertainty, such as soft computing or computational intelligence [24].
Facing Fault Management as It Is, Aiming for What You Would Like It to Be
39
Correlation serves to reduce the number of alarms presented to the operator, and an intelligent fault management system might additionally facilitate fault prediction. The role of the fault management system may be described as: - Fault identification/diagnosis—prediction of the fault(s) that have occurred from the alarms present, - Behaviour prediction—warn the operator before hand of severe faults from the alarms that are presenting themselves, and, - Estimation—of a fault’s likely life span. 4.1 Intelligence Research The technology most commonly used to add significant levels of automation to network management platforms is rule-based expert systems. Yet the inherent disadvantages with such systems, discussed previously, limit how much they can be used, thus encouraging many researchers to seek new approaches. Problems associated with acquiring knowledge from experts and building it into a system before it is out of date (knowledge acquisition bottleneck) together with the high manual maintenance burden for rules has led to research into machine learning techniques. Ways of handling partial truths, evidence, causality and uncertainty were sought from statistics. Likewise, the brittleness and rigidity of the rules and their inability to learn from experience has yielded to research into self-adaptive techniques, such as case-based reasoning. Increasingly, these and other AI techniques are being investigated for all aspects of network management. Machine learning has been used to detect chronic transmission faults [25] and dispatch technicians to fix problems in local loops [26, 27]. Neural networks have been used to predict the overall health of the network [28] and monitor trunk occupancy [29]. Decision trees have also been used for rule discovery [30, 31] as well as data mining [32, 33] and the most recent trend is the use of agents [34, 35]. Although the techniques may address some of the problems of rule-based systems they have disadvantages of their own. Thus an increasing trend in recent years has been to utilise hybrid systems to maximise the strengths and minimise the weaknesses of individual techniques. 4.2 Intelligent Techniques for Fault Identification Neural networks are a key technique in both computational Intelligence and soft computing, with a proven predictive performance. They have been proposed along with case-based reasoning as a hybrid system for a complete fault management process [9], as well as to identify faults in switching systems [38], the management of ATMs [36] and alarm correlation in cellular phone networks [39]. Yet they do not meet one important goal—comprehensibility [38]. This lack of explanation, leads to some reluctance to use neural networks in fault management systems [32]. Kohonen self-organising maps [41] and Bayesian belief networks [42] have been offered as alternatives to such a black-box approach.
40
R. Sterritt
4.3 Bayesian Belief Networks for Fault Identification The Bayesian Belief Networks (BBNs) graphical structure more than meets the need for ‘readability’. BBNs consist of a set of propositional variables represented by nodes in a directed acyclic graph. Each variable can assume an arbitrary number of mutually exclusive and exhaustive values. Directed arcs (arrows) between nodes represent the probabilistic relationships between nodes. The absence of a link between two variables indicates independence between them given that the values of their parents are known. In addition to the network topology, the prior probability of each state of a root node is required. It is also necessary, in the case of non-root nodes, to know the conditional probabilities of each possible value given the states of parent nodes or direct causes. The power of the BBN is evident whenever a change is made on one of the marginal probabilities. The effects of the observation are propagated throughout the network and the other probabilities updated. The BBN can be used for deduction in the fault management domain. For given alarm data, it will determine the most probable cause(s) of the supplied alarms, thus enabling the process to act as an expert system that handles uncertainty. For a discussion on the construction of BBNs for Fault Management see [43]. The next section uses the network experimental data to induce a simple BBN for FM. 4.4 Analysing the Experimental Data with Bayesian Belief Networks The experiment may be considered as inducing 149 instances of the same simulated fault. In order to develop a BBN for the experiment, each instance of the simulated fault (all 149 sets of data) were assigned a row in the contingency table containing its frequencies of alarm occurrences. This is being performed at a much higher abstract level than usual. The normal procedure would be to assign a time window, of possibly as little as 1 second, and calculate the frequencies of occurrence for combinations of alarms that are present throughout the data set. It is expected that there will be fewer edges in the graph but as a whole it should still reflect the significant alarms and the relationships between them for a simulated fault in this experiment.
Fig. 4. Induced results from the 149 simulated fault experiment.
Facing Fault Management as It Is, Aiming for What You Would Like It to Be
41
The PowerConstructor package [44] incorporates an efficient 3-stage algorithm using a version of the mutual information approach [45]. The BBN in Figure 4 was induced for the simulated fault experiment. There was not enough data for the algorithm to distinguish the direction of the edges although this is something an expert could provide. Only the structure and not the values are depicted for simplicity and since the small data set may explain why the figures are not very meaningful. The specifications state that if the signal is configured unstructured (that is, does not conform to ITU-T recommendation G732) an AIS alarm can be considered a valid part of the signal. In this case the PPI-AIS’s strong presence can explain the unstructured signal. PPI-AIS also has a consequence of injecting an AIS towards the tributary unit. If the alarms are separated into two data sets and BBNs induced separately the relationships shown in Figure 5 begin to develop. The relationship between PPI_Unexp_Signal and LP_PLM (discovered earlier by visual inspection of Fig. 2) is confirmed by the induced BBNs. The variables (nodes) in a BBN can represent faults as well as alarms which supports the aim of automated fault diagnosis and not just alarm correlation. Fault nodes are added to this example in Figure 6 and Figure 7. In the first case the fault has three possible fault values: faulty tributary unit, faulty payload manager or unstructured signal. Figure 7 includes a fault node that has two levels: faulty tributary unit and cable misconnection.
Fig. 5. Inducing from split data set due to the knowledge that the signal was unstructured.
Fig. 6. Fault node added to AIS relationship.
42
R. Sterritt
Once the BBN is part of an expert system, the occurrence of these alarms will cause propagation of this ‘evidence’ through the network providing probability updates and predictions of fault identification. The example induced is somewhat simple but illustrates the potential of BBNs in fault management. It is important to note that even with the ability to induce (machine learn or data mine) the BBN from data it took human access to expert knowledge to find a more accurate solution. The success of the approach is also dependent on the quality and quantity of the data. To develop a fully comprehensive belief network that covers the majority of possible faults on a network would be a massive undertaking yet the benefits over rule-based systems suggest that this may be a very worthwhile task.
Fig. 7. Fault node added to Unexplained-signal relationship.
5 Summary and Conclusion The paper first illustrated the complexity and uncertainty in fault management by showing the variability of alarms raised under the same fault conditions. It then described the standard approach of dealing with fault identification using alarm correlation, via monitoring, filtering and masking, implemented as a rule-based system. This highlighted problems with rule-based systems and discussed and demonstrated a threetier rule discovery process to assist in alleviating some of these problems. Other AI techniques and methodologies that handle uncertainty, in particular the main techniques in computational intelligence and soft computing, were discussed. Belief networks were proposed and demonstrated as a technique to support automated fault identification. In each section the approaches were illustrated with data from an experimental simulation of faults on an SDH/Sonet network in Nortel Networks.
Facing Fault Management as It Is, Aiming for What You Would Like It to Be
43
Acknowledgements. The author is greatly indebted to the Industrial Research and Technology Unit (IRTU) (Start 187–The Jigsaw Programme, 1999-2002) for funding this work jointly with Nortel Networks. The paper has benefited from discussions with other members of the Jigsaw team and with collaborators at Nortel.
References 1. M. Cheikhrouhou, P. Conti, R. Oliveira,J. Labetoulle. Intelligent agents in network management: A state-of-the-art. Networking and Information Systems J., 1, pp9-38, Jun 1998. 2. T. Oates. Fault identification in computer networks: A review and a new approach. T.R. 95113, University of Massachusetts at Amherst, Computer Science Department, 1995. 3. C. Bournellis. Internet ‘95. Internet World, 6(11) pp47-52, 1995. 4. M. Cheikhrouhou, P. Conti, J. Labetoulle, K. Marcus, Intelligent Agents for Network Management: Fault Detection Experiment. In Sixth IFIP/IEEE International Symposium on Integrated Network Management, Boston, USA, May 1999. 5. Z. Wang, Model of network faults, In B. Meandzija, J. Westcott (Eds.), Integrated Network Management I., North Holland, Elsevier Science Pub. B.V., 1989. 6. T. Oates. Automatically Acquiring Rules for Event Correlation From Event Logs, Technical Report 97-14, University of Massachusetts at Amherst, Computer Science Dept, 1997. 7. R. Sterritt, A.H. Marshall, C.M. Shapcott, S.I. Mcclean, Exploring Dynamic Bayesian Belief Networks For Intelligent Fault Management Systems, IEEE Int. Conf. Systems, Man and Cybernetics, V, pp3646-3652, Sept. 2000. 8. G. Jakobson and M. Weissman. Alarm correlation. IEEE Network, 7(6):52–59, Nov. 1993. 9. D. Gürer, I. Khan, R. Ogier, R. Keffer, An Artificial Intelligence Approach to Network Fault Management, SRI International, Menlo Park, California, USA. 10. R. N. Cronk, P. H. Callan, L. Bernstein, Rule based expert systems for network management and operations: An introduction. IEEE Network, 2(5) pp7-23, 1988. 11. K. Harrison, A Novel Approach to Event Correlation, HP, Intelligent Networked Computing Lab, HP Labs, Bristol. HP-94-68, July, pp. 1-10, 1994. 12. I. Bratko, S. Muggleton, Applications of Inductive Logic Programming, Communications of the ACM, Vol. 38, no. 11, pp. 65-70, 1995. 13. J. Liebowitz, (ed.) Expert System Applications to Telecommunication,. John Wiley and Sons, New York, NY, USA, 1988. 14. B.Mintegrated, J.Westcott, (eds.) Network Management I, North Holland/IFIP, Elsevier Science Publishers B.V., Netherlands, 1989. 15. J.R. Wright, J.E. Zielinski, E. M. Horton. Expert Systems Development: The ACE System, In [13], pp45—72, 1988. 16. Gary M. Slawsky and D. J. Sassa. Expert Systems for Network Management and Control in Telecommunications at Bellcore, In [13], pp191–199, 1988. 17. Shri K. Goyal and Ralph W. Worrest. Expert System Applications to Network Management, In [13], pp 3—44, 1988. 18. C. Joseph, J. Kindrick, K. Muralidhar, C. So, T. Toth-Fejel, MAP fault management expert system, In [14], pp 627–636, 1989. 19. T. Yamahira, Y. Kiriha, S. Sakata, Unified fault management scheme for network troubleshooting expert system. In [14], pp 637—646, 1989.
44
R. Sterritt
20. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery & Data Mining, AAAI Press & The MIT Press: California, pp1-34, 1996. 21. R.J. Brachman, T. Anand, The Process of Knowledge Discovery in Databases: A HumanCentered Approach., Advances in Knowledge Discovery & Data Mining, AAAI Press & The MIT Press: California, pp37-57, 1996. 22. R. Uthurusamy, From Data Mining to Knowledge Discovery: Current Challenges and Future Directions, Advances in Knowledge Discovery & Data Mining, AAAI Press & The MIT Press: California, pp 561-569, 1996. 23. R. Sterritt, Discovering Rules for Fault Management, Proceedings of IEEE International Conference on the Engineering of Computer Based Systems (ECBS), Washington DC, USA, April 17-20, pp190-196, 2001. 24. R. Sterritt, Fault Management and Soft Computing, Proceedings of the International Symposium Soft Computing and Intelligent Systems for Industry, Paisley, Scotland, UK, June 26 - 29, 2001. 25. R. Sasisekharan, V. Seshadri, and S. M. Weiss. Proactive network maintenance using machine learning. In Proceedings of the 1994 Workshop on Knowledge Discovery in Databases, pp 453-462, 1994. 26. A. Danyluk, F. Provost. Small disjuncts in action: Learning to diagnose errors in the telephone network local loop. In Proceedings of the Tenth International Conference on Machine Learning, 1993. 27. Foster Provost, Andrea Danyluk, A Study of Complications in Real-world Machine Learning , TR NYNEX, 1996. 28. German Goldszmidt and Yechiam Yemini. Evaluating management decisions via delegation. In H. G. Hegering and Y. Yemini, editors, Integrated Network Management, III, pp247-257. Elsevier Science Publishers B.V., 1993. 29. Rodney M. Goodman, Barry Ambrose, Hayes Latin, and Sandee Finnell. A hybrid expert system/neural network traffic advice system. In H. G. Hegering and Y. Yemini, editors, Integrated Network Management, III, pp 607-616. Elsevier Science Publishers B.V., 1993. 30. Rodney M. Goodman and H. Latin. Automated knowledge acquisition from network management databases. In I. Krishnan and W. Zimmer, editors, Integrated Network Management, II, pp 541-549. Elsevier Science Publishers B.V., 1991. 31. Shri K. Goyal. Knowledge technologies for evolving networks. In I. Krishnan and W. Zimmer, editors, Integrated Network Management, II, pp 439-461. Elsevier Sci ence Publishers B.V., 1991. 32. K. Hatonen, M. Klemettinen, H. Mannila, P. Ronkainen, H.Toivonen, Knowledge Discovery from Telecommunication Network Alarm Databases, Proc. 12th Int. Conf. on Data Engineering (ICDE’96), pp.115-122, 1996. 33. Oates, T., Jensen, D., and Cohen, P. R. (1998). Discovering rules for clustering and predicting asynchronous events. In Danyluk, A. Predicting the future: AI approaches to timeseries problems. Technical Report WS-98-07, AAAI Press, pp73-79, 1998. 34. M. Cheikhrouhou, P. Conti, R. Oliveira, and J. Labetoulle. Intelligent agents in network management: A state-of-the-art. Networking and Information Systems Journal, 1 pp 9-38, Jun 1998. 35. M. Cheikhrouhou, P. Conti, J. Labetoulle, K. Marcus, Intelligent Agents for Network Management: Fault Detection Experiment. In Sixth IFIP/IEEE International Symposium on Integrated Network Management (IM’99), Boston, USA, May 1999.
Facing Fault Management as It Is, Aiming for What You Would Like It to Be
45
36. Y.A. Sekercioglu, A. Pitsillides, A. Vasilakos, Computational Intelligence in Management of ATM Networks: A survey of Current Research, Proc. ERUDIT Workshop on Application of Computational Intelligence Techniques in Telecommunication, London, 1999. 37. B. Azvine, N.Azarmi, K.C. Tsui, Soft computing - a tool for building intelligent systems, BT Technology Journal, vol.14, no. 4 pp37-45, Oct. 1996. 38. T. Clarkson, Applications of Neural Networks in Telecommunications, Proc. ERUDIT Workshop on Application of Computational Intelligence Techniques in Telecommunication, London, UK, 1999. 39. H. Wietgrefe, K. Tochs, and et al. Using neural networks for alarm correlation in cellular phone networks. In the International Workshop on Applications of Neural Networks in Telecommunications, 1997. 40. Dorffner, Report for NEuroNet, http://www.kcl.ac.uk/neuronet, 1999. 41. R.D. Gardner and David A. Harle, Alarm Correrlation and Network Fault Resolution using the Kohonen Self-Organising Map, Globecom-97, 1997. 42. R. Sterritt, K. Adamson, M. Shapcott, D. Bell, F. McErlean, Using A.I. For The Analysis of Complex Systems, Proc. Int. Conf. Artificial Intelligence and Soft Computing, pp113-116, 1997. 43. R. Sterritt, W. Liu, Constructing Bayesian Belief Networks for Fault Management in Telest communications Systems, 1 EUNITE Workshop on Computational Intelligence in Telecommunications and Multimedia at EUNITE 2001, pp 149-154, Dec. 2001. 44. J. Cheng, D.A. Bell, W. Liu, An algorithm for Bayesian network construction from data. Proceedings of the 6th International Workshop on Artificial Intelligence and Statistics (AI&STAT’97), 1997. 45. C. J. K. Chow, C. N. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. Information Theory, Vol. 14(3), pp.462-467, 1968.
Enabling Multimedia QoS Control with Black-Box Modelling Gianluca Bontempi and Gauthier Lafruit IMEC/DESICS/MICS Kapeldreef 75 B-3001 Heverlee, Belgium {Gianluca.Bontempi, lafruit}@imec.be http://www.imec.be/mics
Abstract. Quality of Service (QoS) methods aim at trading quality against resource requirements to meet the constraints dictated by the application functionality and the execution platform. QoS is relevant in multimedia tasks since these applications are typically scalable systems. To exploit the scalability property for improving quality, a reliable model of the relation between scalable parameters and quality/resources is required. The traditional QoS approach requires a deep knowledge of the execution platform and a reasonably accurate prediction of the expected configurations. This paper proposes an alternative black-box data analysis approach. The advantage is that it requires no a priori assumptions about the correlation between quality/resources and parameters and it can easily adapt to situations of high complexity, changing platforms and heterogeneous environments. Some preliminary experiments with the QoS modelling of the Visual Texture Coding (VTC) functionality of a MPEG-4 decoder using a local learning technique are presented to support the claim.
1 Introduction Multimedia applications are dynamic, in the sense that they can dynamically change operational requirements (e.g., workload, memory, processing power), operational environments (e.g., mobile, hybrid), hardware support (e.g., terminals, platforms, networks) and functionality (e.g., compression algorithms). It would be highly valuable if such systems could adapt to all changes of configuration and still provide dependable service to the user. This means that the application should be provided with a capacity of monitoring and evaluating online its own behaviour and consequently adjusting it in order to meet the agreed upon goals [24]. This is feasible for scalable systems, where the required resources and the resulting functionality can be controlled and adapted by a number of parameters. A well-known example of a multimedia standard that supports scalability is MPEG-4 [17] where processing and power requirements can be tuned in order to circumvent run-time overloads at a minimal settled quality. The adaptation of scalable applications is a topic that is typically addressed by the Quality of Service (QoS) discipline [27]. QoS traditionally uses expert knowledge and/or domain specific information to cope with time-varying working conditions [19, D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 46–59, 2002. © Springer-Verlag Berlin Heidelberg 2002
Enabling Multimedia QoS Control with Black-Box Modelling
47
21]. Consequently, this approach requires a careful analysis of the functionality, a deep knowledge of the execution platform and a reasonably accurate prediction of the expected configurations. This paper aims to extend the traditional QoS approach with the support of methods, techniques and tools coming from the world of intelligent data analysis [6, 28]. We intend to show that these techniques are promising in a multimedia context since they require little a priori assumptions about the system and the operating conditions, then guaranteeing a large applicability and improved robustness. The work proposes a black-box statistical approach which, based on a set of observations, addresses two main issues. First, defining and identifying which features, within the huge set of parameters characterising a multimedia application, influence the operational requirements (e.g. execution time, memory accesses, processing power) and consequently the quality perceived by the user (e.g., responsiveness, timeliness, signal to noise ratio). Secondly, building a reliable predictive model of the resource requirements, having for input the set of features selected in the previous step. To this aim, we adopt state-of-the-art linear and nonlinear data mining techniques [14]. The resulting predictive model is expected to be an enabling factor for an automated QoS-aware system that should guarantee the required quality by tuning the scalable parameters of the applications [21]. Although the paper will limit the experimental contribution to a modelling task, a QoS architecture integrating the proposed black-box procedure with adaptive control will be introduced and discussed. As a case study, we consider the Visual Texture Coding (VTC) algorithm [20], a wavelet-based image compression algorithm for the MPEG-4 standard. For a fixed platform, we show that it is possible to predict the execution time and the number of memory accesses of the VTC MPEG-4 decoder with a reasonable accuracy, once the value of relevant scalable parameters are known (e.g., the number of wavelets levels). The mapping between resource requirements and scalability parameters is approximated by using a data mining approach [14]. In particular, we compare and assess linear approaches and non-linear machine learning approaches on the basis of a finite amount of data. The measurements come from a set of 21 test images, encoded and decoded by the MoMuSys (Mobile Multimedia Systems) reference code. The experimental results will show that a non-linear approach, namely the lazy learning approach [10], is able to outperform the conventional linear method. This outcome proves that modelling the resource requirement of a multimedia algorithm is a difficult task that can be tackled with success with the support of black-box approaches. Note that, although the paper is limited to modelling the resource requirements, we deem that the proposed methodology is generic enough to be applicable to the modelling of quality attributes. The remainder of the paper is structured as follows. Section 2 presents the quality of service problem in the framework of automatic control problems. Section 3 introduces a statistical data analysis procedure to model the mapping between the scalable parameters and the resource workload. Section 4 analyses how the VTC resource modelling problem can be tackled according to the data driven procedure described in Section 3. The experimental setting and the results are reported in Section 5. Conclusions and future work are discussed in Section 6.
48
G. Bontempi and G. Lafruit
2 A Control Interpretation of the QoS Problem This section introduces a QoS problem as a standard problem of adaptive control, where a system (e.g., the multimedia application) has to learn how to adapt to an unknown environment (e.g., the new platform, configuration or running mode) and reach a specific goal (e.g., the quality requirements). Like any adaptive control problem, a QoS problem can be decomposed in two subproblems, a modelling problem and a regulation problem, respectively. The following sections will discuss these two steps in detail. 2.1 Modelling for Quality of Service The aim of a QoS procedure is to set properly the scalable parameters in order to meet the quality requirements and satisfy the resource constraints. This requires an accurate description of these two relations: 1. The relation linking the scalable parameters characterising the application (e.g., the number of wavelets level in a MPEG-4 still texture codec) to the required resources (e.g., power, execution time, memory). 2. The relation linking the same scalable parameters to the perceived quality (e.g., the signal to noise ratio). It is important to remark how these relations are unknown a priori and that a modelling effort is required to characterise them. Two conventional methods to approach a modelling problem in the literature are: − A white-box approach, which starting from an expert-based knowledge aims at defining how parameters of the algorithm are related to the resource usage and the perceived quality [19, 20]. − A black-box statistical approach, which, based on a sufficient amount of observations, aims at discovering which scalable parameters are effective predictors of the performance of the applications [11]. Note that this approach requires less a priori knowledge and allows a continuous refinement and adaptation based on the upcoming information. Moreover, a quantitative assessment of the predictive capability can be returned together with the model. This paper will focus on the second approach and will apply it to the modelling of the relation between scalable parameters and resource requirements. To this aim, we will suggest the adoption of supervised learning methods. Supervised learning [13] addresses the problem of modelling the relation between a set of input variables, and one or more output variables, which are considered somewhat dependent on the inputs (Fig. 1), on the basis of a finite set of input/output observations. The estimated model is essentially a predictor, which, once fed with a particular value of the input variables, returns a prediction of the value of the output. The goal is to obtain a reliable generalisation, i.e. that the predictor, calibrated on the basis of a finite set of observed measures, is able to return an accurate prediction of the dependent variable when a previously unseen value of the independent vector is presented. In other words, this technique aims to discover and to assess, on the basis of observations only, potential correlations between sets of variables and use these correlations to extrapolate to new scenarios.
Enabling Multimedia QoS Control with Black-Box Modelling input
49
error
output
Phenomenon
Observations
prediction
Model
Fig. 1. Data-driven modelling of an input/output phenomenon.
The following section will show the relevance of black-box input/output models in the context of a QoS policy. 2.2 A QoS-Aware Adaptive Control Architecture The discipline of automatic control offers techniques for developing controllers that adjust the inputs of a given system in order to drive the outputs to some specified reference values [15]. Typical applications include robotics, flight control and industrial plants where the process to be controlled is represented as an input/output system. Similarly, once we model a multimedia algorithm as an input/output process, it is possible to extend control strategies to multimedia quality of service issues. In this context, the goal is to steer properly the inputs of the algorithm (e.g. the scalable parameters) to drive the resource load and the quality to the desired values. In formal terms we can represent the QoS control problem by the following notation. Let us suppose that at time t we have A concurrent scalable applications and R constrained resources. Each resource has a finite capacity rmaxi(t), i=1,..,R, that can be shared, either temporally or spatially. For example, CPU and network bandwidth would be time-shared resources, while memory would be a spatially shared resource. For each application a=1,…,A we assume the existence of the following relations between scalable parameters scalable parameters
s aj , j=1,…,m, and resource usage ria and between
s aj and quality metrics qka , k=1,…,K: ria (t ) = f ( s1a (t ), s 2a (t ),..., s am (t ), E (t )) qka (t ) = g ( s1a (t ), s 2a (t ),..., s am (t ), E (t ))
where
(1) .
s aj denotes the jth scalable parameter of the ath application, K is the number of
quality metrics for each application and E(t) accounts for the remaining relevant factors, like the architecture, the environmental configuration, and so on. We assume a
that the quantities qk are strictly positive and proportional to the quality perceived by the user. At each time instant the following constraints should be met
50
G. Bontempi and G. Lafruit
A
∑ ria (t ) = ri (t ) ≤ rmaxi (t )
a =1 qka (t )
≥ Qka
(2)
i = 1,..., R
k = 1,..., K th
where ri(t) denotes the amount of the i resource which is occupied at time t by the A a
th
applications and Qk is the lower bound threshold for the k quality attribute. The goal of the QoS control policy can be quantified in the following terms: at each time t maximise the quantity (3)
A K
∑ ∑ wka qka (t )
a =1k =1
a
while respecting the constraints in (2). Note that the terms wk in (3) denote the weighted contributions of the different quality metrics in the different applications to the overall quality perceived by the user. In order to achieve this goal a QoS control policy should implement a control law on the scalable parameters, like
s1a (t + 1) = u1a ( s1a (t ),..., s ma (t ), r1 (t ),..., rR (t ), q1a (t ),..., q Ka (t ), E (t ))
(4)
... s ma (t + 1) = u ma ( s1a (t ),..., s ma (t ), r1 (t ),..., rR (t ), q1a (t ),..., q Ka (t ), E (t )) This control problem appears to be outstanding if we assume generic non-linear relations in (1). To make the resolution more affordable, a common approach is to decompose the control architecture into two levels [21]: _ a 1. A resource management level that sets the target quality qk and target resource _
ria for each application in order to maximise the global cost function (4). 2. An application level controller which acts on parameters of each _ the scalable _ a
a
application in order to meet the targets qk and ri fixed by the resource management level. In the rest of the section we will limit our discussion to the application level controller. For examples of resource management approaches we refer the interested reader to [21,12] and the references therein. The application level controller can be implemented by adopting conventional control techniques. Traditional control approaches assume a complete knowledge of the system to be controlled, that is of the relations f and g in (1). Once this knowledge is not available or too difficult to be expressed in analytical form, an adaptive strategy is required [2]. The idea consists in combining a learning procedure (like the one sketched in Section 2.1) with a regulation strategy. The regulation module can then exploit the up-to-date information coming from the learned model in order to drive the system towards the desired goal (Fig. 2). In other words, an adaptive controller assumes that the model is a true representation of the process (certainty equivalence principle [2]) and, based on this information, sets the inputs to the values supposed to bring the system to the desired configuration. In a multimedia context, the controller role is played by the QoS policy
Enabling Multimedia QoS Control with Black-Box Modelling
51
that in front of the current configuration tunes the scalable parameters in order to adjust the resource usage and/or the perceived quality. input
Phenomenon
output
prediction error
Observations
prediction
Model
Adaptive Controller
target output
Fig. 2. Adaptive control system. The adaptive controller exploits the information returned by the learned model in order to drive the output of the I/O phenomenon to the target values.
A large number of methods, techniques and results are available in the automatic control discipline to deal with adaptive control problems [2]. An example of an adaptive control approach for QoS in telecommunications (congestion control) is proposed in [25]. The experimental part of the paper will focus only on the learning module of the QoS adaptive architecture depicted in Fig.2. In particular we will propose a dataanalysis procedure to learn the relation f in (1) based on a limited number of observations.
3 A Data Analysis Procedure for QoS Modelling In Section 2.1, we defined a QoS model as an input/output relation. Here, we propose a black-box procedure to model the relation between scalable parameters (inputs) and resource requirements (outputs). There are two main issues in a black-box approach: (i) the selection of that subset of scalable parameters to which the resources are sensitive and (ii) the estimation of the relation f between scalable parameters and resources. In order to address them, we propose a procedure composed of the following steps: 1. Selection of benchmarks. The first step of the procedure aims to select a representative family of benchmarks in order to collect significant measurements. For example in the case of video multimedia applications this family should cover
52
2.
3.
4. 5.
G. Bontempi and G. Lafruit
a large spectrum of streaming formats and contents if we require a high degree of generalisation from the QoS model [28]. Definition of the target quantities. The designer must choose the most critical quantities (e.g., resources and/or quality metrics in (1)) to be predicted. In this paper we will focus only on the modelling of the resource requirements for a single application (A=1 in Equation (1)). Then, we denote with r the vector of target quantities. Definition of the input variables. In multimedia applications the number of parameters characterising the functionality is typically very high, making the procedure extremely complex. We propose a feature selection approach [18] to tackle this problem. This means that we start with a large set of parameters that might reasonably be correlated with the targets and we select among them the ones that are statistically relevant for obtaining sufficient accuracy. We define with s the selected set of features. Note that in the literature these parameters are also called “knobs” [22] for their capacity of steering the target to the desired values. Data collection. For each sample benchmark, we measure the values of the target quantities obtained by sweeping the input parameters over some predefined ranges. The full set of samples is stored in a dataset D of size N. Modelling of the input/output relation on the basis of the data collected in step 4). The dataset D is used to estimated the input/output relation r = f (s). Note that for simplicity we will assume here that the resource requirements depend only on the scalable parameters, that is, the vector E in (1) is empty. We propose the utilisation and comparison of linear and non-linear models. Linear models assume the existence of linear relations m r = ∑ d j sj +d o j =1
between the input s and the output r. Non-linear models r = f (s ) make less strong assumptions about the analytical form of the relationship. The role of the data analysis technique is indeed to select, based on available observations, what is the form of f which returns the best approximation of the unknown relation [6]. 6. Validation. Once the model is estimated, it is mandatory to evaluate how the performance prediction deteriorates by changing the scenario, or in other terms how the calibrated model is able to generalise in front of new scenarios.
4 The VTC/MPEG-4 Modelling Problem The procedure described in the previous section has been instantiated to a real QoS modelling problem: predicting the resource requirements of the VTC wavelet-based algorithm of the MPEG-4 decoder. VTC is basically a pipeline of a Wavelet Transform (WT), a Quantisation (Q) and a Zero-Tree (ZTR) based entropy coding (arithmetic coder) module. The Wavelet Transform represents a digital image (made of 3 color components: y, u, and v) as a hierarchy of levels, where the first one—the DC image—represents an anti-aliased downsampled version of the original image. All other levels represent additional detailed information that enables the correct upsampling of the DC image back to the original one. Quantisation is the process of thresholding wavelet coefficients, prior to entropy coding. There are three modes of
Enabling Multimedia QoS Control with Black-Box Modelling
53
quantisation in VTC: Single Quantisation (SQ), Multi-level Quantisation (MQ) and Bi-level Quantisation (BQ). For further details on VTC we refer the reader to [20]. These are the different steps of the procedure we followed in order to model the resource requirements: 1. Benchmark selection: we chose 21 image test files in yuv format. Besides the Lena picture, the images are extracted from 4 different AVI videos: Akiyo, IMECnology, Mars, Mother and Daugthter. Table 1 reports all the images’ names with the corresponding formats (width×heigth). The reference software version is the MoMuSys (Mobile Multimedia Systems) reference code. The reference microprocessor is an HP J7000/4 at 440 MHz. Table 1. Benchmark images (width x heigth) 1-Akiyo 2-Imec1 3-Imec2 4-Imec3 5-Imec4 6-Imec5
352x288 232x190 528x432 352x288 704x576 1252x1024
7-Imec6 8-Imec7 9-Imec8 10-Imec9 11-Imec10 12-Imec11
352x288 433x354 352x288 387x317 317x259 352x288
13-Lena 14-Imec12 15-Imec13 16-Imec14 17-Mars1 18-Mars2
256x256 878x719 1056x864 1232x1008 936x672 1280x919
19-Mars3 20-MoDaug
1248x896 176x144
2. Target quantities: we define with r the vector of resource requirements composed of • td: the decoding execution time in seconds, • rd: the total number of decoding read memory accesses and • wd: the total number of decoding write memory accesses. We will refer with rI to the I-th component of the vector r. 3. Input parameters: the initial set of inputs which is assumed to influence the value of target quantities is made of • w: Image width. • h: Image height. • l: Number of wavelet decomposition levels. In the experiments the value of l ranges over the interval [2..4]; • q: Quantization type. It assumes three discrete values: 1 for SQ, 2 for MQ, 3 for BQ. • n: Target SNR level. In the experiments it ranges over the interval [1..3]. • y: Quantization level QDC_y. This number represents the number of levels used to quantize the DC coefficients of the y-component of the image. It assumes two discrete values: 1 and 6. • u: Quantization level QDC_uv. This number represents the number of levels used to quantize the DC coefficients of the u-component and the v-component of the image. It assumes two discrete values: 1 and 6. • ee: Encoding execution time. This is the time required for encoding the picture with the same reference software. This variable is considered as an input only for predicting ed. • re: Total number of encoding read memory accesses. This variable is considered as an input only for predicting rd and wd. • we: Total number of encoding write memory accesses. This variable is considered as an input only for predicting rd and wd.
54
G. Bontempi and G. Lafruit
We define with s the vector [w, h, l, q, n, y, u, ee, re, we] and with S the domain of th values of the vector s. We will refer with si to the i component of the vector s. In Section 5.1 we will present a feature selection procedure to reduce the size of the vector s, aiming at taking into consideration only that subset of s which is effectively critical to predict the target quantities. 4. Data collection. For each test image we collect 108 measurements, by sweeping the input s over the ranges defined in the previous section. The total number of measurements is N=2484. The execution time is returned by the timex command of the HPUX operating system. The total number of read and write memory accessed are measured by instrumenting the code with the Atomium profiling tool [3, 4]. 5. Model estimation. Two different models are estimated on the basis of the collected measurements. The first one is an input/output model m1 of the relation f between the input scalable parameters and the value of the target variables: r=m1(s)
(5)
The second is an input/output model m2 of the relation between the change of the input parameters and the change of the value of the target variables. r(t)-r(t-1)=m2 (s(t) , s(t-1) )
(6)
The model m2 is relevant in order to enable a QoS control policy. On the basis of the information returned by the model m2, the QoS controller can test what change ∆s in the parameter configuration may induce the desired change ∆r in the requirements. Both linear and non-linear models are taken into consideration to model m1 and m2. In particular we adopt the multiple linear regression technique [13] to identify the linear model. Among the large amount of results in statistical non-linear regression and machine learning [6, 28], we propose the adoption of a method of locally weighted regression, called lazy learning [10]. Lazy learning is a memorybased technique that, once an input is received, extracts a prediction interpolating locally the examples which are considered similar according to a distance metric. This method proved to be effective in many problems of non-linear data modelling [7,10] and was successfully applied to the problem of multivariate regression proposed by the NeuroNet CoIL Competition [9]. 6. Validation. We adopt a training-and-test procedure, which means that the original data set, made of N samples, is decomposed v times in two non overlapping sets namely − the training set, made of Ntr samples, used to train the prediction model, and − the test set, composed of Nts=N-Ntr samples, used to validate the prediction model according to some error criterion. After having trained and tested the prediction model v times, the generalisation performance of the predictor is evaluated by averaging the error criterion over the v test sets. Note that the particular case where v=N, Nts=1 and Ntr=N-1 is generally denoted as the leave-one-out (LOO) validation in the statistical literature [24].
Enabling Multimedia QoS Control with Black-Box Modelling
55
5 The Experimental Results 5.1 Feature Selection The first step of the modelling procedure aims at reducing the complexity of the input vector. The goal is to select which variables among the ones contained in the vector s are effectively correlated with the target quantities. There are several popular feature selection algorithms [18]. Here, we adopt an incremental feature selection approach called forward selection [13]. The method starts with an empty set of variables and incrementally adds new variables, testing, for all of them, the predictive accuracy of the model. At each step, the variable that guarantees the largest decrease of the prediction error is added to the set. The procedure stops when the accuracy does not improve any more or deteriorates due to the over-parameterisation. Figure 3 reports the estimated prediction accuracy against the set of features (represented by their number) for a non-linear lazy learning model that predicts the variable r1. Note that the prediction accuracy is measured by a leave-one-out procedure [24]. We choose the set made of the features no. 1 (i.e., w), no. 2 (i.e., h), no. 3 (i.e., l) no. 4 (i.e., q) and no. 8 (i.e., te) as according to the picture this set returns the lowest error in predicting r1.
{8}
{8 4}
{8 4 2}
{8 4 2 3}
{8 4 2 3 1}
{8 4 2 3 1 6} {8 4 2 3 1 6 5}
Fig. 3. Feature selection. The x-axis reports the subset of features and the y-axis the Mean Squared Error (MSE) estimated by the leave-one-out procedure.
Using the same procedure for the other two target quantities, we obtain that the best feature subsets are {w, h, l, q, re} for r2 and {w, h, l, q, we} for r3, respectively. These feature sets will be used in the rest of the experiments as input vectors of the prediction models. 5.2 Estimation of Model m1 We compare linear and non-linear models m1 (Equation (5)) by adopting a trainingand-test setting. This means that we perform 21 experiments where each time the training set contains all the images except one, which is set aside to validate the
56
G. Bontempi and G. Lafruit
prediction accuracy of the method. The predictive accuracy of the models is assessed by their percentage error (PE). Table 2 reports the average of the 21 percentage errors for the linear and non-linear model. Table 2. Predictive accuracy (in percentage error) of the linear and non-linear approach to modelling the relation m1.
Target Decoding execution time r1 Decoding read memory accesses r2 Decoding write memory accesses r3
Linear PE 15.8 % 39.9 % 40.3 %
Non-linear PE 3.8 % 6.0 % 4.6 %
Figure 4 report the real values of r1 (decoding execution time) and the predictions returned by the linear model for the images no. 2 and no. 6. The y-axis reports the execution time for a specific input configuration. Each point on the x-axis represents a different input configuration. 0.45
Execution time Predictions
Execution time Predictions
0.4
12
0.35 10
Time (sec.)
Time (sec.)
0.3
0.25
0.2
8
6
0.15 4 0.1 2 0.05
0
0
10
20
30
40
50
60
70
80
90
100
0
0
10
20
30
40
50
60
70
80
90
100
Fig. 4. Prediction of the linear model for image n.2(left) and n.6 (right).
Similarly, Figure 5 report the real values of r1 (decoding execution time) and the predictions returned by the non-linear model for the images no. 2 and no. 6. 5.3 Estimation of the Model m2 The predictive accuracy of the model m2 in (6) is assessed by counting the number of times that the model is not able to predict the sign of the change of the output value (r(t)-r(t-1)) for a given input change (s(t)→s(t-1)). Given the relevance of this model for the QoS control policy, we deem that the number of incorrect sign changes predictions is more relevant than the average prediction error. We adopt the same training-and-test validation procedure of the previous section. We present two different experimental settings: a non incremental one where the training test is kept fixed and an incremental one where the training test is incremented each time a new observation is available. Note that the goal of the incremental experiment is to test the adaptive capability of the modelling algorithms when the amount of observations is updated on-line.
Enabling Multimedia QoS Control with Black-Box Modelling
0.45
Execution time Predictions
57
Execution time Predictions
0.4
12
0.35 10
Time (sec.)
Time (sec.)
0.3
0.25
0.2
8
6
0.15 4 0.1 2 0.05
0
0
10
20
30
40
50
60
70
80
90
100
0
0
10
20
30
40
50
60
70
80
90
100
Fig. 5. Prediction of the non-linear model for image n.2(left) and n.6 (right)
Table 3 reports the results for the non incremental case. The prediction error is measured by percentage of times that the predictive model returns a wrong prediction of the sign of the output change. Again, linear and nonlinear predictors are assessed and compared.
Table 3. Non incremental configuration: the table reports the percentage of times that the model is not able to predict the sign of the change of the output value for a given input change.
Target Decoding execution time r1 Decoding read memory accesses r2 Decoding write memory accesses r3
Linear 23.5 % 46.7 % 19.9 %
Non-linear 7.5 % 9.4 % 10.2 %
Table 4 reports the results for the incremental case in the case of non-linear approach. It is interesting to note the improvement of the prediction accuracy for all the 3 target quantities although only a limited number of samples is added on-line to the initial dataset.
Table 4. Incremental configuration: the table reports the percentage of times that the model is not able to predict the sign of the change of the output value for a given input change.
Target Decoding execution time r1 Decoding read memory accesses r2 Decoding write memory accesses r3
Non-linear 3.4 % 6.4 % 4.7 %
58
G. Bontempi and G. Lafruit
6 Conclusions The paper presented the preliminary results of a modelling approach suitable for an adaptive control implementation of QoS techniques for multimedia applications. The results presented in Table 2, Table 3 and Table 4, albeit on the basis of a limited number of examples, show that it is possible to estimate a model returning an accurate prediction of the resource requirements, both in terms of execution time and memory accesses. In particular, the experiments suggest that non-linear models outperform linear models in a significant way, showing the intrinsic complexity of the modelling task. Moreover, we show, in concordance with other published results [8], that the non-linear technique we have proposed is also robust in an adaptive setting where the number of samples increments on-line. Future work could take several directions: • Extending the work to multiple platforms and architecture. • Exploring prediction models of some quantitative attributes of the quality. • Integrating the prediction models in a control architecture, responsible for negotiating online the quality demands vs. the resource constraints. • Integrating the application-level control with an higher system-level control mechanism (e.g., resource manager). We are convinced that the promising results of this work will act as driving force for future black-box approaches to QoS policies.
References 1. Abdelzaher, T.F.: An Automated Profiling Subsystem for QoS-Aware Services. In: Proceedings of Sixth IEEE Real-Time Technology and Applications Symposium, 2000. RTAS 2000 (2000). 2. Astrom, K.J.: Theory and Applications of Adaptive Control - A Survey. Automatica, 19, 5, (1983) 471-486. 3. ATOMIUM:, http://www.imec.be/atomium/. 4. Bormans, J. , Denolf, K., Wuytack, S., Nachtergaele, L., Bolsens, I. : Integrating SystemLevel Low Power Methodologies into a Real-Life Design Flow. In: PATMOS’99 Ninth International Workshop Power and Timing Modeling, Optimization and Simulation, (1999) 19-28. 5. Birattari, M ., Bontempi, G.: The Lazy Learning Toolbox, For use with Matlab. Technical Report, TR/IRIDIA/99-7, Université Libre de Bruxelles (1999) (http://iridia.ulb.ac.be/~gbonte/Papers.html). 6. Bishop, C.M.: Neural Networks for Statistical Pattern Recognition. Oxford, UK: Oxford University Press (1994). 7. Bontempi, G.: Local Learning Techniques for Modeling, Prediction and Control. PhD dissertation. IRIDIA- Université Libre de Bruxelles, Belgium, (1999). 8. Bontempi, G., Birattari, M., Bersini, H.: Lazy learning for modeling and control design. International Journal of Control, 72, 7/8, (1999) 643-658. 9. Bontempi, G., Birattari, M., Bersini, H.: Lazy Learners at work: the Lazy Learning Toolbox. In: EUFIT '99 - The 7th European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany (1999). 10. Bontempi, G. , Birattari, M., Bersini, H.: A model selection approach for local learning. Artificial Intelligence Communications, 13, 1, (2000) 41-48.
Enabling Multimedia QoS Control with Black-Box Modelling
59
11. Bontempi, G., Kruijtzer, W.: A Data Analysis Method for Software Performance Prediction. Design Automation and Test in Europe DATE 2002 (2002). 12. Brandt, S., Nutt, G., Berk, T., Mankovich, J.: A dynamic quality of service middleware agent for mediating application resource usage. In: Proceedings of the 19th IEEE RealTime Systems Symposium, (1998) 307 –317. 13. Draper, N. R., Smith, H.: Applied Regression Analysis. New York, John Wiley and Sons (1981). 14. Fayyad, U., Piatesky-Shapiro, G., Smyth, P.: The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39, 11, (1996) 27-34. 15. Franklin, G., Powell, J.: Digital control of Dynamic Systems. Addison Wesley (1981). 16. Jain, R.: Control-theoretic Formulation of Operating Systems Resource Management Policies. Garland Publishing Companies (1979). 17. Koenen, R.: MPEG-4. Multimedia for our time. IEEE Spectrum, (1999), 26-33. 18. Kohavi, R., John, G. H.: Wrappers for Feature Subset Selection. Artificial Intelligence, 97, 1-2, (1997) 273-324. 19. Lafruit, G.: Computational Graceful Degradation methodology. IMEC Technical report (2000). 20. Lafruit, G., Vanhoof, B.: MPEG-4 Visual Texture Coding: Variform, yet Temperately Complex. In: IWSSIP the 8th International Workshop on Systems, Signals and Image Processing , Romania (2001) 63-66. 21. Li, B., Nahrstedt, K.: A Control-based Middleware Framework for Quality of Service Adaptations. IEEE Journal of Selected Areas in Communications, Special Issue on Service Enabling Platforms, 17, 9, (1999) 1632-1650. 22. Li, B., Kalter, W., Nahrstedt, K.: A Hierarchical Quality of Service Control Architecture for Configurable Multimedia Applications. Journal of High-Speed Networks, Special Issue on QoS for Multimedia on the Internet, IOS Press, 9, (2000) 153-174. 23. Lu, Y., Saxena, A., Abdelzaher, T.F.: Differentiated caching services; a control-theoretical approach. In: 21st International Conference on Distributed Computing Systems, (2001) 615–622. 24. Lu, C., Stankovic, J.A., Abdelzaher, T.F., Tao, G., Sao, S.H., Marley, M.: Performance specifications and metrics for adaptive real-time systems. In: Proceedings of the 21st IEEE Real-Time Systems Symposium, (2000) 13 –23. 25. Pitsillides, A., Lambert, J.: Adaptive congestion control in ATM based networks: quality of service with high utilisation. Journal of Computer Communications, 20, (1997), 12391258. 26. Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of Royal Statistical Society B, 36, 1, (1974) 111-147. 27. Vogel, A., Kerherve, B., von Bochmann, G., Gecsei, J.: Distributed Multimedia and QoS: A Survey. IEEE Multimedia, 2, 2, (1995) 10-19. 28. Weiss, S.M., Kulikowski, C.A.: Computer systems that learn. San Mateo, California, Morgan Kaufmann (1991).
Using Markov Chains for Link Prediction in Adaptive Web Sites Jianhan Zhu, Jun Hong, and John G. Hughes School of Information and Software Engineering, University of Ulster at Jordanstown Newtownabbey, Co. Antrim, BT37 0QB, UK {jh.zhu, j.hong, jg.hughes}@ulst.ac.uk
Abstract. The large number of Web pages on many Web sites has raised navigational problems. Markov chains have recently been used to model user navigational behavior on the World Wide Web (WWW). In this paper, we propose a method for constructing a Markov model of a Web site based on past visitor behavior. We use the Markov model to make link predictions that assist new users to navigate the Web site. An algorithm for transition probability matrix compression has been used to cluster Web pages with similar transition behaviors and compress the transition matrix to an optimal size for efficient probability calculation in link prediction. A maximal forward path method is used to further improve the efficiency of link prediction. Link prediction has been implemented in an online system called ONE (Online Navigation Explorer) to assist users’ navigation in the adaptive Web site.
1 Introduction In a Web site with a large number of Web pages, users often have navigational questions, such as, Where am I? Where have I been? and Where can I go? [10]. Web browsers, such as Internet Explorer, are quite helpful. The user can check the URI address field to find where they are. Web pages on some Web sites also have a hierarchical navigation bar, which shows the current Web location. Some Web sites show the user’s current position on a sitemap. In IE 5.5, the user can check the history list by date, site, or most visited to find where he/she has been. The history can also be searched by keywords. The user can backtrack where he/she has been by clicking the “Back” button or selecting from the history list attached to the “Back” button. Hyperlinks are shown in a different color if they point to previously visited pages. We can see that the answers to the first two questions are satisfactory. To answer the third question, what the user can do is to look at the links in the current Web page. On the other hand, useful information about Web users, such as their interests indicated by the pages they have visited, could be used to make predictions on the pages that might interest them. This type of information has not been fully utilized to provide a satisfactory answer to the third question. A good Web site should be able to help its users to find answers to all three questions. The major goal of this paper is to provide an adaptive Web site [11] that changes its presentation and organization on the basis of link prediction to help users find the answer to the third question.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 60–73, 2002. © Springer-Verlag Berlin Heidelberg 2002
Using Markov Chains for Link Prediction in Adaptive Web Sites
61
In this paper, by viewing the Web user’s navigation in a Web site as a Markov chain, we can build a Markov model for link prediction based on past users’ visit behavior recorded in the Web log file. We assume that the pages to be visited by a user in the future are determined by his/her current position and/or visiting history in the Web site. We construct a link graph from the Web log file, which consists of nodes representing Web pages, links representing hyperlinks, and weights on the links representing the numbers of traversals on the hyperlinks. By viewing the weights on the links as past users’ implicit feedback of their preferences in the hyperlinks, we can use the link graph to calculate a transition probability matrix containing one-step transition probabilities in the Markov model. The Markov model is further used for link prediction by calculating the conditional probabilities of visiting other pages in the future given the user’s current position and/or previously visited pages. An algorithm for transition probability matrix compression is used to cluster Web pages with similar transition behaviors together to get a compact transition matrix. The compressed transition matrix makes link prediction more efficient. We further use a method called Maximal Forward Path to improve the efficiency of link prediction by taking into account only a sequence of maximally connected pages in a user’s visit [3] in the probability calculation. Finally, link prediction is integrated with a prototype called ONE (Online Navigation Explorer) to assist Web users’ navigation in the adaptive Web site. In Section 2, we describe a method for building a Markov model for link prediction from the Web log file. In Section 3, we discuss an algorithm for transition matrix compression to cluster Web pages with similar transition behaviors for efficient link prediction. In Section 4, link prediction based on the Markov model is presented to assist users’ navigation in a prototype called ONE (Online Navigation Explorer). Experimental results are presented in Section 5. Related work is discussed in Section 6. In Section 7, we conclude the paper and discuss future work.
2 Building Markov Models from Web Log Files We first construct a link structure that represents pages, hyperlinks, and users’ traversals on the hyperlinks of the Web site. The link structure is then used to build a Markov model of the Web site. A traditional method for constructing the link structure is Web crawling, in which a Web indexing program is used to build an index by following hyperlinks continuously from Web page to Web page. Weights are then assigned to the links based on users’ traversals [14]. This method has two drawbacks. One is that some irrelevant pages and links, such as pages outside the current Web site and links never traversed by users, are inevitably included in the link structure, and need to be filtered out. Another is that the Webmaster can set up the Web site to exclude the crawler from crawling into some parts of the Web site for various reasons. We propose to use the link information contained in an ECLF (Extended Common Log File) [5] format log file to construct a link structure, called a link graph. Our approach has two advantages over crawling-based methods. Only relevant pages and links are used for link graph construction, and all the pages relevant to users’ visits are included in the link graph.
62
J. Zhu, J. Hong, and J.G. Hughes
2.1 Link Graphs A Web log file contains rich records of users’ requests for documents on a Web site. ECLF format log files are used in our approach, since the URIs of both the requested documents and the referrers indicating where the requests came from are available. An ECLF log file is represented as a set of records corresponding to the page requests, WL ={( e1 , e2 ,..., em )}, where e1 , e2 ,..., em are the fields in each record. A record in an ECLF log file might look like as shown in Figure 1: 177.21.3.4 - - [04/Apr/1999:00:01:11 +0100] "GET /studaffairs/ccampus.html HTTP/1.1" 200 5327 "http://www.ulst.ac.uk/studaffairs/accomm.html" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)" Fig. 1. ECLF Log File
The records of embedded objects in the Web pages, including graphical, video, and audio files, are treated as redundant requests and removed, since every request of a Web page will initiate a series of requests of all the embedded objects in it automatically. The records of unsuccessful requests are also discarded as erroneous records, since there may be bad links, missing or temporarily inaccessible documents, or unauthorized requests etc. In our approach, only the URIs of the requested Web page and the corresponding referrer are used for link graph construction. We therefore have a simplified set WLr ={( r , u )}, where r and u are the URIs of the referrer and the requested page respectively. Since various users may have followed the same links in their visits, the traversals of these links are aggregated to get a set WLs ={( r , u , w )}, where w is the number of traversals from r to u . In most cases a link is the hyperlink from r to u . When “-“ is in the referrer field, we assume there is a virtual link from “-“ to the requested page. We call each element ( r , u , w ) in the set a link pair. Two link pairs li =( ri , ui , wi ) and l j =( rj , u j , w j ) are said to be connected if and only if ri = rj , ri = u j , ui = rj , or ui = u j . A link pair set LS m ={( ri , ui , wi )} is said to connect to another link pair set LS n ={( rj , u j , w j )} if and only if for every link pair l j ∈ LS n , there exists a link pair li ∈ LS m , so that li and l j are connected. Definition 2.1 (Maximally connected Link pair Set) Given a link pair set WLs ={( rj , u j , w j )}, and a link pair set LS m ={( ri , ui , wi )} ⊂ WLs , we say
LS n ={( rl , ul , wl )} ⊂ WLs is the Maximally connected Link pair Set (MLS) of LS m on WLs if and only if LS m connects to LS n , and for every link pair l j ∈ ( WLs - LS n ), { l j } and LS m are not connected. For a Web site with only one major entrance, the homepage, people can come to it in various ways. They might come from a page on another Web site pointing to the homepage, follow a search result returned by a search engine pointing to the
Using Markov Chains for Link Prediction in Adaptive Web Sites
63
homepage. “-” in the referrer field of a page request record indicates that the user has typed in the URI of the homepage directly into the address field of the browser, selected the homepage from his/her bookmark, or clicked on a shortcut to this homepage. In all these cases the referrer information is not available. We select a set of link pairs LS 0 ={( ri , u0 , wi )}, where ri is “-“, the URI of a page on another Web site, or the URI of a search result returned by a search engine, u0 is the URI of the homepage, and
wi is the weight on the link, as the entrance to the hierarchy. We then
look for the Maximally connected Link pair Set (MLS) LS1 of LS 0 in WLs - LS 0 to form the second level of the hierarchy. We look for LS 2 of LS1 in WLs - LS 0 - LS1 . k
This process continues until we get LS k , so that WLs -
∑ LS
i
={} or LS k +1 ={}.
i=0
For a Web site with a single entrance, we will commonly finish the link graph k
construction with ( WLs -
∑ LS
i
)={}, which means that every link pair has been put
i=0
onto a certain level in the hierarchy. The levels in the hierarchy are from LS 0 to LS k . For a Web site with several entrances, commonly found in multi-functional Web sites, k
the construction will end with LS k +1 ={} while ( WLs -
∑ LS
i
) ≠ {}. We can then
i=0
k
select a link pair set forming another entrance from ( WLs -
∑ LS
i
) to construct a
i=0
separate link graph. Definition 2.2 (Link Graph) The link graph of WLs , a directed weighted graph, is a hierarchy
consisting
of
multiple
levels,
LS 0 ,…, LSi ,…, LS k , i −1
LS 0 ={( r0 , u0 , w0 )}, LSi is the MLS of LSi −1 in WLs -
∑ LS j =0
where
k
j
, and WLs -
∑ LS
j
={}
j =0
or LS k +1 ={}. We add the “Start” node to the link graph as the starting point for the user’s visit to the Web site and the “Exit” node as the ending point of the user’s visit. In order to ensure that there is a directed path between any two nodes in the link graph, we add a link from the “Exit” node to the “Start” node. Due to the influence of caching, the amount of weights on all incoming links of a page might not be the same as the amount of weights on all outgoing links. To solve this problem, we can either assign extra incoming weights to the link to the start/exit node or distribute extra outgoing weights to the incoming links. Figure 2 shows a link graph we have constructed using a Web log file at the University of Ulster Web site, in which the title of each page is shown beside the node representing the page.
64
J. Zhu, J. Hong, and J.G. Hughes
1
University of Ulster
1800 2700
Department
2 880 CS
5
3
200 720
Science &Arts
810
International Office
6
4500
9000
Start
Information
Student 300
4 S
2390
1800 Undergraduate
Library
8
7
2400 Graduate 10
9
72 2390
1800
2400 9000
648 880
600
282 E
Jobs
Exit
11 2128
12
2128
Register
Fig. 2. A Link Graph Constructed from a Web Log File on University of Ulster Web Site
2.2 Markov Models Each node in the link graph can be viewed as a state in a finite discrete Markov model, which can be defined by a tuple < S , Q , L >, where S is the state space containing all the nodes in the link graph, Q is the probability transition matrix containing one-step transition probabilities between the nodes, and L is the initial probability distribution on the states in S . The user’s navigation in the Web site can be seen as a stochastic process { X n }, which has S as the state space. If the ( m)
conditional probability of visiting page j in the next step, Pi , j , is dependent only on the last m pages visited by the user, { X n } is called a m -order Markov chain [8]. ( m)
Given that the user is currently at page i and has visited pages in −1 ,..., i0 , Pi , j is only dependant on pages i , in −1 ,..., in − m +1 .
Pi ,( jm ) = P ( X n +1 = j | X n = i , X n −1 = in −1 , ..., X 0 = i0 ) = P ( X n +1 = j | X n = i , X n −1 = in −1 ,..., X n − m +1 = in − m +1 )
(1)
Using Markov Chains for Link Prediction in Adaptive Web Sites
65
where the conditional probability of X n +1 given the states of all the past events is equal to the conditional probability of X n +1 given the states of the past m events. When m =1, X n +1 is dependent only on the current state
(1)
X n . Pi , j = Pi , j
= P ( X n +1 = j | X n = i ) is an one-order Markov chain, where Pi , j is the probability that a transition is made from state i to state j in one step. We can calculate the one-step transition probability from page i to page j using a link graph as follows, by considering the similarity between a link graph and a circuit chain discussed in [7]. The one-step transition probability from page i to page j ,
Pi , j , can be viewed as the fraction of traversals from i to j over the total number of traversals from
i to other pages and the “Exit” node.
Pi , j = P ( X n +1 = j | X n = i , X n −1 = in −1 , ..., X 0 = i0 ) = P ( X n +1 = j | X n = i ) = (2)
wi , j
∑
k
wi , k
where wi , j is the weight on the link from i to j , and wi , k is the weight on a link from i to k . Now a probability transition matrix, which represents the one-step transition probability between any two pages, can be formed. In a probability transition matrix, row i contains one-step transition probabilities form i to all states. Row i sums up to 1.0. Column i contains one-step transition probabilities from all states to i . The transition matrix calculated from the link graph in Figure 2 is shown in Figure 3. Page
\ Page
1
1
2
3
4
0.2
0.3
0.5
2
0.111
3 4
0.067
5
6
7
8
9
10
11
12
Exit
0.489 0.4 0.253 0.747 0.4 0.533
5
1.0
6
0.1
7
0.9 0.68
0.32
8
1.0
9
1.0
10
1.0
11
1.0
12
1.0
Exit Start
Start
1.0 1.0
Fig. 3. Transition Probability Matrix for the Link Graph in Fig. 2
66
J. Zhu, J. Hong, and J.G. Hughes
3 Transition Matrix Compression An algorithm, that can be used to compress a sparse probability transition matrix, is presented in [15] while the transition behaviors of the Markov model are preserved. States with similar transition behaviors are aggregated together to form new states. In link prediction, we need to raise the transition matrix Q to the n th power. For a large Q this is computationally expensive. Spears’ algorithm can be used to compress the original matrix Q to a much smaller matrix Qc without significant errors since the accuracy experiments on large matrices have shown that Qc
n
n
and (Q ) c are very
close to each other. Since the computational complexity of Q
n
3
is O ( N ) , by
dramatically reducing N , the time taken by compression can be compensated by all subsequent probability computations for link prediction [15]. We have used Spear’s algorithm in our approach. The similarity metric of every pair of states is formed to ensure those pairs of states that are more similar should yield less error when they are compressed [15]. Based on the similarity metric in [15], the transition similarity of two pages i and j is a product of their in-link and out-link similarities. Their in-link similarity is the weighted sum of distance between column i and column j at each row. Their out-link similarity is the sum of distance between row i and row j at each column.
Simi , j = Simi , j (out − link ) × Simi , j (in − link )
Simi , j (out − link ) = ∑ y α i , j ( y )
Simi , j (in − link ) = ∑ x β i , j ( x)
α i , j ( y ) = | Pi , y − Pj , y | β i , j ( x) =
(3)
mi × Px , j − m j × Px ,i mi + m j
mi = ∑ l Pl ,i , m j = ∑ l Pl , j where mi and m j are the sums of the probabilities on the in-links of page i and j respectively, Simi , j (out − link ) is the sum of the out-link probability difference between i and j , Simi , j (in − link ) is the sum of in-link probability difference between i and j . For the transition matrix in Figure 3, the calculated transition similarity matrix is shown in Figure 4. If the similarity is close to zero, the error resulted from compression is close to zero [15]. We can set a threshold ε , and let Simi , j < ε to look for candidate pages for merging.
Using Markov Chains for Link Prediction in Adaptive Web Sites Page
\ Page
1
2
3
4
5
6
7
8
9
10
11
12
Exit
1 2 3
0.00 0.58 0.00 1.29 0.21 0.00
4 5
1.24 0.00 0.36 0.00 1.31 0.57 0.74 0.99 0.00
6 7 8
1.14 0.53 0.60 0.89 0.00 0.00 1.04 0.51 0.81 0.83 0.26 0.24 0.00 1.71 0.63 1.17 1.20 1.18 1.04 0.18 0.00
9 10
1.14 0.53 0.75 0.89 0.88 0.80 0.51 0.00 0.00 1.39 0.58 0.87 1.03 1.02 0.91 0.58 0.00 0.00 0.00
11 12 Exit
2.88 0.74 1.61 1.68 1.64 1.38 0.89 2.32 1.39 1.77 0.00 2.00 0.67 1.29 1.33 1.31 1.14 0.71 0.00 0.00 0.00 2.88 0.00 3.25 0.76 1.72 1.79 1.75 1.46 1.31 2.55 1.46 1.90 5.98 3.25 0.00
Start
2.00 0.67 1.29 1.33 1.31 1.14 1.04 1.71 1.14 1.39 2.88 2.00 3.25
67
Start
0.00
Fig. 4. Transition Similarity Matrix for Transition Matrix in Fig. 3 (Symmetric)
By raising ε we can compress more states with a commensurate increase in error. Pages sharing more in-links, out-links, and having equivalent weights on them will meet the similarity threshold. Suppose states i and j are merged together, we need to assign transition probabilities between the new state i ∨ j and the remaining state
k in the transition matrix. We compute the weighted average of the i th and j th rows and place the results in the row of state i ∨ j , and sum the i th and j th columns and place the results in the column of state i ∨ j .
Pk ,i ∨ j = Pk ,i + Pk , j
Pi ∨ j , k =
mi × Pi ,k + m j × Pj ,k
(4)
mi + m j
For the similarity matrix in Figure 4, we set the similarity threshold ε =0.10. Experiments indicated a value of ε between 0.08 and 0.15 yielded good compression with minimal error for our link graph. The compression process is shown in Figure 5. States 2 and 4, 5 and 6 are compressed as a result of Simi , j (in − link ) =0, states 8, 9, 10 and 12 are compressed as a result of Simi , j (out − link ) =0. The compressed matrix is shown in Figure 6. The compressed matrix is denser than the original transition matrix. When either Simi , j (out − link ) =0 or Simi , j (in − link ) =0, the compression will result in no error: Errori , j = 0 and Qc = (Q ) c [15]. So there is no compression n
n
error for the transition matrix in Figure 4 and its compressed matrix in Figure 6. There may not always be the case for a transition matrix calculated from another link graph. When Simi , j is below a given threshold, the effect of compression on the transition
68
J. Zhu, J. Hong, and J.G. Hughes
behavior of the states ( (Q ) c −Qc ) will be controlled, the transition property of the matrix is preserved and the system is compressed to an optimal size for probability computation. The compressed transition matrix is used for efficient link prediction. n
n
Compressed state 4 into state 2 (similarity 0.000000)(states: 2 4) Compressed state 6 into state 5 (similarity 0.000000)(states: 5 6) Compressed state 9 into state 8 (similarity 0.000000)(states: 8 9) Compressed state 12 into state 10 (similarity 0.000000)(states: 10 12) Compressed state 10 into state 8 (similarity 0.000000)(states: 8 9 10 12) Finished compression. Have compressed 14 states to 9. Fig. 5. Compression Process for Transition Matrix in Fig. 3
Page
\ Page
1
1 (2, 4) 3 (5, 6)
(2, 4) 0.7
3 0.3
(5, 6)
0.08
0.25
7
(8,9,10,12)
Exit
Start
0.67 0.25 0.04
0.75 0.96
7 (8,9,10,12)
0.68 0.32 1.0
11 Exit Start
11
1.0 1.0 1.0
Fig. 6. Compressed Transition Matrix for Transition Matrix in Figure 3
4 Link Prediction Using Markov Chains When a user visits the Web site, by taking the pages already visited by him/her as a history, we can use the compressed probability transition matrix to calculate the probabilities of visiting other pages or clusters of pages by him/her in the future. We view each compressed state as a cluster of pages. The calculated conditional probabilities can be used to estimate the levels of interests of other pages and/or clusters of pages to him/her. 4.1 Link Prediction on M-Order N-Step Markov Chains Sarukkai [14] proposed to use the “link history” of a user to make link prediction. Suppose a user is currently at page i , and his/her visiting history as a sequence of m pages is { i− m +1 , i− m + 2 ,..., i0 } . We use vector L0 = {l j } , where l j =1 when j = i and
l j =0 otherwise, for the current page, and vectors Lk = {l j } ( k = −1,..., − m + 1 ), k
Using Markov Chains for Link Prediction in Adaptive Web Sites
69
where l jk = 1 when jk = ik and l jk = 0 otherwise, for the previous pages. These history vectors are used together with the transition matrix to calculate vector Re c1 for the probability of each page to be visited in the next step as follows:
Re c1 = a1 × L0 × Q + a2 × L−1 × Q + ... + am × L− m +1 × Q 2
m
(5)
where a1 , a2 ,...am are the weights assigned to the history vectors. The values of
a1 , a2 ,...am indicate the level of influence the history vectors have on the future. Normally, we let 1> a1 > a2 > ... > am >0, so that the closer the history vector to the present, the more influence it has on the future. This conforms to the observation of a user’s navigation in the Web site. Re c1 ={ rec j } is normalized, and the pages with probabilities above a given threshold are selected as the recommendations. We propose a new method as an improvement to Sarukkai’s method by calculating the possibilities that the user will arrive at a state in the compressed transition matrix within the next n steps. We calculate the weighted sum of the possibilities of arriving at a particular state in the transition matrix within the next n steps given the user’s history as his/her overall possibility of arriving at that state in the future. Compared with Sarukkai’s method, our method can predict more steps in the future, and thus provide more insight into the future. We calculate a vector Re cn representing the probability of each page to be visited within the next n steps as follows:
Re cn = a1,1 × L0 × Q + a1,2 × L0 × Q + ... + a1, n × L0 × Q + 2
n
a2,1 × L−1 × Q 2 + a2,2 × L−1 × Q 3 + ... + a2,n × L−1 × Q n +1 +…+
(6)
am −1,1 × L− m +1 × Q m −1 + am −1,2 × L− m +1 × Q m + ... + am −1, n × L− m +1 × Q m + n −1 where a1,1 , a1,2 ,..., a1, n ,..., am −1,1 , am −1,2 ,..., am −1, n are the weights assigned to the history vectors L0 ,…, L− m +1 in 1,2,…, n ,…, m − 1 , m …, m + n − 1 steps into the future, respectively. Normally, we let 1> ak ,1 > ak ,2 > ... > ak , m >0 ( k =1,2,…, m ), so that for each history vector, the closer its transition to the next step, the more important its contribution is. We also let 1> a1,l > a2,l > ... > am ,l >0 ( l =1,2,…, n ), so that the closer the history vector to the present, the more influence it has on the future. Re cn ={ rec j } is normalized, and the pages with probabilities above a given threshold are selected as the recommendations. 4.2 Maximal Forward Path Based Link Prediction A maximal forward path [3] is a sequence of maximally connected pages in a user’s visit. Only pages on the maximal forward path are considered as a user’s history for
70
J. Zhu, J. Hong, and J.G. Hughes
link prediction. The effect of some backward references, which are mainly made for ease of travel, is filtered out. In Fig. 3, for instance, a user may have visited the Web pages in a sequence 1 → 2 → 5 → 2 → 6. Since the user has visited page 5 after page 2 and then gone back to page 2 in order to go to page 6, the current maximal forward path of the user is: 1 → 2 → 6. Page 5 is discarded in the link prediction.
5 Experimental Results st
th
Experiments were performed on a Web log file recorded between 1 and 14 of October, 1999 on the University of Ulster Web site, which is 371 MB in size and contains 2,193,998 access records. After discarding the irrelevant records, we get 423,739 records. In order to rule out the possibility that some links are only interesting to individual users, we set a threshold as the minimum number of traversals on each hyperlink as 10 and there must be three or more users who have traversed the hyperlink. We assume each originating machine corresponds to a different user. These may not always be true when such as proxy servers exist. But in the absence of user tracking software, the method can still provide rather reliable results. We then construct a link graph consisting of 2175 nodes, and 3187 links between the nodes. The construction process takes 26 minutes on a Pentium 3 desktop, with a 600 MHz CPU, 128M RAM. The maximum number of traversals on a link in the link graph is 101,336, which is on the link from the “Start” node to the homepage of the Web site. The maximum and average numbers of links in a page in the link graph are 75 and 1.47 respectively. The maximum number of in-links of a page in the link graph is 57. The transition matrix is 2175 × 2175 and very sparse. By setting six different thresholds for compression, we get the experimental results given in Table 1: Table 1. Compression Results on a Transition Matrix from a Web Log File
ε
Size after compression
% of states removed
0.03
Compression Time (Minutes) 107
1627
25.2
0.05 0.08 0.12 0.15
110 118 122 124
1606 1579 1549 1542
26.2 27.4 28.8 29.1
0.17
126
1539
29.2
We can see that when ε increases, the matrix becomes harder to compress. For this matrix, we choose ε =0.15 for a good compression rate without significant error. Experiments in [15] also show that a value of ε =0.15 yielded good compression with 2 minimum errors. Now we calculate Qc and use the time spent as the benchmark for
Qc m . Since we can repeatedly multiply Qc 2 by Qc to get Qc 2 , …, Qc m −1 , Qc m , the 2 m −1 m time spent for computing Qc ,…, Qc , Qc can be estimated as the m − 1 time of 2 the time for Qc . Table 2 summarises the experimental results of computation for
Using Markov Chains for Link Prediction in Adaptive Web Sites
71
Qc 2 . We can see that the time needed for compression is compensated by the time m
2
saved in the computation for Qc . When calculating Q , computational time can be 2
m
further reduced. Q ,…, Q can be computed off-line and stored for link prediction. So the response time is not an issue given the fast developing computational capability of the Web servers. 2
Table 2. Experimental Results for Q and Qc 2
Matrix Dimension( ) Computation Time for 2 2 Q or Qc (Minutes) Percentage of time saved (%)
2
2175
1627
1606
1579
1549
1542
1539
1483
618
592
561
529
521
518
N/A
58.3
60.1
62.1
64.3
64.9
65.1
We then use the compressed transition matrix for link prediction. Link prediction is integrated with a prototype called ONE (Online Navigation Explorer) to assist users’ navigation in our university Web site. ONE provides the user with informative and focused recommendations and the flexibility of being able to move around within the history and recommended pages. The average time needed for updating the recommendations is under 30 seconds, so it is suitable for online navigation, given the response can be speeded up with the current computational capability of many commercial Web sites. We selected m =5 and n =5 in link prediction to take into account five history vectors in the past and five steps in the future. We computed 2
9
Q ,…, Q for link prediction. The initial feedback from our group members is very positive. They have spent less time to find interested information using ONE than not using ONE in our university Web site. They have more successfully found the information useful to them using ONE than not using ONE. So users’ navigation has been effectively speeded up using ONE. ONE presents a list of Web pages as the user’s visiting history along with the recommended pages updated while the user traverses the Web site. Each time when a user requests a new page, probabilities of visiting any other Web pages or page clusters within the next n steps are calculated. Then the Web pages and clusters with the highest probabilities are highlighted in the ONE window. The user can browse the clusters and pages like in the Windows Explorer. Icons are used to represent different states of pages and clusters. Like the Windows Explorer, ONE allows the user to activate pages, expand clusters. Each page is given its title to describe the contents in the page.
6 Related Work Ramesh Sarukkai [14] has discussed the application of Markov chains to link prediction. User's navigation is regarded as a Markov chain for link analysis. The transition probabilities are calculated from the accumulated access records of past users. Compared with his method, we have three major contributions. We have
72
J. Zhu, J. Hong, and J.G. Hughes
compressed the transition matrix to an optimal size to save the computation time of m + n −1
Q , which can save a lot of time and resources given the large number of Web pages on a modern Web site. We have improved the link prediction calculation by taking into account more steps in the future to provide more insight into the future. We have proposed to use Maximal Forward Path method to improve the accuracy of link prediction results by eliminating the effect of backward references by users. The “Adaptive Web Sites” approach has been proposed by Perkowitz and Etzioni [11]. Adaptive Web sites are Web sites which can automatically change their presentation and organization to assist users’ navigation by learning from Web usage data. Perkowitz and Etzioni proposed the PageGather algorithm to generate index pages composed of Web pages most often associated with each other in users’ visits from Web usage data to evaluate a Web site’s organization and assist users’ navigation [12]. Our work is in the context of adaptive Web sites. Compared with their work, our approach has two advantages. (1) The index page is based on co-occurrence of pages in users’ past visits and does not take into account users’ visiting history. The index page is a static recommendation. Our method has taken into account users’ history to make link prediction. The link prediction is dynamic to reflect the changing interests of the users. (2) In PageGather, it is assumed that each originating machine corresponds to a single user. The assumption can be undermined by proxy servers, dynamic IP allocations, which are both common on the WWW. Our method treats a user group as a whole without the identification of individual users and thus is more robust to these influences. However, computation is needed in link prediction and the recommendations can not respond as quickly as the index page, which can be directly retrieved from a Web server. Spears [15] proposed a transition matrix compression algorithm based on transition behaviors of the states in the matrix. Transition matrices calculated from systems, which are being modeled in too many details, can be compressed for smaller state spaces while the transition behaviors of the states are preserved. The algorithm has been used to measure the transition similarities between pages in our work and compress the probability transition matrix to an optimal size for efficient link prediction. Pirolli and Pitkow [13] studied the web surfers' traversing paths through the WWW and proposed to use a Markov model for predicting users' link selections based on past users' surfing paths. Albrecht et al. [1] proposed to build three types of Markov models from Web log files for pre-sending documents. Myra Spiliopoulou [16] discussed using navigation pattern and sequence analysis mined from the Web log files to personalize a web site. Mobasher, Cooley, and Srivastava [4, 9] discussed the process of mining Web log files using three kinds of clustering algorithms for site adaptation. Brusilovsky [2] gave a comprehensive review of the state of the art in adaptive hypermedia research. Adaptive hypermedia includes adaptive presentation and adaptive navigation support [2]. Adaptive Web sites can be seen as a kind of adaptive presentation of Web sites to assist users’ navigation.
7 Conclusions Markov chains have been proven very suitable for modeling Web users’ navigation on the WWW. This paper presents a method for constructing link graphs from Web log
Using Markov Chains for Link Prediction in Adaptive Web Sites
73
files. A transition matrix compression algorithm is used to cluster pages with similar transition behaviors together for efficient link prediction. The initial experiments show that the link prediction results presented in a prototype ONE can help user to find information more efficiently and accurately than simply following hyperlinks to find information in the University of Ulster Web site. Our current work has opened several fruitful directions as follows: (1) Maximal forward path has been utilized to approximately infer a user’s purpose in his/her navigation path, which might not be accurate. The link prediction can be further improved by identifying users’ goal in each visit [6]. (2) Link prediction in ONE needs to be evaluated by a larger user group. We plan to select a group of users including students, staff in our university, and people from outside our university to use ONE. Their interaction with ONE will be logged for analysis. (3) We plan to use Web log files from some commercial Web site to build a Markov model for link prediction and evaluate the results on different user groups.
References 1. Albrecht, D., Zukerman, I., and Nicholson, A.: Pre-sending Documents on the WWW: A Comparative Study. IJCAI99 (1999) 2. Brusilovsky, P.: Adaptive hypermedia. User Modeling and User Adapted Interaction 11 (1/2). (2001) 87-110 3. Chen, M. S., Park, J. S., Yu, P. S.: Data mining for path traversal in a web environment. In Proc. of the 16th Intl. Conference on Distributed Computing Systems, Hong Kong. (1996) 4. Cooley, R., Mobasher, B., and Srivastava, J. Data Preparation for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems, Vol. 1, No. 1. (1999) 5. Hallam-Baker, P. M. and Behlendorf, B.: Extended Log File Format. W3C Working Draft WD-logfile-960323. http://www.w3.org/TR/WD-logfile. (1996) 6. Hong, J.: Graph Construction and Analysis as a Paradigm for Plan Recognition. Proc. of AAAI-2000: Seventeenth National Conference on Artificial Intelligence, (2000) 774-779 7. Kalpazidou, S.L. Cycle Representations of Markov Processes, Springer-Verlag, NY. (1995) 8. Kijima, M.: Markov Processes for Stochastic Modeling. Chapman&Hall, London. (1997) 9. Mobasher, B., Cooley, R., Srivastava, J.: Automatic Personalization Through Web Usage Mining. TR99-010, Dept. of Computer Science, Depaul University. (1999) 10. Nielsen, J.: Designing Web Usability, New Riders Publishing, USA. (2000) 11. Perkowitz, M., Etzioni, O.: Adaptive web sites: an AI challenge. IJCAI97 (1997) 12. Perkowitz, M., Etzioni, O.: Towards adaptive Web sites: conceptual framework and case study. WWW8. (1999) 13. Pirolli, P., Pitkow, J. E.: Distributions of Surfers’ Paths Through the World Wide Web: Empirical Characterization. World Wide Web 1: 1-17. (1999) 14. Sarukkai, R.R.: Link prediction and path analysis using Markov chains. WWW9, (2000) 15. Spears, W. M.: A compression algorithm for probability transition matrices. In SIAM Matrix Analysis and Applications, Volume 20, #1. (1998) 60-77 16. Spiliopoulou, M.: Web usage mining for site evaluation: Making a site better fit its users. Comm. ACM Personalization Technologies with Data Mining, 43(8). (2000) 127-134
Classification of Customer Call Data in the Presence of Concept Drift and Noise Michaela Black and Ray Hickey School of Information and Software Engineering, Faculty of Informatics University of Ulster, Coleraine, BT51 1SA, Northern Ireland {mm.black, rj.hickey}@ulst.ac.uk
Abstract. Many of today’s real world domains require online classification tasks in very demanding situations. This work presents the results of applying the CD3 algorithm to telecommunications call data. CD3 enables the detection of concept drift in the presence of noise within real time data. The application detects the drift using a TSAR methodology and applies a purging mechanism as a corrective action. The main focus of this work is to identify from customer files and call records if the profile of customers registering for a ‘friends and family’ service is changing over a period of time. We will begin with a review of the CD3 application and the presentation of the data. This will conclude with experimental results.
1 Introduction On-line learning systems which receive batches of examples on a continual basis and are required to induce and maintain a model for classification have to deal with two substantial problems: noise and concept drift. The effects of noise have been studied extensively and has lead to noise-proofed decision tree learners such as C4.5 [13], C5 [14] and rule induction algorithms, e.g. CN2 [17]. Although considered by Schlimmer and Grainger [6] and again recently by others including Widmer and Kubat [10], Widmer [9] and, from a computational learning theory perspective, by Hembold and Long [7], concept drift [1], [8], [10] has received considerably less attention. There has also been work on time dependency for association rule mining; see, for example, [11] and [12]. By concept drift we mean, essentially, that concepts or classes are subject to change over time [5], [7], [8], [9], [10]. Such change may affect one or more classes and, within a class, may affect one or more of the rules which constitute the definition of that class. Examples of this phenomenon are common in real world applications. In marketing, the definition of the concept ‘likely to buy this product in the next three months’ could well change several times during the lifetime of the product or even during a single advertising campaign. Also in fraud detection for, say, credit card or mobile phone usage. Here change may be prompted by advances in technology that make new forms of fraud possible or may be the result of fraudsters altering their behaviour to avoid detection. The consequences of ignoring concept drift when mining for classification models can be catastrophic [1].
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 74–87, 2002. © Springer-Verlag Berlin Heidelberg 2002
Classification of Customer Call Data in the Presence of Concept Drift and Noise
75
As noted by Widmer [10], distinguishing between noise and drift is a difficult task for a learner. When drift occurs, incoming examples can appear as just imperfect. The dilemma for the algorithm is then to decide between the two: has drift occurred and if so, how are the learned class definitions to be updated? In the system Flora 4, concept hypotheses description sets are maintained using a statistical evaluation procedure applied when a new example is obtained. Examples deemed to be no longer valid are ‘forgotten’, i.e. dropped from the window of examples used for forming generalisations. A WAH (window adjustment heuristic) is employed for this purpose. The approach assumes that it is the newer examples which are relevant and thus the older examples are dropped. Although window size is dynamic throughout learning there is a philosophy of preventing the size from becoming overly large: if after receiving a new example the concepts seem sTable then an old example is dropped. Drift may happen suddenly, referred to as revolutionary [1], [2], or may happen gradually over an extended time period, referred to as evolutionary; see [1] and [5]. In the former case we refer to the time at which drift occurred as the drift point. We can regard evolutionary drift as involving a series of separate drift points with very small drift occurring at each one. In [1], [2] we proposed a new architecture for learning system, CD3, to aid detection and correction of concept drift. Like the METAL systems presented by Widmer [9] CD3 utilizes existing learners, in our case tree or rule induction algorithms – the basic induction algorithm is really just a parameter in the system. We do not, however, seek to determine contextual clues. We provided an alternative strategy to ‘windowing’ called ‘purging’, as a mechanism for removing examples that are no longer valid. The technique aims to keep the knowledge base more up-to-date by not offering preference to the newer examples and thus retaining older valid information that has not drifted.
2 TSAR and the CD3 Algorithm We know data will arrive in batches, where batch size may vary from one example to many. The CD3 algorithm presented in [1], [2] must accept batches of data of various sizes providing a flexible update regime. Online incoming data may be batched in accordance to time of arrival or in set sizes of examples. Experimental scripts produced in Prolog allow the user to specify how the data is to be processed by CD3. Batch variations may also exist within one induction process. The central idea is that a time-stamp is associated with examples and treated as an attribute, ts, during the induction process. In effect the learning algorithm is assessing the relevance of the time-stamp attribute. This is referred to as the time-stamp attribute relevance (TSAR) principle, the implication being that if it turns out to be relevant then drift has occurred. Specifically, the system maintains a set of examples (the example base) deemed to be valid and time-stamps these as ts=current. When a new batch arrives, the examples in it are stamped ts=new. Induction then takes place using a noise-proofed tree or rulebuilding algorithm—referred to as the base algorithm. The pseudocode for CD3 is presented in Figure1. Following the induction step, CD3 will provide us with a pruned induced tree structure that may or may not have the ts attribute present. This induced tree could be
76
M. Black and R. Hickey
used for classification of unseen examples by merely having their description augmented with the ts attribute and setting its value to ‘new’. It will be assumed that all unseen examples presented following an induction step are generated from the current model in force when the most recent batch arrived. This assumption is held until another new batch of training examples arrives invoking an update of the tree. CD3(Header_file, Batch_parameters, Data_file, Output_file, Purger_parameter, No_of_trials) repeat for No_of_trials load file specification(Header_file); load training data(Data_file, Trial); begin extract first batch and mark as ‘current’(Batch_parameters); while more batches extract next batch and mark as ‘new’; call ID3_Induce_tree; prune induced tree; extract rules; separate drifted rules; purge invalid training examples(Purger_parameter); append ‘new’ examples to ‘current’; test rules; output results(Output_file); end while end repeat end CD3
Fig. 1. Pseudocode for CD3 Induction Algorithm
However, CD3 offers another method for classification which does not require unseen examples having their description altered. From the pruned induced tree CD3 will extract rules, the rules representing all the concept paths from the root to the leaves. In the case of ts being present in some or all of the paths, CD3 will look upon those having a ts value ‘current’ as being out-of-date and now an invalid rule. Paths with the ts value ‘new’ will be viewed as up-to-date and thus valid rules. Finally, paths in which ts is not instantiated correspond to rules which were valid prior to the new batch and are still so, i.e. unchanged valid rules. These rules can now be separated i.e. valid from invalid, and the ts attribute dropped from the rule conditions. The TSAR methodology uses the ts attribute to differentiate between those parts of the data which have been affected by drift from those that have not. Thus by CD3 applying the TSAR methodology it is enabled to detect drift. Once the invalid and valid rules have been highlighted and separated the ts attribute has fulfilled its purpose and can now be removed. Classification requests can be ongoing, but in an aim to provide the most up to date classifier, CD3 must always be ready to accept these updates. This requires maintaining a correct and up-to-date database. Following the recent induction step and rule conversion, CD3’s database is currently out of date with its knowledge structure. Before CD3 can accept these new updates of training examples it must remove existing examples in the knowledge base that are covered by the most recently identified invalid rules. As presented in [1], [2] CD3 provides a removal technique i.e. ‘purging’ which can be applied to the knowledge base following the concept drift detection phase to extract the now ‘out of date’ examples thus
Classification of Customer Call Data in the Presence of Concept Drift and Noise
77
maintaining an up-to-date version of the database. In purging examples that are no longer relevant, the TSAR approach does not take account of age as windowing mechanisms tend to do. Rather the view is that an example should be removed if it is believed that the underlying rule, of which the example is an instance, has drifted, i.e. matches that of an invalid rule. The invalid rules can be discarded or stored for trend analysis. The valid rules are used for classification where an unseen example description can be matched against one of the rules condition to obtain its classification. However, CD3 aims to be an incremental online learning system which can therefore be updated with new information i.e. new training examples when available. A learning system as described above is said to update learning in the presence of noise and possible concept drift using the TSAR principle. The central feature here is the involvement of the ts attribute in the base learning process. Any tree or rule induction algorithm which effectively handles noise (and therefore, in theory, should prune away the ts attribute - as well as other noisy attributes - if there is no drift) can be used at the heart of a TSAR learning system. In particular, if ID3 with post-pruning is used we call the resulting system CD3 (CD = concept drift). We have implemented a version of CD3 which uses the well-known Niblett-Bratko post-pruning algorithm [18]. A TSAR system implements incremental learning as batch re-learning, i.e. the knowledge base is induced again from scratch every time a new batch arrives with the new batch being added to the existing example base. (In contrast to FLORA4 which incrementally learns on receipt of each new example and without re-learning from the existing examples). Bearing in mind that, depending on the extent and frequency (over time) of drift, many examples could be purged from the example base this is not as inefficient as it might appear. It may be possible to produce a more genuinely incremental and hence more computationally efficient implementation by exploiting the techniques used by [16] in the ITI algorithm. The TSAR approach does not require the user to manually set parameter values other than those that may be required by the base algorithm. Noise and the presence of pure noise attributes, i.e. those that are never useful for classification, will interfere with the ability of CD3 to decide whether drift has occurred. The ts attribute may be retained in a pruned tree even if there has been no drift—a ‘false positive’—and, as a consequence, some examples will be wrongfully purged. Conversely ts may be pruned away when drift has taken place with the result that invalid examples will remain in the example base and contaminate learning. CD3 allows classification tasks to continue to work and survive in very demanding environments such as fraudulent detection in the telecommunications industry, moving marketing targets and evolving customer profiling. The methodology is simple in principle and allows ease of implementation into a wide spectrum of problem areas and classification methods. Purging helps CD3 to work in these demanding environments. The data that CD3 works with will be continually updated with the changing environment. The speed at which it can work is also improved by removing unnecessary data for the task at hand. Using the purging mechanism with the TSAR methodology makes it very easy to implement. We basically rely on the ts attribute to provide two sets of rules: valid and invalid. The purging rules maybe simply coded to check all examples accordingly. Having the purging mechanism separate from the drift detection enhances its
78
M. Black and R. Hickey
reusability aspect. If coded as a separate executable object, this would allow it to be reused, not only across a number of different classification tasks, but also across separate classification algorithms. 2.1 Refinement of the TSAR Methodology We extended the CD3 algorithm presented in [1] to allow refinement of the ts attribute being used as discussed in [2]. CD3 uses a very simple form of time stamping relying on just two values - ‘current’ and ‘new’. This is sufficient to allow it to separate valid and invalid rules and to maintain good classification rates using the former. One possible disadvantage, however, is that as mining proceeds, the examples purged in previous rounds are lost to the system. This is the case even though such purges may be false, i.e. may have occurred as a result of errors in the induction process. It was shown in [1] that false purging even under realistic noise levels could be kept to a minimum. Nevertheless, it is worth considering how the mining process could be allowed to review and maybe revoke earlier decisions. In [2] we used two versions of time stamp refinement. The first refinement of the time stamps as ‘Batch Identifiers’ resulted in the CD4 algorithm. Within this algorithm each batch is assigned and retains indefinitely its own unique batch identifier. At each round of mining, all the data from the previous batches is used together with the new batch. There is no purging process as CD4 is implemented using a new base learner C5 [14] without the purging mechanism. Instead the base learning algorithm is able to distinguish valid and invalid rules by appropriate instantiation of the time stamp attribute possibly revising decisions made in a previous induction. As presented in [2] drift will be located as being after a certain batch and will be presented within the knowledge structure as a binary split at the ts value of the batch identifier. This procedure will result in a set of invalid and valid rules. The valid rules can then be used for online classification. The second refinement procedure presents CD5, which removes the effects of batches on mining by using continuous time stamping in which each training example is has its own unique time stamp. This would be a numeric attribute. The base learning algorithm is now free to form binary splits as for CD4 but without regard to batch. Thus it can place a split point within a batch (either the new or any of the previous batches) and review these decisions at each new round of mining. Again the procedures for extraction of valid and invalid rules and maintenance of a database of currently valid examples are as described above. As with CD4, purging is not an integral part of the incremental mining process. As demonstrated in [2] both the extension to individual batch time stamps and further to individual example time stamps in algorithms CD4 and CD5 respectively appear to produce results comparable to, and possibly, slightly superior to those obtained from CD3. It was also highlighted in [2] that the simple strategy of dealing with drift by ignoring all previously received data and just using the new batch of data is effective only immediately after drift. Elsewhere it prevents growth in ACR through accumulation of data. It also, of course, denies us the opportunity to detect drift should it occur. With the benefit from the enhanced time stamps turning out to be marginal then the choice of which algorithm to deploy may depend on the characteristics of the domain
Classification of Customer Call Data in the Presence of Concept Drift and Noise
79
used and the nature of the on-line performance task. Without the purging process CD4 and CD5 will produce larger trees than CD3 and may take slightly longer to learn. Both CD4 and CD5 offer a benefit over CD3 since they induce trees that record the total history of changes in the underlying rules and therefore provide a basis for further analysis. However the purging mechanism of CD3 offers the application of trend analysis off-line.
3 Experimental Trial with British Telecom Call Data Until now we have developed and experimented with artificial data [1], [2]. This has allowed for strategic generation and control of parameters within the data, aiding the development of a simple yet so far effective application for detecting concept drift in the presence of noise. We are now able to extend this to an experimental trial of real world call data acquired from British telecom (BT). The data is in form of five batches based over a time period of twenty seven months, batched in accordance to when BT proposed that drift was most likely to occur: March and October. The data reflects all landline calls for 1000 customers over this period combined with customer information. We aim to prepare and process the data for CD3 with the hope that it will highlight some concept drift over the five batches. 3.1 The Content of the Data The data was initially presented for the project consisting of two files: customer file and call file as shown in Table 1, Table 2 and Table 3. These were linked via an encoded customer id. One of the main interests of BT was to train on call data and try to identify if the profile of customers who register for ‘friends and family’ (F&F) or ‘premier line’ (P_L) service has changed over time. The ‘friends and family’ service is a discount service option offered to BT customers. This experiment will involve the induction of a rule set for classification of the F&F indicator. Total usage i.e. ‘revenue’ was also of great interest and how this related to the F&F users. This was calculated using two available fields: number of calls ∗ average cost of calls, which could then be split into discrete band values. The F&F, P_L and option_15 indicator (O15) had five separate indicators, one for each of the five time periods, as shown in Table 1. These could then be translated into one indicator for each field, within individual batches, highlighting if the customer was now using the service at that time period. This also allowed for customers to register for a service and, within one of the successive time points, deregister for the service. These indicators are referred to as ffind, plind and O15ind. The binary fields of particular interest were: the friends and family indicator (F&F); O15 indicator (O15); Premier Line Indicator (P_L); single line indicator (SLIND). Other discrete valued fields of interest were: revenue, life stage indicator (LSIND) and the acorn code. (More details are available in Table 1). For the fields in Table 2 marked with an * there exist 13 sets of summarised data under the following sub headings shown in Table 3. These summarised subgroups occur within the call file in the order shown resulting in 79 columns in total. The first
80
M. Black and R. Hickey
is unique: encoded telephone number and then there are 13 batches of six re-occurring attributes. Table 1. Customer File Field encode (telno) distcode startdat acorn ffind ffdate F&F in Oct 1995 F&F in Mar 1996 F&F in Oct 1996 F&F in Mar 1997 F&F in Oct 1997 F&F in Mar 1998 Plind Pldate P&L in Oct 1995 P&L in Mar 1996 P&L in Oct 1996 P&L in Mar 1997 P&L in Oct 1997 P&L in Mar 1998 O15ind O15date O15 in Oct 1995 O15 in Mar 1996 O15 in Oct 1996 O15 in Mar 1997 O15 in Oct 1997 O15 in Mar 1998 Xdir Mps Tps Dontmail Lusind Ccind Hwind Lsind Slind Postcode
Description ecrypted telephone number district code (27 unique) customer started using no. residental codes from postcodes friends and family indicator first got service had F&F service at this time “ “ “ “ “ premier line indicator first got service had P&L service at this time “ “ “ “ “ option15 ind.(fixed call amount) first got service had O15 service at this time “ “ “ “ “ Xdirectory Mailing preference scheme Telephone preference scheme Don’t mail marketing data Low user scheme - Code form Charge card indicator Hard wired indicator Life stage indicator (from postcode, 1..10) More than one line Post code
Type Char char dd-mon-ccyy time integer Y/N dd-mon-ccyy time Y/N Y/N Y/N Y/N Y/N Y/N Y/N dd-mon-ccyy time Y/N Y/N Y/N Y/N Y/N Y/N Y/N dd-mon-ccyy time Y/N Y/N Y/N Y/N Y/N Y/N Y/N Y/N Y/N Y/N X/H Y/N Y/N integer Y/N POSTCODE
Table 2. Call File Details Field Encode (telno) *no. of calls *average duration of calls *variance of duration of calls *average cost of calls *variance of cost of calls *no. of distinct destinations phoned
Type char int real real real real int
Classification of Customer Call Data in the Presence of Concept Drift and Noise
81
Table 3. Call File Summarised Sub-Groups All calls Day-time calls Directory Enquiry calls International calls ISP calls Local calls Long calls Low- call calls Mobile calls National calls Premium Rate calls Short calls Week-end calls
3.2 Pre-processing of the Data Having been accustomed to artificial data, this real data brought with it a complete new set of challenges. The number of fields seemed overwhelming, not to mention the number of discrete values available with fields like the acorn code. The initial files were read into Microsoft Access where they could be grouped via the customer id relationship. This also allowed the revenue field to be calculated and inserted. The aim of the field selection process was to greatly reduce the number of fields. Only those fields of particular interest for the problem stated above were selected. These are shown in Table 4. Once the access files were complete they were then available for Clementine [15] for the second stage of the processing. Clementine allowed analysis of the data for fields like revenue, acorn and LSIND, which were required to be split into an accepTable number of bands. It became clear from the early analysis that there was a clear shift within these customers from non F&F users to F&F users over the twenty seven months period. Our initial fears come from the first batch where the proportion of the F&F users was quite small. We had concerns regarding classes with small coverage. Another concern was that this might be a population shift problem [4] and not that of concept drift. However, the results of the experiments would prove or disprove this. Initially the acorn field had 55 values. These values are derived from the postcode and categorise communities with respect to their location and the sub-groups of the population within that community. These can then be grouped into seven higher classification groups. These higher seven bands were used for our experiments. Similarly the life stage indicator had ten values spanning from 1: representing young people, to 10: representing retired couples. These were regrouped to five values, combining 1 and 2 to give ‘ls_a’, 3 and 4 to give ‘ls_b’ and so on. The resulting fields and their final values are shown below in Table 4. The revenue field values represent the six bands that were selected and the number at the end of each value represents the maximum values of revenue for the category. For example ‘a_12’ represents all customers with revenue less than £12,000. The
82
M. Black and R. Hickey
value ‘b_28’ represents all customers with revenue values from £12,000 but less than £28,000 and so on. Table 4. Selected and Processed Fields for Experimentation Fields Values
Acorn
F&F
P_L
O15
SLIND
LSIND
Revenue
acorn_a, acorn_b, acorn_c, acorn_d, acorn_e, acorn_f
y,n
y,n
y,n
y,n
ls_a, ls_b, ls_c, ls_d, ls_e,
a_12, b_28, c_40, d_52, e_70, f_high
Clementine also enabled the quality of the data to be analysed. We could monitor for missing values and erroneous values. With the quantity of the data available, and because accuracy was an issue, we initially removed records with such properties. The remaining records within each batch were then split into training and test batches as shown in Table 5. The flexibility in CD3’s update regime allows for the batches to be of various sizes. Table 5. Training and Test Batches for Experimental run Month October 1995 March 1996 October 1996 March 1997 October 1997
Total Examples
Training Examples
Test Examples
840 837 848 823 793
840 558 566 549 529
279 282 274 264
3.3 The Experimental Trial For all our previous experiments we had a number of trials of data available, however with real data there can only ever be one. We had five batches of data available spanning a period of twenty seven months. As before CD3 uses the first batch as its current batch and then appends additional batches checking for drift. A header file must be created to represent the data specification including the class values for the algorithm as seen in Figure 2. The ‘mark2’ purger was used for this trial due to its prior success in [1]. The first batch Oct_95 was applied by CD3 as its current batch. The consecutive batches were applied one by one in order of date. The algorithm recorded the ACR of CD3 based on the test set provided, the percentage purged on each iteration and the highest position of the ts attribute within the tree. Our initial hopes were that by analysing the position of the ts attribute and the percentage purged we would be able to clearly identify drift between consecutive batches.
Classification of Customer Call Data in the Presence of Concept Drift and Noise
83
univ(btUniv) . attributes([acorn,p_l,o15,slind,lsind,revenue]) . att_values(acorn,[acorn_a, acorn_b, acorn_c,acorn_d,acorn_e, acorn_f]) . att_values(p_l,[y,n]) . att_values(o15,[y,n]) . att_values(slind,[y,n]) . att_values(lsind,[ls_a,ls_b,ls_c,ls_d,ls_e]) . att_values(revenue,[a12,b28,c40,d52,e70,fhigh]) . classes([y,n]).
Fig. 2. BT Call Data Specification
3.4 The Final Test The classification performance of CD3 starts well at a peak of 85% on application of the second batch: Mar_96 as shown in Figure 3. It would seem that there is little change between Oct_95 and Mar_96. The position of the ts attribute should confirm this. Application of the next two batches: Oct_96 and Mar_97, show a slight decline in performance. Could this be an indication of drift? Detailed study of the ts attribute will confirm this. 90 85
A CR%
A CR%
80 75 70 65 60 55 50 45 40 Mar-96
Oc t-96 Tim e Point(n)
May -97
Fig. 3. ACR% for CD3 with BT Data
Figure 4 confirms our initial suspicions. The ts attribute only reaches level 2 on the application of the second batch: Mar_96 (0 is the root position: top, 1 is second top and so on). This could be false purging occurring or it could be the beginning of concept drift. CD3 still achieves a good ACR. Detailed analysis of the tree in Figure 5 shows at this early stage the revenue attribute is the most informative, and in some cases the only attribute required in determining ‘F&F’ users. The acorn attribute then becomes informative. The drift occurs within acorn value acorn_d highlighting a change in the relevance of the lsind as highlighted in Figure 5.
84
M. Black and R. Hickey
Again with reference to Figure 4 the position of the ts attribute after applying the next two batches: Oct_96 and May_97 confirms our suspicions about drift. The ts attribute climbs to the top of the tree highlighting that all of the data has drifted. This drift continues into the fourth batch: May_97. 2.5 ts Highes t Pos ition
2
ts Hig h
1.5 1 0.5 0 M ar-96
Oc t-96
M ay-97
T im e Po in t(n)
Fig. 4. Highest ts Position for CD3 with BT Data
revenue a12--> n - [40,735] / 775 b28--> n - [37,211] / 248 c40--> n - [5,43] / 48 d52 acorn acorn_a--> n - [1,2] / 3 acorn_b--> def - n acorn_c--> y - [1,1] / 2 acorn_d ts curr lsind ls_a--> def - n ls_b--> y - [1,0] / 1 ls_c--> def - n ls_d--> n - [0,3] / 3 ls_e--> def - n new--> y - [2,0] / 2 Fig. 5. A Section of the Pruned Tree Output from CD3 after Applying Second Batch
The percentage of examples being purged as shown in Figure 6 is measured as a percentage of the current number of examples. This shows that for the second batch, i.e. Mar_96, the drift has not really started, resulting in a very low percentage purged of 0.54%. Again this could be false purging or the beginning of the change. However, things begin to change with the next two batches. After applying the third batch we see an increase in the percentage being purged to 13%. This increase accelerates to the highest rate of the experiment between the third and fourth batches: Oct_96 and May_97 from 13% to 34%.
Classification of Customer Call Data in the Presence of Concept Drift and Noise
85
Following this as with the other findings above, the drift seems to beginning to decline after applying the final batch. Although the percentage being purged stills increases, it is doing so at a slower rate. We also see that the ts attribute moves down to position one in Figure 4 in accordance with this. 50
%Pu r ge d
40 30 20 10
% P urged
0 Mar-96
Oc t-96 Tim e Point(n)
May -97
Fig. 6. % Purged for CD3 with BT Data
With close analysis of the tree in Figure 7 it becomes clear at this final stage that the revenue is once again the most informative attribute and that drift is only applicable to the lowest band of revenue; ‘a12’. Customers within this revenue band and acorn categories of either acorn_d or acorn_f previously all had a F&F indicator of ‘n’: non ‘F&F’ users. (Figure 7 has had some of the binary splits of the acorn attribute under the revenue value ‘a12’ removed to aid readability) However, after applying the final batch: Oct_97, these two bands under go major change with almost all changing to ‘F&F’ users. The reduction in the percentage purged, and assumingly the drift is also highlighted by the increase in the ACR in Figure 3 to 66%.
4 Conclusion When working with real data it is difficult to determine if drift exists within the data, and if so where it occurs. The experiments confirmed the company’s suspicions. Within the twenty-seven week period the profile of customers using the service has changed. The TSAR methodology allowed CD3 to locate the drift and highlight the changing properties within the customer profile. If we look closely at the percentage being purged in Figure 6 we can see that towards the end of the trial CD3 is purging almost 50% of the data. At this stage CD3 has retained 2085 examples out of a total of 2762. What would be interesting is to follow this trial with another few batches after October 1997 to determine if the drift reduces and if so where? Over a longer period it may reduce and then reappear. By using the TSAR approach the user can analyse the drift at each stage. It is very interesting to study the differences in the knowledge structure between what was current and what is now new. The tree structure naturally offers a very clear and readable interpretation of the drift.
86
M. Black and R. Hickey Revenue a12 ts curr acorn acorn_a lsind ls_a--> def - n ls_b--> y - [3,2] / 5 ls_c--> y - [9,3] / 12 ls_d p_l y--> y - [1,0] / 1 n--> n - [23,117] / 140 ls_e o15 y--> y - [1,0] / 1 n p_l y--> y - [1,0] / 1 n--> n - [17,69] / 86 acorn_d--> n - [77,322] / 399 acorn_f--> n - [39,205] / 244 New p_l y--> y - [27,1] / 28 n o15 y--> y - [8,1] / 9 n acorn acorn_a lsind ls_a--> def - n ls_b--> y - [2,2] / 4 ls_c--> y - [5,3] / 8 acorn_d lsind ls_a--> n - [2,4] / 6 ls_b--> y - [2,1] / 3 ls_c--> n - [6,7] / 13 ls_d--> y - [39,31] / 70 ls_e--> y - [6,5] / 11 acorn_f lsind ls_a--> n - [0,2] / 2 ls_b--> y - [8,7] / 15 ls_c--> n - [2,3] / 5 ls_d--> y - [8,3] / 11 ls_e--> y - [11,10] / 21 Fig. 7. The Section of the Pruned Tree Output From CD3 after Applying Final Batch
Classification of Customer Call Data in the Presence of Concept Drift and Noise
87
References 1. Black, M., Hickey, R.J.: Maintaining the Performance of a Learned Classifier under Concept Drift. Intelligent Data Analysis 3 (1999) 453-474 2. Hickey, R.J., Black, M.,: Refined Time Stamps for Concept Drift Detection during Mining for Classification Rules. Spatio-Temporal Data Mining - TSDM2000, published in LNAI 2007, Springer-Verlag. 3. Hickey, R.J., 1996, Noise Modelling and Evaluating Learning from Examples. Artificial Intelligence, 82, pp157-179. 4. Kelly, M.G., Hand, D.J., Adams, N.M.: The Impact of Changing Populations on Classifier Performance. In: Chaudhuri, S., Madigan, D. (eds.): Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York (1999) 367-371 5. Klenner, M., Hahn, U.: (1994). Concept Versioning: A Methodology for Tracking Evolutionary Concept Drift in Dynamic Concept Systems. In: Proceedings of Eleventh European Conference on Artificial Intelligence, Wiley, Chichester, England, 473-477 6. Schlimmer, J.C., Granger, R.H.: Incremental Learning from Noisy Data. Machine Learning 1 (1986) 317-354 7. Hembold, D.P., Long, P.M.: Tracking Drifting Concepts by Minimising Disagreements. Machine Learning 14 (1994) 27-45 8. Hulten, G., Spencer, L., Domingos, P., 2001, Mining Time-Changing Data Streams, Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining. 9. Widmer, G.: Tracking Changes through Meta-Learning. Machine Learning 27 (1997) 259286 10. Widmer, G., Kubat, M.: Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning 23 (1996) 69-101 11. Chakrabarti, S., Sarawagi, S., Dom, B.: Mining Surprising Patterns using Temporal Description Length. In: Gupta, A., Shmueli, O., Widom, J. (eds.): Proceedings of the Twenty-Fourth International Conference on Very Large databases. Morgan Kaufmann, San Mateo, California (1998) 606-61 12. Chen, X., Petrounias, I.: Mining Temporal Features in Association Rules. In: Zytkow, J., Rauch, J. (eds,): Proceedings. of the Third European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Artificial Intelligence, Vol. 1704. Springer-Verlag, Berlin Heidelberg New York (1999) 295-300 13. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California (1993) 14. Quinlan, J.R., 1998, See5, http://www.rulequest.com/. 15. http://www.spss.com/clemintine/ 16. Utgoff, P.E., 1997, Decision Tree Induction Based on Efficient Tree Restructuring, Machine Learning 29(1): 5-44. 17. Clark, P. and Boswell, R. 1991. Rule Induction with CN2: Some Recent Improvements, in Proceedings of the European Workshop on Learning (EWSL-91). 151-163. Berlin: Springer-Verlag. 18. Bratko, I. 1990. Prolog programming for artificial intelligence. Wokingham: AddisonWesley.
A Learning System for Decision Support in Telecommunications ˇ ˇ ep´ Filip Zelezn´ y1 , Jiˇr´ı Z´ıdgek2 , and Olga Stˇ ankov´ a3 1
Center for Applied Cybernetics, 3 The Gerstner Laboratory 1,3 Faculty of Electrotechnics, Czech Technical University Prague, Czech Republic Technick´ a 2, CZ 166 27, Prague 6 {zelezny,step}@labe.felk.cvut.cz 2 Atlantis Telecom s.r.o. ˇ Zirovnick´ a 2389, CZ 106 00, Prague 10
[email protected]
Abstract. We present a system for decision support in telecommunications. History data describing the operation of a telephone exchange are analyzed by the system to reconstruct understandable event descriptions. The event descriptions are processed by an algorithm inducing rules describing regularities in the events. The rules can be used as decision support rules (for the exchange operator) or directly to automate the operation of the exchange.
1
Introduction
In spite of the explosion of information technologies based on written communication, the most common and most frequently used tool is the telephone. Up-todate private branch exchanges (PBX) provide comfort in managing the telephone traffic, namely regarding calls coming into an enterprise from the outside world. Communication proceeds smoothly provided that the caller knows with whom she wants to communicate and the person is available. In the opposite case, there is a secretary, receptionist, operator or colleague that can for instance help to find a substituting person. The operator is a person with no direct product, but with a strong impact on the productivity of other people. Despite that, a wide range of companies have cancelled the post of the telephone operator. The reason is that it is not easy to find a person who is intelligent enough to be a good operator and modest enough to be just an operator. This opens the way for computers - the computer is paid for only once so no fix costs set in. Moreover, the machine can work non-stop and provide additional data suitable for analysis allowing for improvements of the telecommunication traffic. Currently there are several domains where computers are used in the PBX area (neglecting the fact that PBX itself is a kind of computer): – Automated attendant - a device that welcomes a caller in a unified manner and allows him usually to reach a person, or choose a person from a spoken list; in both cases the calling party is required to co-operate. D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 88–101, 2002. c Springer-Verlag Berlin Heidelberg 2002
A Learning System for Decision Support in Telecommunications
89
– Voice mail - a device allowing a spoken message to be left for an unavailable person; some rather sophisticated methods of delivering the messages are available. – Information service - the machine substitutes a person in providing some basic information usually organized into an information tree; the calling party is required to co-operate. The aim of the above listed tools is to satisfy a caller even if there is no human service available at the moment. But all such devices are designed in a static, simple manner - they always act the same way. The reason is simple - they do not consider who is calling nor what they usually want - as opposed to the human operator. Comparing a human operator/receptionist to a computer, we can imagine the following improvements of the automated telephony: 1. Considering who is calling (by the identified calling party number) and what number was dialled by the caller, the system can learn to determine the person most probably desired by the caller; knowledge can be obtained either from previous cases (taking into account other data like daytime, explicit information - long absence of some of the company’s employee, etc.) or by ‘observing’ the way the caller was handled by humans before; this could shorten the caller’s way to get the information she needs. 2. The caller can be informed by a machine in a spoken language about the state of the call and suggested most likely alternatives; messages should be ‘context sensitive’. Naturally, the finite goal of computerized telephony is a fully ‘duplex’ machine that can both speak and comprehend spoken language so that the feedback with the caller can proceed in a natural dialog. We present a methodology where the aim is to satisfy goal 1. The task was defined by a telecommunication company that installs PBX switchboards in various enterprizes. Our experiments are based on the PBX logging data coming from one of the enterprises. The methodology is reflected in a unified system with inductive (learning) capabilities employed to produce decision support rules based on the data describing the previous PBX switching traffic. The system can be naturally adapted to the conditions of a specific company (by including a formally defined enterprise-related background knowledge) as well as in the case of a change in the PBX firmware (again via an inductive learning process). We employ the language of Prolog [3,5] (a subset of the language of firstorder logic) as a unified formalism to represent the input data, the background knowledge, the reasoning mechanism and the output decision support rules. The reason for this is the structured nature of the data with important dependencies between individual records, and the fact that sophisticated paradigms are available for learning in first-order logic. These paradigms are known as Inductive Logic Programming (ILP) [9,7]. The fundamental goal of ILP is the induction of first-order logic theories from logic facts and background knowledge. In recent years, two streams of ILP have developed, called the normal setting (where roughly - theories with a ’predictive’ nature are sought) and the non-monotonic
90
ˇ ˇ ep´ F. Zelezn´ y, J. Z´ıdgek, and O. Stˇ ankov´ a
setting (where the theories have a ’descriptive’ character). We employ both of the settings in the system and their brief description will be given in the respective sections. The paper is further organized as follows. The next section describes the data produced by the PBX. In Sections 3 and 4 we deal with the question of how to reconstruct events from the data, i.e. how to find out what actions the callers performed. In Section 5 we shall describe the way we induce decision support rules from the event database and appropriate background knowledge. Section 6 shows the overall interconnection of the individual learning/reasoning mechanisms into an integrated system. A rough knowledge of the syntax of Prolog clauses (rules) is needed to understand the presented examples of the learning and reasoning system parts.
2
The Exchange and Its Data
The raw logging file of the PBX (MC 7500) is an ASCII file composed of metering receipts (tickets) describing ‘atomic events’. The structure of such a ticket is e.g. 4AB000609193638V1LO EX 0602330533
1
005000 1FEFE
12193650EDILBRDD
This ticket describes a single unanswered ring from the external number 0602 330533 on the internal line 12. To make the information carried by the ticket accessible to both the human user and the reasoning mechanisms, we convert the tickets description into a relational-table form obtained from the data transformation tool Sumatra TT [2] developed at CTU Prague. A window into the relational table is shown in Figure 1. The numbered columns denote the following attributes extracted from the ticket and related to the corresponding event: 1: date, 2: starting time, 3: monitored line, 4: end time, 5: call type (E - incoming, S - outgoing), 6: release type (LB - event terminated, LI - event continues in another ticket), 7: release cause (e.g. TR - call has been transfered), 8: call setup (D - direct, A - result of a previous transfer), 11: call nature (EX - external, LO - local, i.e. between internal lines etc.), 12: corresponding party number, 14: PBX port used, 17: unique ticket key. Attributes not mentioned are not crucial for the explanation following. A complete event, i.e. the sequence of actions (e.g. transfers between lines) starting with an answered ring from an outside party and ending with the call termination, is reflected by two or more tickets. For example, a simple event such as an external answered (non-transferred) call will produce two tickets in the database (one for the ring, another for the talk). Figure 1 contains records related to two simultaneous external calls, each of which was transferred to another line after a conversation on the originally called line. The first problem of the data-analysis is apparent: tickets related to different events are mixed and not trivially separable. Moreover, although calls originating from a transfer from
A Learning System for Decision Support in Telecommunications 1 000802 000802 000802 000802 000802 000802 000802 000802 000802 000802 000802 000802
2 085151 085158 085201 085151 085158 085207 085218 085218 085207 085214 085223 085205
3 32 10 10 32 32 31 11 11 31 31 11 10
4 085151 085201 085205 085205 085205 085207 085218 085223 085223 085223 085339 085424
5 E E E E S E E E E S E E
6 LI LI LI LB LB LI LI LI LB LB LB LB
7 8 D D D TR D TR D D D D TR D TR D A A
9 10 11 EX LO DR 32 LO DR 06 EX 6 LO EX LO DR 31 LO DR 72 EX LO DR 31 EX DR 32 EX
91
12 13 14 15 16 17 0405353377 005001 FE FE 17664 32 0 4 17665 32 005001 0 4 17666 0405353377 005001 FE FE 17667 10 10 10 0 0 17668 85131111 005009 FE FE 17669 31 0 3 17670 31 0 3 17671 85131111 005009 FE FE 17672 11 11 0 0 17673 85131111 005009 FE FE 17674 0400000000 005001 FE FE 17675
Fig. 1. A window into the PBX logging data containing two simultaneous calls.
a preceding call can be identified (those labelled A in the attribute 8), it cannot be immediately seen from which call they originate. Moreover, we have to deal with an erroneous way of logging some instances of the external numbers by the PBX, e.g. the number 0400000000 in Figure 1 actually refers to the caller previously identified as 0405353377. This problem will be discussed in Section 4.1.
3
Event Extraction
The table in Figure 1 can be visualized graphically as shown in Figure 2. The figure visually distinguishes the two recorded simultaneous calls, although they are not distinguished by any attribute in the data. The two events are as follows: caller 0405353377 (EX1) is connected to the receptionist on the line 32, asks to be transferred onto line 10. After a spoken notification from 32 to 10, the redirection occurs. During the transferred call between EX1 and 10, a similar event proceeds for caller 85131111 (EX2), receptionist 31 and the desired line 11. It can be seen that the duration of each of the two events (transferred calls) is covered by the durations of the external-call tickets related to the event. Furthermore, the answering port (attribute 14) recorded for each of the external-call tickets is constant within one event and different for different simultaneous events. In other words, a single external caller remains connected to one port until she hangs up, whether or not she gets transferred to different internal lines. This is an expert-formulated, generally valid rule which can be used to delimit the duration of particular events (Figure 3). The sequences of connected externalcall tickets are taken as a base for event recognition. We have implemented the event extractor both as a Prolog program and as a set of SQL queries. However, additional tickets (besides the external-call tickets) related to an event have to be found in the data to recognize the event. For instance, both of the events reflected in Figure 2 in fact contains a transfer-with-notification action (e.g. line
ˇ ˇ ep´ F. Zelezn´ y, J. Z´ıdgek, and O. Stˇ ankov´ a
92
32 informs line 10 about the forthcoming transfer before it takes place), which can be deduced from the three tickets related to the internal line communication within each of the events. 23.35 % of all tickets in the experimental database fall into one of the extracted events. The rest of the communication traffic thus consists of internal or outgoing calls. 08:51:51 EX1 D
08:51:51
08:51:58
08:52:01
08:52:05
08:52:07
08:52:07
08:52:18
08:52:18
08:53:23
08:53:39
08:54:24
32 LI 32 D
10 LI 32 D
EX1 D
10 LI 32 LBTR
32 D
10 LBTR EX2 D
31 LI 31 D
11 LI 31 D
EX2 D
11 LI 31 LBTR
31 D
11 LBTR EX2 A
EX1 D
11 LB
10 LB
Fig. 2. Visualizing the chronology of the telecommunication traffic contained in the table in Figure 1. Vertical lines denote time instants, labelled horizontal lines denote the duration and attributes given by one ticket. For each such line, the upper-left / upper-right / lower-left / lower-right attributes denote the calling line, called line, call setup attribute and release type + cause, respectively. The abbreviations EX1 and EX2 represent two different external numbers. Thin lines stand for internal call tickets while thick lines represent external call tickets. The vertical position of the horizontal lines reflects the order of the tickets in the database. For ease of insight, tickets represented by dashed lines are related to a different call than those with full lines.
4
Event Reconstruction
Having obtained an event delimitation from the event extractor as a sequence of external-call tickets, we need to look up the database for all other tickets related to that event. According to these tickets we can decide on what sequence of actions occurred during the event, such as different kinds of call transfer (direct, with notification), their outcome (refusal, no answer, line busy), returns to the previous attendant, etc. The way such actions are reflected in the ticket database depends on the current setting of the PBX firmware and an appropriate formal mapping {events} → {sets of inter-related tickets} is not available.
A Learning System for Decision Support in Telecommunications
TICKETS
EVENT EXTRACTOR
93
EVENT DELIMITATIONS
Fig. 3. The event extraction.
Such mapping can, however, be obtained via an inductive learning process which will discover ticket patterns for individual actions from classified examples, that is, completely described events. These classified examples were produced by intentionally performing a set of actions on the PBX and storing separately the tickets generated for each of the actions. The discovered (first-order) patterns will then be used to recognize transitions of an automaton formally describing the course of actions within an event.
4.1
Learning Action Patterns
The goal of the action-pattern learner is to discover ticket patterns, that is, the character of the set of tickets produced by the PBX in the logging data as a result of performing a specific action. This examines the occurrence of individual tickets in the set, their mutual order and/or (partial) overlapping in time, and so on. For this purpose, tickets are represented in the Prolog fact syntax1 as t(...,an , an+1 ,...)
where ai are ticket attributes described earlier and the irrelevant date attribute is omitted. The constant empty will stand for a blank field in the data. The learner has two inputs: the classified event examples (sets of Prolog facts) and a general background knowledge (GBK). Following is a single instance of the example set, composed of facts bound to a single event, namely two facts representing tickets, and two facts representing the actions in the event: an external call from the number 0602330533 answered by the internal line 12 and the call termination caused by the external number hanging up.2 t(time(19,43,48),[1,2],time(19,43,48),e,li,empty,d,empty,empty,ex [0,6,0,2,3,3,0,5,3,3],empty,anstr([0,0,5,0,0,0]),fe,fe,id(4)). t(time(19,43,48),[1,2],time(19,43,50),e,lb,e(relcause),d,dr,06,ex [0,6,0,0,0,0,0,0,0,0],empty,anstr([0,0,5,0,0,0]),fe,fe,id(5)). 1 2
Such a representation is obtained simply by a single pass of the Sumatra TT transformation tool on the original data. Both external and internal line numbers are represented as Prolog lists to allow easy access to their substrings.
94
ˇ ˇ ep´ F. Zelezn´ y, J. Z´ıdgek, and O. Stˇ ankov´ a
ex ans([0,6,0,2,3,3,0,5,3,3],[1,2]). hangsup([0,6,0,2,3,3,0,5,3,3]).
The general background knowledge GBK describes certain priori known properties of the PBX. For example, due to a malfunction, the PBX occasionally substitutes a suffix of an identified external caller number by a sequence of zeros (such as in the second fact above). The correct and substituted numbers have to be treated as identical in the corresponding patterns. Therefore one of the rules (predicate-definitions) in GBK is samenum(NUM1,NUM2) that unifies two numbers with identical prefixes and different suffixes, one of which is a sequence of zeros. To induce patterns from examples of the above form and the first-order background knowledge GBK, we constructed an ILP system working in the nonmonotonic ILP setting (known also as learning from interpretations). The principle of this setting is that given a first order theory B (background knowledge), a set of interpretations (sets of logic facts) E and a grammar G, we have to find all first-order clauses (rules) c included in the language defined by the grammar G, such that c is true in B&e for all e ∈ E.3 In our case, E is the set of classified events, B = GBK and G is defined so that it produces rules where tickets and their mutual relations are expressed in the rule’s Body and the action is identified in the rule’s Head. To specify G we integrated the freely available DLAB [4] grammar-definition tool into our ILP system. Besides grammar specification, DLAB also provides methods of clausal refinement, so we could only concentrate on clause validity evaluation and implementing the (pruning) search through the space of clauses. An example of a generated pattern found to be valid for all of the collected examples is the following, describing which combination of tickets and their relationship specified by the equalities in the rule’s body reflects the action of answering a direct (non-transferred) external call.4 ex ans(RNCA1,DN1):t(IT1,DN1,ET1,e,li,empty,d,EF1,FI1,ex,RNCA1,empty,ANTR1,CO1,DE1,ID1), IT2=ET1, ANTR2=ANTR1, t(IT2,DN2,ET2,e,lb,RC2,d,EF2,FI2,ex,RNCA2,empty,ANTR2,CO2,DE2,ID2), samenum(RNCA1,RNCA2).
The time order of the involved tickets is determined by the equality IT 2 = ET 1 in the rule (with the variables IT 2, ET 1 referring to the initial time of the second ticket and the end time of the first ticket, respectively). 3
4
A clause c = Head : −Body is true in B&e if both B and e are stored in a Prolog database and the Prolog query ? − Body, not Head against that database does not succeed. Remind that capital letters stand for universally quantified variables in the Prolog syntax.
A Learning System for Decision Support in Telecommunications
95
Using the described approach to generate rules for other actions as well, we create a database of action patterns (as shown in Figure 4). Since we had known some of the patterns from experience in the manual data analysis, this process was both a theory discovery and theory revision. The final action pattern database is thus a combination of induction results and explicit knowledge representation. The database can be kept static as long as the PBX exchange firmware (i.e. the exact manner of logging) remains unchanged. The process should be repeated when the firmware is modified and the logging procedures change.
CLASSIFIED ACTIONS PATTERN LEARNER
ACTION PATTERNS
GENERAL BACKGROUND KNOWLEDGE
Fig. 4. Learning action patterns.
4.2
Event Recognizing Automaton
To discover the sequence of actions in an event, we assume that every event (starting with an incoming call) can be viewed as a simple finite-state automaton shown in Figure 5. Each transition corresponds to one or more actions defined in the action pattern database (e.g. ‘Attempt to transfer’ corresponds to more kinds of the transfer procedure). The automaton (event reconstructor) is encoded in Prolog. It takes as input an event delimiting sequence S (produced by the event extractor) and the action-pattern database. In parsing S, the patterns are used to recognize transitions between the states. Since the patterns may refer to GBK and also to tickets not present in S (such as transfer-with-notification patterns - see Figure 2), both GBK and the ticket database must be available to the automaton. This dataflow is depicted in Figure 6. Regarding the output, one version of the reconstructor produces humanunderstandable descriptions of the event, such as in the following example. ? - recognize([id(60216),id(60218),id(60224),id(60228),id(60232),id(60239)]) EVENT STARTS. 648256849 rings on 32 - call accepted, 32 attempts to transfer 0600000000 to 16 with notification, but 16 refused, 32 notifies 12 and transfers 0648256849 to 12, 12 attempts to transfer 0600000000 to 28 with notification, but 28 does not respond,
96
ˇ ˇ ep´ F. Zelezn´ y, J. Z´ıdgek, and O. Stˇ ankov´ a Answered
TALK
Answered
Attempt to transfer Hang up
Unanswered
RING
Attempt to transfer
TRN_ATTEMPT
TERMINATED
UNAVAILABLE
Unanswered
Hang up
Fig. 5. The states and transitions of the event automaton. In this representation of the PBX operation, the sequence RING → Unanswered → UNAVAILABLE → Attempt to transfer cannot occur because the caller is not assisted by a person on an internal line.
12 notifies 26 and transfers 0600000000 to 26, call terminated. EVENT STOPS.
An alternative version of the reconstructor produces the descriptions in the form of structured (recursive) Prolog facts of the recursive form incoming(DATE,TIME,CALLER,FIRST CALLED LINE,RESULT),
where RESULT ∈ {talk, unavailable, transfer([t1 , t2 , ..., tn ], RESULT)}
(1)
and t1 ...tn−1 denote line numbers to which unsuccessful attempts to transfer were made, and the transfer result refers to the last transfer attempt (to tn ). According to this syntax, the previous example output will be encoded as incoming(date(10,18),time(13,37,29),[0,6,4,8,2,5,6,8,4,9],[3,2], transfer([[1,6],[1,2]],transfer([[2,8],[2,6]],talk))).
(2)
and in this form used as the input to the inductive process described in the next section. The effectiveness of the described recognition procedure is illustrated in Figure 7. It can be seen that the method results in a very good coverage (recognition) of events containing more than 2 tickets. For the shorter events, more training examples will have to be produced and employed in learning to improve the coverage.
A Learning System for Decision Support in Telecommunications
97
EVENT DELIMITATIONS
ACTION PATTERNS EVENT RECONSTRUCTOR
EVENTS
TICKETS
GENERAL BACKGROUND KNOWLEDGE
Fig. 6. The event reconstruction.
5
Decision Support
Having reconstructed the events from the logging file, that is, knowing how different external callers have been handled in the history, the system can find regularities in the data, according to which some future events may be partially predicted (extrapolated) or even automatized. For example, it may be found that if the caller identified with her number N calls the receptionist, she always desires to be transferred to line L. Then it makes sense to transfer N to L automatically upon the ring without the assistance of the receptionist (provided that N can be transferred to another line by L if the prediction turns out to be wrong). Or, it may be a regular observation that if the person on line L1 is not available, line L2 is always provided as a substitute - this again offers an automation rule, or at least a decision support advice to the receptionist. A suitable methodology for inducing predictive rules from our data is the normal inductive logic programming setting, where the goal is (typically) the following. Given a first-order theory (background knowledge) B, two sets of logic facts E + , E − (positive and negative examples), find a theory T such that 1. B&T e+ for each e+ ∈ E + 2. B&T e− for each e− ∈ E − That is, we require that any positive (negative) example can (cannot) be logically derived from the background knowledge and the resulting theory. In our case, B is composed of the GBK described earlier and an enterpriserelated background knowledge (EBK). EBK may describe for example the regular (un) availability of employees. The E + set contains the event descriptions given
98
ˇ ˇ ep´ F. Zelezn´ y, J. Z´ıdgek, and O. Stˇ ankov´ a
Fig. 7. Portions of extracted events of individual lengths (number of tickets) that are recognized by the recongition automaton (left). Portions of tickets classified into a recongized event of all tickets extracted into events of different lenghts (right). The complete ticket database contains about 70.000 tickets ranging within a 3 month operation of the PBX.
by the predicate incoming such as the example 2. The negative example set is in our case substituted by integrity constraints that express for instance that a call cannot be transferred to two different lines etc. The resulting theory is bound not to violate the integrity constraints. Our experiments in the decision support part of the system have been reported in detail elsewhere [10], therefore we just mention one example of a resulting rule valid with accuracy 1 on the training set. The rule incoming(D,T,EX,31,transfer([10|R],RES)):day is(monday,D),branch(EX,[5,0]).
employs the predicates day is whose meaning is obvious and branch which identifies external numbers by a prefix. Both of the predicates are defined in EBK. The rule’s meaning is that if a number starting with 50- calls on Monday on the (reception) line 31, the caller always desires to be transferred to line 10 (whatever the transfer result is). See [10] for a detailed overview of the predictive-rule induction experiments. Figure 8 summarizes the data-flow in the system’s decision support part. The performance of the decision-support rules is being tested so far only in the experimental environment and the implementation in the enterprize is currently under construction.
6
System Integration Overview
Figure 9 shows how the previously described individual system parts are integrated. The fundamental cycle is the following: the PBX generates data that are analyzed to produce decision support rules which then again influence the operation of the PBX (with or without a human assistence).
A Learning System for Decision Support in Telecommunications
99
EVENTS
GENERAL BACKGROUND KNOWLEDGE
RULE GENERATOR
DECISION SUPPORT RULES
ENTERPRISE BACKGROUND KNOWLEDGE
Fig. 8. Generating decision support rules.
7
Conclusions
We have presented a system for decision support in telecommunications. The system analyzes data stored by a private branch exchange to reconstruct understandable event descriptions. For this purpose, action patterns are learned from classified examples of actions. The event descriptions are processed by an algorithm that induces rules describing event regularities which can be used as decision support rules (for the exchange operator) or directly to automate the PBX operation. In other words, we have performed a data-mining task on the input data and tried to integrate the results into a decision support system. The methods of data-mining have been currently receiving a lot of attention [6], especially those allowing for intelligent mining from multiple-relation databases [9]. By employing the techniques of inductive logic programming, we are in fact conducting a multi-relational data-mining task. Although there is previous work on data-mining in telecommunications [8], we are not aware of another published approach utilizing multi-relational data-mining methods in this field. The integration of data-mining and decision-support systems is currently also an opening and discussed topic [1], and research projects are being initiated in the scientific community to lay out a conceptual framework for such integration. We hope to have contributed to that research by this application-oriented paper.
Acknowledgements. This work has been supported by the project MSM 212300013 (Decision Making and Control in Manufacturing, Research programme funded by the Czech Ministry of Education, 1999-2003), the Czech Technical University internal grant No. 3 021 087 333 and the Czech Ministry of Education grant FRVS 23 21036 333.
100
ˇ ˇ ep´ F. Zelezn´ y, J. Z´ıdgek, and O. Stˇ ankov´ a
HUMAN WHEN ENTERPRISE CHANGES 5 ENTERPRISE BACKGROUND KNOWLEDGE
2
2 TICKETS
5 TELEPHONE EXCHANGE (PBX)
DECISION SUPPORT RULES
MC 7500
3 EVENT EXTRACTOR
PROLOG, SQL
ILP, OTHER
5 RULE GENERATOR
*
F.S.A. (IN PROLOG) 3
4 EVENT DELIMITATIONS
4 EVENT RECONSTRUCTOR
4
EVENTS
4 GENERAL BACKGROUND KNOWLEDGE
ACTION PATTERNS
WHEN FIRMWARE CHANGES 4
4 CLASSIFIED ACTIONS
PATTERN LEARNER ILP
*
HUMAN
Fig. 9. The system integration overview. The dataflow over the upper (lower) dashed line is needed only when the enterprise conditions are (PBX firmware is) modified, respectively. The dotted arrow represents the optional manual formulation of the expert knowledge about the action representation in the logging data. The star-labelled processes are those where learning/induction takes place. The digit in the upper-left corners in boxes refers the section where more detail on the respective process/database is given.
A Learning System for Decision Support in Telecommunications
101
References 1. Workshop papers. In Integrating Aspects of Data Mining, Decision Support and Meta-Learning. Freiburg, Germany, 2001. 2. Petr Aubrecht. Sumatra Basics. Technical report GL–121/00 1, Czech Technical University, Department of Cybernetics, Technick´ a 2, 166 27 Prague 6, December 2000. 3. Ivan Bratko. Prolog: Programming for Artifical Intelligence. Computing Series. Addison-Wesley Publishing Company, 1993. ISBN 0-201-41606-9. 4. L. Dehaspe and L. De Raedt. DLAB: A declarative language bias formalism. In Proceedings of the 10th International Symposium on Methodologies for Intelligent Systems, volume 1079 of Lecture Notes in Artificial Intelligence, pages 613–622. Springer-Verlag, 1996. 5. P.A. Flach. Simply Logical: Intelligent Reasoning by Example. John Wiley, 1994. 6. D. Hand, H. Mannila, and Smyth P. Principles of Data Mining. MIT, 2000. 7. N. Lavraˇc and S. Dˇzeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994. 8. R. Mattison. Data Warehousing and Data Mining for Telecommunications. Artech House, 1997. 9. N. Lavrac S. Dzeroski, editor. Relational Data Mining. Springer-Verlag, Berlin, September 2001. ˇ ˇ ep´ 10. F. Zelezn´ y, P. Mikˇsovsk´ y, O. Stˇ ankov´ a, and J. Z´ıdek. ILP for automated telephony. In J. Cussens and A. Frisch, editors, Proceedings of the Work-in-Progress Track at the 10th International Conference on Inductive Logic Programming, pages 276–286, 2000.
Adaptive User Modelling in an Intelligent Telephone Assistant 1
Trevor P. Martin and Benham Azvine
2
1
University of Bristol, Bristol, BS8 1TR, UK
[email protected] 2 BTexact Technologies, Adastral Park, Ipswich, IP5 3RE, UK
[email protected]
Abstract. With the burgeoning complexity and capabilities of modern information appliances and services, user modelling is becoming an increasingly important research area. Simple user profiles already personalise many software products and consumer goods such as digital TV recorders and mobile phones. A user model should be easy to initialise, and it must adapt in the light of interaction with the user. In many cases, a large amount of training data is needed to generate a user model, and adaptation is equivalent to retraining the system. This paper briefly outlines the user modelling problem and work done at BTexact on an Intelligent Personal Assistant (IPA) which incorporates a user profile. We go on to describe FILUM, a more flexible method of user modelling, and show its application to the Telephone Assistant component of the IPA, with tests to illustrate its usefulness.
1 Introduction We can recognise a strongly growing strand of interest in user modelling arising from research into intelligent interfaces. In this context, we can identify three different outcomes of user modelling: • Changing the way in which some fixed content is delivered to the user. • Changing the content that is delivered to the user. • Changing the way in which the device is used. Each of these is discussed in turn below. The first is more a property of the device that is displaying content to a user. For example, a WAP browser must restrict graphical content. There is little room for user likes and dislikes, although [12] describe a system which implements different interfaces for different users. Those who have more difficulty navigating through the system use a menu-based interface whereas those with a greater awareness of the system contents are given an interface using a number of shortcut keys. The second category—improving information content—is perhaps the most common. Examples abound in Internet-related areas, with applications to • Deliver only “interesting” news stories to an individual’s desktop. The pointcast news delivery systems are a first step (e.g. www.pointcast.com/products/pcn/ and cnn.com/ads/advertiser/pointcast2.0/); see also [11] and IDIoMS [13]. • Remove unwanted emails. D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 102–113, 2002. © Springer-Verlag Berlin Heidelberg 2002
Adaptive User Modelling in an Intelligent Telephone Assistant
103
• Identify interesting web pages—for example Syskill &Webert [24] uses an information-theoretic approach to detect “informative” words on web pages. These are used as features, and user ratings of web pages (very interesting, interesting, not interesting, etc.) creates a training data set for a naive Bayesian classifier. A similar approach can be used for the retrieval of documents from digital libraries, using term frequency/inverse document frequency [30] to select keywords and phrases as features. A user model can be constructed in terms of these features, and used to judge whether new documents are likely to be of interest. This is a very active area of web development—for example, W3C’s Metadata Activity [32] is concerned with ways to model and encode metadata, that is, information on the kind of information held in a web page, and to document the meaning of the metadata. The primary reason behind this effort is to enable computers to search more effectively for relevant data; however this presupposes some method for the system to know the user’s interests, that is, some kind of user profile. With the incorporation of powerful embedded computing devices in consumer products, there is a blurring of boundaries between computers and other equipment, resulting in a convergence to information appliances or information devices. Personalisation, which is equivalent to user modelling, is a key selling point of this technology—for example, to personalise TV viewing (www.tivo.com, 1999): − “With TiVo, getting your favorite programs is easy. You just teach it what shows you like, and TiVo records them for you automatically. − As you’re watching TV, press the Thumbs Up or Thumbs Down button on the TiVo remote to teach TiVo what you like − As TiVo searches for shows you’ve told it to record, it will also look for shows that match your preferences and get those for you as well...” Sony have implemented a prototype user modelling system [34] which predicts a viewing timetable for a user, on the basis of previous viewing and programme classification. Testing against a database of 606 individuals, 108 programme categories and 45 TV channels gave an average prediction accuracy of 60-70%. We will not discuss social or collaborative filtering systems here. These are used to recommend books (e.g. amazon.com), films, and so on, and are based on clustering the likes and dislikes of a group of users. The third category—changing the way in which the device is used—can also be illustrated by examples. Microsoft’s Office Assistant is perhaps the best known example of user modelling, and aims to provide appropriate help when required, as well as a “tip of the day” that is intended to identify and remedy gaps in the user’s knowledge of the software. The Office Assistant was developed from the Lumiere [16] project, which aimed to construct Bayesian models for reasoning about the timevarying goals of computer users from their observed actions and queries. Although it can be argued that the Office Assistant also fits into the previous category (changing the content delivered to the user), its ultimate aim is to change the way the user works so that the software is employed more effectively. The system described by [19] has similar goals but a different approach. User modelling is employed to disseminate expertise in use of software packages (such as Microsoft Word) within an organisation. By creating an individual user model and comparing it to expert models, the system is able to identify gaps in knowledge and offer individualised tips as well as feedback on how closely the user matches expert use of the package. The key difference from the Office Assistant is that this system
104
T.P. Martin and B. Azvine
monitors all users and identifies improved ways of accomplishing small tasks; this expertise can then be spread to other users. The Office Assistant, on the other hand, has a static view of best practice. Hermens and Schlimmer [14] implemented a system which aided a user filling in an electronic form, by suggesting likely values for fields in the form, based on the values in earlier fields. The change in system behaviour may not be obvious to the user. Lau and Horvitz [18] outline a system which uses a log of search requests from Yahoo, and classifies users’ behaviour so that their next action can be predicted using a Bayesian net. If it is likely that a user will follow a particular link, rather than refining or reformulating their query, then the link can be pre-fetched to improve the perceived performance of the system. This approach generates canonical user models, describing the behaviour of a typical group of users rather than individual user models. There are two key features in all these examples: • The aim is to improve the interaction between human and machine. This is a property of the whole system, not just of the machine, and is frequently a subjective judgement that can not be measured objectively. • The user model must adapt in the light of interaction with the user. Additionally, it is desirable that the user model − Be gathered unobtrusively, by observation or with minimal effort from the user. − Be understandable and changeable by the user - both in terms of the knowledge held about the user and in the inferences made from that knowledge. − Be correct in actions taken as well as in deciding when to act.
2 User Models—Learning, Adaptivity, and Uncertainty The requirement for adaptation puts user modelling into the domain of machine learning (see [17] and [33]). A user model is generally represented as a set of attribute-value pairs—indeed the w3c proposals [31] on profile exchange recommend this representation. This is ideal for machine learning, as the knowledge representation fits conveniently into a propositional learning framework. To apply machine learning, we need to gather data and identify appropriate features plus the desired attribute for prediction. To make this concrete, consider a system which predicts the action to be taken on receiving emails, using the sender’s identity and words in the title field. Most mail readers allow the user to define a kill file, specifying that certain emails may be deleted without the user seeing them. A set of examples might lead to rules such as if title includes $ or money then action = delete if sender = boss then action = read, and subsequently file if sender = mailing list then action = read and subsequently delete
This is a conventional propositional learning task, and a number of algorithms exist to create rules or decision trees on the basis of data such as this [4], [5], [7], [26], [27]. Typically, the problem must be expressed in an attribute-value format, as above; some feature engineering may be necessary to enable efficient rules to be induced. Rulebased knowledge representation is better than (say) neural nets due to better understandability of the rules produced - the system should propose rules which the
Adaptive User Modelling in an Intelligent Telephone Assistant
105
user can inspect and alter if necessary. See [23] for empirical evidence of the importance of allowing the user to remain in control. One problem with propositional learning approaches is that it is difficult to extract relational knowledge. For example: if several identical emails arrive consecutively from a list server, then delete all but one of them
Also, it can be difficult to express relevant background knowledge such as: if a person has an email address at acme.com then that person is a work colleague These problems can be avoided by moving to relational learning, such as inductive logic programming [22], although this is not without drawbacks as the learning process becomes a considerable search task. Possibly more serious issues relate to the need to update the user model, and to incorporate uncertainty. Most machine learning methods are based on a relatively large, static set of training examples, followed by a testing phase on previously unseen data. New training examples can normally be addressed only by restarting the learning process with a new, expanded, training set. As the learning process is typically quite slow, this is clearly undesirable. Additionally in user modelling it is relatively expensive to gather training data - explicit feedback is required from the user, causing inconvenience. The available data is therefore more limited than is typical for machine learning. A second problem relates to uncertainty. User modelling is inherently uncertain— as [15] observes, “Uncertainty is ubiquitous in attempts to recognise an agent’s goals from observations of behaviour,” and even strongly logic-based methods [25] acknowledge the need for “graduated assumptions.” There may be uncertainty over the feature definitions. For example: if the sender is a close colleague then action = read very soon
where close colleague and very soon are fuzzily defined terms) or over the applicability of rules. For example: if the user has selected several options from a menu and undone each action, then it is very likely that the user
requires help on that menu where the conclusion is not always guaranteed to follow. It is an easy matter to say that uncertainty can be dealt with by means of a fuzzy approach, but less easy to implement the system in a way that satisfies the need for understandability. The major problem with many uses of fuzziness is that they rely on intuitive semantics, which a sceptic might translate as “no semantics at all.” It is clear from the fuzzy control literature that the major development effort goes into adjusting membership functions to tune the controller. Bezdek [9], [10] suggests that membership functions should be “adjusted for maximum utility in a given situation.” However, this leaves membership functions with no objective meaning—they are simply parameters to make the software function correctly. For a fuzzy knowledge based system to be meaningful to a human, the membership functions should have an interpretation which is independent of the machine operation—that is, one which does not require the software to be executed in order to determine its meaning. Probabilistic representations of uncertain data have a strictly defined interpretation, and the approach adopted here uses Baldwin’s mass assignment theory and voting model semantics for fuzzy sets [3], [8].
106
T.P. Martin and B. Azvine
3 The Intelligent Personal Assistant BTexact’s Intelligent Personal Assistant (IPA) [1], [2] is an adaptive software system that automatically performs helpful tasks for its user, helping the user achieve higher levels of productivity. The system consists of a number of assistants specialising in time, information, and communication management: • The Diary Assistant helps users schedule their personal activities according to their preferences. • Web and Electronic Yellow Pages Assistants meet the user’s needs for timely and relevant access to information and people. • The RADAR assistant reminds the user of information pertaining to the current task. • The Contact Finder Assistant puts the user in touch with people who have similar interests. • The Telephone and Email Assistants give the user greater control over incoming messages by learning priorities and filtering unwanted communication. As with any personal assistant, the key to the IPA’s success is an up-to-date understanding of the user’s interests, priorities, and behaviour. It builds this profile by tracking the electronic information that a user reads and creates over time—for example, web pages, electronic diaries, e-mails, and word processor documents. Analysis of these information sources and their timeliness helps IPA understand the users personal interests. By tracking diaries, keyboard activity, gaze, and phone usage, the IPA can build up a picture of the habits and preferences of the user. We are particularly interested in the Telephone and E-mail assistants for communication management, used respectively for filtering incoming calls and prioritising incoming e-mail messages. The Telephone Assistant maintains a set of priorities of the user’s acquaintances, and uses these in conjunction with the caller’s phone number to determine the importance of an incoming call. The E-mail Assistant computes the urgency of each incoming message based on its sender, recipients, size and content. Both assistants use Bayesian networks for learning the intended actions of the user, and importantly, the system continually adapts its behaviour as the user’s priorities change over time. The telephone assistant handles incoming telephone calls on behalf of the user with the aim of minimising disruption caused by frequent calls. For each incoming call, the telephone assistant determines whether to interrupt the user (before the phone rings) based on the importance of the caller and on various contextual factors such as the frequency of recent calls from that caller and the presence of a related entry in the diary (e.g. a meeting with the caller). When deciding to interrupt the user, the telephone assistant displays a panel indicating that a call has arrived; the user has the option of accepting or declining to answer the call. The telephone assistant uses this feedback to learn an overall user model for how the user weights the different factors in deciding whether or not to answer a call. Although this model has been effective, its meaning is not obvious to a user, and hence it is not adjustable. To address this issue, the FILUM [20], [21] approach has been applied to the telephone assistant.
Adaptive User Modelling in an Intelligent Telephone Assistant
107
4 Assumptions for FILUM We consider an interaction between a user and a software or hardware system in which the user has a limited set of choices regarding his/her next action. For example, given a set of possible TV programmes, the user will be able to select one to watch. Given an email, the user can gauge its importance and decide to read it immediately, within the same day, within a week, or maybe as unimportant and discardable. The aim of user modelling is to be able to predict accurately the user’s decision and hence improve the user’s interaction with the system by making such decisions automatically. Human behaviour is not generally amenable to crisp, logical modelling. Our assumption is that the limited aspect of human behaviour to be predicted is based mainly on observable aspects of the user’s context—for example, in classifying an email the context could include features such as the sender, other recipients of the message, previously received messages, current workload, time of day, and so on. Of course, there are numerous unobservable variables - humans have complex internal states, emotions, external drives, and so on. This complicates the prediction problem and motivates the use of uncertainty modelling—we can only expect to make correct predictions “most” of the time. We define a set of possible output values
B = {b1, b2, …, bj}, which we refer to as the behaviour, and a set of observable inputs I = {i1, i2, …, im}. Our assumption is that the n+1th observation of the user’s behaviour is predictable by some function of the current observables and all previous inputs and behaviours. bn = f(I1, b1, I2, b2, … In, bn, In+1) The user model, including any associated processing, is equivalent to the function f. This is assumed to be relatively static; within FILUM, addition of new prototypes would correspond to a change in the function. We define a set of classes implemented as Fril++ [6], [28] programs. C = {c1, c2, … ck}, A user model is treated as an instance that has a probability of belonging to each class according to how well the class behaviour matches the observed behaviour of the user. The probabilities are expressed as support pairs, and updated each time a new observation of the user’s behaviour is made. We aim to create a user model m, which correctly predicts the behaviour of a user. Each class ci must implement the method Behaviour, giving an output in B (this may be expressed as supports over B). Let Sn (m Œc i ) be the support for the user model m belonging to the ith class before the nth observation of behaviour. Initially,
S1(m Œ ci )= [0, 1] for all classes ci, representing complete ignorance.
108
T.P. Martin and B. Azvine
Each time an observation is made, every class makes a prediction, and the support for the user model being a member of that class is updated according to the predictive success of the class :
Sn+1(m Œ ci )=
n ¥ Sn (m Œ ci )+ S (ci .Behaviourn+1 == bn+1) n +1
(1)
where S (ci .Behaviourn +1 == bn +1) represents the (normalised) support for class ci predicting the correct behaviour (from the set B) on iteration n+1. It is necessary to normalise the support update to counteract the swamping effect that could occur if several prototypes predict the same behaviour. Clearly as n becomes large, supports change relatively slowly. [29] discuss an alternative updating algorithm. The accuracy of the user model at any stage is the proportion of correct predictions made up to that point—this metric can easily be changed to use a different utility function, for example, if some errors are more serious than others. 4.1 Testing In order to test any user modelling approach, data is needed. Initial studies generated data using an artificial model problem, the n-player iterated prisoner’s dilemma. A Fril++ system was developed to run n-IPD tournaments, allowing a choice of strategies to be included in the environment, with adjustable numbers of players using each strategy. The user model aimed to reproduce the behaviour of a player by means of some simple prototypes. The user models converged after 10-12 iterations, that is, the supports for the models belonging to each class do not change significantly after this. Overall predictive success rates were good, typically 80%-95%, although random strategies were difficult to predict, as would be expected. 4.2 User Models in the Telephone Assistant The FILUM approach has also been applied to the prediction of user behaviour in the telephone assistant. The following assumptions have been made: • The user model must decide whether to divert the call to voicemail or pass it through to be answered. • The user is available to answer calls. • Adaptive behaviour is based on knowing the correct decision after the call has finished. • A log of past telephone activity and the current diary are available • The identity of all callers is known. A sample set of user prototypes is shown in Table 1.
Adaptive User Modelling in an Intelligent Telephone Assistant
109
Table 1. User Prototypes Prototype
Identifying Characteristic
Behaviour
Talkative
none
always answer
Antisocial
none
always divert to voicemail
Interactive
recent calls or meetings involving this caller
answer
Busy
small proportion of free time in next working day (as shown by diary)
answer if caller is brief, otherwise divert to voicemail
Overloaded
small proportion of free time in next working day (as shown by diary)
divert to voicemail
Selective
none
answer if caller is a member of a selected group, else divert to voicemail
Regular
large proportion of calls answered at particular times of the day e.g. early morning
answer if this is a regular time
This approach assumes that all activities are planned and recorded accurately in an electronically accessible format. Other ways of judging a user’s activity would be equally valid and may fit in better with a user’s existing work pattern—for example the IPA system also investigated the use of keyboard activity, gaze tracking and monitoring currently active applications on a computer. There is a need to model callers using a set of caller prototypes, since a user can react in different ways to different callers in a given set of circumstances. For example, the phone rings when you are due to have a meeting with the boss in five minutes. Do you answer if (a) the caller is the boss or (b) the caller is someone from the other side of the office who is ringing to talk about last night’s football results while waiting for a report to print. The sample set of caller prototypes is shown in Table 2. The user and caller prototypes are intended to illustrate the capabilities of the system rather than being a complete set; it is hoped that they are sufficiently close to real behaviour to make detailed explanation unnecessary. Terms in italic are either fuzzy definitions that can be changed to suit a user. Note that support pairs indicate the degree to which a user or caller satisfies a particular prototype - this can range from uncertain (0 1) to complete satisfaction (1 1) or its opposite (0 0), through to any other probability interval. Table 2. Caller Prototypes Prototype
Identifying Characteristic
Brief Verbose Frequent Reactive Proactive
always makes short calls to user always makes long calls to user calls user frequently calls following a recent voicemail left by user calls prior to a meeting with user
110
T.P. Martin and B. Azvine
A sample diary is shown in Figure 1. Note that the diary is is relatively empty at the beginning and end of the week but relatively full in the middle of the week. The busy and overloaded prototypes are written to be applicable when there is a small proportion of free time in the immediate future, that is, during the latter part of Tuesday and Wednesday. 5 4 3 2 1
1 design_review 2 seminar 3 research 0
52
104
156
208
260
time (15 minute intervals)
4 programming 5 home
Fig. 1. Sample of diary. The window for the working day has been defined as 7:00 am - 8:00 pm, and diaried activities for each fifteen minute period within the window are shown; unassigned slots represent free time which can be used as appropriate at the time.
1 0.9 0.8 0.7 0.6 Individual
0.5
Cumulative
0.4 0.3 0.2 0.1 0 1/10/01 0:00
1/10/01 12:00
2/10/01 0:00
2/10/01 12:00
3/10/01 0:00
3/10/01 12:00
4/10/01 0:00
4/10/01 12:00
5/10/01 0:00
5/10/01 12:00
6/10/01 0:00
Fig. 2. Performance of the user model on individual calls (correct prediction =1, incorrect prediction = 0) and as a cumulative success rate.
Figure 2 shows the success rate of the user model in predicting whether a call should be answered or diverted to voicemail. The drop in performance on the second day occurs because the busy and overloaded prototypes become active at this time, due to the full diary on Wednesday and Thursday. It takes a few iterations for the system to increase the membership of the user model in the busy and overloaded classes; once this has happened, the prediction rate increases again. The necessary and possible supports for membership of the user model in the busy class is shown in Figure 3, where the evolution of support can be seen on the third and fourth days where this prototype is applicable. At the start, the identifying characteristics (full diary) are not satisfied and support remains at unknown (0 1). In the middle of the week, conditions are satisfied. Initially the user does not behave as predicted by this prototype and possible support drops (i.e. support against increases);
Adaptive User Modelling in an Intelligent Telephone Assistant
111
subsequently, the user behaves as predicted and necessary support increases. At the end of the week, once again the identifying characteristics are not satisfied. By repeating the week’s data, there is relatively little change in the support pairs— this is suggestive that the learning has converged, although additional testing is necessary. Evolution of caller models can also be followed within the system, and good convergence to a stable caller model is observed. It should be emphasised that the supports for each prototype can be adjusted by the user at any stage. The user modelling software has been tested in several diary and call log scenarios, with good rates of prediction accuracy. Further testing is needed to investigate the user model evolution over longer periods. 1 0.9 0.8 Busy
0.7
Busy
0.6 0.5 0.4 0.3 0.2 0.1 0 1/10/01 0:00
1/10/01 12:00
2/10/01 0:00
2/10/01 12:00
3/10/01 0:00
3/10/01 12:00
4/10/01 0:00
4/10/01 12:00
5/10/01 0:00
5/10/01 12:00
6/10/01 0:00
Fig. 3. Evolution of support pair for the “busy” prototype in the user model.
5 Summary The aim of user modelling is to increase the quality of interaction—this is almost always a subjective judgement, and it can be difficult to discuss the success (or otherwise) of user modelling. We have developed an experimental testbed based on the iterated prisoner’s dilemma, allowing generation of unlimited data. Prediction success rates vary between 80-95% for non-random behaviours in the testbed, and accuracy of over 80% has been obtained in a series of simulated tests of the telephone assistant. The user model changes with each observation, and there is very little overhead in updating the user model. This approach depends on having a “good” set of prototypes, giving reasonable coverage of possible user behaviour. It is assumed that a human expert is able to provide such a set; however, it is possible that (for example) inductive logic programming could generate new prototype behaviours. This is an interesting avenue for future research.
112
T.P. Martin and B. Azvine
References 1. Azvine, B., D. Djian, K.C. Tsui and W. Wobcke, The Intelligent Assistant: An Overview. Lecture Notes in Computer Science, 2000(1804): p. 215-238. 2. Azvine, B. and W. Wobcke, Human-centred intelligent systems and soft computing. Bt Technology Journal, 1998. 16(3): p. 125-133. 3. Baldwin, J.F., The Management of Fuzzy and Probabilistic Uncertainties for Knowledge Based Systems, in Encyclopedia of AI, S.A. Shapiro, Editor. 1992, John Wiley. p. 528-537. 4. Baldwin, J.F., J. Lawry and T.P. Martin, A mass assignment based ID3 algorithm for decision tree induction. International Journal of Intelligent Systems, 1997. 12(7): p. 523552. 5. Baldwin, J.F., J. Lawry and T.P. Martin. Mass Assignment Fuzzy ID3 with Applications. in Proc. Fuzzy Logic - Applications and Future Directions. 1997. pp 278-294 London: Unicom. 6. Baldwin, J.F. and T.P. Martin, Fuzzy classes in object-oriented logic programming, in Fuzz-Ieee ‘96 - Proceedings of the Fifth Ieee International Conference On Fuzzy Systems, Vols 1-3. 1996. p. 1358-1364. 7. Baldwin, J.F. and T.P. Martin. A General Data Browser in Fril for Data Mining. in Proc. EUFIT-96. 1996. pp 1630-1634 Aachen, Germany. 8. Baldwin, J.F., T.P. Martin and B.W. Pilsworth, FRIL - Fuzzy and Evidential Reasoning in AI. 1995, U.K.: Research Studies Press (John Wiley). 391. 9. Bezdek, J.C., Fuzzy Models. IEEE Trans. Fuzzy Systems, 1993. 1(1): p. 1-5. 10. Bezdek, J.C., What is Computational Intelligence?, in Computational Intelligence Imitating Life, J.M. Zurada, R.J. Marks II, and C.J. Robinson, Editors. 1994, IEEE Press. p. 1-12. 11. Billsus, D. and M.J. Pazzani. A Hybrid User Model for News Story Classification. in Proc. User Modeling 99. 1999. pp 99-108 Banff, Canada. 12. Bushey, R., J.M. Mauney and T. Deelman. The Development of Behavior-Based User Models for a Computer System. in Proc. User Modeling 99. 1999. pp 109-118 Banff, Canada. 13. Case, S.J., N. Azarmi, M. Thint and T. Ohtani, Enhancing e-Communities with AgentBased Systems. IEEE Computer, 2001. 33(7): p. 64. 14. Hermens, L.A. and J.C. Schlimmer, A Machine-Learning Apprentice for The Completion of Repetitive Forms. IEEE Expert-Intelligent Systems & Their Applications, 1994. 9(1): 2833. 15. Horvitz, E., Uncertainty, action, and interaction: in pursuit of mixed-initiative computing. Ieee Intelligent Systems & Their Applications, 1999. 14(5): p. 17-20. 16. Horvitz, E., J. Breese, et al. The Lumière Project: Bayesian User Modeling for Inferring the Goals and Needs of Software USers. in Proc. Fourteenth Conference on Uncertainty in Artificial Intelligence. 1998. pp 256-265 (see also http://www.research. microsoft.com/research/dtg/horvitz/lum.htm ) Madison, WI. 17. Langley, P. User Modeling in Adaptive Interfaces. in Proc. User Modeling 99. 1999. See http://www.cs.usask.ca/UM99/Proc/invited/Langley.pdf Banff, Canada. 18. Lau, T. and E. Horvitz. Patterns of Search: Analyzing and Modeling Web Query Refinement. in Proc. User Modeling 99. 1999. pp 119-128 Banff, Canada. 19. Linton, F., A. Charron and D. Joy, Owl: A Recommender System for Organization-Wide Learning. 1998, MITRE Corporation. http://www.mitre.org/ technology/tech_tats/modeling/owl/Coaching_Software_Skills.pdf. 20. Martin, T.P. Incremental Learning of User Models - an Experimental Testbed. in Proc. IPMU 2000. 2000. pp 1419-1426 Madrid. 21. Martin, T.P. and B. Azvine. Learning User Models for an Intelligent Telephone Assistant. in Proc. IFSA-NAFIPS 2001. 2001. pp 669 -674 Vancouver: IEEE Press. 22. Muggleton, S., Inductive Logic Programming. 1992: Academic Press. 565.
Adaptive User Modelling in an Intelligent Telephone Assistant
113
23. Nunes, P. and A. Kambil, Personalization? No Thanks. Harvard Business Review, 2001. 79(4): p. 32-34. 24. Pazzani, M. and D. Billsus, Learning and revising user profiles: The identification of interesting Web sites. Machine Learning, 1997. 27(3): p. 313-331. 25. Pohl, W., Logic-based representation and reasoning for user modeling shell systems. User Modeling and User-Adapted Interaction, 1999. 9(3): p. 217-282. 26. Quinlan, J.R., Induction of Decision Trees. Machine Learning, 1986. 1: p. 81-106. 27. Quinlan, J.R., C4.5: Programs for Machine Learning. 1993: Morgan Kaufmann. 28. Rossiter, J.M., T. Cao, T.P. Martin and J.F. Baldwin. A Fril++ Compiler for Soft Computing Object-Oriented Logic Programming. in Proc. IIZUKA2000, 6th International Conference on Soft Computing. 2000. pp 340 - 345 Japan. 29. Rossiter, J.M., T.H. Cao, T.P. Martin and J.F. Baldwin. User Recognition in Uncertain Object Oriented User Modelling. in Proc. Tenth IEEE International Conference On Fuzzy Systems (Fuzz-IEEE 2001). 2001. pp Melbourne. 30. Salton, G. and G. Buckley, Term-weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 1988. 24: p. 513-523. 31. W3C, Composite Capability/Preference Profiles (CC/PP): A user side framework for content negotiation. 1999. 32. W3C, Metadata and Resource Description. 1999. 33. Webb, G., Preface to Special Issue on Machine Learning for User Modeling. User Modeling and User-Adapted Interaction, 1998. 8(1): p. 1-3. 34. Yiming, Z. Fuzzy User Profiling for Broadcasters and Service Providers. in Proc. Computational Intelligence for User Modelling. 1999. http://www.fen.bris.ac.uk/engmaths/research/aigroup/martin/ci4umProc/Zhou.pdf Bristol, UK.
A Query-Driven Anytime Algorithm for Argumentative and Abductive Reasoning Rolf Haenni Computer Science Department, University of California, Los Angeles, CA 90095
[email protected], http://haenni.shorturl.com
Abstract. This paper presents a new approximation method for computing arguments or explanations in the context of logic-based argumentative or abductive reasoning. The algorithm can be interrupted at any time returning the solution found so far. The quality of the approximation increases monotonically when more computational resources are available. The method is based on cost functions and returns lower and upper bounds.1
1
Introduction
The major drawback of most qualitative approaches to uncertainty management comes from their relatively high time and space consuming algorithms. To overcome this difficulty, appropriate approximation methods are needed. In the domain of argumentative and abductive reasoning, a technique called cost-bounded approximation has been developed for probabilistic argumentation systems [16, 17]. Instead of computing intractably large sets of minimal arguments for a given query, the idea is that only the most relevant arguments not exceeding a certain cost bound are computed. This is extremely useful and has many applications in different fields [1,3]. In model-based diagnostics, for example, computing arguments corresponds to determining minimal conflict sets and minimal diagnoses. Very often, intractably many conflict sets and diagnoses exist. The method presented in [16] is an elegant solution for such difficult cases. However, the question of choosing appropriate cost bounds and the problem of judging the quality of the approximation remain. The approach presented in this paper is based on the same idea of computing only the most relevant arguments. However, instead of choosing the cost bound first and then computing the corresponding arguments, the algorithm starts immediately by computing the most relevant arguments. It terminates as soon as no more computational resources (time or space) are available and returns the cost bound reached during its run. The result is a lower approximation that is sound but not complete. Furthermore, the algorithm returns an upper approximation that is complete but not sound. The difference between lower and upper approximation allows the user to judge the quality of the approximation. 1
Research supported by scholarship No. 8220-061232 of the Swiss National Science Foundation.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 114–127, 2002. c Springer-Verlag Berlin Heidelberg 2002
A Query-Driven Anytime Algorithm
115
The algorithm is designed such that the cost bound (and therefore the quality of the approximation) increases monotonically when more resources are available. It can therefore be considered as an anytime algorithm that provides a result at any time. Note that this is very similar to the natural process of how people collect information from their environment. In legal cases, for example, resources (time, money, etc.) are limited, and the investigation focusses therefore on the search of the most relevant and most obvious evidence and such that the corresponding costs remain reasonable. Another important property of the algorithm is the fact that the actual query is taken into account during its run. This ensures that those arguments which are of particular importance for the user’s actual query are returned first. Such a query-driven behavior corresponds to the natural way of how human gathers the relevant information from different sources.
2
Probabilistic Argumentation Systems
The theory of probabilistic argumentation systems is based on the idea of combining classical logic with probability theory [17,15]. It is an alternative approach for non-monotonic reasoning under uncertainty. It allows to judge open questions (hypotheses) about the unknown or future world in the light of the given knowledge. From a qualitative point of view, the problem is to find arguments in favor and against the hypothesis of interest. An argument can be seen as a defeasible proof for the hypothesis. It can be defeated by counter-arguments. The strength of an argument is weighted by considering its probability. In this way, the credibility of a hypothesis can be measured by the total probability that it is supported or rejected by such arguments. The resulting degree of support and degree of possibility correspond to (normalized) belief and plausibility in Dempster-Shafer’s theory of evidence [24,27,19]. A quantitative judgement is sometimes more useful and can help to decide whether a hypothesis should be accepted, rejected, or whether the available knowledge does not permit to decide. The technique of probabilistic argumentation systems generalizes de Kleer’s and Reiter’s original concept of assumption-based truth maintenance systems (ATMS) [6,7,8,23,18,10] by (1) removing the restriction to Horn clauses and (2) by adding probabilities in a similar way as Provan [22] or Laskey and Lehner [2]. Approximating techniques for intractably large sets of arguments has been proposed for ATMS by Forbus and de Kleer [14,9], by Collins and de Coste [5], and by Bigham et al. [4]. For the construction of a probabilistic (propositional) argumentation system, consider two disjoint sets A = {a1 , . . . , am } and P = {p1 , . . . , pn } of propositions. The elements of A are called assumptions. LA∪P denotes the corresponding propositional language. If ξ is an arbitrary propositional sentence in LA∪P , then a triple (ξ, P, A) is called (propositional) argumentation system. ξ is called knowledge base and is often specified by a conjunctively interpreted set Σ = {ξ1 , . . . , ξr } of sentences ξi ∈ LA∪P or, more specifically, clauses
116
R. Haenni
ξi ∈ DA∪P , where DA∪P denotes the set of all (proper) clauses over A ∪ P . We use P rops(ξ) ⊆ A ∪ P to denote all the propositions appearing in ξ. The assumptions play an important role for expressing uncertain information. They are used to represent uncertain events, unknown circumstances and risks, or possible outcomes. Conjunctions of literals of assumptions are of particular interest. They represent possible scenarios or states of the unknown or future world. CA denotes the set of all such conjunctions. Furthermore, NA = {0, 1}|A| represents the set of all possible configurations relative to A. The elements s ∈ NA are called scenarios. The theory is based on the idea that one particular scenario sˆ ∈ NA is the true scenario. Consider now the case where a second propositional sentence h ∈ LA∪P called hypothesis is given. Hypotheses represent open questions or uncertain statements about some of the propositions in A ∪ P . What can be inferred from ξ about the possible truth of h with respect to the given set of unknown assumptions? Possibly, if some of the assumptions are set to true and others to false, then h may be a logical consequence of ξ. In other words, h is supported by certain scenarios s ∈ NA or corresponding arguments α ∈ CA . Note that counter-arguments refuting h are arguments supporting ¬h. More formally, let ξ←s be the formula obtained from ξ by instantiating all the assumptions according to their values in s. We can then decompose the set of scenarios NA into three disjoint sets IA (ξ) = {s ∈ NA : ξ←s |= ⊥}, SPA (h, ξ) = {s ∈ NA : ξ←s |= h, ξ←s | = ⊥}, RFA (h, ξ) = {s ∈ NA : ξ←s |= ¬h, ξ←s | = ⊥} = SPA (¬h, ξ),
(1) (2) (3)
of inconsistent, supporting, and refuting scenarios, respectively. Furthermore, if NA (α) ⊆ NA denotes the set of models of a conjunction α ∈ CA , then we can define corresponding sets of supporting and refuting arguments of h relative to ξ by SP (h, ξ) = {α ∈ CA : NA (α) ⊆ SPA (h, ξ)}, RF (h, ξ) = {α ∈ CA : NA (α) ⊆ RFA (h, ξ)},
(4) (5)
respectively. Often, since SP (h, ξ) and RF (h, ξ) are upward-closed sets, only corresponding minimal arguments are considered. So far, hypotheses are only judged qualitatively. A quantitative judgment of the situation becomes possible if every assumption ai ∈ A is linked to a corresponding prior probability p(ai ) = πi . Let Π = {π1 , . . . , πm } denote the set of all prior probabilities. We suppose that the assumptions are mutually independent. This defines a probability distribution p(s) over the set NA of scenarios2 . Note that independent assumptions are common in many practical applications [1]. A quadruple (ξ, P, A, Π) is then called probabilistic argumentation system [17]. 2
In cases where no set of independent assumptions exists, the theory may also be definied on an arbitrary probability distribution over NA .
A Query-Driven Anytime Algorithm
117
In order to judge h quantitatively, consider the conditional probability that the true scenario sˆ is in SPA (h, ξ) but not in IA (ξ). In the light of this remark, dsp(h, ξ) = p(ˆ s ∈ SPA (h, ξ) | sˆ ∈ / IA (ξ))
(6)
is called degree of support of h relative to ξ. It is a value between 0 and 1 that represents quantitatively the support that h is true in the light of the given knowledge. Clearly, dsp(h, ξ) = 1 means that h is certainly true, while dsp(h, ξ) = 0 means that h is not supported (but h may still be true). Note that degree of support is equivalent to the notion of (normalized) belief in the Dempster-Shafer theory of evidence [24,27]. It can also be interpreted as the probability of the provability of h [21,26]. A second way of judging the hypothesis h is to look at the corresponding conditional probability that the true scenario sˆ is not in RFA (h, ξ). It represents the probability that ¬h can not be inferred from the knowledge base. In such a case, h remains possible. Therefore, the conditional probability dps(h, ξ) = p(ˆ s∈ / RFA (h, ξ) | sˆ ∈ / IA (ξ)) = 1 − dsp(¬h, ξ)
(7)
is called degree of possibility of h relative to ξ. It is a value between 0 and 1 that represents quantitatively the possibility that h is true in the light of the given knowledge. Clearly, dps(h, ξ) = 1 means that h is completely possible (there are no counter-arguments against h), while dps(h, ξ) = 0 means that h is false. Degree of possibility is equivalent to the notion of plausibility in the DempsterShafer theory. We have dsp(h, ξ) ≤ dps(h, ξ) for all h ∈ LA∪P and ξ ∈ LA∪P . Note that the particular case of dsp(h, ξ) = 0 and dps(h, ξ) = 1 represents total ignorance relative to h. An important property of degree of support and degree of possibility is that they behave non-monotonically when new information is added. More precisely, if ξ represents a new piece of information, then nothing can be said about the new values dsp(h, ξ ∧ ξ ) and dps(h, ξ ∧ ξ ). Compared to dsp(h, ξ) and dps(h, ξ), the new values may either decrease or increase, both cases are possible. This reflects a natural property of how a human’s conviction or belief can change when new information is given. Non-monotonicity is therefore a fundamental property for any mathematical formalism for reasoning under uncertainty. Probabilistic argumentation systems show that non-monotonicity can be achieved in classical logic by adding probability theory in an appropriate way. This has already been noted by Mary McLeish in [20].
3
Computing Arguments
From a computational point of view, the main problem of dealing with probabilistic argumentation systems is to compute the set µQS(h, ξ) of minimal quasisupporting arguments with QS(h, ξ) = {α ∈ CA : α ∧ ξ |= h}. The term “quasi” expresses the fact that some quasi-supporting arguments of h may be in contradiction with the given knowledge. Knowing the sets µQS(h, ξ), µQS(¬h, ξ), and
118
R. Haenni
µQS(⊥, ξ) allows then to derive supporting and refuting arguments, as well as degree of support and degree of possibility [17]. We use QSA (h, ξ) = NA (QS(h, ξ)) = {s ∈ NA : ξ←s |= h}
(8)
to denote corresponding sets of quasi-supporting scenarios. Suppose that the knowledge base ξ ∈ LA∪P is given as a set of clauses Σ = {ξ1 , . . . , ξr } with ξi ∈ DA∪P and ξ = ξ1 ∧· · ·∧ξr . Furthermore, let H ⊆ DA∪P be another set of clauses such that ∧H ≡ ¬h. ΣH = µ(Σ ∪ H) denotes then the corresponding minimal clause representation of ξ ∧ ¬h obtained from Σ ∪ H by dropping subsumed clauses. 3.1
Exact Computation
The problem of computing minimal quasi-supporting arguments is closely related to the problem of computing prime implicants. Quasi-supporting arguments for h are conjunctions α ∈ CA for which α ∧ ξ |= h holds. This condition can be rewritten as ξ∨¬h |= ¬α or ΣH |= ¬α, respectively. Quasi-supporting arguments are therefore negations of implicates of ΣH which are in DA . In other words, if δ ∈ DA is an implicate of ΣH , then ¬δ is a quasi-supporting argument for h. We use PI(ΣH ) to denote the set of all prime implicates of ΣH . If ¬Ψ is the set of conjunctions obtained from a set of clauses Ψ by negating the corresponding clauses, then µQS(h, ξ) = ¬(PI(ΣH ) ∩ DA ).
(9)
Since computing prime implicates is known to be NP-complete in general, the above approach is only feasible when ΣH is relatively small. However, when A is small enough, many prime implicates of ΣH are not in DA and are therefore irrelevant for the minimal quasi-support. Such irrelevant prime implicates can be avoided by the method described in [17]. The procedure is based on two operations ConsQ (Σ) = Consx1 ◦ · · · ◦ Consxq (Σ),
(10)
ElimQ (Σ) = Elimx1 ◦ · · · ◦ Elimxq (Σ),
(11)
where Σ is an arbitrary set of clauses and Q = {x1 , . . . , xq } a subset of propositions appearing in Σ. Both operations repeatedly apply more specific operations Consx (Σ) and Elimx (Σ), respectively, where x denotes a proposition in Q. Let Σx denote the clauses of Σ containing x as a positive literal, Σx¯ the clauses containing x as a negative literal, and Σx˙ the clauses not containing x. Furthermore, if ρ(Σx , Σx¯ ) = {ϑ1 ∨ ϑ2 : x ∨ ϑ1 ∈ Σx , ¬x ∨ ϑ2 ∈ Σx¯ }
(12)
denotes the set of all resolvents of Σ relative to x, then Consx (Σ) = µ(Σ ∪ ρ(Σx , Σx¯ )) and Elimx (Σ) = µ(Σx˙ ∪ ρ(Σx , Σx¯ )).
A Query-Driven Anytime Algorithm
119
Thus, ConsQ (Σ) computes all the resolvents (consequences) of Σ relative to the propositions in Q and adds them to Σ. Note that if Q contains all the proposition in Σ, then ConsQ (Σ) = PI(Σ). In contrast, ElimQ (Σ) eliminates all the propositions in Q from Σ and returns a new set of clauses whose set of models corresponds to the projection of the original set of models. Note that from a theoretical point of view, the order in which the propositions are eliminated is irrelevant [17], whereas from a practical point of view, it critically influences the efficiency of the procedure. Note that the elimination process is a particular case of Shenoy’s fusion algorithm [25] as well as of Dechter’s bucket elimination procedure [13]. The set of the minimal quasi-supporting arguments can then be computed in two different ways by µQS(h, ξ) = ¬ConsA (ElimP (ΣH )) = ¬ElimP (ConsA (ΣH )).
(13)
Note that in many practical applications, computing the consequences relative to the propositions in A is trivial. In contrast, the elimination of the propositions in P is usually much more difficult and becomes even infeasible as soon as ΣH has a certain size. 3.2
Cost-Bounded Approximation
A possible approximation technique is based on cost functions c : CA → IR+ . Conjunctions α with low costs c(α) are preferred and therefore more relevant. We require that α ⊆ α implies c(α) ≤ c(α ). This condition is called monotonicity criterion. Examples of common cost functions are: – the length of the conjunction (number of literals): c(α) = |α|, – the probability of the negated conjunction: c(α) = 1 − p(ˆ s ∈ NA (α)). The idea of using the length of the conjunctions as cost function is that short conjunctions are usually more weighty arguments. Clearly, if α is a conjunction in CA , then an additional literal & is a supplementary condition to be satisfied, and α ∧ & is therefore less probable than α. From this point of view, the length of a conjunction expresses somehow its probability. However, if probabilities are assigned to the assumptions, then it is possible to specify the probability of a conjunction more precisely. That’s the idea behind the second suggestion. Let β ∈ IR+ be a fixed bound for a monotone cost function c(α). A conjunction α ∈ CA is called β-relevant, if and only if c(α) < β. Otherwise, α is called β-irrelevant. The set of all β-relevant conjunctions for a fixed cost bound β is denoted by CAβ = {α ∈ CA : c(α) < β}.
(14)
Note that CAβ is a downward-closed set. This means that α ∈ CAβ implies that every (shorter) conjunction α ⊆ α is also in CAβ . Evidently, CA0 = ∅ and CA∞ = CA . An
120
R. Haenni
approximated set of minimal quasi-supporting arguments can then be defined by µQS(h, ξ, β) = µQS(h, ξ) ∩ CAβ .
(15)
The corresponding set of scenarios is denotes by QSA (h, ξ, β). Note that µQS(h, ξ, β) is sound but not complete since µQS(h, ξ, β) ⊆ µQS(h, ξ). It can therefore be seen as a lower approximation of the exact set µQS(h, ξ). In order to compute µQS(h, ξ, β) efficiently, corresponding downward-closed sets are defined over the set of clauses DA∪P . Obviously, every clause ξ ∈ DA∪P can be split into sub-clauses ξA and ξP by ξ = &1 ∨ · · · ∨ &k ∨ &k+1 ∨ · · · ∨ &m = ξA ∨ ξP , ∈A±
(16)
∈P ±
where A± and P ± are the sets of literals of A and P , respectively. Such a clause can also be written as an implication ¬ξA → ξP where Arg(ξ) = ¬ξA is a conjunction in CA . The set of clauses ξ for which the corresponding conjunction Arg(ξ) is in CAβ can then be defined by β DA∪P = {ξ ∈ DA∪P : Arg(ξ) ∈ CAβ }.
(17)
A new elimination procedure called β-elimination can then by defined by ElimβQ (Σ) = Elimβx1 ◦ · · · ◦ Elimβxq (Σ),
(18)
β where the clauses not belonging to DA∪P are dropped at each step of the process β by Elimβx (Σ) = Elimx (Σ) ∩ DA∪P . In this way, the approximated set of quasisupporting arguments can be computed by β µQS(h, ξ, β) = ¬ElimβP (ConsA (ΣH )) = ¬ElimβP (ConsA (ΣH ) ∩ DA∪P ). (19)
See [17] for a more detailed discussion and the proof of the above formula. Two major problems remain. First, it is difficult to choose a suitable cost bound β in advance (if β is too low, then the result may be unsatisfactory, if β is to high, then the procedure risks to get stuck). Second, there is no mean to judge to quality of the approximation.
4
Anytime Algorithm
The algorithm introduced below helps to overcome the difficulties mentioned at the end of the previous section. Instead of first choosing the cost bound and then computing the corresponding arguments, the algorithm starts immediately by computing the most relevant arguments and terminates as soon as no more computational resources (usually time) are available. Finally, it returns two minimal sets LB and UB of (potential) minimal arguments with
A Query-Driven Anytime Algorithm
121
NA (LB) ⊆ QSA (h, ξ) ⊆ NA (UB) and a cost bound β with LB ⊇ µQS(h, ξ, β). Obviously, the sets LB is considered as lower bound (sound but not complete) while UB is considered as upper bound (complete but not sound). From this follows immediately that p(ˆ s ∈ NA (LB)) ≤ p(ˆ s ∈ QSA (h, ξ)) ≤ p(ˆ s ∈ NA (UB)).
(20)
Furthermore, if LB , UB , and β are the corresponding results for the same input parameters (knowledge base, hypothesis, cost function) but with more computational resources, then NA (LB) ⊆ NA (LB ), NA (UB) ⊇ NA (UB ), and β ≤ β . The quality of the approximation increases thus monotonically during the execution of the algorithm. If the algorithm terminates before all computational resources are used or if unlimited computational resources are available, then the algorithm returns the exact result µQS(h, ξ) = LB = UB and β = ∞. The idea for the algorithm comes from viewing the procedure described in the previous section from the perspective of Dechter’s bucket elimination procedure [13]. From this point of view, the clauses contained in ConsA (ΣH ) are initially distributed among an ordered set of buckets. There is exactly one bucket for every proposition in P . If a clause contains several propositions from P , then the first appropriate bucket of the sequence is selected. In a second step, the elimination procedure takes place among the sequence of buckets. The idea now is similar. However, instead of processing the whole set of clauses at once, the clauses are now iteratively introduced one after another, starting with those having the lowest cost. At each step of the process, possible resolvents are computed and added to the list of remaining clauses. Subsumed clauses are dropped. As soon as a clause containing only proposition from A is detected, it is considered as a possible result. For a given sequence of buckets, this produces exactly the same set of resolvents as in the usual bucket elimination procedure, but in a different order. It guarantees that the most relevant arguments are produced first. The algorithm works with different sets of clauses: – Σ ⇒ the remaining set of clauses, initialized to ConsA (ΣH ) \ DA , – Σ0 ⇒ the results, initialized to ConsA (ΣH ) ∩ DA , – Σ1 , . . . , Σn ⇒ the corresponding sequence of buckets for all propositions in P = {p1 , . . . , pn }, all initialized to ∅. The details of the whole procedure are described below. The process terminates as soon as Σ = ∅ or when no more resources are available. [01] Function Quasi-Support(P, A, Σ, h, c); [02] Begin [03] Select H ⊆ DA∪P such that ∧H ≡ ¬h; [04] ΣH ← µ(Σ ∪ H); [05] Σ ← ConsA (ΣH ) \ DA ; [06] Σ0 ← ConsA (ΣH ) ∩ DA ; [07] Σi ← ∅ for all i = 1, . . . , n; [08] Loop While Σ = ∅ And Resources() > 0 Do
122
R. Haenni
[09] Begin [10] Select ξ ∈ Σ such that c(Arg(ξ)) = min({c(Arg(ξ)) : ξ ∈ Σ}); [11] Σ ← Σ \ {ξ}; [12] k ← min({i ∈ {1, . . . , n} : pi ∈ P rops(ξ)}); [13] If pk ∈ ξ [14] Then R ← ρ({ξ}, {ξ ∈ Σk : ¬pk ∈ ξ}); [15] Else R ← ρ({ξ ∈ Σk : pk ∈ ξ}, {ξ}); [16] Σ ← Σ ∪ (R \ DA ); [17] Σ0 ← Σ0 ∪ (R ∩ DA ); [18] Σk ← Σk ∪ {ξ}; [19] S ← µ(Σ ∪ Σ0 ∪ · · · ∪ Σn ); [20] Σ ← Σ ∩ S; [21] Σi ← Σi ∩ S for all i = 1, . . . , n; [22] End; [23] LB ← {Arg(ξ) : ξ ∈ Σ0 }; [24] UB ← µ{Arg(ξ) : ξ ∈ Σ0 ∪ Σ}; [25] β ← min({c(Arg(ξ)) : ξ ∈ Σ} ∪ {∞}); [26] Return (LB, UB, β); [27] End. At each step of the iteration, the following operations take place: – line [10]: select a clause ξ from Σ such that the corresponding cost c(Arg(ξ)) is minimal; – line [11]: remove ξ from Σ; – line [12]: select the first proposition pk ∈ P with pk ∈ P rops(ξ) and • lines [13]–[15]: compute all resolvents of ξ with Σk , • lines [16] and [17]: add the resolvents either to Σ or Σ0 , • line [18]: add ξ to Σk , • lines [19]–[21]: remove subsumed clauses from Σ, Σ0 , Σ1 , . . . , Σn . Finally, LB and β are obtained from Σ0 . Furthermore, UB can be derived from Σ0 and Σ. Note that the procedure is a true anytime algortihm giving progressively better solutions as time goes on and also giving a response however little time has elapsed [12,11]. In fact, it satisfies most of the basic requirements of anytime algorithms [28]: – measurable quality: the precision of the approximate result is known, – monotonicity: the precision of the result is growing in time, – consistency: the quality of the result is correlated with the computation time and the quality of the inputs, – the diminishing returns: the improvement of the solution is larger at the early stages of computation and it diminishes over time, – interruptibility: the process can be interrupted at any time and provides some answer,
A Query-Driven Anytime Algorithm
123
– preemptability: the process can be suspended and continued with minimal overhead. The proofs of correctness will appear in one of the author’s forthcoming technical reports. 4.1
Example
In order to illustrate the algorithm introduced in the previous subsection, consider a communication network with 6 participants (A, B, C, D, E, F ) and 9 connections. The question is whether A is able to communicate with F or not. This is expressed by h = A → F . It is assumed that connection 1 (between A and B) works properly with probability 0.1, connection 2 with probability 0.2, etc.
Fig. 1. A simple Communication Network.
The original knowledge base Σ consists of 9 clauses ξ3 , ξ5 , ξ6 , ξ8 , ξ10 , ξ11 , ξ13 , ξ17 and ξ18 . Furthermore, H = {A, ¬F } = {ξ1 , ξ2 } and ΣH = µ(Σ ∪ H) = (Σ ∪ H) \ {ξ11 }, since ξ11 is subsumed by ξ1 . The following table shows all the clauses produced during the process (ordered according to their probabilities).
• • • • • • •
ξ1 ξ2 ξ3 ξ4 ξ5 ξ6 ξ7 ξ8 ξ9 ξ10 ξ11 ξ12 ξ13
×
A ¬F ¬E ∨ F ∨ ¬a9 ¬E ∨ ¬a9 ¬D ∨ E ∨ ¬a8 ¬C ∨ F ∨ ¬a7 ¬C ∨ ¬a7 ¬B ∨ F ¬a6 ¬B ∨ ¬a6 ¬E ∨ B ∨ ¬a5 ¬E ∨ A ∨ ¬a4 ¬E ∨ ¬a5 ∨ ¬a6 ¬A ∨ D ∨ ¬a3
1.0 1.0 0.9 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.4 0.3 0.3
ξ14 ξ15 ξ16 • ξ17 • ξ18 ξ19 ξ20 ξ21 ξ22 ξ23 ξ24 ξ25 ξ26
D ∨ ¬a3 E ∨ ¬a3 ∨ ¬a8 ¬a3 ∨ ¬a8 ∨ ¬a9 ¬B ∨ C ∨ ¬a2 ¬A ∨ B ∨ ¬a1 B ∨ ¬a1 ¬E ∨ C ∨ ¬a2 ∨ ¬a5 ¬a3 ∨ ¬a5 ∨ ¬a6 ∨ ¬a8 ¬E ∨ ¬a2 ∨ ¬a5 ∨ ¬a7 ¬a1 ∨ ¬a6 C ∨ ¬a1 ∨ ¬a2 ¬a2 ∨ ¬a3 ∨ ¬a5 ∨ ¬a7 ∨ ¬a8 ¬a1 ∨ ¬a2 ∨ ¬a7
0.3 0.24 0.216 0.2 0.1 0.1 0.1 0.072 0.07 0.06 0.02 0.017 0.014
124
R. Haenni
The initial clauses of Σ are marked by • and the clauses of H by . The minimal quasi-supporting arguments for h = A → F are finally obtained from the clauses ξ16 , ξ21 , ξ23 , ξ25 , and ξ26 (marked by ): µQS(h, Σ) = {¬ξ16 , ¬ξ21 , ¬ξ23 , ¬ξ25 , ¬ξ26 } = {a3 ∧a8 ∧a9 , a3 ∧a5 ∧a6 ∧a8 , a1 ∧a6 , a2 ∧a3 ∧a5 ∧a7 ∧a8 , a1 ∧a2 ∧a7 }. Note that every minimal quasi-supporting argument in µQS(h, Σ) corresponds to a minimal path from node A to node F in the communication network. If we take A, F, B, D, C, E as elimination sequence (order of the buckets), then the complete run of the algorithm can be described as in the table shown below (where c(α) = 1 − p(ˆ s ∈ NA (α)) serves as cost function). A Step
Σ
F
1 2 3
ξ1 ξ2 , ξ3 , ξ5 , ξ6 , ξ8 , ξ10 , ξ13 , ξ17 , ξ18 A ξ1 ξ2 ξ3 , ξ5 , ξ6 , ξ8 , ξ10 , ξ13 , ξ17 , ξ18 F ξ2 ξ3 ξ5 , ξ6 , ξ8 , ξ10 , ξ13 , ξ17 , ξ18 F ξ3
4 5 6
ξ4 ξ5 , ξ6 , ξ8 , ξ10 , ξ13 , ξ17 , ξ18 ξ5 ξ6 , ξ8 , ξ10 , ξ13 , ξ17 , ξ18 ξ6 ξ8 , ξ10 , ξ13 , ξ17 , ξ18
E D F
7 8
ξ7 ξ8 , ξ10 , ξ13 , ξ17 , ξ18 ξ8 ξ10 , ξ13 , ξ17 , ξ18
C F
×
B B E A ξ13
13 14 15 16
ξ14 ξ15 ξ17 ξ18
ξ17 , ξ18 ξ17 , ξ18 ξ18 ξ20
D E B A ξ18
17 18 19 20
ξ19 ξ20 ξ20 ξ24 ξ22 ξ24 ξ24 3
ξ9 ξ10
4
ξ7
0.80 0.70 0.70
ξ9
0.60 0.60 0.50 0.30 0.30 0.30
ξ12 ξ14 ξ15 ξ17
×
ξ19 ξ20 ξ22 ξ24
5
ξ4
ξ12
ξ14
6
7
8
9
10
1−β 1.00 0.90 0.90
ξ7
×
B C E C
Σ0
ξ5
8
ξ10 , ξ13 , ξ17 , ξ18 ξ13 , ξ17 , ξ18 ξ13 , ξ17 , ξ18 ξ17 , ξ18
2
× ξ ×
R
ξ4
ξ6
9 ξ9 10 ξ10 11 ξ12 12 ξ13
1
B D C E
pi Σ1 Σ2 Σ3 Σ4 Σ5 Σ6
ξ15 ξ16 , ξ21 ξ16 , ξ21 ξ20 ξ19
0.24 0.20 0.10 0.10
ξ23 , ξ24 ξ22 ξ25 ξ26
ξ23 ξ25 ξ26
0.10 0.07 0.02 −∞
11
12
13
Every row desribes a single step. The 2nd column shows the most probable clause in Σ (the one with the lowest cost) that is selected for the next step. The 3rd column contains the remaining clauses in Σ. The 4th column indicates the first proposition in the given sequence that appears in the selected clause. This determines the corresponding bucket Σi into which the selected clauses is added (columns 5 to 10). Cross-marked clauses are subsumed by others and can
A Query-Driven Anytime Algorithm
125
be dropped. Then, column 11 shows the resolvents produced at the actual step. Resolvents containing only proposition from A are added to column 12. Finally, the last column indicates the actual value of β. At step 1, ξ1 = A is selected and added to Σ1 . There are no resolvents and no subsumed clauses. Σ1 contains then the single clause ξ1 while Σ0 is still empty. At step 3, for example, ξ3 = ¬E ∨ F ∨ ¬a9 is selected and added to Σ2 that already contains ξ2 = ¬F from step 2. A new resolvent ξ4 = ¬E ∨ ¬a9 can then be derived from ξ2 and ξ3 . Since ξ3 is subsumed by ξ4 , it can be dropped. The new clause ξ4 is then added to Σ. Σ0 is still empty. Later, for example at step 14, ξ15 = E ∨ ¬a3 ∨ ¬a8 is selected and added to Σ6 that contains two clauses ξ4 = ¬E ∨ ¬a9 and ξ12 = ¬E ∨ ¬a5 ∨ ¬a6 from previous steps. Two new resolvents ξ16 = ¬a3 ∨ ¬a8 ∨ ¬a9 and ξ21 = ¬a3 ∨ ¬a5 ∨ ¬a6 ∨ ¬a8 are produced and added to Σ0 . These are the first two results. After step 20, Σ is empty and the algorithm terminates. Observe that the clauses representing the query H = {A, ¬F } = {ξ1 , ξ2 } are processed first. This influences considerably the further run of the algorithm and guarantees that those arguments which are of particular importance for the user’s actual query are returned first. Such a query-driven behavior is an important property of the algorithm. It corresponds to the natural way of how human gathers the relevant information from a knowledge base. 4.2
Experimental Results
This section discusses the results of testing the algorithm on a problem of more realistic size. The discussion focusses on lower and upper bounds and compares them to the exact results. The knowledge base consists of 26 propositions, 39 assumptions and 74 initial clauses. It describes a communication network like the one used in the previous subsection. The exact solution for a certain query consists of 1,008 minimal arguments (shortest paths). For a given elimination sequence, the complete procedure generates 211,828 resolvents. The corresponding degree of support is 0.284. Figure 2 shows how the approximated solution monotonically approaches the exact value during the process.
Fig. 2. Left, the complete run of the algorithm for Example 1; right, the first 2,000 resolvents.
126
R. Haenni
Observe that after generating approximately 1,000 resolvents, the algorithm has found the first 8 arguments and returns a numerical lower bound that corresponds in the first two numbers after the decimal point to the exact solution. The 8 first arguments are found in less than 1 second (instead of approximately 15 minutes for the 211,828 resolutions of the complete solution). The upper bound converges a little bit slower. This is a typical behavior that has been observed for many other examples of different domains [1].
5
Conclusion
This paper introduces a new algorithm for approximated assumption-based reasoning. The advantages to other existing approximation methods are twofold: (1) it is an anytime algorithm that monotonically increases the quality of the result as soon as more computational resources are available; (2) the algorithm produces not only a lower but also an upper approximation without significant additional computational costs. This two improvements are extremely useful and can be considered as an important step towards the practical applicability of logic-based abduction and argumentaion in general, and probabilistic argumentation systems in particular.
References 1. B. Anrig, R. Bissig, R. Haenni, J. Kohlas, and N. Lehmann. Probabilistic argumentation systems: Introduction to assumption-based modeling with ABEL. Technical Report 99-1, Institute of Informatics, University of Fribourg, 1999. 2. K. B. Laskey and P. E. Lehner. Assumptions, beliefs and probabilities. Artificial Intelligence, 41(1):65–77, 1989. 3. D. Berzati, R. Haenni, and J. Kohlas. Probabilistic argumentation systems and abduction. In C. Baral and M. Truszczynski, editors, Proccedings of the 8th International Workshop on Non-Monotonic Reasoning, Breckenridge Colorado, 2000. 4. J. Bigham, Z. Luo, and D. Banerjee. A cost bounded possibilistic ATMS. In Christine Froidevaux and J¨ urg Kohlas, editors, Proceedings of the ECSQARU Conference on Symbolic and Quantitive Approaches to Reasoning and Uncertainty, volume 946, pages 52–59, Berlin, 1995. Springer Verlag. 5. J. W. Collins and D. de Coste. CATMS: An ATMS which avoids label explosions. In Kathleen McKeown and Thomas Dean, editors, Proceedings of the 9th National Conference on Artificial Intelligence, pages 281–287. MIT Press, 1991. 6. J. de Kleer. An assumption-based TMS. Artificial Intelligence, 28:127–162, 1986. 7. J. de Kleer. Extending the ATMS. Artificial Intelligence, 28:163–196, 1986. 8. J. de Kleer. Problem solving with the ATMS. Artificial Intelligence, 28:197–224, 1986. 9. J. de Kleer. Focusing on probable diagnoses. In Kathleen Dean, Thomas L.; McKeown, editor, Proceedings of the 9th National Conference on Artificial Intelligence, pages 842–848. MIT Press, 1991. 10. J. de Kleer. A perspective on assumption-based truth maintenance. Artificial Intelligence, 59(1–2):63–67, 1993.
A Query-Driven Anytime Algorithm
127
11. T. Dean and M. Boddy. An analysis of time-dependent planning. In Proceedings of the Seventh National Conference on Artificial Intelligence (AAAI-88), pages 49–54. MIT Press, 1988. 12. T. L. Dean. Intrancibility and time-dependent planning. In M. P. Geofrey and A .L. Lansky, editors, Proceedings of the 1986 Workshop on Reasoning about Actions and Plans. Morgan Kaufmann Publishers, 1987. 13. R. Dechter. Bucket elimination: A unifying framework for reasoning. Artificial Intelligence, 113(1–2):41–85, 1999. 14. K. D. Forbus and J. de Kleer. Focusing the ATMS. In Tom M. Smith and Reid G. Mitchell, editors, Proceedings of the 7th National Conference on Artificial Intelligence, pages 193–198, St. Paul, MN, 1988. Morgan Kaufmann. 15. R. Haenni. Modeling uncertainty with propositional assumption-based systems. In S. Parson and A. Hunter, editors, Applications of Uncertainty Formalisms, Lecture Notes in Artifical Intelligence 1455, pages 446–470. Springer-Verlag, 1998. 16. R. Haenni. Cost-bounded argumentation. International Journal of Approximate Reasoning, 26(2):101–127, 2001. 17. R. Haenni, J. Kohlas, and N. Lehmann. Probabilistic argumentation systems. In J. Kohlas and S. Moral, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, Volume 5: Algorithms for Uncertainty and Defeasible Reasoning. Kluwer, Dordrecht, 2000. 18. K. Inoue. An abductive procedure for the CMS/ATMS. In J. P. Martins and M. Reinfrank, editors, Truth Maintenance Systems, Lecture Notes in A.I., pages 34–53. Springer, 1991. 19. J. Kohlas and P. A. Monney. A Mathematical Theory of Hints. An Approach to the Dempster-Shafer Theory of Evidence, volume 425 of Lecture Notes in Economics and Mathematical Systems. Springer, 1995. 20. M. McLeish. Nilsson’s probabilistic entailment extended to Dempster-Shafer theory. In L. N. Kanal, T. S. Levitt, and J. F. Lemmer, editors, Uncertainty in Artificial Intelligence, volume 8, pages 23–34. North-Holland, Amsterdam, 1987. 21. J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. 22. G. M. Provan. A logic-based analysis of Dempster-Shafer theory. International Journal of Approximate Reasoning, 4:451–495, 1990. 23. R. Reiter and J. de Kleer. Foundations of assumption-based truth maintenance systems: Preliminary report. In Kenneth Forbus and Howard Shrobe, editors, Proceedings of the Sixth National Conference on Artificial Intelligence, pages 183– 188. American Association for Artificial Intelligence, AAAI Press, 1987. 24. G. Shafer. The Mathematical Theory of Evidence. Princeton University Press, 1976. 25. P. P. Shenoy. Binary join trees. In Eric Horvitz and Finn Jensen, editors, Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence (UAI-96), pages 492–499, San Francisco, 1996. Morgan Kaufmann Publishers. 26. Ph. Smets. Probability of provability and belief functions. In M. Clarke, R. Kruse R., and S. Moral, editors, Proceedings of the ECSQARU’93 conference, pages 332– 340. Springer-Verlag, 1993. 27. Ph. Smets and R. Kennes. The transferable belief model. Artificial Intelligence, 66:191–234, 1994. 28. S. Zilberstein. Using anytime algorithms in intelligent systems. AI Magazine, Fall:71–83, 1996.
Proof Length as an Uncertainty Factor in ILP Gilles Richard and Fatima Zohra Kettaf IRIT - UMR CNRS 5505 118 Rte de Narbonne 31062 Toulouse cedex 4 {kettaf, richard}@irit.fr
Abstract. A popular idea is that the longer the proof the riskier the truth prediction. In other words, the uncertainty degree over a conclusion is an increasing function of the length of its proof. In this paper, we analyze this idea in the context of Inductive Logic Programming. Some simple probabilistic arguments lead to the conclusion that we need to reduce the length of the clause bodies to reduce uncertainty degree (or to increase accuracy). Inspired by the boosting technique, we propose a way to implement the proof reduction by introducing weights in a well-known ILP system. Our preliminary experiments confirm our predictions.
1
Introduction
It is a familiar idea that, when a proof is long, it is difficult to verify and the number of potential errors increases. This is our starting point and we apply this evidence to Inductive Logic Programming (ILP). Recall that the aim of an ILP machine is to produce explanations H (i.e. logical formulas) to observable phenomena described by data E (set of examples). We hope E is “provable” from H. Following the previous popular assertion, the uncertainty degree concerning the truth of a given fact increases with the length of its proof. So, we decided to measure this uncertainty degree by the length of the proof and to weight a given fact with this number. At the end of a learning process, we have an associated weight for each training example. But what can we do with this weight to improve the learning process? Here we take our inspiration from a machine-learning technique known as boosting [10]. The idea is to complete a full learning process by modifying the weight of each training example between each turn. At the end of the process, we get a finite set of classifiers: the final classifier is a linear combination of the intermediate ones. We follow these lines for improving the behavior of a well-known ILP system: Progol [8]. The mixing of boosting and ILP has already been proposed in [9,6]. But our idea is rather different. For each training example, we compute its uncertainty degree (which is not an error degree as in [9]) by using a trick in the Progol machinery. This degree is considered as a weight for a given example. We see, in our experimentation, that the behavior of “Progol with loop” is, of course, better than Progol alone but, and this is more convincing, better than “Progol with randomly chosen weights”. In section 2, we briefly describe the ILP framework. In D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 128–140, 2002. c Springer-Verlag Berlin Heidelberg 2002
Proof Length as an Uncertainty Factor in ILP
129
section 3, we use some simple probabilistic arguments to validate the initial idea concerning the relationship between uncertainty and the length of the proofs. Section 4 is devoted to the target ILP system we want to optimize by introducing weighted examples. In section 6, we describe our implementation and we give the results of some practical experiments. Finally, we discuss in section 7 and conclude in section 8.
2
ILP: A Brief Survey
We assume some familiarity with the usual notions of first order logic and logic programming (see for instance [7] for a complete development). In standard inductive logic programming, a concept c is a Herbrand interpretation that is a subset of the Herbrand base, the full set of ground atoms. The result of the learning process is a logic program H. The question may be asked, what does it means to say that we have learned c with H? Informally, this means that the “implicit” information deducible from (or coded by) H “covers” c in some sense. If we stay within the pure setting where programs do not involve negation, it is widely admitted that this implicit information is the least Herbrand model of H denoted MH , which is yet a subset of the Herbrand base. An ideal situation would be c = MH but, generally, MH (or H) is just an “approximation” of c. To be more precise, an ILP machine takes as input: – a finite proper subset E =< E + , E − > (the training set in Instance Based Learning terminology) where E + are the positive examples i.e. the things known as being true and is a subset of c, E − are the negative examples and is a subset of c. – a logic program usually denoted B (as background knowledge) representing a basic knowledge we have concerning the concept to approximate. This knowledge satisfies two natural conditions: it does not explain the positive examples: B |= \E + and it does not contradict the negative ones: B ∪ E − |= \⊥ To achieve the ILP task, one of the most popular methods is to build H such that H ∪ B |= E + and H ∪ B ∪ E − |=\⊥. In that case, since E + ⊆ c and E − ⊆ c, it is expected that ∀e ∈ c, H ∪ B |= e i.e. H ∪ B |= c. Thus, H, associated with B, could be considered as an explanation for c . Of course, as explained in the previous section, an ILP machine could behave as a classifier. Back to the introduction, the sample set S = {(x1 , y1 ), . . . , (xi , yi ), . . . , (xn , yn )} is represented as a finite set of Prolog facts class(xi , yi ) constituting the set E + . The ILP machine will provide an hypothesis H. Given a query q, we get an answer with the program H ∪ B by running standard Prolog machinery (H ∪ B is in fact a Prolog program). In the simple case of clustering, for instance, we get the class y of a given element x by giving the query class(x, Y )? to a Prolog interpreter, H being previously consulted. Back to our main purpose, we want to evaluate, in some sense, the uncertainty relative to the answer for a given query q?. This is the object of the next section.
130
3
G. Richard and F.Z. Kettaf
A Simple Probabilistic Analysis
Our learning machine build an hypothesis H, which is a logic program and, as such, could be identified with its least Herbrand model MH . To formalize the problem, we follow the lines of [5,1] by considering a probability measure µ over the full set I of all possible interpretations (the set of “worlds” in the [5, 1] terminology) Then the degree of validity of a given ground formula F is the measure of its set of models: roughly speaking, the more models F has, the truer F is. Since we have a measure over I, we can consider any ground formula F as a random variable starting from I and with range in {t, f } by putting F (I) = the truth value of F in interpretation I. I −→ {t, f } F I → I(F ) Now, µ(F = v) has a standard meaning: µ({I ∈ I | F (I) = v}). To abbreviate, we denote µ(F ) = µ(F = t). We may notice that, in the classical bi-valued setting, µ(F = t) = 1 − µ(F = f ). In a more general three-valued logic, this equality does not hold. Nevertheless, the remaining of this section would be yet valid in such a context. A logic program H could be considered as a conjunction of ground Horn clauses and so µ(H) is defined. Of course, we are interested in the relationship between this probability measure µ and the |= relation. We have the following lemma: Lemma 1. If H1 |= H2 then – µ(H1) ≤ µ(H2 ) – µ(H2 /H1 ) = 1 Proof: i) by definition of H1 |= H2 , if H1 (I) = I(H1 ) = t then H2 (I) = I(H2 ) = t this implies the first relation. 1 ∧H2 ) ii) µ(H2 /H1 ) = µ(H but µ(H1 ∧ H2 ) = µ({I ∈ I | I |= H1 ∧ H2 }) = µ({I ∈ µ(H1 ) I | I |= H1 } ∩ {I ∈ I | I |= H2 }) = µ({I ∈ I | I |= H1 }) by hypothesis and we are done. Let be now a new instance of our problem a ∈ / E which is known as being true. Our question is “what is our chance that one of the two hypothesis covers a i.e. is such that H |= a ?”. So we are interested in the comparison between the probabilities that a holds knowing that b → a holds or that a holds knowing that b ∧ c → a. We can easily deduce from the previous lemma: Proposition 1. If m ≤ n: µ(a | ∧i∈[1,n] bi → a) ≤ µ(a | ∧i∈[1,m] bi → a) Proof: We loose no generality by dealing with m = 1 and n = 2. Since we have a |= (b → a) and (b → a) |= (b ∧ c → a), we deduce using the Bayes formula and the part ii) of the previous lemma: µ(a | b → a) = and
µ(b → a | a)µ(a) µ(a) = µ(b → a) µ(b → a)
(1)
Proof Length as an Uncertainty Factor in ILP
µ(a | b ∧ c → a) =
µ(b ∧ c → a | a)µ(a) µ(a) = µ(b ∧ c → a) µ(b ∧ c → a)
131
(2)
Note that b → a |= b ∧ c → a and we can now use the part i) of the previous lemma: µ(b → a) ≤ µ(b ∧ c → a) (3) So we get the expected result: µ(a | b ∧ c → a) ≤ µ(a | b → a). We can conclude here that it is probably better to justify a with a clause with a short body. If we consider the length of a proof for a given instance a, this number is thus a reasonable measure of the uncertainty we have concerning the fact that a is true. This number could be viewed as a weight for a and, inspired by standard techniques in machine learning field, we shall try to introduce these weights in the ILP process to guide the search for relevant hypothesis. The remaining of our paper is devoted to take into account in an ILP process the previous analysis and to reduce uncertainty by reducing the proofs. The target machinery is Progol ([8]) that we shortly describe in the following section.
4
The Progol System
Back to the standard ILP process, instead of searching for consequences, we search for premises: it is thus rather natural to reverse standard deductive inference mechanisms. That is the case for Progol which uses the so-called inverse entailment mechanism ([8]). We only give a simple algorithm1 schematizing its behavior in figure 1. Initialize : E = E (initial set of examples) H = ∅ (initial hypothesis) While E = ∅ do Choose e ∈ E Compute a covering clause C for e H = H ∪ {C} Compute Cov = {e | e ∈ E, B ∪ H |= e } E = E \ Cov End While Fig. 1. General Progol scheme
At a glance, it is clear that no weight is taken into account in the previous algorithm. But let us examine in the next subsection how the covering clause is chosen. 1
see http://www.cs.york.ac.uk/mlg/progol.html where a full and clear description is given.
132
4.1
G. Richard and F.Z. Kettaf
The Choice of the Covering Clause
It is clear that there is an infinite number of clauses covering e, and so Progol need to restrict the search in this set. The idea is thus to compute a clause Ce such that if C covers e, then necessarily C |= Ce (C is more general than Ce ). Since, in theory, Ce could be an infinite disjunction, Progol restricts the construction of Ce using mode declarations and some other settings (like number of resolution inferences allowed, etc...). Mode declarations imply that some variables are considered as input variables and other ones as output variables: this is a standard way to restrict the search tree for a Prolog interpreter. At last, when we have a suitable Ce , it suffices to search for clauses C which θ-subsume Ce since this is a particular case which validates C |= Ce . Thus, Progol begins to build a finite set of θ-subsuming clauses, C1 , . . . , Cn . For each of these clauses, Progol computes a natural number f (Ci ) which expresses the quality of Ci : this number measures in some sense how well the clause explains the examples and is combined with some other compression requirement. Given a clause Ci extracted to cover e, we have: f (Ci ) = p(Ci ) − (c(Ci ) + h(Ci ) + n(Ci )) where: – p(Ci ) = #({e | e ∈ E, B ∪ {Ci } |= e}) i.e. the number of covered examples – n(Ci ) = #({e | e ∈ E, B ∪ {Ci } ∪ {e} |= ⊥}) i.e. the number of incorrectly covered examples – c(Ci ) is the length of the body of the clause Ci – h(Ci ) is the minimal number of atoms of the body of Ce we have to add to the body of Ci to insure output variables have been instantiated. The evaluation of h(Ci ) is done by static analysis of Ce . Then, Progol chooses a clause C = Ci0 ≡ arg maxCi f (Ci ) (i.e. such that f (Ci0 ) = max{f (Cj ) | j ∈ [1, n]}). We may notice that, in the formula computing the number f (Ci ) for a given clause covering e, there is no distinction between the covered positive examples. So p(Ci ) is just the number of covered positive examples. The same remark is valuable for the computation of n(Ci ) and then success and failure examples could be considered as equally weighted. In the next section, we shall explain how we introduce weights to distinguish between examples. 4.2
A Simulation of the Weights
Given a problem instance a covered by H, there is a deduction, using only the resolution rule, such that H ∪ B |= a. Back to our previous example, to deduce a knowing a ← b, c (using Prolog standard notation), you have first to prove (or to remove in Prolog terminology) b then c with standard Prolog strategy. It becomes clear that the number of resolution steps used to prove a is an increasing function of the length of the clauses defining a in H. Starting from this observation, we argue that this number is likely a good approximation of the difficulty to cover
Proof Length as an Uncertainty Factor in ILP
133
an instance of a which would not be in the training set. We infer that training examples with long proofs will likely generate errors in the future and so, we have to focus on such examples during the learning process. So, we decide to give to each training example a weight equal to the length of the proof it needs to be covered by the current hypothesis. Among the finite set of possible proofs, we choose the shortest length. Let us recall that Progol is not designed to readily employ weighted training instances and we have to simulate in some sense the previous weights. Nevertheless, in the definition of the f function allowing to choose the best clause for the current instance, the parameter we are interested in is p(C): this is the number of training instances in E covered by the clause we are dealing with. Let us suppose that, instead of considering E as a set, we consider E as a multiset (or bag) and we include in the current E multiples occurrences of some existing examples. Then we force the ILP machine to give great importance to these instances and to chose the associated covering clauses just because of the great value of p(C). Let us give an example to highlight our view. E = {e1 , e2 , e3 , e4 }. Starting from e1 , the Progol machine finds C1 , C2 and C3 such that C1 covers e1 , e2 and e3 , C2 covers e1 , e2 and e4 , C3 covers e1 , e3 and e4 . At this step, we have p(C1 ) = p(C2 ) = p(C3 ) = 3. So these clauses could not be distinguished only with the covered positive examples. Let us suppose now that we give to e2 the weight 2 and to e3 the weight 3, e1 and e4 keeping their implicit weight of 1. Now we have p(C1 ) = 1 + 2 + 3 = 6, p(C2 ) = 1 + 2 + 1 = 4 and
p(C3 ) = 1 + 3 + 1 = 5.
From the viewpoint of the positive covered examples, C1 is the best clause and will be added to the current hypothesis H: the clause C2 , which does not cover e3 , is eliminated and we choose among the clauses covering e3 . We understand that the Progol machinery will then choose the clauses covering the “heavy” instances.
5
Progol with Weights
Now we have to manage the weight in a Progol system: we take inspiration from a technique issued from the field of classification namely boosting. We give the main lines of this algorithm in the next subsection. 5.1
The Boosting Technique
This technique is issued from the field of classification: the main idea is to combine simple classifiers Ct , t ∈ [1, T ], into a new one C whose predictive accuracy
134
G. Richard and F.Z. Kettaf
is better than those of the Ct ’s. So C = ψ({Ct | t ∈ [1, T ]}) and the problem is to cleverly define the ψ aggregating function. We focus here on the boosting technique as initially described in [3,4] and improved in [10]. We suppose given a finite set of training instances S = {(x1 , y1 ), . . . , (xi , yi ), . . . , (xn , yn )} where yi denotes the class of xi (to simplify the presentation, we suppose a binary classification where yi ∈ {−1, +1}). The “boosted” system will now repeat T times the construction of a classifier Ct for weighted training instances starting from a distribution D1 where the initial weight w1 (i) of instance i is 1/n. At the end of turn t of the algorithm, the distribution Dt is updated into Dt+1 : roughly speaking, the idea is to increase the weight of a misclassified instance i.e. if Ct (xi ) = yi then wt+1 (i) > wt (i). Of course, the difficulty is to cleverly define the new weight. One of the main results of [10] is to give choice criteria and also to prove important properties of such boosted algorithms. For our aim, it is not necessary to go further and so we give in figure 2 a simplified version of the boosting algorithm. Generally, the ψ functional is a linear combination of Initialize t = 1 For i = 1 to n do Dt (i) = 1/n For t = 1 to T do built a classifier Ct with S weighted by Dt update Dt End For C = ψ({Ct | t ∈ [1, T ]}) Fig. 2. General boosting scheme
the Ct ’s considered as real-valued functions. To predict the class of a given x, it suffices to compute C(x): if C(x) > 0 then +1 else −1. So [10] proved that the training error of the classifier C has a rather simple upper bound (when Dt is conveniently chosen) and some nice methods to reduce this error are described. 5.2
Introduction of the Weights in Progol
Now it is relatively easy to introduce a kind of weight management in the Progol machinery: it suffices to consider the training set as a multiset, the weight ω(e) of an example e being its multiplicity order. So we start with equally weighted examples: each instance appears one time in E. Introducing T times a change of the function ω in a learning loop, then we will get distinct clauses and then distinct final hypothesis in each turn of the loop. To abbreviate, we shall denote P rogol(B, E, f ) the output program H currently given by the Progol machine with input B as background knowledge, E as training set (or bag) and using function f to chose the relevant clauses. E t is the bag associated to weight ω t during turn t and let us now show in figure 3 the scheme we use. It clearly appears that, with regard to the behavior of Progol, the program P rogol(B, E t , f ) will be probably different from the program P rogol(B, E t+1 , f ). This is an exact
Proof Length as an Uncertainty Factor in ILP
135
Given E = {e1 , . . . , em } a set of positive examples, B a background knowledge (a set of Horn clauses) /* Initialization */ E 1 = E; /* main loop */ For t = 1 to T do H t = Progol(B, E t , f ); For each e ∈ E, compute ω t+1 (e) = length of the proof B ∪ H t |= e; Update E t into E t+1 using ω t+1 /* end main loop */ Fig. 3. A boosting-like scheme for Progol
simulation of a weight-handling Progol system where the function f would be defined with: f (C) = p(C) − (h(C) + c(C) + n(C)) where p(C) = Σ{e∈E t |B∪C|=e} ω(e) So we get a set of hypothesis {H t | t ∈ [1, T ]} and since we deal with symbolic values, we aggregate these programs by a plurality vote procedure i.e. the answer for a query q is just the majority vote of each component H t . This is exactly the solution of [9]. Doing so, we may notice that we get a classifier but this is not a simple logic program. Nevertheless, we do not focus on this aspect for classification tasks. We examine in the next section how we proceed and the behavior of our machine from the viewpoint of predictive accuracy.
6
Implementation and Experiments
Our experiments were made using UCI database ([2]). We choose 3 databases, namely ZOO (102 examples, 18 attributes), HEPATITIS (155 examples, 20 attributes) and MORAL-REASONER (202 examples) databases. The first two ones are standard classification data using symbolic and numeric attribute values. The third one is fully relational, dealing only with symbolic values. The main boosting loop is implemented as a Unix script with - a) additional Prolog predicates to generate the weights and - b) some C programs to process the different outcome files. The running time is rather important for data with large number of attributes: about 2 hours for Hepatisis over a standard PC (500Mhz). This is the main reason why we chose restricted set of data. A full C implementation would likely allow to deal with more complex data and relations. But, as usual in that domain, we are not primarily concerned with real time constraints. The main difficulty is due to the computation of the weights at the end of a learning turn:
136
G. Richard and F.Z. Kettaf
we need to run a Prolog interpreter for each example and to keep the length of each success proof in view to compute the shortest one. 6.1
Results for ZOO and HEPATITIS Databases
Results for ZOO Database (in Accuracy Rate%) Attribute name Progol Random Weights +Boosting eggs
88,5
90
94,2
hair
96
96
96
feathers
100
100
100
airbone
76
80
84
aquatic
80
86
86
toothed
72
72
72
backbone
98
98
98
breathed
90
93
98
We fix the number of loops to 10 (as in [9]). We run standard Progol and our “boosted” system over the previous database with fixed training sets of size 50, randomly sampled from the full one. We also compare our machine with a “random boosting” system, i.e. where the weights are randomly generated between two consecutive turns of the main loop. Then we test the results hypothesis over the full set of examples(this is an easy task with Progol since a built-in predicate test\1 does the job) and we compute the accuracy rate simply by dividing the number of correctly classified data by the full number of data. A majority vote is implemented for computing the answers of the “boosted” systems. That is why we have 4 columns in our tables. The first one indicates the class we are focusing on, the next ones indicate the respective accuracy rate with Progol only, random weight loop and our weight function. Results for HEPATITIS Database (in Accuracy Rate%) Each line of the tables is identified by the name of the attribute we focus on: in such a line, we consider that the value of the attribute has to be guessed by the system (i.e. this attribute is the concept to learn).
Proof Length as an Uncertainty Factor in ILP
137
Attribute name Progol Random Weights +Boosting alive
83,7
84
86,1
sex
88,5
90
94,2
steroids
41
41
44
antiviral
20
20
25
anorexia
80
83
83
liverBig
75
76
80
liverFirm
53,6
60
63,4
spleenpalpable
41,6
45
51,2
spiders
65,8
70
73
ascite
87,8
90
92,6
varices
90,2
95
95,1
The results we get are described in the above tables. We may remark than the random method is, in general better than standard Progol and this is an expected result. Moreover, our method is often better than the random one. 6.2
Results for MORAL-REASONER Database
One of the main interest we found in this base is the huge number of predicate symbols involved. Recall that we are interested in the length of the clause bodies and then we focus on theory with a lot of predicate symbols: we thus increase the possibility for the bodies of clauses. For instance, considering the BALLOON database (UCI), the Progol machine gives only clauses with empty body (because the concept to learn is rather simple) so there is no possibility to improve the mechanism with our scheme. In the MORAL-REASONER database, we have 49 predicate symbols to define the guilty/1 target predicate. A little bit more than 4750 clauses constitute the background knowledge. The full training set contains 102 positive instances S + and 100 negative ones S − . Since this database is rather a logical one, we adopt a different protocol to test our machine. The protocol used in that case is the following one: training sets E = E + ∪ E − of sizes 10, 20, 30, 40, 50, 60, 70, 80 , 90 and 100 were randomly sampled from the full training set. Each training set is equally divided in positive and negative instances (i.e. 5 positives examples in E + and 5 negative examples in E − for the first test, and
138
G. Richard and F.Z. Kettaf
so on). First of all, we may notice the behavior of standard progol where the accuracy rate is not a strictly increasing function of the number of examples: this is due to the fact that the given training sets do not constitute an increasing sequence. But the main observation is the fact that ‘”boosted Progol” has, in all the cases except the first one, a better accuracy rate over the full sample set than simple Progol. We guess that this exception is not really informative because of the small size of the sample set. With only 10 training instances, it is clear that we are rather lucky to get as first hypothesis a performant classifier (accuracy = 81.19%). So, in the other boosting loops, we get a more standard behavior with a great deviation and the majority vote demonstrates this fact.
Fig. 4. Comparison Progol/Progol with weight
7
Discussion
Our method for computing weights is generally better than the random weight method which is itself better than simple Progol. So, we can infer that boosting with length of proof optimizes the behavior of the Progol machine. We may notice that the first hypothesis we get with the boosting scheme is exactly the one produced by the standard Progol machine. During the boosting process, if we get an hypothesis H which has already been produced, this is a halting condition.
Proof Length as an Uncertainty Factor in ILP
139
The reason is that the weights to actualize have already been computed and we will not get more information. Since we use a plurality vote, this means we have not a single logic program at the end of the process and so, we lose an interesting features of ILP: the intelligibility of the resulting hypothesis, that is, the possibility to a non-expert people to understand why we get a given result. Of course, each individual hypothesis remains understandable but the behavior of the whole system could not be reduced to one hypothesis unless we exhibit an equivalent logic program. So, instead of using a voting procedure, it would be interesting to combine the hypothesis Pt to get a single full logic program. The simplest solution would be to take the union of the Pt ’s: P = t∈[1,T ] Pt . From a logical point of view, we are sure to get a consistent theory since we only deal with Horn clauses and so each Pt has only positive consequences. We examine this possibility now. Let us consider the union P of two logic programs P1 and P2 considered as unordered finite sets of Horn clauses. Then the success set of P , ss(P ) (see [7] for a complete description), is a superset of the union of the two success sets: ss(P1 ) ∪ ss(P2 ) ⊆ ss(P ). So, from the viewpoint of the covered positive examples, we increase the accuracy rate. But, let us focus now on the negative examples. We recall that a negative example is covered if and only if it belongs to the finite failure set of P , f f (P ) (see [7]). Unfortunately, we have the dual inclusion f f (P ) ⊆ f f (P1 ) ∩ f f (P2 ) highlighting the fact that negation as failure is a non monotonic logical rule. A partial solution would be to consider accuracy rate over positive examples only. Nevertheless, two other problems occur: 1. First, we increase the non-determinism of the program since we probably get several answers for one given instance. And this is problematic when we deal with single class clustering task. 2. Secondly, we probably get a very redundant system where each covered instance leads to a lot of different proofs then reducing time efficiency of the final program. So a more clever aggregating method has to be found. This is an open issue.
8
Conclusion
In this paper, we provide a way to introduce weights in an ILP machine, Progol. This idea is not completely new. As far as we know, [9] was the first one to build such a mechanism for the FOIL machine but there are two main differences with our work: – FOIL is a weight-handling system so the novelty was only the boosting mechanism. In our case, Progol does not handle weighted examples. As a consequence, we have to search for reasonable weights and to find a method to deal with them. – in FOIL, the weights are derived from the training error but our weights rely on the length of proofs since the training error is always null.
140
G. Richard and F.Z. Kettaf
The length of a proof is viewed as a numerical quantification of the uncertainty degree and a simple probabilistic reasoning justifies this claim. We implement this idea by simulating weight using bags instead of sets as training bases, the multiplicity order of an element being considered as its “weight”. This is a way to focus over potentially unfair examples and to force the learning machine, Progol, to privilegiate hypothesis covering these examples. The final aggregation is just a plurality vote to get the answer. Our practical results improve the standard Progol behavior and validate our idea over a restricted set of data. A more extensive experimentation would be interesting. Among the different ideas to develop, we have to introduce negative examples since they are available in such first order systems. But in that case, matters are not so simple. To deal with negative information, the standard way in logic programming is to start with the Closed World Assumption and to implement a negation-as-failure mechanism. Unfortunately, negation-as-failure is not a logical inference rule but only an operational trick and in that case, the length of a proof is of no help in designing a weight function.
References 1. F.Bacchus, A.J. Grove, J.Y. Halpern, and D. Koller. From statistical knowledge bases to degree of belief. In Artificial Intelligence, (87), pp 75-143, 1997 2. C.I. Blake, C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html. 3. Y. Freund, R.E. Shapire. Experiments with a new boosting algorithm. In Machine Learning : proceeding of the 13th Int. Conf., pp148-156, 1996 4. Y. Freund, R.E. Shapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Journal of Computer and System Sciences, vol.55(1) pp119-139, 1997 5. J.Y. Halpern. An analysis of first-order logics of probability. In Artificial Intelligence, (46), pp 311-350, 1990 6. S. Hoche, S. Wrobel. Using constrained confidence-rated boosting. In 11th Int. Conf., ILP, Strasbourg, France, pp51-64, 2001 7. J.W. Lloyd. Foundations of Logic Programming. Symboloc Computation series. Springer Verlag, 1997 (revised version). 8. S. Muggleton. Inverse entailment and Progol. New Gen. Comput., 12 pp 245-286, 1994. 9. J.R. Quinlan. Boosting First-Order Learning. In Proc. 7th Int. Workshop on Algorithmic Learning Theory ALT’96. Springer Verlag, 1996. 10. R.E. Shapire, Y. Singer. Improving boosting algorithms using confidence-rated predictions. In Proc. 11th Ann. Conf. Computation Learning Theory, 1998.
Paraconsistency in Object-Oriented Databases Rajiv Bagai1 and Shellene J. Kelley2 1
2
Department of Computer Science, Wichita State University, Wichita, KS 67260-0083, USA
[email protected] Department of Mathematics & Computer Science, Austin College, Sherman, TX 75090-4440, USA
[email protected]
Abstract. Many database application areas require an ability to handle incomplete and/or inconsistent information. Such information, called paraconsistent information, has been the focus of some recent research. In this paper we present a technique for representing and manipulating paraconsistent information in object-oriented databases. Our technique is based on two new data types for such information. These data types are generalizations of the boolean and bag data types of the Object Data Management Group standard (ODMG). Algebraic operators with a 4valued paraconsistent semantics are introduced for the new data types. Also a 4-valued operational semantics is presented for the select expression, with an example illustration of how such a semantics can be used effectively to query an object-oriented database containing paraconsistent information. To our knowledge, our technique is the first treatment of inconsistent information in object-oriented databases.
1
Introduction
Employing classical sets to capture some underlying property of their elements requires the membership status of each possible element to be completely determinable, either positively or negatively. For bags, determining multiplicity of membership of their elements is necessary as well. While it is often possible to determine that underlying property of possible elements of sets or bags, there are numerous applications in which that property can at best be only guessed by employing one or more tests or sensors. By itself, such a test or sensor is usually not foolproof, making it necessary to take into account the outcomes, sometimes contradictory, of several such tests. As an example, Down’s Syndrome is an abnormality in some fetus, leading to defective mental or physical growth. Whether or not the fetus inside a pregnant woman has Down’s Syndrome can only be guessed at by subjecting the mother to certain tests, e.g. serum screening, ultrasound, amniocentesis diagnosis, etc. Such tests are often carried out simultaneously and contradictory
This research has been partially supported by the National Science Foundation research grant no. IRI 96-28866.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 141–150, 2002. c Springer-Verlag Berlin Heidelberg 2002
142
R. Bagai and S.J. Kelley
results are possible. Decisions are made based on information obtained from all sensors, even contradictory. As another example, target identification in a battlefield is often carried out by employing different sensors for studying an observed vehicle’s radar image shape, movement pattern, gun characteristics, etc. Gathered information is often incomplete, and sometimes even inconsistent. Astronomical and meteorological databases are other examples of domains rich in such information. For the same reasons, combinations of databases [4,9] or databases containing beliefs of groups of people often have incomplete, or more importantly, inconsistent information. Paraconsistent information is information that may be incomplete or inconsistent. Given the need for handling such information in several application areas, some techniques have recently been developed. Bagai and Sunderraman [1] proposed a data model for paraconsistent information in relational databases. The model, based on paraconsistent logic studied in detail by da Costa [7] and Belnap [5], has been shown by Tran and Bagai in [10] to require handling infinite relations, with an efficient algebraic technique presented in [11]. Also, some elegant model computation methods for general deductive databases have been developed using this model in [2,3]. In the context of object-oriented databases, however, we are not aware of any prior work on handling paraconsistency. In this paper we present a technique, motivated by the above model, for representing and manipulating paraconsistent information in object-oriented databases. We present two new data types for the object model of Cattell and Barry [6]. Our data types are generalizations of the boolean and bag data types. We also introduce operators over our new data types and provide a 4-valued semantics for the operators. In particular, we provide a richer, 4-valued operational semantics for the select expression on our new data types. Equipped with our semantics, the operators become an effective language for querying object-oriented databases containing paraconsistent information. The remainder of this paper is organized as follows. Section 2 introduces the two new data types for storing paraconsistent information. This section also gives some simple operators on these data types. Section 3 presents a 4-valued operational semantics for the select expression, followed by an illustrative example of the usage and evaluation of this construct. Finally, Section 4 concludes with a summary of our main contributions and a mention of some of our future work directions.
2
Paraconsistent Data Types
In this section we present two new data types, namely pboolean and pbag
, that are fundamental to handling paraconsistent information in an objectoriented database. These data types are, respectively, generalizations of the data types boolean and bag of the ODMG 3.0 object data standard of [6].
Paraconsistency in Object-Oriented Databases
2.1
143
The Data Type pboolean
The data type pboolean (for paraconsistent boolean) is a 4-valued generalization of the ordinary data type boolean. The new generalized type consists of four literals: true, false, unknown and contradiction. As in the boolean case, the literal true is used for propositions whose truth value is believed to be true, and the literal false for ones whose truth value is believed to be false. If the truth value of a proposition is not known, we record this fact by explicitly using the literal unknown as its value. Finally, if a proposition has been observed to be both true as well as false (for example, by different sensors), we use the literal contradiction as its value. Paraconsistent Operators pand, por & pnot We also generalize the well-known operators and, or and not on the type boolean to their paraconsistent counterparts pand, por and pnot, respectively, on the type pboolean. The following table defines these generalized operators: P contradiction contradiction contradiction contradiction true true true true false false false false unknown unknown unknown unknown
Q P pand Q P por Q pnot P contradiction contradiction contradiction contradiction true contradiction true contradiction false false contradiction contradiction unknown false true contradiction contradiction contradiction true false true true true false false false true false unknown unknown true false contradiction false contradiction true true false true true false false false true unknown false unknown true contradiction false true unknown true unknown true unknown false false unknown unknown unknown unknown unknown unknown
Except for the cases when one of P and Q is contradiction and the other is unknown, all values in the above table are fairly intuitive. For example, false pand unknown should be false, regardless of what that unknown value is. And false por unknown is unknown as false is the identity of por. The two cases when one of P and Q is contradiction and the other is unknown will become clear later. However, at this stage it is worthwhile to observe that the 4-valued operators pand, por and pnot are monotonic under the no-moreinformed lattice ordering (≤), where unknown < true < contradiction, and unknown < false < contradiction.
144
R. Bagai and S.J. Kelley
Also, the duality of pand and por is evident from the above table. Moreover, the following algebraic laws can easily be shown to be exhibited by the above 4-valued operators: 1. Double Complementation Law: pnot (pnot P ) = P 2. Identity and Idempotence Laws: P pand true = P pand P = P P por false = P por P = P 3. Commutativity Laws: P pand Q = Q pand P P por Q = Q por P 4. Associativity Laws: P pand (Q pand R) = (P pand Q) pand R P por (Q por R) = (P por Q) por R 5. Distributivity Laws: P pand (Q por R) = (P pand Q) por (P pand R) P por (Q pand R) = (P por Q) pand (P por R) 6. De Morgan Laws: pnot (P pand Q) = (pnot P ) por (pnot Q) pnot (P por Q) = (pnot P ) pand (pnot Q) 2.2
The Data Type pbag
If t is any data type, then bag is an unordered collection of objects of type t, with possible duplicates. We define pbag (for paraconsistent bag) to be a collection data type, such that any element p of that type is an ordered pair p+ , p− , where p+ and p− are elements of type bag. For example, {1, 1, 2, 2, 2, 3}, {1, 3, 3} is an element of type pbag<short>. Intuitively, just like a bag, a paraconsistent bag captures some underlying property of elements that (may) occur in it. An element a of type t occurs in p+ as many times as there are evidences of it having the underlying property; similarly, a occurs in p− as many times as there are evidences of it not having the underlying property. Since inconsistent information is possible, an element may be simultaneously present (in fact, a multiple number of times) in both positive as well as negative parts of a paraconsistent bag. Also, an element may be absent from one or both parts. As a more realistic example, suppose a patient in a hospital is tested for some symptoms s1 , s2 and s3 . Often the testing methods employed by hospitals are not guaranteed to be foolproof. Thus, for any given symptom, the patient, over a period of time, is tested repeatedly, with possibly contradictory results. A possible collection of test results captured as a paraconsistent bag is: {s1 , s2 , s2 , s3 }, {s1 , s3 , s3 }. The above contains the information that this patient tested positive once each for symptoms s1 and s3 , and twice for symptom s2 . Also, the patient tested negative once for symptom s1 and twice for s3 .
Paraconsistency in Object-Oriented Databases
145
Membership of an Element Membership of an element in a paraconsistent bag is a paraconsistent notion. If a is an expression of type t and p of type pbag, then a pin p is an expression of type pboolean given by: true if false if a pin p = unknown if contradiction if
a ∈ p+ a ∈p+ a ∈p+ a ∈ p+
and and and and
a ∈p− , a ∈ p− , a ∈p− , a ∈ p− .
The above corresponds to the main intuition behind the four literals of the pboolean type.
Paraconsistent Operators punion, pintersect & pexcept We now define the union, intersection and difference operators on paraconsistent bags as generalizations of the union, intersect and except operators, respectively, on ordinary bags. Let t1 and t2 be compatible types and t their smallest common supertype. If p1 is of type pbag and p2 is of type pbag, then p1 punion p2 is the following paraconsistent bag of type pbag: + − − p+ 1 union p2 , p1 intersect p2 .
Once again, the punion operator is best understood by interpreting bags (both ordinary and paraconsistent) as collections of evidences for elements having their respective underlying properties, and p1 punion p2 as the “either-p1 -or-p2 ” + property. Thus, since p+ 1 and p2 are the collections of positive evidences for the properties underlying p1 and p2 , respectively, the collection of positive evi+ dences for the property “either-p1 -or-p2 ” is clearly p+ 1 union p2 . Similarly, since − − p1 and p2 are the collections of negative evidences for the properties underlying p1 and p2 , respectively, the collection of negative evidences for the property − “either-p1 -or-p2 ” is p− 1 intersect p2 . The generalized operators for the intersection and difference of paraconsistent bags should be understood similarly. p1 pintersect p2 is the following paraconsistent bag of type pbag: + − − p+ 1 intersect p2 , p1 union p2
and p1 pexcept p2 is the following paraconsistent bag of type pbag: − − + p+ 1 intersect p2 , p1 union p2 .
146
R. Bagai and S.J. Kelley
Type Conversion For any collection C, we define pbagof(C) as the following paraconsistent bag: the bag of all elements of C if C is a list or set, if C is a bag, pbagof(C)+ = C + C if C is a paraconsistent bag, ∅ if C is a list, set or bag, pbagof(C)− = C − if C is a paraconsistent bag.
3
A Paraconsistent Select Expression
We now introduce a paraconsistent select expression construct that can be used to query paraconsistent information from a database. The syntax of our construct is similar to the select construct of OQL but we provide a paraconsistent operational semantics for evaluating the construct, resulting in a paraconsistent bag. The general form of the select expression is as follows: select [distinct] g(v1 , v2 , . . . , vk , x1 , x2 , . . . , xn ) from x1 pin f1 (v1 , v2 , . . . , vk ), x2 pin f2 (v1 , v2 , . . . , vk , x1 ), x3 pin f3 (v1 , v2 , . . . , vk , x1 , x2 ), ··· xn pin fn (v1 , v2 , . . . , vk , x1 , x2 , . . . , xn−1 ) [where p(v1 , v2 , . . . , vk , x1 , x2 , . . . , xn )] where v1 , v2 , . . . , vk are free variables that have to be bound to evaluate the query, expressions f1 , f2 , . . . , fn result in collections of types with extents, say, e1 , e2 , . . . , en , respectively. (The extent of a type is the set of all current instances of that type.) And p is an expression of type pboolean. The result of the query will be of type pbag, where t is the type of the result of g. The query is evaluated as follows: 1. The result of the from clause, Φ(n), is a paraconsistent bag of n-tuples, defined recursively as Φ(1) = pbagof(f1 (v1 , v2 , . . . , vk )) and for 2 ≤ i ≤ n,
a1 , a2 , . . . , ai−1 , t : t ∈ pbagof(fi (v1 , v2 , . . . , vk , Φ(i)+ = a1 , a2 , . . . , ai−1 ))+ a1 ,a2 ,...,ai−1 ∈Φ(i−1)+
Φ(i)− = Φ(i − 1)− × ei ∪
a1 , a2 , . . . , ai−1 , t : t ∈ pbagof(fi (v1 , v2 , . . . , vk , a1 , a2 , . . . , ai−1 ))− a1 ,a2 ,...,ai−1 ∈Φ(i−1)+
Paraconsistency in Object-Oriented Databases
147
The subexpression Φ(i − 1)− × ei appears in the definition of Φ(i)− due to the fact that if a tuple does not have the Φ(i−1) property, then no extension of it can have the Φ(i) property. 2. If the where clause is present, obtain from Φ(n) the following paraconsistent bag Θ: Θ+ = Φ(n)+ ∩
a1 , a2 , . . . , an ∈ e1 × e2 × · · · × en : p(v1 , v2 , . . . , vk , a1 , a2 , . . . , an ) is true or contradiction Θ− = Φ(n)− ∪
a1 , a2 , . . . , an ∈ e1 × e2 × · · · × en : p(v1 , v2 , . . . , vk , a1 , a2 , . . . , an ) is false or contradiction Intuitively, Θ is obtained from the paraconsistent bag of n-tuples, Φ(n), by performing a pintersect operation with the paraconsistent bag p, also of n-tuples. 3. If g is just ”*”, keep the result of step (2). Otherwise, replace each tuple a1 , a2 , . . . , an in it by g(v1 , v2 , . . . , vk , a1 , a2 , . . . , an ). 4. If the keyword distinct is present, eliminate duplicates from each of the two bag components of step (3). An Example Let us now look at an example illustrating some paraconsistent computations for a query that requires them. Consider again a hospital ward where patients are tested for symptoms. Let a class Patient contain the following relationship declaration in its definition: relationship pbag<Symptom> test; where Symptom is another class. The above declaration states that, for any patient, the relationship test is essentially a paraconsistent bag containing, explicitly, the symptoms for which that patient has tested positive or negative. Let the sets {P1 , P2 } and {s1 , s2 , s3 } be the current extents of the classes Patient and Symptom, respectively. We also suppose that in the current state of the database, P1 .test and P2 .test are the following relationships: P1 .test = {s1 , s1 , s2 }, {s3 }, P2 .test = {s1 , s3 , s3 }, {s1 , s1 }. In other words, P1 was tested positive for s1 (twice), s2 (once), and negative for s3 (once). Also, P2 was tested positive for s1 (once), s3 (twice), and negative for s1 (twice). Now consider the query: What patients showed contradictory test results for some symptom?
148
R. Bagai and S.J. Kelley
This query clearly acknowledges the presence of inconsistent information in the database. More importantly, it attempts to extract such information for useful purposes. A select expression for this query is: select distinct p from p pin Patient, s pin p.test where pnot(s pin p.test) In the ordinary 2-valued OQL a similar select expression will produce an empty bag of patients as the stipulations ”s in p.test” and ”not(s in p.test)” are negations of each other and no patient-symptom combination can thus simultaneously satisfy both of these stipulations. However, our 4-valued semantics of paraconsistent operators like pnot and pin enable us to extract useful, in this case contradictory, information from the database. We first compute the result of the from clause, Φ(2), a paraconsistent bag of patient-symptom ordered pairs, by performing the following evaluations: Φ(1) = {P1 , P2 }, ∅, Φ(2)+ = {P1 , s1 , P1 , s1 , P1 , s2 , P2 , s1 , P2 , s3 , P2 , s3 }, Φ(2)− = {P1 , s3 , P2 , s1 , P2 , s1 }. The condition of the where clause is then evaluated for all possible patientsymptom ordered pairs: p, s pnot(s pin p.test) P1 , s1 false P1 , s2 false P1 , s3 true P2 , s1 contradiction P2 , s2 unknown P2 , s3 false resulting in the following paraconsistent bag Θ: Θ+ = {P2 , s1 }, Θ− = {P1 , s1 , P1 , s2 , P1 , s3 , P2 , s1 , P2 , s1 , P2 , s1 , P2 , s3 }. Finally, projecting the patients from the above and removing duplicates results in the following answer to the select expression: {P2 }, {P1 , P2 }. The result states that P2 showed contradictory test result for some symptom (actually s1 ) but not for all symptoms (for example, not for s3 ), and P1 did not show contradictory result for any symptom.
Paraconsistency in Object-Oriented Databases
4
149
Conclusions and Future Work
The existing ODMG 3.0 Object Data Standard [6] is not capable of handling incomplete and/or inconsistent information in an object-oriented database. For many applications that is a severe limitation as they abound in such information, called paraconsistent information. We have presented a technique for representing and manipulating paraconsistent information in object-oriented databases. Our technique is based upon two new data types pboolean and pbag that are generalizations of the boolean and bag data types, respectively, of [6]. We also presented operators over these data types that have a 4-valued semantics. Most importantly, we presented a 4-valued operational semantics for the select expression of OQL, which makes possible querying contradictory information contained in the database. We have recently completed a first prototype implementation of our results. Kelley [8] describes in detail an object-oriented database management system capable of storing incomplete and/or inconsistent information and answering queries based on the 4-valued semantics presented in this paper. To the best of our knowledge, our work is the first treatment of inconsistent information in object-oriented databases. Such information occurs often in applications where, to determine some fact many sensors may be employed, some of which may contradict each other. Examples of such application areas include medical information systems, astronomical systems, belief systems, meteorological systems, military and intelligence systems. Due to space limitations we have presented a minimal but complete extension to the standard of [6] for handling paraconsistent information. An exhaustive extension would have to have generalized versions of other operators and features, such as for all, exists, andthen, orelse, < some, >= any, etc. We have left that as a future extension of the work presented in this paper. Some other future directions in which we plan to extend this work are to develop query languages and techniques for databases that contain quantitative paraconsistency (a finer notion of paraconsistency with real values for belief and doubt factors) and temporal paraconsistency (dealing with paraconsistent information that evolves with time).
References 1. R. Bagai and R. Sunderraman. A paraconsistent relational data model. International Journal of Computer Mathematics, 55(1):39–55, 1995. 2. R. Bagai and R. Sunderraman. Bottom-up computation of the Fitting model for general deductive databases. Journal of Intelligent Information Systems, 6(1):59– 75, 1996. 3. R. Bagai and R. Sunderraman. Computing the well-founded model of deductive databases. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 4(2):157–175, 1996. 4. C. Baral, S. Kraus, and J. Minker. Combining multiple knowledge bases. IEEE Transactions on Knowledge and Data Engineering, 3(2):208–220, 1991.
150
R. Bagai and S.J. Kelley
5. N. D. Belnap. A useful four-valued logic. In G. Epstein and J. M. Dunn, editors, Modern Uses of Many-valued logic, pages 8–37. Reidel, Dordrecht, 1977. 6. R. G. G. Cattell, and D. K. Barry. The Object Data Standard: ODMG 3.0. Morgan Kaufmann Publishers, 2000. 7. N. C. A. da Costa. On the theory of inconsistent formal systems. Notre Dame Journal of Formal Logic, 15:497–510, 1974. 8. S. J. Kelley. A Paraconsistent Object-Oriented Database Management System. MS thesis, Department of Computer Science, Wichita State University, November 2001. 9. V. S. Subrahmanian. Amalgamating knowledge bases. ACM Transactions on Database Systems, 19(2):291–331, 1994. 10. N. Tran and R. Bagai. Infinite Relations in Paraconsistent Databases. Lecture Notes in Computer Science, 1691:275–287, 1999. 11. N. Tran and R. Bagai. Efficient Representation and Algebraic Manipulation of Infinite Relations in Paraconsistent Databases. Information Systems, 25(8):491– 502, 2000.
Decision Support with Imprecise Data for Consumers Gergely Luk´ acs Universit¨ at Karlsruhe Institute for Program Structures and Data Organization Am Fasanengarten 5, 76128, Karlsruhe, Germany [email protected]
Abstract. Only imperfect data is available in many decision situations, which therefore plays a key role in the decision theory of economic science. It is also of key interest in computer science, among others when integrating autonomous information systems: the information in one system is often imperfect from the view of another system. The case study for the present work combines the two issues: the goal of the information integration is to provide decision support for consumers, the public. By the integration of an electronic timetable for public transport with a geographically referenced database, for example, with rental apartments, it is possible to choose alternatives, for example, rental apartments from the database that have a good transport connection to a given location. However, if the geographic references in the database are not sufficiently detailed, the quality of the public transport connections can only be characterized imprecisely. This work focuses on two issues: the representation of imprecise data and the sort operation for imprecise data. The proposed representation combines intervals and imprecise probabilities. When the imprecise data is only used for decision making with the Bernoulli-principle, a more compact representation is possible without restricting the expressive power. The key operation for decision making, the sorting of imprecise data, is discussed in detail. The new sort operation is based on so called π-cuts, and is particularly suitable for consumer decision support.
1
Introduction
There is great potential in Business-to-Consumer Electronic commerce (B2C EC) for information based marketing, and to integrate independent product evaluations of for example, consumer advisory centres with the offers of Webshops. In both cases, in fact, a decision support system for consumers has to be built. As an example, it would be possible to show not only the retail price of household machines, but also the cost of energy they consume over their lifetime. In traditional stores, there is little possibility to communicate the latter to the consumer, but in Web-shops customized calculations could be easily carried out. Interestingly, many important possibilities for climate change mitigation have negative net costs, i.e., although there is usually a higher initial investment D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 151–165, 2002. c Springer-Verlag Berlin Heidelberg 2002
152
G. Luk´ acs
required, the reduced energy consumption and energy costs more than make up for these costs1 . In decision support, however, imperfect data is often unavoidable. Imperfections occur e.g., because the future cannot be predicted with certainty, or because the decision alternatives or the preferences of the decision maker are not known perfectly. In many cases, not negligible uncertainties result from these sources. Therefore, the handling of uncertainties is a key issue in the theory and practice of professional decision making. If we want to offer decision support for consumers, appropriate ways to describe and to operate on imperfect data are necessary, and information systems, database management systems have to be extended correspondingly. The handling of imperfect data in information systems is far from beeing a straightforward issue, even simple operations cannot be easily extended for imperfect data. It is not surprising than that this issue has been very extensively studied in the research. A corresponding bibliography [1], almost comprehensive up to the year 1997, lists over 400 papers in this field. Even though there is an obvious correspondence between decision making and imperfect data, only a few papers mention decision making. That is, where this paper sets up: we study the handling of imperfect data from a decision theoretic point of view. Furthermore, our goal is to provide decision support not for professionals but for consumers, i.e., a very wide user group, the characteristics of which have to be considered in the way of handling imperfect data. There are many possible criteria consumers want to consider in their decisions. As opposed to professional decision making, it is now not possible to support all criteria by the system, some of them can only be considered by the users. We call these criteria external criteria. The system thus shall give freedom for the users to consider their external criteria. It shall not select some good alternatives, but rather, it shall sort all available alternatives, so that the user can pick an appropriate alternative from the sorted list. For this reason, the sort operation plays a key role in decision support for consumers. The rest of the paper is organized as follows. In Section 2 the case study and in Section 3 the neccesary background in decision theory and probability theory are introduced. Related work is analysed in Section 4. The new contributions of the paper, a very powerful representation of imprecise values and a sort operation for imprecise values, are described in Sections 5 and 6. Conclusions and future work are discussed in Section 7.
2
Case Study
On the long term we envision applications mentioned previously, i.e., B2C Webshops, where the consumer can readily consider (customized) running energy costs of his various consumer decisions, such as the selection of household machines. 1
IPCC. Climate change 2001: Mitigation. Available at: http://www.ipcc.ch.
Decision Support with Imprecise Data for Consumers
153
Currently, we are working with a seemingly different case study, geographic information retrieval (GIR). There is a database with geographically referenced data, e.g., rental apartments, restaurants or hotels on the one side and electronic timetable for public transport including digital maps on the other side. Queries such as “find rental apartments close by public transport to my new working place in Karlsruhe” can be carried out by the integrated system. A particular challenge is that some data objects only have incomplete geographic references, e.g., only the settlement is known, or the house number is missing for a complete address that is very common in classified advertisements. The current GIR case study and the decision support for consumers in Webshops may make the impression of not having much in common. However, they share the following features, making them similar for our investigations. First, from a decision theoretic point of view, in both cases an alternative has to be selected from a set of alternatives, where the evaluations of the alternatives are imprecise (due to the incomplete address or due to e.g., unknown technical data or imprecisely predictable energy prices). Second, in both cases the user group is potentially very large, such that — as opposed to professional decision making — a solid mathematical background of the decision maker cannot be assumed. (Lastly, the GIR case study also has an energy/cost saving and climate change mitigating potential.)
3 3.1
Background Issues Decision Making under Imprecision
The consequences of a decision often depend on chance. As an example, there are two alternatives having uncertain outcomes, characterized by probability variables. (For the sake of simplicity we assume over the rest of the paper that a smaller outcome is better.) In a simple decision model one could consider the expected value of the two probability variables, and select the alternative with the lower expected value. As an example, there are two alternatives Alt1 and Alt2 and their outcomes are characterized by the probability variables w1 and w2 . Using a simple decision model one shall prefer the first alternative over the second alternative, if the expected value of the first outcome is smaller than the expected value of the second outcome: Alt1 > Alt2 ⇔ E(w1 ) < E(w2 )
(1)
However, considering the expected value is often unsatisfactory, because e.g., it would not explain buying a lottery ticket or signing an insurance contract — both actions reasonable under special circumstances. Decision makers often value potentially very large wins more and potentially very large losses less than the expected values would suggest. The names of the corresponding decision behaviours are “risk-sympathy” and “risk-aversion”, the decision strategy considering only the expected values corresponds to “risk-neutrality”.
154
G. Luk´ acs
The risk strategy of a decision maker describes, how he behaves when the outcomes of the alternatives are probabilistic variables. The Bernoulli-principle [2] from the decision theory of economic sciences is concerned with decision making under imprecision. It uses a utility function u that describes the risk strategy of the decision maker. The utility function is applied to the (uncertain) outcomes of the alternatives, so that instead of the outcomes the corresponding utilities are considered. In the case of risk-aversion, the utility function associates comparatively smaller utilities with larger outcomes, i.e., it is a concave function. In the case of risk-sympathy, relatively higher utilities are associated with high outcomes, i.e., the utility function is convex. When comparing two alternatives, instead of the expected values of their outcomes, the expected values of their utilities calculated with the utility function u are compared: Alt1 > Alt2 ⇔ E(u(w1 )) < E(u(w2 ))
(2)
The utility function plays in fact two roles: it describes the risk preference, but also the height preference of the decision maker. The height preference is the relative utility of an outcome in comparison to another outcome, which also plays a role in risk free decisions. The double role of the utility function does not result, according to many decision theoretists, in any limitation of the approach. If the decision maker prefers a larger outcome to a smaller outcome (or vice versa), which is very often the case, then is the height preference function monotonous. In this case, the utility function considering both the height preference and the risk preference is also monotonous for all reasonable risk behaviors. 3.2
Probability Theory
Probability theory offers a powerful way of describing imprecise values. Classical probability theory can be defined on the basis of the Kolgomorov Axioms (for simplicity we only handle discrete domains): Definition 1. (K-Probability) A domain of values dom(A) and power set A of the domain are given. The set-function Pr on A is a K-probability, if it satisfies the Kolgomorov Axioms: 0 ≤ Pr(A) ≤ 1 A ∈ A
(3)
Pr(dom(A)) = 1
(4)
If A1 ; A2 ; . . . ; Am (Ai ∈ A) satisfy Ai ∩ Aj = ∅ when i = j: Pr( Ak ) = Pr(Ak ) k
(5)
k
An important feature of K-probabilities is that if the function Pr is known for some particular events, i.e., subsets of dom(A), e.g., for the events having only one element, the function can be calculated for all other events, too.
Decision Support with Imprecise Data for Consumers
155
For classical probability theory, precise probabilities are required. This is often seen as a major limitation, since precise probabilities can often not be determined. Therefore, much effort has been put into finding other, more general theories for describing imprecise data. The most widely known such theories are fuzzy measures, possibility and necessity measures [3,4], belief and plausibility measures (also called the Dempster-Shafer theory of evidence) [5,6], lower and upper probabilities [7] and lower and upper previsions [8]. The most powerful and general theory out of these is the theory of lower and upper previsions [9,10]. However, it does not only describe possible alternatives, but also their utilities to the decision maker. This is unnecessary for our purposes, as we will explain it later, because we do not expect from the users to make their utility functions explicit. The next most powerful theory is the theory of lower and upper probabilities, sufficiently general for our purposes. Lower and upper probabilities, also called as R- and F-probabilities, intervalvalued probabilities, imprecise probabilities, or robust probabilities (see e.g., [10], [11], [8]) are very powerful in describing as much, or as little, information, as is available. Our brief overview is based on [12] and [10]. Definition 2. (R-Probability) An interval-valued set-function P R on A is an R-probability, if it satistifes the following two axioms: P R(A) = [PrL (A); PrU (A)], A ∈ A 0 ≤ PrL (A) ≤ PrU (A) ≤ 1 ∀A ∈ A
(6)
The set M of K-probabilities Pr over A fulfilling the following conditions (called also the structure of the R-probability) is not empty: PrL (A) ≤ Pr(A) ≤ PrU (A), ∀A ∈ A
(7)
The name “R”-probability stays for “reasonable”, meaning that the lower and upper probabilities are not contradicting, and that there is at least one K-probability function fulfilling the boundary conditions defined by the lower and upper probabilities. An R-probability can very well be redundant, i.e., some of the lower and upper probabilities could be dropped without changing the structure of the R-probability, the information content of the R-probability. An F-probability is a special type of R-probability: Definition 3. (F-Probability) An R-probability fulfilling the following axioms is called an F-probability: inf Pr(A) = PrL (A) Pr∈M (8) sup Pr(A) = PrU (A) ∀A ∈ A Pr∈M
The boundaries PrL and PrU are in an F-probability not too wide, they and the corresponding structure M implicate each other. An F-probability is called therefore “representable” [11] or coherent [8]. The name “F”-probability refers to the word “feasible”, meaning that for all lower and upper probabilities there exists a K-probability from the structure realizing those values.
156
G. Luk´ acs
K-, R- and F-probabilities associate probabilities or lower and upper probabilities with every possible subset of dom(A). A partially defined R- and Fprobabilities is defined only on some, but not on each possible subsets of dom(A). We only give the definition for the partially defined F-probability: Definition 4. (Partially defined F-probability) There are subsets AL and AU of A defined: A = A \ {dom(A), ∅} (9) AL ⊆ A , AU ⊆ A , AL ∩ AU ⊂ A , AL ∪ AU =∅
(10)
A partially defined F-probability associates with all A ∈ AL a lower bound PrL (A) and with all A ∈ AU an upper bound PrU (A), such that a non-empty structure M of K-probabilities Pr, fulfilling the following conditions, exists: PrL (A) ≤ Pr(A), ∀A ∈ AL Pr(A) ≤ PrU (A), ∀A ∈ AU inf Pr(A) = PrL (A), ∀A ∈ AL Pr∈M
(11)
sup Pr(A) = PrU (A), ∀A ∈ AU
Pr∈M
A very significant difference between precise (K-)and imprecise (R- and F-)probabilities is that additivity holds for the first, but not for the second theory. This has the following important consequence. A K-probability can be fully described with probabilities on only a relatively small subset of A, e.g., all events with a single element. From these probabilities, using the additivity feature, the probabilities of all other events can be calculated. This does not hold for R- and F-probabilities. Since the handling of all possible events is too costly in most cases, partially determined R- and F-probabilities have to be worked with, and it is a very critical issue, which events are selected and associated with lower and upper probabilities.
4
Related Work
Approaches to handle imperfect data in information systems, or more specifically in database management systems, are summarized in this section. We concentrate on the description of imprecise values and the sort operation, the research foci of the presented work. 4.1
Description of Imprecise Values
Commercial database management systems only support imprecise values by “NULL”-values [13,14,15,16]. That is, if a value is not precisely known, all partial information shall be ignored, the value shall be declared unknown, and the corresponding attribute shall get the value “NULL”. The approach presented in [17] supports imprecise values by allowing a description by a set of possible values or by intervals (for ordinal domains). As an
Decision Support with Imprecise Data for Consumers
157
example, an attribute can get the value “25 − 50”, meaning that the actual value is between these two values. Probability theory based extensions to different data models, such as the relational or the object oriented have been researched over the last two decades [18,19,20,21,22,23,24,25,26,27]. Both the description of imprecise values and operations on them are investigated. In the followings, we focus on how imprecise values are described, and how the sort operation, essential in decision support, is extended for imprecise values. The approach in [25] describes a classical (K-)probability in a relation. Each tuple has an associated probability, expressing that the information in the tuple is true. The sum of the probabilities in the relation equals 1. The major drawback of the approach is that for every single imprecise value a separate relation is required. The approach in [19,20] is similar to the previous one, with the difference that the constraint referring to the sum of the probabilities in a relation is dropped. It becomes thus possible to describe several imprecise values in a single relation. The approach in [18,24] supports imprecise probabilities, rather than only precise probabilities. Imprecise probabilities can be associated with any possible set of values from the domain of the attribute in question. However, there are no further criteria set, ensuring e.g., the conditions for an R- or F-probability. Hence, information in these probabilistic relations can be contradictory or redundant, which we consider a major drawback of the approach. All of the previously introduced approaches associate probabilities with tuples. This has the disadvantage that pieces of information on a single imprecise value are scattered over several tuples. A more handy approach is to encapsulate all pieces of information on an imprecise value in a single (compound) attribute value. A corresponding approach with precise probabilities is presented in [28]. The approach in [23,22] too, handles the probabilities inside of compound attribute values. It also allows “missing” probabilities, a very special case of partially determined F-probabilities. The argument for supporting “missing” probabilities is that some relational operations can result in them. However, supporting only this special case of partially determined F-probabilities, the expressive power of the data model suffers. The approach in [21] associates lower and upper probabilities with only single elements of the domain and not with each or any possible subsets of the domain. This approach has therefore, from the point of view of imprecise probabilities, a very limited expressive power, and no explanation justifying this limitation is given. The approach in [26] supports both an interval description and, optionally, a description with K-probabilities. The approach in [27] supports imperfections both on the tuple and on the attribute levels. The underlying formalism is, however, the Dempster-Shafer theory of evidence, known also as belief and plausibility measures. This has in comparison to the theory of imprecise probabilities a much more limited expressive power. Summarizing the above arguments, there is no approach known where both intervals and imprecise probabilities are supported. Furthermore, all approaches supporting imprecise probabilities only allow very special forms of imprecise
158
G. Luk´ acs
probabilities having a limited expressive power, and they do this without giving a plausible explanation for this approach. 4.2
Sort Operation on Imprecise Data
One major application area of imperfect data is decision support, where the sort operation is essential for ranking the alternatives. Still, surprisingly, there is little work on extending the sort operation, or the closely related comparison, in database management systems for imprecise data. The approach in [26] gives a proposal for a comparison operation. The result of the comparison is determined by whether it is more probable that the first imprecise value is larger, or that the second imprecise value is larger. This approach results in an unsatisfying semantics, as the following example shows. The reason for this is that the possible values themselves are not considered, only their relation to each other. Consider, e.g., the case that the first alternative has a probability of 51% for being 1 minute better, than the second alternative, and the second alternative has a probability of 49% for being 99 minutes better. The comparison operation considers the first alternative better, even though in most decision situations it is more reasonable to consider the second alternative to be better. The approach in [17] only considers binary criteria. In a comparison of two alternatives, the one is considered to be better where the probability of fulfilling the binary criterion larger is. The semantics of this approach, too, is often not satisfactory, as the example shows. Let us consider the binary criterion that the travel time is under 30 minutes. The first alternative fulfills this criteria with a probability of 55%, the second alternative with a probability of 50%. Hence, the first alternative will win over the second one. It is easily possible, however, that this is an unreasonable choice, e.g., when the value of the first alternative definitely over 29 minutes is, and the second alternative has a value of under 10 minutes with a probability of 50%. Summarizing, the sort operation is very neglected issue in the proposals for extending database technology for imprecise data and the few available proposals have an unsatisfactory semantics for decision support purposes.
5
Description of Imprecise Values
Our first task is to specify a formalism for imprecise data with a very high expressive power. The formalism shall allow to accommodate as much — or as little — information about the actual value, as available. It shall not force us to ignore partially available information or to make the impression that more information is available, as really is. The following description fulfills these requirements. We denote the attribute in question with A and the domain of the possible actual values with dom(A). The first possibility to specify an imprecise value for attribute A is to specify the set of possible values, i.e., a subset of dom(A). We denote this set of possible
Decision Support with Imprecise Data for Consumers
159
values with P V , e.g., if there is an imprecise value w for attribute A, the set of possible actual values is w.P V (where w.P V ⊂ dom(A)). The second possibility to specify an imprecise value is an imprecise probability (e.g., R-, F- or partially defined F probability). In this case, there are sets of possible values and associated lower and upper probabilities. I.e., both the subsets of dom(A) for which lower probabilities are available, and the lower probabilities themselves have to be specified, this applies to the upper probabilities, too. We introduce the following notation. The subsets of dom(A) for which lower probabilities are available are denoted by SSPrL . The lower probability for a particular event e ∈ SSPrL is denoted by PreL . For the upper probabilities a similar notation is used, only the letter L is replaced for U . All possibilities to describe an imprecise attribute value, the set of possible actual values, lower and upper probabilities for some subsets of the possible actual values, are optional. Hence, the expressive power of the formalism is maximal: as much, or as little, information on the actual value can be described, as is available. 5.1
Description for Decision Support
The introduced formalism to describe imprecise values is very general, its expressive power is very high. This has the drawback that supporting it in a real system is expensive in terms of realization effort, storage place and time. It is especially awkward having to describe arbitrary subsets of dom(A), needed both for the set of possible values, and for the lower and upper probabilities. If we have ordinal data, and we only want to use imprecise data for a special purpose, e.g., decision making with the Bernoulli principle, we can define a less general and less costly formalism. By appropriately choosing the restricted formalism, the decisions made will be not effected, and hence the expressive power will be unrestricted from the point of view of decision making. We assume in the following, that the utility function u is monotonous, which is reasonable in most cases as discussed previously. First, we give a definition, when two imprecise values are considered to be equivalent. Definition 5. Two imprecise values w1 and w2 are equivalent, if the minimum and the maximum of their expected utilities (calculated according to the Bernoulli principle) do not differ from each other for any monotonous utility function. The rationale of this definition is that if there are no differences in the minimum and the maximum values of the expected utilities, there is no difference in the information content relevant to the decision making: either the one or the other imprecise value is preferred, or no preference relation can be set up between the two imprecise values. The following theorems state the existence of an imprecise value in a simple form for an arbitrary imprecise value. The first theorem refers to the case when the imprecise values are characterized by sets of possible values only (i.e., no imprecise probabilities are available):
160
G. Luk´ acs
Theorem 1. For any imprecise value w1 characterized by the sets of possible values w1 .P V , there exists an equivalent imprecise value w2 characterized by a special set of possible values, a closed interval over dom(A). The proof of the theorem and the construction of the equivalent imprecise value are straightforward. We denote the limits of the closed interval by L and U , e.g., for the value w2 by w2 .L and w2 .U . We now turn to the case when an imprecise value is characterized by lower and upper probabilities for any subsets of dom(A), the most general description with imprecise probabilities. Theorem 2. There is an imprecise value w1 characterized by lower and upper probabilities for the subsets w1 .SSPrL and w.SSPrU . There exists an equivalent imprecise value w2 , where w2 .SSPrL = w2 .SSPrU = = {{a1 }, {a1 , a2 }, . . . {a1 , . . . , an }} (ai denote the values in dom(A): dom(A) = {a1 , . . . , an } and ai < aj , when i < j)and the lower and upper probabilities represent a partially defined F-probability. The theorem states that for an imprecise value described with a very general imprecise probability, an equivalent imprecise value of a very simple form can be defined, where only special subsets of dom(A) are considered. The proof of the theorem and the algorithm for constructing the equivalent imprecise value are beyond the space limits of this paper. The notation for this special form is as follows. The set of subsets for which lower and upper probabilities are available follow directly from dom(A), and consequently, do not need a special notation. The lower and upper probabilities for the subset {a1 , . . . , ai } are denoted by i i Pr≤a and Pr≤a L U . Example 1. Figure 1 shows an example for an imprecise value in the general form (a) and the equivalent imprecise value in the form restricted for decision support purposes (b) (the details of the calculation are not presented). The restricted form containts only an interval of the possible values, and lower and upper probabilities for some special events. a: Imprecise value in general form w.P V = {10, 20, 40} {20} w. PrL = 0, 1 {10} w. PrU = 0, 7 {10} w. PrU = 0, 6 {10,20} w. PrU = 0, 8 {40} w. PrU = 0, 8
b: Equivalent imprecise value for decision support w.L = 10 w. Pr≤10 = 0 w. Pr≤10 = 0, 7 L U w. Pr≤20 = 0, 1 w. Pr≤20 = 0, 8 L U w. Pr≤30 = 0, 2 w. Pr≤30 =1 L U w.U = 40
Fig. 1. Equivalent imprecise values
Decision Support with Imprecise Data for Consumers
6
161
Sorting Imprecise Values
The sort operation has a central role in decision making: the alternatives have to be sorted on the basis of their utility, and the best alternative have to be chosen from the sorted list. If there are external criteria to be considered, rather than just those considered by the system when sorting the alternatives, than it may be reasonable to choose not the first, but another alternative from the sorted list. We only consider single attributes now, i.e., an alternative is described by a single, though potentially imprecise value, and our goal is to define decision theoretically founded sort operation for imprecise values. On the basis of the Bernoulli-principle an exact value — the expected utility — could be associated with each of the imprecise values, and the sort operation would be straightforward. There are two preconditions for applying this simple approach: (1) the utility function u has to be known; (2) the imprecise values have to be characterized by precise (K-)probabilities. These preconditions do not hold in our case. First, professional decision makers may have the necessary mathematical background and know explicitly the utility function, consumers cannot be expected to describe their preferences by such mathematical means, even if they very well have such preferences — implicitly. Second, our imprecise values are characterized by an interval of possible values and imprecise probabilities, rather than only by precise (K-)probabilities. For these reasons, the straightforward approach for the sort operation cannot be applied. Our sort operation has to be defined in a way that our decision makers, consumers, do not need any special mathematical background and can handle according to their implicitly available risk strategy, which can be any reasonable risk strategy. Furthermore, the sort operation has to be defined on our, for decision making optimised, general description of imprecise values. There are only two special cases where a preference relation can be set up between two imprecise values, even if the utility function is the same for both values. The first special case is when the intervals of possible values do not overlap with each other, i.e., w1 .U < w2 .L. In this case, the first actual value is definitely smaller than the second actual value, and the first value (alternative) has to be chosen. The second special case is when the structure of the imprecise probability of the first imprecise value dominates the structure of the imprecise ≤ai probability of the second imprecise value, i.e., ∀ai ∈ dom(A) : w1 .P rU < ≤ai w2 .P rL . In this case it is reasonable to choose the first value based on the available information, but we are actually not sure, which actual value is larger. In most cases, however, these special cases do not hold, and no preference relation can be set up between the two alternatives. Consequently, we can only determine a partial sorting of the values (alternatives), which is complicated to present the user (in comparison to a sorted list) and gives little support for the decision. Furthermore, these preference relations can only be applied, if we assume that the same utility function holds for both imprecise values. However, it is possible, that the utility functions are different for different imprecise values. The key concept of our approach is the so called π-cut (where “π” stays for probability). A π-cut consists of a qualifier q and a probability value v,
162
G. Luk´ acs
where (q ∈ {≤; ≥} and 0 ≤ v ≤ 1. A π-cut determines, with which value, and, consequently, in which position, an imprecise value should appear in the result of the sort operation. We call this value the π-cut value of the imprecise value for the particular π-cut. This π-cut value of an imprecise value is the value that is exceeded by the actual value of the imprecise value with a probability specified in the probability value of the π-cut. e.g., if the probability value v of a π-cut is 0.1 (or 10%), the π-cut value of an imprecise value will be exceeded by the actual value of the imprecise value with a probability of 0.1. The quantifier of the π-cut is necessary to handle imprecise (F-)probabilities, instead of only precise (K-)probabilities. The quantifier ≤ means that the probability value of the π-cut v should be not exceeded by the probability of the imprecise value, the quantifier ≥ that it should be exceeded. If the difference between the lower and upper probabilities is too large, it may occur that no appropriate π-cut value can be found on the basis of the above principles. In this case the lower and upper ends of the interval of possible values can be used. The formal definition is as follows: Definition 6. (π-cut value) The π-cut value w.P C π-c of a given imprecise value w for a given π-cut π-c = (q; v) is as follows: i (i) : q = “ ≤ ” ∧ ∃w. Pr≤a ≤v aj where j = max U ≤ai PrU ≤v i w.L : q = “ ≤ ”∧ w. ∃ Pr≤a ≤v U w.P C (q;v) = (12) ≤ai a where j = min (i) : q = “ ≥ ” ∧ ∃w. Pr ≤ v j L ≤ai PrL ≤v w.U : q = “ ≥ ”∧ w. ∃ Pr≤ai ≤ v L
An imprecise value occurs as many times in the result of the sort operation, as there are π-cuts defined. An imprecise value appears in the result with its π-cut value, and — in addition — the corresponding π-cut. The result of this extended sort operation is indeed very easy to interpret even for users with very little mathematical background, e.g., “This alternative (imprecise value) has a chance of less than 10% (more than 90%) for reaching this value”. Furthermore, the approach allows the user to use any risk strategy, if there are sufficiently many π-cuts set. In most cases only a few, predefined π-cuts give a sufficiently detailed result for the users. As an example, in many cases a few π-cuts, such as (≤; 0.1) und (≥; 0.9), are sufficient. For advanced users, the setting of arbitrary π-cuts should also be allowed. Example 2. There are four alternatives available. Each of the alternatives is characterized by a key ID, a value W and a description. Two alternatives have precise and two imprecise values (s. Fig. 2). We use the π-cuts (≤; 0, 1) and (≥; 0, 9). The π-cut values of alternatives Alt3 are P C (≤;0,1) = 40 and P C (≥;0,9) = 50, those of Alt4 P C (≤;0,1) = 20 and P C (≥;0,9) = 60. The result of the sort operation is represented in Fig. 3. The
Decision Support with Imprecise Data for Consumers ID W Alt1 10
163
Description Karl-W. Str. 5: 2 bedroom, 45 qm, sunny . . . Klosterweg 28: 3 bedroom, top-flor.. . . . Karlsruhe Waldstadt: 2 bedroom, groundfloor . . .
Alt2 30
Alt3 L = 30 ≤30 ≤30 P rL = 0 P rU = 0, 06 ≤40 ≤40 P rL = 0, 30 P rU = 0, 60 U = 50 Alt4 L = 20 Karlsruhe: ≤20 ≤20 P rL = 0, 01 P rU = 0, 15 2 bedroom, 2nd floor . . . ≤30 ≤30 P rL = 0, 15 P rU = 0, 30 ≤40 ≤40 P rL = 0, 15 P rU = 0, 45 ≤50 ≤50 P rL = 0, 50 P rU = 0, 70 U = 60 Fig. 2. Alternatives with imprecise values ID Alt1 Alt4 Alt2 Alt3 Alt3 Alt2
W P C π−C π − C 10 20 (≤; 0, 1) 30 40 (≤; 0, 1) 50 (≥; 0, 9) 60 (≥; 0, 9)
Fig. 3. Result of sorting imprecise values with π-cuts
alternatives with precise values, Alt1 and Alt2 , occur once, the alternatives with imprecise values, Alt3 and Alt4 , occur twice in the sorted result list. The user of the system is looking now for an appropriate alternative. Alt1 and Alt2 are, because of the external criteria, not suitable. Alt3 with a π-cut of ≤ 10% ((≤; 0, 1)) is the next item in the list. The user thinks that even though this alternative has some chance to be good, considering the external criteria, it is too risky to choose it. So he goes further, and finds that Alt3 is a very good alternative considering the external criteria, and so he decides to choose it.
7
Summary and Future Work
We presented a group of applications where decision support is offered for consumers by integrating information from autonomous information systems, and where the handling of imprecise data is required. The central issues in such a setting are a sufficiently powerful description of the imprecise data and the extension of the join and sort operations for imprecise data.
164
G. Luk´ acs
In the current paper, we presented a formalism with very high expressive power to describe imprecise data. We also presented a restricted formalism much easier to handle, for the case when imprecise data is used for decision making, in other words a sort operation has to be defined over imprecise data. The restricted formalism does not loose expressive power in this application. We also defined a sort operation appropriate for consumer decisions for the imprecise data. The main concept of the sort operation is the so called π-cut. Future work is required on extending the join operation (known in database management) for handling imprecise data. The join operation has a central role, too, for two reasons. First, imprecise data often occurs, when integrating (or joining) data from autonomous information systems. Second, if there are several probabilistically dependent imprecise values, they still preferably have to be described in separate relations, and the dependences have to be considered in the join operation. At this point, the theory of Bayesian networks intensively researched and applied in Artificial Intelligence could be adapted to data models. Acknowledgements. The author is partially supported by the Hungarian National Fund for Scientific Research (T 030586). Earlier discussions with Peter C. Lockemann and Gerd Hillebrand contributed significantly to the ideas presented.
References 1. Curtis E. Dyreson. A bibliography on uncertainty management in information systems. In Amihai Motro and Philippe Smets, editors, Uncertainty management in information systems, pages 413–458. Kluwer Academic Publishers, 1997. 2. von J. Neumann and O. Morgenstern. Theory of games and economic behavior. Princeton, 1944. 3. Didier Dubois and Henri Prade. Possibility theory : an approach to computerized processing of uncertainty. Plenum Pr., New York, 1988. Original: Th´eorie des possibilit´es. 4. L.A. Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1:3–28, 1978. 5. A.P. Dempster. Upper and lower probabilities induced by a multivalued mapping. The annals of mathematical statistics: the off. journal of the Institute of Mathematical Statistics, 38:325–339, 1967. 6. Glenn Shafer. A mathematical theory of evidence. Princeton Univ. Pr., Princeton, New Jersey, 1976. 7. C.A.B. Smith. Consistency in statistical inference and decision (with discussion). Journal of the Royal Statistical Society, 1961. 8. Peter Walley. Statistical Reasoning with Imprecise Probabilities. Chapmann and Hall, 1991. 9. Peter Walley. Towards a unified theory of imprecise probability. In Proceedings of the First International Symposium on Imprecise Probabilities and their Applications, Ghent, Belgium, 29 June - 2 July, 1999. 10. Kurt Weichselberger. The theory of interval-probability as a unifying concept for uncertainty. In Proceedings of the First International Symposium on Imprecise Probabilities and their Applications, Ghent, Belgium, 29 June - 2 July, 1999.
Decision Support with Imprecise Data for Consumers
165
11. Peter J. Huber. Robust statistics. John Wiley & Sons, 1981. 12. Kurt Weichselberger. Axiomatic foundations of the theory of interval-probability. In Symposia Gaussiana, Proceedings of the 2nd Gauss Symposium, pages 47–64, 1995. 13. Edgar F. Codd. Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4):397–434, December 1979. 14. Witold Lipski. On semantic issues connected with incomplete information databases. ACM Transactions on Database Systems, 4(3):262–296, 1979. 15. Joachim Biskup. A foundation of Codd’s relational maybe-operations. ACM Transactions on Database Systems, 8(4):608–636, December 1983. 16. Tomasz Imielinski and Witold Lipski. Incomplete information in relational databases. Journal of the ACM, 31(4):761–791, October 1984. 17. Joan M. Morrissey. Imprecise information and uncertainty in information systems. ACM Transactions on Information Systems, 8(2):159–180, April 1990. 18. Michael Pittarelli. Probabilistic databases for decision analysis. International Journal of Intelligent Systems, 5:209–236, 1990. 19. Debabrata Dey and Sumit Sarkar. A probabilistic relational data model and algebra. ACM Transactions on Database Systems, 21(3):339–369, September 1996. 20. Debabrata Dey and Sumit Sarkar. PSQL: A query language for probabilistic relational data. Data and Knowledge Engineering, 28:107–120, 1998. 21. Laks V.S. Lakshmanan, Nicola Leone, Robert Ross, and V.S. Subrahmanian. ProbView: A flexible probabilistic database system. ACM Transactions on Database Systems, 22(3):419–469, September 1997. 22. Daniel Barbara, Hector Garcia-Molina, and Daryl Porter. The management of probabilistic data. IEEE Transactions on Knowledge and Data Engineering, 4(5):478–502, 1992. 23. Daniel Barbara, Hector Garcia-Molina, and Daryl Porter. A probabilistic relational data model. In F. Bancilhon, C. Thanos, and D. Tsichritzis, editors, Advances in Database Technology, EDBT’90, International Conference on Extending Database Technology, Venice, Italy, March, 1990, volume 416 of Lecture Notes in Computer Science, pages 60–74. Springer-Verlag, 1990. 24. Michael Pittarelli. An algebra for probabilistic databases. IEEE Transactions on Knowledge and Data Engineering, 6(2):293–303, April 1994. 25. Roger Cavallo and Michael Pittarelli. The theory of probabilistic databases. In Proceedings of the 13th VLDB Conference, Brigthon, pages 71–81, 1987. 26. Curtis E. Dyreson and Richard T. Snodgrass. Supporting valid-time indeterminacy. ACM Transactions on Database Systems, 23(1):1–57, March 1998. 27. Ee-Peng Lim, Jaideep Srivastava, and Shashi Shekhar. An evidential reasoning approach to attribute value conflict resolution in database integration. IEEE Transactions on Knowledge and Data Engineering, 8(5):707–723, 1996. 28. Yoram Kornatzky and Solomon Eyal Shimony. A probabilistic object-oriented data model. Data and Knowledge Engineering, 12(2):143–166, 1994.
Genetic Programming: A Parallel Approach Wolfgang Golubski University of Siegen Department of Electrical Engineering and Computer Science H¨ olderlinstr. 3, 57068 Siegen, Germany [email protected]
Abstract. In this paper we introduce a parallel master-worker model for genetic programming where the master and each worker have their own equal-sized populations. The workers execute in parallel starting with the same population and are synchronized after a given interval where all worker populations are replaced by a new one. The proposed model will be applied to symbolic regression problems. Test results on two test series are presented.
1
Introduction
The importance of regression analysis in economy, social sciences and other fields is well known. The objective of solving a symbolic regression problem is to find an unknown function f which best fits given data (xi , yi ) i = 1, . . . , k for a fixed and finite k ∈ IN. With this function f the output y for arbitrary x not belonging to the data set can be estimated. The genetic programming approach [5] is one of the successful methods to solve regression problems. But genetic programming (abbreviated as GP) can be a very slow process. Hence parallelization could be a possible alternative to speed-up the evaluation process. Small computer networks consisting of a handful computers (e.g. five machines) are accessible for and manageable by many people. Therefore the start point of our investigation was the question if a simple parallelization model of the GP approach can lead to significant improvements in the solution search process. The aim of this paper is to answer this question and to present a parallel master-worker model applied to a basic (sequential) genetic programming algorithm in order to find more suitable real functions which can work better than a basic GP. The most important quality feature is given by the success rate, that is, the measure of how often the algorithm evaluates an acceptable solution. The parallel model presented here works with a multiple-population where master and worker have own equal-sized populations. The workers execute in parallel starting with the same population and are synchronized after a given interval where all worker populations are replaced by a new one. To show the quality of our model we present test results of 6 test series. Our proposed parallel model is applied to 42 different regression problems on D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 166–173, 2002. c Springer-Verlag Berlin Heidelberg 2002
Genetic Programming: A Parallel Approach
167
polynomial functions of degree not higher than 10. The presented results are promising. The paper is structured as follows. In the next section we present the basic genetic programming algorithm which is used for finding good fitting functions in the sequential case. In the third section we describe the parallel master-worker model. Then in Section 4 we present our test results. Before finishing the paper with a conclusion we give a literature review in the fifth section.
2
Basic Genetic Programming
Koza [5] used the idea of a genetic algorithm to evolve LISP programs solving a given problem. Genetic algorithms manipulate the encoded problem representation which is normally a bit string of fixed length. Crossover, mutation and selection are operators of evolution being applied on these bit string representations. Over several generations these operators modify the set of bit strings and lead to an optimization with respect to the given fitness function. So genetic algorithms perform a directed stochastic search. In order to apply a genetic algorithm to the problem of regression we need to know the structure of the solution which is given by the function itself. In genetic programming, Koza used tree structures instead of string representation. He wanted to evolve a LISP program itself for a given task. LISP programs are represented by their syntactic tree structure. Starting with a set of LISP programs new programs are evolved by recombination and selection until a sufficiently good solution has been found. Crossover operators operate on the tree structures by randomly selecting and exchanging sub-trees between individuals. This way a new tree representing which stands for a new program is generated. Each individual is evaluated according to a given fitness function. Out of the set of individuals of the same generation the best programs were chosen to build the offspring of the next generation. This process is also called selection. In contrast to evolutionary or genetic algorithms no mutation operators are used in the genetic programming approach. So, the tree structure representation of each individual in genetic programs overcome the problem of fix length representation in genetic algorithms. Let us now describe the way of representing a real function as a tree. The prefix notation of a real function can easily be transformed into a tree structure where arithmetic operators such as +, · and − as well as pdiv stand for the nodes of the tree and variables of the function are the leaves of the tree. As an example we consider the function x3 + x2 + 2 ∗ x. The corresponding prefix notation has the form +(+(∗(∗(x, x), x), ∗(x, x)), +(x, x)). The corresponding tree is drawn in Figure 1. The application of genetic programming to regression problems is well-known, see [5,3]. Therefore we call the approach presented in this section as basic genetic programming. The main parts are described next in more detail: 1. Initialization The genetic algorithm is initialized with a sufficiently large set of real functions which are randomly generated. The size (depths of each tree) is initially
168
W. Golubski
+
+
+ *
*
X
X
*
X
X
X
X
X
Fig. 1. Tree representation of the function x3 + x2 + 2 ∗ x
2.
3.
4.
5.
6.
restricted so that simple functions are in the initial generation. The leaves of the tree are randomly selected out of a given interval. Selection The fitness of each member of the previously generated population is computed, i.e. the total error over all sample points. The functions with the smallest errors are the fittest of their generation. Stopping Criterion If there exits a real function (tree) with an error smaller than a predefined fitness threshold then the genetic program has found an appropriate solution and stops. Otherwise the algorithm goes on to step 4. Recombination A recombination process takes place to generate 90% of the population of the next generation. Two of the fittest functions are randomly chosen and recombined. For this one node of the first tree is randomly selected as its crossover point. In the same way a crossover point of the second function is chosen. Both crossover nodes including their subtrees are exchanged resulting in two new trees, see Figure 2 for an example. Reproduction 10% of the generation were directly chosen out of the fittest trees of the previous generation. With reproduction and recombination the whole new generation is generated. Our genetic program processes with step 2 until a given number of generations is reached.
The parameter settings are summarized in Tables 1 and 2.
3
Parallel Genetic Programming
The start point of our investigation was the question if a simple parallelization of the basic genetic programming approach would lead to a significant improvement of the success rate and the solution search.
Genetic Programming: A Parallel Approach
169
Table 1. Basic Parameter Settings Parameter
Value
# test functions 42 max. generation 50 fitness type AbError, see Eq.(2) fitness threshold 10−6 recombination property 90% reproduction property 10% function sets {×, +, −, pdiv} terminal sets {x}
Strongly influenced by the master-worker or master-slave paradigm of parallel or distributed systems [2] we have implemented the genetic programming approach as master-worker programs (processes). The master process fulfills two tasks: (1) the management of the worker communications including synchronization and (2) the handling of the fittest individual sets. That means that the master stores the population (more precisely the fittest individuals). Each worker unit is responsible for two (sub-)processes:(1) holding the communication to the server and (2) executing the basic GP algorithm. That means that each worker stores its own population and executes the GP operations like recombination, reproduction and fitness evaluation. Without the communication part a worker behaves like a usual GP, e.g. as described in Section 2. In more detail, the master-worker parallel genetic programming model works as follows, see Figure 3. The following numbers are related to this figure. The master and the worker processes must be started, see 1. The master waits until all worker are ready to work. Then the master sends the parameter settings (i.e. the parameter settings of the basic GP and the synchronization number) to each worker. The synchronization number represents the number of generation steps to be executed on a worker without a break unless a solution would be found. Every time a solution is evaluated then the whole execution process will be stopped. The worker initializes the basic GP algorithm by the received parameter settings and creates the start population. Then the worker performs so much generation steps as given by the synchronization number, see 2. Now the basic GP will be interrupted and the current population will be transmitted to the master process, see 3. During this time the master is waiting for the workers responses in order to collect all workers fittest individual sets. If all worker transmission has been received the master picks up the best individuals, i.e. the worker’s fittest individual sets will be sorted by fitness value and the size of this set will be reduced so that the size of the fittest individual sets on master and workers are the same. Next the master broadcasts the new set of fittest individuals to each worker process, see 4. Each worker replaces its fittest individual set by the received one and resumes the GP execution. The just described steps will be
170
W. Golubski *
+
+ C1
− C2
C3
pdiv
C4
*
*
* C5
D1
D2
* − C2
D4
D5
+
+ C1
+
D3
C3
* pdiv
C4
* D1
* C5
+
D3 D4
D5
D2
Fig. 2. Crossover between two Programs where the dashed circles are the crossover points (Ci and Di are variables)
repeated until one worker finds a successful solution or the prescribed maximal generation steps has been reached. Our model is implemented with Java and Java RMI.
4
Results
In this section we will discuss how our parallel genetic programming approach is tested. The system has been applied to symbolic regression problems. We have chosen 42 polynomial functions of degrees not higher than 10 f (x) =
10
ai ∗ xi
(1)
i=1
where ai ∈ Z and Z denotes the set of integer numbers. Two functions (quintic and sextic polynomials) are taken from the literature [5] and the other ones are randomly generated. A data set is generated for each of these functions by randomly chosen real numbers xi in a predefined interval [-1,1]. The outputs Y i are computed as f (xi ) = Y i for each of the previously chosen functions. In our tests we are using 50 data pairs per function. For each of these functions we run 100 differently initialized genetic programs in order to see how well our method performs. The fitness function is defined as the absolute error, i.e.
Genetic Programming: A Parallel Approach
171
MASTER 1
Population
Fittest Individuals
4
WORKER 1
1
3
4 3
4 3
WORKER 2
4
3
WORKER 3
WORKER 4
2 Population
2 Population
2 Population
2 Population
Fittest Individuals
Fittest Individuals
Fittest Individuals
Fittest Individuals
1
1
1
Fig. 3. The Master-Worker Genetic Programming Model
AbError =
50
|f (xi ) − Y i | ,
(2)
i=1
If a function has an AbError smaller than the given threshold value then the function represents a successful solution of the problem. The whole parameter settings of the genetic programming algorithm are listed in Tables 1 and 2. Six various parameter settings, more precisely two parameter settings different in the population size, have been applied in each case to the basic genetic programming approach, the parallel model with two workers and the parallel model with four workers. The synchronization interval is set to 5 worker generation steps. The obtained results are also listed in Table 2. Regarding the results it can be seen so far that our proposed method shows quite good results. Both test series (T1xx and T2xx) show the same behavior. The basic version T1 has not a good success rate. But the used genetic programming algorithm is neither frequently tested to obtain an acceptable parameter settings nor optimized as done in [3]. Applying the parallel model to these worse parameter settings leads already to a dramatical improvement of the success rate. The 4 worker version T1-W4 evaluates a problem solution in 67% of all cases. The T2 test series behaves similar to the T1 one. The parallel model leads to a significant improvement although the success rate of the basic version was acceptable (i.e. 75%). In the 4 worker system the most (i.e. 96%) runs deliver acceptable function description. On the other side regarding the total number of used generation steps we can see a similar coherence. The basic versions use much more generation steps than the parallel ones.
172
W. Golubski Table 2. Additional Parameter Settings and Test Results
Parameter population size # fittest individuals
T1 100 14
T1-W2 T1-W4 100 100 14 14
Only for parallel # worker 2 # synchronization steps 5 total # executed generations 161773 134385 max. # executed generations 210000 210000 generation rate 77% 64% successful runs 1378 1985 max. # runs 4200 4200 success rate 31% 47%
T2 500 72
T2-W2 T2-W4 500 500 72 72
4 2 5 5 102130 78400 49345 210000 210000 210000 48% 37% 23% 2678 3158 3685 4200 4200 4200 67% 75% 88%
4 5 32450 210000 15% 4028 4200 96%
What is also interesting is, most of the time a genetic program stopped by reaching the fitness threshold, the algorithm only needed a small number of generations (< 20) to find a real function. So it looks like our method performs quite well. However, we are running more tests in order to verify this for other functions as well being more complicated e.g. on fuzzy functions [4].
5
Comparison to Existing Approaches
Let us now review the literature. There are numerous papers on parallel genetic algorithm (see [1,7] for a good overview with more literature references) but only a few on parallel genetic programming [8,6]. At first we summarize in a few words the most important approaches. Usually one can divide up parallel genetic algorithm into three main categories: (1) global single-population master-slave algorithms where the fitness evaluation is parallelized, (2) single-population fine-grained algorithms suited for massively parallel computers where each processor resides one individual and (3) multiplepopulation coarse-grained algorithms being more sophisticated where each population exchanges individuals with the others with a fixed exchange rate. The connection of populations is strongly influenced by the underlying network topology (e.g. hyper-cubes). A population can only exchange individuals with a population in the neighborhood. The parallel genetic programming approaches [8,6] are both of the category (3). To sum up, it can be said that the parallel model presented in Section 4 can be characterized by (i) master-worker paradigm, (ii) multiple-population where master and worker have own equal-sized populations, (iii) the workers execute in parallel starting with the same population except at initialization phase, (iv) synchronization after a given interval where all worker populations are replaced by a new one and (v) the proposed algorithm does not behave like the basic GP.
Genetic Programming: A Parallel Approach
173
The proposed parallel master-worker model is obviously different from parallel genetic algorithms of kind (1) and (2). In contrast to (1) each worker has its own population. The massively parallel approach of kind (2) is completely incomparable. In some sense our model can be regarded as an exotic version of the multiple-population approach of (3). But usually the underlying topology is not a star nor is there central management of the population exchange (performed by the master), that is, the construction of the new worker population is missing. [8] is the approach most similar to ours but the latter points just stated (i.e. topology and exchange) still remain true and the implementation technique (by using MPI) is another one. Our model is rather influenced by the client-server and master-worker paradigms known from distributed systems.
6
Conclusions and Further Work
We presented a master-worker model suited for execution genetic programming in parallel. The proposed method shows quite good results on solving regression problems. What has to be done next is to process more tests on more complicated problems.
References 1. E. Cantu-Paz: A survey of parallel genetic algorithms. Calculateurs Paralleles, Reseaux et Systems Repartis. Vol. 10, No. 2., Paris: Hermes (1998) 141-171 2. G. Coulouris, J. Dollimore and T. Kindberg: Distributed Systems - Concepts and Design, (3rd Edition). Addison-Wesley (2000) 3. J. Eggermont and J.I. van Hemert: Adaptive Genetic Programming Applied to New and Existing Simple Regression Problems. Proceedings on the fourth European conference on Genetic Programming (EuroGP2001), Lecture Notes in Computer Science Vol. 2038, Springer (2001) 4. W. Golubski and T. Feuring: Genetic Programming Based Fuzzy Regression. Proceedings of KES2000 4th International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, Brighton (2000) 349–352 5. J.R. Koza:Genetic Programming II. Cambridge/MA: MIT Press (1994) 6. M. Oussaidene, B. Chopard, O. Pictet and M. Tomassini:Parallel Genetic Programming and its Application to Trading Model Induction. Parallel Computing, 23 (1997) 1183-1198 7. M. Tomassini: Parallel and Distributed Evolutionary Algorithms: A Review. Evolutionary Algorithms in Engineering and Computer Science, J. Wiley and Sons, Chichester, K. Miettinen, M. M¨ akel¨ a, P. Neittaanm¨ aki and J. Periaux (editors) (1999) 113-133 8. M. Tomassini, L. Vanneschi, L. Bucher and F. Fernandez: A Distributed Computing Environment for Genetic Programming using MPI. Recent Advances in Parallel Virtual Machine and Message Passing Interface, J. Dongarra, P. Kaksuk and N. Podhorszki (Eds), Lecture Notes in Computer Science Vol. 1908, Springer (2000) 322-329
Software Uncertainty Manny M. Lehman1 and J.F. Ramil2 1Department
of Computing, Imperial College 180 Queen’s Gate, London SW7 2BZ, UK [email protected] 2Computing Dept., Faculty of Maths and Computing The Open University, Walton Hall, Milton Keynes MK7 6AA, UK [email protected]
Abstract. This paper presents reasoning implying that the outcome of the execution of an E-type program or E-type software system (software for short) of whatever class are not absolutely predictable. It is intrinsically uncertain. Some of the sources of that uncertainty are identified and it is argued that the phenomenon qualifies as a Principle of Software Uncertainty. The latter represents an example of an assertion in a Theory of Software Evolution which is ripe for development based on empirical generalisations identified in previous research, most recently in the FEAST projects. The paper briefly discusses some practical implications of uncertainty, and the other concepts presented, on evolution technology and software processes. Though much of what is presented here has previously been discussed, its presentation as a cohesive whole provides a new perspective.
1 Program Specification When computers first came into general use it was taken for granted, and may still be in some circles, that, subject to correct implementation, the results of computations not violating the domain limits set by its specification, would be absolutely predictable. The paper argues that as a consequence of the inevitable changes in the real world in which a computing system operates, the outcome of program execution in the real world cannot and should not be taken for granted. In general, a specification [34] prescribes domain boundaries and defines properties of the required and acceptable program behaviour when executed and the nature of the inputs to and outputs from the various operations that the program is to perform. It also addresses the quality of the results of execution, in terms of, for example, numerical precision. The need for a specification is generally accepted as basic for initial development process. They can provide many benefits, particularly when stated in a formal notation [11]. Whether formal or not, they provide developers with an explicit statement of the purpose the program is to address, precisely what is required, and a guide for potential users of the software of the facilities available from the program, its capabilities and limitations. It is also realised by some that a specification is important as a means for facilitating the inevitable evolution, enhancement and extension of the software. D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 174–190, 2002. © Springer-Verlag Berlin Heidelberg 2002
Software Uncertainty
175
Meeting these expectations requires that the system and software requirements were satisfactorily identified and stated in the first place and as subsequently modified during development, usage and evolution. It is also necessary that the software and the larger system that includes the designated hardware on which it is to run, other associated software such as an operating system, have been appropriately expressed, adequately validated (for example by comprehensive testing) and documented. The role and influence of any other elements, including humans, involved in system operation must also be taken into account. Finally, it is necessary that the software has been correctly installed and that the hardware is operating without fault. However, in his 1972 Turing Lecture, Dijkstra [6] pointed out that testing can, at best, demonstrate the presence of faults, never their absence. To ensure satisfaction of the specification, the software must be verified by demonstrating its correctness, in the full mathematical sense of the term [9], [10], relative to that specification. This, of course, requires that it has been derived from a formal specification or otherwise constructed in a way that enables and facilitates proofs of correctness or other forms of verification. Only where this is possible is the term correctness meaningful in its mathematical sense. Specifications may take many forms. To be appropriate in the context of verification it must be formal, complete in the sense that it reflects all properties identified as critical for application success, free of inconsistencies and at a level of granularity and explicitness that permits implementation with a minimum of assumptions; a topic further considered below. It is also desirable that the formalisms used facilitate representation of the domains involved. These prerequisities are not easily met for all types of applications and programs. They are clearly met by programs of type S, as described below. These operate in an isolated, abstract and closed, for example mathematical, domain. The need for software systems of this type is somewhat restricted. They are exemplified by programs used in programming lectures and textbooks and, more significantly, to address completely defined problems or situations in a well understood and fully defined domain [33]. The latter would include mathematical function or model evaluation in the physical sciences, for example, in calculating the focal length of a lens to be used in a prescribed and precisely defined instrument. This class of programs stands in sharp contrast to those that operate in and interact with a real world domain and termed E-type systems. The outputs of the latter are used in some practical application or problem solution within some prescribed domain. Such programs, that have become pervasive at all levels of business and other organisations, are therefore, of universal interest. They do, however, present major challenges. In the present discussion, it is their behavioural properties that are of particular relevance. The behaviour of E-type software cannot be fully specified and formalised as required for demonstrations of correctness. Aspects of their behaviour can be formally specified. This produces, at best, a partial specification. Moreover, when such formal partial specifications are available, obstacles to the demonstration of correctness would be likely to arise as a result of system size and various sources of complexity, constraints that cannot further considered here. In discussing uncertainty in software, what is of interest is the obstacle to complete specification and formalisation that arises from the fact that the application domains involved are essentially unbounded in the number of their attributes and that many of these attributes are subject to change. Moreover, when humans are involved in system operation the unpredictability of individual human behaviour also makes appropriate formalisation at best difficult, more generally
176
M.M. Lehman and J.F. Ramil
impossible. Thus correctness, in the formal sense, is meaningless in the E-type context. One can merely consider how successful the system is in meeting its intended purpose. But need this be a matter for concern? Demonstration of correctness, even of those parts of a system whose specification can be adequately formalised, are likely to be the little interest real world users or other stakeholders, who are, by the way, often likely to hold inconsistent viewpoints [8]. Some increase in confidence that the software as a whole will satisfy its stakeholders may result from a demonstration of correctness against a partial specification and/or that parts of the program are correct with respect to their specification. But the concern of stakeholders will generally be with the results of execution. It is those that are assessed and applied in the context of problem being solved or applications being pursued in the operational domain. What matters, is the validity and applicability of the results obtained and consequential behaviour when the outputs of execution are used. Whether system execution is successful will be judged in terms of whatever criteria had previously been explicitly or implicitly determined or the behaviour that was desired and expected. This is the ultimate criterion of software system acceptability in the real world. A program classification briefly described below is useful for clarification of these concepts.
2 SPE Classification Scheme 2.1 Type S The SPE software classification scheme has been described and discussed many times [15], [17], [30]. Two views of S-type programs have been given in the various discussions of the term. The first considers S-type programs as those operating in an abstract and closed, for example mathematical, domain. Their behaviour may be fully and absolutely specified [33]. As already suggested, this situation has only limited practical relevance. The second view is, however, of wider relevance for computing in the real world. It identifies S-type programs as those for which the sole criteria for accepting satisfactory completion of a development contract (whatever its nature) is that the completed product satisfies a specification reflecting the behavioural properties expected from program execution. As stated above, this presupposes existence of an appropriate formal specification, accepted by the customer as completely defining the need to be fulfilled or the computational solution desired. Such a specification becomes a contract between a supplier and the representative of prospective users. Whether the results of execution are useful in some sense, whether they provide a solution to the specified problem, will be of concern to both users and producers. However, once the specification has been contractually accepted and the product has been shown to satisfy it, the contract has been fulfilled, by definition. Thus, if the results do not live up to expectations or need to be adjusted, that is, if their properties need to be redefined, rectification of the situation requires a new, revised, specification to be drawn up and a new program to be developed. Depending on the details of changes required, new versions may be developed from scratch or obtained by modification of those rejected. However achieved, it is a new program.
Software Uncertainty
177
2.2 Type E Type E software has been briefly described in section 1. As, by definition, a program used to solve real world problems or support real world activity, it is the type of most general interest and importance. Conventionally, its development is initiated by means of eliciting requirements [e.g. 27]. At best, only parts of the statement of requirements, mathematical functions for example, can be fully described, defined and formalised. The criterion of correctness is replaced by validity or acceptability of the results of execution in real world application. If there are any that do not meet the need for which the development was triggered in the first place, the system is likely to be sent back for modification. Whether the source of dissatisfaction originates in system conception, specification, design and/or implementation is, in the context of the present discussion, irrelevant though different procedures to identify the source of dissatisfaction and to implement the necessary changes or take other action must be adopted in each situation. Stakeholders have several choices. For example, other software may be acquired, the activity to be supported or its procedures may be changed to be compatible with the system as supplied or the latter may be changed. If the third approach is taken, one may be faced with the alternatives of changing code or documentation or both. Since the real world and activities within it are dynamic, always changing, such changes are not confined to the period when the system is first accepted and used. As stated in the first law of software evolution [12], [14], [15], briefly described as “the law of continuing change”, similar choices and the need for action will arise throughout the system’s active lifetime. The system must be evolved, adapted to changing needs, opportunities and operational domain characteristics. The latter may be the result of exogenous changes or of installation and use of the system, actions that change the operational domain and are likely to change the application implemented or the activity supported. Evolution is, therefore, inevitable, intrinsic to the very being of an E-type system. This is illustrated by Figure 1 which presents a simplified, high level, view of the software evolution process. Note that the number and nature of the steps in the figure merely typifies a sequence of activities that make up a software process and is not intended to be definitive. A more detailed view would show it to be a multi-level, multi-loop, multi-agent feedback system [2], [14], [22]. We briefly return to this point later in the paper. 2.3 Type P P-type software was included in the original schema to make the latter as inclusive as possible. They cover programs intended to provide a solution to a problem that can be formally stated even though approximation and consideration of the real world issues are essential for its solution. It was suggested that this type could, in general, be treated as one or other of the other two. But there are classes of problems, chess players and some decision support programs, for example, that do not satisfactorily fit into the other classes and for completeness the type P is retained. However, when type P programs are used in the real world they acquire type E properties and, with regard to the issue of uncertainty, for example, they inherit the properties of that type. Hence they need not be separately considered in the present paper.
178
M.M. Lehman and J.F. Ramil
Application concept Views
Application domain
Program
Computational procedures and algorithms
Program definition
Evolving understanding and structure Theories, models procedures, laws of application and system domains Requirements statement
Fig. 1. Stepwise program development omitting feedback loops
2.4 S-type as a System Brick S-type programs are clearly of theoretical importance in exposing concepts such as complete specification, correctness and verification. Examples studied, and even used, in isolation are often described as toy programs. Their principle value is pedagogical. However, where the behaviour and properties of a large system component can be formally, specified, it may, in isolation be defined to be of type S, giving it a property that can play an important role in achieving desired system properties. The fact that its properties are fully defined and verifiable indicates the contribution it can make to overall system behaviour. The specification, though complete in itself, is generally, based on assumptions about the domain within which it executes. However, when so integrated into a host E-type system, that system becomes the execution domain. As is discussed further below, this leads to a degree of uncertainty in system behaviour which is, in turn, reflected in component behaviour. A component of type S when considered in isolation, acquires the properties of type E once it is integrated into and executed within an E-type system and domain. Clearly the S-type property can make contribution to achieving required software behaviour. It cannot guarantee such behaviour. It has a vital role to play in system building that will become ever more important as the use of component based architectures, COTS and reuse becomes more widespread. Knowledge of the properties of a component and of any assumptions made, explicitly or implicitly, during its conception, definition, design and implementation, whether ab initio development for direct use, as a COTS offer or development for organisational reuse, is vital for subsequent integration into a real world system. It has long been recognised that formal specification is a powerful tool for recording, revealing and
Software Uncertainty
179
understanding the content and behaviour of a program [11], provided that the latter is not too complex or large. It would appear essential for COTS and reuse units to be so specified, considered and processed as S-type programs. The most general view of Stype programs sees them as bricks in large systems construction. The S-type concept is significant for another reason, one that has immediate practical implications. Development of elemental units for integration into a system is normally assigned to one individual, or to a small group. The activity they undertake generally involves assumptions, conscious or unconscious, that resolve on-the-spot issues that arise as the design and code or documentation text is evolved. Resolution of issues that are adopted will generally be based on time and space-local views of the problem and/or the solution being implemented and may well be at odds with design or implementation of other parts of the system. Even if then justified, the assumption(s) that resolution requires will be reflected in the system and can become a source of failure at a later time when changes in, for example, the application domain invalidate it (them). Assumptions become a seed for uncertainty. Basing work assignments to an individual or small group on a formal specification strictly limits their freedom of choice, of uncertainty in the assignment that forces local decisions that are candidates for subsequent invalidation. Specifications of practical real world systems such as those already in general use or to be developed in the future cannot, in general, be fully formalised. Those of many of the low level elements, the modules that are the bricks from which the system is to be constructed, can. The principle that follows is that, wherever possible, work assignments to individuals, implementers or first line managers should be of type S. Application of the S-type program concept can simplify management of the potential conflicts and the seeds of invalid assumptions that must arise when groups of people implement large systems concurrently with individuals taking decisions in relative isolation. But even S-type bricks are not silver bullets [3]. No matter how detailed the definition when an S-type task is assigned, issues demanding implementation-time decisions will still be identified by the implementer(s) during the development process. In principle, they must be resolved by revision of the specification and/or documentation. However, explicit or implicit, conscious or unconscious adoption of assumptions by omission or commission will inevitably happen, may remain unrecorded and are eventually forgotten. Even when conscious, they are adopted by implementers who conclude that the assumptions are justified on the basis of a local view of the unit, the system as a whole and its intended application. With the passage of time, however, and the application and domain changes that must be expected, some of these assumptions are likely to become invalid. Thus even S-type bricks carry seeds of uncertainty that can cause unanticipated behaviour or invalid results. The use of S-type bricks minimises the likelihood of failure, the uncertainty at the lowest level of implementation, where sources of incorrect behaviour are most difficult to identify. It does not reduce it to zero. As described below, the assumptions issues raised are inescapable, a major source of program execution uncertainty. It is worth noting that in practice the formal specification of each of the S-type bricks does not tend to be a one-off activity. As implementation progresses and new understanding emerges and the specification itself must be updated to reflect properties (e.g., bounds of variables) that only become apparent when the low-level of abstraction issues are tackled and/or when emergent properties of the larger system are identified. This issue is recognised at the S-type level, for example, by a retrenchment method recently proposed [1], which acknowledges the fact and
180
M.M. Lehman and J.F. Ramil
provides tools to help overcome the fact that not only the program, but also its formal specification are likely to require change and adaptation as implementation progresses and usage and subsequent program evolution take over. 2.5 The Wider SPE Concept As described so far the SPE classification relates only to programs and to integrated and interacting assemblies of such programs into what has been loosely termed systems. The first step in generalising the concept, particularly in relation to E-type systems is to extend the class to include embedded real-world systems, that is total systems which include hardware that is tightly coupled to and usually controlled by software. In such systems the software must be regarded as, at least part of, the means whereby the system achieves its purpose. Neither the term real world nor embedded has been defined here but they are used as in common usage. This suggests, that whatever type of software is installed in the system, once embedded, the hardware/software system as such is also appropriately designated as of type E. Its hardware elements operate as part of the real world supporting other real world activity. Thus such systems must be expected to share at least some of the properties of E-type software and some of its evolutionary characteristics [17]. They will evolve, but the pattern of evolution, rates of change and other time dependent behaviour are likely to be quite different since, unlike software, system evolution is not achieved by applying change upon change to a uniquely identifiable soft artefact – code and/or documentation - but by replacement of elements or parts of the system. Moreover, its physical parts are subject to physical laws, with material characteristics, such as size, weight, energy consumption, processing speed, memory size, and fitness in the proposed environments rapidly become constraints for the evolution of the hardware parts of the embedded system. Given this extension, it is then appropriate to refer to E-type systems in general, even where software does not play a dominant, or even any, role in driving and controlling system behaviour and evolution. One of the major sources of the intrinsic need of E-type software for continual evolution, stems from the fact that the software evolution process is a feedback system [2], [14], [22]. This is true for real world systems and, in particular, for those in which human organisations or individuals are involved, whether as drivers, as driven or as controllers as exemplified by social organisations such as cities [12]. Hence the concept of E-type can be further extended to include this wider class. Whether such generalisation is useful and what, other than intrinsic evolution, are the behavioural similarities between these various types of systems remains to be investigated. If successful, such investigation should contribute to understanding of how artificial systems evolve and to mastery of their design and evolution [32]. It is, however, certainly appropriate to include general systems such as computer applications in the E-type class. As indicated above, at the centre of the arguments being presented in this paper, a computer application and the software implementing and supporting it are interdependent, the evolution of one triggers that of the other, as illustrated by the closed loop of Figure 2. Hence, the concept of E-type applications is also useful.
Software Uncertainty
181
Exogenous change Application concept Application domain Program
Operational program
Development step i +1
Bounding Development step 2
Development step 3
Development step i
Development step 4
Fig. 2. Installation, operation and evolution
2.6 S-type as Part of a Real World System So far the concept of S-type has been treated as a property of a program in isolation. However programs are seldom developed to stand alone except, perhaps, when studying programming methodology or by students of programming. In so far as industry is concerned, a program is, in general, used in association with other software or, in the case of, for example embedded systems, hardware. It is either integrated into such software and becomes part of the larger system or it provides output used elsewhere. And even if, in rare instances a program is used in total isolation, the results it produces are, in general, intended for application in some other activity. In all these situations, the S-type program has become part of, an element in, an E-type system, application or activity. It will require the same attention, as would any E-type element in the same system. The specification of the S type program may, for example, need to be evolved to adapt it to the many sources of change already discussed. The S-type categorisation applies only to the program in isolation, a matter of major significance when the concepts of reuse, COTS and component-based architectures are considered [23].
3 Computer Applications and Software, Their Operational Domain, and Specifications Linking Them The remainder of this paper considers one aspect of the lifetime behaviour of E-type software and applications. Underlying the phenomena to be considered is a view,
182
M.M. Lehman and J.F. Ramil
illustrated in Figure 3, of the relationships between a program, its specification and the application it implements or supports. FORMAL SPECIFICATION
I RE N O TI
APPLICATION CONCEPT
CA FI
A BS TR A CT IO N
ABSTRACT
OPERATIONAL SYSTEM
CONCRETE (REAL WORLD)
Fig. 3. An early view [16]
This view is a direct application of the mathematical concepts of theory and its models in the software context [16], [33], [34], [35]. A more recent and detailed depiction is provided by Figure 4.
validation
f lo
n io
on cti tra s ab
e od m is
of
t ca ifi re
m is
el od
verification & validation
Specification
Real World Domain validation Application
E-type Program
Fig. 4. A more recent and detailed view
A specification can be seen as a theory of the application in its real world operational domain [33] obtained by abstraction and successive refinement [36] from knowledge and understanding of the both. It may, for example be presented as a statement of requirements, with formal parts where possible. Conversely the real world and the program are both models of the theory, or, equivalently, of the specification [33]. The executable program is obtained by reification based on successive refinement or an equivalent process. Program elements reflecting formal specification elements should be verified and the program, in part and as a whole, validated against the specification. This can ensure that, within bounds set by the specification and to the extent to which the validation process covers the possible states of the operational environment, the program will meet the purpose of the intended application as
Software Uncertainty
183
expressed in the specification. These checks are, however, not sufficient. The system must be also be validated under actual operational conditions since it will have properties not addressed in the specification. The checks should be repeated in whole or part whenever changes are subsequently made to the program to maintain the validity of the real world/specification/program relationship.
4 Inevitability of Assumptions in Real World Computing The real world per se and the bounded real world operational sub-domain have an unbounded number of properties. Since a statement of requirements and a program specification are, of necessity, finite an unbounded number of properties will have been excluded. Such exclusions will include properties that were unrecognised or assumed irrelevant with respect to the sought-for solution. Every exclusion implies one or more assumptions that, explicitly or implicitly, by omission or commission, knowingly or unknowingly become part of the specification. It will be up to the subsequent verification and validation processes to confirm the validity of the set of known assumptions in the context of the application and its operational domain. The conscious and unconscious bounding of the latter, of functional content and its detailed properties during the entire development process, determines to a great extent, the resultant unbounded assumption set. Validation of the specification with respect to the real world domain is required to ensure that it satisfies the needs of the application in its operational domain. Since the real world is dynamic, validation, in whole or in part, must be periodically repeated to ensure that, as far as possible, assumptions invalidated by changes in the external world or the system itself are corrected by the appropriate changes. Assumptions revealed by such changes, must be recorded and included in the known set for subsequent validation. Figure 5 expresses the need for the program to be periodically (ideally continually) validated with respect to the real world domain over the system lifetime. A desirable complementary goal is to maintain the specification as an abstraction of the real world and the program. We termed this goal to maintain the system (program, specification, documentation) as a model-like reflection of the real world. The issue of assumptions does not arise only from the relationship between the specification and the intended application in its domain. The reification process also adds assumptions. This is exemplified, for example, by decisions taken to resolve issues not addressed in the specification which implies that they were overlooked or were considered irrelevant in generating the specification. Moreover, the abstraction and reification processes are, generally, carried out independently and assumptions adopted in the two phases may well be incompatible. Real world validation of the completed program is, therefore, necessary. Testing over a wide range of input parameters and conditions is a common means of establishing the validity of the program. But the conclusions reached from tests are still subject to Dijkstra’s stricture [6] and in a dynamic world, are, at most, valid at the time when the testing is undertaken or in relation to the future as then foreseen. Hence, the overall validity of the assumption set relates to real world properties at the time of validation. As indicated in Figure 5, assumption relationships are mutual. The specification is necessarily based on assumptions about the application and its real world operational
184
M.M. Lehman and J.F. Ramil
domain because the latter are unbounded in the number of their properties and the specification is bounded. Moreover, in general, the application has potential for many more features than can be included based on available budgets and the time available to some designated completion point. Hence, the specification also reflects assumptions about the application and implementation. But users, in general, cannot be fully aware of the details and complexities of the specification. validation
Specification
Real World Domain Application
t ou ab ns er t io oth mp an su ne As o
t ou ab s r ion he pt ot m an u e s As on
verification & validation
Assumptions about one another Continual validation
E-type Program
Must remain compatible with one another
Fig. 5. The role of assumptions
Even a disciplined and well-managed application will be based on assumptions about the system. Similarly, as they pursue their various responsibilities, system implementers will make interpretations and assumptions about the specification, particularly its informally expressed parts. And, probably to a lesser extent, those responsible for developing the specification will make assumptions about the program, its content, performance and the system domain (hardware and human) in which it executes. Assumption relationships between the three entities, the specification, the application in its real world domain and the program are clearly mutual. To the extent that they are recognised and fully understood, both individual assumptions and the mutual compatibility of the set will be validated. But those that have been adopted unconsciously or arise from omission will not knowingly have been validated. Nor will validity bounds have been determined or even recognised. All these factors represent sources of invalid assumptions reflected in the program. Moreover, the real world is always changing. Hence even previously valid assumptions and their reflection in program and documentation texts, structure and timing may become invalid.
5 A Principle of Software Uncertainty 5.1 A Principle Given the above background, we are now in a position to introduce a Principle of Software Uncertainty in a revised formulation of a statement first published some years ago [18], [19], [20], [21] on the basis of insights developed during earlier
Software Uncertainty
185
studies of software evolution [17]. The Principle may be stated in short as: “The outcome of the execution of E-type software entails a degree of uncertainty, the outcome of execution cannot be absolutely predicted”, or more fully, “Even if the outcome of past execution of an E-type program has previously been admissible, the outcome of further execution is inherently uncertain; that is, the program may display inadmissible behaviour or invalid results”. This statement makes no reference to the source of the uncertainty. Clearly the presence in the software of assumptions, known or implied, that may be invalid is sufficient to create uncertainty. There may be other sources. As stated the principle refers explicitly to E-type software, its relevance with respect to S-type programs requires additional discussion and is not pursued here. Use of the terms admissible and inadmissible in the statement is not intended to reflect individual or collective human judgement about the results of execution within the scope of the principle, though such judgement is certainly an issue, a possible source of uncertainty. Such use addresses the issue whether the results of execution fulfil the objective purpose of the program. This principle is stated in terms of admissibility rather than satisfactory to avoid any ambiguity which might arise from the mathematical meaning of satisfy [29] which, in the Computer Science context is used to address the relationship between a formal specification and a program which has been verified with respect to it. 5.2 Interpretations Thought the above statement about uncertainty of the admissibility of a program execution was not intended to include uncertainty about human judgement, the principle still applies if views, opinions, desires or expectations of human stakeholders are considered in determining the admissibility of an E-type programs. Since the circumstances of execution and other exogenous factors are likely to affect those views and all are subject to change [13] the level of uncertainty increases in circumstances where they have a role. Admissible execution is possible only when the critical assumption set reflected in the program is valid. For E-type programs, the validity of the latter cannot be proven, if only because one cannot identify all members of that set. And even those that can be identified may become invalid. Thus, if changes have occurred in the application or the domain and rectifying changes have not been made in the system, the program may display unsuccessful or unacceptable behaviour no matter how admissible executions were in the past. Uncertainty may even be present in a program verified as correct with respect to a formal specification since the latter may cease to be an adequate abstraction of the real world as a consequence of changes in the execution domain. Finally consider effort-related interpretations of admissibility. Sustained admissibility of results requires that the critical assumption set is maintained valid. This requires human effort, analysis and implementation. Now in all organisations human effort is bounded. Since organisations responsible for maintaining the assumption set and the software have their own priorities, one can never be certain that enough effort will be available to achieve timely adjustments that keep critical assumptions valid. Thus, even when invalid assumptions are identified, it is not certain that changes will be in place in time to guarantee continual admissibility of program execution., This further aspect of uncertainty, however, is a prediction about
186
M.M. Lehman and J.F. Ramil
future behaviour rather than being relevant to current execution so need not be regarded as being part of the principle. There may well be other sources of uncertainty. In terms of understanding of software evolution and the principle, the most important sources are those illustrated by Figure 5 and qualified by the observation of the unboundedness of the number of properties of real world domains and, hence, of the assumption set. It is the only one that is not, in a sense, self-evident, at least until after having been stated. Nevertheless, the others must be explicitly stated since uninformed purchasers of software systems need to consciously accept that future inadmissible behaviour of one sort or another cannot be prevented. This reasoning is more than sufficient to justify identification of the software uncertainty as a principle. But the likelihood and impact of such uncertainty can be reduced if it is recognised that its sources relate to the inevitability of change in the real world and in the humans that populate it. This implies that satisfaction (in the conventional sense) with future computational results cannot be guaranteed, however pleasing past results have been. Though, perhaps, self evident to the informed, for a society increasingly dependent on correctly behaving computers it is vital that this fact is more widely recognised, not the least by politicians and other policy makers. 5.3 Relationship to Heisenberg’s Uncertainty Principle? It is natural to enquire whether this principle is related in any way to Heisenberg’s Principle of Uncertainty. The present authors have not been able to make any connection, sees it, at most, as an analogue. The late Gordon Scarrott, however, saw it in a different light and left behind a series of papers in which he argued, inter alia, that the Software Principle was an instance of the Heisenberg Principle. His thesis is presented in several papers including one entitled "Physical Uncertainty, Life and Practical Affairs" [31]. It appears that the authors’ copy is an early version of a paper presented by Scarrott in a Royal Society lecture.
6 Empirical Evidence for the Software Uncertainty Principle? 6.1 General Observation Recent studies [7] have indirectly reinforced the reasoning that led to formulation of the Principle of Software Uncertainty as here stated. Unfortunately, sufficient relevant data, such as histories of fault reports relating to our collaborators’ systems, to permit even an initial meaningful estimate of the degree of satisfaction or acceptability of the individual stakeholders involved with the systems we studied, were not available. Nor was such a study part of the identified FEAST studies. It was, however, clear that continual change was present in all the systems observed, and that a portion of such changes addressed progressive invalidation of reflected assumptions. More work is needed to assess, for example, the proportion of changes due to invalidation of assumptions versus changes triggered by other reasons. The evolution process is a feedback system in which the deployment of successive versions or releases and the consequent stakeholder reaction, learning and
Software Uncertainty
187
familiarisation or even dislike, for example, play a role. Such studies are, therefore, not straight forward, and ascertaining the ultimate trigger of particular change requests is difficult; may even not be possible. Despite these difficulties and the fact that direct investigation is required, the FEAST studies supported the conclusions that underlie the present discussion and, in particular, that continual evolution is inevitable for real world applications. This, in turn, indicates the need to maintain the validity of the assumption set and their reflections in the software. That is believed to be an important component of software maintenance activity. 6.2 Approach to a Theory of Evolution As indicated above, the FEAST projects have reinforced the reasoning that led to formulation of the Principle of Software Uncertainty. For practical reasons, the studies were limited to the traditional, that is, for example, non Object Oriented and non-component based paradigms. Plans are underway to extend the FEAST studies to the latter. However, subject to the preceding limitation, the overall results of the projects included identification of behavioural invariants from data on the evolution of a wide spectrum of industrially developed and evolved systems. Previous discussions of the basic concepts and insights presented in this paper can be found in earlier publications [7], [17]. The contribution made here is to bring together, explicitly and implicitly, to the extent that this can be done in a conference paper, the observations, models, laws and empirical generalisations [5] that have been accumulated over a period of more that 30 years into a cohesive whole. It now appears that these provide observations at the immediate behavioural level for further study of software evolution by providing a wide phenomenological base. The later is an important input for theory formation, whose eventual output would be a formal theory of software evolution. If it can be formed such a theory would, inter alia, help clarify formulations such as that of the Principle of Software Uncertainty and put them on a more solid, coherent and explicit foundation. It would appear that the Principle of Software Uncertainty that has been the focus of the present paper in the context of the Soft-Ware Conference on Software Uncertainty is a candidate for a theorem in the proposed theory. An informal outline proof of the principle has been developed in [24]. It is, at best, an outline since definitions of many of the terms used need refinement, steps in the reasoning need to be filled in and some may not conform to the style accepted by those more experienced than the present authors in theory development and its formalisation. It does, however, convey the intent of what is planned in the SETh project [25]. It had been intended to include an improved, though not yet complete version in the present paper but time has not permitted preparation of a proof that satisfies us. We hope, however, that our presentation at the workshop together with the concepts presented here and in other FEAST publications will encourage others to work in this area.
7 Practical Implications Given the reasoning that underlies its formulation it might be thought that the Principle of Software Uncertainty is a curio of theoretical interest but of little practical
188
M.M. Lehman and J.F. Ramil
value. Consider, for example, the statement that a real world domain has an uncountable number of properties and cannot, therefore, be totally covered by an information base of finite capacity. The resultant incompleteness represents an uncountable number of assumptions that underlie use of the system and impact results achieved. The reader may well think, "so what?" Clearly, the overwhelming majority of these assumptions are not relevant in the context of the software in use or contemplated. Neither do they have any impact on behaviour or on the results of execution. It is, for example, quite safe to assume that one may ignore the existence of black holes in outer space when developing the vast majority of programs. It will certainly not lead to problems unless one is working in the area of cosmology or, perhaps, astronomy. On the other hand, after a painful search for the cause of errors during the commissioning of a particle accelerator, a tacit assumption that one may ignore the influences of the moon for earth bound problems was shown to be wrong. It was discovered that as a result of the increased size of the accelerator the gravitational pull of the moon was the basic source of experimental error [4]. Many more examples (see, for example, [28] for a discussion of the Arianne 501 destruction) could be cited of essentially software failure due to the invalidation of reflected assumptions as a result of changes external to the software. The above should convince the reader that degradation of the quality, precision or completeness of the assumption set reflected by a program, its specification and other documents, represents a major societal threat as the ever wider, more penetrating and integrated use of such systems spreads. There is room here for serious methodological and process research, development and improvement. At least one exercise has been undertaken to show that the observations, inferences and conclusions achieved in the FEAST studies and as exemplified by the examples in the present paper, are more than of just theoretical interest. At the request of the FEAST collaborators a report was prepared, to present some of the direct practical measures suggested by the FEAST results. That report has been, or is shortly, to be published [26]. There follows a list of some of the practical recommendations that address, directly or indirectly, the implications of the Principle: • when developing a computer application and associated systems, estimate and document the likelihood of change in the various areas of the application domains and their spread through the system to simplify subsequent detection of assumptions that may have become invalid as a result of changes, • seek to capture by all means, assumptions made in the course of program development or change, • store the appropriate information in a structured form, related possibly to the likelihood of change as in (a), to facilitate to detect any that have become invalid in periodic review, • assess the likelihood or expectation of change in the various categories of catalogued assumptions, and as reflected in the database structure to facilitate such review, • review the assumptions database by categories as identified in (c), and as reflected in the database structure, at intervals guided by the expectation or likelihood of change or as triggered by events, • develop and provide methods and tools to facilitate all of the above, • when possible, separate validation and implementation teams to improve questioning and control of assumptions,
Software Uncertainty
189
•
provide for ready access by the evolution teams to all appropriate domain specialists with in-depth knowledge and understanding of the application domain. A more general consequence of the Principle is that just as computer users, in the broadest sense of the term, must learn to treat the results of computation with care, so must software users. It must never be assumed that information is correct simply because its source is a computer. This realisation calls for careful thought and a general educational process in government, industry and the educational system that maintains faith in the computer but, at the same time, ensures adequate care in how the results of their use are managed. It should not be beyond the wit of society to achieve this.
References – * indicates reprint as a chapter in Lehman and Belady 1985 1. 2.* 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.* 13.* 14.* 15.*
Banach R and Poppleton MR, Retrenchment, Refinement and Simulation, in J.P. Bowen, S.E. Dunne, A. Galloway, King S. (editors) ZB2000: Formal Specification and Development in Z and B, Springer, 2000, 525 pp Belady LA and Lehman MM, An Introduction to Program Growth Dynamics, in W. Freiburger, (editor) Statistical Computer Performance Evaluation, Academic Press, New York, pp. 503-511 Brooks FP, No Silver Bullet - Essence and Accidents of Software Engineering, Information Processing 86, Proc. IFIP Congress 1986, Dublin, Sept. 1-5, Elsevier Science Publishers (BV), North Holland, pp. 1069 - 1076 CERN, The Earth breathes on LEP and LHC, CERN Bulletin 09/98; 23 February 1998, http://bulletin.cern.ch/9809/art1/Text_E.html Coleman JS, Introduction to Mathematical Sociology, The Free Press Of Glencoe, Collier-Macmillan Limited, London, 1964, 554 pps Dijkstra EW, The Humble Programmer, ACM Turing Award Lecture, CACM, v. 15, n.10, Oct. 1972, pp. 859 – 866 FEAST Projects web site: http://www.doc.ic.ac.uk/~mml/feast/ includes a list of Project FEAST and the authors' papers, PDF versions of those recent papers not restricted by copyright transfers Finkelstein A, Gabbay D, Hunter A, Kramer J and Nuseibeh B, Inconsistency Handling in Multi-Perspective Specifications, IEEE Trans. on Softw. Eng., v. 20, n. 8, Aug. 1994, pp. 569 - 578 Hoare CAR, An Axiomatic Basis for Computer Programming, CACM, v. 12, n.10, Oct., pp. 576 - 583 id., Proof of a Program FIND, CACM, v. 14, n. 1, Jan. Van Lamsweerde A, Formal Specification: a Roadmap, in A. Finkelstein (ed.), The Future of Software Engineering, 22nd ICSE, Limerick, Ireland, 2000, ACM Order N. 592000-1, pp. 149-159 Lehman MM, Programs, Cities, Students—Limits to Growth, Imp. Col. Inaug. Lect. Ser., v.9, 1970 - 1974, pp. 211 - 229; also in Gries, 1978 id, Human Thought and Action as an Ingredient of System Behaviour, in The Encyclopædia of Ignorance, R Duncan and M Weston-Smith, editors, Pergamon Press, London, 1977, pp. 347 - 354 id, Laws of Program Evolution-Rules and Tools for Programming Management, Proc. Infotech State of the Art Conference, Why Software Projects Fail, April 9-11, 1978, pp. 1V1 - lV25 id, Program Life Cycles and Laws of Software Evolution, Proc. lEEE Spec. Iss. on Softw. Eng., Sept. 1980, pp. 1060 - 1076
190 16. 17. 18. 19. 20. 21. 22.
23.
24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36.
M.M. Lehman and J.F. Ramil id, A Further Model of Coherent Programming Process, Proc. Softw. Proc. Worksh., Egham, Surrey, 6 – 8 Feb. 1984, IEEE Cat. no . 84 CH 2044-6, pp. 27-35 Lehman MM and Belady LA, Program Evolution—Processes of Software Change, Academic Press, London, 1985 Lehman MM, Uncertainty in Computer Application and its Control Through the Engineering of Software, J. of Software Maintenance: Research and Practice, v. 1, n. 1, Sept. 1989, pp. 3 - 27 Lehman MM, Software Engineering as the Control of Uncertainty in Computer Application, SEL Software Engineering Workshop, Goddard Space Centre, MD, 29 Nov. 1989, publ. 1990 id, Uncertainty in Computer Application, CACM, v. 33, n. 5, May 1990, pp. 584 - 586 id, Uncertainty in Computer Applications is Certain - Software Engineering as a Control, Proc. CompEuro 90, Int. Conf. on Computer Systems and Software Engineering, Tel Aviv, 7 - 9 May, 1990, Publ. by IEEE Comp. Soc. Press, n. 2041, pp 468 - 474 id, Feedback in the Software Evolution Process, Keynote Address, CSR Eleventh Annual Workshop on Software Evolution: Models and Metrics, Dublin, 7-9 Sept. 1994, Workshop Proc., Information and Software Technology, sp. is. on Software Maintenance, v. 38, n. 11, 1996, Elsevier, 1996, pp. 681-686 Lehman MM and Ramil JF, Software System Maintenance and Evolution in an Era of Reuse, COTS and Component Based Systems, Joint Keynote Lecture, International Conference on Software Maintenance and Int. Workshop on Empirical Studies of Software Maintenance WESS 99, Oxford, 3 Sept. 1999 id, Towards a Theory of Software Evolution - And Its Practical Impact, invited talk, Proc. ISPSE 2000, Intl. Symposium on the Principles of Software Evolution, Kanazawa, Japan, Nov 1-2, 2000, IEEE CS Press, pp. 2-11 Lehman M M, SETh – Towards a Theory of Software Evolution, EPSRC Proposal, Case for Support Part 2, Dept. of Comp. ICSTM, 5 Jul. 2001 Lehman MM and Ramil JF, Rules and Tools of Software Evolution Planning, Management and Control, Annals of Software Engineering, Spec. Iss. on Softw. Management, v. 11., issue 1, 2001, pps. 15 – 44 Nuseibeh B, Kramer J and Finkelstein A, A Framework for Expressing the Relationships Between Multiple Views in Requirements Specification, Trans. on Software Engineering, vol. 20, n. 10, Oct. 1994, pp 760 – 773 Nuseibeh B, Arianne 5 Who Dunnit?, IEEE Software, May/June 1997, pp. 15-16 The Compact Oxford English Dictionary, 2nd, Micrographically Reduced Edition, Oxford Univ. Press 1989 Pfleeger S, Software Engineering – The Production of Quality Software, Macmillan Pub. Co., 1987 Scarrott G, Copies of various relevant papers, published and unpublished and including the one reference that can be obtained from one of the authors (mml) of this paper Simon HA, The Sciences of the Artificial, M.I.T. Press, Cambridge, MA. 1969, 2nd ed, 1981 Turski WM, Specification as a Theory with Models in the Computer World and in the Real World, System Design, Infotech State of the Art Rep. (P Henderson ed), se. 9, n. 6, 1981, pp 363 - 377 Turski WM and Maibaum TSE, The Specification of Computer Programs, AddisonWesley, Wokingham, 1987 Turski WM, An Essay on Software Engineering at the Turn of the Century, in T. Maibaum (editor), Fundamental Approaches to Software Engineering, Proc. Third Int. Conf. FASE 2000. March/April 2000. LNCS 1783, Springer-Verlag, Berlin, pp. 1 – 20 Wirth N, Program Development by Stepwise Refinement, CACM, v.14, n.4, Apr. 1971, pp. 221 - 227
Temporal Probabilistic Concepts from Heterogeneous Data Sequences Sally McClean, Bryan Scotney, and Fiona Palmer School of Information and Software Engineering Faculty of Informatics, University of Ulster, Cromore Road Coleraine, BT52 1SA, Northern Ireland {si.mcclean, bw.scotney}@ulst.ac.uk; [email protected]
Abstract. We consider the problem of characterisation of sequences of heterogeneous symbolic data that arise from a common underlying temporal pattern. The data, which are subject to imprecision and uncertainty, are heterogeneous with respect to classification schemes, where the class values differ between sequences. However, because the sequences relate to the same underlying concept, the mappings between values, which are not known ab initio, may be learned. Such mappings relate local ontologies, in the form of classification schemes, to a global ontology (the underlying pattern). On the basis of these mappings we use maximum likelihood techniques to handle uncertainty in the data and learn local probabilistic concepts represented by individual temporal instances of the sequences. These local concepts are then combined, thus enabling us to learn the overall temporal probabilistic concept that describes the underlying pattern. Such an approach provides an intuitive way of describing the temporal pattern while allowing us to take account of inherent uncertainty using probabilistic semantics.
1 Background It is frequently the case that data mining is carried out in an environment that contains noisy and missing data, and the provision of tools to handle such imperfections in data has been identified as a challenging area for knowledge discovery in databases. Generalised databases have been proposed to provide intelligent ways of storing and retrieving data. Frequently, data are imprecise, i.e. we are uncertain about the specific value of an attribute but only that it takes a value that is a member of a set of possible values. Such data have been discussed previously as a basis of attribute-oriented induction for data mining [12, 13]. This approach has been shown to provide a powerful methodology for the extraction of different kinds of patterns from relational databases. It is therefore important that appropriate functionality is provided for database systems to handle such information. A database model that is based on partial values [3, 4, 28] has been proposed to handle such imprecise data. Partial values may be thought of as a generalisation of null values, where rather than not knowing anything about a particular attribute value, as is the case for null values, we may be more specific and identify the attribute value D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 191–205, 2002. © Springer-Verlag Berlin Heidelberg 2002
192
S. McClean, B. Scotney, and F. Palmer
as belonging to a set of possible values. A partial value is therefore a set such that exactly one of the values in the set is the true value. Most previous work has concentrated on providing functionality that extends relational algebra with a view to executing traditional queries on uncertain or imprecise data. However, for such imperfect data, we often require aggregation operators that provide information on patterns in the data. Thus while traditional query processing is tuple-specific, where we need to extract individual tuples of interest, processing of uncertain data is often attribute-driven, where we need to use aggregation operators to discover properties of attributes of interest. Thus we might want to aggregate over individual tuples to provide summaries which describe relationships between attributes. The derivation of such aggregates from imprecise data is a difficult task for which, in our approach, we rely on the EM algorithm [5] for aggregation of the partial value model. Such a facility is an important requirement in providing a database with the capability to perform the operations necessary for knowledge discovery in an imprecise and uncertain environment. In this paper we are concerned, in particular, with identifying temporal probabilistic concepts from heterogeneous data sequences. Such heterogeneity is common in distributed databases, which typically have developed independently. Here, for a common concept we may have heterogeneous classification schemes. Local ontologies, in the form of such classification schemes, may be mapped onto a global ontology. The resolution of such conflicts remains an important problem for database integration in general [16, 25, 26] and for database clustering in particular [21]. More recently, learning database schema for distributed data has become an active research topic in the database literature [9]. A problem area of significant current research interest in which heterogeneous data sequences occur is that of gene expression data, where each sequence corresponds to a different gene [7] Previous work, e.g. [23], has clustered discretised gene expression sequences using mutual entropy. However, such an approach is unable to capture the full temporal semantics of dynamic sequences [20]. Our aim here is to develop a method that captures the temporal semantics of sequences via temporal probabilistic concepts. In the context of gene expression data this approach therefore provides a means of associating a set of genes, via their expression sequences, with an underlying temporal concept along with its accompanying dynamic behaviour. For example, we may associate a gene sequence cluster with an underlying growth process that develops though a number of stages that are paralleled by the associated genes. We return to the context of gene expression data in the illustrative example in Section 5. The general solution that we propose involves several distinct tasks. We assume that the data comprise heterogeneous sequences that have an underlying similar temporal pattern. Such data may have been produced by the use of a prior clustering algorithm, e.g. using mutual entropy. Since we are concerned with temporally clustering heterogeneous sequences, we must first determine the mappings between the states of each sequence in a cluster and the global concept; for this we compute the possible mappings. Then, on the basis of these mappings, we use maximum likelihood techniques to learn the probabilistic description of local probabilistic concepts represented by individual temporal instances of the expression sequences. This stage is followed by one in which we learn the global temporal concept. Finally,
Temporal Probabilistic Concepts from Heterogeneous Data Sequences
193
we use the concept description to determine the most probable pathway for each concept. Such an approach has a number of advantages: it provides an intuitive way of describing the underlying shape of the process by explicitly modelling the temporal aspects of the data, Such segmental models have considerable potential for sequence data; it provides a way of mapping heterogeneous sequences; it allows us to take account of natural variability via probabilistic semantics; it allows sequences to be characterised in a temporal probabilistic concept model; concepts may then be matched with known processes in the data environment. We build on our previous work on integration [19, 25, 26] and clustering [21] of multinomial data that are heterogeneous with respect to classification schemes. In our previous papers we have assumed that the schema mappings are made available by the data providers. In this paper the novelty partly resides in the fact that we must now learn the schema mappings as well as the underlying concepts. Such schema mapping problems for heterogeneous data are becoming increasingly important as more databases become available on the Internet, providing opportunities for knowledge discovery from open data sources. An additional novel aspect of this paper is the learning of temporal probabilistic concepts (TPCs) from such sequences.
2 The Problem In Table 1 we present three such sequences where the sequences are identified, on the basis of mutual entropy clustering, to have similar temporal semantics. We note that the codes (0, 1, or 2) should be regarded as symbolic rather than numerical data, and we re-label them in Table 2 to emphasise this. Table 1. Raw sequence data time Sequence 1 Sequence 2 Sequence 3 Sequence 4
t1 0 0 0 0
t2 0 0 0 0
t3 1 0 0 0
t4 1 1 2 2
t5 1 1 2 2
t6 1 1 2 1
t7 1 1 2 1
t8 1 2 1 1
t9 0 2 1 1
Table 2. Re-labelled sequence data Sequence 1 Sequence 2 Sequence 3 Sequence 4
A C F I
A C F I
B C F I
B D H K
B D H K
B D H J
B D H J
B E G J
A E G J
Examples of possible mappings are presented in Table 3, where L, M and N are the global labels that we are learning. The global sequence (L, L, L, M, M, M, M, N, N) could therefore characterise the temporal behaviour of the global ontology underpinning these data. We note that these mappings are not exact in all cases; e.g., in Sequence 1 the ontology is coarser than the underlying global ontology, and neither Sequence 1 nor Sequence 4 exactly map onto the global sequence. This highlights the
194
S. McClean, B. Scotney, and F. Palmer
necessity for building probabilistic semantics into the temporal concept. Although, in some circumstances, such schema mappings may be known to the domain expert, typically they are unknown and must be discovered by an algorithm. Table 3. Schema mappings for Table 2 A B A
L M N
C D E
L M N
F G H
L N M
I J K
L N M
Definition 2.1: We define a sequence to be a set S ={s1,…,sL}, where L is the (variable) length of the sequence and si , i=1,…,L, are members of a set A comprising the letters of a finite alphabet. In what follows we refer to such letters as the values of the sequence. Malvestuto [17] has discussed classification schemes that partition the values of an attribute into a number of categories. A classification P is defined to be finer than a classification Q if each category of P is a subset of a category of Q. Q is said to be coarser than P. Such classification schemes may be specified by the database schema or may be identified by appropriate algorithms. The relationship between two classification schemes is described by a correspondence graph [17, 18] where nodes represent classes and arcs indicate that associated classes overlap. For heterogeneous distributed data it is frequently the case that there is a shared ontology that specifies how the local semantics correspond to the global meaning of the data; these ontologies are encapsulated in the classification schemes. The mappings between the heterogeneous local and global schema are then described by a correspondence graph represented by a correspondence table. In our case we envisage data with heterogeneous schema that may have arisen because either the sequences represent different variables that are related through a common latent variable, or the data may be physically distributed and related through a common ontology. We define a correspondence table for a set of sequences to be a representation of the schema mappings between the sequence and hidden variable ontologies. The correspondence table for Table 3 is presented in Table 4; the symbolic values in each sequence are numbered alphabetically. It is this that we must learn in order to determine the mappings between heterogeneous sequences. Table 4. The correspondence table for the sequence data in Table 1 Global ontology 1 2 3
Sequence 1 1 2 3
Sequence 2 1 2 3
Sequence 3 1 3 2
Sequence 4 1 3 2
3 Clustering Heterogeneous Sequence Data The general solution that we propose involves several distinct tasks. We assume that the data comprise heterogeneous sequences that have an underlying similar temporal
Temporal Probabilistic Concepts from Heterogeneous Data Sequences
195
pattern. Since we are concerned with clustering heterogeneous sequences, the first step is to determine the mappings between the values of each sequence in a cluster and the global ontology; for this we compute the possible mappings. We are trying to find mappings between the heterogeneous sequences in order to identify homogeneous clusters; this involves identification of where the symbols in the respective (sequence) alphabets co-occur. Finding these schema mappings involves searching over the possible set of mappings. Such a search may be carried out using a heuristic approach, for example, a genetic algorithm, to minimise the divergence between the mapped sequences. In order to restrict the search space, we may limit the types of mapping that are permissible. For example, we may allow only order-preserving mappings; the fitness function may also be penalised to prohibit trivial mappings, e.g. where every symbol in the sequence is mapped onto the same symbol of the global ontology. The schema mappings from each local ontology to the global ontology envisaged in this paper may serve one, or both, of the following two functions: re-labelling the symbols of a local ontology to the symbols of the global ontology; changing the granularity because the granularity of a local ontology is coarser than that of the global ontology. Each value of the local ontology is mapped to a set of values of the global ontology; these may be singleton sets having one element, or they may be sets of more than one element, referred to as partial values. Partial values are defined formally in Section 4. In this paper we present a simple algorithm in which each local ontology value is mapped to the global ontology value to which it most frequently corresponds. Where the most frequently corresponding global ontology value is not unique, the local ontology value is mapped to the set of most frequently corresponding values, i.e., a partial value. This mapping may be generalised by using fuzzy logic. The most frequently corresponding global ontology value may be considered to be a fuzzy concept, resulting in the use of partial values where a local ontology value maps to a set of global ontology values to each of which it frequently corresponds. Since the search space for optimal mappings is potentially very large, we propose an ad hoc approach that can be used to initialise a heuristic hill-climbing method such as a genetic algorithm. Our objective is to minimise the distance between local sequences and the global sequence once the mapping has been carried out. However, since the global sequence is a priori unknown, we propose to approximate this function by the distance between mapped sequences. Our initialisation method then finds a mapping, as summarised in Figure 1. If we wish to provide a number of solutions, say to form the initial mating pool for a genetic algorithm, we can choose a number of different sequences to act as proxies for the global ontology in the first step of the initialisation method. Example 3.1: We consider the data in Table 2. Here we select sequence 2 as the proxy for the global ontology as it is at the finest granularity of any of the sequences. Using the algorithm above to determine mappings from sequence 1 to sequence 2
196
S. McClean, B. Scotney, and F. Palmer *
Choose one of the sequences whose number of symbols is maximal (S say); these symbols act as a proxy for values of the global ontology. For each remaining i sequence S , of length L, let
1 if (s *j = u | s ij = r) for any u h ijru = otherwise 0
∀i, and j = 1,..., L .
Here s ij , the j’th value in the local sequence, is a value in the alphabet of S ; s *j , i
*
the corresponding value in the global ontology, is a value in the alphabet of S . L
Then, " i,r, compute h iru = ∑ h ijru , and find u ir such that h iru i = max (h iru ) . r
j=1
u
In the i’th sequence, the value r is then mapped to u ir . If u ir is not unique, r is mapped to the partial value { u ir }. Fig. 1. Algorithm for mapping allocation
gives hAC = 2, hAE = 1; hBC = 1, hBD = 4, hBE = 1. We thus induce the mapping AC and BD. Similarly, from sequence 3 to sequence 2 we have hFC = 3; hGE = 2; hHD = 4, and we thus induce the mapping FC, GE, and HD. From sequence 4 to sequence 2 we have hIC = 3; hJD = 2, hJE = 2; hKD = 2, and we thus induce the mapping IC, J{D,E}, and KD.
4 Concept Learning 4.1 Concept Definitions A concept is defined over a set X of instances; training examples are members x of X [24] Given a set of training examples T of a target concept C, the learner must estimate C from T. The concepts we are concerned with here may be thought of as symbolic objects which are described in terms of discrete-valued features Xi : i=1,…,n, where Xi has domain Di={vi1,…, vim }. A symbolic object is then defined in i
terms of feature values as: O= {[X1 = v1];…; [Xn = vn]} [27]. A logical extension of this definition is to define the object attribute values to lie in subsets of the domain, that is, O = {[X1 ³ S1]; …; [Xn ³ Sn]} where Si ² Di for i=1,…,n. Each set Si represents a partial value of the set of domain values Di [4, 28]. Definition 4.1: A partial value is determined by a set of possible attribute values of an attribute X, of which one and only one is the true value. We denote a partial value h by h = ( a r ,... a s ) corresponding to a set of h possible values { a r ,... a s } of the same domain, in which exactly one of these values is the true value of h. Here, h is
Temporal Probabilistic Concepts from Heterogeneous Data Sequences
197
the cardinality of h; ( a r ,... a s ) is a subset of the domain set { a1 ,... a k } of attribute X, and h ≤ k. Example 4.1: Consider the features {expression level, function} with respective domains {low, medium, high} and {growth, control}. Then examples of concepts are: C1={[expression level = high]; [function = growth]} C2={[expression level = (low, medium)]; [function = control]} In concept C2, {low, medium} is an example of a partial value. In this case we know that the expression level is either low or medium. We have previously defined a partial probability distribution, which assigns probabilities to each partial value of a partition of the set of possible values [20]. Definition 4.2: A partial probability distribution is a vector of probabilities j(h) = (p ,...,p ) which is associated with a partition formed by partial values h = (h ,…,h ) of 1 r 1 r attribute A. Here pi is the probability that the attribute value is a member of partial r
value hi and ∑ p i = 1. i =1
Example 4.2: An example of a partial probability distribution on the partition of the domain values of expression level given by ({low, medium}, {high}) is then: j({low, medium}, {high}) = (0.99, 0.01). This distribution means that the probability of having either a low or medium expression level is 0.99, the probability of having a high expression level is 0.01. Probabilistic Concepts have been used to extend the definition of a concept to uncertain situations where we must associate a probability with the values of each feature vector [10, 11, 27]. For example, a probabilistic concept might be: C3 = {[expression level = high]:0.8, {[expression level = medium]:0.2, [function = growth]:1.0}. This means that the concept is characterised by having high expression level with probability 0.8, medium expression level with probability 0.2, and function growth. We are concerned with learning concepts that encapsulate both probabilistic and temporal semantics from heterogeneous data sequences. Some previous work has addressed the related problem of identifying concept drift in response to changing contexts [6, 14, 15]. However, our current problem differs in that, rather than seeking to learn concepts, which then change with time, our focus is on learning temporal concepts where the temporal semantics are an intrinsic part of the concept. This is achieved by regarding time as one of the attributes of the symbolic object that represents the concept. We regard time as being measured at discrete time points T={t1, …tk}. A time interval is then a subset S of T such that the elements of S are contiguous. A local probabilistic concept (LPC) may then be defined on a time interval of T. Example 4.3: Let S ={t1, t2} be a time interval of T. Then we may have a local probabilistic concept C4 = {[Time = S], [expression level = high]:0.8, [expression level = medium]:0.2},
198
S. McClean, B. Scotney, and F. Palmer
That is, during time interval S there is a high expression level with probability 0.8 and medium expression level with probability 0.2. From these local probabilistic concepts we must then learn the global temporal concept. A global temporal concept is the union of local probabilistic concepts that relate to time intervals that form a partition of the time domain T. Definition 4.3: A temporal probabilistic concept (TPC) is defined in terms of a time attribute with domain T = {t1 ,…, tk} and discrete-valued features Xj, where Xj has domain Dj={vj1,…, v jm j }, for j = 1,…,n. Then we define a partition of T as {T1,…,Tq} q
where Ti ² T, Ti¬Tj=« for i ≠ j and ∪ Ti =T. For each feature Xj , let i=1
Sij={Sij1,…, S
ijrij
} be a partition of Dj in time interval Ti , where Siju ² Dj for u = 1, …,
rij. A local probabilistic concept for interval Ti is then defined as LPCi= {Ti , Si1:( pi11,…, pi1r ), …, Sin:( pin1,…, pinr )} i1
rij
where
in
q
p ijs = 1 for j = 1,...n . The TPC is then defined as TCP={ ∪ LPC ∑ s =1 i =1
i
}.
Example 4.4: Let Time have domain T={t1, t2, t3} that is partitioned into two time intervals T1={t1, t2} and T2={t3}; X1=expression level has domain D1={low, medium, high}. Typical local probabilistic concepts are LPC1 = {T1, D1 :(0, 0.2, 0.8)}, LPC2 = {T2, D1 : (0.9, 0.1, 0)}. The corresponding TPC is then the union LPC1LPC2. Hence, during time interval T1 there is a high expression level with probability 0.8 and medium expression level with probability 0.2; during time interval T2 there is a low expression level with probability 0.9 and a medium expression level with probability 0.1. 4.2 Learning Local Probabilistic Concepts In this section we describe an algorithm for learning local probabilistic concepts which takes account of the fact that the schema mappings discussed in Section 3 may map a value in the local ontology onto a set of possible values in the global ontology. We have previously developed an approach that allows us to aggregate attribute values for any attribute with values that are expressed as a partial probability distribution [20, 22]. Such partial probability distributions correspond in our current context to local probabilistic concepts. Notation 4.1: We consider an attribute Xj with corresponding global ontology domain Dj ={v1,...,vk} which has instances x1,...,xm. Then the value of the r’th instance of Xj is (j) a (partial value) set given by Sr for j=1,…,m. We further define: 1 if v i ∈ S r (j) q irj = 0 otherwise
for i=1,…,k.
Temporal Probabilistic Concepts from Heterogeneous Data Sequences
199
Definition 4.4: The partial value aggregate of a number of partial values on attribute Xj for time interval Ti, denoted pvagg (Aij), is defined as a vector-valued function: pvagg (Aij) = (p ij1 ,..., p ijk ) , where the p ij ’s are computed from the iterative scheme:
p ijs Here
(n)
p ijs
= p ijs (n)
(n −1)
m
k
r =1
u =1
(∑ qirj / ∑ p iju
( n −1)
q urj ))/m for i = 1,..., k.
p ijs at the n’th iteration and the p ijs s are the probabilities
is the value of
associated with the respective values v1,...,vk of attribute Xj in time interval Ti. This formula may be regarded as an iterative scheme, which at each step apportions the data to the (partial) values according to the current values of the probabilities. The initial values are taken from the uniform distribution. We can show [20, 22] that this formula produces solutions for the p ijs s which minimise the Kullback-Leibler information divergence; this is equivalent to maximising the Likelihood. It is in fact an application of the EM algorithm [5, 29]. We illustrate the algorithm using the data presented in Table 1 and the mappings learned in Example 3.1. Example 4.5: Applying the mappings learned in Example 3.1, we induce the mapped gene sequences presented in Table 5. Table 5. Mapped sequence data Sequence 1 Sequence 2 Sequence 3 Sequence 4
C C C C
C C C C
D C C C
D D D D
D D D D
D D D C D D E E D D E E {D,E} {D,E} {D,E} {D,E}
Then, for example, using only the data at the eighth time point, (column 9 of Table 5) we obtain: (n) (n −1) (n −1) pC = pC 0/p C /4
pD
(n)
pE
(n)
= pD = pE
(n −1)
(n −1)
( (1/p (2/p
)
(n −1) D (n −1) E
( + 1/ (p
+ 1/ p D
(n −1) (n −1)
D
+ pE
(n −1)
+ pE
(n −1)
))/4 ))/4
Iteration yields the solution p C = 0, p D = 1 / 3, p E = 2 / 3 . Similarly, using only the data at the ninth time point, (column 10 of Table 5) we obtain:
pC
(n)
pD
(n)
pE
(n)
= pC
(n −1)
(n −1)
C
= pD = pE
(1/p )/4 (1/(p + p ))/4 (2/p + 1/ (p + p
(n −1)
(n −1)
(n −1)
D
(n −1)
E
(n −1)
E
(n −1)
D
(n −1) E
))/4
In this case, iteration yields the solution p C = 1 / 4, p D = 0, p E = 3 / 4 . These are both examples of local probabilistic concepts.
200
S. McClean, B. Scotney, and F. Palmer
4.3 Learning Temporal Probabilistic Concepts Once we have learned the local probabilistic concepts, the next task is to learn the temporal probabilistic concept for the combined sequences. This is carried out using temporal clustering. The algorithm is described in Figure 2. The similarity metric for the distance between two local probabilistic concepts, used here for clustering is given by d12 = /1 + /2 – /12, where / is the log-likelihood function for a local probabilistic concept, given by Definition 4.5. Here /1 is the loglikelihood for LPC 1, /2 is the log-likelihood for LPC2, and /1 is the log-likelihood for LPC1 and LPC2 combined. Then we can use a chi-squared test to decide whether LPC1 and LPC2 can be combined to form a new LPC. Definition 4.5: The log-likelihood of the probabilistic partial value (p1 ,..., p k ) in a local probabilistic concept, as defined in Section 4.1, is given by: m
k
k
r =1
i =1
i =1
/ = ∑ log(∑ q ir p i ) subject to ∑ p i = 1. Here the pi’s are first found using the iterative algorithm in Section 4.2. Input: A set of sequences that have been aligned using the schema mappings Clustering contiguous time periods: Beginning at the first time point, test for similarly of contiguous local probabilistic concepts (LPCs) If LPCs are similar then combine, else, compare with previous LPC clusters and combine if similar If LPC is not similar to any previous cluster, then start a new cluster Characterisation of temporal clusters: For each cluster: identify local probabilistic concept Combine optimal clusters to provide temporal probabilistic concept (TPC) Output: Temporal probabilistic concept (TPC) Fig. 2. Temporal clustering for mapped heterogeneous sequences
Example 4.6: We consider clustering for the LPCs in Example 4.5. Here the values for the first two time points (columns) are identical so the distance d12 is zero and we combine LC1 and LC2 to form LC12. We now must decide whether LC12 should be combined with LC3 or whether LC3 is part of a new local probabilistic concept. In this case:
/12 = 8 log pC, where pC=1, pD=pE=0, so /12 = 0, /3 = 3 log pC + log pD, where pC=3/4, pD=1/4, pE=0, so / 3 = -2.249, /123 = 11 log pC +log pD, where pC=11/12, pD=1/12, pE=0, so / 123 = -3.442.
Temporal Probabilistic Concepts from Heterogeneous Data Sequences
201
The distance between LPC12 and LPC3 is then 0 – 2.249 + 3.442 = 1.193. Since twice this value is inside the chi-squared threshold with 1 degree of freedom (3.84), we therefore decide to combine LPC12 and LPC3.
5 An Illustrative Example We illustrate our discussion using sequences of gene expression data. These data are analysed in a number of papers, e.g. Michaels et al. [23], D'haeseleer et al. [8] and are available at: http://stein.cshl.org/genome_informatics/expression/somogyi.txt. The data contain sequences of 112 gene expressions for rat cervical spinal cord measured at nine developmental time points (E11, E15, E18, E21, P0, P7, P14, A), where E=Embryonic, P=Postnatal, and A=Adult. The continuous gene expressions were discretised by partitioning the expressions into three equally sized bins. This effects a smoothing of the time series without (hopefully) masking the underlying pattern. Associations between these gene expression time series were then identified using mutual entropy; the clusters based on this distance metric are described in detail in Michaels et al. [23]. We now use the cluster to learn the mapping that characterises the cluster. Once we have succeeded in mapping the local sequence ontologies to a global ontology, we can derive local probabilistic concepts and a temporal probabilistic concept for each cluster. We consider cluster 2 to illustrate the approach. In this case the gene expression sequences are presented in Table 6. We note that codes 0, 1, and 2 are local references and may have different meanings in different genes. Table 6. Gene expression sequences for Cluster 2. Gene MgluR7 RNU06832 L1 S55536 GRa2 (Ý) GRa5 (#) GRg3 RATGABAA MgluR3 RATMGLURC NMDA2B RATNMDA2B Statin RATPS1 MAP2 RATMAP2 Pre-GAD67 RATGAD67 GAT1 RATGABAT NOS RRBNOS GRa3 RNGABAA GRg2 (#) MgluR8 MMU17252 TrkB RATTRKB1 Neno RATENONS GRb3 RATGARB3 TrkC RATTRKCN3 GAP43 RATGAP43 NAChRd RNZCRD1 Keratin RNKER19 Ins1 RNINS1 GDNF RATGDNF SC6 RNU19140 Brm (I I)
E11 E13 E15 E18 E21 0 0 0 2 2 0 0 1 2 2 0 0 1 2 2 0 0 1 2 2 0 0 1 2 2 0 0 1 2 2 0 0 1 2 2 0 0 1 2 2 0 0 2 2 2 0 0 2 2 2 0 0 2 2 2 0 0 2 2 2 0 0 2 2 2 0 0 2 2 2 0 0 2 2 2 0 0 2 1 1 0 1 2 2 2 0 1 2 2 2 0 1 2 2 2 1 1 2 2 2 1 2 0 0 0 2 0 0 0 1 2 0 0 0 0 2 2 0 0 1 2 2 0 0 0 2 2 1 1 1
P0 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 0 0 0 1 0 1
P7 1 1 2 2 2 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 0 0 0 1 0 1
P14 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 0 0 0 0 0 1
A 2 1 2 2 1 2 2 2 2 2 2 1 2 2 1 1 2 1 2 2 0 0 0 0 0 0
202
S. McClean, B. Scotney, and F. Palmer
The mappings for the data in Table 6 were then learned using the algorithm in Figure 1 in Section 3, and these mappings are shown in Table 7. Using these mappings the sequences were transformed to the global ontology, as illustrated in Table 8. Using the iteration algorithm in Definition 4.4, partial value aggregates may be computed for each of the nine time-points, and these are shown in Table 9, along with the log likelihood values computed as in Definition 4.5. The clustering algorithm described in Figure 2 is then applied, and the likelihood ratio test used at each stage to determine the similarity of the local probabilistic concepts. The first LPCs considered are E11 and E13. The combined data for E11 and E13 gives the probabilistic partial value (p 0 , p 1 , p 2 ) = (0.961, 0, 0.039) and corresponding log likelihood value of /12 = -8.438. The distance between E11 and E13 is then measured as d12 = /1+/2³/12, where /1 = 0 and /2 = -6.969, giving 2*d12 = 2.937 < 5.99 (the critical value for the chi-squared test with 2 degrees of freedom at a 95% significance level). Hence E11 and E13 are found to be not significantly different form each other, and therefore are combined to form a new cluster. Table 7. The mappings to the global ontology for the sequences in Table 6.
Global Ontology
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * 2 2 2 2 2 2
Local Ontology 1 1 2 0 0 {0,2} {0,1} {0,1} 0 * * * 2 1 2 2 2 0 {0,2} 0 0 0 2 * 2 * 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 0 0 0 0 0 0
At the next stage, E15 is compared with this new cluster. In this case we must then compute the probabilistic partial value for the combined data from E11, E13 and E15, giving (p 0 , p1 , p 2 ) = (0.743, 0, 0.257) and corresponding log likelihood value of /12 = -43.808. The distance between the cluster {E11, E13} and E15 is then d12 = /1+/2³/12 = -8.438 – 14.824 + 43.808, giving 2*d12 = 41.093 > 5.99. Hence E15 is found to be significantly different from the cluster {E11, E13} and is not combined. At the next
Temporal Probabilistic Concepts from Heterogeneous Data Sequences
203
stage, therefore, E18 is compared with E15, and found to be significantly different. E18 is also significantly different to the {E11, E13} cluster, and hence forms the start of a new cluster. Subsequently, {E18, E21, P0, P14, A} and {P7} are found to be clusters, along with {E11, E13} and {E15}. These clusters are then characterised by the local probabilistic concepts {{E11, E13}, (0.961, 0, 0.039)}, {E15, (0.28, 0. 0.72)}, {{E18, E21, P0, P14, A}, (0, 0, 1)} and {P7, (0, 0.154, 0.846)} respectively.
6 Summary and Further Work We have described a methodology for describing and learning temporal concepts from heterogeneous sequences that have the same underlying temporal pattern. The data are heterogeneous with respect to classification schemes where the class values differ between sequences. However, because the sequences relate to the same underlying concept, the mappings between values may be learned. On the basis of these mappings we use statistical learning methods to describe the local probabilistic concepts. A temporal probabilistic concept that describes the underlying pattern is then learned. This concept may be matched with known genetic processes and pathways. Table 8. The sequences in Table 6 mapped using the transformations in Table 7 Gene mGluR7_RNU06832 L1_S55536 GRa2_(Ý) GRa5_(#) GRg3_RATGABAA mGluR3_RATMGLURC NMDA2B_RATNMDAB statin_RATPS1 MAP2_RATMAP2 pre-GAD67_RATGAD67 GAT1_RATGABAT NOS_RRBNOS GRa3_RNGABAA GRg2_(#) mGluR8_MMU17252 trkB_RATTRKB1 neno_RATENONS GRb3_RATGARB3 trkC_RATTRKCN3 GAP43_RATGAP43 nAChRd_RNZCRD1 keratin_RNKER19 Ins1_RNINS1 GDNF_RATGDNF SC6_RNU19140 Brm_(I_I)
E11 E13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 {0,2} 0 0 0 0 0 0 0 2 0 2 0 0 0 0 0 0
E15 E18 0 2 2 2 0 2 0 2 {0,2} 2 {0,1} 2 {0,1} 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
E21 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
P0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
P7 P14 A 1 2 2 2 2 2 2 2 2 2 2 2 2 2 {0,2} {0,1} 2 2 {0,1} 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 {0,2} 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
204
S. McClean, B. Scotney, and F. Palmer Table 9. Partial probability aggregates and corresponding log likelihood values p0 p1 p2 Log likelihood
E11 1 0 0 0
E13 0.92 0 0.08 -6.969
E15 0.28 0 0.72 -14.824
E18 0 0 1 0
E21 0 0 1 0
P0 0 0 1 0
P7 P14 0 0 0.154 0 0.846 1 -11.162 0
A 0 0 1 0
The approach is illustrated using data of gene expression sequences. Although this is a modest dataset, it serves to explain our approach and demonstrates the necessity of considering the possibility of a temporal concept for such problems, where there is an underlying temporal process involving staged development. For the moment we have not considered performance issues since the problem we have identified is both novel and complex. Our focus, therefore, has been on defining terminology and providing a preliminary methodology. In addition to addressing such performance issues, future work will also investigate the related problem of associating clusters with explanatory data; for example our gene expression sequences could be related to the growth process. Acknowledgement. This work was partially supported by the MISSION (Multi-agent Integration of Shared Statistical Information over the (inter)Net) project, IST project number 1999-10655, which is part of Eurostat’s EPROS initiative funded by the European Commission.
References 1. Bassett, D.E. Jr., Eisen, M.B., Boguski, M.S.: Gene Expression Informatics - it’s All in Your Mine. Nature genetics supplement 21 (1999) 51-55 2. Cadez, I., Gaffney, S., Smyth, P.: A General Probabilistic Framework for Clustering Individuals. In: Proc. ACM SIGKDD (2000) 140-149 3. Chen, A.L.P., Tseng, F.S.C.: Evaluating Aggregate Operations over Imprecise Data. IEEE Transactions on Knowledge and Data Engineering 8 (1996) 273-284 4. Demichiel, L.G.: Resolving Database Incompatibility: An Approach to Performing Relational Operations over Mismatched Domains. IEEE Transactions on Knowledge and Data Engineering 4 (1989) 485-493 5. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion). J. R. Statist. Soc. B 39 (1977) 1-38 6. Devaney, M., Ram, A.: Dynamically Adjusting Concepts to Accommodate Changing Contexts. In: Proc. ICML-96 Pre-Conference Workshop on Learning in Context-Sensitive Domains, Bari, Italy (1996) 7. D’haeseleer, P., Wen, X., Fuhrman, S., Somogyi, R.: Mining the Gene Expression Matrix: Inferring Gene Relationships from Large-Scale Gene Expression Data. In: Paton, R.C., Holcombe, M. (eds.): Information Processing in Cells and Tissues. Plenum Publishing (1998) 203-323 8. D’haeseleer, P., Liang S., Somogyi, R.: Gene Expression Data Analysis and Modelling. In: Tutorial at the Pacific Symposium on Biocomputing (1999) 9. Doan, A.H., Domingues, P., Levy, A.: Learning Mappings between Data Schemes. In: Proc. AAAI Workshop on Learning Statistical Models from Relational Data, AAAI '00, Austin, Texas, Technical Report WS00006 (2000) 1-6
Temporal Probabilistic Concepts from Heterogeneous Data Sequences
205
10. Fisher, D.H.: Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning 2 (1987) 139-172 11. Fisher, D.: Iterative Optimisation and Simplification of Hierarchical Clusterings. Journal of AI Research 4 (1996) 147-179 12. Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., Zaiane O.: DBMiner: A System for Mining Knowledge in Large Relational Databases. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.): Proc. 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96), Portland, Oregon (1996) 250-255 13. Han, J., Fu, Y.: Exploration of the Power of Attribute-oriented Induction in Data Mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusay, R. (eds.): Advances in Knowledge Discovery, AAAI Press / The MIT Press (1996) 399-421 14. Harries, M., Horn K., Sammut, C.: Extracting Hidden Context. Machine Learning 32(2) (1998) 101-126 15. Harries, M., Horn, K.: Learning Stable Concepts in a Changing World. In: Antoniou, G., Ghose, A., Truczszinski, M. (eds.): Learning and Reasoning with Complex Representations. Lecture Notes in AI, Vol. 1359. Springer-Verlag (1998) 106-122 16. Lim, E.-P., J. Srivastava, Shekher, S.: An Evidential Reasoning Approach to Attribute Value Conflict Resolution in Database Management. IEEE Transactions on Knowledge and Data Engineering 8 (1996) 707-723 17. Malvestuto, F.M.: The Derivation Problem for Summary Data. In: Proc. ACM-SIGMOD Conference on Management of Data, New York, ACM (1998) 87-96 18. Malvestuto, F.M.: A Universal-Scheme Approach to Statistical Databases containing Homogeneous Summary Tables. ACM Transactions on Database Systems 18 (1993) 678708 19. McClean, S.I., Scotney, B.W., Shapcott, C.M.: Aggregation of Imprecise and Uncertain Information for Knowledge Discovery in Databases. In: Proc. 4th International Conference on Knowledge Discovery in Databases (KDD'98) (1998) 269-273 20. McClean, S.I., Scotney, B.W., Shapcott, C.M.: Incorporating Domain Knowledge into Attribute-oriented Data Mining. Journal of Intelligent Systems 6 (2000) 535-548 21. McClean, S.I., Scotney, B.W., Greer, K.R.C.: Clustering Heterogenous Distributed Databases. In: Kargupta, H., Ghosh, J., Kumar, V. Obradovic, Z. (eds.): Proc. KDD Workshop on Knowledge Discovery from Parallel and Distributed Databases (2000) 20-29 22. McClean, S.I., Scotney, B.W., Shapcott, C.M.: Aggregation of Imprecise and Uncertain Information in Databases. Accepted, IEEE Trans. Knowledge and Data Engineering (2001) 23. Michaels, G.S., Carr, D.B., Askenazi, M., Fuhrman S., Wen, X., Somogyi, R.: Cluster Analysis and Data Visualisation of Large-Scale Gene Expression Data. Pacific Symposium on Biocomputing 3 (1998) 42-53 24. Mitchell, T.: Machine Learning. New York: McGraw Hill (1997) 25. Scotney, B.W., McClean, S.I.: Efficient Knowledge Discovery through the Integration of Heterogeneous Data. Information and Software Technology (Special Issue on Knowledge Discovery and Data Mining) 41 (1999) 569-578 26. Scotney B.W., McClean, S.I, Rodgers, M.C.: Optimal and Efficient Integration of Heterogeneous Summary Tables in a Distributed Database. Data and Knowledge Engineering 29 (1999) 337-350 27. Talavera, L., Béjar, J.: Generality-based Conceptual Clustering with Probabilistic Concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2) (2001) 196-206 28. Tseng, F.S.C., Chen, A.L.P., Yang, W-P.: Answering Heterogeneous Database Queries with Degrees of Uncertainty. Distributed and Parallel Databases 1 (1993) 281-302 29. Vardi, Y., Lee, D.: From Image Deblurring to Optimal Investments: Maximum Likelihood Solutions for Positive Linear Inverse Problems (with discussion). J. R. Statist. Soc. B (1993) 569-612
Handling Uncertainty in a Medical Study of Dietary Intake during Pregnancy Adele Marshall1, David Bell2, and Roy Sterritt2 1
Department of Applied Mathematics and Theoretical Physics Queen’s University of Belfast, Belfast, BT9 5AH, UK [email protected] 2 School of Information and Software Engineering, Faculty of Informatics University of Ulster, Jordanstown Campus, Newtownabbey, BT37 0QB {da.bell, r.sterritt} @ulst.ac.uk
Abstract. This paper is concerned with handling uncertainty as part of the analysis of data from a medical study. The study is investigating connections between the birth weight of babies and the dietary intake of their mothers. Bayesian belief networks were used in the analysis. Their perceived benefits include (i) an ability to represent the evidence emerging from the evolving study, dealing effectively with the inherent uncertainty involved; (ii) providing a way of representing evidence graphically to facilitate analysis and communication with clinicians; (iii) helping in the exploration of the data to reveal undiscovered knowledge; and (iv) providing a means of developing an expert system application.
1 Introduction This paper is concerned with handling uncertainty as part of the analysis of data from a medical study. The study is recording and analysing details of pregnant women at 28 weeks gestation to examine the relationship between the dietary intake of a mother and the birth weight of her baby. The study is an extension of a major new medical research programme, HAPO (Hyperglycaemia and Adverse Pregnancy Outcome), currently underway at the Royal Victoria Hospital, Belfast. This paper describes the development of a decision support system (expert system) that will deal with (i) uncertainty in the dietary and associated data; (ii) preliminary analysis of the data; and (iii) collection of the data in the medical study itself. 1.1 Background and Previous Work Consumption of a nutritionally adequate diet is of particular importance during pregnancy and has a considerable influence on birth outcome. Previous research has shown that sudden and severe restriction of energy and protein intake during pregnancy reD. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 206–216, 2002. © Springer-Verlag Berlin Heidelberg 2002
Handling Uncertainty in a Medical Study of Dietary Intake during Pregnancy
207
duces birth weight by as much as 300g [1]. However, studies have also shown that an adequate energy and protein consumption may not be sufficient without the accompaniment of vitamins, minerals [2] and even fatty acids [3]. There is substantial evidence in the literature to suggest a strong association with low birth weight and increased incidence of neonatal mortality [4] and higher neonatal morbidity [5]. Impairment of foetal growth and low birth weight due to inadequate maternal and foetal nutrition can also increase the risk of chronic diseases in adulthood [6]. Maternal diet not only influences the immediate outcome of pregnancy but also the longer-term health of her offspring [7], possibly influencing susceptibility to diseases such as ischaemic heart disease, hypertension and Type 2 diabetes. Additionally, there is evidence in the literature to suggest a relationship between high birth weight, caused by excessive foetal growth (macrosomia), and potential risks such as difficulties in childbirth and intensive postnatal care. These problems are most commonly associated with diabetic pregnancy. This has led to numerous attempts to relate complications in pregnancy and birth outcome to the level of maternal glycaemia [8], [9]. However, these studies have not been able to identify a possible threshold level of glycaemia above which there is a high risk of macrosomia. 1.2 The HAPO Medical Study The US National Institute of Health (NIH) has recently approved an extensive international HAPO (Hyperglycaemia and Adverse Pregnancy Outcome) study in 16 key centres around the world. The Royal Victoria Hospital is one of these centres. Each centre will recruit 1500 pregnant women in the study over a two year period. Research is primarily concerned with the following hypothesis used to investigate the association between maternal blood glucose control and pregnancy outcome. Hypothesis: Hyperglycaemia in pregnancy is associated with an increased risk of adverse maternal, foetal and neonatal outcomes. In addition to the specified HAPO variables, the Royal Victoria Hospital will be collecting information on dietary intake of the pregnant women in the study. Information gathered will include: anthropometric measures socio-economics family history metabolic status food frequency and diet measures of pregnancy outcomes The details are recorded at 28 weeks gestation during an interview where a food frequency questionnaire is completed. Additional information is gathered about the pregnancy outcome and details recorded on the newborn baby. The calendar of events for the HAPO study is shown in Figure 1. It shows the initial recruitment and data collection using the food frequency questionnaire followed
208
A. Marshall, D. Bell, and R. Sterritt
by the pregnancy outcome variables and home visit follow-up. A follow-up is also scheduled to record information on the general health of the baby when 2-3 years old.
Fig. 1. HAPO event calendar.
The food frequency questionnaire was designed to gather relevant information as the dietary intake of the pregnant mothers. The food frequency questionnaire is responsible for the collection of information on the frequency that pregnant women eat food in each of the various food groups. There is a high level of complexity and uncertainty involved in determining the nutritional value and content of various different foods in addition to capturing the effect of variations between manufacturers and products. After consultation with experts it was advised that the food frequency questionnaire deal with this problem by focusing on groups of food types such as cereals, meat / fish and so on, as summarised in Figure 2.
Fig. 2. Food groups analysed in the HAPO study.
The food groups may be associated with different nutritional levels. For example, meat and fish will have a strong association with the nutritional influence of proteins whereas cereals will be strongly associated with folic acid, a nutrient that plays an important role in the prevention of neural tube defects. There are five parts to the study, evaluating: 1. Dietary intake in pregnancy compared to non-pregnant women of childbearing age. 2. Links between diet and lifestyle/socio-economic factors. 3. Links between diet and maternal/foetal glucose. 4. links between dietary intake and pregnancy outcome. 5. Possible follow-up of links between dietary intake in pregnancy and both maternal and outcome at two years. Part 1 is a control to assess whether there is a significant difference in the diet of pregnant women in relation to other women. Parts 2 to 5 can be supported with the intro-
Handling Uncertainty in a Medical Study of Dietary Intake during Pregnancy
209
duction and development of a decision support (expert) system for assessing the dietary variables for pregnant women. 1.3 Expert Systems An expert system is a computer program that is designed to solve problems at a level comparable to that of a human expert in a given domain [10]. Rule-based expert systems have achieved particular success in medical applications [11] largely because the decisions that are made can be traced and understood by domain experts. However, their development has suffered due to the difficulty of gathering relevant information, an inability to handle uncertainty adequately, and the maintenance problem of keeping the stored knowledge up-to-date [12]. Another possible AI technique that can be used within an expert system is neural networks. Unfortunately, clinicians and healthcare providers have been reluctant to accept this approach because neural networks provide little insight into how they draw their conclusions, leading to a lack of confidence in the decisions made [13]. Bayesian belief networks (BBNs) are another possibility. A key advantage is their ability to represent potentially causal relationships between variables making them ideal for representing causal medical knowledge. The graphical nature of BBNs also makes them attractive for communication between medical and non-medical researchers. BBNs allow clinicians a better insight into the workings of the model thereby improving their confidence in the process and the decisions made. Example of relevant work in this area include Research into the graphical conditional independence representation of BBNs as an expert system for medical applications [14]. The application of BBNs in the treatment of kidney stones [15]. The diagnosis of patients suffering from dyspnoea (shortness of breath) [16]. The representation of medical knowledge [17]. Assistance in the assessment of breast cancer [21]. Assessing the likelihood of a baby suffering from congenital heart disease [22]. In addition, various other applications have been developed in biology for prediction and diagnosis [13].
2 Research and Methodology: Handling Uncertainty 2.1 Bayesian Belief Networks Bayesian belief networks (BBNs) are probabilistic models which represent potential causal relationships in the form of a graphical structure [18], [19]. There has been much discussion and concern over the use of the term causality. In the context of BBNs, causal is used is in the representation of relationships which have the potential to be causal in nature—when one variable directly influences another. [20] explain this reasoning by stating that it is rare that firm conclusions about causality can be drawn
210
A. Marshall, D. Bell, and R. Sterritt
from one study but rather the objective is to provide representations of data that are potentially causal—those that are consistent with, or suggestive of causal interpretation. For the potential casual relationships to be considered truly causal, external intervention is required to attach an overall understanding to the whole process of generating the data and the deeper concepts involved. The BBNs are a set of variables and a set of relationships between them, where each variable is a set of mutually exclusive events. Relationships are represented in graphical form. The graphical models use Bayes’ Theorem, the probability of an event occurring given that some previous event has already taken place. Uncertainty in the data can therefore be handled appropriately by the probabilistic nature of the BBN. The structure of the BBN is formed by nodes and arrows that represent the variables and causal relationships presented as a directed acyclic graph. An arrow or directed arc, in which two nodes are connected by an edge, indicates that one directly influences the other. Attached to these nodes are the probabilities of various events occurring. This ability to capture potential causality supported by the probabilities of Bayes’ Theorem makes the BBN appealing to use. 2.2 Development Process of the BBN There are many ways of constructing a BBN according to the amount of expert knowledge and data available. If there is a significant uncertainty in the expert knowledge available, learning of the BBNs may only occur through induction from the data. Alternatively, a high contribution of expert advice may lead to the initial development of the BBN structure originating entirely from the experts, with probabilities attached from the data. A purely expert-driven approach is unattractive because of the basic difficulty in acquiring knowledge from experts. Such a problem can be alleviated by supplying experts with data induced relationships. A pure data approach would often seem ideal, providing the opportunity of discovering new knowledge along with the potential for full automation. However, the data approach will not always capture every possible circumstance and still demands the attention of expert or human assistance for interpretation purposes—a discovery only becomes knowledge with human interpretation. It is between these two extremes that most developments lie, but that in turn makes the development process unclear. A BBN combines a mixture of expert consultation and learning from data. The objective is to achieve the best of both worlds and minimise the disadvantages of each source, thus reducing the uncertainty. Generally, the approach consists of three main stages: HUMAN:
Probabilities and structure defined by consultation with experts and literature. SYSTEM: Learning of structure and probabilities from data. COMBINE: Knowledge base amended with discoveries and probabilities obtained from the data, or the structure induced from data is adjusted to include expert reasoning.
Handling Uncertainty in a Medical Study of Dietary Intake during Pregnancy
211
As the intention is to investigate the nature of BBNs as a research development technique, one possible approach is to run the system and human stages in parallel without any collaboration andcompare the outcomes in the combine stage. Alternatively, BBNs may be developed using any combination of the three development components on the first data set. Then, as the data collection continues the process of developing BBNs can also continue, with various BBNs developed for the growing data set. This exercise will indicate the benefits of using BBNs as a research development technique as more and more discoveries on the data are obtained. The focus of this paper is the system component of the development process in which the structure and probabilities are derived from analysing the study questionnaires for implicit causal relationships. This will be followed by a second stage derivation of a BBN from the experts involved in the study, with the final outcome being a comparison of the resulting BBNs. In reality, the process, human-system-combine or system-human-combine is similar to an evolutionary development process since a re-examination or re-learning of the system will be required throughout the study at significant stages to include new data. This process should facilitate fine-tuning in the development of the causal network. 2.3 Expected BBN Formulated from Literature The research literature reveals various potential causal relationships that can be represented in a BBN structure. For example, one simple causal relationship [1], is the direct influence of the level of proteins in the diet of the mother on the final birth weight of the baby. This may be represented as part of a BBN model as shown in Figure 3.
Fig. 3. BBN representation of a causal relationship.
The hypothetical BBN in Figure 4 represents potential causal relationships induced from consultation of the literature and the study domain. The model considers the foods classified into their basic food type. The arrows indicate direct influences. For example, in this model, the food groups proteins, vitamins and minerals and fatty acids will all have potential causal influences on the birth weight of the baby.
212
A. Marshall, D. Bell, and R. Sterritt
Fig. 4. A hypothetical BBN for the HAPO study.
The birth weight of the baby will in turn have a direct influence on its survival. For example a baby with a very low birth weight may be more likely to die. Also taken into account in the BBN structure is the influence on baby weight on variables obtained from the follow-up study—for example, the occurrence of adult chronic diseases, such as ischaemic heart disease. At the current stage in the data collection, it is only possible to hypothesise about such relationships as follow-up variables will not be available to the research team until at least 2003 and throughout the babies development into children and adulthood.
3 Preliminary Results The medical study has now started with the first set of participants’ details being recorded. Currently, the study has recruited and interviewed 294 women, 108 of whom have had their babies. Preliminary statistical analysis has been carried out on this initial data set. The pregnancy outcomes include measures recorded for the new-born baby along with additional information such as delivery type and mothers condition. Baby information includes the baby weight, length, head circumference and sex. In this sample of data, the birth weights of the 108 babies ranged from 1.9kg to 4.93kg (4.18-10.85 lbs.) with an average of 3.51kg (7.7 lbs.). Variables identified as having a direct influence on baby weight include the frequencies at which pasta, bread and potatoes are consumed. Other potential influencing factors are the number of cigarettes smoked, the number of children that the mother has already, and whether there are any relatives who have diabetes. The statistical analysis performed on the data set has indicated some useful observations on the variables that have a possible influence on the baby outcomes. To investigate these relationships further, it would be useful to construct a BBN.
Handling Uncertainty in a Medical Study of Dietary Intake during Pregnancy
213
3.1 Resulting BBNs The focus of this paper is the development of an initial BBN in which the structure and probabilities are derived from analysing the data collected from the study questionnaires. It is hoped that this will be followed by a second stage derivation of a BBN from the experts involved in the study. The BBNs are constructed using the PowerConstructor package [23]. PowerConstructor takes advantage of Chow and Lui’s algorithm [24] which uses mutual information for learning causal relationships and enhances the method with the addition of further procedures to form a three-stage process of structure learning from the data. The first phase (drafting) of the PowerContructer software utilises the Chow-Liu algorithm for identifying strong dependencies between variables by the calculation of mutual information. The second stage (thickening) performs conditional independence (CI) tests on pairs of nodes that were not included in the first stage. Stage 3 (thinning) then performs further CI tests to ensure that all edges that have been added are necessary. This three-stage approach manages to keep to one CI test per decision on an edge 2 throughout each stage and has a favourable time complexity of O(N ) unlike many of its competitors which have exponential complexity. Preliminary analysis was carried out using the PowerConstructor package on the food frequency variables for the first 108 pregnant mothers along with the outcome variable, the baby's birth weight. The BBN in Figure 5 was induced from the data.
Fig. 5. Initial BBN for the birth weight outcome and food frequency variables using [23].
As before, the variables are represented by ovals in the structure while the edges between the ovals represent potential causal relationships between the variables. From
214
A. Marshall, D. Bell, and R. Sterritt
inspection of the BBN, it is apparent that there are many inter-relationships between the dietary intake variables, but already at this early stage in the data collection a relationship is emerging for the baby’s birth weight. This may be made clear by removing some of the less significant variables and repeating the induction of the BBN. The resulting BBN is shown in Figure 6.
Fig. 6. BBN representing the birth weight outcome along with food type variables using [23].
The relationship emerging from the BBN is that the baby’s birth weight seems to be directly influenced by the variety of bread the mother consumes during pregnancy. In fact if the probabilities are considered it is evident that the greater the variety of bread consumed, the greater the probability of a larger baby. In addition to this, the variable bread is in turn influenced by the consumption of puddings which is in turn influenced by the consumption of fruit and vegetables, potatoes, pasta, rice which are influenced by meat and cereal. The BBN in Figure 6 captures some of the significant relationships on baby weight. However, as the data set grows, it is expected that the number of edges in the BBN will also increase. It is hoped that the BBNs will not only be a tool for the representation of the evolving model but also as a research development technique to aid discussion as the study progresses, helping to identify further causal relationships.
4
Conclusions and Observations
The paper has discussed the use of Bayesian belief networks (BBNs) for handling uncertainty in a medical study. The study is concerned with modelling the influencing factors of a mother’s dietary intake during pregnancy on the final birth outcomes of the baby. In particular, the baby’s birth weight is important as this may cause further complications for both the mother and baby.
Handling Uncertainty in a Medical Study of Dietary Intake during Pregnancy
215
A food frequency questionnaire designed to record the frequencies of consumption of various different foods is being used to collect dietary information. There is a high level of complexity and uncertainty involved in determining the nutritional value and content of various different foods and in capturing the effect of variations between manufacturers and products. In addition to developing a system that can handle such uncertainty, the development project itself is a research project with undiscovered knowledge and unproven hypotheses. The objectives are to develop a system that can represent relationships between dietary and outcome variables, to be able to handle uncertainty, while also producing results in such a way that can be understood by clinicians. BBNs seem to be an appropriate technique for such a challenge. Advantages of using them include their ability to represent potentially causal relationships, their visual graphical representation and their capability of dealing with uncertainty using probability theory. The systems or software aspect of this project is to engineer an intelligent system. Ideally this would involve acquiring or learning from an environment with proven hypotheses; however, the medical research in this project is running in parallel so there is uncertainty in the expert knowledge available. Thus, the process to engineer the system should also assist in expressing the evidence contained in the evolving study to the medical experts and the system designers as well as exploring the study data as it is gathered for undiscovered knowledge. The study is currently at the data collection stage and further evolution of the BBNs will follow. It is hoped that an analysis of this evolution will also provide interesting insights into the BBN development process.
Acknowledgements. The authors are greatly indebted to our collaborators Mrs. A. Hill (Nutrition and Dietetics, Royal Hospital Trust), Professor P.R. Flatt and Dr. M.J. Eaton-Evans (Faculty of Science, University of Ulster, Coleraine), Dr. D. McCance and Professor D.R. Hadden (Metabolic Unit, Royal Victoria Hospital).
References 1. Stein, Z., Fusser, M.: The Dutch Famine, 1944-45, and the Reproductive Process. Pediatric Research 9 (1975) 70-76 2. Doyle, W. Crawford, M.A., Wynn, A.H.A., Wynn, S.W.: The Association Between Maternal Diet and Birth Dimensions. Journal of Nutrition and Medicine 1 (1990) 9-17 3. Olsen, S.F., Olsen, J., Frische, G.: Does Fish Consumption During Pregnancy Increase Fetal Growth? International Journal of Epidemiology 19 (1990) 971-977 4. Bakketeig, L.S., Hoffman, H.J., Titmuss Oakley, A.R.: Perinatal Mortality. In: Bracken M.B. (ed.): Perinatal Epidemiology, Oxford University Press, New York, Oxford (1984) 99-151 5. Walther, F.J., Raemaekers, L.H.J.: Neonatal Morbidity of SGA Infants in Relation to their Nutritional Status at Birth. Acta Paediatric Research Scandia 71 (1982) 437-440
216
A. Marshall, D. Bell, and R. Sterritt
6. Barker, D.J.P.: The Fetal and Infant Origins of Adult Disease. British Medical Journal, London (1992) 7. Mathews, F., Neil, H.A.W.: Nutrient Intakes During Pregnancy in a Cohort of Nulliparous Women. Journal of Human Nutrition and Dietetics 11(1998) 151-161 8. Pettitt, D.J., Knowler, W.C., Baird, H.R., Bennett, P.H.: Gestational Diabetes: Infant and rd Maternal Complications in Relation to 3 Trimester Glucose Tolerance in Pima Indians. Diabetes Care, 3 (1980) 458-464 9. Sermer, M., Naylor, C.D., Gare, D.J.: Impact of Increasing Carbohydrate Intolerance on Maternal-fetal Intrauterine Growth Retardation. Human Nutrition and Clinical Nutrition 41C (1995) 193-197 10. Cooper, G.F.: Current Research Directions in the Development of Expert Systems Based on Belief Networks. Applied Stochastic Models and Data Analysis 5 (1989) 39-52 11. Millar, R.A.: Medical Diagnostic DSS - Past, Present and Future. JAMIA 1 (1994) 8-27 12. Bratko, I., Muggleton, S.: Applications of Inductive Logic Programming. Comm. ACM 38(11) (1995) 65-70 13. Lisboa, P.J.G., Ifeachor, E.C., Szczepaniak P.S. (eds.): Artificial Neural Networks in Biomedicine. Springer (2000) 14. Andersen, L.R., Krebs, J.H., Damgaard, J.: STENO: An Expert System for Medical Diagnosis Based on Graphical Models and Model Search. Journal of Applied Statistics 18 (1991) 139-153 15. Madigan, D.: Bayesian Graphical Models for Discrete Data. Technical Report, University of Washington, Seattle (1993) 16. Lauritzen, S.L., Spiegelhalter, D.J.: Local Comparisons with Probabilities on Graphical Structures and their Application to Expert Systems. Journal Royal Statistical Society B 50(2) (1988) 157- 224 17. Korver, M., Lucas, P.J.F.: Converting a Rule-Based Expert System into a Belief Network. Med. Inform. 18(3) (1993) 219-241 18. Buntine, W.: Graphical Models for Discovering Knowledge. In: Fayyad, U.M., PiatetskyShapiro, G., Smyth, P., Uthurusay, R. (eds.): Advances in Knowledge Discovery, AAAI Press / The MITPress (1996) 59-82 19. Ramoni, M., Sebastiani, P.: Bayesian Methods for Intelligent Data Analysis. In: Berthold, M., Hand, D.J. (eds.): Intelligent Data Analysis: An Introduction, Springer, New York (1999) 20. Cox, D.R., Wermuth N.: Multivariate Dependencies, Chapman and Hall (1996) 21. Hojsgaard, S., Thiesson, B., Greve, J., Skjoth, F.: BIROST - A Program for Inducing Block Recursive Models from a Complete Database, Institute of Electronic Systems, Department of Mathematics and Computer Science, Aalborg University, Denmark (1992) 22. Spiegelhalter, D.J., Dawid, A.P., Lauritzen, S.L., Cowell, R.G.: Bayesian Analysis in Expert Systems, Statistical Science 8(3) (1993) 219-283 23. Cheng, J., Bell, D.A., Liu, W.: An Algorithm for Bayesian Network Construction from Data. 6th International Workshop on AI and Stats (1997) 83-90 24. Chow, C. J. K., Liu, C. N.: Approximating Discrete Probability Distributions with Dependence Trees, IEEE Trans. Information Theory, Vol. 14(3), (1968) 462-467
Sequential Diagnosis in the Independence Bayesian Framework David McSherry School of Information and Software Engineering, University of Ulster Coleraine BT52 1SA, Northern Ireland [email protected]
Abstract. We present a new approach to test selection in sequential diagnosis (or classification) in the independence Bayesian framework that resembles the hypothetico-deductive approach to test selection used by doctors. In spite of its relative simplicity in comparison with previous models of hypotheticodeductive reasoning, the approach retains the advantage that the relevance of a selected test can be explained in strategic terms. We also examine possible approaches to the problem of deciding when there is sufficient evidence to discontinue testing, and thus avoid the risks and costs associated with unnecessary tests.
1 Introduction In sequential diagnosis, tests are selected on the basis of their ability to discriminate between competing hypotheses that may account for an observed symptom or fault [1], [2], [3], [4]. In spite of the strong assumptions on which it is based, the independence Bayesian approach to diagnosis (also known as Naïve Bayes) often works well in practice, for example in domains such as the diagnosis of acute abdominal pain [5], [6]. Early approaches to test selection in the independence Bayesian framework were based on the method of minimum entropy, in which priority is given to the test that minimises the expected entropy of the distribution of posterior probabilities [2]. A similar strategy is the basis of the theory of measurements developed by deKleer and Williams, who combined model-based reasoning with sequential diagnosis in a system for localising faults in digital circuits [1]. However, the absence of a specific goal in the minimum entropy approach means that the relevance of selected tests can be difficult to explain in terms that are meaningful to users. In previous work, we have argued that for the relevance of tests to be explained in terms that are meaningful to users, the evidence-gathering process should reflect the approach used by human experts [7]. It was this aim that motivated the development of Hypothesist [3], [8], an intelligent system for sequential diagnosis in the independence Bayesian framework in which test selection is based on the evidencegathering strategies used by doctors. Doctors are known to rely on hypotheticodeductive reasoning, selecting tests on the basis of their ability to confirm a target
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 217-231, 2002. © Springer-Verlag Berlin Heidelberg 2002
218
D. McSherry
diagnosis, eliminate an alternative diagnosis, or discriminate between competing diagnoses [9], [10], 11]. Hypothesist is goal driven in that the test it selects at any stage depends on the target hypothesis it is currently pursuing, which is continuously revised in the light of new evidence. In order of priority, its main test-selection strategies are confirm (confirm the target hypothesis in a single step), eliminate (eliminate the likeliest alternative hypothesis), and validate (increase the probability of the target hypothesis). A major advantage of the approach is that the relevance of a test can be explained in terms of the strategy it is selected to support. However, there is considerable complexity associated with the co-ordination of Hypothesist’s test-selection strategies. For example, one reason for giving priority to the confirm strategy is that the measure used to select the most useful test in the validate strategy can be applied only to tests that do not support the confirm strategy. A potential drawback of this policy is that the system may sometimes be forced to select from tests that only weakly support the confirm strategy when strategies of lower priority may be more strongly supported by the available tests. Similar problems have been identified in a multiple-strategy approach to attribute selection in decision-tree induction [12]. In this paper, we present a new approach to test selection in sequential diagnosis and classification that avoids the complexity of Hypothesist’s multiple-strategy approach while retaining the advantage that the relevance of a selected test can be explained in strategic terms. We also examine possible approaches to the problem of deciding when to discontinue the testing process. The ability to recognise when there is sufficient evidence to support a working diagnosis, and thus avoid the risks and costs associated with unnecessary tests, is an important aspect of diagnostic reasoning [10]. Other good reasons for minimising the length of problem-solving dialogues in intelligent systems include avoiding frustration for the user, minimising network traffic in Webbased applications, and reducing the length of explanations of how a conclusion was reached [13], [14], 15]. On the other hand, minimising dialogue length in sequential diagnosis must be balanced against the risk of accepting a diagnosis before it is fully verified, an error sometimes referred to as premature closure [10], [16]. In Section 2, we present an intelligent system prototype called VERIFY in which Hypothesist’s multiple-strategy approach to test selection is replaced by the single strategy of increasing the probability of a target hypothesis. In Section 3, we examine VERIFY’s ability to explain the relevance of tests when applied to a well-known classification task. In Section 4, we describe how the trade-off between unnecessary prolongation of the testing process and avoiding the risk of premature closure is addressed in VERIFY. Related work is discussed in Section 5 and our conclusions are presented in Section 6.
2 Sequential Diagnosis in VERIFY The model of hypothetico-deductive reasoning on which VERIFY is based can be applied to any domain of diagnosis (or classification) for which the following are available: a list of hypotheses to be discriminated (i.e. possible diagnoses or outcome classes) and their prior probabilities, a list of relevant tests (or attributes), and the
Sequential Diagnosis in the Independence Bayesian Framework
219
conditional probabilities of their results (or values) in each hypothesis. VERIFY repeats a diagnostic cycle of selecting a target hypothesis, selecting the most useful test, asking the user for the result of the selected test, and updating the probabilities of all the competing hypotheses in the light of the new evidence. The approach is based on the usual assumptions of the independence Bayesian framework. The hypotheses H1, H2, ..., Hn to be discriminated are assumed to be mutually exclusive and exhaustive, and the results of tests are assumed to be conditionally independent in each hypothesis. The test selected by VERIFY at any stage of the evidence-gathering process depends on the target hypothesis (diagnosis or outcome class) it is currently pursuing. The target diagnosis is the one that is currently most likely based on the evidence previously reported. As in Hypothesist, the target diagnosis is continually revised in the light of new evidence. As new evidence is obtained, the probability of the target hypothesis Ht is sequentially updated according to the independence form of Bayes’ theorem: p ( H t | E1 , E2 ,..., Er ) =
p ( E1 | H t ) p ( E2 | H t ) ... p ( Er | H t ) p ( H t ) n
∑ p( E1 | H i ) p( E2 | H i ) ... p( Er | H i ) p( H i )
i =1
where E1 is the most recently reported test result and E2,..., Er are previously reported test results. The probability of each competing hypothesis is similarly updated. 2.1 Evidential Power The measure of attribute usefulness used in VERIFY to select the attribute that most strongly supports its strategy of increasing the probability of the target hypothesis is called evidential power. Definition 1. The evidential power of an attribute A in favour of a target hypothesis Ht is l(A, Ht, x) =
n
∑
i =1
p(A = vi | Ht) p(Ht | A = vi, x)
where v1,v2,...,vn are the values of A and x is the evidence, if any, provided by previous test results. Of course, we do not suggest that such a probabilistic measure is used by doctors. Where no test results have yet been reported, we will simply write l(A, Ht) instead of l(A, Ht, x). At each stage of the evidence-gathering process, the attribute selected by VERIFY is the one whose evidential power in favour of the target hypothesis is greatest. While increasing the probability of the target hypothesis is one of the attribute-selection strategies used in Hypothesist, the latter uses a different measure of attribute usefulness.
220
D. McSherry
2.2 Example Domain The contact lenses data set is based on a simplified version of the optician’s real-world problem of selecting a suitable type of contact lenses, if any, for an adult spectacle wearer [17]. A point we would like to emphasise is that the problem of contact lens selection is used here merely as an example, and alternatives to Bayesian classification such as decision-tree induction may well give better results on the contact lenses data set, for example in terms of predictive accuracy. Outcome classes in the data set are no contact lenses, soft contact lenses, and hard contact lenses. Attributes in the data set and conditional probabilities of their values are shown in Table 1. The prior probabilities of the outcome classes are also shown. Table 1. Prior probabilities of outcome classes and conditional probabilities of attribute values in the contact lenses data set
__________________________________________________ Contact lens type:
None 0.63
Soft 0.21
Hard 0.17
__________________________________________________ Age of patient young pre-presbyopic presbyopic Astigmatism present absent Spectacle prescription hypermetrope myope Tear production rate normal reduced
0.27 0.33 0.40
0.40 0.40 0.20
0.50 0.25 0.25
0.53 0.47
0.00 1.00
1.00 0.00
0.53 0.47
0.60 0.40
0.25 0.75
0.20 0.80
1.00 0.00
1.00 0.00
__________________________________________________
Figure 1 below shows an example consultation in VERIFY in which the evidencegathering process is allowed to continue until any hypothesis is confirmed or no further attributes remain. Initially the target hypothesis is no contact lenses, and the user is informed when it changes to soft contact lenses. In Section 4, we describe techniques for recognising when there is sufficient evidence to discontinue the evidence-gathering process, thus reducing the number of tests required, on average, to reach a solution. When VERIFY is applied to the example domain, the target hypothesis, with a probability of 0.63, is initially no contact lenses. From Table 1, p(age = young | no contact lenses) = 0.27 p(age = pre-presbyopic | no contact lenses) = 0.33 p(age = presbyopic | no contact lenses) = 0.40
Sequential Diagnosis in the Independence Bayesian Framework
221
By Bayes’ theorem, p(no contact lenses | age = young) =
0.27 × 0.63 = 0.50 0.27 × 0.63 + 0.40 × 0.21 + 0.50 × 0.17
Similarly, p(no contact lenses | age = pre-presbyopic) = 0.63
and p(no contact lenses | age = presbyopic) = 0.75
The evidential power of age in favour of the target hypothesis is therefore: l(age, no contact lenses) = 0.27 0.50 + 0.33 0.63 + 0.40 0.75 = 0.64
_____________________________________________________________________ VERIFY: User: VERIFY: User: VERIFY: User: VERIFY: User: VERIFY:
The target hypothesis is no contact lenses What is the tear production rate? normal The target hypothesis is soft contact lenses Is astigmatism present? no What is the age of the patient? young What is the spectacle prescription? hypermetrope The surviving hypotheses and their posterior probabilities are: soft contact lenses (0.86) no contact lenses (0.14)
_____________________________________________________________________ Fig. 1. Example consultation in VERIFY
Similarly, the evidential powers of astigmatism, spectacle prescription and tear production rate in favour of no contact lenses are 0.63, 0.63,and 0.85. According to VERIFY, the most useful attribute is therefore tear production rate. When the user reports that the tear production rate is normal in the example consultation, the revised probabilities of no contact lenses, soft contact lenses, and hard contact lenses are 0.25, 0.42, and 0.33 respectively. For example, p(soft contact lenses | tear production rate = normal) = 1× 0.21 = 0.42 0.20 × 0.63 + 1× 0.21 + 1× 0.17
The target hypothesis therefore changes to soft contact lenses. VERIFY now chooses the most useful among the three remaining attributes. For example, the probabilities required to compute the evidential power of astigmatism in favour of soft contact lenses are now:
222
D. McSherry
p(astigmatism = present | soft contact lenses) = 0 p(astigmatism = absent | soft contact lenses) = 1 p(soft contact lenses | astigmatism = present, tear production rate = normal) = 0 p(soft contact lenses | astigmatism = absent, tear production rate = normal) = 1 × 1 × 0.21 = 0.78 0.47 × 0.20 × 0.63 + 1 × 1 × 0.21 + 0 × 1 × 0.17
So, l(astigmatism, soft contact lenses, tear production rate = normal) = 0 0 + 1 0.78 = 0.78
Similarly, the evidential powers of age and spectacle prescription in favour of soft contact lenses are now 0.43 and 0.45 respectively, so the next question that VERIFY asks the user is whether astigmatism is present. The remaining questions in the example consultation (age and spectacle prescription) are similarly selected on the basis of their evidential powers in favour of soft contact lenses. ___________________________________________________________________ tear production rate = reduced : no contact lenses (1) tear production rate = normal astigmatism = present spectacle prescription = myope age of patient = young : hard contact lenses (0.88) age of patient = pre-presbyopic : hard contact lenses (0.75) age of patient = presbyopic : hard contact lenses (0.72) spectacle prescription = hypermetrope age of patient = young : hard contact lenses (0.69) age of patient = pre-presbyopic : no contact lenses (0.53) age of patient = presbyopic : no contact lenses (0.58) astigmatism = absent age of patient = young spectacle prescription = myope : soft contact lenses (0.82) spectacle prescription = hypermetrope : soft contact lenses (0.86) age of patient = pre-presbyopic spectacle prescription = myope : soft contact lenses (0.79) spectacle prescription = hypermetrope : soft contact lenses (0.83) age of patient = presbyopic spectacle prescription = myope : soft contact lenses (0.6) spectacle prescription = hypermetrope : soft contact lenses (0.67) ___________________________________________________________________ Fig. 2. Consultation tree for the example domain
Sequential Diagnosis in the Independence Bayesian Framework
223
2.3 Consultation Trees The questions sequentially selected by VERIFY, the user’s responses, and the final solution can be seen to generate a path in a virtual decision tree of all possible consultations. The complete decision tree, shown in Fig. 2, provides an overview of VERIFY’s problem-solving behaviour in the example domain. The posterior probabilities of the solutions reached by VERIFY are shown in brackets. An important point to note is that such an explicit decision tree, which we refer to as a consultation tree, is not constructed by VERIFY. The example consultation tree was constructed off line by a process that resembles top-down induction of decision trees [18] except that there is no partitioning of the data set. It can be seen from Fig. 2 that in contrast to Cendrowska’s rule-based approach to classification in the example domain [17], only no contact lenses can be confirmed with certainty in the independence Bayesian framework. Test results on the path followed by VERIFY in the example consultation are shown in italics. Although the solution is soft contact lenses, its probability is only 0.86. It can also be seen from Fig. 2 that this is the maximum probability for soft contact lenses over all possible consultations. Similarly the maximum probability of hard contact lenses over all consultations is 0.88. When applied to the contact lenses data set, VERIFY correctly classifies all but one example, a case of no contact lenses which it misclassifies as soft contact lenses with a probability of 0.6. 2.4 Findings Tthat Always Support a Given Hypothesis An important phenomenon in probabilistic reasoning is that certain test results may sometimes increase and sometimes decrease the probability of a given hypothesis, depending on the evidence provided by other test results [7], [16], [19]. For example, although the likelier spectacle prescription in no contact lenses is hypermetrope, it does not follow that this finding always increases the probability of no contact lenses. In the absence of other evidence, a spectacle prescription of hypermetrope does in fact increase the probability of no contact lenses from its prior probability of 0.63, since by Bayes’ theorem: p(no contact lenses | spectacle prescription = hypermetrope) = 0.66
On the other hand, if astigmatism is known to be absent, then a spectacle prescription of hypermetrope decreases the probability of no contact lenses: p(no contact lenses | astigmatism = absent) = 0.58 p(no contact lenses | spectacle prescription = hypermetrope, astigmatism = absent) = 0.55
The relevance of a test result whose effect on the probability of a target hypothesis varies not only in magnitude but also in sign is difficult to explain in other than casespecific terms. However, it is known that a test result always increases the probability of a target hypothesis, regardless of the evidence provided by other test results, if its conditional probability in the target hypothesis is greater than in any competing hypothesis [7], [16]. Such a test result is called a supporter of the target hypothesis [16]. The supporters of the hypotheses in the example domain can easily be identified from Table 1. For example, spectacle prescription = hypermetrope is a supporter of
224
D. McSherry
soft contact lenses, while spectacle prescription = myope always increases the probability of hard contact lenses. Table 2 shows the supporters of each hypothesis in the example domain. As we shall see in Section 3, the ability to identify findings that always increase the probability of a given hypothesis helps to improve the quality of explanations in VERIFY. Certain test results may have more dramatic effects, such confirming a target hypothesis in a single step or eliminating a competing hypothesis. A finding will confirm a hypothesis in a single step (and always has this effect) if it occurs only in that hypothesis. In medical diagnosis, such findings are said to be pathognomonic for the diseases they confirm [11]. It can be seen from Table 1 that while a reduced tear production rate always confirms no contact lenses, neither of the other hypotheses in the example domain can be confirmed in a single step. A finding eliminates a given hypothesis (and always has this effect) if it never occurs in that hypothesis. The presence of astigmatism can be seen to eliminate soft contact lenses, while the absence of astigmatism eliminates hard contact lenses. Table 2. Supporters of the hypotheses in the example domain ______________________________________________________ no contact lenses age = presbyopic tear production rate = reduced soft contact lenses
age = pre-presbyopic astigmatism = absent spectacle prescription = hypermetrope
hard contact lenses
age = young astigmatism = present spectacle prescription = myope ______________________________________________________
3 Explanation in VERIFY Before answering any question, the user can ask why it is relevant. VERIFY differs from Hypothesist in the way it responds to such requests for explanation. The explanation provided by Hypothesist depends on the strategy that the attribute (or test) was selected to support. In the confirm, eliminate and validate strategies, the user is shown the value of the attribute that will confirm the target hypothesis, eliminate the likeliest alternative hypothesis, or increase the probability of the target hypothesis [3]. Though having only a single strategy on which to base its explanations of attribute relevance, VERIFY is capable of providing explanations that closely resemble its predecessor’s. It always shows the user the value of the selected attribute that maximises the probability of the target hypothesis. If this value happens to be one that always confirms the target hypothesis, the user is informed of this effect. The user is similarly informed if the value that maximises the probability of the target hypothesis is one that always eliminates a competing hypothesis. If the maximising value has neither of these effects, VERIFY simply informs the user that the value
Sequential Diagnosis in the Independence Bayesian Framework
225
increases the probability of the target hypothesis and whether it always has this effect; that is, whether it is a supporter of the target hypothesis. Fig. 3 shows an example consultation in which VERIFY explains the relevance of its first two questions by showing the answers that will confirm the target hypothesis or eliminate a competing hypothesis. In the second half of the example consultation, VERIFY identifies two findings (age = young and spectacle prescription = hypermetrope) that will increase the probability of soft contact lenses, but only the latter can always be guaranteed to increase the probability of soft contact lenses. This distinction is reflected in the explanations that VERIFY provides. ___________________________________________________________________________
VERIFY:
The target hypothesis is no contact lenses What is the tear production rate? User: why VERIFY: If tear production rate is reduced, this finding always confirms no contact lenses What is the tear production rate? User: normal VERIFY: The target hypothesis is soft contact lenses Is astigmatism present? User: why VERIFY: If astigmatism is absent, this finding always eliminates hard contact lenses Is astigmatism present? User: no VERIFY: What is the age of the patient? User: why VERIFY: If the age of the patient is young, this will increase the probability of soft contact lenses What is the age of the patient? User: young VERIFY: What is the spectacle prescription? User: why VERIFY: If the spectacle prescription is hypermetrope, this finding always increases the probability of soft contact lenses What is the spectacle prescription? User: hypermetrope VERIFY: The surviving hypotheses and their posterior probabilities are: soft contact lenses (0.86) no contact lenses (0.14) ____________________________________________________________________ Fig. 3. Explanation of reasoning in VERIFY
226
D. McSherry
4 When Can Testing Be Discontinued? In this section we examine possible approaches to the problem of deciding when there is sufficient evidence to discontinue the testing process in VERIFY. One simple strategy is to discontinue the evidence-gathering process when the probability of the leading (most likely) hypothesis reaches a predetermined level. However, a potential drawback in this approach is that the evidence provided by omitted tests may dramatically reduce the probability of the leading hypothesis [7], [16]. To illustrate this problem, Fig. 4 shows the consultation tree for the example domain that results if testing is discontinued in VERIFY when the probability of the leading hypothesis reaches 0.70. This simple criterion has the advantage that the user is asked at most two questions before a solution is reached. However, it can be seen from Fig. 2 that if testing is allowed to continue at the node shown in italics there is one combination of results of the omitted tests (spectacle prescription = hypermetrope, age = presbyopic) that would reduce the probability of hard contact lenses to 0.42, making no contact lenses the most likely classification with a posterior probability of 0.58. Another combination of results reduces the probability of hard contact lenses to 0.47, once again with the additional evidence favouring no contact lenses. The trade-off for the reduction in consultation length is therefore that two cases of no contact lenses in the contact lenses data set are now misclassified as hard contact lenses. ____________________________________________________________________ tear production rate = reduced : no contact lenses (1) tear production rate = normal astigmatism = present : hard contact lenses (0.71) astigmatism = absent : soft contact lenses (0.78) Fig. 4. Consultation tree for the example domain with testing discontinued when the probability of the leading hypothesis reaches 0.70
As this example illustrates, a more reliable approach may be to discontinue testing only when the evidence in favour of the leading hypothesis is such that its probability cannot be less than an acceptable level regardless of the evidence that additional tests might provide [16]. In practice, what is regarded as an acceptable level is likely to depend on the cost associated with an incorrect diagnosis. The approach we have implemented in VERIFY is to discontinue testing only when both of the following conditions are satisfied: (a) the probability of the leading hypothesis is at least 0.5 (b) its probability can never be less than 0.5 if the consultation is allowed to continue
Although a higher threshold may be appropriate in certain domains, (b) ensures that no competing hypothesis can ever be more likely regardless of the results of additional tests that might be selected by VERIFY if the consultation is allowed to continue. Finding the minimum probability of the leading hypothesis involves a breadth-first search of the space of all possible question-answer sequences that can occur if the consultation is allowed to continue. Condition (a) ensures that this look-ahead search is attempted only if the leading hypothesis is a sufficiently promising candidate. As soon as a node is reached at which the probability of the target hypothesis falls below
Sequential Diagnosis in the Independence Bayesian Framework
227
0.5, the search is abandoned and the next most useful question is selected in the usual way. On the other hand, if the search reveals that the probability of the leading hypothesis can never fall below 0.5, VERIFY selects this hypothesis as the solution to the problem presented by the user. It shows the user the current probability of the leading hypothesis as well its minimum possible probability as determined by lookahead search. Figure 5 shows an example consultation illustrating VERIFY’s ability to recognise when there is sufficient evidence to discontinue testing. Although the leading hypothesis is initially no contact lenses with a prior probability of 0.63, VERIFY need only look ahead a single step to determine that its probability will be reduced to 0.25 if the tear production rate is normal. Look-ahead search is not attempted when the user reports that the tear production is in fact normal as soft contact lenses, now the leading hypothesis, has a probability of only 0.42. Following the user’s answer to the next question, the leading hypothesis changes to hard contact lenses as its probability has increased to 0.71 (as we can see from Fig 4). Look-ahead search for the minimum possible probability of hard contact lenses is therefore initiated and reveals that its probability will fall to 0.42 if the spectacle prescription is hypermetrope and the age is presbyopic (as we can see from Fig. 2). As there is not yet sufficient evidence to discontinue the testing process, VERIFY now asks the user about the spectacle prescription. When the user reports that the spectacle prescription is myope, this increases the probability of hard contact lenses to 0.80 and look-ahead search now reveals that the age of the patient can at worst reduce its probability to 0.72. VERIFY therefore ends the consultation and reports that the likeliest solution is hard contact lenses. ____________________________________________________________________ VERIFY: The target hypothesis is no contact lenses What is the tear production rate? User: normal VERIFY: The target hypothesis is soft contact lenses Is astigmatism present? User: yes VERIFY: The target outcome class is hard contact lenses What is the spectacle prescription? User: myope VERIFY: The surviving hypotheses and their probabilities are: hard contact lenses (0.80) no contact lenses (0.20). The probability of hard contact lenses cannot be less than 0.72. ____________________________________________________________________ Fig. 5. Recognising when there is sufficient evidence to discontinue testing
Figure 6 shows the consultation tree for the example domain with testing discontinued in VERIFY only when conditions (a) and (b) are satisfied. The tree was constructed off line by a process that uses look-ahead search to determine if further testing may affect the solution reached by VERIFY. The reduced consultation tree has an average path length of 3 nodes compared with 3.8 for the full consultation tree.
228
D. McSherry
5 Related Work PURSUE is an algorithm for strategic induction of decision trees in which attribute selection is based on evidential power in the current subset of the data set [20]. However, the conditional probabilities on which an attribute’s evidential power is based are continuously revised in PURSUE as the data set is partitioned. In contrast, there is no partitioning of the data set in VERIFY and no updating of conditional probabilities. VERIFY also differs from PURSUE in requiring no access to the data, if any, from which the prior and conditional probabilities were derived. It can therefore be applied to diagnosis and classification tasks in which the only available probabilities are subjective estimates provided by a domain expert. ____________________________________________________________________ tear production rate = reduced : no contact lenses (1) tear production rate = normal astigmatism = present spectacle prescription = myope : hard contact lenses (0.8) spectacle prescription = hypermetrope age of patient = young : hard contact lenses (0.69) age of patient = pre-presbyopic : no contact lenses (0.53) age of patient = presbyopic : no contact lenses (0.58) astigmatism = absent : soft contact lenses (0.78)
____________________________________________________________________ Fig. 6. Consultation tree for the example domain with testing discontinued when the probability of the leading hypothesis cannot be less than 0.5
It is interesting to note the analogy between deciding when to discontinue the evidence-gathering process in sequential diagnosis and pruning a decision tree in the context of decision-tree induction. The trade-off between accuracy and simplicity of induced decision trees has been the subject of considerable research effort [13], [21], and lessons learned from this research may have important implications for sequential diagnosis in the independence Bayesian framework. For example, loss of accuracy resulting from drastic reductions of decision-tree size is often surprisingly small, while more conservative reductions can sometimes produce small but worthwhile improvements in accuracy [13]. Though tending to favour conservative reductions in dialogue length, VERIFY’s policy of maintaining consistency with the solutions obtained in full-length dialogues also ensures that only reductions which have no effect on accuracy are allowed. Whether improvements in accuracy may be possible by relaxing the consistency requirement is an important issue to be addressed by further research. In decision-tree induction, the choice of splitting criterion is known to affect the size of the induced decision tree before simplification, for example as measured by its average path length [13], [15]. Similarly, the test-selection strategy used in sequential diagnosis is likely to affect the number of tests required, on average, to reach a solution. Another issue to be addressed by further research is how evidential power
Sequential Diagnosis in the Independence Bayesian Framework
229
compares with alternative measures such as entropy [2] in terms of average dialogue length before any reduction based on look-ahead search. There is increasing awareness of the need for intelligent systems to support mixedinitiative dialogue [4], [22]. For example, intelligent systems are unlikely to be acceptable to doctors if they insist on asking the questions and ignore the user’s opinion as to which symptoms are most relevant. In VERIFY, the user can volunteer data at any stage of the consultation without waiting to be asked. VERIFY updates the probabilities of each hypothesis in the light of the reported evidence, revises its target hypothesis if necessary, and proceeds to select the next most useful test in the usual way. Another issue to be addressed by system designers is the problem of incomplete data. Often in real-world applications there are questions that the user is unable to answer, for example because the answer requires an expensive or difficult test that the user is reluctant or unable to perform [4]. As in Hypothesist, a simple solution to the problem of incomplete data in VERIFY is to select the next most useful test if the user is unable (or declines) to answer any question. Efficiency of test selection in VERIFY is unlikely to deteriorate significantly when the system is applied to domains with larger numbers of tests than the example domain. However, a practical limitation of the proposed strategy for discontinuing the testing process is that look-ahead search is likely to be feasible in real time only if the number of available tests is relatively small. In previous work we have presented techniques for reducing the computational effort in a search for the minimum probability of a given hypothesis [16]. However, the ability to recognise combinations of test results that can be eliminated in a search for the minimum probability depends on the assumption that p(E | H) > 0 for every test result E and hypothesis H. For example, this condition ensures that no combination of test results that minimises the probability of a hypothesis can include a supporter of the hypothesis [16]. Similar techniques could be used to reduce the complexity of look-ahead search in VERIFY when applied to domains in which this condition is satisfied (or imposed). In this paper, we have avoided this assumption as it means that no hypothesis can be confirmed or eliminated with certainty. It is also worth noting that the need for look-ahead search at problem-solving time could be eliminated by constructing and simplifying an explicit consultation tree off line and using the simplified tree to guide test selection in future consultations. However, this would compromise the system’s abilities to support mixed-initiative dialogue and tolerate incomplete data, both of which rely on freedom from commitment to an explicit decision tree. By selecting tear production rate (the most expensive of the available tests) as the most useful test in the example domain, VERIFY reveals its unawareness of the relative costs of tests. One way to address this limitation would be to constrain the selection of tests so that priority is given to those with negligible cost. A more challenging issue to be addressed by further research, though, is the system’s insensitivity to differences in the costs associated with different misclassification errors. As shown by research in cost-sensitive learning, an approach to classification that takes account of such differences is often essential for optimal decision making [23].
230
D. McSherry
6 Conclusions We have presented a new approach to test selection in sequential diagnosis in the independence Bayesian framework that resembles the hypothetico-deductive approach to evidence gathering used by doctors. The approach avoids the complexity associated with the co-ordination of multiple test-selection strategies in previous models of hypothetico-deductive reasoning, while retaining the advantage that the relevance of a selected test can be explained in strategic terms. We have also examined two approaches to the problem of deciding when to discontinue the evidence-gathering process. In the first approach, testing is discontinued when the probability of the leading hypothesis reaches a predetermined threshold. In the second approach, testing is discontinued only if the solution would remain the same if testing were allowed to continue. The second approach tends to favour more conservative reductions in the length of problem-solving dialogues, while ensuring that only reductions that can have no effect on accuracy are accepted. Whether improvements in accuracy on unseen test cases may be possible by relaxing this constraint is one of the issues to be addressed by further research.
References 1. de Kleer, J, Williams, B.C.: Diagnosing Multiple Faults. Artificial Intelligence 32, 97-130, 1987. 2. Gorry, G.A., Kassirer, J.P., Essig, A., Schwartz, W.B.: Decision Analysis as the Basis for Computer-Aided Management of Acute Renal Failure. American Journal of Medicine 55, 473-484, 1973. 3. McSherry, D.: Hypothesist: a Development Environment for Intelligent Diagnostic Systems. In: Keravnou, E., Garbay, C., Baud, R., Wyatt, J. (eds): Artificial Intelligence in Medicine. LNAI, Vol. 1211. Springer-Verlag, Berlin Heidelberg, 223-234, 1997. 4. McSherry, D.: Interactive Case-Based Reasoning in Sequential Diagnosis. Applied Intelligence 14, 65-76, 2001. 5. Adams, I.D., Chan, M., Clifford, P.C. et al.: Computer Aided Diagnosis of Acute Abdominal Pain: a Multicentre Study. British Medical Journal 293, 800-804, 1986. 6. Provan, G.M., Singh, M.: Data Mining and Model Simplicity: a Case Study in Diagnosis. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, California, 57-62, 1996. 7. McSherry, D. Intelligent Dialogue Based on Statistical Models of Clinical Decision Making. Statistics in Medicine 5 497-502, 1986. 8. McSherry, D.: Dynamic and Static Approaches to Clinical Data Mining. Artificial Intelligence in Medicine 16, 97-115, 1999. 9. Elstein, A.S., Schulman, L.A., Sprafka, S.A.: Medical Problem Solving: an Analysis of Clinical Reasoning. Harvard University Press, Cambridge, Massachusetts, 1978 10. Kassirer, J.P., Kopelman, R.I.: Learning Clinical Reasoning. Williams and Wilkins, Baltimore, Maryland 1991 11. Shortliffe, E.H., Barnett, G.O.: Medical Data: Their Acquisition, Storage and Use. In: Shortliffe, E.H. and Perreault, L.E. eds: Medical Informatics: Computer Applications in Health Care. Addison-Wesley, Reading, Massachusetts 37-69, 1990.
Sequential Diagnosis in the Independence Bayesian Framework
231
12. McSherry, D.: A Case Study of Strategic Induction: the Roman Numerals Data Set. In: Bramer, M., Preece, A., Coenen, F. eds: Research and Development in Intelligent Systems XVII. Springer-Verlag, London, 48-61, 2000 13. Breslow L.A., Aha D.W. Simplifying Decision Trees: a Survey. Knowledge Engineering Review 12,1-40, 1997. 14. Doyle, M., Cunningham, P.: A Dynamic Approach to Reducing Dialog in On-Line Decision Guides. In: Blanzieri, E., Portinale, L. (eds.) Advances in Case-Based Reasoning. LNAI, Vol. 1898. Springer-Verlag, Berlin Heidelberg, 49-60, 2000. 15. McSherry, D.: Minimizing Dialog Length in Interactive Case-Based Reasoning. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, 993-998, 2001. 16. McSherry, D.: Avoiding Premature Closure in Sequential Diagnosis. Artificial Intelligence in Medicine 10, 269-283, 1997. 17. Cendrowska, J.: PRISM: an Algorithm for Inducing Modular Rules. International Journal of Man-Machine Studies 27, 349-370, 1998. 18. Quinlan, J.R.: Induction of Decision Trees. Machine Learning 1, 81-106, 1998. 19. Szolovits, P., Pauker, S.G.: Categorical and Probabilistic Reasoning in Medical Diagnosis. Artificial Intelligence 11, 115-144, 1978. 20. McSherry, D.: Explanation of Attribute Relevance in Decision-Tree Induction. In: Bramer, M., Coenen, F., Preece, A. (eds.) Research and Development in Intelligent Systems XVIII. Springer-Verlag, London 2001 39-52 21. Bohanec, M., Bratko, I.: Trading Accuracy for Simplicity in Decision Trees. Machine Learning 15, 223-250, 1994. 22. Berry, D.C., Broadbent, D.E. Expert Systems and the Man-Machine Interface. Part Two: The User Interface. Expert Systems 4, 18-28, 1987. 23. Elkan, C.: The Foundations of Cost-Sensitive Learning. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, 973-978, 2001.
Static Field Approach for Pattern Classification Dymitr Ruta and Bogdan Gabrys Applied Computational Intelligence Research Unit Division of Computing and Information Systems, University of Paisley High Street, Paisley PA1-2BE, UK {ruta-ci0, gabr-ci0}@paisley.ac.uk
Abstract. Recent findings in pattern recognition show that dramatic improvement of the recognition rate can be obtained by application of fusion systems utilizing many different and diverse classifiers for the same task. Apart from a good individual performance of individual classifiers the most important factor is the useful diversity they exhibit. In this work we present an example of a novel non-parametric classifier design, which shows a substantial level of diversity with respect to other commonly used classifiers. In our approach inspiration for the new classification method has been found in the physical world. Namely we considered training data as particles in the input space and exploited the concept of a static field acting upon the samples. Specifically, every single data point used for training was a source of a central field, curving the geometry of the input space. The classification process is presented as a translocation in the input space along the local gradient of the field potential generated by the training data. The label of a training sample to which it converged during the translocation determines the eventual class label of the new data point. Based on selected simple fields found in nature, we show extensive examples and visual interpretations of the presented classification method. The practical applicability of the new model is examined and tested using well-known real and artificial datasets.
1 Introduction Research in pattern recognition proves that no individual method can be shown to be the best for all classification problems [1], [2], [3]. Instead, an interesting alternative is to construct a number of diverse, generally well performing classification methods, and combine them on different levels of classification process. Combining classifiers has been shown to offer a significant classification improvement for some non-trivial pattern classification problems [2], [3]. However, the highest improvement of a multiple classifier system is subject to the diversity exhibited among the component classifiers [4], [5], [6]. In this paper we propose an example of such a novel classifier that performs well individually and can be shown diverse with respect to other commonly used classifiers. In designing the new classifier we exploit a notion of a static field generated by a set of samples treated as physical particles. Our approach is closely D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 232–246, 2002. © Springer-Verlag Berlin Heidelberg 2002
Static Field Approach for Pattern Classification
233
related to the idea of an information field, recently emerging from the studies in information theory, where increasingly deep analogies are drawn with the physical world [7], [8]. Shannon entropy representing probabilistic interpretation of information content is an example of a direct counterpart to the thermodynamic entropy related to physical particles. Information or its uncertainty is quite often compared to energy, with all its aspects [8]. The latest findings led even to the formulation of the quantum information theory based on well-developed quantum physics [7]. The mathematical concept of a field so commonly observed in nature, has hardly been exploited for the data analysis. In [9], Hochreiter and Mozer use electric field metaphor to Independent Component Analysis (ICA) problem where joint and factorial density estimates are treated as a distribution of positive and negative charges. In [10], Principe et al introduce the concept of information potentials and forces aroused from the interactions between the samples, which Torkkola and Campbell [11] used further for transformation of the data attempting to maximize their mutual information. Inspired by these tendencies we adapt directly the concept of the field to the data, which can be seen as particles and field sources. The type of the field is uniquely defined by the definition of potential and can be absolutely arbitrary, chosen according to the purpose of the data processing. For classification purposes the idea is to assign a previously unseen sample to one of the classes of the data, the topology of which should be learnt from the training data. As a response to this demand we propose to use an attracting action between the samples similar to the result of gravity acting among masses. Training data are fixed to their positions and represent the sources of the static field acting dynamically on loose testing samples. The field measured in a particular point of the input space is a result of a superposition of the local fields coming from all the sources. Thus the positions of the training data uniquely determine the field in the whole input space and in this way define the paths of the translocations of the testing data along the forces aroused from the local changes of the field. The ending point of such transposition has to be one of the training samples, which in turn determines the label of the classified sample. Such a static field classification (SFC) resembles, to a certain degree, non-parametric density estimation based approaches for classification [1]. The remainder of the paper is organized as follows. Section 2 explains the way in which the data is used as sources of the field, including implementation details. In Section 3 we show how the field generated upon the labeled data can be used for classification of previously unseen data. The next section provides the results from the experiments with real datasets. Finally, conclusions and plans for the future work in this area are presented briefly.
2 Data as Field Sources Inspired by the field properties of the physical world one can consider each data point as a source of a certain field affecting the other data in the input space. In general the choice of a field definition is virtually unrestricted. However, for the classification purposes considered in this paper, we use a central field with a negative potential
234
D. Ruta and B. Gabrys
increasing with the distance from a source. An example of such a field is the omnipresent gravity field or electrostatic field. Given the training data acting as field sources, every point of the input space can be uniquely described by the field properties measured as a superposition of the influences from all field sources. In this paper we consider a static field in a sense that the field sources are fixed to their initial positions in the input space. All dynamic aspects imposed by the field are ignored in this work. Given a training set of N S m -dimensional data: X S = {x1 , x 2 ,..., x N } , let each sample be the source of a field defined by a potential: r (1) U j = −csi f (rij ) where c represents the field constant, s i stands for a source charge of the data point r x i , and f (rij ) is a certain non-negative function decreasing with an increasing length r of the vector rij describing the distance between the source x i and the point x j in the input space. Note that the potential is always negative, which decides about the attracting properties of the data. In this work we adopt the gravitational field for which:
r 1 f (rij ) = r rij
(2)
For notation simplicity we assume: rij = rij . The overall potential in a certain point x j of the input space is a superposition of potentials coming from all the sources: N
U j = −c ∑ i =1
si rij
(3)
Considering a new data point in such a field, we can immediately associate with it the energy equal to: N
E j = −cs j ∑ i =1
si rij
(4)
Simplifying the model further we can assume that all data points are equally important and have the same source charge equal to unit, si = 1 , thus eliminating source charges from equations (3) and (4). Another crucial field property is its intensity, which is simply a gradient of the potential and can be formally expressed by:
r r ∂U j ∂U j ∂U j E j = −∇U j = − , ,..., ∂x ∂x ∂x jm j2 j1
(5)
Solution of equation (5) leads to the following form of the field vector:
r N (x N (x − x ) N (x − x ) − x ) E j = −c ∑ j1 3 i1 ,..., ∑ jm 3 im = −c ∑ j 3 i i=1 i =1 i =1 rij rij rij
(6)
Static Field Approach for Pattern Classification
235
A vector of field intensity or shortly a field vector shows the direction and the magnitude of maximum fall of the field potential. By analogy in the presence of a new data point by analogy we can introduce the force, which is a result of the field acting upon the data. Due to our simplifications the force vector and field vector have identical values and directions: r r (7) Fj = E j The only difference between them is a physical unit, which is of no importance in our application. The concept of field forces will be directly exploited for the classification process described in section 3. The field constant c does not affect the directions of forces but only decides about their magnitudes. As previously, without any loss of generality we can assume its unit value and in that way free all the field equations from any parameters, apart from the definition of the distance itself. 2.1 Field Generation From a perspective of classification, the generation of a field could represent the training process of a classifier. However, as the training data uniquely determine the field and all its properties, the training process may be omitted. All the calculations required to classify new data are carried out online during the classification process avoiding any imprecision caused by approximations that might have been applied otherwise. It is very similar to generation and operation of the very well known knearest neighbor classifier. In case of a large amount of data to be classified, another option is available although not completely precise as in previous case. Namely one can split the input space into small hyper boxes and calculate all the field properties required in the center of each hyper box. The field can be approximated that way in any point just by local aggregation procedures. The training process would be substantially prolonged, but the classification phase would require calculations related to just one or couple of points from the neighborhood and therefore can be drastically shortened. For both methods the critical factor is calculation of the distances from the examined points and all the sources. Using matrix formulation of the problem and mathematical software this task can be obtained rapidly even for thousands of sources. In this case it was achieved using the P3 processor and Matlab 5, with calculations of all distances between 1000 10-D points and 1000 10-D sources taking less than 1 second. Let X N ×m denotes the matrix of N m -dimensional data points, we want to examN ×m ine the field at, and X S be the matrix of N S training data – field sources. The task is to obtain the matrix D N ×N of all the distances between the examined data and the training data. As opposed to time consuming double loop implementation, introducing matrix formulation leads to significant savings in terms of code length and processing time. Namely D can be calculated as simply as: S
S
D = X o X • 1(1, N S ) − 2 • X • X ST + 1( N ,1) • X ST o X ST
(8)
236
D. Ruta and B. Gabrys
where “ o ” denotes the operator of element wise matrix multiplication (multiplication of corresponding elements), “ • ” represents standard matrix multiplication and 1(n, m) stands for a matrix of size (n, m) with all elements equal to one. Matlab implementation of the above rule results in the 20 times shorter time of calculations comparing to the double loop algorithm. Given the distance matrix D all the properties of the field can be obtained by simple algebra performed on D. Avoiding numerical problem the distances have been limited from the bottom by an arbitrary threshold d preventing by zero.
2.2 Numerical Example As an example we generated 50 random 2-D points from a range (0,1) in both dimensions. Figures 1 and 2 provide the visualization of the field arising from the training data. Not surprisingly, the potential is the lowest in the maximum data concentrations and generally in the middle of the regions occupied by the data. This phenomenon nicely correlates with the classification objective, as the highly concentrated regions should have greater chance of data interception. The same applies to the dramatic local decrease of the potential around the field sources. Presence of the field can be also interpreted as a specific curvature of the input space imposed by the presence of field sources, that is, the data. Each point in such a curved input space will be forced to move along the force vectors ultimately ending up in a position of one of the field sources. In this way the field built upon the data has the ability to uniquely transform the input space. For the classification purposes, such a transformation leads to the split of the whole input space into the subspaces labeled according to the labels of field sources intercepting the data from these subspaces. Figures 1 and 2 show a visualization of the static data field generated by 50 random 2-D points from the range of (0,1) in each dimension.
Fig. 1. 3-D visualization of the potential generated by 2-D data .
Static Field Approach for Pattern Classification
237
Fig. 2. Vector plot of the field pseudo-intensity (‘pseudo’ as the vectors point only in the true direction of the field but their lengths are here fixed for visualization clarity).
3 Classification Process Given the field, classification process is very straightforward and is simply reduced to the gradient descent source finding. The slide begins from the position of a new data point to be classified. For algorithm stability we ignore the actual values of the field or force vectors, following just the direction of maximum decrease of the potential. The sample is shifted along this direction by a small step d and the whole field is recalculated again in the new position. This procedure is repeated until the unknown sample approaches any source at the distance lower or equal d . If this happens the sample is labeled according to the source it was intercepted by. Parameter d corresponds to the length of the shift vector and if fixed, could cause different relative shifts for differently scaled dimensions. To avoid this problem we normalize the input space to cover all the training data within the range (0,1) in all dimensions and set the step d to the fixed value: d = 0.01 . During classification process the new data is transformed to the normalized input space and even if its position falls outside the range (0,1) in any dimension, the step d remains well scaled. The step d has been deliberately denoted by the same symbol as lower limit of the distance introduced in previous section. This ensures that the sample never misses the source on its trajectory and additionally two parameters are reduced to just one. To speed up the classification process one can extend the step as long as the data size is small enough and the distances between the sources remain larger than d .
238
D. Ruta and B. Gabrys
3.1 Matrix Implementation Rather than classifying samples one by one we used Matlab capabilities to classify samples simultaneously. Given distance matrix D obtained by the (7), the matrix of forces F N × m can be immediately obtained by (6). Exploiting the triangular relation between forces and shifts and given constant d , the matrix of shifts ∆X N × m can be calculated by the formula:
∆X =
d •F (F o F) • 1(m,1) • 1(1, m)
(9)
The full SFC algorithm can be expressed in the following sequence of steps: 1. Given the sources – training data X S , and data to be classified X , calculate the matrix of distances D according to (8). 2. 3. 4. 5.
Calculate the matrix of field forces at the positions of unlabeled data to be classified. Given a fixed step, calculate the shifts of the samples according to (9). Shift the samples to the new locations calculated in the step 3. For all samples check if the distance to any source is less or equal to the step d . If yes classify these samples with the same labels as the sources they were intercepted by and remove them from the matrix X.
6. If matrix X is empty finish else go to step 1. Transformation as presented above leads to the split of the whole input space into the subspaces labeled according to the labels of field sources intercepting the data from these subspaces. Figure 3 presents graphical interpretation of the classification process for an artificial dataset with 8 classes. One can notice that the information about the labels of the training data is not used till the very end of the classification process. This property makes the method an interesting candidate for an unsupervised clustering technique. The class boundary diagram reveals an interesting effect of the presented classification method. Namely, occasionally one can observe a narrow strip of one class getting deep into the area of another class. This is the case of the potential ridge, which is balanced from both sides by the data causing the field vector to go inbetween, sometimes even reaching another class. Although this phenomenon is not particularly desirable for an individual classifier, as we show in the experiments it contributes to the satisfactory level of the diversity the SFC classifier exhibits with other classifiers. 3.2 Comparison with Other Classifiers The static field classification presented in this paper share some similarities with other established classifier designs. The process of field generation can be seen in fact as indirect parametric estimation of the data density where kernels are defined by potentials generated by each training data. Although technically similar, the two approaches have diametrically different meaning. Rather than data density we calculate potential, which in our approach imposes specific curvature of the input space used further for
Static Field Approach for Pattern Classification
239
classification purposes. This part of our method represents purely unsupervised approach, as the information about the labels is not used at any point.
3a
3b
3c
3d
Fig. 3. Visualization of the static field based classification process performed on the 8-classes 2D artificial data of 160 samples. 3a: Scatter plot of the training data. 3b: Vector plot of the field pseudo-intensity. 3c: Trajectories of exemplary testing data sliding down along potential gradients. 3d: Class boundaries diagram.
The classification process of falling into potential well of a single source resembles knearest neighbor classification where k=1. However rather than joining based on the least distance, we apply a specific translocation leading to the nearest field source met on the trajectory determined by maximum fall of a field potential. Some similarities can be also found with Bayesian classifiers, which pick a class with maximum a posteriori probability calculated on the basis of assumed or estimated class probability density functions. In our approach, instead of probability and probability distribution we apply potential, which we use in a less restrictive manner. Furthermore all training data regardless of the class take part in forming the decision landscape for each data to be classified. No comparison is made after the matching procedure, the result of which is that a classifier designed in this way is able to produce only binary outputs. It is worth mentioning that the labels of any data are not used during classification process either as the testing samples are intercepted purely on the basis of the field built upon the training data and not their labels. One can say that classification process
240
D. Ruta and B. Gabrys
is hidden until the labels of the sources are revealed. What the SFC classifier does afterwards is passing the labels from the field generators to the captured data. Once the labels of the sources are known, any data can be classified according to these labels. However if not, the method is simply matching the data with sources according to the descending potential rule, which can be potentially exploited for clustering algorithms
4 Diversity Diversity among classifiers is the notion describing the level, to which classifiers vary in data representation, concepts, strategy etc. This multidimensional perception of diversity results however in a simple effect observed at the outputs of classifiers: they tend to make errors for different input data. This phenomenon has been shown to be crucial for effective and robust combining methods [3], [4], [5]. The diversity can be measured in a variety of ways [4], [5], [6], but the most effective turned out to be the measures evaluating directly disagreement to errors among classifiers [5], [6]. In [6] we investigated the usefulness and potential applicability of a variety of pairwise and non-pairwise diversity measures operating on binary outputs (correct/incorrect). For the purpose of this paper we will be using a Double Fault (DF) measure, which turned out to be the best among analyzed pairwise measures. Recalling the definition of the DF measure, the idea is to calculate the ratio of the number of samples misclassified by both classifiers n 11 to the total number of samples n : (10) F = n 11 / n Using this simple measure one can effectively assess the diversity between all pairs of classifiers as well as quite reasonably evaluate the combined performance [6].
5 Experiments The static field based classification method requires a number of evaluation procedures. First of all one need to check its performance over typical real and artificial datasets and compare it against other classifiers. Secondly, as we mentioned above, we intended to develop a classifier, which would be diverse to other commonly used classifiers. It is important for a good classifier to meet both these conditions on a satisfactory level to be successfully used in combining schemes. 5.1 Experiment 1 In this experiment we used 4 well-established datasets to check an individual performance of the SFC and compare it against the performances of another 9 typically used classifiers. Table 1 shows the details of datasets picked and a list of the other classifiers used. For all but one case the datasets have been split into two equal parts used for
Static Field Approach for Pattern Classification
24 1
training and testing respectively. For the Phoneme dataset, due to its large size, we used 200 samples for training and 1000 for testing. Classification runs have been repeated 10 times for different random splits. Table 2 shows averaged individual performances of classifiers from this experiment. Although the performance of the SFC classifier is never the best result, it remains close to the best. This makes the SFC a valid candidate for combining provided it exhibits a satisfactory level of diversity. Table 1. Description of datasets and classifiers used for experhnents.
eudo Rsher support vector classifier
Table 2. Individual performances of classifiers applied to the datasets: IM, Phoneme, Conetorus andSynthetzc. The results are obtalned for 10 random splits and averaged.
5.2 Experiment 2 The diversity properties of the SFC classifier are evaluated in the next experiment. As mentioned in Section 4 we decided to apply the DF measure to examine the diversity among all pairs of classifiers. The results have been obtained for all datasets mentioned above. Due to a large size of the DF measures obtained for all pairs of considered classifiers we present the results graphically in the form of diversity diagrams as shown in Figure 4. Iris
Phoneme
Conetorus
M k , of Gaussians
Fig. 4. Diversity diagram obtalned for 4 considered datasets.
242
D. Ruta and B. Gabrys
The coordinates of each small square correspond to the indices of classifiers for which the DF measure is calculated. Note that the diagrams are diagonally symmetrical. The shade of the squares reflects the magnitude of the DF measure values. The lower the DF measure, the lighter the square and the more diverse the corresponding pair of classifiers. To support a single value, which would reflect diversity properties of individual classifiers, we averaged the DF measures between a considered classifier and all remaining classifiers. This is shown numerically in Table 3. Both, diagrams from Figure 4 and averaged results from Table 3 show very good diversity properties of the SFC classifier. Only for the Phoneme dataset, the SFC demonstrates quite average diversity level. For the remaining 3 datasets SFC turned out to be the most diverse despite quite average individual performance. Table 3. Averaged values of DF measures between individual classifiers and all the remaining classifiers. The DF values have been expressed as a percentages of the occurrences of pairwise coincident errors to the total number of samples. Dataset [%] Iris Phoneme Conetorus Gaussians
Loglc
fisherc
0.48 5.61 10.82 2.73
0.36 6.98 11.42 2.33
ldc
nmc
persc
pfsvc
qdc
Parzenc
rbnc
Sfc
0.45 6.08 12.39 3.44
0.44 7.85 11.05 2.55
0.12 6.11 11.46 3.73
0.21 6.80 10.99 3.18
0.53 6.95 12.62 2.68
0.47 6.81 12.40 3.27
0.37 5.88 12.70 2.83
0.11 6.70 10.80 1.74
5.3 Experiment 3 In the last experiment we investigated a parametrical variability of the presented classifier. Recalling the force definition (7), the only parameter of the field having a potential influence on the classification results is the type of distance appearing in the potential definition (1). For the Conetorus dataset we applied the SFC with different powers of the distance appearing in the denominator in (3). Additionally, we examined a simple exponential function with one parameter as an alternative definition of potential. Table 4 shows all configurations of the SFC examined in this experiment, as well as individual performances obtained for the Conetorus dataset for a single split. Visual results, including field images and class boundaries are shown in Figures 5 and 6. For both functions the results depict a clear meaning of the parameter a . Namely it accounts for the balance between local and global interactions among the samples. Table 4. Individual performances of various configurations of SFC classifier. The results obtained for Conetorus dataset using single random split (50% for training, 50% for testing ). Rational Potential definition Parameter a Performance
Exponential
U ( x ) = − ∑i =11 / ri N
0.1 72.86
0.6 76.88
2 82.41
U ( x ) = − ∑ i =1e − ar N
a
5 81.42
10 69.85
20 76.38
50 80.90
i
100 81.91
Static Field Approach for Pattern Classification
243
a=.1
a=.6
a=2
a=5
Fig. 5. Potential with rational function of the distance for different values of the parameter a.
The larger the value of a , the more local the field, so that virtually only the nearest neighbors influence the field in a particular point of the input space. For smaller a the field becomes more global and below a certain critical value some training samples are no longer able to intercept any testing samples.
244
D. Ruta and B. Gabrys
a=10
a=20
a=50
a=200
Fig. 6. Exponential dependence on the distance for different values of parameter a.
Technically it is the case when a single source cannot curve the geometry strong enough to create closed enclave of higher potential around itself. The presented SFC classifier will not be able to classify such samples, which may just find local minimum of the potential rather than the field source. The critical value of parameter a seems to be the function of the number of samples and its optimization could be included into
Static Field Approach for Pattern Classification
245
the classifier design. However the results in Table 4 suggest that more local fields tend to result in a better performance and therefore it is safer to apply larger values of a .
6 Conclusions In this paper, we introduced a novel non-parametric classification method based on the static data field adopted from the physical field phenomena. The meaning of training data has been reformulated as sources of a central static field with the negative potential increasing with the distance from the source. Attracting force among the data defines a specific complex potential landscape resembling the joint potential wells. The classification process has been proposed as a gradient descent translocation of the unlabelled sample ultimately forced to approach one of the sources and inherit its label. The Static Field Classifier (SFC) has been implemented using an efficient matrix formulation suitable for Matlab application. Extensive graphical content has been used to depict different geometrical interpretations of the SFC as well as fully visualize the classification process. The presented SFC has been evaluated in a number of ways. An Individual performance has been examined on a number of datasets and compared to other well performing classifiers. The results showed relatively average performance of the SFC if applied individually. However it has shown the highest level of diversity with other classifiers for 3 out of 4 datasets, making it a very good candidate for classifier combination purposes. Various types of fields have also been examined within the general SFC definition. The conducted experiments suggested the use of local fields for the best performance as well as for boundaries invariability, but further experiments are required for fuller interpretation of these results. The properties mentioned above as well as the results from the presented experiments allow considering the SFC as an alternative non-parametric approach for the classification, particularly useful for combining with other classifiers.
References 1. Duda R.O., Hart P.E., Stork D.G.: Pattern Classification. John Wiley & Sons, New York (2001). 2. Bezdek J.C.: Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer Academic, Boston (1999). 3. Sharkey A.J.C.: Combining Artificial Neural Nets: Ensemble and Modular Multi-net Systems. Springer-Verlag, Berlin Heidelberg New York (1999). 4. Sharkey A.J.C., Sharkey N.E.: Combining Diverse Neural Nets. The Knowledge Engineering Review 12(3) (1997) 231-247. 5. Kuncheva L.I., Whitaker C.J.: Ten Measures of Diversity in Classifier Ensembles: Limits for Two Classifiers. IEE Workshop on Intelligent Sensor Processing, Birmingham (2001) 10/1-10/6
246
D. Ruta and B. Gabrys
6. Ruta D., Gabrys B.: Analysis of the Correlation Between Majority Voting Errors and the Diversity Measures in Multiple Classifier Systems. International Symposium on Soft Computing, Paisley (2001). 7. Zurek W.H.: Complexity, Entropy and the Physics of Information. Proc. of the Workshop on Complexity, Entropy, and the Physics of Information. Santa Fe (1989). 8. Klir G.J., Folger T.A.: Fuzzy Sets, Uncertainty, and Information. Prentice-Hall International Edition (1988). 9. Hochreiter S., Mozer M.C.: An Electric Approach to Independent Component Analysis. Proc. of the Second International Workshop on Independent Component Analysis and Signal Separation, Helsinki (2000) 45-50. 10. Principe J., Fisher III, Xu D.: Information Theoretic Learning. In S. Haykin (Ed.): Unsupervised Adaptive Filtering. New York NY (2000). 11. Torkkola K., Campbell W.: Mutual Information in Learning Feature Transformations. Proc. of International Conference on Machine Learning, Stanford CA (2000).
Inferring Knowledge from Frequent Patterns Marzena Kryszkiewicz Institute of Computer Science, Warsaw University of Technology Nowowiejska 15/19, 00-665 Warsaw, Poland [email protected] Abstract. Many knowledge discovery problems can be solved efficiently by means of frequent patterns present in the database. Frequent patterns are useful in the discovery of association rules, episode rules, sequential patterns and clusters. Nevertheless, there are cases when a user is not allowed to access the database and can deal only with a provided fraction of knowledge. Still, the user hopes to find new interesting relationships. In the paper, we offer a new method of inferring new knowledge from the provided fraction of patterns. Two new operators of shrinking and extending patterns are introduced. Surprisingly, a small number of patterns can be considerably extended into the knowledge base. Pieces of the new knowledge can be either exact or approximate. In the paper, we introduce a concise lossless representation of the given and derivable patterns. The introduced representation is exact regardless the character of the derivable patterns it represents. We show that the discovery process can be carried out mainly as an iterative transformation of the patterns representation.
1 Introduction Let us consider the following scenario that is typical for collaborating (e.g. telecom) companies: a company T1 requests some services offered by a company T2. To this end T2 must collect some knowledge about T1. T1 however may not wish to reveal some facts to T1 unintentionally. Therefore, it is important for T1 to be aware of all the consequents derivable from the required information. Awareness of what can be derived from a fraction of knowledge can be crucial for the security of the company. On the other hand, methods that enable reasoning about knowledge can be very useful also in the case when the information available within the company is incomplete. The problem of inducing knowledge from the provided set of association rules was first addressed in [4]. It was offered there how to use the cover operator (see [3]) and extension operator (see [4]) in order to augment the original knowledge. The cover operator does not require any information on statistical importance (support) of rules and produces at least as good rules as original ones; the extension operator requires information on support of original rules. In [5] it was proposed a different method of indirect inducing knowledge from the given rule set. It was proved there that it is better first to transform the provided rule set into corresponding patterns P, then to augment P with new patterns (bounded by patterns in P), and finally to apply old and new patterns for association rules discovery. The set of rules obtained this way is guaranteed to be a superset of the rules obtained by indirect “mining around rules”. Additionally, it was shown in [5]
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 247–262, 2002. © Springer-Verlag Berlin Heidelberg 2002
248
M. Kryszkiewicz
how to test the consistency of the rule set and patterns as well as how to extract consistent subset of rules and patterns. In this paper we follow the idea formulated in [5] that patterns are a better form for knowledge derivation than association rules. We treat patterns as an intermediate form for derivation of other forms of knowledge. The more patterns we are able to derive from itemsets, the more association rules we are able to discover. This paper extends considerably the original method of deriving new patterns from the provided fraction of patterns. In particular, we propose here how to obtain extended and shrunken patterns that are not bounded by known itemsets. In addition, we define a lossless concise representation of patterns and show how to use it in order to obtain all derivable itemsets. The introduced representation is an adapted version of generators and closed itemsets representation developed recently as an efficient representation of all frequent patterns present in the database [6,7]. The original representation of patterns is defined in terms of database transactions, while the adapted version is defined in terms of available patterns. Though both representations are conceptually close, the related problems differ considerably – in the case of original representation the main problem is its discovery from the transactional database; in the case of our representation – the issue we mainly address is restriction of an intended pattern augmentation to the representation transformation. The layout of the paper is as follows: In Section 2 we introduce the basic notions of patterns (itemsets) and association rules. Section 3 reminds the results obtained in [5] related to discovering association rules from the given rule set by means of bounded itemsets. In Section 4 we propose a new method of augmenting the number of patterns by means of two new operators of shrinking and extending. In Section 5 we offer a notion of a concise lossless representation of available and derivable patterns. It is offered in Section 6, how to restrict the overall pattern derivation process by applying the concise representation of patterns. Section 7 concludes the results.
2 Frequent Itemsets, Association Rules, Closed Itemsets, and Generators The problem of association rules was introduced in [1] for sales transaction database. The association rules identify sets of items that are purchased together with other sets of items. For example, an association rule may state that 90% of customers who buy butter and bread buy milk as well. Let us recollect the problem more formally: Let I = {i1, i2, ..., im} be a set of distinct literals, called items. In general, any set of items is called an itemset. Let D be a set of transactions, where each transaction T is a subset of I. An association rule is an expression of the form X ⇒ Y, where ∅ ≠ X,Y ⊂ I and X ∩ Y = ∅. Support of an itemset X is denoted by sup(X) and defined as the number (or the percentage) of transactions in D that contain X. Property 1 [1]. Let X,Y⊆I. If X⊂Y, then sup(X)≥sup(Y). The property below is an immediate consequence of Property 1. Property 2. Let X,Y,Z⊆I. If X⊆Y⊆Z and sup(X)=sup(Z), then sup(Y)=sup(X).
Inferring Knowledge from Frequent Patterns
249
Property 3 [6]. Let X,Y,Z⊆I. If X⊆Z and sup(X)=sup(Z), then X∪Y⊆Z∪Y and sup(X∪Y)=sup(Z∪Y). Support of the association rule X ⇒ Y is denoted by sup(X ⇒ Y) and defined as sup(X ∪ Y). Confidence of X ⇒ Y is denoted by conf(X ⇒ Y) and defined as sup(X ∪ Y) / sup(X). (In terms of sales transactions, conf(X ⇒ Y) determines the conditional probability of purchasing items Y when items X are purchased.) The problem of mining association rules is to generate all rules that have support greater than a user-defined threshold minSup and confidence greater than a threshold minConf. Association rules that meet these conditions are called strong. Discovery of strong association rules is usually decomposed into two subprocesses [1], [2]: Step 1. Generate all itemsets whose support exceeds the minimum support minSup. The itemsets of this property are called frequent. Step 2. From each frequent itemset generate association rules as follows: Let Z be a frequent itemset and ∅ ≠ X ⊂ Z. Then X ⇒ Z\X is a strong association rule provided sup(Z)/sup(X) > minConf. The number of both frequent itemsets and strong association rules can be huge. In order to obey this problem, several concise representations of knowledge have been proposed in the literature (see e.g. [3], [6], [7]). In particular, frequent closed itemsets, which constitute a closure system, are one of the basic lossless representations of frequent itemsets [7]. The closure of an itemset X (denoted by γ(X)) is defined as the greatest (w.r.t. set inclusion) itemset that occurs in all transactions in D in which X occurs. The itemset X is defined closed if γ(X)=X. Another basic representation of frequent itemsets is based on a notion of a generator. A generator of an itemset X (denoted by G(X)) can be defined as a minimal (w.r.t. set inclusion) itemset that occurs in all transactions in D in which X occurs.
3 Known Itemsets, Bounded Itemsets, and Association Rules In the previous section we outlined how to calculate association rules from a set of all frequent itemsets occurring in the database. Here, we assume that the database is not available to a user, however supports of some itemsets are known to him/her. In the sequel, the itemsets the supports of which are known will be called known itemsets and will be denoted by K. The purpose of the user is to generate as many strong association rules as possible based on known itemsets. The simplest way to generate strong association rules from K is to apply the procedure described as Step 2 in Section 2. The set of derivable association rules may be increased considerably if we know how to construct new itemsets and estimate their supports based on K. Here we remind the approach proposed in [5]: A pair of itemsets Y,Z in K will be called bounding for X, if Y⊆X⊆Z. The itemset X will be called bounded (also called derivable in [5]) in K if there is a pair of bounding itemsets Y, Z∈K for X. The support of any (unknown) itemset X can be estimated by supports of bounding itemsets as follows: min{sup(Y)| Y∈K ∧ Y⊆X} ≥ sup(X) ≥ max{sup(Z)| Z∈K ∧ X⊆Z}.
250
M. Kryszkiewicz
The set of all bounded itemsets in K will be denoted by BIS(K), that is: BIS(K) = {X⊆I| ∃Y,Z∈K, Y⊆X⊆Z}. Obviously, BIS(K) ⊇ K. Pessimistic support (pSup) and optimistic support (oSup) of an itemset X∈BIS(K) w.r.t. K are defined as follows: pSup(X,K) = max{sup(Z)| Z∈K ∧ X⊆Z}, oSup(X,K) = min{sup(Y)| Y∈K ∧ Y⊆X}. The real support of X∈BIS(K) belongs to [pSup(X,K), oSup(X,K)]. Clearly, if X∈K, then sup(X)=pSup(X,K)=oSup(X,K). In fact, it may happen for a bounded itemset X not present in K that its support can be precisely determined. This will happen when pSup(X,K)=oSup(X,K). Then, sup(X)=pSup(X,K). In the sequel, we will call such itemsets exact bounded ones. The set of all exact bounded itemsets will be denoted by EBIS(K), EBIS(K) = {X∈BIS(K)| pSup(X,K)=oSup(X,K)}. Example 1. Let K = {{ac}[20], {acf}[20], {d}[20], {ad}[20], {cd}[20], {f}[30], {cf}[30], {aef}[15], {def}[15]} (the values provided in square brackets in the subscript denote the supports of itemsets). We note that the itemset {df}∉K is bounded by two known subsets: {d} and {f}, and by the known superset: {def}. Hence, pSup({de}, K) = max{sup({def})} = 15, and oSup({de},K) = min{sup({d}),sup({f})} = 20. In our example, BIS(K) = K ∪ {{de}[15,20], {df}[15,20], {ef}[15,30]} (the values provided in square brackets in the subscript denote pessimistic and optimistic supports of itemsets, respectively). Since no newly derived itemset (i.e. itemset in BIS(K) \ K) has equal pessimistic and optimistic supports, then EBIS(K) = K. Property 4. Let X,Y∈BIS(K) and X⊂Y. Then: a) pSup(X,K) ≥ pSup(Y,K), b) oSup(X,K) ≥ oSup(Y,K). Property 5. Let X∈BIS(K). Then: a) max{pSup(Z,K)| Z∈BIS(K) ∧ X⊆Z} = pSup(X,K), b) min{oSup(Y,K)| Y∈BIS(K) ∧ Y⊆X} = oSup(X,K). Knowing BIS(K) one can induce (approximate) rules X⇒Y provided X∪Y ∈ BIS(K) and X ∈ BIS(K). The pessimistic confidence (pConf) of induced rules is defined as follows: pConf(X⇒Y,K) = pSup(X∪Y,K) / oSup(X,K). The approximate association rules derivable from K are called theory for K and are denoted by T: T(K) = {X⇒Y| X, X∪Y ∈ BIS(K)}.
Inferring Knowledge from Frequent Patterns
251
It is guaranteed for every rule r∈T(K) that its real support is not less than pSup(r,K) and its real confidence is not less than pConf(r,K). (Please, see [5] for the GenTheory algorithm computing T(K)). Now, let us assume that the user is not provided with known itemsets, but with association rules R the support and confidence of which are known. According to [5] the knowledge on rules should be first transformed into the knowledge on itemsets as follows: Let r∈R be a rule under consideration. Then the support of the itemset from which r is built equals to sup(r) and the support of the itemset that is the antecedent of r is equal to sup(r) / conf(r). All such itemsets are defined as known itemsets for R and are denoted by KIS(R), that is KIS(R) = {X∪Y| X⇒Y∈R} ∪ {X| X⇒Y∈R}. Now, it is sufficient to extract frequent known itemsets K from KIS(R), and then to compute T(K).
4 Deriving Unbounded Itemsets In this section we will investigate if and under which conditions the given set of known itemsets K can be augmented by itemsets that are not bounded in K. Let us start with the two fundamental propositions: Proposition 1. Let X,Y,Z⊆I such that Z,Y⊇X and sup(X)=sup(Z). Then: Y’ ⊆ Y ⊆ Y” and sup(Y’) = sup(Y) = sup(Y”), where Y’ = X∪(Y\Z), Y” = Y∪Z. Proof: Let X,Y,Z⊆I, Z,Y⊇X, sup(X) = sup(Z) and V = Y\Z. By Property 3, X∪V ⊆ Z∪V and sup(X∪V) = sup(Z∪V). Since Y⊇X and V = Y\Z, then X∪V = X∪(Y\Z) ⊆ Y ⊆ Y∪Z = Z∪(Y\Z) = Z∪V and sup(X∪(Y\Z)) = sup(Y∪Z). Now, by Property 2, sup(Y) = sup(X∪(Y\Z)). Proposition 1 states that each itemset Y in K can be shrunken into subset Y’ of the same support as sup(Y) and extended into superset Y” of the same support as sup(Y), if there is a pair of itemsets X,Z∈K, such that Z,Y⊇X and sup(X)=sup(Z). Example 2. Let X = {f}[30], Z = {cf}[30], Y = {acf}[20], Y’ = {aef}[15]. Then, by Proposition 1, we can shrink Y into new exact subset X∪(Y\Z) = {af}[20], and we can extend Y’ into new exact superset Y’∪Z = {acef}[15]. The proposition below shows how to perform even stronger itemset shrinking than that proposed in Proposition 1. Proposition 2. Let X,Y,Z⊆I such that Z,Y⊇X and sup(X)=sup(Z). ∀V∈K such that V⊆Y, sup(Y)=sup(V), the following holds: Y’ ⊆ Y and sup(Y’) = sup(Y), where Y’ = X∪(V\Z).
252
M. Kryszkiewicz
Proof: Let X,Y,V,Z⊆I, Z,Y⊇X, sup(X)=sup(Z), V⊆Y, sup(Y)=sup(V). As X∪V ⊇ X, then by Proposition 1, X∪((X∪V)\Z) ⊆ X∪V and sup(X∪((X∪V)\Z)) = sup(X∪V). In addition, by Property 3, X∪V ⊆ X∪Y and sup(X∪V) = sup(X∪Y). Taking into account that, (X∪V)\Z = V\Z (since Z⊇X) and X∪Y = Y (since Y⊇X), we obtain: X∪(V\Z) = X∪((X∪V)\Z) ⊆ X∪V ⊆ X∪Y = Y and sup(X∪(V\Z)) = sup(X∪((X∪V)\Z)) = sup(X∪V) = sup(X∪Y) = sup(Y). Proposition 2 states that each itemset Y in K can be shrunken into subset Y’=X∪(V\Z) of the same support as sup(Y), if X,Z,V∈K and Z,Y⊇X, sup(X)=sup(Z), V⊆Y, sup(Y)=sup(V). The extended and shrunken itemsets w.r.t. K will be denoted by EIS(K) and SIS(K), respectively and are defined as follows: EIS(K) = {Y∪Z| ∃X,Y,Z∈K such that Z,Y⊇X and sup(X)=sup(Z)}, SIS(K) = {X∪(V\Z)| ∃X,Y,V,Z∈K such that Z,Y⊇X, sup(X)=sup(Z), V⊆Y and sup(Y)=sup(V)}. The support of each shrunken and extended itemset equals to the support of some known itemset in K. However, K augmented with such shorter and longer itemsets derived by EIS(K) and SIS(K) will bound greater number of itemsets than BIS(K), i.e., EBIS(K∪EIS(K)∪SIS(K))⊇EBIS(K) and BIS(K∪EIS(K)∪SIS(K))⊇BIS(K). Example 3. Let K be the known itemsets from Example 1. Then newly derived itemsets (EIS(K) ∪ SIS(K)) \ K = {{af}[20], {acd}[20], {acef}[15], {adef}[15], {cdef}[15]}. If (EIS(K) ∪ SIS(K)) \ K ≠ ∅, then further augmentation of (exact) bounded itemsets is feasible. In order to obtain maximal knowledge on itemsets we should apply the operators EIS and SIS as many times as no new exact bounded itemsets can be derived. * By EBIS (K) we will denote all exact bounded itemsets that can be derived from K by multiple use of the EIS and SIS operators. More formally, *
EBIS (K) = Ek(K), where
• • •
E1(K) = EBIS(K), En(K) = EBIS(En-1(K) ∪ EIS(En-1(K)) ∪ SIS(En-1(K))), for n ≥ 2, k is the least value of n such that En(K) = En+1(K). *
Example 4. Let K be the known itemsets from Example 1 and K’ = EBIS (K) (see Fig. 1 for K and K’). In our example, the number of known itemsets increased from 9 itemsets in K to 23 exact itemsets in K’. Some of the newly derived itemsets could be found as bounded earlier (e.g. {de}[15,20], {df}[15,20] ∈ BIS(K)), however the process of shrinking and extending itemsets additionally enabled precise determination of their supports (e.g. sup(de)=15, sup(df)=20).
Inferring Knowledge from Frequent Patterns
253
{acdef}[15] {acdf}[20]
{acef}[15]
{acf}[20] {acd}[20] {adf}[20] {cdf}[20]
{adef}[15]
{cdef}[15]
{ace}[15] {aef}[15] {ade}[15] {def}[15] {cde}[15]
{ac}[20] {af}[20] {ad}[20] {cd}[20] {df}[20] {cf}[30] {d}[20]
{de}[15]
{f}[30] *
Fig. 1. All exact itemsets K’=EBIS (K) derivable from known itemsets K (underlined) by multiple use of SIS and EIS operators. Closed itemsets and generators in K’ are bolded *
By BIS (K) we will denote all bounded itemsets that can be derived from K by multiple use of the EIS and SIS operators. More formally, *
*
BIS (K) = BIS(EBIS (K)).
5
Lossless Representation of Known and Bounded Itemsets
In this section we offer a concise representation of known and bounded itemsets in K that allows deriving each item in BIS(K) and determining its pessimistic and optimistic support without error. In our approach we follow the idea of applying the concepts of closed itemsets and generators of itemsets. We will however extend these concepts in order to cover the additional aspect of possible imprecision of support determination for bounded itemsets, which was not considered so far in the context of closures and generators. A closure of an itemset X∈BIS(K) in K is defined to be a maximal (w.r.t. set inclusion) known superset of X that has the same support as pSup(X,K). The set of all closures of X in K is denoted by γ(X,K), that is:
γ(X,K) = MAX{Y∈K| Y⊇X ∧ sup(Y)=pSup(X,K)}. Let B be a subset of bounded itemsets such that K⊆B⊆BIS(K). The union of closures of the itemsets B in K will be denoted by C(B,K), that is: C(B,K) = ∪X∈B γ(X,K). For B=K, C(B,K) will be denoted briefly by C(K). An itemset X∈K is defined closed in K iff γ(X,K)={X}. A generator of an itemset X∈BIS(K) in K is defined to be a minimal (w.r.t. set inclusion) known subset of X that has the same support as oSup(X,K). The set of all generators of X in K is denoted by G(X,K), that is: G(X,K) = MIN{Y∈K| Y⊆X ∧ sup(Y)=oSup(X,K)}. Let B be a subset of bounded itemsets such that K⊆B⊆BIS(K). The union of generators of the itemsets B in K will be denoted by G(B,K), that is:
254
M. Kryszkiewicz
G(B,K) = ∪X∈B G(X,K). For B=K, G(B,K) will be denoted briefly by G(K). An itemset X∈K is defined a key generator in K iff G(X,K)={X}. Example 5. Let K’ be the set of exact itemsets in Fig. 1. Let X = {acd} (note {acd}∈K’). Then: γ(X,K’) = {{acdf}[20]} and G(X,K’) = {{ac}[20], {d}[20]}. Now, let X’={ef} (note {ef}∈BIS(K’)\K’). Then, pSup(X’,K’) = 15, oSup(X’,K’) = 30. Thus: γ(X’,K’) = {{acdef}[15]}, G(X’,K’) = {{f}[30]}. The union of closures of itemsets in K’: C(K’) = {{cf}[30], {acdf}[20], {acdef}[15]} and the union of generators of itemsets in K’: G(K’) = {{f}[30], {d}[20], {ac}[20], {af}[20], {de}[15], {ace}[15], {aef}[15]} (see Fig. 1). Thus, G(K’) ∪ C(K’) consists of 10 itemsets out of 23 itemsets present in K’. Clearly, G(K) ∪ C(K) ⊆ K. Supports of known itemsets can be determined either from supports of closures in K or from supports of generators in K as follows: Property 6. Let X∈K. Then: a) sup(X) = max{sup(Z)| Z∈C(K) ∧ X⊆Z}, b) sup(X) = min{sup(Y)| Y∈G(K) ∧ Y⊆X}. Proof: Ad. a) For each Z∈γ(X,K): Z is a known superset of X in C(K) and sup(Z)=pSup(X,K)=sup(X). On the other hand, supports of known supersets of X are not greater than sup(X) (by Property 1). Thus, sup(X) = max{sup(Z)| Z∈C(K) ∧ X⊆Z}. Ad. b) Analogous to that for the case a. The next property states that the pessimistic (optimistic) support of a bounded itemset is the same when calculated both w.r.t. K and w.r.t. C(K) (w.r.t. G(K)). In addition, the same closures (generators) of a bounded itemset will be found both in K and in C(K) (in G(K)). Property 7. Let X∈BIS(K). Then: a) pSup(X,K) = pSup(X, C(K)), b) γ(X,K) = γ(X, C(K)), c) oSup(X,K) = oSup(X, G(K)), d) G(X,K) = G(X, C(K)). Proof: Ad. a) pSup(X,K) = max{sup(Z)| Z∈K ∧ X⊆Z} = /* by Property 6a */ = max{sup(Z)| Z∈C(K) ∧ X⊆Z} = pSup(X, C(K)). Ad. b) By definition, γ(X,K) = MAX{Y∈K| Y⊇X ∧ sup(Y)=pSup(X,K)} = /* by Property 7a */ = MAX{Y∈K| Y⊇X ∧ sup(Y)=pSup(X,C(K))} = MAX{Y∈C(K)| Y⊇X ∧ sup(Y)=pSup(X,C(K))} = γ(X,C(K)). Ad. c-d) Analogous to those for the cases a-b, respectively. The lemma below states that each closure in K is a closed itemset and each generator in K is a key generator. Lemma 1. Let X∈BIS(K). Then: a) If Y∈γ(X,K), then γ(Y,K)={Y},
Inferring Knowledge from Frequent Patterns
255
b) If Y∈G(X,K), then G(Y,K)={Y}. Proof: Ad. a) Let Y∈γ(X,K). By definition of closure, Y is a maximal known superset of X such that sup(Y)=psup(X,K). Hence, no known proper superset of Y has the same support as Y. In addition, as Y is a known itemset, then sup(Y)=psup(Y,K). Thus we conclude that Y is the only known maximal (improper) superset of Y such that sup(Y)=psup(Y,K). Therefore and by definition of closure, γ(Y,K)={Y}. Ad. b) Analogous to that for the case a. Property 8. a) C(K) = {X∈K| γ(X,K)={X}}, b) G(K) = {X∈K| G(X,K)={X}}. Proof: Ad. a) We will prove an equivalent statement: X∈C(K) iff γ(X,K)={X}. (⇒) By definition of C, X∈C(K) implies ∃Y∈K such that X∈γ(Y,K). Hence by Lemma 1a, γ(X,K)={X}. (⇐) By definition of C, C(K) ⊇ γ(X,K). As γ(X,K) = {X}, then X∈C(K). Ad. b) Analogous to that for the case a. By Property 8, all closures in K are all closed itemsets in K and all generators in K are all key generators in K. Therefore, further on we will use the notions of closed itemsets and closures in K interchangeably. Similarly, we will use the notions of key generators and generators in K interchangeably. The proposition below specifies a way of determining closed itemsets and generators based on supports of known itemsets. Proposition 3. Let X∈K. a) C(K) ={X∈K| ∀Y∈K, if Y⊃X, then sup(X)≠sup(Y)}, b) G(K) ={X∈K| ∀Y∈K, if Y⊂X, then sup(X)≠sup(Y)}. Proof: Ad. a) Let X∈K. By Property 8a, X∈C(K) iff γ(X,K) = {X} iff MAX{Y∈K| Y⊇X ∧ sup(Y)=pSup(X,K)} = {X} iff MAX{Y∈K| Y⊇X ∧ sup(Y)=sup(X)} = {X} iff ∀Y∈K, if Y⊃X, then sup(X)≠sup(Y). Ad. b) Analogous to that for the case a. The lemma below states an interesting fact, that the union of closures (generators) of any subset of bounded itemsets containing all known itemsets is equal to the closed itemsets (key generators) in K. Lemma 2. Let B be a subset of bounded itemsets such that K⊆B⊆BIS(K). a) C(B,K) = C(K), b) G(B,K) = G(K). Proof: Ad. a) C(B,K) =
∪X∈B γ(X,K) ⊇ ∪X∈K γ(X,K) = C(K). On the other hand,
C(B,K) = ∪X∈B γ(X,K) = {Y∈K| X∈B, Y∈γ(X,K)} = /* by Lemma 1a */ = {Y∈K| X∈B, Y∈γ(X,K), γ(Y,K)={Y}} ⊆ {Y∈K| γ(Y,K)={Y}} = /* by Property 8a */ = C(K). Since C(B,K) ⊇ C(K) and C(B,K) ⊆ C(K), then C(B,K) = C(K). Ad. b) Analogous to that for the case a.
256
M. Kryszkiewicz
The immediate conclusion from Lemma 2 is that the union of closures (generators) of all exact bounded itemsets and all bounded itemsets is equal to the closed itemsets (key generators) in K. Proposition 4. a) C(EBIS(K)) = C(K), b) C(BIS(K)) = C(K), c) C(C(K)) = C(K), d) G(EBIS(K)) = G(K), e) G(BIS(K)) = G(K), f) G(G(K)) = G(K). Proof: Ad. a,b) Immediate by Lemma 2a. Ad. c) By Property 8a, C(K) = {X∈K| γ(X,K)={X}}. Now, the following equation is trivially true: C(K) = {X∈C(K)| γ(X,K)={X}}. In addition, by Property 7b, we have γ(X, C(K))=γ(X, K) for any X∈C(K). Hence, C(K) = {X∈C(K)| γ(X, C(K))={X}} = /* by Property 8a */ = C(C(K)). Ad. d-f) Analogous to those for the cases a-c, respectively. Proposition 5 claims that each itemset in BIS(K) is bounded by a subset that is a key generator in K and by a superset that is a closed itemset in K. Proposition 5. a) BIS(K) = {X⊆I| ∃Y∈G(K), Z∈C(K), Y⊆X⊆Z}, b) EBIS(K) = {X⊆I| ∃Y∈G(K), Z∈C(K), Y⊆X⊆Z, pSup(X,K)=oSup(X,K)}. Proof: Ad. a) Let X⊆I. By definition of bounded itemsets, X∈BIS(K) iff ∃Y,Z∈K, Y⊆X⊆Z iff ∃Y,Z∈K such that Y⊆X⊆Z and ∃Y’∈G(K), ∃Z’∈C(K) such that Y’∈G(Y,K) ∧ Z’∈γ(Z,K) iff ∃Y,Z∈K, ∃Y’∈G(K), ∃Z’∈C(K), such that Y’⊆Y⊆X⊆Z⊆Z’ iff /* by the facts: G(K)⊆K and C(K)⊆K */ ∃Y’∈G(K), ∃Z’∈C(K), such that Y’⊆X⊆Z’. Ad. b) Immediate by definition of EBIS and Proposition 5a. By Proposition 5 and Property 7a,c, the pair (G(K),C(K)) allows determination of all exact bounded itemsets and their supports as well as determination of all bounded itemsets and their pessimistic and optimistic supports. Hence, (G(K),C(K)) is a lossless representation of EBIS(K) and BIS(K). In the sequel, the union of G(K) and C(K) will be denoted by R(K), that is: R(K) = G(K) ∪ C(K). Proposition 6. a) C(R(K)) = C(K), b) G(R(K)) = G(K), c) R(R(K)) = R(K), d) R(EBIS(K)) = R(K), e) R(BIS(K)) = R(K), f) EBIS(K) = EBIS(R(K)), g) BIS(K) = BIS(R(K)).
Inferring Knowledge from Frequent Patterns
257
Proof: Ad. a) By definition of C, C(C(K)) = ∪{γ(X, C(K))| X∈C(K)}, C(R(K)) =
∪{γ(X, R(K))| X∈R(K)}, and C(K) = ∪{γ(X,K)| X∈K}. Since, C(K) ⊆ R(K) ⊆ K and γ(X, C(K)) = γ(X, R(K)) = γ(X, K), then C(C(K)) ⊆ C(R(K)) ⊆ C(K). However, we have C(C(K)) = C(K) (by Proposition 4c). Hence, C(R(K)) = C(K). Ad. b). Analogous to that for the case a. Ad. c). R(R(K)) = G(R(K)) ∪ C(R(K)) = /* by Proposition 6a-b */ = G(K) ∪ C(K) = R(K). Ad. d). R(EBIS(K)) = G(EBIS(K)) ∪ C(EBIS(K)) = /* by Proposition 4a,d */ = G(K) ∪ C(K) = R(K). Ad. e). Follows by Proposition 4b,e; analogous to that for the case d. Ad. f). By Proposition 5b. Ad. g). By Proposition 5a. As follows from Proposition 6a-b, G(K) and C(K) are determined uniquely by R(K), hence R(K) is a lossless representation of EBIS(K) and BIS(K). Clearly, if G(K) ∩ C(K) ≠ ∅, then R(K) is less numerous than the (G(K),C(K)) representation.
6 Lossless Representation of Derivable Itemsets *
*
Anticipating that the concise lossless representation of BIS (K) and EBIS (K) can be significantly less numerous than the corresponding (exact) bounded itemsets, we will * * investigate how to determine BIS (K) and EBIS (K) efficiently by manipulating mainly the concise lossless representation of known itemsets. Let K’ be the set of known itemsets augmented by extending and shrinking, i.e. K’ = K ∪ EIS(K) ∪ SIS(K). The proposition below claims that seeking for C(K’) among shrunken itemsets SIS(K) is useless. Similarly, seeking for G(K’) among extended itemsets EIS(K) is useless. Proposition 7. a) C(K ∪ EIS(K) ∪ SIS(K)) = C(K ∪ EIS(K)), b) G(K ∪ EIS(K) ∪ SIS(K)) = G(K ∪ SIS(K)), c) R(K ∪ EIS(K) ∪ SIS(K)) = C(K∪EIS(K)) ∪ G(K∪SIS(K)). Proof: Let K’ = K∪EIS(K)∪SIS(K) and K” = K∪EIS(K). By Proposition 2, each itemset Y’∈SIS(K) is a subset of some item Y∈K and sup(Y’)=sup(Y). Hence, if Y’⊂Y, then Y’ is neither closed in K’ nor in K” (by Proposition 3a). If however, Y’=Y, then Y’∈K. Hence, itemsets in SIS(K) either are not closed in K’ and in K” or belong to K. Ad. b) Follows from definition of EIS(K), Proposition 1 and Proposition 3b; can be proved analogously to the proof for the case a. Ad. c) Follows immediately from Proposition 7a-b. Now, we will address the following issue: Is the representation R(K ∪ EIS(K) ∪ SIS(K)) derivable from the representation R(K) without referring to the knowledge on supports of itemsets in K \ R(K)?
258
M. Kryszkiewicz
The lemma below states that the closed itemsets in the union of known itemsets K1 and K2 are equal to the closed itemsets in the union of C(K1) and C(K2), and the key generators in K1 ∪ K2 are equal to the key generators in the union of G(K1) and G(K2). Lemma 3. Let K1,K2 be subsets of known itemsets. a) C(K1 ∪ K2) = C(C(K1) ∪ C(K2)), b) G(K1 ∪ K2) = G(G(K1) ∪ G(K2)). Proof: In Appendix.
Lemma 3 implies that the knowledge on closed itemsets and generators of known, extended and shrunken itemsets can be directly applied for determining G(K∪EIS(K)) and C(K∪SIS(K)). Corollary 1. a) C(K ∪ EIS(K)) = C(C(K) ∪ C(EIS(K))), b) G(K ∪ SIS(K)) = G(G(K) ∪ G(SIS(K))). Proof: Ad. a-b) Follow immediately from Lemma 3a,b, respectively.
Thus, by Proposition 7c and Corollary 1, R(K ∪ EIS(K) ∪ SIS(K)) is directly derivable from the representation R(K) and the sets: C(EIS(K)), G(SIS(K)). The proposition below states further that the two sets: C(EIS(K)) and G(SIS(K)) can also be determined directly from R(K). Proposition 8. a) C(EIS(K)) = C({Y∪Z| ∃X∈G(K), ∃Y,Z∈C(K) such that Z,Y⊇X and sup(X)=sup(Z)}), b) C(EIS(K)) = C(EIS(R(K))), c) G(SIS(K)) = G({X∪(V\Z)| ∃X,V∈G(K), ∃Y,Z∈C(K) such that Z,Y⊇X, sup(X)=sup(Z), V⊆Y and sup(Y)=sup(V)}), d) G(SIS(K)) = G(SIS(R(K))). Proof: In Appendix.
Hence, we conclude that R(K ∪ EIS(K) ∪ SIS(K)) is directly derivable from R(K) the knowledge on itemsets in K \ R(K) is superfluous. * Let us define the operator R as follows: *
• • • *
R (K) = Rk(K), where R1(K) = R(K), Rn(K) = R(Rn-1(K) ∪ EIS(Rn-1(K)) ∪ SIS(Rn-1(K))), for n ≥ 2, k is the least value of n such that Rn(K) = Rn+1(K).
R iteratively transforms only the concise lossless representation of known itemsets by means of EIS and SIS operators. The lemma below shows a direct correspondence * between the auxiliary operators Rk (used for defining R (K)) and Ek (used for defining * EBIS (K) in Section 4), namely Rk is a concise lossless representation of Ek.
Inferring Knowledge from Frequent Patterns
259
Lemma 4. a) Ek(K) = EBIS(Rk(K)), b) R(Ek(K)) = Rk(K). Proof: In Appendix.
* The immediate conclusion from Lemma 4 is that R (K) is a lossless representation of * EBIS (K)). Proposition 9. * * a) EBIS (K) = EBIS(R (K)), * * b) BIS (K) = BIS(R (K)), * * c) R(EBIS (K)) = R (K), * * d) R(BIS (K)) = R (K). Proof: Ad. a) Follows immediately by Lemma 4a. * * Ad. b) BIS (K) = /* by definition */ = BIS(EBIS (K)) = /* by Proposition 9a */ = * * * BIS(EBIS(R (K))). Let K’ = R (K). Then, BIS (K) = BIS(EBIS(K’)) = /* by Proposition * 6g */ = BIS(R(EBIS(K’))) = /* by Proposition 6d */ = BIS(R(K’)) = BIS(R(R (K))) = /* * * by definition of R */ = BIS(R (K)). Ad. c) Follows immediately by Lemma 4b. * * Ad. d) R(BIS (K)) = /* by Proposition 9b */ = R(BIS(R (K))) = /* by Proposition 6e */ * * * = R(R (K)) = /* by definition of R */ = R (K). * * Proposition 9 states that R (K) is not only a lossless representation of EBIS (K)), but * also a lossless representation of BIS (K)).
7
Conclusions
In the paper we proposed how to generate maximal amount of knowledge in the form of frequent patterns and association rules from a given a set of known itemsets or association rules. Unlike in earlier work in [5] we enabled generations of patterns that are not bounded by the given sample of known itemsets. We determined the cases when the original set of known itemsets can be augmented with the new patterns that are supersets (EIS(K)) and subsets (SIS(K)) of known itemsets K, such that their support can be precisely determined. The procedure of augmenting the set of known or exact derivable itemsets can be repeated as long as all exact itemsets are discovered. In order to avoid superfluous calculations, we introduced the lossless concise representation (R(K)) of a given sample of itemsets. We proposed a method of performing the pattern augmentation procedure as an iterative transformation of a concise lossless pattern representation into the new ones. When no changes occur as a result of a consecutive transformation, the obtained representation is considered final * * (R (K)). We proved that all derivable bounded (BIS (K)) and exact bounded * (EBIS (K)) patterns can be derived by bounding only the itemsets present in the final representation.
260
M. Kryszkiewicz
References 1. Agrawal, R., Imielinski, T., Swami, A.: Mining Associations Rules between Sets of Items in Large Databases. In: Proc. of the ACM SIGMOD Conference on Management of Data. Washington, D.C. (1993) 207-216 2. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast Discovery of Association Rules. In: Advances in Knowledge Discovery and Data Mining. AAAI, Menlo Park, California (1996) 307-328 3. Kryszkiewicz, M.: Representative Association Rules. In: Proc. of PAKDD ’98. Melbourne, Australia. LNAI 1394. Springer-Verlag (1998) 198-209 4. Kryszkiewicz, M.: Mining with Cover and Extension Operators. In: Proc. of PKDD ’00. Lyon, France. LNAI 1910. Springer-Verlag (2000) 476-482 5. Kryszkiewicz, M.: Inducing Theory for the Rule Set. In: Proc. of RSCTC ’00. Banff, Canada (2000) 353-360 6. Kryszkiewicz, M.: Concise Representation of Frequent Patterns based on Disjunction-free Generators. In: Proc. of ICDM ’2001, IEEE Computer Society Press (2001). 7. Pasquier N., Bastide Y., Taouil R., Lakhal L., Efficient Mining of Association Rules Using Closed Itemset Lattices. Information Systems 24 (1999) 25-46
Appendix: Proofs In the appendix we will prove Lemma 3, Proposition 8, and Lemma 4. In the proof of Lemma 3, we will apply the beneath lemma. Lemma 5. a) ∀Y∈K, if Y⊃X, then sup(X)≠sup(Y) iff ∀Y∈C(K), if Y⊃X, then sup(X)≠sup(Y). b) ∀Y∈K, if Y⊂X, then sup(X)≠sup(Y) iff ∀Y∈G(K), if Y⊂X, then sup(X)≠sup(Y). Proof: Ad. a) (⇒) Trivial as K ⊇ C(K). (⇐) (by contradiction). Let ∀Y∈C(K), if Y⊃X, then sup(X)≠sup(Y). Let Y’∈K be an itemset such that Y’⊃X and sup(X)=sup(Y’). Let Y∈γ(Y’,K). Then, Y∈C(K), Y’⊆Y and sup(Y)=pSup(Y’,K)=sup(Y’). Hence, Y⊇Y’⊃X and sup(Y)=sup(Y’)=sup(X), which contradicts the assumption. Ad. b) Analogous to that for the case a.
Proof of Lemma 3: Ad. a) Let W(X,Y) denote condition: Y⊃X implies sup(X)≠sup(Y). By Proposition 3a: • C(K1∪K2) = {X∈K1∪K2| ∀Y∈K1∪K2, W(X,Y)} = {X∈K1 | ∀Y∈K1∪K2, W(X,Y)} ∪ {X∈K2 | ∀Y∈K1∪K2, W(X,Y)}. Now, {X∈K1 | ∀Y∈K1∪K2, W(X,Y)} = {X∈K1 | ∀Y∈K1, W(X,Y)} ∩ {X∈K1 | ∀Y∈K2, W(X,Y)} = /* by Proposition 3a */ = C(K1) ∩ {X∈K1 | ∀Y∈K2, W(X,Y)} = /* by Lemma 5a */ = C(K1) ∩ {X∈K1 | ∀Y∈C(K2), W(X,Y)} = C(K1) ∩ {X∈C(K1)| ∀Y∈C(K2), W(X,Y)} = /* by Proposition 4c */ C(C(K1)) ∩ {X∈C(K1)| ∀Y∈C(K2), W(X,Y)} = /* by Proposition 3a */ = {X∈C(K1)| ∀Y∈C(K1), W(X,Y)} ∩ {X∈C(K1)| ∀Y∈C(K2), W(X,Y)} =
Inferring Knowledge from Frequent Patterns
261
{X∈C(K1)| ∀Y∈C(K1)∪C(K2), W(X,Y)}. Thus, we proved that: (*) {X∈K1 | ∀Y∈K1∪K2, W(X,Y)} = {X∈C(K1)| ∀Y∈C(K1)∪C(K2), W(X,Y)}. Similarly, one can prove that: (**) {X∈K2 | ∀Y∈K1∪K2, W(X,Y)} = {X∈C(K2)| ∀Y∈C(K1)∪C(K2), W(X,Y)}. By (*) and (**) we obtain: C(K1∪K2) = {X∈C(K1)| ∀Y∈C(K1)∪C(K2), W(X,Y)} ∪ {X∈C(K2)| ∀Y∈C(K1)∪C(K2), W(X,Y)} = {X∈C(K1)∪C(K2)| ∀Y∈C(K1)∪C(K2), W(X,Y)} = /* by Proposition 3a */ = C(C(K1)∪C(K2)). Ad. b) Analogous to that for the case a.
Proof of Proposition 8: Ad. a) By definition, an itemset W∈EIS(K) provided W=Y∪Z, where Y,Z∈K, and ∃X∈K, such that Z,Y⊇X and sup(X)=sup(Z). Let us assume X,Y,Z,W are such itemsets and in addition W∈C(EIS(K)). We will show that there are itemsets X’∈G(K), and Y’,Z’∈C(K) such that Z’,Y’⊇X’ and sup(X’)=sup(Z’) and W=Y’∪Z’. Let X’∈G(X,K), Y’∈γ(Y,K), Z’∈γ(Z,K). Hence, X’∈G(K), Y’,Z’∈C(K), X’⊆X, Y’⊇Y, Z’⊇Z, and sup(X’)=sup(X)=sup(Z)=sup(Z’), sup(Y’)=sup(Y). Now, since Z,Y⊇X, then we conclude Z’,Y’⊇X’. We deduce further: Y’∪Z’∈EIS(K). In addition, W = Y∪Z ⊆ Y∪Z’ ⊆ Y’∪Z’ and sup(W) = sup(Y∪Z) = /* by Property 3 */ = sup(Y∪Z’) = /* by Property 3 */ = sup(Y’∪Z’). Hence, W ⊆ Y’∪Z’ and sup(W) = sup(Y’∪Z’). Since W∈C(EIS(K)), then by Proposition 3a: ∀V∈EIS(K), V⊃W implies sup(X)≠sup(Y)}. Thus, there is no proper superset of W in EIS(K) the support of which would be equal to sup(W). This implies W = Y’∪Z’. As W was chosen arbitrarily, we proved that any closed itemset in C(EIS(K)) can be built solely from R(K). Ad. b) Follows from (the proof of) Proposition 8a. Ad. c-d) Analogous to those for the cases a-b, respectively.
Proof of Lemma 4. Ad. a) (by induction) Let k=1. • E1(K) = EBIS(K) = /* by Proposition 6f */ = EBIS(R(K)) = EBIS(R1(K)). Hence, the lemma is satisfied for k=1. Now, we will apply the following induction hypothesis for k>1: For every i
262
M. Kryszkiewicz
G( G(K’) ∪ G(SIS(R(K’))) ) = /* by Proposition 8d */ = G( G(K’)) ∪ G(SIS(K’)) ) = /* by Lemma 3b */ = G( K’ ∪ SIS(K’) ) = G( Rk-1(K)) ∪ SIS(Rk-1(K)) ). Thus, we proved that: (*) G( Ek-1(K) ∪ SIS(Ek-1(K)) ) = G( Rk-1(K) ∪ SIS(Rk-1(K)) ). Anologously, one can prove that: (**) C( Ek-1(K) ∪ EIS(Ek-1(K)) ) = C( Rk-1(K) ∪ EIS(Rk-1(K)) ). By (*) and (**) we obtain: • Ek(K) = EBIS( G( Rk-1(K) ∪ SIS(Rk-1(K)) ) ∪ C( Rk-1(K) ∪ EIS(Rk-1(K)) ) ) = /* by Proposition 7c */ = EBIS(R( Rk-1(K) ∪ SIS(Rk-1(K)) ∪ EIS(Rk-1(K)) )) = EBIS(Rk(K)). Ad. b) R(Ek(K)) = /* by Lemma 4a */ = R(EBIS(Rk(K))) = /* by Proposition 6d */ = * R(Rk(K)) = /* by definition of R */ = Rk(K).
Anytime Possibilistic Propagation Algorithm Nahla Ben Amor1 , Salem Benferhat2 , and Khaled Mellouli1 1
2
Institut Sup´erieur de Gestion de Tunis {nahla.benamor, khaled.mellouli}@ihec.rnu.tn Institut de Recherche en Informatique de Toulouse (I.R.I.T) [email protected]
Abstract. This paper proposes a new anytime possibilistic inference algorithm for min-based directed networks. Our algorithm departs from a direct adaptation of probabilistic propagation algorithms since it avoids the transformation of the initial network into a junction tree which is known to be a hard problem. The proposed algorithm is composed of several, local stabilization, procedures. Stabilization procedures aim to guarantee that local distributions defined on each node are coherent with respect to the ones of its parents. We provide experimental results which, for instance, compare our algorithm with the ones based on a direct adaptation of probabilistic propagation algorithms.
1
Introduction
In possibility theory there are two different ways to define the counterpart of Bayesian networks. This is due to the existence of two definitions of possibilistic conditioning: product-based and min-based conditioning [5] [7] [14]. When we use the product form of conditioning, we get a possibilistic network close to the probabilistic one sharing the same features and having the same theoretical and practical results [4]. However, this is not the case with min-based networks [10] [12]. This paper focuses on min-based possibilistic directed graphs, by proposing a new algorithm for propagating uncertain information in a possibility theory framework. Our algorithm, is an anytime algorithm. It is composed of several steps, which progressively get close to the exact possibility degrees of a variable of interest. The first step is to transform the initial possibilistic graph into an equivalent undirected graph. Each node in this graph contains a node from the initial graph and its parents and will be quantified by their local joint distribution instead of the conditional one. Then, different stability procedures are used, in order to guarantee that joint distributions on a given node are in agreement with those of its parents. The algorithm successively applies one-parent stability, two-parents stability,..., n-parents stability, which respectively checks the stability with respect to only one parent, two parents,..., all parents. We show that the more parents we consider, the bette the results, comparing with the exact possibility degrees of variables of interest. We also provide experimental results showing the merits of our algorithm. D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 263–279, 2002. c Springer-Verlag Berlin Heidelberg 2002
264
N.B. Amor, S. Benferhat, and K. Mellouli
Section 2 gives a briefly background on possibility theory. Section 3 introduces min-based possibilistic graphs and briefly recalls a standard propagation algorithm. Section 4 presents our new algorithm. Section 5 considers the case where a new evidence is taken into account. Lastly, Section 6 gives some experimental results.
2
Basics of Possibility Theory
Let V = {A1 , A2 , ..., AN } be a set of variables. We denote by DA = {a1 , .., an } the domain associated with the variable A. By a we denote any instance of A. Ω = ×Ai ∈V DAi denotes the universe of discourse, which is the Cartesian product of all variable domains in V . Each element ω ∈ Ω is called a state of Ω. ω[A] denotes the instance in ω of the variable A. In the following, we only give a brief recalling on possibility theory, for more details see [7]. A possibility distribution π is a mapping from Ω to the interval [0, 1]. It represents a state of knowledge about a set of possible situations distinguishing what is plausible from what is less plausible. Given a possibility distribution π defined on the universe of discourse Ω, we can define a mapping grading the possibility measure of an event φ ⊆ Ω by Π(φ) = maxω∈φ π(ω). A possibility distribution π is said to be α-normalized, if h(π) = maxω π(ω) = α. If α = 1, then π is simply said to be normalized. h(π) is called the consistency degree of π. Possibilistic conditioning: In the possibilistic setting conditioning consists in modifying our initial knowledge, encoded by a possibility distribution π, by the arrival of a new sure piece of information φ ⊆ Ω. In possibility theory there are two well-known definitions of conditioning: - min-based conditioning proposed in an ordinal setting [7] [14]: Π(ψ ∧ φ) if Π(ψ ∧ φ) < Π(φ) Π(ψ | φ) = (1) 1 otherwise - product-based conditioning proposed in a numerical setting and which is a direct counterpart of probabilistic conditioning: π(ψ∧φ) if Π(φ) =0 Π(φ) π(ψ |p φ) = (2) 0 otherwise Possibilistic independence: There are several definitions of independence relations in the possibilistic framework [2] [5] [11]. In particular, two definitions have been used in the perpective of possibilistic networks. - the Non-Interactivity independence [16] defined by: Π(x ∧ y | z) = min(Π(x | z), Π(y | z)), ∀x, y, z.
(3)
- the Product-based independence relation defined by: Π(x |p y ∧ z) = Π(x |p z), ∀x, y, z.
(4)
Anytime Possibilistic Propagation Algorithm
3
265
Min-Based Possibilistic Graphs
This section defines min-based possibilistic graphs and briefly recalls the directed adaptation of probabilistic propagation algorithm in a possibility theory framework. 3.1
Basic Definitions
A min-based possibilistic graph over a set of variables V , denoted by, consists of: - a graphical component which is a DAG (Directed Acyclic Graph) where nodes represent variables and edges encode the link between the variables. The parent set of a node A is denoted by UA . - a numerical component which quantifies different links. For every root node A (UA = ∅), uncertainty is represented by the a priori possibility degree Π(a) of each instance a ∈ DA , such that maxa Π(a) = 1. For the rest of the nodes (UA = ∅) uncertainty is represented by the conditional possibility degree Π(a | uA ) of each instances a ∈ DA and uA ∈ DUA . These conditional distributions satisfy the following normalization condition: maxa Π(a | uA ) = 1, for any uA . The set of a priori and conditional possibility degrees induces a unique joint possibility distribution defined by: Definition 1 Given the a priori and conditional possibilities, the joint distribution denoted by πm , is expressed by the following min-based chain rule: πm (A1 , .., AN ) = min Π(Ai | UAi )
(5)
i=1..N
Example 1 Let us consider the min-based possibilistic network ΠG composed by the DAG of Figure 1 and the initial distributions given in Tables 1 and 2. Table 1. Initial distributions a
Π(a)
b
a
Π(b | a)
c
a
Π(c | a)
a1 a2
1 0.9
b1 b1 b2 b2
a1 a2 a1 a2
1 0 0.4 1
c1 c1 c2 c2
a1 a2 a1 a2
0.3 1 1 0.2
These a priori and conditional possibilities encode the joint distribution relative to A, B, C and D using (5) as follows: ∀a, b, c, d, πm (a ∧ b ∧ c ∧ d) = min(Π(a), Π(b | a), Π(c | d), Π(d | b ∧ c). For instance πm (a1 ∧ b2 ∧ c2 ∧ d1 ) = min(1, 0.4, 1, 1) = 0.4. Moreover we can check that h(πm ) = 1.
266
N.B. Amor, S. Benferhat, and K. Mellouli Table 2. Initial distributions d
b
c
Π(d | b ∧ c)
d
b
c
Π(d | b ∧ c)
d1 d1 d1 d1
b1 b1 b2 b2
c1 c2 c1 c2
1 1 1 1
d2 d2 d2 d2
b1 b1 b2 b2
c1 c2 c1 c2
1 0 0.8 1
A
B
C
D
Fig. 1. Example of a Multiply Connected DAG
3.2
Possibilistic Propagation in Junction Trees
This section summarizes a direct adaptation of probabilistic propagation algorithm [15] in the possibilistic framework. For more details see [4]. The principle of this propagation method is to transform the initial DAG into a junction tree and then to perform the propagation on this new graph. Given a min-based possibilistic network, the construction of its corresponding junction tree is performed in the same manner as in the probabilistic case [15]. In a first step, the DAG is moralized by adding undirected edges between the parents and by dropping the direction of existing edges. Then, the moral graph is triangulated which means that every cycle of length four or greater contains an edge that connects two non-adjacent nodes in the cycle. It is possible to have different triangulations of a moral graph. In particular, we can simply construct a unique cluster containing all the variables. However such triangulation is not interesting since it does not allow local computations. The task of finding an optimal triangulation is stated as an NP-complete problem [6]. Finally, the triangulated graph is transformed into a junction tree where each node represents a cluster of variables and each edge is labeled with a separator corresponding to the intersection of its adjacent clusters. Once the junction tree is constructed, it will be initialized using the initial conditional distributions and the observed nodes. Then, the propagation process starts via a message passing mechanism between different clusters after choosing an arbitrary cluster to be the pivot of the propagation. Similarly to the probabilistic networks, the message flow is divided into two phases: - a collect-evidence phase in which each cluster passes a message to its neighbor
Anytime Possibilistic Propagation Algorithm
267
in the pivot direction, beginning with the clusters farthest from the pivot. - a distribute-evidence phase in which each cluster passes messages to its neighbors away from the pivot direction, beginning with the pivot itself. These two message passes ensure that the potential of each cluster corresponds to its local distribution. Thus, we can compute the possibility measure of any variable of interest by simply marginalizing any cluster potential containing it.
4 4.1
Anytime Possibilistic Propagation Algorithm Basic Ideas
The product-based possibilistic networks are very close to Bayesian networks since conditioning is defined in the same way in the two frameworks. This is not the case for min-based networks since the minimum operator has different properties like the idempotency property. Therefore, we propose a new propagation algorithm for such networks which is not a direct adaptation of probabilistic propagation algorithms. In particular, we will avoid the transformation of the initial network into a junction tree. Given a min-based possibilistic network ΠG, the proposed algorithm locally computes for any instance a of a variable of interest A the possibility distribution Πm (a) inferred from ΠG. Note that computing Πm (a) corresponds to possibilistic inference with no evidence. The more general problem of computing Πm (a | e), where e is the total evidence, is advocated in Section 5. The basic steps of the propagation algorithm are: – Initialization. Transforms the initial network into an equivalent secondary structure, also called here for simplicity moral graph, composed of clusters of variables obtained by adding to each node its parents. Then quantifies the graph using the initial conditional distributions. Lastly, incorporates the instance a of the variable of interest A. – One-parent stability. Ensures that any cluster agree with each of its parents on the distributions defined on common variables. – n-parents stability. Ensures that any cluster agree on the distributions defined on common variables computed from 2, 3,..,n parents. – n-best-parents stability. Ensures that only best instances in the distribution of each cluster agree with the best instances in the distribution computed from the parent set. The proposed algorithm is an anytime algorithm since the longer it runs, the closer to the exact marginals we get. 4.2
Initialization
From conditional to local joint distributions. The first step in the initialization procedure is to transform the initial network into an equivalent secondary
268
N.B. Amor, S. Benferhat, and K. Mellouli
structure, also called moral graph, and denoted MG. Moral graphs are obtained by adding to each node its parent set and by dropping the direction of existing edges. Each node in MG is called a cluster and denoted by Ci . Each edge in MG is labeled with the intersection of its clusters Ci and Cj , called separator, and denoted by Sij . Note that, contrary to classical construction of separators, a link is not necessary between the clusters sharing the same parents. The initial conditional distributions are transformed into local joints. Namely, for each cluster Ci of MG, we assign a local joint distribution relative to its variables, called potential and denoted by πCi . We denote by ci and sij the possible instances of the cluster Ci and the separator Sij respectively. ci [A] denotes the instance in ci of the variable A. The outline of this first phase of the initialization procedure is as follows: Algorithm 1 From conditional to local joint distributions Begin 1. Building the moral graph: - For each variable Ai , form a cluster Ci = {Ai } ∪ UAi - For each edge connecting two nodes Ai and Aj : form an undirected edge in the moral graph between the clusters Ci and Cj labeled with a separator Sij corresponding to their intersection. 2. Quantify the moral graph: For each cluster Ci : πCi (Ai ∧ UAi ) ← Π(Ai | UAi ) End
From MG, we can associate a unique possibility distribution defined by: Definition 2 The joint distribution associated with MG, denoted πMG is expressed by: πMG (A1 , .., AN ) = min πCi (6) i=1..N
Example 2 Let us consider the ΠG given in Example 1. The moral graph corresponding to ΠG is represented in Figure 2. The initial distributions are transformed into joint ones as shown by Table 3. Table 3. Initialized potentials of A, AB, AC and BCD a
πA
a
b
πAB
a
c
πAC
b
c
d
πBCD
b
c
d
πBCD
a1 1 a2 0.9
a1 a1 a2 a2
b1 b2 b1 b2
1 0.4 0 1
a1 a1 a2 a2
c1 c2 c1 c2
0.3 1 1 0.2
b1 b1 b1 b1
c1 c1 c2 c2
d1 d2 d1 d2
1 1 1 0
b2 b2 b2 b2
c1 c1 c2 c2
d1 d2 d1 d2
1 0.8 1 1
Anytime Possibilistic Propagation Algorithm
269
A A
A
AB
AC
B
C BCD
Fig. 2. Moral graph of the DAG in Figure 1
Incorporating the variable of interest. Let A be the variable of interest and let a be any of its instances. We are interesting in computing Πm (a). We first define a new possibility distribution πa from πm as follows: πm (ω) if ω[A] = a πa (ω) = (7) 0 otherwise Proposition 1 Let πm be a possibility distribution obtained by (5) and πa be a possibility distribution computed from πm using (7). Then, Πm (a) = h(πa ) = max πa (ω) ω
This proposition means that the possibility degree Πm (a) is equal to the consistency degree of πa . The incorporation of the instance a in MG, should be such that the possibility distribution obtained from the moral graph is equal to πa . This can be obtained by modifying the potential of the cluster Ci as follows: πCi (ci ) if ci [A] = a πCi (ci ) ← 0 otherwise Example 3 Suppose that we are interested with the value of Πm (D = d2 ), Table 4 represents the potential of the cluster BCD after incorporating this variable. Table 4. Initialized potential of BCD after incorporating the evidence D = d2 b
c
d
πBCD
b
c
d
πBCD
b1 b1 b1 b1
c1 c1 c2 c2
d1 d2 d1 d2
0 1 0 0
b2 b2 b2 b2
c1 c1 c2 c2
d1 d2 d1 d2
0 0.8 0 1
270
N.B. Amor, S. Benferhat, and K. Mellouli
Proposition 2 shows that the moral graph obtained by incorporating the variable of interest A leads, indeed, to the possibility distribution πa . Proposition 2 Let ΠG be a min-based possibilistic network. Let MG be the moral graph corresponding to ΠG given by the initialization procedure. Let πa be the joint distribution given by (7) (which is obtained after incorporating the instance a of the variable of interest A). Let πMG be the joint distribution encoded by MG (given by (6)) after the initialization procedure. Then πa = πMG . The following subsections present several stabilizing procedures which aim to approach the exact value of h(π(a)) (hence Πm (a)). They are based on the notion of stability, which means that adjacent clusters agree on marginal distributions defined on common variables. 4.3
One-Parent Stability
One-parent stability means that any cluster agree with each of its parents on the distributions defined on common variables. More formally, Definition 3 Let Ci and Cj be two adjacent clusters and let Sij be their separator. The separator Sij is said to be one-parent stable if: max πCi = max πCj
Ci \Sij
Cj \Sij
(8)
where maxCi \Sij πCi (resp. maxCj \Sij πCj ) is the marginal distribution of Sij defined from πCi (resp. πCj ). A moral graph MG is said to be one-parent stable if all of its separators are one-parent stable. The one-parent stability procedure is performed via a message passing mechanism between different clusters. Each separator collects information from its corresponding clusters, then diffuses it to each of them, in order to update them by taking the minimum between their initial potential and the one diffused by their separator. This operation is repeated until there is no modification on the cluster’s potentials. The potentials of any adjacent clusters Ci and Cj (with separator Sij ) are updated as follows: – Collect evidence (Update separator): πSij ← min( max πCi , max πCj ) Ci \Sij
Cj \Sij
(9)
– Distribute evidence (Update clusters): πCi ← min(πCi , πSij )
(10)
πCj ← min(πCj , πSij )
(11)
Anytime Possibilistic Propagation Algorithm
271
These two steps are repeated until reaching the one-parent stability on all clusters. At each level of the stabilizing procedure, the moral graph encodes the same joint distribution: Proposition 3 Let MG be a moral graph, let MG be the resulted moral graph after the modification of two adjacent clusters Ci and Cj using equations (9), (10) and (11). Then πMG = πMG . It can be shown that the one-parent stability is reached after a finite number of message passes, and hence it is a polynomial procedure. The following proposition shows that if a moral graph is stabilized at one-parent, then the maximum value of all its cluster’s potentials is the same. Proposition 4 Let MG be a stabilized moral graph. Then ∀Ci , α = max πCi From Propositions 2 and 3 we deduce that from the initialization to the oneparent stability level, the moral graph encodes the same joint distribution i.e. πa = πMG . Example 4 Let us consider the moral graph initialized in Examples 2 and 3. Note first that this moral graph is not one-parent stabilized. For instance, the separator A between the two clusters AB and A is not one-parent stable since maxAB\A πAB (a2 ) = 1 = πA (a2 ) = 0.9. At one-parent stability, reached after two message passes, we obtain the potentials given in Table 5. The maximum potential is the same in the four clusters i.e. maxπA = maxπAB = maxπAC = maxπBCD = 0.9. Table 5. Stabilized potentials a
πA
a
b
πAB
a
c
πAC
b
c
d
πBCD
b
c
d
πBCD
a1 0.9 a2 0.9
a1 a1 a2 a2
b1 b2 b1 b2
0.9 0.4 0 0.9
a1 a1 a2 a2
c1 c2 c1 c2
0.3 0.9 0.9 0.2
b1 b1 b1 b1
c1 c1 c2 c2
d1 d2 d1 d2
0 0.9 0 0
b2 b2 b2 b2
c1 c1 c2 c2
d1 d2 d1 d2
0 0.8 0 0.9
Note that, one-parent stability does not guarantee that the degree α corresponds to the exact degree Πm (a) = h(πa ) since the equality h(πM G ) = α is not always verified. Indeed, we can check in the previous example that = 0.9. Nevertheless, as we will see later, experimentations show h(πMG ) = 0.8 that, in general, this equality holds.
272
4.4
N.B. Amor, S. Benferhat, and K. Mellouli
N-Parents Stability
As noted before, one-parent stability does not always guarantee local computations (from clusters) of the possibility measure Πm (A) of any variable of interest A. Thus, our idea is to improve the resulted possibility degree by considering stability with respect to a greater number of parents. Therefore, we will increase the parents number by first considering two parents, then three parents until reaching n parents where n is the cardinality of the parent set relative to each cluster. To illustrate this procedure we only present the two-parents stability. The principle of this procedure is to ensure for each cluster, having at least two parents, its stability with respect to each pair of parents. More formally: Definition 4 Let Ci be cluster in a moral graph MG, let Cj and Ck be two parents of Ci . Let Sij be the separator between Ci and Cj and Sik be the separator between Ci and Ck . Let C = Cj ∪ Ck and let S = Sij ∪ Sik . Let πC be the joint distribution computed from πCj and πCk . The cluster Ci is said to be stable with respect to its two parents Cj and Ck if: maxCi \S πCi = maxC\S πC 1 . In a similar way, a cluster Ci is said to be two-parents stable if it is stable with respect to each pair of its parents. Then, a moral graph MG is said to be two-parents stable if all of its clusters are two-parents stable. Ck :
The following procedure ensures the stability of Ci with respect to Cj and
Algorithm 2 Stabilize a cluster Ci with respect to two-parents Cj and Ck Begin - Compute πC using πCj and πCk : πC ← min(πCj , πCk ) - Compute πS using πC : πS ← maxC\S πC - Update πCi using πS : πCi ← min(πCi , πS ) End
The two-parents stability saves the joint distribution encoded by the moral graph: Proposition 5 Let πa be the joint distribution given by (7). Let πMG be the joint distribution encoded by MG after the two-parents stability procedure. Then, πa = πMG . The following proposition shows that the two-parents stability improves the one-parent stability: 1
where maxCi \S πCi (resp. maxC\S πC ) is the marginal distribution of S defined from πCi (resp. πC ).
Anytime Possibilistic Propagation Algorithm
273
Proposition 6 Let α1 be the maximal degree generated by the one-parent stability (which is unique c.f. Proposition 4). Let α2 be the maximal degree generated by the two-parents stability. Then, α1 ≥ α2 ≥ Πm (a) Example 5 Let us consider the inconsistent stabilized moral graph of example 4. The two-parents stabilized potential of the cluster BCD with respect to its two parents AB and AC is given by Table 6. Note, for instance, that the potential of c2 ∧ b2 ∧ d2 decreases from 0.9 to 0.4. Thus, we should re-stabilize the moral graph at one-parent (see Table 7). We can check that the resulted moral graph is two-parents stabilized. Moreover, we have h(πMG ) = 0.8, in other terms, we have reached the consistency degree of πa . Table 6. Two-parents stabilized potential of BCD b
c
d
πBCD
b
c
d
πBCD
b1 b1 b1 b1
c1 c1 c2 c2
d1 d2 d1 d2
0 0.3 0 0
b2 b2 b2 b2
c1 c1 c2 c2
d1 d2 d1 d2
0 0.8 0 0.4
Table 7. One-parent re-stabilized potentials a
4.5
πA
a
b
πAB
a
c
πAC
b
c
d
πBCD
b
c
d
πBCD
a1 0.4 a2 0.8
a1 a1 a2 a2
b1 b2 b1 b2
0.3 0.4 0 0.8
a1 a1 a2 a2
c1 c2 c1 c2
0.3 0.4 0.8 0.2
b1 b1 b1 b1
c1 c1 c2 c2
d1 d2 d1 d2
0 0.3 0 0
b2 b2 b2 b2
c1 c1 c2 c2
d1 d2 d1 d2
0 0.8 0 0.4
N-Best-Parents Stability
Ideally, we want to perform an n-parents stability where n is the cardinality of the parent set relative to each cluster. In other terms, each cluster will be stabilized with respect to the whole set of its parents. However, this can be impossible especially when a cluster has an important number of parents since we should compute their cartesian product. In order to avoid this problem, we will relax the n-parents stability by only computing the best instances in this cartesian product called best global instances. The main motivation in n-best parents stability, is that our aim is to compute the exact value of h(πa ), and
274
N.B. Amor, S. Benferhat, and K. Mellouli
not the whole distribution πa . The idea is to cover for any cluster Ci , its n parents by only saving the best instances (i.e having the maximum degree) of each cluster and by combining them while eliminating the incoherent instances. Once the best global instances is constructed, we can compute the best instances relative to the n separators existing between Ci and its parents and compare them with the ones obtained from Ci . If some instances in Ci are incoherent with those computed from the parents, then we will decrease their degrees. This is illustrated by the following example, Example 6 Let us consider the cluster CEF G in Figure 3, having three parents ABC, CDE and F . The Figure shows the best instances in each cluster (for instance the best instance in the cluster F is f1 ). From the cartesian product of the best instances (i.e. best global instances) we can check that the best instances relative to the three separators C, E and F are c1 ∧ e1 ∧ f1 and c1 ∧ e2 ∧ f1 . However, from the cluster CEF G, the best instances relative to the separators are c1 ∧ e1 ∧ f1 and c2 ∧ e1 ∧ f2 . Thus, we should decrease the degree of the instance c2 ∧ e1 ∧ f2 ∧ g1 from α to the next degree in ABC, CDE and F and re-stabilize the moral graph at one-parent.
a 1 b1 c1 d2 e2 f 1 a 1 b1 c1 d2 e1 f 1 a 2 b2 c1 d2 e2 f 1 a 2 b2 c1 d2 e1 f 1
. . a1 b1 c1 a1 b1 c2 a2 b2 c1
ABC
. .
B
CDE
c1 d2 e2 c 1d2 e1
. .
f1 F
E
C
CEFG
F
c1 e1 f1 g1 c2 e1 f 2 g1.
. . Fig. 3. Example of n-best-parents stability
. .
Anytime Possibilistic Propagation Algorithm
5
275
Handling the Evidence
The proposed propagation algorithm can be easily extended in order to take into account new evidence e which corresponds to a set of instanciated variables. The computation of Πm (a | e) is performed via two calls of the above propagation algorithm in order to compute successively Πm (e) and Πm (a ∧ e). Then using the min-based conditioning, we get: Πm (a ∧ e) if Πm (a ∧ e) < Πm (e) Πm (a | e) = 1 otherwise The computation of Πm (e) needs a slight transformation on the initialization procedure since the evidence can be obtained on several variables. More precisely, the phase of incorporation of the instance of interest is replaced by: Incorporating the instancea1 ∧, .., ∧aM of the variables of interest A1 , .., AM , i.e: πCi (ci ) if ci [Ai ] = ai ∀i ∈ {1, .., M }, πCi (ci ) ← 0 otherwise
6
Experimental Results
The experimentation is performed on random possibilistic networks generated as follows: Graphical component: we used two DAG’s structures generated as follows: – In the first structure the DAGs are generated randomly, by just varying three parameters: the number of nodes, their cardinality and the maximum number of parents. – In the second one, we choose special cases of DAGs where nodes are partioned into levels such that nodes of level i only receive arcs either from nodes of the same level, or from level i − 1. For instance the DAG of Figure 4 has 4 levels: the first contains 5 nodes, the second 7 nodes, the third 3 nodes and the fourth 5 nodes. Note that if we consider only two levels by omitting the intra-levels links, this structure corresponds to the QMR (Quick Medical Reference) network [13]. Numerical component: Once the DAG structure is fixed, we generate random conditional distributions of each node in the context of its parents. Then, we generate random variable of interest. 6.1
Stability vs. Consistency
In the first experimentation we propose to test the quality of the stability with respect to the consistency degree h(πa ) (i.e. Πm (a)). Regarding the first structure, we have noted that the one-parent stability and the two-parents stability provides, respectively, 99% and 99,999% of exact results. That is why, we tested the second structure considering 19 levels from 2 to 20. At each level we generate
276
N.B. Amor, S. Benferhat, and K. Mellouli
6
1
2
3
4
5
7
8
9
10
11
13
14
15
17
18
19
16
12
20
Fig. 4. Example of a DAG with 4 levels
300 networks with a number of nodes varying between 40 and 60 nodes, since we are limited, in some cases, by the junction tree algorithm2 which is enable to treat complex networks with a great number of nodes. Table 8 represents different parameters for this experimentation. Table 8. Parameters of the experimentation of stability vs consistency levels 2 3 4 5 6 7 8 9 10 11
nodes 45 40 45 40 40 40 40 40 40 40
links 68 80 88 90 85 81 86 85 84 85
levels 12 13 14 15 16 17 18 19 20
nodes 40 49 50 49 48 51 54 57 60
links 83 106 107 102 99 105 112 120 125
Figure 5, shows the results of this experimentation. At each level (from 2 to 20), the first (resp. second, third, forth) bar from the left represents the percentage of the networks where the one-parent (resp. two-parents, three-parents, n-best-parents) stability leads to consistency (i.e. generates the exact marginals). It is clear that the higher the number of parents considered in the stability procedure, the better the quality of results. Moreover, this figure shows that the stability degree, even at one-parent, is a good estimation of the consistency de2
The junction tree algorithm (c.f. subsection 3.2) is used for providing exact values of Π(a).
Anytime Possibilistic Propagation Algorithm
277
gree (96,11%). In addition, we remark that the quality of estimation depends on the number of levels in the DAG since with a small number of levels (2, 3 and 4), the one-parent stability is sufficient to reach consistency. These results is interesting since it shows that with networks having complex structures with a great number of nodes, we can use the one-parent stability which is a polynomial procedure. Indeed, as we will see later, in such cases the exact algorithm can generate huge clusters where local computations are impossible. Figure 6 represents the running time between different stability procedures for DAGs of 50 nodes and 100 links in average. It is clear that the one-parent stability is the faster one, while the n-best parents stability is the slowest one.
100%
80%
60%
40%
20%
0% 2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 levels
one-parent stability (96,11%) three-parents stability (99,46%)
two-parents stability (99,18%) n-best parents stability (99,51%)
Fig. 5. Stability vs Consistency
35 30
seconds
25 20 15 10 5 0 one-parent
two-parents
three-parents
n-best parents
stability
Fig. 6. Running time between different stability procedures
278
6.2
N.B. Amor, S. Benferhat, and K. Mellouli
Correlation between Exact Marginals and Stability Degrees
We are now interesting with the correlation between the exact marginals and the ones generated by the stabilization procedure. This experimentation is performed on 100 random networks with 20 levels and 60 nodes. For each network the evidence, the variable of interest and the instance of interest are we fixed randomly. Then, we compare the possibility degree of the instance of interest generated by the junction tree algorithm (exact marginals) with those generated by the one-parent stability procedure. Figure 7 shows the results of this experimentation. Again we confirm that the one-parent stability is a good estimation of consistency. Indeed, it is clear that in the cases where the equality, between exact marginals and those obtained from the stability procedure, does not hold, the gap is not important.
1
one-parent stability degrees
0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
exact marginals
Fig. 7. Correlation plots between exact marginals and one-parent stability degrees
6.3
Comparing Junction Tree Algorithm with One-Parent Stability
We also have compared experimentally the junction tree algorithm with oneparent stability. In this experimentation, using the first structure, we vary the ratio Links/Nodes in order to test the limitation of the junction tree algorithm. For instance, with networks containing 40 (resp. 50, 60) nodes, the junction tree algorithm is blocked from the ratio 3.55 (resp. 2.72, 1.78) while the one-parent stability provides a result is a few seconds. When the junction algorithm is not blocked, it is faster that the one-parents stability. However, the difference does not exceed few seconds.
Anytime Possibilistic Propagation Algorithm
7
279
Conclusion
This paper has proposed an anytime propagation algorithm for min-based directed networks. The stability procedures improve those presented in [1] since we use more than one-parent stability. Moreover, this paper contains experimental results which are very encouraging since they show that consistency is reached in the most of cases and that our algorithm can be used in situations where the junction tree algorithm is limited. For lack of space, we have not presented the consistency procedure which provides exact values. A first version of this procedure is evoked in [1]. A future work will be to improve it and to compare our algorithm with the ones used in possibilistic logic [8] and in FCSP [9].
References 1. N. Ben Amor, S. Benferhat, K. Mellouli, A New Propagation Algorithm for MinBased Possibilistic Causal Networks, procs. of ECSQARU’2001, 2001. 2. N. Ben Amor, S. Benferhat, D. Dubois, H. Geffner and H. Prade, Independence in Qualitative Uncertainty Frameworks. Procs. of KR’2000, 2000. 3. S. Benferhat, D.Dubois, L. Garcia and H. Prade, Possibilistic logic bases and possibilistic graphs, Procs. of UAI’99, 1999. 4. C. Borgelt, J. Gebhardt, and Rudolf Kruse, Possibilistic Graphical Models, Proc. of ISSEK’98 (Udine, Italy), 1998. 5. L.M de Campos and J. F. Huete, Independence concepts in possibility theory, Fuzzy Sets and Systems, 1998. 6. G. F. Cooper, Computational complexity of probabilistic inference using Bayesian belief networks, Artificial Intelligence, 393-405, 1990. 7. D. Dubois and H. Prade, Possibility theory : An approach to computerized, Processing of uncertainty, Plenium Press, New York, 1988. 8. D. Dubois, J. Lang and H. Prade, Possibilistic logic, In Handbook of logic in Artificial intelligence and logic programming Oxford University press, Vol. 3, 439513, 1994. 9. H. Fargier, Probl`emes de satisfaction de contraintes flexibles- application ` a l’ordonnancement de production, Th`ese de l’Universit´e P. Sabatier, Toulouse, France, 1994. 10. P. Fonck, Propagating uncertainty in a directed acyclic graph, IPMU’92, 17-20, 1992. 11. P. Fonck, Conditional independence in possibility theory, Uncertainty in Artificial Intelligence, 221-226, 1994. 12. J. Gebhardt and R. Kruse, Background and perspectives of possibilistic graphical models, Procs. of ECSQARU/FAPR’97, Berlin, 1997. 13. D. Heckerman, A tractable inference algorithm for diagnosing multiple diseases, Procs. of UAI’89, 1989. 14. E. Hisdal, Conditional possibilities independence and non interaction, Fuzzy Sets and Systems, Vol. 1, 1978. 15. F. V. Jensen, Introduction to Bayesien networks, UCL Press, 1996. 16. L. A. Zadeh, Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1, 3-28, 1978.
Macro Analysis of Techniques to Deal with Uncertainty in Information Systems Development: Mapping Representational Framing Influences 1
Carl Adams and David E. Avison 1 2
2
Department of Information Systems, University of Portsmouth, UK Department SID, ESSEC Business School, Cergy-Pontoise, France
Abstract. Development methods and techniques provide structure, directed tasks and cognitive tools with which to collect, collate, analyze and represent information about system requirements and attributes. These methods and techniques provide support for developers when learning about a system. Each development technique has its own unique set of characteristics distinguishing it from other techniques. Consequently, different development techniques can represent the same set of requirements or a problem situation differently. A new classification of techniques is developed based on representational characteristics. To understand if these different representations are likely to impact problem and requirement understanding this paper draws upon the framing effect of prospect theory. The classification is applied to works from the cognitive psychology literature which indicate how specific technique attributes may influence problem understanding. This classification is applied to approximately 100 development techniques.
1 Introduction Development methods and techniques provide structure, directed tasks and cognitive tools with which to collect, collate, analyze and represent information about system requirements and attributes. These methods and techniques provide support for developers when learning about system requirements and dealing with many of the uncertainties of development. Each development technique has its own unique set of characteristics, which distinguishes it from other techniques. Consequently, different development techniques can represent, or frame, the same set of requirements or a problem situation in a different way. According to the framing effect [46], [47], [48], suggested by prospect theory [28], different representations of essentially the same situation will result in a different preferred ‘prospect’ or choice: peoples’ understanding of a problem is profoundly influenced by how the problem is presented. This paper aims to map framing influences of techniques used within information systems development. The focus of this paper is on the effects on problem cognition due to distinct characteristics of development methods, or more typically the component techniques. Drawing on the cognitive psychology literature enables an analysis of how specific characteristics of techniques may influence problem understanding. A new D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 280–299, 2002. © Springer-Verlag Berlin Heidelberg 2002
Macro Analysis of Techniques to Deal with Uncertainty
281
classification is developed based on a ‘natural’ grouping of representational characteristics [51]. The classification also defines the problem/solution space for different types of techniques. This classification is applied to approximately 100 development techniques (see Appendix). The structure of the rest of this paper is as follows. First there is a discussion on methodologies and techniques and an examination of the main characteristics of techniques used in information systems development. These characteristics are used to develop a ‘natural’ classification based on the representational attributes of techniques. The paper then examines the background to prospect theory and the framing effect and further works on framing influences. These framing influences are applied to the developed classification how particular aspects of techniques may influence problem cognition.
1.1 Development Methods and Techniques Wynekoop and Russo [55] assess the use of development methods and conclude, ‘there is little empirical insight into why some methodologies might be better than others in certain situations’ (p69). Interestingly, Wynekoop and Russo cite several studies indicating that development methods are adapted considerably by organizations and even individual projects. Keyes maintains that there are no methods, just techniques, not a common view within the IS literature, but this highlights the prominent role of ‘techniques’ in development practice. In Wynekoop and Russo’s work, development techniques are seen as component parts of methodologies, collected together within a particular philosophical framework. The selection and use of techniques distinguish one development method or approach from another. For instance, in Fitzgerald’s [18] postal survey investigating the use of methodologies, 60% of respondents were not using a formalised commercial development methodology and very few (6%) rigorously followed a development methodology. Many of the respondents from Fitzgerald’s survey using a formal development methodology tended to adapt the methodology to specific development needs. In a later study, Fitzgerald [19] found that considerable tailoring of methodologies was common practice, with the tailoring involving the use of additional techniques or method steps and/or missing out specific techniques and/or method steps. From this discussion, development techniques may be classed as a ‘lowest common denominator’ between methodologies and play an influential role on how an information system is developed. When examining the use of techniques in information systems development, one is struck by the variety of ‘different’ techniques available. For general problem and business analysis (an integral part of many information systems development approaches) there are a wealth of available techniques: for instance Jantsch [27] examined over 100 techniques for general business and technological forecasting; in the Royal Society’s work on Risk Assessment [39] numerous techniques from several business areas are examined; Bicheno [6] examined 50 techniques and business tools focusing on improving quality; Couger [10], Adams [2] and de Bono [12], [13], [14] between them examined many techniques to improve creativity, innovation and lateral thinking in problem solving; and Obolensky [34] examined a range of techniques suitable for business re-engineering. There is also a range of techniques aimed at
282
C. Adams and D.E. Avison
specific information systems development activities, for instance, techniques to help conduct a feasibility study, analysis requirements, design a system and develop, test and monitor systems (e.g. [3], [5], [15], [16], [20], [22], [26], [56]). New technologies and applications give rise to new techniques and new tools to support development (e.g. [36]). Seemingly, therefore, there is an abundance of different techniques available to developers. However, there is much similarity between many of the different techniques. A closer examination of the items listed by Bicheno [6] reveals that they are often heavily based on previous ones, with many newly-claimed techniques being adaptations or compilations of other techniques.
1.2 What Techniques Offer Given that development techniques play such an influential role in how an information system is developed it would be useful to consider what is gained from using a development technique. An initial list may include the following: • Reduces the ‘problem’ to a manageable set of tasks. • Provides guidance on addressing the problem. • Adds structure and order to tasks. • Provides focus and direction to tasks. • Provides cognitive tools to address, describe and represent the ‘problem’. • Provides the basis for further analysis or work. • Provides a communication medium between interested parties. • Provides an output of the problem-solving activity. • Provides general support for problem-solving activities. These items can be considered as aiding developers in understanding the problems and requirements of an information system. This is supported by Wastell [52], [53] who, examining the use of development techniques, identified two concepts which describe this learning support behaviour: ‘social defence’ against the unknown and, ‘transitional objects and space’. ‘We argue that the operation of these defences can come to paralyse the learning processes that are critical to effective IS development … These social defences refer to modes of group behaviour that operate primarily to reduce anxiety, rather than reflecting genuine engagement with the task at hand …(Transitional) spaces have two important aspects: a supportive psychological climate and a supply of appropriate transitional objects (i.e. entities that provide a temporary emotional support’). [53, p3] The social defences concept is used to describe how developers follow methods, techniques and other rituals of development as a means to cope with the stresses and uncertainties of the development environment. A negative aspect of these social defences is the potential for rules of the methods and techniques to become paramount rather than addressing the ‘real’ problems of information systems development. This concept of supporting the learning process is consistent with the findings of Fitzgerald’s [19, p342] study, which found that there was considerable tailoring of development methodologies and that tailoring was more likely to be conducted by experienced developers. Inexperienced developers tend to rely more heavily on following a development method or technique rigorously. Inexperienced developers
Macro Analysis of Techniques to Deal with Uncertainty
283
require more guidance and support in the development process and look to the method, or collection of techniques, for that support. Key elements here are that techniques play an important role within information systems development, influencing the learning and understanding process of developers (about the information system requirements) and the potential negative influences when developers engage in the rituals of technique at the expense of problem understanding. The next section examines more closely the characteristics of techniques to identify further possible influences on problem cognition. 1.3 Characteristics of Techniques By examining a variety of techniques, certain attributes become apparent, for example: • Visual attributes, e.g. visual representation and structure of technique output. • Linguistic attributes, e.g. terminology and language used–not just English language, but also others such as mathematical and diagrammatical [2, p103]. • Genealogy attributes, e.g. history of techniques, related techniques. • Process/procedure attributes, e.g. description and order of tasks. • People attributes, e.g. roles of people involved in tasks. • Goal attributes, e.g. aims and focus of techniques. • Paradigm attributes, e.g. discourse, taken–for-granted elements, cultural elements. • Biases, e.g. particular emphasis, items to consider, items not considered. • Technique or application-specific attributes. Some characteristics of a technique are explicit, for instance where a particular visual representation is prescribed. Other characteristics might be less obvious, such as the underlying paradigm. Many of the characteristics are interwoven, for instance the visual and linguistic attributes might be closely aligned with the genealogy of a technique. The next section will develop a classification based on the main representational characteristics of techniques. Literature from the cognitive psychology field will be used to to examine how specific visual attributes of techniques are likely to affect cognition.
2 Classification of Techniques by Representational Characteristics This section develops an initial classification of techniques, based on Waddington’s [51] ‘natural’ attributes for grouping items. A similar classification of techniques by natural attributes is described in [1]. Waddington discusses our ‘basic’, or natural, methods of ordering complex systems, the most basic of which relies on identifying simple relationships, hierarchies, patterns and similarities of characteristics. The natural grouping for development techniques is based the linguistic attributes (e.g. generic names) of a technique and by the final presentation of a technique (i.e. grouping techniques with similar looking presentations together). The result is six groups: (i) Brainstorming Approaches, (ii) Relationship Approaches, (iii) Scenario
284
C. Adams and D.E. Avison
Approaches, (iv) Reductionist Approaches, (v) Matrix Approaches and, (vi) Conflict Approaches. • Brainstorming Approaches. This group is defined by a generic name ‘brainstorming’. Representation for brainstorming techniques vary, but usually contain lists of items and/or some relationship diagram (e.g. a mind map). Brainstorming is probably the most well known, well used and most modified of the techniques. Brainstorming is often associated with De Bono (e.g. [13], [14]) who covered it as one of a set of lateral thinking techniques, though others seem to have earlier claims, such as [8, p262]. It is a group activity to generate a cross stimulation of ideas. • Relationship Approaches. This group is defined mainly by the final presentation of the techniques, which is typically based on diagrams representing a defined structure or relationships between component parts. Included in this grouping are Network Diagrams (e.g. [6, p40], [32]) and Cognitive Mapping [17], which some might argue are quite different techniques, however, the final output presentations are topologically very similar. A further characteristic is the use of a diagram to present and model the situation. • Scenario Approaches. This group is defined by generic linguistic attributes based around scenarios. These techniques involve getting participants to consider different possible futures for a particular area of interest. Representation in these approaches can vary from lists of items to diagrams. • Reductionist Approaches. This group is defined by the use of generic linguistic attributes (i.e. use similar terminology revolved around reducing the ‘problem’ area into smaller component parts) and visual attributes based on well-defined structures. Once a problem has been ‘reduced’ then the component parts are addressed in turn before scaling back up to the whole problem again. • Matrix Approaches. This group is defined by the final presentation, that of a matrix or list structure, though often the generic name ‘matrix’ is also used. Using some form of matrix or list approach for structuring and making decisions is widely known and frequently used (e.g. [27, p211]). A list of factors are compared or analysed against another list of factors. • Conflict Approaches. This grouping is defined by the generic name ‘conflict’. It underlines an approach to view the problem from different and conflicting perspectives. Each group can be considered in terms of ‘social defence’ against the unknown [52]. As discussed earlier, ‘social defence’ in this context represents organisational or individual activities and rituals that are used to deal with anxieties and uncertainties. It is argued that the more quantitatively rigorous and detailed (depth of study) the technique then the higher the potential for being a social defence mechanism. Another useful concept to consider techniques is the problem/solution space [38] that can be used to represent the scope of possible solutions by a technique. These concepts are developed and applied to the six groups, a summary is represented in the Table 1, and problem/solution space diagrams are presented in Figure 1. This initial classification is applied to approximately 100 techniques, the results of which are shown in the Appendix.
Macro Analysis of Techniques to Deal with Uncertainty
285
Fig. 1. Problem solution space for each of the natural groups
Though providing a different vista on the classification of techniques, with possible social defence attributes and problem/solution spaces mapped out, this grouping proves too simplistic in that it does not address ‘how’ techniques in a particular group would affect problem cognition. The next section draws upon psychology literature to inform how distinct representational attributes will affect problem cognition.
286
C. Adams and D.E. Avison Table 1. Characteristics of Natural Grouping for Techniques to Deal with Uncertainty Potential for social defence
Area of Problem/ Sol. Space covered
Brainstorming Relationship
Quantitatively rigorous/ Depth of study LOW HIGH
LOW MEDIUM-HIGH
Scenario
MEDIUM
MEDIUM
Reductionist Matrix
VERY HIGH HIGH
Conflict
MEDIUM – HIGH
HIGH MEDIUM-HIGH or HIGH MEDIUM-HIGH
SCATTERED LOCALISED CLUSTERS SCATTERED CLUSTERS LOCALISED LOCALISED CLUSTERS VERY LOCALISED
Group
3 Impact on Problem Understanding: Lessons from Cognitive Psychology 3.1 Prospect Theory and the Framing Effect For understanding cognitive influences on problem understanding we are initially drawn to prospect theory [28], which was developed a descriptive model of decisionmaking under risk. Prospect theory was also presented as a critique of expected utility theory (EUT) [50] and collated the major violations of EUT for choices between risky ‘prospects’ with a small number of outcomes. The main characteristics of Kahneman and Tversky’s [44] original prospect theory are the framing effect, a hypothetical ‘S’ shaped value function with corresponding weighting function and a two-phase process consisting of an editing phase and an evaluation phase. The focus of this paper is on the framing effect, a concept that was described as ‘preferences may be altered by different representations of the probabilities’ (p273), i.e. different representations of essentially the same situation will result in a different preferred ‘prospect’ or choice. This was made more explicit as a framing effect in their later work [48]. They, along with others (e.g. [41]), have demonstrated several different types of framing influences. There are some limitations of prospect theory, particularly regarding the implied cognitive processes. The theory seems to be good at describing what decision people will make and what items may influence those decisions, however, it is lacking in describing how people reach these decisions. (See [11] for a description of some of the weaknesses and alternative cognitive process models.) Another possible limitation focuses around the laboratory-based research methods and artificial scenarios used to develop the theory. However, there is considerable support for prospect theory and the cornerstone of prospect theory, the framing effect, is robust and likely to represent some key influences on decisionmaking [41]. The next section explores some key areas of framing influences relevant to information systems development techniques.
Macro Analysis of Techniques to Deal with Uncertainty
287
3.2 Visual Influences: Gestalt Psychologists One of the earliest and most influential movements of cognitive psychology was that of the Gestalt psychologists initiated by Max Wertheimer, Wolfgang Kohler and Kurt Koffka [23], [25], [54]. ‘In Gestalt theory, problem representation rests at the heart of problem solving – the way you look at the problem can affect the way you solve the problem. … The Gestalt approach to problem solving has fostered numerous attempts to improve creative problem solving by helping people represent problems in useful ways.’ [31, p68] The key element here is that the way in which a problem is represented will affect the understanding of the problem, which is consistent with prospect theory. Relating this to techniques, one can deduce that the visual, linguistic and other representation imposed by a technique will impact on problem cognition. The Gestalt movement in cognitive psychology has a (comparatively) long history and has had a big impact on the understanding of problem solving. The movement has spored various strands of techniques such as lateral thinking and some creative techniques. Gillam [23] gives a more current examination of Gestalt theorists and works, particularly in the area of perceptual grouping (i.e. how people understand and group items). Gillam shows that perceptual coherence (i.e. grouping) is not the outcome of a single process (as originally proposed by Gestalt theory) but may be best regarded as a domain of perception (i.e. the grouping process is likely to be more complex, influenced by context and other aspects) (ibid p161). The Gestalt psychologists indicate a potentially strong influence on problem understanding, that of functional fixedness ‘prior experience can have negative effects in certain new problem-solving situations … the idea that the reproductive application of past habits inhibits problem solving’ [31]. The implication is that habits ‘learnt’ using previous techniques and problems would bias the application of new techniques and problems. This could be particularly relevant given the glut of ‘new’ techniques and may explain why many techniques are in reality rehashes of older techniques. Another major area that a technique can influence cognition can be deduced from support theory which indicates that support for an option will increase the more that the option is broken down into smaller component parts, with each part being considered separately. The more specific the description of an event, then the more the event will seem likely. The implications are that the more a technique breaks down and considers a situation into component parts or alternatives, then the more likely the situation will become apparent. 3.3 Structure Influences A prescriptive structure is also likely to exert influence on problem cognition. For instance, hierarchy and tree structures are likely to exert some influence on problem cognition in binding attributes together (e.g. on the same part of a tree structure) and limiting items to the confines of the imposed structure. In cognitive psychology this is known as category inclusion. ‘One enduring principle of rational inference is category inclusion: categories inherit the properties of their superordinates’ [42]. The implication is that techniques dictating hierarchical structures will force a (selfperpetuating) category inclusion bias. An element in one branch of a hierarchical
288
C. Adams and D.E. Avison
structure will automatically have different properties to an element in another branch of the hierarchical structure. For instance, take a functional breakdown of an organization (such as that described in [56]). One might conclude from category inclusion that a task in an accounting department will always be different to a task in a personnel department, which clearly may not be the case as both departments will have some similar tasks, such as ordering the stationary. However, this category inclusion is not universally the case. Sloman [42] found that the process is likely to be more complex. In his study, participants frequently did not apply the category inclusion principle ‘instead, judgments tended to be proportional to the similarity between premise and conclusion’ and concluded ‘arbitrary hierarchies can always be constructed to suit a particular purpose. But those hierarchies are apparently less central to human inference than logic suggests’ (p31). The initial premise surrounding a situation is likely to be related to the underlying paradigm. Dictating a hierarchical structure in itself may not result in category inclusion biases. However, coupled with an underlying paradigm of closed hierarchical properties, it will more likely result in category inclusion biases. Along the same theme are proximity influences and biases. The understanding of items can be influenced by the characteristics of other items represented in close proximity. 3.4 Order and Discourse Influences Perceptual processing is profoundly influenced by order of information presented and the relational constructs of information [33]. The order and number of items in a list will influence how people will understand (and recall) items and how people will categorize them. The implications are that the language and order of describing a problem situation, the questions asked and how they are asked and the implied relationships (all of which are usually prescribed by a technique) will bias problem understanding, e.g. by forcing ‘leading questions’ or ‘leading processes’. The discourse and language used to describe a problem is likely to play a role in problem understanding. Adams [2] discusses various different types of ‘languages of thought’ used in problem representing and solving. People can view problems using mathematical symbols and notation, drawings, charts, pictures and a variety of natural verbal language constructs such as analogies and scenarios. Further, people switch consciously and unconsciously between different modes of thought using the different languages of thought (p72). The information systems development environment is awash with technical jargon and language constructs. In addition, different application areas have their own set of jargon and specific language. Individual techniques have their own peculiar discourse consisting of particular language, jargon and taken-forgranted constructs, all of which may exert influence. For instance, the initial discourse used affects understanding of a problem situation, particularly in resolving ambiguities [30] by setting the context with which to consider the situation. Resolving ambiguous requirements is a common task in information systems development [21]. Effectively, techniques have the potential for leading questions and processes. In addition, cognitive psychology literature indicates that there will be a different weight attached to the normative than to descriptive representations and results of techniques. The basis for this is the ‘understanding/acceptance principle’ [43], which states that ‘the deeper the understanding of a normative principle, the greater the tendency to respond in accordance with it’ [44, p349].
Macro Analysis of Techniques to Deal with Uncertainty
289
Language aspects highlight another set of possible influences, that of communication between different groups of people (such as between analysts and users). Differences of perspective between different groups of people in the development process has been discussed within the IS field under the heading of ‘softer’ aspects or as the organizational or people issues (e.g. [7], [21], [29], [40]). Identifying differences and inconsistencies can be classed as a useful task identifying and dealing with requirements [21]. From cognitive psychology there are also other considerations. Teigen’s [45] work on the language of uncertainty shows that there is often more than the literal meaning implied in the use of a term, such as contextual and relational information or some underlying ‘other’ message. The use of language is very complex. The implications are that even if a technique prescribes a set of ‘unambiguous’ language and constructs, there may well be considerable ambiguity when it is used. 3.5 Preference Influences There are also likely to be individual preferences, and corresponding biases, for some techniques or specific tasks within techniques, as Puccio [37, p171] relates: ‘The creative problem solving process involves a series of distinct mental operations (i.e. collecting information, defining problems, generating ideas, developing solutions, and taking action) and people will express different degrees of preference for these various operations’. Couger [9, p5] has noted similar preferences: ‘It is not surprising that technical people are predisposed towards the use of analytical techniques and behaviorally orientated people towards the intuitive techniques’. In addition, there may be some biases between group and individual tasks, a point taken up by Poole (1990) who notes that group interaction on such tasks is likely to be complex with many influences. The theme was also taken up by Kerr et al. [57], who investigated whether individual activities are better than group activities (i.e. have less errors or less bias), but their findings were inconclusive ‘the relative magnitude of individual and group bias depends upon several factors, including group size, initial individual judgement, the magnitude of bias among individuals, the type of bias, and most of all, the group-judgment process …. It is concluded that there can be no simple answer to the question “which are more biased, individuals or groups?”’ [57, p687]. To address the potential individual/group biases many authors suggesting techniques recommend some consideration of the make-up of different groups using them (e.g. [6], [9]), though they give limited practical guidance on doing so. 3.6 Goal Influences Goal or aim aspects also profoundly influence problem understanding by providing direction and focus for knowledge compilation [3]. Goals influence the strategies people undertake to acquire information and solve problems. Further, when there is a lack of clear goals people are likely to take support from a particular learning strategy, which will typically be prescribed by the technique: ‘The role of general methods in learning varies with both the specificity of the problem solver’s goal and the systematicity of the strategies used for testing hypothesis about rules. In the absence of a specific goal people are more likely to use
290
C. Adams and D.E. Avison
a rule-induced learning strategy, whereas provision of a specific goal fosters use of difference reduction, which tends to be a non-rule-induction strategy’ [49]. The implications are that techniques with clear task goals will impact the focus and form of information collection (e.g. what information is required and where it comes from, along with what information is not deemed relevant) and how the information is to be processed. Further, if there are no clear goals then people are likely to rely more heavily on the learning method prescribed by the technique. Technique attributes are likely to dictate representations used.
3.7 Potential Blocks to Problem Cognition In addition to specific representational attributes, framing can be considered in terms of providing conceptual blocking. From creative, innovative and lateral thinking perspectives, Groth and Peters [24, p183] examined barriers to creative problem solving amongst managers. They identified a long list of perceived barriers to creativity including: fear of failure, lack of confidence, environmental factors, fear of success and its consequences, fear of challenge, routines, habits, paradigms, preconceived notions, rules, standards, tunnel sight, internal barriers, structure, socialization, external barriers, money, rebellion, health and energy, mood, attitudes, desire, time. They grouped the perceived barriers into ‘self imposed’, ‘professional environment’ and ‘environmentally imposed’ categories. Fear of some sort seems to be the predominant barrier, at least for these managers. For more general barriers, Adams [2], identifies four main areas of conceptual blocks, these are represented in Table 2.These ‘blocks’ indicate that techniques could have a variety of adverse influences on problem cognition, including ‘blinkered’ perception from a particular perspective, lack of emotional support as a transitional object, providing a flawed approach and logic and, not providing appropriate cognitive tools [2].
3.8 Summary of Representational Framing Influences The framing influences discussed above indicate that any framing effect due to the characteristics of a technique is likely to be complex and interwoven. However, there are some main themes that emerge. The visual, structure and linguistic aspects can be combined in a general ‘representational’ heading. Arguably, the more prescribed and structured a technique is, then the more likely that ‘predictable’ framing influences can be ascribed. Overall, the works from the cognitive psychology field give several indications about how the characteristics of a technique are able to exert some influence on problem cognition.
Macro Analysis of Techniques to Deal with Uncertainty
291
Table 2. Four main areas of conceptual blocks. Perceptual Blocks • Seeing what you expect to see – stereotyping • Difficulty in isolating the problem • Tendency to delimit the problem area too closely (i.e. imposing too many constraints on the problem) • Inability to see the problem from various viewpoints • Saturation (e.g. disregarding seemingly unimportant or less ‘visible’ aspects) • Failure to utilize all sensory inputs Emotional Blocks • Fear of taking risks • No appetite for chaos • Judging rather than generating ideas • Inability to incubate ideas • Lack of challenge and excessive zeal • Lack of imagination
Cultural and Environmental Blocks Cultural blocks could include: • Taboos • Seeing fantasy and reflection as a waste of time • Seeing reasons, logic, numbers, utility, practicality as good; and feeling, intuition, qualitative judgments as bad • Regarding tradition as preferable to change Environmental blocks could include: • Lack of cooperation and trust among colleagues • Having an autocratic boss • Distractions Intellectual and Expressive Blocks • Use of appropriate cognitive tools and problem solving language
4 Applying Framing Influences As the previous discussion shows, the literature from cognition psychology indicates that framing influences are likely to be complex and involved. Attributing likelihood of framing influences to techniques is likely to be somewhat subjective. In addition, the discussion earlier on the use and adaptation of methodologies and techniques indicate that there is likely to be considerable variation in applying a technique. However, applying the identified framing influences with the main representational attributes of techniques enables some likely framing effects to be identified. These are summarised in Table 3. In addition there are likely to be further influences, such as individual biases towards different types of techniques (or tasks within them), negative versus positive framing, and a range of perceptual blocks. 4.1 Summary This paper has contended that techniques influence problem understanding during information systems development. The influences can be considered under certain representational attributes. The cognition psychology literature indicates how these attributes are likely to affect problem understanding. In prospect theory this is known as the framing effect. By classifying the characteristics of techniques this paper has tried to indicate how different types of technique are likely to influence problem cognition, and in doing so has tried to map the framing effect of techniques.
292
C. Adams and D.E. Avison
Some potential biases and blocks to cognition were identified. These biases become more prominent when one considers that the results of a technique (i.e. diagrams, tables etc.) may be used by different groups of people to those that produced them (e.g. analysts may produce some charts and tables which will be used by designers) and this is likely to perpetuate such biases throughout the development process. Table 3. Potential for framing influences applied to natural classification of techniques
Group
Potential framing influences Structure Order influences influences E.g. functional fixedness
Brainstorming
Low
Relation-ship
Discourse influences
Medium
High
LowMedium High
Prescribed goal influences E.g. ruleinduced learning Low
Normati ve/ Analytic al biases
Med-high
Medium
Scenario
Low-Medium
Medium
Medium
Medium-High
Reductionist Matrix
High High
High High
High Low
Conflict
Low-Medium
Low
High Lowmedium High
MedHigh LowMedium High Medium
High
Medium
Low
References 1. 2. 3. 4. 5. 6. 7. 8. 9.
Adams C. (1996) Techniques to deal with uncertainty in information systems development, 6th annual conference of Business Information Systems (BIT’96), Manchester Metropolitan University. Adams J. (1987) Conceptual blockbusting, a guide to better ideas. Penguin, Harmondsworth, Middlesex. Anderson J.R. (1987) Skill acquisition: compilation of weak-method problem solutions. Psychological Review, 94, pp192-210. Anderson R.G. (1974) Data processing and management information systems. MacDonald and Evans, London. Avison D.E and Fitzgerald G. (1995) Information systems development: methodologies, techniques and tools, 2nd. ed., McGraw-Hill, Maidenhead. Bicheno J (1994) The quality 50, a guide to gurus, tools, wastes, techniques and systems. PICSIE, Buckingham. Checkland P (1981) Systems thinking, systems practice. Wiley, Chichester. Clark C (1958) Brainstorming - the dynamic new way to create successful ideas. Doubleday, Garden City, NY. Couger D., Higgins L. and McLntyre S. (1993) (Un)Structured creativity in information systems organisations, MIS Quarterly, December, pp375-397.
Macro Analysis of Techniques to Deal with Uncertainty
293
10. Couger D. (1995) Creative problem solving and opportunity. Boyd and Fraser, Massachusetts. 11. Crozier R. and Ranyard R. (1999) Cognitive process models and explanations of decisionmaking. In: Decision-making cognitive models, Ranyard R., Crozier R and Svenson O. (eds), Routledge, London. 12. de Bono E. (1969) The mechanism of mind. Penguin, Harmondsworth, Middlesex. 13. de Bono E. (1970) The use of lateral thinking. Penguin, Harmondsworth, Middlesex. 14. de Bono E. (1977) Lateral thinking: a textbook of creativity. Penguin, Harmondsworth, Middlesex. 15. de Marco T. (1979) Structured analysis and systems specification, Prentice Hall, Englewood Cliffs, NJ. 16. Downs E., Clare P. and Coe I. (1988) Structured Systems Analysis and Design Method, Prentice Hall, London. 17. Eden C (1992) Using cognitive mapping for strategic options development and analysis (SODA). In: Rosenhead J. (ed) (1992) Rational analysis for a problematic world, problem structuring methods for complexity, uncertainty and conflict. Wiley, Chichester. 18. Fitzgerald B. (1996) An investigation of the use of systems development methodologies in th practice. In: Coelho J. et al. (eds) Proceedings of the 4 ECIS. Lisbon, pp143-162. 19. Fitzgerald B. (1997) The nature of usage of systems development methodologies in practice. In: Avison D.E. (ed), Key Issues in Information Systems, McGraw-Hill, Maidenhead. 20. Flynn D.J. (1992) Information systems requirements: determination and analysis. McGraw-Hill, London. 21. Gabbay D. and Hunter A. (1991) Making inconsistency respectable: a logical framework for inconsistency reasoning, Lecture notes in Artificial Intelligence, 535, Imperial College, London, pp19-32. 22. Gane C. and Sarson T. (1979) Structured systems analysis, Prentice Hall, Englewood Cliffs, NJ. 23. Gillam B (1992), The status of perceptual grouping: 70 years after Wertheimer, Australian Journal of Psychology, 44, 3, pp157-162. 24. Groth J. and Peters J. (1999) What blocks creativity? A managerial perspective. 8, 3, pp179-187. 25. Honderich T. (ed) (1995) The Oxford companion to philosophy, OUP, Oxford. 26. Jackson M.A. (1983) Systems development, Prentice-Hall, Englewood Cliffs, NJ. 27. Jantsch E (1967) Technological forecasting in perspective. A report for the Organisation for Economic Co-operation and Development (OECD). 28. Kahneman D. and Tversky A. (1979) Prospect theory: an analysis of decision under risk. Econometrica, 47, pp263-291. 29. Lederer A. and Nath R. (1991). Managing organizational issues in information system development, Journal of Systems Management, 42, 11, pp23-39 30. Martin C., Vu H., Kellas G. and Metcalf K. (1999) Strength of discourse context as a determinant of the subordinate bias effect. The Quarterly Journal of Experimental Psychology, 52A, 4, pp813-839. nd 31. Mayer R.E. (1996) Thinking, problem solving, cognition, 2 ed. Freeman, NY. 32. Mizuno S. (ed) (1988) Management for quality improvement: the 7 new QC tools. Productivity Press. 33. Mulligan N.W. (1999) The effects of perceptual inference at encoding on organization and order: investigating the roles of item-specific and relational information. Journal of Experimental Psychology, 25, 1, pp54-69. 34. Obolensky N (1995): Practical business re-engineering; tools and techniques for achieving effective change. Kogan Page, London. 35. Poole M.S. (1990) Do we have any theories of group communication? Communication Studies, 41, 3, pp237-247.
294
C. Adams and D.E. Avison
36. Proctor T. (1995) Computer produced mind-maps, rich pictures and charts as aids to creativity. Creativity and Innovation Management, 4, pp43-50 37. Puccio G. (1999) Creative problem solving preferences: their identification and implications, Creativity and Innovation Management, 8, 3, pp171-178. 38. Rosenhead J. (ed) (1992) Rational analysis for a problematic world, problem structuring methods for complexity, uncertainty and conflict. Wiley, Chichester. 39. Royal Society (1992) Risk analysis perception and management, Royal Society, London. 40. Sauer C. (1993) Why information systems fail: a case study approach, Alfred Waller, Henley. 41. Schneider S.L (1992) Framing and conflict: aspiration level contingency, the status quo and current theories of risky choice. Journal of Experimental Psychology: Learning Memory and Cognition, 18, pp 104-57. 42. Sloman S.A. (1998) Categorical inference is not a tree: the myth of inheritance hierarchies, Cognitive Psychology, 35, pp1-33 43. Slovic P. and Tversky A. (1974) Who accepts Savage’s axiom? Behaviour Science, 19, pp368-373. 44. Stanovich K.E. and West R.F. (1999) Discrepancies between normative and descriptive models of decision-making and the understanding/acceptance principle, Cognitive Psychology, 38, pp349-385. 45. Teigen K.H. (1988) The language of uncertainty. Acta Psychologica, 68, pp27-38. 46. Tversky A. and Kahneman D. (1973) Availability: a heuristic for judging frequency and probability. Cognitive Psychology, 5, pp207-232. 47. Tversky A. and Kahneman D. (1974) Judgement under uncertainty: heuristics and biases. Science, 185, pp1124-1131. 48. Tversky A. and Kahneman D. (1981) The framing of decision and the rationality of choice. Science, 221, pp453-458. 49. Vollmeyer R., Burns B.D. and Holyoak K.J. (1996). The impact of goal specificity on strategy use and the acquisition of problem structure. Cognitive Science, 20, pp75-100. 50. Von Neumann J. and Morgenstern O. (1944) Theory of games and economic behaviour. Princeton University Press, Princeton. 51. Waddington C.H. (1977) Tools for thought. Paladin Frogmore, St Albans. 52. Wastell D. (1996) The fetish of technique: methodology as a social defence, Information Systems Journal, 6, 1, pp25-40. 53. Wastell D. (1999) Learning dysfunctions in information systems development: overcoming the social defences with transitional object, MIS Quarterly 54. Wertheimer M. (1923) Untersuchungen zur Lehre von der Gestalt. Psychologische Forschung, 4, pp301-350. 55. Wynekoop J.L. and Russo N.L. (1995) Systems development methodologies. Journal of Information Technology, Summer, pp65-73. 56. Yourdon E. and Constantine (1979) Structure design: fundamentals of a discipline of computer program and systems design, Prentice-Hall, Englewood Cliffs, NJ.
Appendix: Developed Classification of Techniques The classification has six groups 1. Brainstorming Approaches 2. Relationship Approaches 3. Scenario Approaches 4. Reductionist Approaches 5. Matrix Approaches 6. Conflict Approaches
Macro Analysis of Techniques to Deal with Uncertainty
295
The classification has been applied to approximately 100 techniques, the results of which are represented in the following table. Technique Affinity Diagram
Description
This is a brainstorming technique aimed at aiding idea generation and grouping. It seems to be particularly good at identifying commonalities in thinking within the group. Relies heavily on a facilitator to run the session. Analytic Hierarchy Saaty’s AHP uses a (3 level) hierarchy to represent Process the relationships. Used in analysis of spare parts for manufacturing. Attribute Works from the premis that all ideas originate from Association previous ideas (ie. they are just modified ideas). Based on lists of characteristics or attributes of a problem or product. Each characteristic is changed and the result discussed. (A cross between brainstorming and matrix?) Association /Images Tries to link and find associations between processes technique (& items). Boundary Defining and stating assumptions about problem Examination boundary. Brainstorming Aimed at idea generation. See also lateral thinking. Brainwriting Similar to brainstorming, but gets participants to Shared record ideas themselves. Enhancements Variation Bug List Gets participants to list items that ’bug’ them about the system. Aims to get a consensus on what the problem areas are. Cognitive Mapping Develops a model of inter-relationships between different features. Common Cause More an engineering tool to identify common causes Failures (CCFs) for possible failures. Critical Path See network techniques. Analysis (CPA), Critical Path Method (CPM) Critical Success Looks at the critical factors which will influence the Factors (CSF) success of an IS or, from a strategic view, all the organisation’s IS. It is a matrix type technique with the characteristics of the technique down one axis and the factors on the other axis. Note: this looks like it may also be appropriate to examine Critical Failure Factors. Cross-Impact See Matrix techniques. Matrices Decision Matrices See Matrix techniques. Decision Trees See tree techniques. Decomposable The components of each sub-system are listed and Matrices arranged within a matrix and the interactions between elements are weighted. Relationships between components can then be focused on. [A cross between matrix and relationship.]
Classification groups 1 *
2
3
4
5
*
*
* * * *
*
*
* *
*
*
* * * *
6
296
C. Adams and D.E. Avison
Technique Delphi
Description
Aims to get a consensus view, or long term forecast, from a group of experts by iteratively polling them. Developed by Helmer & Dalkey at the RAND corporation in the 1960s. Dimensional Aims to explore and clarify the dimensions and limits Analysis of a problem /situation. It examines five elements of a problem: substantive, spacial, temporal, qualitative and quantitative dimensions. External A summary list of external items that affect the Dependencies project. Oblensky states, these "need not be planned of detailed. However, they do need to be summarised to remind the project team that there are activities outside of the project which they need to be aware of". Fagan Reviews Effectively just getting a group of peers to critically review an analysis, design or code module. Failure Modes & Examines the various ways a product, or system, can Effect Analysis fail and analyses what the effect of each fail mode (FMEA) would be. Fault Tree Analysis A tree approach to relating potential fault causes. Five ’Cs’ and ’Ps’ Checklists of thinks to consider. (The Cs are Context, Customers, Company, Competition & Costs; The Ps are Product, Place, Price, Promotion, People). Five Whys Invented by Toyota, it is basically developing a questioning attitude, to probe behind the initial given answers. There is also a ’Five Hows’ along the same principles. These are very similar to the earlier lateral thinking ’Challenging Assumptions’ and the examine stage of a Method Study. Five Ws and the H Who-what-where-when-why and how. Brainstorming techniques answering these questions. Force Field Idea generation and list technique to identify ’forces’ Analysis pulling or pushing towards an ideal situation. Future Analysis A technique specifically aimed at IS development, it examines possible future scenarios which an IS would have to operate in. Gaming, Game Several gaming techniques to deal with competitive Theory or conflict situation. See also Metagames and Hyper games. Hazard and A systematic technique to assess the potential hazards Operability Studies of a project, system or process. Usually associated (HAZOP) with the chemical industry. Hazards Analysis Identifies critical points in the work processing which and Critical Control need controls or special attention. Usually associated Points (HACCP) with production, particularly food production. Hypergames A variation on game theory which develops a ’game’ from the prospective of the different stakeholders. Influence Diagrams, Similar to Cognitive mapping, it generates logical Interrelationship relationships between events or activities. Diagrams ’Johari’ window of The technique named after inventors (Joe Luff & knowledge Harry Ingham), tries to identify areas of understanding and lack of understanding. Lateral Thinking Several techniques, including:- The Generation of Techniques:Alternatives, Challenging assumptions, Suspended Generation of judgement, Dominant ideas and crucial factors,
Classification groups 1 *
2
3
4
5
6
*
*
* *
* *
*
* *
* *
*
*
*
* *
*
*
*
Macro Analysis of Techniques to Deal with Uncertainty Technique
Description
Alternatives, Challenging assumptions, Suspended judgement, Dominant ideas and crucial factors, Fractionation, The reversal method, Brainstorming, Analogies and, Random stimulation. Maintainability Analysis
Fractionation, The reversal method, Brainstorming, Analogies and, Random stimulation. Arguably these types of techniques would be suited to early analysis and problem identification. Equally, some of the techniques could be used in the later stages of systems development. For instance, Fractionation and Challenging assumptions could be used in a design situation. Many of these lateral thinking techniques have been modified and combined to make ’new’ techniques.
Classification groups 1
Markov Chains, Markov Analysis Matrix Techniques, Matrix Analysis
McKinsey 7 S Framework Metagames
Morphological Approaches
Network Techniques
Nominal Group Technique (NGT) Opposition-Support Map Options Matrix Planning Assistance Through Technical Evaluation of Relevance Numbers (PATTERN) Precedence Diagramming Method (PDM) network Preliminary Hazard Analysis (PHA) Program Evaluation
Examines the component parts of a system and analyses them, in probability terms, for easy of maintenance. Usually associated with engineering product design. Uses probability to model different states within a system. There are several ’matrix’ techniques which aim to represent and compare requirements or feature in a matrix format. Some techniques weighting or ranking of the requirements or features. A diagnostic tool to identify the interactions within an organisation. A variation of game theory which attempts to analyse the processes of cooperation and conflict between different ’actors’. This takes a systematic approach to examining solutions to a problem. It does this by identifying the important problem characteristics and looks at the solutions for each of those characteristics. First developed by Zwicky, a Swiss astronomer, in 1942. There are several diagramming techniques that can be classed at network techniques. Some, like CPM and PERT are very quantitative relying heavily on numbers. Others like Interrelationship diagrams rely more on subjective logical relationships or connections. Similar to Brainstorming. * A representation of opposition and support for particular actions. See matrix techniques. Developed by Honeywell, it is the first large scale application of Relevance Trees to numerical analysis, and makes use of computing support.
Similar to PERT, but has 4 relationships (FS finish start, SS start - start, FF finish - finish, SF start – finish)
2
3
4
5
6
*
* *
* *
*
*
* * *
*
* A networking technique similar to Critical Path
297
*
298
C. Adams and D.E. Avison
Technique
Description
and Review Technique (PERT) Rapid Ranking
Analysis, but addresses uncertainty in calculating the task times. Techniques aims to list and rank the important issues to a problem. Technique used in negotiating situations.
Classification groups 1
RBO - Rational Bargaining Overlaps Relevance Trees, Reliance Trees
Relevance Trees (or Reliance Trees) (sometimes referred to as hierarchical models or systems, probably first proposed by Churchman, Ackoff & Arnoff (1957). Reliability These are representations of the reliability Networks dependencies between components of a system. Similar to CPA/PERT type networks, but represent ’dependencies’ rather than order of events. Once the networks are drawn then estimates for failure rates of each component can be evaluated. Similar to Relevance/Reliance Trees. Requirements, Based on lists and matrices, aims to understand the Needs and Priorities impact of an application on the organisation prior to (RNP) development. Top management play a key role. Risk Assessment/ Attempts to identify and, where possible, quantify the Engineering / risks in a project. Usually associated with large scale Management engineering projects but principles can be appropriate to smaller scale situations. Robustness The aim is to "keep the options open". It does this Analysis identifying and analysing a range of scenarios and examining actions are most ’robust’ in those scenarios. Scenario Writing / Scenario planning gets participants to consider Analysis different possible futures for a particular area of interest. Shareholder Value Tries to identify the key values and needs of the Analysis (SVA) shareholders and how those needs are currently being met. Stakeholder Tries to understand the needs of the key stakeholders Analysis and how those needs are currently being met. It is effectively a range of techniques where different techniques are used for different stakeholder groups, eg. use VCA for analysing supplier stakeholder group and SVA for analysing the shareholder stakeholders. Simulation The features and workings of a complex situation are simulated. The model can then be changed (either the inputs or workings of the model) to observe what will happen. Good for developing a deeper understanding of the problem area. Strategic Choice Aims to deal with the interconnections of decisions/problems. Focuses attention on alternative ways of managing uncertainty Strategic Options Though it has a ’strategic’ title it is aimed at getting Development and consensus actions in messy situation. Analysis (SODA) Soft Systems A well known method aimed at problem Methods (SSM) identification and representing views of a problem form different stakeholders perspectives - a theme common in many of the subjective techniques.
2
3
4
5
6
* *
*
*
*
*
*
*
*
*
*
*
Macro Analysis of Techniques to Deal with Uncertainty Technique SIL -suggested integration of problem elements Synergistic Contingency Evaluation and Review Technique (SCERT) Systems Failure Method (SFM) SWAT analysis (Strengths, Weaknesses, Opportunities & Threats) Tree Analysis Value Engineering/ Management Value Chain Analysis Wildest Idea
Description
299
Classification groups
1 A German-developed brainstorming technique, gets * participants to write down ideas, then pairs of ideas are compared to integrate and interrogate the ideas. Risk assessment technique used in oil processing installations, power plants and large engineering projects.
2
3
4
5
6
*
Looks at three level of influence: organisation, team and individual. Examines potential failure from these three levels. Generates perceptions of how customers (or others) view the organisation (or problem situation).
*
*
*
See decision trees. Similar to Value Chain Analysis.
*
Analyses the supply chain within an organisation and tries to identify (and usually quantify) when extra ’value’ is added to a product or service. Tries to get people to come up with a wild idea to * address a problem. With this as a starting point the group continue to generate ideas.
*
* *
The Role of Emotion, Values, and Beliefs in the Construction of Innovative Work Realities ´ Carvalho3 Isabel Ramos1 , Daniel M. Berry2 , and Jo˜ ao A. 1
Escola Superior de Tecnologia e Gest˜ ao de Viana do Castelo, Viana do Castelo, Portugal [email protected] 2 Department of Computer Science, University of Waterloo, Waterloo, ON, Canada [email protected] 3 Departamento de Inform´ atica, Universidade do Minho, Guimar˜ aes, Portugal [email protected]
Abstract. Traditional approaches to requirements elicitation stress systematic and rational analysis and representation of organizational context and system requirements. This paper argues that (1) for an organization, a software system implements a shared vision of a future work reality and that (2) understanding the emotions, feelings, values, beliefs, and interests that drive organizational human action is needed in order to invent the requirements of such a software system. This paper debunks some myths about how organizations transform themselves through the adoption of Information and Communication Technology; describes the concepts of emotion, feeling, value, and belief; and presents some constructionist guidelines for the process of eliciting requirements for a software system that helps an organization to fundamentally change its work patterns.
1
Introduction
Before the 90s, software systems were used mainly for automating existing tasks or for collecting or delivering information. With the rapid development of Information and Communication Technology (ICT), software systems became a driver for innovative work practices and for new models of management and organization [12]. Terms like “globalization”, “knowledge management”, “organizational learning”, “collaborative work”, “value creation”, “extended enterprise”, “client relationship management”, and “enterprise resource planning”, among others, are creating a new understanding of human action in organizations. We are learning that action is enabled, empowered, or extended by ICT. As a consequence, individuals and organizations can now be more creative, flexible, and adaptive. We have more complex and volatile environments and organizations. Change is presented as inevitable. Holistic approaches to change management and the D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 300–314, 2002. c Springer-Verlag Berlin Heidelberg 2002
The Role of Emotion, Values, and Beliefs
301
development of software systems are seen as imperative in order to cope with organizational complexity. Nearly everyone seems to accept that environments in which change emerges or is induced are incredibly complex. Thus, the software systems that are supposed to help organizations adapt or transform themselves are inherently complex. Their essence, their requirements defy rapid or systematic understanding. Yet, the development of these software systems is still expected to occur in a more orderly systematic fashion, at a well defined point in time, and to be as cheap and quick as possible. The goals that drive the process are often economic and structural. These goals include the improvement of organizational efficiency and effectivness, the reduction of costs, and the improvement of individual or group performance. The lofty goals notwithstanding, it is very difficult get these software systems to be used successfully and effectively [27], [18]. People in organizations resist the changes. They resist using the systems, misuse them, or reject them. As a result, the goals are not achieved, intended changes are poorly implemented, and development budgets and schedules are not respected. Misplaced emotions, values, and beliefs are often offered as the causes of these problems. Accordingly, this paper – debunks some myths about how organizations transform themselves through the adoption of ICT applications; – describes the concepts of emotion, value, and belief and how they affect development and acceptance of software systems; and – presents some constructionist guidelines for the process of eliciting requirements for software systems that help organizations to fundamentally change their work patterns.
2
Organizational Transformation Supported by the Adoption of Innovative Software Systems
Organizational transformation (OT) is the process of fundamentally changing an organzation’s processes in order to allow it to better meet new challenges. It is often accompanied by the introduction of new software systems that make the new process possible. OT in an organization is often prompted when it begins to consider how it might automate its process. The organization realizes that just automating current processes is a waste of computing resources. The current processes were designed over the years to allow the organization to function in a unautomated, paper-driven environment. Data on paper are often accurate only to the day or longer. Automating current processes maintains these manual, paper-driven processes, when a computer and its software has the potential of providing a highly dynamic, automated, paper-free process with information accurate to the second rather than to the day or longer. OT is connected not only to automating an organization’s processes. Even an organization with fully automated processes may engage in OT. Other triggers
302
´ Carvalho I. Ramos, D.M. Berry, and J.A.
for OT include implementing a new management or business model, adopting a new best practice, desiring to satisfy clients better, creating a new internal or external image, promoting a new social order, obeying environmental rules, fostering collaborative practices, etc. Sometimes, OT is triggered as a consequence of internal political fights. We must explain the use, in this paper, of the word “transformation” instead of “change”. Both words mean change, in the general sense, but the technical term “transformation” means radical change and the technical term “change” means evolutionary change. Evolutionary change refers to efficiency improvements, local quality improvements, change in procedures, and all kinds of localized change that have minor impact on the overall organization. Transformation implies fundamental changes of meanings and practices relevant to individual workers, groups, or the organization. Transformation means a change of identity. In whatever social order it occurs, it will have a big internal and environmental impact on the organization. 2.1
Rhetoric and Myths about Organizational Transformation
Some authors [7,16,19,23,5,8] present OT as a process that can be planned, managed, and controlled. According to these authors, OT is a rational and controllable process that can be systematicall implemented using well-tested methods and techniques to guide it. Consequently, OT can be made predictable, quick, and reasonably cheap. OT is best led by consulting firms that are experts in the field. OT is often directed to the organizational structure: goals and strategies, processes, tasks and procedures, formal communication channels, co-ordination and control of activities, work needs, and authority levels. Finally, OT is expected to have impact on relevant concepts and practices and on political relations. Resistance to transformation of meanings and practices is often expected. This resistance is seen as a problem to solve or minimise as soon as possible. Individuals are expected to adhere to values such as flexibility, creativity, collaboration, and continuous learning. They are expected to be motivated to immediately, effectively, and creatively use the delivered system. Every planned OT is seen initially as positive. In the end, the OT may fail. Since the OT is often justified by economic or political reasons, the failure is considered critical to the organization. Thus, there must be blame for the failure. The failure is often blamed on the leaders of the failed process, the consulting firms that failed to implement it, or the individuals and groups that failed to change. Ethical and moral considerations about the way the process was led and about the obtained results are rarely considered, let alone reported. ICT applications are often seen as drivers of the intended OTs. They are adopted to foster collaborative work, improve organizational learning, make knowledge management effective, and so on. This brief description of the rhetoric surrounding OT processes implicitly exposes several myths about the process and about people as agents and beneficiaries or victims of the transformation.
The Role of Emotion, Values, and Beliefs
2.2
303
Organizations as Separate Entities
We tend to see an organization as a separate entity with its own goals, strategies, potentialities, and constraints. However, an organization is the people that bring it to existence [13]. Goals and strategies emerge from the sense-making processes that continually reshape how an individual perceives herself and the others in the organization. This understanding leads to two main insights: 1. The idea of an organization being a separate entity with its own goals and strategies serves mainly management interests. Traditionally, management responsibilities involve the co-ordination and control of individual and subgroup efforts, in order to guarantee the economic, social, and political success of the organization. The strategy is to limit emotions, interests, values, and beliefs that could reduce the probability of achieving the goals and to implement strategies that management has defined as the best for the organization. 2. Each of us has interests, beliefs, and, sometimes, values that are not in tune with the organizational identity that, maybe, someone else is trying to solidify. Of course, this potential conflict is why participation in decision processes is so important a theme in the social sciences. Nevertheless, when consensus is not possible, there is the possibility of negotiation. There is always the possibility of giving up some interests and beliefs in exchange for other advantages. The imposition of decisions by powerful individuals or groups should be the last resort. Both negotiation and imposed decisions may lead to the emergence of negative emotions such as frustration, fear, anger, and depression. They may appear on the surface, or they may be held in silence. They may have unpredictable consequences for the development of organizational identity and for organizational success. 2.3
Emotions, Values and Beliefs, and Change
It is useful to define the three concepts used to construct the core ideas in this paper (1) emotions, (2) values and beliefs, and (3) change. Emotions. According to Dam´ asio [9], there are two types of emotions, (1) background emotions and (2) social emotions. Background emotions include the sensations of well being and malaise; calmness and tension; pain and pleasure; enthusiasm and depression. The social emotions include shame; jealousy; guilt; and pride. There is a biological foundation that is shared by all these emotions: – Emotions are complex sets of chemical and neuronal responses that emerge in patterns. Their purpose is to help preserve the life of the organism. – Even if learning processes and culture are responsible for different expressions of emotions and for attaching different social meanings to them, emotions are biologically determined. They depend on cerebral devices that are innate and founded in the long evolutionary history of life on Earth.
304
´ Carvalho I. Ramos, D.M. Berry, and J.A.
– The cerebral devices upon which emotions depend may be activated without awareness of the stimulus or without the exercise of will. – Emotions are responsible for profound modifications of body and mind. These modifications give rise to neuronal patterns that are at the basis of the feelings of emotion. – When someone experiences an emotion and expresses and transforms it into an image, it can be said that she is feeling the emotion. She will know that the feeling is an emotion when the process of consciousness is added to the processes of emoting and feeling. This complex notion of what are emotions, feelings, and awareness of feelings helps us to understand why OT will never be instrumental, quick, or without high costs. The notions of value and belief also reinforce this understanding. Values and Beliefs. Human values are socially constructed concepts of right and wrong that we use to judge the goodness or badness of concepts, objects, and actions and their outcomes [17]. The beliefs that a person holds about the reality in which he lives define for him the nature of that reality, his position in it, and the range of possible relationships to that reality and its parts [25]. As is easily seen, the physiological and sociological nature of emotions and the fact that values and beliefs are deeply rooted in personal and human history challenges the myth that OTs can be fully planned, managed, and controlled, i.e., instrumentally implemented. OT concepts and practices often lead to radical changes in cherished and long-held beliefs and values. A radical change of the way in which we understand our reality and our roles and actions in it will trigger background and social emotions that need to be carefully dealt with by creating trustful spaces of interaction, patiently over time. An anecdote illustrates this issue. Suppose that Joe dislikes the color blue. No one can force him to like it. He can be forced to show some appearance of liking it, but then there is no transformation in his color preferences, and the forcing only increases his dislike for blue. He could be brainwashed, but brainwashing would hardly be considered an enlightened technique. Joe may be convinced of the advantages of liking blue, thus ensuring his motivation to cooperate with the transformation process. However, not even Joe can guarantee the transformation of his color preferences. Nevertheless, if Joe is motivated to cooperate there are, in effect, some strategies to improve the chances of a successful transformation: – by conjuring emotionally positive experiences from Joe’s past involving blue sensations, e.g., a peaceful, leisurely sunny summer afternoon with a crystalclear blue sky spent with his girlfriend wearing a blue bathing suit, or – by constructing pleasant views of Joe’s future involving blue sensations, e.g, a peaceful, leisurely sunny summer afternoon with a crystal-clear blue sky spent with his girlfriend wearing a blue bathing suit. This anecdote shows also that an OT process can never be without a high expenditure of the resources needed to improve the chances of making it a success [29].
The Role of Emotion, Values, and Beliefs
305
From everything said so far, it becomes clear also that resistance to change is natural in human beings. Because a transformation’s implementers are as human as the target group, they need to find also the roots of their own resistance. That is, when the implementers are trying to minimise resistance, they may end up resisting the arguments by the process subjects that suggest changes in the implementers’ thinking, strategies, and plans. Change. Nowadays, there are many mythological OTs fostered by management and ICT gurus. In the name of so-called best practices, there is little consideration for their ethical and moral implications. The implementation of complex systems, such as Enterprise Resource Planning systems, are rarely preceded by considerations about [4], [30], [41], [34], [20]: – the system’s degradation of the employees’ quality of work life, by reducing job security and by increasing stress and uncertainty in pursuing task and career interests; – the system’s impact on the informal communication responsible for friendship, trust, feeling of belonging, and self respect; – the power imbalances the system will cause; and – the employees’ loss of work and life meaning, which leads to depression. 2.4
Summary
In summary, this section has addressed some myths about organizational transformation in order to advance the idea that OT that challenges meaning structures is difficult, resource consuming, and influenced by emotions in situations that require trust between the participants of the OT process [6], [3], [24]. Because most actual OTs draw with them the adoption of complex software systems that support new work concepts and practices, the elicitation of the requirements of those systems must include the understanding of the involved emotions, values, beliefs, and interests. The next section presents a constructionist perspective [21], [33] of requirements elicitation that takes into account emotions, values and beliefs, and change. Some general guidelines are offered to understand the structural, social, political, and symbolic work dimensions [4] in which values, beliefs, and interests are expressed. The section also includes guidelines for reading the emotions that elicitors and participants express in the informal and formal dialogues that occur during the process.
3
Requirements Elicitation
Traditionally, requirements engineering assumes a strong reality [11], [10], [39], [28], [40], [2], [22], [37], [42], [31]. The requirements engineer elicits information from this strong reality and proceeds systematically to a requirments specification.
306
3.1
´ Carvalho I. Ramos, D.M. Berry, and J.A.
Socially Constructed Reality
Deviating from this tradition and viewing reality as socially constructed implies several epistemological and methodological assumptions, including [38], [1]: 1. Reality is constructed through purposeful human action and interaction. 2. The aim of knowledge creation is to understand the individual and shared meanings that define the purpose of human action. 3. Knowledge creation is informed by a variety of social, intellectual, and theoretical explorations. Tools and techniques used to support this activity should foster such explorations. 4. Valid knowledge arises from the relationship between the members of some stakeholding community. Agreements on validity may be the subject of community negotiations regarding what will be accepted as truth. 5. To make our experience of the world meaningful, we invent concepts, models, and schemes, and we continually test and modify these constructions in the light of new experience. This construction is historically and socio-culturally informed. 6. Our interpretations of phenomena are constructed upon shared understandings, practices, and language. 7. The meaning of knowledge representations is intimately connected with the authors’ and the readers’ historical and social contexts. 8. Representations are useful if they emerge out of the process of questioning the status-quo, in order to create a genuinely new way of thinking and acting. 9. The criteria by which to judge the validity of knowledge representations include that the representations [26] – are plausible for those who were involved in the process of creating them, – can be related to the individual and shared interpretations from which they emerged, – express the views, perspectives, claims, concerns, and voices of all stakeholders, – raise awareness of one’s own and others’ mental constructions, – prompt action on the part of people involved in the process of knowledge creation, and – empower that action. The social construction of reality emerges from four main social processes: subjectification, externalization, objectification, and internalization [1]. Subjectification is the process by which an individual creates her own experiences. How an individual interprets what is happening is related to the reality she perceives. This reality is shapped by her subjective conceptual structures of meaning. Externalization is the process by which people communicate their subjectifications to others, through a common language. By making something externally available, we enable others to react to our previously subjective experiences and thoughts. By means of this communication, humans may transform the original content of a thought and formulate another that is new, refined, changed or
The Role of Emotion, Values, and Beliefs
307
developed. The mutual relation with others is dialectical and leads to continuous reinterpretation and change of meanings. Surrounding reality is created by externalization. Objectification is the process by which an externalized human act might attain the characteristic of objectivity. Objectification happens after several reflections, reinterpretations, and changes in the original subjective thoughts, when the environment has generally started to accept the externalization as meaningful. This process can be divided into phases: institutionalization and legitimization. Internalization is the process by which humans become members of the society. It is a dialectic process that enables humans to take over the world in which others already live. This is achieved through socialization occuring during childhood, and in learning role-specific knowledge and the professional language associated with it. 3.2
A Constructionist Perspective of Requirements Elicitation
These core ideas have implications for practice of requirements engineering. Specifically for requirements elicitation, which is the focus of this paper, these implications are summarized in Table 1, found after the bibliographical references. This table works on three subprocesses of requirements elicitation: 1. the creation of knowledge about the current work situation, perceived problems or expectations, and the vision of a new work situation that includes the use of a software system that supports or implements innovative work concepts and practices; 2. the representation of the created knowledge; and 3. the joint invention by all stakeholders of requirements for a system that acceptably meets all stakeholder’s needs, expectations, or interests. These subprocesses, of course, are interconnected processes that are described here independently to simplify their analyses. The table has one column for each of these subprocesses. The rows represent the constructionist perspective on project goals; the process structure; the final product; the use of theoretical frameworks; methods, techniques, and tools; the role of the participants; and the reuse of previous product. According to the constructionist perspective, knowledge is a social product, actively constructed and reconstructed through direct interaction with the environment. In this sense, knowledge is a real-life experience. As such, it is personal, sharable through interaction, and its nature is both rational and emotional. Knowledge representation is intimately connected with the knower–teacher and the learner. Knowledge representation is never complete or accurate since it can never replace the experience from which it is derived. However, a knowledge representation can be useful if it makes ideas tangible and enables communication and the negotiation of meanings. A system requirement is a specific form of knowledge representation.
´ Carvalho I. Ramos, D.M. Berry, and J.A.
308
Table 1. Practical implications for elicitation of constructionist assumptions Knowledge Creation Goals
Understand (1) human action and interaction that will be supported by the software system and (2) the meanings behind that action. Question and re-create those meanings.
Knowledge
Requirements
Representation
Invention
Express a multivoiced account of the reality that we construct socially. It includes the voice of the elicitor and all stakeholders of the system.
Reinvent the work reality through the adoption of a software system.
Process structure
Process structure is the result of the joint effort of system’s stakeholders and elicitors for emancipation, fairness, and community empowerment. Its shape is situational, i.e., it varies with organizational history and culture, and resources involved.
Product
Reformulation of mental constructions, recreation of shared meanings, awareness of contradictions and paradoxes of concepts and practices. Development of a common and local language to express feelings, perceptions and conceptions.
Expression of individual and shared experience.
Shared
interpretations
of adequate support of work. Cannot be disconnected
from
historical
and social contexts of requirements creators.
Theoretical frameworks
Inform the process with the values and beliefs held by elicitors and the system’s stakeholders.
Methods, techniques, and tools
Inform the process with the values held by elicitors and the system’s stakeholders. Help create graphical and textual elements of a common language.
Define the organization of knowledge representations.
Define
the
format
in
which requirements are expressed.
Have the potential of bias towards some stakeholders’ voices and of forcing a foreign language. Role of participants
Co-creators of knowledge, jointly nominate the questions of interest.
Co-creators of a language to represent knowledge, jointly design outlets for knowledge to be shared more widely within and outside the site.
Co-inventors of a com-
Reuse of product
Created knowledge is local, transferable only for sites where people have similar experiences and beliefs.
Representations are connected with the context in which they were created. If transposed to a different location, they may invoke different mental constructions in readers.
Reuse
3.3
mon future.
of
stakeholders’
requirements is problematic because of their historical and sociocultural dimension.
Integration of Organizational Theory into Requirements Elicitation
Recently, a number of authors, e.g., Bolman and Deal [4], Morgan [30], and Palmer and Hardy [32], attempted systematizations of organizational theory. Ramos investigated the usefulness of integrating this organizational theory into
The Role of Emotion, Values, and Beliefs
309
the requirements engineering process [35]. She described the importance of the structural, social, political, and symbolic dimensions of work in determining requirements. One result of this work are guidelines for understanding the meaning of human action and interaction. These guidelines are summarized in Tables 2 and 3, found after the bibliographical references. Table 2. Work aspects that should guide the choice of participants Structural Participants should be representative of: Formal roles Tasks Skills Levels of authority Accessed/produced information Social Participants should be representative of: Communication skills Negotiation skills Informal roles Degrees of motivation to change work practices Participation in the shaping of organizational history Willingness and experience in decision making processes Professional status Knowledge Political Participants should be representative of: Individual interests Form of power held: organizational authority, control of scarce resources, control of the definition of formal arrangements, restricted access to key information, control of organizational borders, control of core activities, member of a strong coalition, charisma Symbolic Participants should be representative of: Use of jargon Use of proverbs, slogans or metaphors Relevant beliefs and superstitions Use of humor Story telling Responsibilities for symbolic events Ways of instigating social routines and taken for granted techniques to perform a task Ways of conceiving the work space
Table 2 helps decide which stakeholders should be consulted during elicitation, that is, which participants should be chosen to represent the various work dimensions. For each dimension of work, the table lists the properties of the chosen individuals that must be considered. Table 3 shows for each dimension of work, the human actions and interactions that can be relevant to requirements.
310
´ Carvalho I. Ramos, D.M. Berry, and J.A. Table 3. Dimensions of human action in organizations
Structural Relevant organizational goals, objectives, and strategies Tasks, processes, rules, regulations, and procedures Communication channels and exchanged information Coordination and control Formal roles How authority is distributed Needs of system support to work Relevant organizational and technological knowledge to be able to perform tasks Social Shared goals and objectives Performance expectations Rewards or punishments for performance Motivation factors Informal roles and communication Personal knowledge and its impact on work concepts, practices, and relationships Fostered participation in decision making Use of individual and group skills Political Personal interests relating performed tasks, career progression, and private life Coalitions Individual or group power plays Conflict of interests Negotiation processes: concepts, and practices Symbolic Symbols used to deal with ambiguity and uncertainty Shared values and beliefs Common language Relevant myths, stories, and metaphors Rituals and ceremonies Relevant messages organizational, work, or system stakeholders Legitimized ways of expressing emotions
3.4
Towards a Constructionist Requirements Elicitiation Process
During requirements elicitation, all created knowledge should be represented and continually consulted about how previous and actual historical, social, and cognitive experiences have been shaping the process of its creation. While creating the knowledge elements included in Tables 2 and 3, elicitors and the system’s stakeholders participate in conversations. In these conversations, the processes of subjectification, externalization, objectification, and internalization are occurring continually, and their interplay creates a common reality for elicitors and stakeholders. Logic and emotion, awareness and unawareness, explicit and tacit are everpresent elements in the interactions, shaping thinking and action. Emotions, feelings, unconscious experience, and knowledge can be accessed only indirectly
The Role of Emotion, Values, and Beliefs
311
through the actions and reactions of the participants in requirements elicitation and through their use of language in its most general sense [20]: – vocal characterizers (noises one talks through, e.g., laughing, whispering, yelling, crying); – vocal segregates (sounds used to signal agreement, confusion, understanding, e.g., “hmm-hmm”, “Huh?!”, “Ah!”, “Nu?!”); – voice qualities (modifications of language and vocalizations, e.g., pitch, articulation, rhythm, resonance, tempo); – idiom (dialect, colloquialism, jargon, slang); – discourse markers (“well”, “but”, “so”, “okay”); – stylistic devices (use of repetition, formulatic expressions, parallelism, figurative language, onomatopoeia, pauses, silences, mimicry); – facial expressions (smile, frown, furrowed brow); – gestures (nodding, arm, motions, hand movements); – shifts in posture; – alterations in positioning from intimacy (touching) to social or public distance; – performance spaces (an allocated room or impromptu meeting in a corridor, rearranged seating, etc.); – props (especially for ceremonial oratory); and – clothing, cosmetics, and even fragrance. During the knowledge construction process, the elicitors should reflect critically on themselves as practitioners. This reflection has mainly three dimensions: 1. What theories and practical experience has been shaping our practice as elicitors? What are the alternatives? Why should we stick to our usual ways of thinking and acting? 2. What frameworks will we be using to guide our actions in the present situation? Why? What goals will guide the interaction with members of this community? What ethical considerations are we assuming? 3. How effective is our communication with the system’s stakeholders? What feelings have been present in interactions with them? What have we learned? In which way are our and others’ understandings and practice changing? These guidelines are derived from case studies carried out by the first author for her Ph.D. dissertation [35], [36]. Two case studies were carried out in order to identify what needs, expectations, and beliefs were sustaining specific OTs in which ICT applications (1) were being adopted to foster use of practices that the senior management thought to be the best and (2) were being locally developed in opposition to work concepts and practices that senior management thought were best. In each case, the OT was carried out successfully.
4
Conclusions and Future Work
This paper was written with the primary aim of addressing the implications of emotions, values, beliefs, and interests in the conception and adoption of software
312
´ Carvalho I. Ramos, D.M. Berry, and J.A.
systems that support new work realities. The secondary aim of the paper is to advance some general guidelines to understand the emotions, values, beliefs, and interests relevant to requirements’ elicitation. The approach to requirements elicitation implicit in the guidelines is lengthy and resource intensive. The transformation of values, beliefs and interests, and the emotions and feelings attached to them is difficult and uncertain. It requires patience and trust. At the end of a successful OT process that includes the adoption of ICT applications, stakeholders and requirements engineers will find themselves transformed in some way. In a joint effort, they will have conceived the support of a new work reality that will be implemented. This new reality must be nurtured until it solidifies close to the way it was originally envisioned. Addressing only the structural, political, and economic aspects of the process would mean to ignore that emotions and feelings are present even in our most rational and objective decisions [9]. In the elicitation of requirements, emotions and feelings are present in the choice of the problem to address, the choice of techniques and tools to gather information about business goals and work practices, the choice of stakeholders and the needs they express, the abstractions and partitions of reality, the knowledge we find relevant, the requirements we elicit, and the formats in which we choose to represent knowledge and requirements. In future research, the guidelines will be made more detailed so that engineers can choose the ones they will integrate into their preferred methods for elicitation. It is already planned to do several cases studies in which, by studying the implementation of the same ready-to-use package of software in different organizations, the differences in the historical and socio-cultural backgrounds will be mapped into differences in the implementations. The basic assumptions of the constructionist perspective, from which Table 1 is derived, are already implicitly integrated into the Soft Systems Methodology (SSM). Authors in requirements engineering have been emphasising the interconnectedness of science, society, and technology [15] and the relevance of ethnographic techniques for eliciting requirements in their context [28], [14]. However, few specific guidelines have been provided to deal with the impact of emotions, beliefs, and values of the whole team involved in a requirements elicitation. There is also a shortage of guidelines to help elicit emotions, beliefs, and values from the visible and shared constructions of human action and interaction that occur in organizations. Finally, almost no ideas have been provided to structure requirements elicitation around the social dynamics of a learning process. In the future, the authors intend to develop an approach that will structure requirements elicitation around the four processes that mold socially created realities and that will make use of the above guidelines and of strategies to effectively influence the transformation of emotions, values, and beliefs. An initial version of this approach has already been developed and tested in the field, but it needs to be improved in future action research projects. The authors do not intend to invent new techniques or a new method to guide requirements
The Role of Emotion, Values, and Beliefs
313
elicitation. Rather, they intend to provide a general framework in which existing methods and techniques could be integrated or reconstructed.
References 1. Arbnor, I., Bjerke, B.: Methodology for Creating Business Knowledge. Sage, Thousand Oaks, CA (1997) 2. Berry, D.M., Lawrence, B.: Requirements Engineering. IEEE Software 15:2 (March 1998) 26–29 3. Boje, D.M., Gephardt, R., Thatchenkery, T.J.: Postmodern Management and Organization Theory. Sage, Thousand Oaks, CA (1997) 4. Bolman, L.G., Deal, T.E.: Reframing Organizations: Artistry, Choice, and Leadership. Second Edition. Jossey-Bass, San Francisco, CA (1997) 5. Burke, W.W.: Organization Change: What We Know, What We Need to Know. Journal of Management Inquiry 4:2 (1995) 158–171 6. Cialdini, R.B.: Influence: Science and Practice. Harper Collins College, New York, NY (1993) 7. Cummings, T.G., Worley, C.G.: Essentials of Organization Development and Change. South-Western College Press, Mason, OH (2000) 8. Dahlbom, B., Mathiassen, L.: Computers in Context: The Philosophy and Practice of Systems Design. Blackwell, Oxford, UK (1993) 9. Dam´ asio, A.: The Feeling of What Happens: Body and Emotion in the Making of Consciousness. Harcourt Brace, New York, NY (1999) 10. Davis, A., Hsia, P.: Giving Voice to Requirements. IEEE Software 11:2 (March 1994) 12–16 11. Davis, A.M.: Software Requirements: Analysis and Specification. Prentice-Hall, Englewood Cliffs, NJ (1990) 12. Dickson, G.W., DeSanctis, G.: Information Technology and the Future Enterprise: New Models for Managers. Prentice Hall, Englewood Cliffs, NJ (2000) 13. Espejo, R., Schuhmannn, W., Schwaninger, M., Bilello, U.: Organizational Transformation and Learning: A Cybernetic Approach to Management. Jossey-Bass, Chicester, UK (1996) 14. Goguen, J.A., Jirotka, M.: Requirements Engineering: Social and Technical Issues. Academic Press, London, UK (1994) 15. Goguen, J.A.: Towards a Social, Ethical Theory of Information. In: Bowker, G., Gasser, L., Star, L., Turner, W.: Social Science Research, Technical Systems and Cooperative Work. Erlbaum, Mahwah, NJ (1997) 27–56 16. Greenwood, R., Hinings, C.R.: Understanding Strategic Change: the Contribution of Archetypes. Academy of Management Journal 36:5 (1993) 1052–1081 17. Hirschheim, R., Klein, H.K., Lyytinen, K.: Information Systems Development and Data Modeling: Conceptual and Philosophical Foundations. Cambridge University Press, Cambridge, UK (1995) 18. Iivari, J., Hirschheim, R., Klein, H.K.: A Paradigmatic Analysis Contrasting Information Systems Development Approaches and Methodologies. Information Systems Research 9:2 (1998) 164–193 19. Jick, T.D.: Accelerating change for competitive advantage. Organizational Dynamics 24:1 (1995) 77–82 20. Jones, M.O.: Studying Organizational Symbolism. Sage, Thousand Oaks, CA (1996)
314
´ Carvalho I. Ramos, D.M. Berry, and J.A.
21. Kafai, Y., Resnick, M.: Constructionism in Practice: designing, thinking, and learning in a digital world. Erlbaum, Mahwah, NJ (1996) 22. Kotonya, G., Sommerville, I.: Requirements Engineering. John Wiley & Sons, West Sussex, UK (1998) 23. Kotter, J.P.: Leading Change. Harvard Business School Press, Cambridge, MA (1996) 24. Kramer, R.M., Neale, M.A.: Power and Influence in Organizations. Sage, Thousand Oaks, CA (1998) 25. Lincoln, Y.S., Guba, E.G.: Competing Paradigms in Qualitative Research. In: Denzin, N.K., Lincoln, Y.S.: Handbook of Qualitative Research. Sage, Thousand Oaks, CA (1994) 105–117 26. Lincoln, Y.S., Guba, E.G.: Paradigmatic Controversies, Contradictions, and Emerging Confluences. In: Denzin, N.K., Lincoln, Y.S.: Handbook of Qualitative Research. Sage, Thousand Oaks, CA (2000) 163–188 27. Lyytinen, K., Mathiassen, L., Ropponen, J.: Attention Shaping and Software Risk—A Categorical Analysis of Four Classical Risk Management Approaches. Information Systems Research 9:3 (1998) 233–255 28. Macaulay, L.A.: Requirements Engineering. Springer, London, UK (1996) 29. Marion, R.: The Edge of Organization: Chaos and Complexity Theories of Formal Social Systems. Sage, Thousand Oaks, CA (1999) 30. Morgan, G.: Images of Organization. Sage, Thousand Oaks, CA (1997) 31. Nuseibeh, B., Easterbrook, S.: Requirements Engineering: A Roadmap. In: Finkelstein, A.: The Future of Software Engineering 2000. ACM, Limerick, Ireland (June 2000) 32. Palmer, I., Hardy, C.: Thinking about management. Sage, Thousand Oaks, CA (2000) 33. Papert, S.: Introduction. In: Harel, I.: Constructionist Learning. MIT Media Laboratory, Cambridge, MA (1990) 34. Parker, S., Wall, T.: Job and Work Design: Organizing Work to Promote WellBeing and Effectiveness. Sage, Thousand Oaks, CA (1998) 35. Ramos, I.M.P.: Aplica¸co ˜es das Tecnologias de Informa¸c˜ ao que suportam as dimens˜ oes estrutural, social, pol´ıtica e simb´ olica do trabalho. Ph.D. Dissertation Departamento de Inform´ atica, Universidade do Minho, Guimar˜ aes, Portugal (2000) ´ Computer-Based Systems that Support the Structural, 36. Santos, I., Carvalho, J.A.: Social, Political and Symbolic Dimensions of Work. Requirements Engineering 3:2 (1998) 138–142 37. Robertson, S., Robertson, J.: Mastering the Requirements Process. AddisonWesley, Harlow, England (1999) 38. Schwandt, T.A.: Three Epistemological Stances for Qualitative Inquiry: Interpretivism, Hermeneutics, and Social Constructionism. In: Denzin, N.K., Lincoln, Y.S.: Handbook of Qualitative Research. Sage, Thousand Oaks, CA (2000) 189–213 39. Siddiqi, J., Shekaran, M.C.: Requirements Engineering: The Emerging Wisdom. IEEE Software 9:2 (March 1996) 15–19 40. Sommerville, I., Sawyer, P.: Requirements Engineering, A Good Practice Guide. John Wiley & Sons, Chichester, UK (1997) 41. Spector, P.E.: Job Satisfaction: Application, Assessment, Causes, and Consequences. Sage, Thousand Oaks, CA (1997) 42. van Lamsweerde, A.: Requirements Engineering in the Year 00: A Research Perspective. Proceedings of 22nd International Conference on Software Engineering. ACM, Limerick, Ireland (June 2000)
Managing Evolving Requirements Using eXtreme Programming Jim Tomayko Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA [email protected]
Abstract. One problem of moving development at “Internet speed” is the volatility of requirements. Even in a supposedly stable project like that described here, requirements change as the client sees “targets of opportunity.” That is one of the unintended side effects of having the client on-site frequently, although it does increase user satisfaction because they are not prevented from adding functionality. This paper is an account of using an agile method, eXtreme Programming, to survive and manage rapid requirement changes without sacrificing quality.
1 Introduction One of the most prevalent problems in software development is changing requirements, either because all of the requirements are unknown at the beginning of a project, or the clients simply changed their minds during its course, or some combination of the two. The way that requirements are managed in eXtreme Programming (XP), and other “agile” or “lightweight” development processes can ameliorate the effects of requirements uncertainty. In fact, the strongest undercurrent of these methods is the phrase “embrace change.” As Jim Highsmith and Martin Fowler have written [3]: “For a start, we don’t expect a detailed set of requirements to be signed off at the beginning of the project; rather, we see a high-level view of requirements that is subject to frequent change.” This paper tries to show how XP can adjust the development process to keep up with most changes.
2 eXtreme Programming XP is one of the growing numbers of lightweight methods now becoming popular for software development [1]. An initial glance at XP reveals places where its processes can be extended by the addition of selected practices from more heavyweight methods (Table 1). Note that the only XP practice that cannot be extended is the 40-hour week. Perhaps the greater than 40-hour week is a direct result of requirements evolution. Actually the “40-hour week” is a metaphor for “the developers are alert and rested.”
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 315–331, 2002. © Springer-Verlag Berlin Heidelberg 2002
316
J. Tomayko
The other requirements-oriented practices of XP can essentially be the means of preserving a normal working load. Table 1. XP processes and related standard practices.
XP Topic Planning Game Small Releases Metaphor Simple Design Testing Refactoring Pair Programming Collective Ownership Continuous Integration 40-hour Week On-site Customer Coding Standard
Additional Practice Iterative estimation; COCOMO II Rapid Application Development Problem Frames, Prototypes Software Architectural Styles Statistical Testing Software Architectural Styles Inspections Open Source Continuous Verification Use Cases Personal Software Process
Requirements per se are not mentioned in the list of XP practices. However, the XP practices of metaphor, simple design, refactoring, on-site customer, testing, collective ownership, and continuous integration are all requirements related. This paper discusses how these XP practices can be used to control requirements evolution. Along the way we will point out where we can use the fundamental XP values of simplicity, communication, feedback, and courage. 2.1 Metaphor Metaphor n. a figure of speech containing an implied comparison, in which a word or phrase ordinarily and primarily used of one thing is applied to another [12]. This definition is expanded in XP to encompass the entire initial customer and developers’ understanding of the system’s story. As such, it is a substitute for the architecture, which keeps development focused [1]. Therefore, as the metaphor is better understood, so are the requirements. For example, let the “Voyager probe computer network” be a metaphor for the system we want to build. We will not implement the system as a duplicate of Voyager’s configuration. We will just match its functionality Similarly, the use of Michael Jackson’s Problem Frames is a way of fleshing out the metaphor [5]. An example is given in [10] and one section is paraphrased here. Let us say that we have the following user story: A probe has a Command Computer, Attitude Control Computer and a Data Processing Computer. It also has three experiment computers, each of which has a limited version of the software residing in a primary computer as a backup. If one of the primary computers fails, the backup software will act as the primary until it can determine if there are sufficient resources to either run a more robust version of the software, or that version will have to be spread over several processors.
Managing Evolving Requirements Using eXtreme Programming
317
There is a problem frame for a controller that implements some Required Behavior. There is another that commands Controlled Behavior. In an earlier version of Jackson’s work, these two were combined in the Control Frame. However, in the spacecraft example, they are more effective separated (Figures 1 and 2).
Control Machine Attitude Control Computer
Controlled Domain
Required Behavior
Attitude Jets, Star Tracker
3-axis Inertial Orientation
Fig. 1. Required behavior.
The important thing to note when applying these problem frames is that their proper use requires filling out the metaphor “Voyager computer system.” There is no mention of thrusters, star trackers, or 3-axis inertial orientation in the original user story. It is only when completing the domains that these come up. However, these are hardly requirement changes. They are more like requirement refinements. But, they have the same effect, and coupling problem frames to the metaphor discovers them earlier and simplifies the evolution of requirements (Figure 1).
Control Machine Command Computer
Controlled Domain Maneuver Thrusters, Experiment Processors, Heartbeat
Commanded Behavior
Course Corrections, Experiment On/Off, Fault Tolerance Management
Ground Controller Operator
Fig. 2. Controlled behavior
In contrast, consider naming a metaphor that does not match: financial software is called a “checkbook,” but it maintains a budget, investments, and several accounts, besides allowing check writing. Perhaps “accountant” would be better. The original metaphor is quite limiting.
318
J. Tomayko
2.2 Simple Design and Refactoring Simple design and refactoring are discussed together because following the intent of the former makes the latter easier. Regardless of whether refactoring (redesigning) is used, simple designs fit the primary XP goal of simplicity. This enables the value of courage, as it gives the client the courage to ask for something new that may have just come to light and it gives the developers the courage to add functionality. Basically, this is the courage to change, a central tenet to requirements evolution. Simple designs are the product of much work at the front end. When XP developers start exploring the solution space, they derive simple designs by keeping modularity and abstraction prevalent through object-orientation. Refactoring for the long view (i.e. global variables versus local, variables versus constants, classes and object instantiations versus individual objects, etc.) results in simplifications and reuse. All are capable of making the adding of requirements graceful. “Simplicity” is not another word for “poor” or “haphazard.” Simple designs facilitate change. Fowler and Highsmith again: “Agile approaches emphasize quality of design, because design quality is essential to maintaining agility.” [3] 2.3 On-Site Customer Perhaps the strongest positive influence on evolving requirements and uncertainty is the XP practice of having the customer present. Many times their presence can be handy in simplifying the product. Once when a team was implementing the spacecraft software discussed above, they were having trouble re-booting processors that had failed. The client pointed out that a failed processor would have a low likelihood of a restart in space conditions. This seemingly obvious information changed the requirement for fault detection and tolerance when the team was about to go into the agony of implementing the requirement as they saw it. It can be that the chief means of controlling uncertainty over requirements is to have a representative of the group that will use the software present to say yea or nay before any change. This avoids travel down false paths and shortens the develop leadtime of new functions. 2.4 Testing One of the XP values is feedback. This is assured to be accomplished because XP software development is predicated on testing, and passed tests tell the developers whether they have successfully implemented a function. Testing is also a way of evolving requirements, as a test must exist before a function is added or changed. If it is impossible to develop the test, it is probably impossible to implement the change. Therefore, one check upon requirements expansion is this need to develop a test for any implementation. If it all works out and there is an adequate code/test pair that proves the added functionality and does not add complexity, as long as there is budget to cover it, who will care?
Managing Evolving Requirements Using eXtreme Programming
319
2.5 Collective Ownership One problem of evolving requirements is distributing their implementation according to developer expertise. Often the wrong engineer is assigned an implementation due to management ignorance. Collective ownership prevents this by permitting developers to choose to build components and avoid building others. This also prevents developers from learning new things, but training is not an essential part of the process. The principle of collective ownership both allows engineers to gravitate toward their areas of expertise, or to fix a naïve implementation. In this way they can contribute to keeping the effects of evolving requirements under control. Collective ownership means more and hopefully better refactorings, and a simpler design, since all must be able to understand them. 2.6 Continuous Integration This is a powerful technique of not only XP; Microsoft practices it and other organizations have tried to copy its success. Microsoft reportedly has a rule that the day is not over until the software under development is successfully compiled. That is difficult to believe in the case of operating systems, but not for most applications using modern compilers. The point is that the software is in a constant state of pseudo-completion. If it was developed in the XP fashion, it already represents some value to the customer. Adding or changing a requirement does not affect its availability to do the baseline application.
3 A Case Study In order to see how these XP practices and values are applied to evolving requirements, the story of how a team developed an application in an atmosphere of uncertainty is illustrative. The XP team was two pairs. Its job was to build simulation of a deep space craft’s computer systems for a study of fault tolerance. Previously, fault tolerance was often accomplished by redundancy [9]. The problem with that method is the redundant hardware constitutes an additional drain. The chief constraints on spacecraft are size, power, and weight. Eliminating redundant hardware would benefit all three. During a meeting of the High Dependability Computing Consortium (HDCC) in early 2001 at National Aeronautics and Space Administration’s (NASA) Ames Research Center, it occurred to me that Moore’s Law made even the ancient processors likely to find themselves chosen for a probe to be more powerful than is truly necessary for experiments. Therefore, a kernel process running in experiment computers that could be instantiated if one or more of the primary computers fail could back up the primary command, attitude, and data computers. The primary computers on even spacecraft that have been in flight for decades have never failed. Thus it makes sense to try this scheme. The problem was that of a Voyager-type spacecraft instead of the reconfigurable software on Galileo [11]. Every detail of the requirements was not delivered prior to beginning development. The first thing that the XP team did was to use the metaphor and problem frames to
320
J. Tomayko
make a reasonably correct prototype. The prototype was related to communications between computers and processes. This turned out proven to be possible, so the substance of the metaphor was all right, the primary computers could communicate with the others. Until this was determined to be possible, nothing could really be done. The overall story was written and divided to a series of small stories on cards. The client, who was available throughout development, ordered the stories into something that would deliver value with each cycle of development. Essentially, this was three cycles: all the functionality of the command computer, all that of the attitude control computer, and finally that of the data computer. The developers seemed relieved by this ordering, as the most difficult task of the entire system is redundancy management. The pairs initially thought that, since redundancy management is the job of the command computer, then any logic for that purpose must reside on that computer. It turned out to be quite easy to send a “heartbeat” to the other computers. If they did not respond to it in a fixed amount of time, then the node is considered failed. However, a failed computer could not be restarted unless it was completely rebooted. The client saw this as a misunderstanding. Since this was a simulation of a deep space craft, there would be no possibility for repair, so the reboot capability was not needed. When the client removed this functionality, the team could reconsider the direction of the fault tolerance signal. Now the peripheral computers could send a heartbeat to the command computer. When a time-out occurred, the offending computer is declared failed. The kernel process for the command computer only did the heartbeat. The developers realized that once the command computer and its kernel were figured out, development was essentially over. Near that point, they offered the code to the client for refactoring. Normally, the client would have been involved directly in helping to shape the functionality in general, not the details. However, this client had some expertise, and was allowed to exercise the privileges of refactoring and common ownership. This sort of relationship can be common in aerospace, since the prime contractor usually has some experience in the field. It turned out the refactorings were delivered to the developers as suggestions for change, so the client did not have to learn the development environment or configuration management system. They just let the developers handle the changes. Most were minor, such as replacing constants with variables. Looking at the code, the client noted that the developers were preparing to have the attitude control computer to accept an orientation value from the ground. This was not explicitly part of the requirements at first, but the developers seemed impressed with the elucidation of the three-axis inertial orientation requirement. In this way, both client and developers can contribute to requirement refinement. The client discovered this additional requirement as part of the refactoring/open ownership of the code. Usually implementing a requirement that a client has not specified has been brushed off as a “feature” by the developers and “gold plating” by the clients. The usual rationale is that since the additional code does not affect the actual requirements, if the client gets something extra, so the better. This ignores both added complexity and added difficulty in maintenance. Specific to this case, it violates the XP value of simplicity if both do not agree. In this case, allowing the attitude control computer to align the spacecraft (virtually) along its zero axis added a reasonable function. The client was worried that
Managing Evolving Requirements Using eXtreme Programming
321
the attitude and data computers only had the “I’m alive” function of fault detection. By adding this functionality to the software, the developers accomplished getting the attitude control computer more involved with the system. The client then introduced a requirements change to make acknowledgement of the attitude change command be routed through the data computer, thus giving it an additional function. Later three additional and specific functions were added to simulate load. Note that in the paradigmatic waterfall-like development life cycle, the client would most probably seen this feature at the acceptance test phase. This would be much too late for an elegant addition to the code, or to veto it. As it turned out, refactoring the code to remove the attitude change functionality would be simple, as the developers used abstraction and information hiding well. But the client wanted to keep the additional function and add one to the data computer. Therefore, this combination of techniques enabled requirements evolution, not only a reaction to it. Having the client on-site is the most important practice that helped the result. 3.1 Further Results At about this time, the client asked for a written description of the JUnit test system for possible later use. The development team took this opportunity to produce a test plan for integration as well (Appendix I). The team had done well independently using JUnit for component testing, as there was no need for interaction among them. However, integration would be different. When the time came for integration, they developed the plan to avoid interfering with each other. Note that the team restated the requirements as they understood them after problem frames analysis. Also at about this time in development, the team had the client sit down and try the software, hopefully to get direct experiential feedback. One of the areas that were not given any emphasis was the user interface. This is because the main purpose of the interface would be to turn on and off computing resources, so the client did not specify anything special, essentially leaving the format up to the development team. Now, even though integration was not finished, the interface was. The developers had made a simple interface of selections from a hierarchical set of menus. The messages at the lower levels were identical. So, with the speed of the processor, they would come up in such a way that their overwriting would be indiscernible, so that one perceived that they did not change. Small evolutions like this were found in the process of the client starting and stopping the three main spacecraft computers using the software. Finally, the team finished the original cards. They were asked by the client to add some experiment computer functions. This was done. When the client asked for additional capabilities, the team turned him down because their previous experience with the software indicated that they could not get done before the time allocated ran out.
4 Summary The claims of XP advocates, that the approach is relatively resilient to requirements changes, seems to have been justified in this project [1]. One developer shared
322
J. Tomayko
reasons why: “Having short release cycles and testing continuously aided us, the XPteam, to find defects quickly and also gave me a sense of accomplishment once we finish a requirement on the targeted date.” [2] A reason why the client shared their enthusiasm of the method’s effectiveness at requirements changes was stated by another developer: “Everything that is developed serves the interests of the customer.” [8] We made the assertion that quality remained high using agile methods. According to Jerry Weinberg “Quality is value to some person.” What we value is lack of defects. A corresponding team using the Team Software Process (TSP)1 on the same problem reported making 19.15 defects/KLOC [4], while a contemporary team with the same process had 20.07 defects/KLOC [6]. The XP team injected 9.56 defects/KLOC, or about half the relative number. The TSP team was measured at formal inspections and integration test. The XP team was measured during integrations and while executing their test plan. TSP is just not build for speed. Productivity figures bear this out: the XP team wrote 5,334 lines of Java that implemented all the requirements. The TSP team wrote 2,133 lines that implemented roughly half the requirements. This is because the XP team was implementing almost from the beginning, while the TSP team was writing the vision document, design document, and so on. The XP team was almost documentless, except for the coding standard and the integration test plan. If the project had continued, other documents would probably have been added. There was no desire for documentation of requirements change, but rather of the resulting design. A third developer mentioned: “I believe that formal design documentation should be more emphasized so that even if developers come in and go there is a point of reference.” [7] This is one of the biggest problems of maintenance, recapturing the design, and represents good insight on the part of the developer. Therefore, we can see that in a project trying to manage asynchronous requirements change, agile methods are useful. They enable the client to change their mind due to business pressure yet still have the developers easily respond to change.
References 1. 2. 3. 4. 5. 6. 7. 8.
1
Kent Beck. eXtreme Programming Explained. Addison-Wesley, 2000. Maida Felix. Personal Experiences in an XP Project, manuscript, Carnegie Mellon University, August 2001. Martin Fowler and Jim Highsmith. The Agile Manifesto, Software Development, V.9 No.8, 2001. Watts Humphrey. Team Software Process..Addison-Wesley, Boston, 2000. Michael Jackson. Problem Frames. Addison-Wesley, 2001. Michelle Krysztopik. Quality Report, manuscript, Carnegie Mellon University, 2001. Azifarwe Mahwasane. Personal Experiences in an XP Project, manuscript, Carnegie Mellon University, August 2001. Beryl Mbeki. Personal Experiences in an XP Project, manuscript, Carnegie Mellon University, August 2001.
The Team Software Process and TSP are service marks of Carnegie-Mellon University
Managing Evolving Requirements Using eXtreme Programming
323
9.
James Tomayko. Achieving Reliability. The Evolution of Redundancy in American Manned Spacecraft Computers, Journal of the British Interplanetary Society, V.38, 1985. 10. Jim Tomayko. Adapting Problem Frames to eXtreme Programming , manuscript, Carnegie Mellon University, 2001. 11. James Tomayko. Computers in Spaceflight. The NASA Experience, Contractor Report 182505. Washington, DC. NASA, Ch. 5 and 6, 1988. 12. Webster’s Dictionary, 3rd College Ed., Webster’s New World, 1988.
Appendix: XP Project Test Plan Test Plan Version 1.0 July 3, 2001 Maida Felix Azwifarwi Mahwasane Beryl Mbeki Thembakazi Zola
Introduction This section introduces the Space Probe Simulation System including its requirements, assumptions, and scenarios. It also describes the objective of this test plan. Purpose An objective of this test plan is to provide guidance for the Space Probe Simulation System Test. The goal is to provide a framework to developers and testers in order for them to plan and execute the necessary tests in a timely and cost-effective manner. This system test plan outlines test procedures to verify system level operations through user scenarios. This system testing is intended to test all portions of the Space Probe System. This test plan sample follows the steps discussed in the book “Object-Oriented Software Engineering” by Bernd Bruegge and Allen Dutoit. System Overview A space probe has three primary computers: Command Computer, Attitude Control Computer and the Data Processing Computer. Each primary computer has backup software that resides in an experiment computer. The space probe interacts with the ground station computer in order to get the commands. The computers communicate with each other over a LAN. The Command Computer receives and decodes the commands from the ground computer in order to change the attitude of the space probe or turn on experiments. It also receives the heartbeat of each computer that is connected to it. The Command Computer is able to tell if any of the computers has failed and shall instantiate the
324
J. Tomayko
kernel of a failed computer. The experiment computers shall be on a standby until they get a “turn-on” message from the Command Computer. The Attitude Control Computer has an interface to the Command Computer for changing attitude of the space probe and for fault tolerance. It generates a status report and sends it to the Data Processing Computer for formatting. The kernel of the Attitude Control Computer’s functionality shall reside on an experiment processor. The Data Processing Computer gets raw data from each experiment processor. Once the data is in the Data Processing Computer, it will be translated into a format that is acceptable to the ground station computer. The kernel of the Data Processing Computer’s functionality shall reside on an experiment processor. Requirements The requirements of the Space Probe Simulation System consists of the following: 1. The Command computer shall be able to try to restore a failed computer’s functionality. 2. The Command computer shall be able to instantiate the kernel of a failed computer, even itself. 3. The Command computer shall be able to turn on and off experiments. 4. If one of the primary computers has failed, then the kernel of the failed machine’s functionality shall run as a second job in an experiment processor. 5. The Attitude computer shall keep the spacecraft pointing to earth unless commanded otherwise. 6. The Command computer will send messages to the Attitude computer to change the position in which the spacecraft is pointing. 7. The Attitude Control computer shall have a sensor that senses the position of the spacecraft. Once the Command computer has issued a command, the Attitude Computer responds to the command by first finding out what position the spacecraft is at. If it is not at an acceptable position, then following the command will rectify the position. 8. By loading the functionality software on the Attitude Experiment computer it will be possible to have the functionality of the Attitude Control computer on the kernel of the Experiment computer. 9. The Attitude computer will send power codes as a “heartbeat” to inform the Command computer of its on or off status. 10. The Attitude and its Experiment shall gather data and pass it to the Data computer. 11. The Data Computer gets raw data from each processor. Once the data is in the Data computer, it translates the data into a format that is acceptable to the Ground computer. 12. The Data Computer shall format data for downlink. 13. A kernel of the Data computer’s functionality shall reside on an experiment processor. Assumptions The Space Probe simulation holds the following assumptions: 1. The software shall run on seven (7) virtual machines within a single LAN. This is an enclosed environment and no future changes are envisioned.
Managing Evolving Requirements Using eXtreme Programming
325
2. The main computers, except the Ground computer, are in space and once a failure occurs they cannot be rebooted (reinstantiated). 3. All commands are sent from the Ground computer only. System Scenarios The following steps describe how the system functions. We have described the details of establishing a connection between the Ground computer and the Command computer as well as a connection between the other computers (Attitude, Data, Attitude Experiment, Command Experiment and Data Experiment). 1. Establish a connection: Assumption: All computers are on standby waiting for commands. A user starts the Ground computer, stating which computer to connect to e.g. MSEPC 26. This computer represents the Command computer. Once a connection has been established, a menu appears giving user options to choose from. The user can choose one of the following options: o 1. Switch on a computer. o 2. Switch off a computer. o 3. Change the attitude. Or type “hangup” to exit the system. 2. Sending Commands: 2.1 Switch on a computer User enters “1” to switch on a computer A submenu appears asking which one of the computers should be switched on. User enters her choice and Command switches on the specified computer. 2.2 Switch off a computer User enters “2” to switch off a computer A submenu appears asking which one of the computers should be switched off. User enters her choice and Command switches off the specified computer. 2.3 Exit The user types “hangup ” to exit the system. Testing Information This section provides information on the features to be tested, pass and fail criteria, approach, testing materials, test cases, and testing schedules. Unit Testing All code will be tested to ensure that the individual unit (class) performs the required functions and outputs the proper results and data. Proper results will be determined by system requirements, design specification, and input from the on site client. Unit
326
J. Tomayko
testing is a typical white box testing. This testing will help ensure proper operation of a module because tests are generated with knowledge of the internal workings of the module. The developers will perform unit testing on all components. Tests based on the requirements were written before coding using JUnit. JUnit is a small testing framework written in Java by Kent Beck and Erich Gamma. Each method in a test case exercises a different aspect of the tested software. The aim, as is customary in software testing, is not to show that the software works fine, but to show that it doesn’t. Interaction with the JUnit framework is achieved through assertions, which are method calls that check (assert) that given conditions hold at specified points in the test execution. JUnit allows for several ways to run the different tests, either individually or in batches, but the simplest one by far is to let the Test Suite collect the tests using Java introspection: import junit.framework.*; public class TestOptionMatch extends TestCase { Optionmatch M; //constructor public TestOptionMatch (String name) { super(name); } //Initialize the fixture state public void setUp() { M = new Optionmatch(); } //add all test methods to the testsuite. public static Test suite() { return new TestSuite(TestOptionMatch.class); } //performing all tests public void test() { assert(M.choices(1)== 0); } public void test1() { assert(M.choices(2)== 45); } public void test2() { assert(M.choices(3)== 90); } //This method starts the text interface and runs all the tests. public static void main (String args[]) { junit.textui.TestRunner.run(suite()); }
Managing Evolving Requirements Using eXtreme Programming
327
Integration Testing There are two levels of integration testing. One level is the process of testing a software capability e.g. being able to send a message upon completion of the integration of the developed system. During this level, each module is treated as a black box, while conflicts between functions or classes are resolved. Test cases must provide unexpected parameter values when design documentation does not explicitly specify calling requirements for client functions. A second level of integration testing occurs when sufficient modules have been integrated to demonstrate a scenario e.g. the ability to queue and receive commands. Both hardware and software coding is fixed or documentation is reworked as necessary. In the case of the Space Probe Simulation System, the developers performed integration testing to ensure that the test cases work as desired on a specific computer. If a computer cannot communicate with its corresponding partner, and depending on the problem, the developers may have to fix the code to ensure that the system works properly. System Testing System testing will test communication functionalities between the Ground computer and the space probe computers. The purpose of system testing is to validate and verify the system in order to assure a quality product. This is the responsibility of the system developers. The developers and the client will ensure that the system tests are completed according to the specified test cases listed below. The developers will ensure that the results of the system tests are addressed if the results show that the functionality does not meet requirements. System testing is actually a series of different tests intended to fully exercise the computer-based system. Each test may have a different purpose, but all work to expose system limitations. System testing will follow formal test procedures based on hardware and software requirements. Testing Features/Test Cases Displaying Menu options on the Ground computer is a feature that will be tested. The following lists the test case scenarios that are most applicable to the Space Probe Simulation System. 1. Invalid Entries. 2. Verify that a computer is on before sending a command to it. 3. Verify that command computer receives heartbeats from other designated computers every 10 seconds. 4. When primary computer fails, backup (Experiment) computer are instantiated and take over the functionalities of the failed computer. 5. Verify that Attitude computer and its Experiment receive and display the proper attitude. 6. Data computer formats and sends attitude to Ground computer. Pass/Fail Criteria The test is considered pass if all of the following criteria are met: The Command computer is connected to Ground computer, Attitude, Data and the Experiments are connected to Command computer. The Command Experiment computer is connected to Ground computer.
328
J. Tomayko
Experiments are capable of taking over the functionality of the failed computer when one of the primary computers fails. All computers in “space” perform their functions automatically without any user interference. Ground computer shall be user-friendly so that it is easy to use. The test is considered fail if: The different computers have difficulty connecting or communicating with each other. The Experiments does not perform the functionality of a failed computer when one of the primary computers fails. The computers cannot perform all their functions automatically. Testing Approach This section describes the scenarios before the system testing and the step-by-step instruction on how to conduct the system testing. Before System Testing The following compile and run procedures should be followed in all seven (7) virtual machines. To compile Open DOS. Change the working directory to the directory containing all of the system's source files, by typing cd directory name. Type javac filename.java To run: type java filename Conducting System Testing The system testing consists of 7 user scenarios corresponding to the descriptions below. Each of the following sections provides test procedures for the Space Probe Simulation System. Each table lists the test procedure for a given user scenario. All the test procedures assume that a user has already established connection. Table 2 shows a typical test procedure for checking invalid host name entry. Table 2. Case 1 – Invalid Hostname Step
Purpose/Action
Purp. 1
To test invalid host name Start the Ground Computer
2
Enter host name in host label. Example: Type “MSEPC XX”
View/Result
Show host name and IP address and request for hostname Confirmation of connection and first menu.
Table 3 shows a typical test procedure for invalid Menu entry. Assume that the Ground computer is on and a menu is being displayed on the screen.
Managing Evolving Requirements Using eXtreme Programming
329
Table 3. Test Case 2 – Invalid Menu Entry . Step
Purpose/Action
Purp. 1
To test invalid Menu entry Enter Menu option e.g.: 1, 2, 3
2
Enter submenu option e.g.: A, B, C etc. User enters wrong choice
Result
View/Result
A corresponding submenu appears with more options. A command is sent and a confirmation is received. Error message is displayed and user gets another chance to enter an option.
Table 4 shows a test procedure for exit. Table 4. Test Case 3 – Exit Step
Purpose/Action
View/Result
Purp.
To test if user can end the session without problems Type “hangup”
Exit
1
Table 5 shows a typical test procedure for checking if a computer is on when a command is sent to it. Table 5. Test Case 4 – Verify if Attitude computer is on. Step
Purpose/Action
Purp. 1
To verify if Attitude computer is on. Enter Menu option “3”
2
Choose an attitude
Result
Attitude is off
View/Result
Submenu with the possible attitudes is displayed. Command sent to Attitude Computer. Send notification to Ground computer to switch on Attitude computer first.
Table 6 shows the test procedure for checking if the Data computer is on when a command is sent to it. Table 6. Test Case 5 – Verify if Data computer is on. Step
Purpose/Action
Purp. 1
To verify if Data computer is on. Attitude computer send message to Data computer. Data computer is off
Resul t
View/Result
Data formats the message and send it to Ground Computer. Send notification to Ground computer to switch on Data computer before the command can be sent.
330
J. Tomayko
Table 7 shows a test procedure to verify that the Command computer receives heartbeats. Table 7. Test Case 6 – Command receives heartbeats Step
Purpose/Action
Purp.
To verify that Command receives heartbeats every ten (10) seconds. Send heartbeat to Command computer. No heartbeat
1 Resul t
View/Result
Command computer receives an “I am alive” message. The specified computer has failed, Instantiate the kernel of the failed computer.
Table 8 shows a typical test procedure for instantiating an experiment computer.
Table 8. Test Case 7 – Instantiate backup computer Step
Purpose/Action
Purp.
To test whether an experiment computer switches on when its primary has failed. Send a “SWITCH ON ” message
1
Result
Experiment does not instantiate or experiment computer fails.
View/Result
Experiment takes functionality of the computer. System failure.
over failed
Table 9. Test Case 8 – Attitude receives and displays proper attitude. Step
Purpose/Action
Purp.
To test whether Attitude computer and Attitude Experiment receives and displays the right attitude.
1
Send attitude message e.g.: “B”.
Resu lt
No attitude command received Attitude and experiment displays wrong attitude.
View/Result
Attitude and Attitude Experiment display “Move 45 degrees”. No change in attitude. Attitude data will be wrong.
Managing Evolving Requirements Using eXtreme Programming
331
Table 10. Test Case 9 – Data computer formatting and sending attitude message Step
Purpose/Action
View/Result
Purp.
To test whether the Data computer sends and formats the attitude properly. Attitude computer or the Attitude Experiment sends attitude change to Data computer No attitude sent to Data from Attitude computer No attitude sent to Ground from Data computer
Data computer receives, formats the attitude and sends it to the Ground computer. No Data sent to Ground
1
Resul t
Testing Schedules The developers must ensure that the test data are accurate and the product should not be deployed until unit testing, integration testing, and system testing are properly performed. As part of the extreme programming testing strategy we will run the tests we wrote before hand frequently. Therefore we are always in the testing stage, but our final testing will be done as scheduled below: Testing Items
Test Date
Space Probe Simulation System
7/30/2001 – 8/03/2001
Responsible Person
Maida Felix, Azwifarwi Mahwasane, Nonzaliseko Mbeki, Thembakazi Zola
Text Summarization in Data Mining Colleen E. Crangle ConverSpeech LLC, 60 Kirby Place, Palo Alto, California 94301, USA [email protected] www.converspeech.com
Abstract. Text summarizers automatically construct summaries of a naturallanguage document. This paper examines the use of text summarization within data mining, identifying the potential summarizers have for uncovering interesting and unexpected information. It describes the current state of the art in commercial summarization and current approaches to the evaluation of summarizers. The paper then proposes a new model for text summarization and suggests a new form of evaluation. It argues that for summaries to be truly useful within data mining, they must include concepts abstracted from the text in addition to sentences extracted from the text. The paper uses two news articles to illustrate its points.
1 Introduction To summarize a piece of writing is to present the main points in a concise form. Work on automated text summarization began over 40 years ago [1]. The growth of the Internet invigorated this work in recent years [2], and summarization systems are beginning to be applied in areas such as healthcare and digital libraries [3]. Several commercially available text summarizers are now on the market. Examples include Capito from Semiotis, Inxight’s summarizer, the Brevity summarizer from LexTek International, the Copernic summarizer, TextAnalyst from Megaputer, and Whiskey™ from Converspeech. These programs work by automatically extracting selected sentences from a piece of writing. A true summary succinctly expresses the gist of a document, revealing the essence of its content. This paper examines the use of text summarization within data mining for uncovering interesting and unexpected information. It describes the current state of the art in summarization systems and current approaches to the evaluation of summarizers. The paper then proposes a new model for text summarization and suggests a new form of evaluation. It argues that for summaries to be truly useful within data mining, they must include concepts abstracted from the text in addition to sentences extracted. Such summarizers offer a potential not yet exploited in data mining.
2 Summarizers in Data Mining Much of the information crucial to an organization exists in the form of unstructured text data. That is, the information does not reside in a database with well-defined D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 332–347, 2002. © Springer-Verlag Berlin Heidelberg 2002
Text Summarization in Data Mining
333
methods of organization and access, but is expressed in natural language and is contained within various documents such as web pages, e-mail messages, and other electronic documents. The process of identifying and extracting valuable information from such data repositories is known as text data mining. Tools to do the job must go beyond simple keyword indexing and searching. They must determine, at some level, what a document is about. 2.1 Text Data Mining Keyword indexing and searching can provide a specific answer to a specific question, such as “What is the deepest lake in the United States?” with the answer being found in a piece of text such as: “Crater Lake, at 1,958 feet (597 meters) deep, is the seventh deepest lake in the world and the deepest in the United States.” Keyword indexing and searching can also provide answers to more complex questions, such as “What geological processes formed the three deepest lakes in the world?” Several sources will probably have to be consulted, their information fused, and interpretations made (what counts as a geological process versus a human intervention), and conclusions drawn. But standard keyword indexing and searching will probably suffice to find the pieces of text needed. Text data mining goes beyond question answering. It seeks to uncover interesting and useful patterns in large repositories of text, answering questions may not yet have been posed. The focus is on discovery, not simply finding what is sought. The focus is on uncovering unexpected content in text and unexpected relationships between pieces of text, not simply the text itself. 2.2 Summaries in Text Data Mining Summaries aid text data mining in at least the following ways: An information analyst—whether a social scientist, a member of the intelligence community, or a market researcher—uses summaries to guide her examination of data repositories that are so large she cannot possibly read everything or even browse the repository adequately. Summaries suggest what documents should be read in their entirety, which should be read together or in sequence, and so on. Summaries of the individual documents in a collection can reveal similarities in their content. The summaries then form the basis for clustering the documents or categorizing them into specified groups. Applications include Internet portal management, evaluating free-text responses to survey questions, help-desk automation for responses to customer queries, and so on. The very process of categorizing or clustering two document summaries into the same group can reveal an unexpected relationship between the documents. The summary of a collection of related documents taken together can reveal aggregated information that exists only at the collection level. In biomedicine, for example, Swanson has used summaries together with additional information-extraction techniques to form a new and interesting clinical hypothesis [4]. An interesting and significant form of indeterminacy creeps into summarization. It results from the inherent indeterminacy of meaning in natural language. Summaries,
334
C.E. Crangle
whether produced by a human abstractor or a machine, are generally thought to be good if they capture the author’s intent, that is, succinctly present the main points the author intended to make. (There are other kinds of summarization in which sentences are extracted relative to a particular topic, a technique that is a form of information extraction.) So-called neutral summaries, however, those that aim to capture the author’s intent, can succeed only to the extent that the author had a clear intent and expressed it adequately. What if the author had no clear intent or was an inadequate writer? Poor writers abound, and most short written communications, such as e-mail messages or postings to electronic bulletin boards, are messy in content and execution. Do automated text summarizers reveal anything useful in these cases? If the summarization technique is itself valid, the answer is that the summary reveals what the piece of text is really about, whether the author intended it or not. Various studies have explored the indeterminacy of meaning in language, and the extent to which meaning depends on the context in which language is used [5, 6]. Author’s intent does not bound meaning nor fully determine the content of a document. When documents are pulled together and their collective content is examined, there generally is no single author anyway whose intent could dominate. An automated summarizer that reveals what a text is really about, independent of authorial intent, is a powerful tool in data mining. It has the potential to reveal new and interesting information in a document or a collection of documents. The pressing and practical concern is how to evaluate any given summarizer; that is, how do we know whether or not it produces good summaries? What counts as a good summary, and does that judgment depend on the purpose the summary is to serve? Within data mining, for example, summaries that revealed unexpected content or unexpected relationships between documents would be of the greatest value. The next section looks at current work in summarization evaluation.
3 Evaluating Summarizers A group representing academic, U.S. government, and commercial interests has been working over the past few years to draw up guidelines for the design and evaluation of summarization systems. This work arose out of the TIDES Program (Translingual Information Detection, Extraction, and Summarization) sponsored by the Information Technology Office (ITO) of the U.S. Defense Advanced Research Project Agency (DARPA). In a related effort, the National Institute of Standards and Technology (NIST) of the U.S. government has initiated a new evaluation series in text summarization. Called the Document Understanding Conference (DUC), this initiative has resulted in the production of reference data—documents and summaries—for summarizer training and testing. (See http://www-nlpir.nist.gov/projects/duc/ for further information.) A key task accomplished by these initiatives was the compilation of sets of test documents. In these sets the important sentences (or sentence fragments) within each document are annotated by human evaluators, and/or for each document, humangenerated abstracts of various lengths are provided. An early example of a summary data set consisted of eight news articles published by seven news providers, New York Times, CNN, CBS, Fox News, BBC, Reuters,
Text Summarization in Data Mining
335
rd
and Associated Press, on June 3 , 2000, the eve of the meeting between Presidents Clinton and Putin. One of these articles is used below. Several approaches to summarizer evaluation have been identified. They include: Using the annotated sentences. For each document, counting how many of the annotated (i.e., important) sentences are included in the summary. A simple measure of percent agreement can be applied, or the traditional measures of recall and precision.1 Using the abstracts. Counting how many of the sentences in the human-generated abstracts are represented by sentences in the summaries. A simple measure of percent agreement can be applied, or the traditional measures of recall and precision. Using a question-answering task. For a given set of pre-determined questions, counting how many of the questions can be answered using the summary? The more questions can be answered, the better the summary. Using the utility method of Radev [7] . Using the content-based measures of Donaway [8]. To illustrate a simple evaluation, consider the following test document. The underlined sentences are those considered important by the human evaluators. CLINTON TAKES STAR WARS PLAN TO RUSSIA - US president Bill Clinton has arrived in Moscow for his first meeting with Russia’s new president Vladmir Putin. The two heads of state will meet on Saturday night for an informal dinner before getting down to business on Sunday. - High on the agenda will be the United State’s plans to build a missile shield in Alaska. Russia opposes the shield as it contravenes a pact signed by the two countries in 1972 which bans any anti-missile devices. - Clinton—in his last few months of office and keen to make his mark in American history—will be seeking to secure some sort of concession from Putin. - The Russian leader has said that he will suggest an alternative to the US system. - Kremlin officials said Putin would propose a system that would shoot down the missiles with interceptors shortly after they were fired rather than high in their trajectory. - “We’ll talk about it in Russia,” Clinton told reporters before leaving Berlin for Moscow. “It won’t be long now.” Accompanying the President is US Secretary of State Madeline Albright. “What’s new is that Putin is signalling that he is open to discuss it, that he is ready for talks,” she said. “We will discuss it.” - Arms control will not be the only potentially troublesome issue. US National Security Adviser Sandy Berger said last week Clinton would raise human rights and press freedom. Here is an automatically generated summary of this text:
1
Recall refers to the number of annotated sentences correctly included in the summary, divided by the total number of annotated sentences. Precision refers to the number of annotated sentences correctly included in the summary, divided by the total number of sentences (correctly or incorrectly) included in the summary.
336
C.E. Crangle
CLINTON TAKES STAR WARS PLAN TO RUSSIA - US president Bill Clinton has arrived in Moscow for his first meeting with Russia’s new president Vladmir Putin. - The two heads of state will meet on Saturday night for an informal dinner before getting down to business on Sunday. - High on the agenda will be the United State’s plans to build a missile shield in Alaska. - Russia opposes the shield as it contravenes a pact signed by the two countries in 1972 which bans any anti-missile devices. - Clinton—in his last few months of office and keen to make his mark in American history—will be seeking to secure some sort of concession from Putin. This extraction summary has three of the five important sentences, and of its six sentences (including the heading) three are considered important. Simple recall and precision figures of 60% and 50% result. All the current evaluation approaches assume that a summary is produced by extracting sentences. Are there other ways to think about summarization? Are there also new ways to think about evaluating summarizers? This author would argue yes, particularly in the context of data mining. In data mining, we are interested in discovering, not merely finding, information. We may need to dig beneath the surface of a text to make such discoveries. §6 returns to these questions, after a brief review of the state of the art in text summarization and presentation of a new model for summarization.
4 Text Summarization: The State of the Art Current summarizers work by extracting key sentences from a document. As yet, there is no summarizer on the market or even within the research community that truly fuses information to create a set of new sentences to represent the document’s content. In general, summarizers simply extract sentences. They differ in the methods they use to select those sentence. There are two main kinds of methods involved, that may be used separately or in combination: 1. Heuristic methods, based largely on insight into how human, professional abstractors work. Many of these heuristics exploit document organization. So, for example, sentences in the opening and closing paragraphs are more likely to be in the summary. Some heuristics exploit the occurrence of cue phrases such as “in conclusion” or “important.” 2. Methods based on identifying key words, phrases, and word clusters. The document is analyzed using statistical and/or linguistic techniques to identify the words, phrases or word clusters that by their frequency and co-occurrence are thought to represent the content of the document. Then sentences containing or related to these words and phrases are selected. The techniques commercial summarizers use to identify key words and phrases are often proprietary and can only be inferred from the extracted sentences. What is readily seen, however, is whether or not the method identifies concepts in the text. Concepts are expressed using words and phrases that may or may not appear within the text. Concept identification as opposed to key word and phrase identification is a cru-
Text Summarization in Data Mining
337
cial differentiating factor between summarizers. Summaries that contain true abstractions from the text are more likely to reveal unexpected, sometimes hidden, information within documents and surprising relationships between documents. A true abstraction summarizer can be a powerful tool for text data mining. It is important from a scientific point of view to devise objective measures to evaluate summarizers. However, given that the output of a summarizer is itself natural-language text, some human judgment is inescapable. The DUC initiative relies heavily on human evaluators. Based on informal testing of several dozen documents of various kinds—business and marketing documents (regulatory filing, product description, business news article), personal communications (fax, e-mail, letter), non-technical pieces (long essay, short information piece, work of fiction), scientific articles, and several documents that pose specific challenges (threaded bulletin board messages, enumerations in text, program code in text)—what follows is an intuitive judgment of the state of commercially available summarizers. Current summarizers are able to produce adequate sentence-extraction summaries of articles that have the following characteristics: The article is well written and well organized. It is on one main topic. It is relatively short (600-2,000 words). It is informational, for example, a newspaper article or a technical article in an academic journal. It is not a work of the imagination, such as fiction, or an opinion piece or general essay. It is devoid of layout features such as enumerations, indented quotations, or blocks of program code. (Although some summarizers use heuristics that take headings into account, for example, summarizers typically ignore or strip a document of most of its layout features.) Some summarizers perform limited post-processing “smoothing” on the sentences they list in an attempt to give coherence and fluency to the summary. This postprocessing includes: Removing inappropriate connecting words and phrases. If a sentence in the document begins with a connecting phrase—for example, “Furthermore” or “Although”—and that sentence is selected for the summary, the connecting phrase must be removed from the summary because it probably no longer plays the connecting role it was meant to. Resolving anaphora and co-reference. When a sentence is selected for inclusion in the summary, the pronouns (and other referring phrases) in it have to be resolved. That is, the summarizer has to make clear what that pronoun (or referring phrase) refers to. For example, suppose a document contains the following sentences: “Newcompany Inc. has recently reported record losses. If it continues to lose money, it risks strong shareholder reaction. The company yesterday announced new measures to…” If the second sentence is selected for the summary, the word “it” has to be resolved to refer to Newcompany Inc.; otherwise, the reader will have no idea what “it” is, and will naturally relate “it” to whatever entity is named in the preceding sentence of the summary. Ideally, the summary sentence should appear as: “If it [Newcompany Inc.] continues to loose money, it risks strong shareholder reaction.”
338
C.E. Crangle
If the third sentence is selected for the summary, the phrase “the company” should similarly be identified as referring to Newcompany Inc. and not any other company that may be named in a preceding sentence of the summary. Anaphoric and co-reference resolution is very difficult; not surprisingly, current commercial summarizers incorporate very few, if any, of these techniques. Current research into summarization has a strong emphasis on post-processing techniques.
5 A New Model for Summarization The standard model of summary production is represented by the sequence shown in Figure 1. Input the text document | Identify key words and phrases in the text | Extract sentences containing those words and phrases | Perform post-processing “smoothing” on the extracted sentences Fig. 1. Standard model of summary production
What if the summarizer is able to identify key concepts and not just key words and phrases? Not only can the key concepts by themselves stand as an encapsulated summary of the document, concepts can provide a better basis for selecting sentences to be extracted. An enhanced model results, as depicted in Figure 2. A summary that provides information not immediately evident from a surface reading of the text is of potentially great value in data mining. To test the assumption that concepts can provide a better basis for selecting sentences to be extracted, and to understand the significance of this enhanced model, the action of three different commercially available summarizers on the news article in Appendix I is considered. The first summarizer simply produces sentences. The second additionally displays key words and phrases from the text along with the extracted sentences. The third, ConverSpeech’s Whiskey, abstracts concepts and uses those concepts to extract sentences. Input the text document | Identify key concepts expressed in the text | | Present key concepts as Extract sentences and do encapsulated summary post-processing “smoothing” Fig. 2. Enhanced, concept-based, model of summary production
The first summarizer produced the following sentences extracted from the text:
Text Summarization in Data Mining
339
- About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area’s hightech companies. - Five local companies were already bumped off last year, and a sixth – PlanetRx.com Inc., a former South San Francisco health care company – was just delisted. - The whole Internet market crashed down, and we’re rolling with it,” says Peter Friedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if it doesn’t boost its stock price soon. - With stock prices down and the economy slowing, companies are falling short of the standards Nasdaq sets for its some 3,802 companies. - While the listing standards are arcane, the most obvious cardinal sin in the eyes of Nasdaq’s regulators is simple: The fall of a company’s stock price below $1 for 30 consecutive trading days. The second summarizer produced the following key words and phrases and sentences extracted from the text: - Nasdaq, stock, delisting, firms, investors, stock price, stock exchange, San, officer, Edison - It’s the company version of the pink slip in the mail – get your act together, or you’re fired from Nasdaq. - About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area’s hightech companies. - Five local companies were already bumped off last year, and a sixth – PlanetRx.com Inc., a former South San Francisco health care company – was just delisted. - While the delisting doesn’t have to mean the game is over, it relegates companies to the junior and less reputable leagues of the stock exchange world, where it’s much harder to raise money. - “The whole Internet market crashed down, and we’re rolling with it,” says Peter Friedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if it doesn’t boost its stock price soon. - Once booted, companies usually end up in the netherworlds of the stock market, where only a few brave investors venture. - This exchange doesn’t require firms to register with the Securities and Exchange Commission or even file financial statements. - “We’re working on strategic partnerships that will have a major impact on the stock,” says Nadyne Edison, chief marketing officer for the company. The third summarizer produced a high-level abstraction, a listing of the key concepts expressed in the text, and a list of extracted sentences. The number after each sentence is its score, calculated on the basis of how many occurrences of the words in the concept list appear in the sentence, optionally normalized for sentence length. Those sentences that receive the top 75% scores are selected for inclusion. This percentage is set as a parameter. Notice that the concept of business has been extracted from the text even though the word “business” appears only once in the text, in the last sentence. Note also that
340
C.E. Crangle
the word “time” also does not occur frequently in the text but the concept of time does. time company day
business Nasdaq working
capital stock share
- About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area’s hightech companies. (378) - Five local companies were already bumped off last year, and a sixth—PlanetRx com Inc., a former South San Francisco health care company—was just delisted. (352) - Nationwide, Nasdaq has either sent notices or is close to notifying at least 200 other companies, many of whom offered stocks to the public for the first time last year. (368) - While the delisting doesn’t have to mean the game is over, it relegates companies to the junior and less reputable leagues of the stock exchange world, where it’s much harder to raise money. (400) - “The whole Internet market crashed down, and we’re rolling with it,” says Peter Friedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if it doesn’t boost its stock price soon. (438) - While the listing standards are arcane, the most obvious cardinal sin in the eyes of Nasdaq’s regulators is simple: the fall of a company’s stock price below $1 for 30 consecutive trading days. (404) - Autoweb.com Inc., a Santa Clara Internet company that specializes in auto consumer services, has about 40 days left under the 90-day rule, but is busy scrambling to avoid a hearing. (390) The three summaries have three sentences in common, and the third summary has one additional sentence in common with each of the first and second summaries.
6 A New Evaluation Method How good are these three summaries and the summarizers that produced them? Any of the evaluation methods mentioned in §3 could be applied to assess the value of the extracted sentences. However, in the context of data mining, A new evaluation method for summarizers is proposed here. It asks the following question: How sensitive is a summarizer to surface perturbations in the text, such as in word choice or sentence order? Specifically, this method asks what happens if synonyms are substituted for words and phrases in the text. Does the summarizer give a different summary, selecting sentences that differ markedly in content from the previously selected ones? Similarly, the order of some of the sentences is changed, does that markedly alter what gets identified as key sentences? This test gives a good indication of the robustness of the summarizer and the soundness of the methods used to identify the content of the document. If simple
Text Summarization in Data Mining
341
changes in word choice or sentence order produce different summaries, it could be argued that the summarizer is not getting at the core of the document’s content. The news article in Appendix I uses the words “firm” and “company” interchangeably, with 23 occurrences of “company” (the more familiar word) and four occurrences of “firm. If we substitute “firm” for “company” in key sentences in the text, what happens? Two tests were performed. In the first two substitutions were made: About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area’s hightech companies firms. Five local companies firms were already bumped off last year, and a sixth—PlanetRx.com Inc., a former South San Francisco health care company—was just delisted. In the second test there were three additional substitutions: “The whole Internet market crashed down, and we’re rolling with it,” says Peter Friedman, CEO of Talk City Inc., a company firm that could get kicked off Nasdaq if it doesn’t boost its stock price soon. …With stock prices down and the economy slowing, companies firms are falling short of the standards Nasdaq sets for its some 3,802 companies firms. The results obtained were as follows: First Summarizer. With the first round of substitutions, one sentence from the original summary was removed and a different sentence was inserted.
- Out: Five local firms were already bumped off last year, and a sixth– PlanetRx.com Inc., a former South San Francisco health care company was just delisted. - In: When that happens, Nasdaq sends a notice giving the company 90 calendar days to get the stock price up again. With the second round of substitutions, another sentence from the original summary was removed and a different sentence from the text inserted.
- Out: With stock prices down and the economy slowing, firms are falling short of the standards Nasdaq sets for its some 3,802 firms. - In: If a company sold things on the Web—cars, pet food, you name it—it was almost guaranteed a spot on the stock exchange. Second Summarizer. The two rounds of substitutions produced only one change— the removal of the following sentence after the second round of substitutions: - Out: This exchange doesn’t require firms to register with the Securities and Exchange Commission or even file financial statements. Third Summarizer. The two rounds of substitutions produced the same sentences. The word “firm” was added to the list of concepts for both rounds. To further test the second and third summarizers, which appeared somewhat equally robust, they were run on two more versions of the article with several further substitutions of “firm” for “company.” Both summarizers produced stable sets of sentences for these changes: the second summarizer retained the same altered set of sentences as for the other substitutions, and the third summarizer continued to select the same sentences throughout. These two summarizers were also run on the Clinton/Putin test article give earlier, and on two variations of that article. The first variation was obtained by substituting
342
C.E. Crangle
“anti-missile device” for the following four phrases which, in the context, were synonymous with “anti-missile device”: “missile shield,” “shield,” and “system.” High on the agenda will be the United State’s plans to build a missile shield an anti-missile device in Alaska. Russia opposes the shield anti-missile device as it contravenes a pact signed by the two countries in 1972 that bans any anti-missile devices. … The Russian leader has said that he will suggest an alternative to the US system anti-missile device. A second variation was obtained by further substituting the presidents’ last names (“Putin” and “Clinton” respectively) for the referring expressions “the Russian leader” and “the President” in the following sentence: Putin The Russian leader has said that he will suggest an alternative to the US antimissile device. … Accompanying Clinton the President is US Secretary of State Madeline Albright. These were the results obtained. Second Summarizer. For the original article, the following words and phrases and extracted sentences had been produced: Clinton, Putin, Russia, president, Moscow, STAR WARS PLAN, missile shield, business, informal dinner, heads CLINTON TAKES STAR WARS PLAN TO RUSSIA - US president Bill Clinton has arrived in Moscow for his first meeting with Russia’s new president Vladmir Putin. - The two heads of state will meet on Saturday night for an informal dinner before getting down to business on Sunday. - High on the agenda will be the United State’s plans to build a missile shield in Alaska. - Russia opposes the shield as it contravenes a pact signed by the two countries in 1972 which bans any anti-missile devices. - Clinton—in his last few months of office and keen to make his mark in American history—will be seeking to secure some sort of concession from Putin. The following lists of key words and phrases were produced for the two altered versions of the article: - Clinton, Putin, Russia, president, anti-missile device, Moscow, STAR WARS PLAN, business, informal dinner, heads - Clinton, Putin, Russia, anti-missile device, Moscow, president Bill Clinton, STAR WARS PLAN, business, informal dinner, heads For both of the two altered versions, the following sentence was dropped from the summary, with no other sentence being substituted: - Out: High on the agenda will be the United State’s plans to build an anti-missile device in Alaska. Third Summarizer. The same set of sentences was extracted for the original article and the two variations. The following listing of abstracted concepts preceded the original summary. It is notable that the concept of country was identified as significant in the article even though the word “country” does not itself appear in the text. state business president Putin (Vladmir Putin) US Clinton (Bill Clinton) country missile (missile shield, system missile devices)
Text Summarization in Data Mining
343
The concepts abstracted for both of the two altered versions of the article were the following: state business president Putin (Vladmir Putin) US Clinton (Bill Clinton) device missile (missile shield, missile devices) The sentences extracted for all three versions of the article were as follows (with scores omitted): - US president Bill Clinton has arrived in Moscow for his first meeting with Russia’s new president Vladmir Putin. - The two heads of state will meet on Saturday night for an informal dinner before getting down to business on Sunday. - High on the agenda will be the United State’s plans to build a missile shield in Alaska. - Russia opposes the shield as it contravenes a pact signed by the two countries in 1972 which bans any anti-missile devices. - Clinton – in his last few months of office and keen to make his mark in American history – will be seeking to secure some sort of concession from Putin. - Kremlin officials said Putin would propose a system that would shoot down the missiles with interceptors shortly after they were fired rather than high in their trajectory. - “What’s new is that Putin is signalling that he is open to discuss it, that he is ready for talks,” she said. - US National Security Adviser Sandy Berger said last week Clinton would raise human rights and press freedom. Once again, the second summarizer, while not as stable as the third, concept-based, summarizer, did perform with relative robustness. Only one sentence was eliminated from the summaries for the two versions of the article containing substitutions for synonymous terms. However, the second and third summarizers differed markedly in their behavior when they were tested on articles that had re-ordered sentences. To illustrate, the same Clinton/Putin article.was used (Similar results were obtained with the news story on the Nasdaq delistings.) The Clinton/Putin article was rearranged to begin at the following sentence, with the displaced first two paragraphs tacked on at the end. (See Appendix II.) Clinton—in his last few months of office and keen to make his mark in American history—will be seeking to secure some sort of concession from Putin. These were the results obtained. Second Summarizer. It selected the following key words and phrases for the permuted article. There were seven in common with the original summary, three that were different (“Albright,” “Russian leader,” “concession”).
- Clinton, Putin, Russia, president, STAR WARS PLAN, missile shield, State Madeline Albright, Moscow, Russian leader, concession. The real limitation of the summarizer, however, is revealed in the sentences it selected for extraction. It had only four sentences in common with the original summary, eliminating two and adding three different ones:
344
C.E. Crangle
- Out: The two heads of state will meet on Saturday night for an informal dinner before getting down to business on Sunday. Russia opposes the shield as it contravenes a pact signed by the two countries in 1972 which bans any anti-missile devices. - In: The Russian leader has said that he will suggest an alternative to the US system. “We’ll talk about it in Russia,” Clinton told reporters before leaving Berlin for Moscow. Accompanying the President is US Secretary of State Madeline Albright. This summarizer most likely uses a heuristic that is commonly employed in summarizing algorithms. The heuristic gives greater weight to a sentence the nearer it is to the beginning of the article. (A variation of this heuristic assigns a greater weight only to the first sentence of the article or the sentences in the first paragraph.) However, there is something fundamentally mistaken about over-reliance on this heuristic even though it may improve results under some of the other evaluation methods. A sentence is placed at the beginning of an article because it is important. It is not important because it is at the beginning of an article. Over-reliance on the heuristic confuses these two points. Third Summarizer. In marked contrast, the concept-based summarizer produced exactly the same results for the permuted article (and all other permuted articles it was tested on.) The alterations in the summaries produced by the second summarizer, resulting simply from sentence reordering, suggest that the summarizing technique lacks robustness. Similarly, the alterations in the summaries produced by the first summarizer, resulting simply from synonym substitution, also suggest lack of robustness. What is essentially different about the third summarizer is that it abstracts from the words and phrases that appear in the text, and relies on those abstracted concepts to extract sentences.
7 Conclusion To capture the essence of a document, regardless of authorial intent, a summarizer must do more than identify key words and phrases in the text and extract sentences on that basis. It must also identify concepts expressed in the text. Summarizers that offer this level of abstraction appear to get at the essence of a text more reliably, showing a greater tolerance for superficial changes in the input text. Such summarizers are potentially powerful tools in data mining, uncovering information that lies beneath the surface of the words and phrases of the text.
References 1. H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2 (2), 1958. 2. Marti Hearst. Untangling Text Data Mining. Proceedings of ACL 99. 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 1999.
Text Summarization in Data Mining
345
3. Kathleen R. McKeown, et al. PERSIVAL, a System for Personalized Search and Summarization over Multimedia Healthcare Information, In Proceedings of The First ACM+IEEE Joint Conference on Digital Libraries. Roanoke, W. Va., June 2001. 4. Don R. Swanson and N.R. Smalheiser. An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artificial Intelligence, 91. 183-203 (1977) 5. C.E. Crangle. What words mean: some considerations from the theory of definition in logic. Journal of Literary Semantics, Vol. XXI, No. 1, 17-26, 1992. 6. C.E. Crangle and P. Suppes. Language and Learning for Robots. Stanford University, Stanford. CSLI Press. Distributed by Cambridge University Press, 1994 7. Dragomir R. Radev, Hongyan Jing, and Malgorzata Budzikowska. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. Proceedings of the ANLP/NAACL-2000 Workshop on Automatic Summarization, pp. 2130, Seattle, WA., 2000. 8. Robert L. Donaway, Kevin K. Drummey, and Laura A. Mather. A Comparison of Rankings Produced by Summarization Evaluation Measures. Proceedings of the ANLP/NAACL-2000 Workshop on Automatic Summarization, pp. 69-78, Seattle, WA., May 2000.
Appendix I: News Article 20 area firms face delisting by Nasdaq, by Matt Marshall, Jan. 24, 2001. Copyright © 2001 San Jose Mercury News. All rights reserved. Reproduced with permission. Use of this material does not imply endorsement of the San Jose Mercury News. It’s the company version of the pink slip in the mail—get your act together, or you’re fired from Nasdaq. About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area’s high-tech companies. Five local companies were already bumped off last year, and a sixth— PlanetRx.com Inc., a former South San Francisco health care company—was just delisted. Nationwide, Nasdaq has either sent notices or is close to notifying at least 200 other companies, many of whom offered stocks to the public for the first time last year. While the delisting doesn’t have to mean the game is over, it relegates companies to the junior and less reputable leagues of the stock exchange world, where it’s much harder to raise money. For shareholders, a Nasdaq delisting sounds like a chilling death knoll – the value of their stock could all but implode. Some delisted companies, like Pets.com, simply close their doors. “The whole Internet market crashed down, and we’re rolling with it,” says Peter Friedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if it doesn’t boost its stock price soon. “The emotion was too much. Things just snapped.” This round of delistings is the ignominious end to a year of decadence now coming back to haunt us. Most of these companies had no profits, and many had hardly any sales, when investor enthusiasm created a wave of new stock offerings last year. If a company sold things on the Web—cars, pet food, you name it—it was almost guaranteed a spot on the stock exchange.
346
C.E. Crangle
But in less than a year, many of the same investors have abandoned their former darlings. With stock prices down and the economy slowing, companies are falling short of the standards Nasdaq sets for its some 3,802 companies. While the listing standards are arcane, the most obvious cardinal sin in the eyes of Nasdaq’s regulators is simple: The fall of a company’s stock price below $1 for 30 consecutive trading days. When that happens, Nasdaq sends a notice giving the company 90 calendar days to get the stock price up again. If it fails to do so—for 10 consecutive days—the firm has one last resort: an appeal to Nasdaq. That involves a trek to Washington, D.C., and a quick hearing at a room in the St. Regis Hotel, where Nasdaq’s three-person panel grills executives. Unless there’s good reason to prolong the struggle, the company’s Nasdaq days are over. Once booted, companies usually end up in the netherworlds of the stock market, where only a few brave investors venture. First, it’s the Over The Counter Bulletin Board, which is considerably more risky and yields lower return to investors. However, even the OTCBB has requirements. Failing that, the next step down is the so-called Pink Sheets, named for the color of the paper they used to be traded on. This exchange doesn’t require firms to register with the Securities and Exchange Commission or even file financial statements. “They’re the wild, wild West,” says Nasdaq spokesman Mark Gundersen. Autoweb.com Inc., a Santa Clara Internet company that specializes in auto consumer services, has about 40 days left under the 90-day rule, but is busy scrambling to avoid a hearing. “We’re working on strategic partnerships that will have a major impact on the stock,” says Nadyne Edison, chief marketing officer for the company. On Tuesday, Edison was in Detroit, busy opening a new office near the nation’s auto capital. Edison says the firm is considering moving its headquarters to Detroit to be nearer its clients. Other companies that got delisting notices are trying layoffs. Take Mountain Viewbased Network Computing Devices, which provides networking hardware and software to large companies. Its sales have been pinched as the personal computer industry slows down, so it has laid off people. “We’ve had to downsize, downsize, downsize,” says Chief Financial Officer Michael Garner. Women.com, a San Mateo-based Internet site devoted to women, has laid off 25 percent of the workforce recently to avoid delisting. Becca Perata-Rosati, vice president of communications, says the site isn’t being fairly rewarded by Wall Street. The company is the 29th most heavily visited Web site in the world, she says. One trick that doesn’t seem to work is the so-called “reverse stock split,” which PlanetRx.com tried on Dec. 1. By converting every eight shares into one, PlanetRx.com hoped each share price would be boosted eightfold. But the move was seen by investors as a sign of desperation, and the stock plunged from $1 to 53 cents. Out of alternatives, PlanetRx didn’t even show up for its hearing with Nasdaq. It is now trading on the OTCBB after a recent move to Memphis and faces an uncertain future. At least one executive says he doesn’t mind the prospect of going to the OTCBB. Talk City’s Friedman says his company is growing, and expects its 9 million in service fee revenue to double this year. Even if he’s forced off the Nasdaq, he has hopes of returning.
Text Summarization in Data Mining
347
“I’d like to stay on the Nasdaq,” he says. ``If we get off, we’ll build a business. Then we’ll go back on.” Contact Matt Marshall at [email protected] or (408)920-5920.
Appendix II: Permuted Clinton/Putin News Article CLINTON TAKES STAR WARS PLAN TO RUSSIA Clinton—in his last few months of office and keen to make his mark in American history—will be seeking to secure some sort of concession from Putin. The Russian leader has said that he will suggest an alternative to the US system. Kremlin officials said Putin would propose a system that would shoot down the missiles with interceptors shortly after they were fired rather than high in their trajectory. “We’ll talk about it in Russia,” Clinton told reporters before leaving Berlin for Moscow. “It won’t be long now.” Accompanying the President is US Secretary of State Madeline Albright. “What’s new is that Putin is signalling that he is open to discuss it, that he is ready for talks,” she said. “We will discuss it.” Arms control will not be the only potentially troublesome issue. US National Security Adviser Sandy Berger said last week Clinton would raise human rights and press freedom. US president Bill Clinton has arrived in Moscow for his first meeting with Russia’s new president Vladmir Putin. The two heads of state will meet on Saturday night for an informal dinner before getting down to business on Sunday. High on the agenda will be the United State’s plans to build a missile shield in Alaska. Russia opposes the shield as it contravenes a pact signed by the two countries in 1972 that bans any anti-missile devices.
Industrial Applications of Intelligent Systems at BTexact Keynote Address Benham Azvine BTexact—BT Advanced Communication Technology Centre Adastral Park, Martlesham, Ipswich, IP5 3RE, UK [email protected]
Abstract. Soft computing techniques are beginning to penetrate into new application areas such as intelligent interfaces, information retrieval and intelligent assistants. The common characteristic of all these applications is that they are human-centred. Soft computing techniques are a natural way of handling the inherent flexibility with which humans communicate, request information, describe events or perform actions. Today, people use computers as useful tools to search for information and communicate electronically. There have been a number of ambitious projects in recent years including one at BTexact known as the Intelligent Personal Assistant (IPA), with the aim of revolutionising the way we use computers in the near future. The aim remains to build systems that act as our assistants and go beyond being just useful tools. The IPA is an integrated system of intelligent software agents that helps the user with communication, information and time management. The IPA includes specialist assistants for e-mail prioritisation and telephone call filtering (communication management), Web search and personalisation (information management), and calendar scheduling (time management). Each such assistant is designed to have a model of the user and a learning module for acquiring user preferences. In this talk I focus on the IPA, its components and how we used computational intelligence techniques to develop the system.
Biography. Behnam Azvine holds a BSc in Mechanical Engineering, an MSc and a PhD in Control Systems, all from the University of Manchester. After a number of academic appointments, he joined BTexact Technologies (formerly BT Labs) in 1995 to set up and lead a research programme in Intelligent systems and soft computing, and is currently the head of computational intelligence Group at BTexact. He holds the British Computer Society medal for IT for his team’s work on the digital personal assistant project and is a visiting fellow at Bristol University. He has edited a book, contributed to more than 60 publications, has 15 international patents and regularly gives presentations in international conferences and workshops on the application of AI in Telecommunications. Currently, he is the co-chairman of the European Network of Excellence for Uncertainty techniques (EUNITE). His research interests include the application of Soft Computing and AI techniques to human-centred computing and adaptive software systems.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, p. 348, 2002. © Springer-Verlag Berlin Heidelberg 2002
Intelligent Control of Wireless and Fixed Telecom Networks Keynote Address John Bigham Department of Electronic Engineering, Queen Mary College,University of London, UK [email protected]
Abstract. Different agent systems that have been developed for the control of resources in telecoms are studied and assessed. The applications considered include the control of admission to ATM networks, resource management in the wider market place promised by 3G mobile networks and location aware services. The designs of these systems have a common pattern that has significant differences from the traditional management and control planes used in telecommunications. Rather, columns of control with communication between them are built. The degree of communication depends on the layers and the corresponding response time. In this talk it will be also shown that such a design provides increased flexibility. This allows the best policy to be applied according to the current demand.
Biography. Dr. Bigham has many years’ experience in applying computational intelligence in telecommunications. He currently is a Reader at Queen Mary College, University of London research Artificial Intelligence and Reasoning Under Uncertainty within the Intelligent Systems & Multimedia research group.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, p. 349, 2002. © Springer-Verlag Berlin Heidelberg 2002
Assertions in Programming: From Scientific Theory to Engineering Practice Keynote Address Sir Tony Hoare Microsoft Research, Cambridge, UK
Abstract. An assertion in a computer program is a logical formula (Boolean expression) which the programmer expects to evaluate to true on every occasion that program control reaches the point at which it is written. Assertions can be used to specify the purpose of a program, and to define the interfaces between its major components. An early proponent of assertions was Alan Turing (1948), who suggested their use in establishing the correctness of large routines. In 1967, Bob Floyd revived the idea as the basis of a verifying compiler that would automatically prove the correctness of the programs that it compiled. After reading his paper, I became a member of a small research school devoted to exploring the idea as a theoretical foundation for a top-down design methodology of program development. I did not expect the research to influence industrial practice until after my retirement from academic life, thirty years ahead. And so it has been. In this talk, I will describe some of the ways in which assertions are now used in Microsoft programming practice. Mostly they are used as test oracles, to detect the effects of a program error as close as possible to its origin. But they are beginning to be exploited also by program analysis tools and even by compilers for optimisation of code. One purpose that they are never actually used for is to prove the correctness of programs. This story is presented as a case study of the way in which scientific research into ideals of accuracy and correctness can find unexpected application in the essentially softer and more approximative tasks of engineering.
Biography. Professor Hoare studied Philosophy, Latin, and Greek at Oxford University in the early fifties, Russian during his National Service in the Royal Navy, and the machine translation of languages as a graduate student at Moscow State University (1959). One outcome of the latter work was the discovery of the Quicksort algorithm. On returning to England in 1960, he worked as a programmer for Elliott Brothers, and led a team in the development of the first commercial compiler for the programming language Algol 60. In 1968, he took up a Chair in Computing Science at the Queen’s University, Belfast. There his output included a series of papers on the use of assertions in program proving. In 1977, he moved to Oxford University, where ’provable correctness’ was again a focus of his research. Well-known results of this work included the Z specification language, and the CSP concurrent-programming model. Recently, he has been investigating the unification of a diverse range of theories that apply to different programming languages, paradigms, and implementation technologies. Throughout his academic career, Tony has maintained strong contacts with industry, through consultancy, teaching, and collaborative
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 350–351, 2002. © Springer-Verlag Berlin Heidelberg 2002
Assertions in Programming: From Scientific Theory to Engineering Practice
351
research projects. He has taken a recent interest in legacy systems, where assertions can play an important role in system testing. In 1999, on reaching retirement age at Oxford, Tony moved back to industry as a Senior Researcher with Microsoft Research in Cambridge, England. In March 2000, he received a knighthood from the Queen for services to Computing Science.
Hybrid Soft Computing for Classification and Prediction Applications Keynote Address Piero Bonissone General Electric Corporation, Research and Development Centre, Schenectady, New York
Abstract. Soft computing (SC) is an association of computing methodologies that includes as its principal members fuzzy logic (FL), neural computing (NC), evolutionary computing (EC), and probabilistic computing (PC). These methodologies allow us to deal with imprecise, uncertain data, and incomplete domain knowledge that are encountered in real-world applications. We will describe the advantages of using SC techniques, and in particular we will focus on the synergy derived from the use of hybrid SC systems. This hybridization allows us to integrate knowledge-based and data-driven methodologies to construct models for classification, prediction, and control applications. In this presentation we will describe three real-world SC applications: the prediction of time-to-break margins in paper machines; the automated underwriting of insurance applications; and the development and tuning of raw-mix proportioning controllers for cement plants. The first application is based on a model that periodically predicts the amount of time left before an unscheduled break of the web in a paper machine. The second application is based on a discrete classifier, which assigns a vector of real-valued and attribute-values inputs, representing an insurance applicant’s vital data, to a rate class, representing the correct insurance premium. The third application is based on a hierarchical fuzzy controller, which determines the correct proportion of the raw material to maintain certain properties in a cement plant. The similarity among these applications is the common process with which their models were constructed. In all three cases, we held knowledge engineering sessions (to capture the expert knowledge) and we collected, scrubbed and aggregate process data (to define the inputs for the models). Then we encoded the expert domain knowledge using fuzzy rule-based or case-based systems. Finally, we tuned the fuzzy system parameters using either local or global search methods (NC and EC, respectively) to determine the parameter values that minimize prediction, classification, and control errors.
Biography. Dr Bonissone has a BS in Electrical and Mechanical Engineering from the University of Mexico City (1975) and an MS and PhD in Electrical Engineering and Computer Sciences from UC Berkeley (1976 and 1979). He has been a computer scientist at General Electric Corporate Research and Development Centre since 1979, carrying out research in expert systems, approximate reasoning, pattern recognition, decision analysis, and fuzzy sets. In 1993, he received the Coolidge Fellowship Award from General Electric for overall technical accomplishments. In 1996, he became a Fellow of the American Association for Artificial Intelligence (AAAI) and has
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 352–353, 2002. © Springer-Verlag Berlin Heidelberg 2002
Hybrid Soft Computing for Classification and Prediction Applications
353
been the Editor-in-Chief of the International Journal of Approximate Reasoning for the past seven years. He has co-edited four books, including the Handbook of Fuzzy Computation (1998), published over one hundred articles, and registered nineteen patents. He is the 2001 President-Elect of the IEEE Neural Network Council.
Why Users Cannot ‘Get What They Want’ Keynote Address Ray Paul Brunel University, UK
Abstract. The notion that users can ‘get what they want’ has caused a planning blight in information systems development, with the resultant plethora of information slums that require extensive and expensive maintenance. This paper will outline why the concept of ‘user requirements’ has led to a variety of false paradigms for information systems development, with the consequent creation of dead systems that are supposed to work in a living organisation. It is postulated that what is required is an architecture for information systems that is designed for breathing, for adapting to inevitable and unknown change. Such an architecture has less to do with ‘what is wanted’, and more to do with the creation of a living space within the information system that enables the system to live.
Biography. Professor Paul spent twenty one years at the London School of Economics (1971-92), as a Lecturer in Operational Research and Senior Lecturer in Information Systems, before taking up a Chair in Simulation Modelling at Brunel University in 1992. He was Head of Department for five years (1993-98) and then appointed Dean of the Faculty of Science in 1999. He has been a Visiting Professor in the Department of Community Medicine, Hong Kong University since 1992 and an Associate Research Fellow in the Centre for Research into Innovation, Culture and Technology (CRICT), Brunel University. He is a Director and Founder of the Centre for Living Information Systems Thinking (LIST) and the Centre for Applied Simulation Modelling (CASM) at Brunel. He has acted as a consultant for various government departments, including Health and Defence, as well as a plethora of commercial organisations and charitable bodies. Ray is co-founder of the European Journal of Information Systems, launched in January 1991, and Chair of the Editorial Board from 1991 to 1999. He is currently on the editorial board of Computers and Information Technology, Journal of Intelligent Systems, Logistics and Information Management, and Journal of Simulation Systems Science and Technology.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, p. 354, 2002. © Springer-Verlag Berlin Heidelberg 2002
Systems Design with the Reverend Bayes Keynote Address Derek McAuley Marconi Research Laboratories, Cambridge, UK
Abstract. A computer viewed as a technological encapulation of logic is a fine ideal. However, given the laws of physics actually tell us that computers will make mistakes, roll in "to err is human" in software development and we’re some way from the ideal. Perhaps we might review how probabilistic reasoning, originally expounded in the early 18th century by the Revd Bayes and adopted now as the underpinning of machine learning, could be brought to bear on software and systems design. Tales from the trenches and some views for the future.
Biography. Professor McAuley joined Marconi in January 2001 to establish the new Marconi Labs in Cambridge. He obtained his B.A. in Mathematics from the University of Cambridge in 1982 and his Ph.D. addressing issues in interconnecting heterogeneous ATM networks in 1989. After a further five years at the University of Cambridge Computer Laboratory as a lecturer he moved in 1995 to a chair at the University of Glasgow Department of Computing Science. He returned to Cambridge in July 1997, to help found the Cambridge Microsoft Research facility. His research interests include networking, distributed systems and operating systems. Recent work has concentrated on the support of time dependent mixed media types in both networks and operating systems.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, p. 355, 2002. © Springer-Verlag Berlin Heidelberg 2002
Formalism and Informality in Software Development Keynote Address Michael Jackson Consultant, UK
Abstract. Because the machines we build are essentially formal we are obliged to formalise every problem to which we apply them. But the world in which the problem exists is rarely, if ever, formal. Formalisation of an informal reality is therefore a fundamental—though somewhat neglected—activity in software development. In this talk some aspects of the formalisation task are discussed, some techniques for finding better approximations to reality are sketched, and some inevitable limitations of formal models are asserted.
Biography. Professor Jackson has over forty year’s experience in the software development industry. He created the JSD method of system development and the JSP method of program design, which became a government standard. He has held visiting chairs at several universities and received various honours, including the Stevens Award, the IEE Achievement Medal, the BCS Lovelace Medal and the ACM SIGSOFT Award for Outstanding Research. He is on the editorial board of four journals: Requirements Engineering, Automated Software Engineering, Science of Computer Programming and ACM Transactions on Software Engineering and Methodology. He now works as an independent consultant and as a part-time researcher at AT&T Research.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, p. 356, 2002. © Springer-Verlag Berlin Heidelberg 2002
An Industrial Perspective on Soft Issues: Successes, Opportunities and Challenges Panel Gordon Bell (Chair), Managing Director, Liberty Information Technology Bob Barbour, Chief Executive, Centre for Competitiveness Paul McMenamin, Vice President (Belfast Labs), Nortel Networks Dave Allen, Principal Consultant, Charteris Maurice Mulvenna, Co-Founder and Chief Executive, LUMIO
Summary. Panel members have been invited to share their experiences of handling soft issues in system design, development and operation—either personal experiences, or those of their organisation. In addition, they have been encouraged to identify a few ‘grand challenges’ areas for future research.
D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, p. 357, 2002. © Springer-Verlag Berlin Heidelberg 2002
Author Index
Adams, Carl 280 Amor, Nahla Ben 263 Avison, David E. 280 Azvine, Benham 102, 348 Bagai, Rajiv 141 Bell, David 206 Benferhat, Salem 263 Berry, Daniel M. 300 Bigham, John 349 Black, Michaela 74 Bonissone, Piero 352 Bontempi, Gianluca 46 ´ Carvalho, Jo˜ ao A. 300 Chieng, David 14 Chrysostomou C. 1 Crangle, Colleen E. 332 Gabrys, Bogdan 232 Golubski, Wolfgang 166 Haenni, Rolf 114 Hickey, Ray 74 Ho, Ivan 14 Hoare, Tony 350 Hong, Jun 60 Hughes, John G. 60 Jackson, Michael
356
Kelley, Shellene J. 141 Kettaf, Fatima Zohra 128 Kryszkiewicz, Marzena 247
Lafruit, Gauthier 46 Lehman, Manny M. 174 Luk´ acs, Gergely 151 Marshall, Adele 206 Marshall, Alan 14 Martin, Trevor P. 102 McAuley, Derek 355 McClean, Sally 191 McSherry, David 217 Mellouli, Khaled 263 Palmer, Fiona 191 Parr, Gerard 14 Paul, Ray 354 Pitsillides A. 1 Ramil, J.F. 174 Ramos, Isabel 300 Richard, Gilles 128 Rossides, L. 1 Ruta, Dymitr 232 Scotney, Bryan 191 Sekercioglu, A. 1 ˇ ep´ Stˇ ankov´ a, Olga 88 Sterritt, Roy 31, 206 Tomayko, Jim
315
ˇ Zelezn´ y, Filip 88 Z´ıdgek, Jiˇr´ı 88 Zhu, Jianhan 60