This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen
1507
3 Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
Tok Wang Ling Sudha Ram Mong Li Lee (Eds.)
Conceptual Modeling - ER ’98 17th International Conference on Conceptual Modeling Singapore, November 16-19, 1998 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands
Volume Editors Tok Wang Ling Mong Li Lee National University of Singapore School of Computing, Department of Computer Science 55 Science Drive 2, Singapore 117599 E-mail: {lingtw,leeml}@comp.nus.edu.sg Sudha Ram University of Arizona, Department of Management Information Systems 430J McClelland Hall, College of BPA Tuscon, AZ 85721, USA E-mail: [email protected]
Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Conceptual modeling : proceedings / ER ’98, 17th International Conference on Conceptual Modeling, Singapore, November 16 - 19, 1998. Tok Wang Ling ; Sudha Li Lee (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Budapest ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1998 (Lecture notes in computer science ; Vol. 1507) ISBN 3-540-65189-6
CR Subject Classification (1991): H.2, H.4, F.1.3, F.4.1, I.2.4, H.1, J.1 ISSN 0302-9743 ISBN 3-540-65189-6 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. c Springer-Verlag Berlin Heidelberg 1998 Printed in Germany Typesetting: Camera-ready by author SPIN 10639013 06/3142 – 5 4 3 2 1 0
Printed on acid-free paper
Foreword
I would like to welcome you to Singapore and the 17th International Conference on Conceptual Modeling (ER’98). This conference provides an international forum for technical discussion on conceptual modeling of information systems among researchers, developers and users. This is the first time that this conference is held in Asia, and Singapore is a very exciting place to host ER’98. We hope that you will find the technical program and workshops useful and stimulating. The technical program of the conference was selected by the distinguished program committee consisting of two co-chairs and 83 members. Credit for the excellent final program is due to Tok Wang Ling and Sudha Ram. Special thanks to Frederick H. Lochovsky for selecting interesting panels, and Alain Pirotte for preparation of attractive tutorials. I would also like to thank Yong Meng Teo (Publicity Chair), and the region co-ordinators, Alberto Laender, Erich Neuhold, Shiwei Tang, and Masaaki Tsubaki, for taking care of publicity. The following three workshops are also organized to discuss specific topics of data modeling and databases: “International Workshop on Data Warehousing and Data Mining” organized by Sham Navathe (Workshop chair) and Mukesh Mohania (Program Committee Chair), “International Workshop on New Database Technologies for Collaborative Work Support and Spatio-Temporal Data Management” organized by Yoshifumi Masunaga, and “International Workshop on Mobile Data Access” organized by Dik L. Lee. Ee Peng Lim took care of all detailed work related to the workshops. I would like to thank all these people who organized the workshops as well as the members of program committees. The workshop proceedings will be published jointly after the workshop. I would also like to express my appreciation to other organizing committee members, Chuan Heng Ang (Publication), Hock Chuan Chan (Registration), Mong Li Lee and Danny Poo (Local Arrangements), Cheng Hian Goh (Treasurer), and Kian Lee Tan (Industrial Chair). Special thanks to Tok Wang Ling who worked as the central member of the organizing committee, and who made my job very easy. Last, but not least, I would like to express thanks to the members of the Steering Committee, especially to Stefano Spaccapietra (Chair), Bernhard Thalheim (Vice Chair), and Peter Chen (Chair, Emeritus) who invented the widely used ER model and started this influential conference. Finally I would like to thank all the sponsors and attendees of the conference, and hope that you will enjoy the conference, the workshops, and Singapore to the utmost extent.
November 1998
Yahiko Kambayashi Conference Chair
Program Chairs’ Message
The 17th International Conference on Conceptual Modeling (ER’98) is aimed at providing an international forum for technical discussion among researchers, developers, practitioners, and users whose major emphasis is on conceptual modeling. This conference was originally devoted to the Entity-Relationship (ER) model, but has long since expanded to include all types of semantic data modeling, behavior and process modeling, and object-oriented systems modeling. This year’s conference embraces all phases of software development including analysis, specification, design, implementation, evolution, and reengineering. Our emphasis this year has been to bring together industry and academia to provide a unique blend of original research and contributions related to practical system design using conceptual modeling. We have an exciting agenda focusing on emerging topics ranging from conceptual modeling for Web based information systems to data warehousing and industrial case studies on the use of conceptual models. The conference attracted 95 papers from authors in 31 different countries. Both industry and academic contributions were solicited. Similarly high standards were applied to evaluating both types of submissions. Of the submissions, 32 were accepted for presentation at the conference based on extensive reviews from the Program Committee and external reviewers. The program consists of 26 research papers and 6 industrial papers representing 17 different countries from around the globe. The entire submission and reviewing process was handled electronically, which proved to be a challenge and a blessing at the same time. A conference of this magnitude is the work of many people. The program committee with the help of external reviewers worked under a tight schedule to provide careful, written evaluations of each paper. Mong Li Lee, Chuan Heng Ang, Choon Leong Chua, and Sew Kiok Toh helped to coordinate the review of our electronic submission and review system, tabulated the scores and distributed reviews to authors. Since the program co-chairs are from two different continents, great coordination was required and achieved through the use of the Internet. Jinsoo Park from the University of Arizona and Mong Li Lee from the National University of Singapore did an outstanding job of assisting the Program Co-Chairs. On behalf of the entire ER’98 committee, we would like to express our appreciation to all the people who helped with the conference. Finally, our thanks to all of you for attending the conference here in Singapore. We wish you a week of fun in the enchanting garden city of Singapore! November 1998
Tok Wang Ling and Sudha Ram Program Co-Chairs
Conference Organization
Conference Chair: Yahiko Kambayashi (Kyoto University, Japan) Program Co-Chairs: Tok Wang Ling (National University of Singapore, Singapore) Sudha Ram (University of Arizona, USA) Panel Chair: Frederick H. Lochovsky (HK University of Science & Technology, Hong Kong) Tutorial Chair: Alain Pirotte (University of Louvain, Belgium) Publication Chair: Chuan Heng Ang (National University of Singapore, Singapore) Registration Chair: Hock Chuan Chan (National University of Singapore, Singapore) Finance Chair: Cheng Hian Goh (National University of Singapore, Singapore) Local Arrangements Co-Chairs: Mong Li Lee (National University of Singapore, Singapore) Danny Poo (National University of Singapore, Singapore) Workshop Chair: Ee Peng Lim (Nanyang Technological University, Singapore) Industrial Chair: Kian Lee Tan (National University of Singapore, Singapore) Publicity Chair: Yong Meng Teo (National University of Singapore, Singapore) Steering Committee Representatives: Stefano Spaccapietra (Swiss Federal Institute of Technology, Switzerland) Bernhard Thalheim (Cottbus Technical University, Germany) Peter Chen (Louisiana State University, USA)
VIII
Conference Organization
Region Co-ordinators: Alberto Laender (Federal University of Minas Gerais, Brazil) Erich Neuhold (German National Research Center for Information Technology, Germany) Masaaki Tsubaki (Data Research Institute, Japan) Shiwei Tang (Peking University, China)
Tutorials Multimedia Information Retrieval, Categorisation and Filtering by Carlo Meghini and Fabrizio Sebastini (CNR Pisa, Italy) Co-design of Structures, Processes and Interfaces for Large-Scale Reactive Information Systems by Bettina Schewe, Klaus-Dieter Schewe and Bernhard Thalheim (Germany) Advanced OO Modeling: Metamodels and Notations for the Next Millenium by Brian Henderson-Sellers, Rob Allen, Danni Fowler, Don Firesmith, Dilip Patel, and Richard Due Modeling Information Security - Scope, State-of-the-Art, and Evaluation of Techniques by Gunther Pernul and Essen (Germany) Spatio-Temporal Information Systems: a Conceptual Perspective by Christine Parent, Stefano Spaccapietra, and Esteban Zimanyi (EPFL Lausanne, Switzerland)
Workshops Data Warehousing and Data Mining Chair: Sham Navathe (Georgia Institute of Technology, USA) Program Chair: Mukesh Mohania (University of South Australia, Australia) Mobile Data Access Chair: Dik L. Lee (HK University of Science and Technology, Hong Kong) New Database Technologies for Collaborative Work Support and SpatioTemporal Data management Chair: Yoshifumi Masunaga (University of Library and Info. Science, Japan)
Conference Organization
Program Committee Peter Apers, The Netherlands Akhilesh Bajaj, USA Philip Bernstein, USA Elisa Bertino, Italy Glenn Browne, USA Stefano Ceri, Italy Hock Chuan Chan, Singapore Chin-Chen Chang, Taiwan Arbee L. P. Chen, Taiwan Roger Hsiang-Li Chiang, Singapore Joobin Choobineh, USA Phillip Ein-Dor, Israel Ramez Elmasri, USA David W. Embley, USA Tetsuya Furukawa, Japan Georges Gardarin, France Cheng Hian Goh, Singapore Wil Gorr, USA Terry Halpin, USA Igor Hawryszkiewycz, Australia Alan Hevner, USA Uwe Hohenstein, Germany Sushil Jajodia, USA Ning Jing, China Leonid Kalinichenko, Russia Hannu Kangassalo, Finland Jessie Kennedy, UK Hiroyuki Kitagawa, Japan Ramayya Krishnan, USA Gary Koehler, USA Prabhudev Konana, USA Uday Kulkarni, USA Akhil Kumar, USA Takeo Kunishima, Japan Alberto Laender, Brazil Laks V. S. Lakshmanan, Canada Per-Ake Larson, USA Dik-Lun Lee, China Mong Li Lee, Singapore Suh-Yin Lee, Taiwan Qing Li, China Stephen W. Liddle, USA
Ling Liu, USA Pericles Loucopoulos, UK Leszek A. Maciaszek, Australia Stuart E. Madnick, USA Kia Makki, USA Salvatore March, USA Heinrich C Mayr, Austria Vojislav Misic, China David E. Monarchi, USA Shamkant Navathe, USA Erich Neuhold, Germany Peter Ng, USA Dan O’Leary, USA Maria E Orlowska, Australia Aris Ouksel, USA Mike Papazoglou, The Netherlands Jeff Parsons, Canada Joan Peckham, USA Niki Pissinou, USA Calton Pu, USA Sandeep Purao, USA Sury Ravindran, USA Arnon Rosenthal, USA N L Sarda, India Sumit Sarkar, USA Arun Sen, USA Peretz Shoval, Israel Keng Leng Siau, USA Il-Yeol Song, USA Stefano Spaccapietra, Switzerland Veda Storey, USA Toby Teorey, USA Bernhard Thalheim, Germany A Min Tjoa, AUSTRIA Alex Tuzhilin, USA Ramesh Venkataraman, USA Yair Wand, Canada Kyu-Young Whang, Korea Carson Woo, Canada Jian Yang, Australia Masatoshi Yoshikawa, Japan
IX
X
Conference Organization
External Referees Iqbal Ahmed Hiroshi Asakura H. Balsters Linda Bird Jan W. Buzydlowski Sheng Chen Yam San Chee Sheng Chen Wan-Sup Cho Eng Huang Cecil Chua Peter Fankhauser Thomas Feyer George Giannopoulos Spot Hua Gerald Huck Hasan M. Jamil Panagiotis Kardasis Justus Klingemann Suk-Kyoon Lee Wegin Lee
Ki-Joune Li Jun Li Hui Li Weifa Liang Ee Peng Lim P. Louridas Sam Makki Elisabeth M´etais Wilfred Ng E. K. Park Ivan Radev Rodolfo Resende Klaus-Dieter Schewe Takeyuki Shimura Kian-Lee Tan Thomas Tesch Chiou-Yann Tsai Christelle Vangenot R Wilson
Conference Organization
Organized By School of Computing, National University of Singapore Sponsored By ACM The ER Institute
In Cooperation with School of Applied Science, Nanyang Technological University Singapore Computer Society Information Processing Society of Japan
Corporate Sponsors Beacon Information Technology Inc., Japan CSA Automated Pte Ltd Digital Equipment Asia Pacific Pte Ltd Fujitsu Computers (Singapore) Pte Ltd IBM Data Management Competency Center (Singapore) Lee Foundation NSTB(National Science and Technology Board) Oracle Systems S.E.A. (S) Pte Ltd Sybase Taknet Systems Pte Ltd
XI
Table of Contents
Keynote 1: The Rise, Fall and Return of Software Industry in Japan . . . . . . . . . . . . . . . . . . . . 1 Yoshioki Ishii (Beacon Information Technology Inc., Japan)
Session 1: Conceptual Modeling and Design Conceptual Design and Development of Information Services . . . . . . . . . . . . . . . . 7 Thomas Feyer, Klaus-Dieter Schewe, Bernhard Thalheim, Germany An EER-Based Conceptual Model and Query Language for Time-Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Jae Young Lee, Ramez A. Elmasri, USA Chrono: A Conceptual Design Framework for Temporal Entities . . . . . . . . . . . .35 Sonia Bergamaschi, Claudio Sartori, Italy
Session 2: User Interface Modeling Designing Well-Structured Websites: Lessons to Be Learned from Database Schema Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Olga De Troyer, The Netherlands Formalizing the Informational Content of Database User Interfaces . . . . . . . . . 65 Simon R. Rollinson, Stuart A. Roberts, UK
Session 3: Information Retrieval on the Web A Conceptual-Modeling Approach to Extracting Data from the Web . . . . . . . 78 D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, Y.-K. Ng, D.W. Quass, R.D. Smith, USA Information Coupling in Web Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Sourav S. Bhowmick, Wee-Keong Ng, Ee-Peng Lim, Singapore Structure-Based Queries over the World Wide Web . . . . . . . . . . . . . . . . . . . . . . . 107 Tao Guan, Miao Liu, Lawrence V. Saxton, Canada
Panel 1: Realizing Next Generation Internet Applications: Are There Genuine Research Problems, or Is It Advanced Product Development? . . . . . . . . . . . . . . . . . . . . . . .164 Chairpersons: Kamalakar Karlapalem and Qing Li, Hong Kong
Keynote 2: Web Sites Need Models and Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Paolo Atzeni, Universit` a di Roma Tre, Italy
Session 5: Conceptual Modeling Tools ARTEMIS: A Process Modeling and Analysis Tool Environment . . . . . . . . . . 168 S. Castano, V. De Antonellis, M. Melchiori, Italy From Object Oriented Conceptual Modeling to Automated Programming in Java* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Oscar Pastor, Vicente Pelechano, Emilio Insfr´ an, Jaime G´ omez, Spain An Evaluation of Two Approaches to Exploiting Real-World Knowledge by Intelligent Database Design Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Shahrul Azman Noah, Michael Lloyd-Williams, UK
Session 6: Quality and Reliability Metrics Metrics for Evaluating the Quality of Entity Relationship Models . . . . . . . . . 211 Daniel L. Moody, Australia
Panel 2: Do We Need Information Modeling for the Information Highway? . . . . . . . . . 348 Panel chair: Bernhard Thalheim, Germany
XVI
Table of Contents
Session 8: Data Warehousing Design and Analysis of Quality Information for Data Warehouses* . . . . . . . . 349 Manfred A. Jeusfeld, The Netherlands, Christoph Quix, Matthias Jarke, Germany Data Warehouse Schema and Instance Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Dimitri Theodoratos, Timos Sellis, Greece Reducing Algorithms for Materialized View Updates . . . . . . . . . . . . . . . . . . . . . . 377 Tetsuya Furukawa, Fei Sha, Japan
Industrial Session 2: Industrial Case Studies Reengineering Conventional Data and Process Models with Business Object Models: A Case Study Based on SAP R/3 and UML . . . . . . . . . . . . . . . . . . . . . . 393 Eckhart v. Hahn, Barbara Paech, Germany, Conrad Bock, USA An Active Conceptual Model for Fixed Income Securities Analysis for Multiple Financial Institutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 Allen Moulton, St´ephane Bressan, Stuart E. Madnick, Michael D. Siegel, USA An Entomological Collections Database Model for INPA . . . . . . . . . . . . . . . . . . 421 J. Sonderegger, P. Petry, J.L. Campos dos Santos, N.F. Alves, Brazil
Session 9: Object-Oriented Approaches A Global Object Model for Accommodating Instance Heterogeneities . . . . . 435 Ee-Peng Lim, Roger H.L. Chiang, Singapore On Formalizing the UML Object Constraint Language OCL . . . . . . . . . . . . . . 449 Mark Richters, Martin Gogolla, Germany Derived Horizontal Class Partitioning in OODBs: Design Strategies, Analytical Model and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 Ladjel Bellatreche, Kamalakar Karlapalem, Qing Li, Hong Kong
The Rise, Fall, and Return of Software Industry in Japan Yoshioki Ishii Beacon Information Technology Inc. Shinjuku L Tower, 7F. 1-6-1, Nishi-shinjuku Shinjuku-ku, Tokyo 163-1507, Japan
Abstract. The Software Industry in Japan grew extraordinarily only in the field of custom software, and fell after the collapse of the “bubble economy” in 1991. In Japan, the field of packaged software is still at an early stage of development. Why did this happen? On the other hand, Japan surpassed the U.S.A in the game software field, and became No. 1 in the world. Why is this? Can Japanese packaged software survive in the future? Or, will Western packaged software made by Microsoft, SAP etc. conquer the Japanese market? I will state my opinion based on my own experience in Software Industry during the past 30 years.
1 Introduction I have been working in the development of DBMS since the late 1960s. In the process, I have provided consulting expertise on Database products for a wide range of customers beginning in 1968. In 1973, I introduced ADABAS to the Japanese market and have been supplying this product to the IT market ever since. There are presently 800 corporate customers of ADABAS in Japan alone. I am fortunate to say that many of my peers recognize me as a pioneer in Database related developments in Japan. In parallel to my activities on the industry side, I have also been an active participant in the academic circle of Information Processing. I presented many research papers since 1973 at Database Research Group in the Information Processing Society of Japan. When ACM SIGMOD Japan was first established in 1993, I honorably accepted the post of chairperson and worked as the first chairperson of the organization from 1993 to 1995. About 10 years ago, I also started to focus on the Multi-Dimensional Model. I am presently also providing sales and development oriented consulting on Multidimensional DBMS (Essbase). Based on all of these experiences, I published my first book titled “Data Warehouse” in 1996 in Japan. Those ideas were also presented at the VLDB ’96.
T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 1−6, 1998. Springer-Verlag Berlin Heidelberg 1998
2
Y. Ishii
From my perspective as a technical person in the Database arena, and an executive who has managed a successful software company for the last 30 years, I will briefly speak on the Software Industry in Japan. I will relate my observations through a road that takes us from the Rise, through the Fall and to the Return of the Software Industry in Japan. I will elaborate a while on the cause of the Fall. The roots of the Software Industry in Japan trace back to 1964. As was the case in the US, computers were installed at computer centers and leased for usage by the hour. Software companies started appearing in 1968. After the first ten years, annual revenues exceeded 400 Billion Japanese Yen in 1978, and it was officially recognized as an industry. Please refer to the following chart (Fig. 1.) that shows “The Rise, Fall and Return of the Software Industry in Japan”. Billion Yen 7000
6000
5000
4000
3000
2000
1000
0 64
70
80
90
98
Fig. 1. Software Industry Growth in Japan
2 Period of Rise During this period, companies focused on computerization of back office activities relying mainly on mainframes. Custom software was developed using Cobol, Fortran and PL/I, for various private corporations, national, prefecture and local governments. Most of the development work was outsourced to software companies. Due to this reason, growth in the custom software field was abnormally high and that of packaged software was relatively low in Japan as compared to the rest of the world. Please refer to the following chart (Fig. 2), which shows a comparison of the share of custom
The Rise, Fall, and Return of Software Industry in Japan
3
software and packaged software for Japan, Europe and the US in 1988, which was also the end of this period of rise.
($Billion) User Expenditures
Source: Input
30
20
10
Custom Packaged
0
U.S.
Europe
Japan Market Overview
Fig. 2. Custom Software Development vs. Software Products, 1988
The extraordinary growth of the software industry in Japan, actually backfired and became a serious cause of its subsequent fall. The main reason was the collapse of Japanese “bubble economy” in 1991.
3 Period of Fall The computer hardware industry of Japan grew mainly on the strength in mainframe technology in the 1980s, and there even arose a possibility of surpassing the successes of the industry in the US. In order to wrestle the initiative in the 1990’s and beyond, the Japanese Government started an ambitious project called the Fifth Generation Computer Project in 1983. This project was to range over 10 years and was concentrated mainly on AI. Resources for this project were pooled not only from scientists in University Laboratories but were also recruited from the technical staff of Japanese six major companies, such as Fujitsu and Hitachi, but not IBM Japan. The scale of this project was truly massive and a lot of time, money and resources were allocated. The project classified the existing computers as 3rd generation computers and aimed to completely skip the next, 4th generation of computer technology by focusing on AI
4
Y. Ishii
technology to achieve the advanced functionality of 21st century computing. This was termed 5th generation technology and future computers were termed 5th generation computers. Around 1990, both the Japanese Government and mainframers had illusions of the coming of the 5th generation computing era and that it would arrive soon. Japan was at the peak of enjoying prosperity that accompanied the bubble economy. On the other hand, the US went through a period of recession in the latter half of 1980’s. Riding of the “Downsizing” wave, growth was seen in the sales of UNIX machines and personal computers. These technologies were an extension of 3rd generation technology. In other words, 4th generation technology made steady progress in the US. Meanwhile Japan was consistently aiming much efforts at 5th generation computing, which never materialized. The 4th generation finally did arrive in Japan. But by then, due to this strategic planning failure, computer related technologies in Japan were going in the wrong directions, and the strength of computing in Japan went down considerably compared to the US. As I had earlier mentioned the software industry in Japan concentrated disproportionately in the area of custom software development. The collapse of the Japanese “bubble economy” in 1991 had a drastic effect, and abruptly private corporations altogether stopped custom software development projects for mainframes. As a result, growth in the Japanese software industry was greatly reduced. (Fig. 1.) During the earlier years, corporations in Japan had a strong tendency for developing application software exclusively and for internal usage only. This, as one could imagine, was prohibitively expensive. With the collapse of the “bubble economy”, these private development efforts virtually stopped. To reduce costs and advance in usage of information technology, these corporations turned their attention towards UNIX on business applications for the first time. The adoption of UNIX technology in Japan therefore lagged the US by at least 3 years. Since the software industry in Japan was mainframe centered and there were few companies with the technology and experience in the UNIX arena, the customers’ demands for downsizing could not be met and the growth of the industry suffered heavily. Fujitsu, Hitachi and NEC, which had also grown mainly on the strength of the mainframe, were stagnant for several years in a similar manner to IBM. Except for a few exceptions, Japan did not have local access to 4th generation technology and UNIX in particular. Japan was misdirecting its efforts for several years. Between 1992 and 1996, what Japan could only do was to concentrate on learning US technology, which was itself quite arduous. During this period, a number of technical experts moved from software industry to other fields. Custom software is usually made as “the only one of its kind” in the world and runs on a specific site. Therefore, it is almost impossible to evaluate the quality or excellence of the application as being good or otherwise. Under this environment, even those technicians who did not produce quality software would be incorrectly perceived as technical experts. On the other hand, actual end users usually evaluate packaged software and only high quality software survives and that with inferior quality often disappears. As a result, the abilities of technicians in the package software field improved dramatically. But since an overwhelming majority of technicians in Japan grew up in the custom software field, I think that many of these technicians have not been able to excell..
The Rise, Fall, and Return of Software Industry in Japan
5
4 Return The Software industry in Japan and the six Japanese computer manufacturers entered into difficult times since 1991. After that, however, there occurred a big change and the Software industry in Japan has returned almost completely. Please refer to the following Fig. 3 “Worldwide 1997 Software Revenue” and Fig. 4 “Japan 1997 Software Revenue”. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Company IBM Microsoft Computer Associates Oracle Hitachi* SAP Fujitsu* DEC SUN Siemens Nixdorf Parametric Tech Intel Novell Adobe Sybase
Company Microsoft Japan Oracle Japan SAP Japan Lotus Japan Just System* Ashisuto* Beacon IT* Novell Japan Informix Japan Sybase Japan CA Japan BSP* Baan Japan
* Japanese company Fig. 4. Japan 1997 Software Revenue
6
Y. Ishii
Computer manufacturers are included in Fig. 3, but not in Fig. 4. NEC was not ranked in the Fig. 3. It seems that NEC did not report their revenue to the research group. Hitachi, Fujitsu and NEC came to be ranked highly in the worldwide ranking. Also, in the genuine Japanese software product field, various Japan-made software products were developed. These products were not only used in the Japanese market, but some of them are also being exported. Japan is positioned to become the third axis, next to the US and Europe, in the software product field as we move into the future.
Conceptual Design and Development of Information Services 1
Thomas Feyer1 , Klaus-Dieter Schewe2 , and Bernhard Thalheim1 Computer Science Institute, Brandenburg Technical University at Cottbus, P.O. Box 101344, 03013 Cottbus, FRG 2 Computer Science Institute, Clausthal Technical University, Erzstr. 1, 38678 Clausthal-Zellerfeld, FRG Abstract. Due to the development of the internet and cable nets information services are going to be widely used. On the basis of projects for the development of information services like regional information services or shopping services we develop a method for the creation of information systems and services to be used through different nets. The approach is based on the codesign of data, dialogues and presentations. The main concepts are information units and information containers. These basic concepts for information services are presented in this paper. Information units are defined by generalized views with enabled functions for retrieving, summarizing and restructuring of information. Information containers are used for transfering information to users according to their current needs and the corresponding dialogue step. Size and presentation of information containers depend on the restrictions of the users environment.
1
Background
The ‘Internet’ is currently one of the main buzzwords in journals and newspapers. However, to become a vital and fruitful source for information presentation, extraction, acquisition and maintenance fundamental design concepts are required but still missing. To bridge this gap we investigate information units and information containers and their integration with database-backed infrastructures. These are based on practical experience from projects for the development of a regional information and shopping service and meant to be the main concepts for the development of integrated information services. 1.1
Information Services
Currently we can observe a common and still increasing interest in information services to be used with the internet. Unfortunately, there is no systematic or even commonly accepted approach in building these services. On the other hand it is the goal of conceptual modelling to provide adequate methods for this task. Then at least the following problems have to be met: – – – –
Conceptual understanding of information services; Integration of different information systems; Maintenance of information quality; Evaluation of information quality;
T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 7–20, 1998. c Springer-Verlag Berlin Heidelberg 1998
8
T. Feyer, K.-D. Schewe, and B. Thalheim
– User-adequate presentation of information collections; – Ressource-adequate transmission through networks. Due to the large variety and diversity of information services it is advisable to concentrate either on specific aspects or characteristics. Our experience on information service development is based on two large industrial cooperations: In the project FuEline [7,18] an online database service has been developed for collecting, managing, trading and intelligently retrieving data on industrial and university research facilities. The system realizes a client-server architecture using the retrieval-oriented database machine ODARS. The system is currently used as the base system for technology transfer institutes in Brandenburg. The project Cottbus net (since 1995) aims at the development of intelligent information services that are simple to capture and simple to use. These should be available for around 90.000 households in the Lausitian region through the internet and the cable nets using either computers or TV set-top boxes. Up to now the project has developed a number of information services for travel information, shopping, regional information, industry, administration and booking services. Several architectures and suggestions from the literature have been tested including multidimensional database architectures [12,26]. Unfortunately, it has shown to be too weak and to provide unacceptable performance. The proposals very recently made in [2,8] are similar trials but different in scope, application area, and devoted to different users. The ideas presented below have been used in the development of different information services for Cottbus net. Currently, two other architectures (multi-tier architectures with fat or thin clients [18,20, 21]) are tested in parallel. Both projects have shown to us that the ad-hoc development of information services (such as web.sql) as the state of the art for most internet-based services is not acceptable due to maintenance and development costs. In both projects the information service is characterizable by the access to large amounts of almost structured data accessible through databases. The conceptual understanding of information services is based on conceptual modeling of the underlying databases, modelling of functionality and user intentions. The latter ones can be modelled by dialogues. In particular, the integration of different information systems is enabled. Careful modeling can increase information quality. Therefore, we concentrate on two main concepts for user-adequate presentation and delivery of information: information unit and information container . Information containers are transmitted through the network according to the necessary amount of information. They transfer only those data which are necessary for the current dialogue step. Technically, this optimization of data transmission is achieved by careful integration of data modeling with supplied functions and dialogues. The approach has been used to develop a platform which is now in use for Cottbus net. 1.2 Database-Backed Information Services Besides the various approaches to grasp the meaning of ‘information’ [24] and the large number of books on ‘information systems’ it is generally accepted that information needs a carrier in the form of data. For our purposes we may assume
Conceptual Design and Development of Information Services
9
that these data are structured, formatted, filtered and summarized, meet the needs and current interests of its receiver and is going to be selected, arranged and processed by him/her on the basis of his/her interests, experience, intuition, knowledge etc. Within this context we can assume that information services are systems that are based on database machines and use a certain communication infrastructure. Loosely spoken, information can be extracted from filtered and summarized data collections, where filtration is similar to view generation and summarization of selected data can be performed on the basis of the computational functionality of the given machine. Finally, information presentation respects environmental conditions and user needs. A large number of information service applications is out of the scope of our particular research. For example, travel guidance systems are based on information which is relatively stable. They are usually made commercially available on CD-ROMs. Database systems are used whenever data has a good update rate and the information service requires actuality. The technical embedding of database systems into information services can be based on middleware solutions. In general, database-backed information services can be integrated into DBMSs, although database vendors do not yet offer such a fully integrated solution. A large number of tools for retrieval and manipulation of databases through the internet has been developed. These tools use specific protocols and are mainly designed for specific DBMSs. For this reason each information service has to be based on several interfaces to databases, whilst the information service itself uses specific databases. Thus, these databases can be adapted to the needs of information services. In this case, the design and development of information services subsumes some of the ordinary design and development tasks. Additional requirements are implied by the variety of used displays. 1.3
Codesign of Information Service Applications
As outlined so far, many information service applications are based on information systems. This renders conceptual modelling, especially database design, a fundamental task in information service development. This task subsumes the design of database structure with corresponding static integrity constraints, database processes with corresponding dynamic integrity constraints and user interfaces. Conceptually, there are two dimensions: static/dynamic and global/local. The global static component is usually modelled by database schemata and the global dynamic component by processes implemented as transactional application programs. The local static component is often modelled by information units and the local dynamic component by the user interface which depends on the information units and the processes. Although views filter and summarize data from the database, the local static component for information services is more complex. Information units are computed by computational rules, condensed by abstraction and rebuilding rules and finally scaled by customizing and building a facetted representation. In Sect. 2 we shall discuss this process. Similarly, the local dynamic component is much more complex than the user
10
T. Feyer, K.-D. Schewe, and B. Thalheim
interface. It captures all aspects of user-driven processing on different application layers. Therefore, we prefer to talk of a dialogue component. Each dialogue consists of elementary dialogue steps corresponding to actions selected by the user [19]. Their order depends on the application story and its underlying business processes. Thus, dialogues generalize ‘use cases’. In general we can model dialogues for groups of actors or roles as stated in [14,24,27]. Since we do not intend to discuss codesign in detail, we refer the interested reader to [4]. 6 local
information containers -
information units
6 filtration summarization scaling
dialogues
6 enabled manipulation requests
supplied processes
global
database schema
enabled processes
static
- processes dynamic
-
Fig. 1. Information Services Codesign: Data and Process Flow Perspective
Information units can be the input for dialogues using either the formation at run-time according the actual environment and the user request or predefined data collections. The first approach is more general but seldom computationally tractable. The second approach is simpler and can be based on results of conceptual design. Information containers are obtained by the application of formation and wrapping rules to collections of information units. In Sect. 3 the complete definition of containers is given. Containers are constructed from information units according to the user needs and their environment. The chosen approach to create information services is illustrated in Fig. 1.
2
Modelling Information Units
Information units depend on the database schema. They represent data in a standard, intuitive framework that allow high-performance access. Information units modelling can be compared with the modelling of semistructured data. Then information units turn out to be generalized views on the database [3, 15]. The generalization should support data condensation and supplementary facilities to enable an adequate representation to the user. We restrict the rule system used for generating units from the database to the smallest possible system. The rule system can be extended by inclusion of different analysis systems to enable a detailed analysis of data sets. Other extensions can be included, since the rule system is considered to be an open system. In order to define the rule system, we discuss first the modelling process.
Conceptual Design and Development of Information Services
11
2.1 Modelling Process Since we are interested in the support of information services we use the most general definition. Thus, the computation of information units is separated into three consecutive steps: Filtration by computational rules results in a view in the usual sense. In general, a view has its own schema, the simplest case being a subschema of the given database schema. Summarization by abstraction and rebuilding rules is the abstraction and construction of preinformation from the filtered data. The result will be called a raw information unit. In this step the demanded data condensation applies. Scaling by scaling rules is a process of customizing and building a facetted representation of information based on user interests, profiles etc. It uses typestructured queries and satisfies the requirement for supplementary facilities. (a) HERM subdiagram HH promoted company HHon I @ @ 6 ? @ H HH H organizes trading location HH HH 6@ I @ ? ? ? HH @ H belongs held H - event HHto HHon 6 ? HH has person site HH role
(b) raw information unit obtained by filtering and summarizing promotion period
selling period
3 Q k Q Q Q sport HH - location organizing H event H site ∈ Cottbus ?
hosting club
Fig. 2. Subschema for cultural, sport etc. events
No matter, whether views are materialized or not, raw information units and information units depend on the application and the functionality attached to the information containers. Example 1. The database schema in Fig. 2a representing data on events is a simplification of the schema used in Cottbus net. We use the higher-order ER model which allows relationship types to be defined over relationship types as their components, e.g. consider the type has role. Suppose that filtration is based on selecting sport events, companies which are clubs and locations residing in Cottbus. The filtration rule is expressible by a nested Select-From-Where-statement in ERQL. Alternatively, we may use the generalized ER-QBE discussed in [9,25]. Then a simplified ER-QBE-table for this query is the following one: organizes trading promoted on belongs to event company date kind held on ... event ... site location kind name kind ... location ... sport n club hosting l n Cottbus l
t u
12
2.2
T. Feyer, K.-D. Schewe, and B. Thalheim
Abstraction and Rebuilding Rules
Since filtration is defined by views we concentrate on the rules for summarization and scaling. Views are used for representation of various aspects in the application, but it is often claimed that the data consumed by different processes cannot be consistently represented in the database at the same time. This problem can be solved on the basis of event-condition semantics [23]. Derived views considered so far do not introduce new values as needed for condensation. This is achieved by abstraction and rebuilding rules, e.g. for summarization of numeric values, and extends aggregation formulae in SQL. Many information service operations (comparisons with aggregation, multiple aggregation, reporting features) are hard or impossible to express in SQL. Further, other query techniques like scanning, horizontal and vertical partitioning, parallel query processing, optimization of nested subqueries, or commutation of group by (cube) and join cannot be applied. Abstraction and rebuilding rules result in raw information units which need further to be adapted to the user’s needs, requirements and capabilities. We remark that on the basis of the specification of units a certain database functionality is enabled. Example 2. The events database in Fig. 2a keeps data on ongoing cultural or sport events etc. Our aim is to define an information unit which is used to obtain information on sport events organized in Cottbus by hosting clubs with information for picking up tickets and advertisement. Thus, we summarize the filtered data from Example 1 according to the schema in Fig. 2b. t u 2.3
Scaling Rules
Information units are obtained from raw information units by supplement rules and completion with functions: – Measure rules are used for translation of different scales used for the domain of values. Measure rules are useful especially for numerical values, prices etc. – Ordering rules apply to the ordering of objects in the information unit which depends on the application scenario. They are useful for the determination of the correct order in the presentation during dialogues. – Adhesion rules specify the coherence of objects that are put together into one unit. Adhesion rules are used for detecting disapproved decompositions. Objects with a high adhesion should be displayed together. – Hierarchy metarules express hierarchies among data which can be either linear or fanned. The rules can be used for computation of more compact presentations of data summaries. Example 3. In our event example from Fig. 2 a preordering is given by hosting club ' sport event selling period promotion period location. The preorder can be defined on different levels of abstraction. For example the attributes within entity hosting club are preordered by club name kind founded size remark. The adhesion of clubs to events is higher than the one of locations and time
Conceptual Design and Development of Information Services
13
to event, although two functional dependencies hold, and one is not preferred above the other. To represent adhesion we state the matrix which contains proximity between entities, where 0 indicates no adhesion and 1 indivisible adhesion (similar to ordering, adhesion can be additionally defined on attribute level): Adhesion proximity hosting club ...
hosting club 1.0
sport event 0.7
selling period 0.5
promotion period 0.5
location 0.3
Finally, several hierarchies exist such as the time hierarchy (year, month, week, day, daytime) and the location hierarchy (region, town, village, street). t u Besides the pure static aspects of information units described so far, functions from the following (not yet complete) list can be attached to information units: – Generalization functions are used for generation of aggregated data. They are useful in the case of insufficient space or for the display of complementary, generalized information after terminating a task. Hierarchy rules are used for the specification of applicability of generalization functions. The roll-up function in [1], slicing, and grouping are special generalization functions. – Specialization functions are used for querying the database in order to obtain more details for aggregated data. The user can obtain more specific information after he has seen the aggregated data. Hierarchy rules are used for the specification of applicability of specialization functions. The drill-down function used in the data warehouse approach is a typical example. – Reordering functions are used for the rearrangement of units. The pivoting, dimension destroying, pull and push functions [1] and the rotate function are special reordering functions. – Browsing functions are useful in the case that information containers are too small for the presentation of the complete information. – Sequentialization functions are used for the decomposition of sets or sequences of information. – Linking functions are useful whenever the user is required to imagine the context or link structure of units. – Survey functions are used for the graphical visualization of unit contents. – Searching functions can be attached to units in order to enable the user for computation of add-hoc aggregates. – Join functions are used for the construction of more complex units from units on the basis of the given metaschema. Example 4. Depending on the time granularity opening hours of organizers are presented by time intervals, weekly opening hours, or single dates. Generalization and specialization functions may swap between these representations. By applying reordering functions content of event data will be tailored to users needs. Event data includes, for example, either the event, its location and visualized map coordinates or the event, its hosting club and contact information. If additional information as detailed description or visualized directions do not fit into one container, browsing functions distribute data into several containers. They are provided by appropriate context and linking information. t u
14
T. Feyer, K.-D. Schewe, and B. Thalheim
Finally, we derive an interchange format for the designed information units which is used for the packing of units into containers. Identifiers are used for the internal representation of units. The formal context interchange format represents the order-theoretic formal contexts of units. The context interchange format is specified for each unit by the unit identifier, the type of context, the subsequent units, and the incident units. Example 5. In the event example, the order is either specified by the scenario of the workflow or by the order of information presentation. For example, it is assumed that information on actual sport events is shown before information on previous sport events is given. An advantage of the approach is the consideration of rule applicability to raw units. For this reason almost similar looking, simple units are generated. t u 2.4
Differences between Views and Information Units
Our intention behind the introduction of information units is to provide a standard, intuitive framework for information representation that enables high-performance access. Summarization and compactification should be supported by appropriate software as well as methods for the analysis of information. ER schemata turn out to be unsuitable for this purpose, since end-users, especially casual users, cannot understand nor remember complex ER schemata nor navigate through them. Thus, the task of query specification is getting too hard for the user. Therefore, for the development of information services we need – a standard and predicatable framework that allows for high-performance access, navigation, understanding, creation of reports, and queries, – a high-performance ‘browsing’ facility across the components within an information unit and – an intuitive understanding of constraints, especially cardinality constraints as guidance of user behaviour. At first glance this list of requirements looks similar to the one for data warehouses or multidimensional databases[12,17]. However, the requirements for information services are harder to meet, since the conceptual schema splits into multiple external views. The ER design techniques seek to remove redundancy in data and in the schema. Multidimensional databases often can handle redundancy for the price of inefficiency, infeasibility and modification complexity. Incremental modification, however, is a possible approach to information units and hence for information containers. By developing simple and intuitively usable information units user behaviour becomes more predictable. We can attach an almost similar functionality to the information units. This advantage preserves the genericity property of relational databases where the operations like insert are defined directly with the specification of the relations. Since information containers are composed from units, containers also maintain their main properties. Since user behaviour is encorporated and additional functionality is added, containers have additional
Conceptual Design and Development of Information Services
15
properties that will be discussed in Sect. 3 and compared with other concepts in anticipation to the following table: ER schemata MultiER-based ER-based with external dimensional information information views databases units containers redundancy + + + schema modification + + + navigation through subschemata (+) + + relationship-based subschemata (+) + + + coexistence of subschemata (±) (+) + + additional functionality + + compositionality + genericity (±) (∓) + + TA-efficiency + + + materialization + ± -
3
Information Containers
In internet applications it is commonly assumed that pages can be arbitrarily large. Psychological studies, however, show that typical users only scan that part of a page that is currently displayed leaving vertical browsing facilities through mouse or pointers untouched [10]. This limited use of given functionality is even worse for cable net users, since the browsing devices are even harder to use. For this reason we have to take into consideration the limitations of display. The concept of information containers solves this problem. They can be considered as flexible generalizations of dialogue objects [19]. Containers can be large as in the case of mouse-based browsers or tiny as in the case of TV displays. Similar to approaches in information retrieval we distinguish between the logical structure described by container parameters, the semantical content given by container instantiations and layout defined by container presentations. 3.1
Defining Information Containers
Since the data transferred in information containers are semistructured we may adapt the concept of tuple space [6]. The tuple space of containers is defined as a multiset of tuples, i.e., sequences of actual fields, which can be expressions, values or multi-typed variables. Variables can be used for the presentation of information provided by information units. The loading procedure for a container includes the assigment of types to variables. The assigment itself considers the display type (especially the size) of the variables. Pattern-matching is used to select tuples in a tuple space. Two tuples match if they have the same values in those fields which are common in both. Variables match any value of the same display type, and two values match only if they are identical. The operations to be discussed below for information containers are based on this general framework. Information containers are defined by: – Capacity of containers restricts the size and the display types of variables in the tuple space of the container.
16
T. Feyer, K.-D. Schewe, and B. Thalheim
– Loadability of containers parametrizes the computational functionality for putting information into one container. Functions like preview, precomputation, prefetching, or caching are useful especially in the case when capacity of containers is low. – Unloadability of containers specify readability, scannability and surveyability attached to containers. Instantiation of information containers is guided by the rules and the supported functions of the information units from which the container is loaded. Whether supported functions are enabled or not depends on the application and the rules of the units. The operations defined on tuple spaces are used for instantiation of containers by values provided by the information units. The size parameters limit the information which can be loaded into the containers. Figure 3 shows three different variants of containers. The “middle” container allows us to ‘see’ the general information on a selected meeting and the information on organizers and sales agents. The dialogues for information services we are currently developing are still rather simple. Dialogue steps can be modelled by graphs, in some cases even by trees. The typing system can be changed, if dialogues are more complex or the modelling of complex workflow is intended. Thus, the dialog itself can be in a certain state which corresponds to a node in the graph or to a subgraph of the dialogue graph. Information containers are used in dialogues to deliver the information for dialogue states. Layout of containers is expressible through style rules depending on the container parameters. Additional style rules can be used for deriving container layout according to style guides developed for different applications. 3.2
Modelling the Information Content for Dialogs
Information containers support dialogues and dialogue states. Therefore, an escort information is attached to each container which depends on its instantiation (see Fig. 3). This information is used to guide the user in the current state and the next possible states and to provide additional background information. In internet pages this information is often displayed through frames, but frames are very limited in their expressibility and often misleading. For this reason we prefer the explicit display of escort information. Then we can use two different modes: complete information displays the graph with all nodes from which the current state can be reached; minimal information displays at least the path which is used through the application graph to reach the current node. One important aim in the Cottbus net project is the development of dialogs with a self-descriptive and self-explainable information content. In order to achieve this goal, dialogs are modeled on the basis of their suported processes, their enabled manipulation operations and especially on the basis of the enabled information units with attached functionality. Dialogs are constructed from dialog steps. Each dialog step has its information content and its context information. The composition of dialogs from dialog steps can be used to separate the information which needs to be displayed in
Conceptual Design and Development of Information Services
complete escort information
17
escort information
(for ”small” and ”middle”
(for ”small” and ”middle”
containers)
containers)
Cottbus information ...
sports organizations
interest in sport
HH HH ... HH HH
...
...
sport events
HH HH H
A
...
A A A
...
sports commercial PP provider club PP HH ... ... H P
...
...
HH H
time schedule
kinds of sport
PP H PPHH PPH PH
...
meetings
"middle" container
sports enthusiast
...
...
"small" container
@H H "large" @HH container H @ HH @ H time
organizer agents, general additional information selling informationinformation
details, location
Fig. 3. The subgraph of interest in sport
the single step from the information which belongs to the step but has been displayed already in the previous steps. For example, in the middle container the sport club information can be separated into necessary information which has to be displayed for the container and into escort information which can be shown upon request. The separation we have used is based on the functions defined for tuple spaces[6] like selective insert, cascaded insert, conditional insert etc. 3.3
Formation and Wrapping Rules
Formally, instantiation of information containers is the process of assigning values from information units to variables of the tuple space. Furthermore, information containers have functions for loading the container and functions for searching within the container. On the basis of enabled computational functions (generalization, specialization, reordering, browsing, linking, surveying, search-
18
T. Feyer, K.-D. Schewe, and B. Thalheim
ing, joining) for analysis and interpretation of data in the used units, the general functionality which can be used for information containers is derivable. Additionally, the user profile is taken under consideration. In order to handle these requirements we use two different rule sets: Formation rules are used to instantiate the container in dependence of the necessary information and functionality. Information containers are similar to containers used in transportation. They can be unloaded only in a certain order and with certain instruments. Thus, depending on the necessary size information containers can be loaded with different information units. The loading process is based on the structure of the dialog and on the properties of units like association of information in different units. Based on the design of units, the set of available information containers and the design of dialogs we can infer the presentation scenario. It contains the description of units, their association (adhesion, cohesion) and their enabled functionality for each dialog step in a certain dialog. The presentation scenario is used to describe the different approaches enabled for the user to extract information from the container. Browsing, linking and join functions of the exploited units can be used for achieving flexibility in dialog steps. Since the variety of possible sets of enabled functions can be very high, we use different models for computing of data. These models are based on application scenarios. Operations like aggregation and prediction and analysis operations for generating status reports and comparing different variants of data. A typical status data type is the shopping basket. In the sports example users are enabled to store several variants of shopping data and schedules. The sport example has only one presentation scenario. However, there is a large variety of generated links and a browsing functionality. Wrapping rules are used to pack the containers depending on the user’s needs and the dialog steps in the current dialog. The application of wrapping rules depends also on the properties of containers. The wrapping rules can be changed whenever different style rules or display rules are going to be applied[21,20]. This flexibility is also necessary if the communication channel is currently overloaded and does not allow the transportation of large containers. The transportation of container contents can be dynamically adapted to the order of dialog steps. Special wrapping rules are used for labeling the containers content. If such rules are applicable then the user can ask for a summary of the containers content before requesting to send the complete container. The label of each container is generated on the basis if survey functions defined for the units of the container. Thus, this approach enables in intuitive data manipulation in the style users know from spreadsheets. Further, wrapping rules can be developed for restructuring information presentation in accordance to the repeadadly visited steps of the dialogs. Also, reordering and sequentialization functions defined for units can be used for better flexibility in information containers. Style rules are used for wrapping the instantiated information container. Information containers are based on user profiles, their preferences, their expectations, and their environment. Thus, handling of containers, loading and
Conceptual Design and Development of Information Services
19
reloading should be as flexible as possible. Customizing containers to the dialogue environment can be done on the basis of customization rules. In our sports example, wrapping rules can be used for the display style of (escort) information, for the placement of information on the screens, for enabling different functions and for displaying the content of the container.
4
Conclusion
There is a need for the systematic construction of information services. The approach outlined in this paper is based on information units which are used in a larger number of dialogue steps. Information units are composed to information containers which form the basis for the human interface to the services. Information containers allow a certain functionality tailored to the user needs. Based on this functionality a user can send manipulation requests to an underlying database. The whole approach is rule-based with execution semantics given by lazy evaluation. Currently, conflicts are just monitored, but in a later stage this will also be done systematically. The approach presented so far has shown to be sufficient for several large information service projects. In order to meet further requirements we have developed an open architecture. Thus, additional rules can be added to each of the presented steps. Furthermore, the addition of new components to derived views, raw units, information units and to a certain extent also to containers can be handled in a simple fashion. This makes the approach easily extendible in the case of unexpected changes. We are not interested in developing a general framework for information handling. Our aim so far is the development of a platform which enables the conceptual design of information services such as business information services, administration information services, online inhabitants services, educational information services shopping services etc. The topics discussed in this paper are currently used in our information services projects and are still under investigation. Therefore, there is a large number of open questions like incremental updates of units and containers. Nevertheless, the approach has been flexible enough for the inclusion of solutions to new requirements. Thus, the presented method can be considered to be one vital approach to the development of database-backed information services. Acknowledgement. We would like to thank the members of the FuEline and Cottbus net project teams for their stimulating discussions and their effort to implement our ideas.
References 1. R. Agrawal, A. Gupta, S. Sarawagi, Modeling multidimensional database. Proc. Data Engineering Conference, 232–243, Birmingham, 1997. 2. P. Atzeni, G. Mecca, P. Merialdo, Design and maintenance of data-intensive web sites. EDBT 98, Valencia, 1998, LNCS 1377, 436–450. 3. P. Bretherton, P. Singley, Metadata: A user’s view. IEEE Bulletin, February, 1994.
20
T. Feyer, K.-D. Schewe, and B. Thalheim
4. W. Clauss, B. Thalheim, Abstraction layered structure-process codesign. D. Janaki Ram, editor, Management of Data, Narosa Publishing House, New Delhi, 1997. 5. L.M.L. Delcambre, D. Maier, R. Reddy, L. Anderson, Structured maps: Modeling explicit semantics over a universe of information. Int. Journal of digital Libraries, 1997, 1(1), 20–35. 6. R. De Nicola, G.L. Ferrari, R. Pugliese, KLAIM: a kernel language for agents interaction and mobility. Report, Dipartimento di Sistemi e Informatica, Universit` a di Firenze, Florence, 1997. 7. F. Fehler, Planing and development of online-systems for enterprise-wide information exchange. PhD Thesis, BTU Cottbus, 1996 (In German). 8. P. Fraternali, P. Paolini, A conceptual model and a tool environment for developing more scalable, dynamic, and custumizable web applications. EDBT 98, Valencia, 1998, LNCS 1377, 422–435. 9. J. Grant, T.W. Ling, and M. L. Lee, ERL: Logic for entity-relationship databases. Journal of Intelligent Information Systems, 1993, 2, 115–147. 10. J. Hasebrock, Multimedia psychology. Spektrum, Berlin, 1995. 11. R.E. Kent, C. Neuss, Conceptual analysis of hypertext. Intelligent Hypertext (Eds. C. Nicholas, J. Mayfield), LNCS 1326, Springer, 1997, 70–91. 12. R. Kimball, A dimensional modeling manifesto. DBMS, July 1996, 51–56. 13. M.W. Lansdale, T.C. Ormerod, Understandig interfaces. Academic Press, 1995. 14. J. Lewerenz, Dialogs as a mechanism for specifying adaptive interaction in database application design. Submitted for publication, Cottbus, 1998. 15. A. Motro, Superviews: Virtual integration of multiple databases. IEEE ToSE, 13, 7, July, 1987. 16. K. Parsaye, M. Chignell, Intelligent database tools and applications. John Wiley & Sons, Inc., New York, 1995. 17. N. Pendse, The olapreport. Available through www.olapreport.com, 1997. 18. M.Roll, B Thalheim, The surplus value service system FOKUS. INFO’95, Information technologies for trade, industry and administration, Potsdam, 355–366, 1995. (in German). 19. K.-D. Schewe, B. Schewe, View-centered conceptual modelling - an object-oriented approach. ER’96, LNCS 1157, Cottbus, 1996, 357–371. 20. T. Schmidt, Requirements, concepts, and solutions for the development of a basic technology of information services - The client. Master Thesis, BTU Cottbus, 1998 (In German). 21. R. Schwietzke, Requirements, concepts, and solutions for the development of a basic technology of information services - The server. Master Thesis, BTU Cottbus, 1998 (In German). 22. C.T. Talcott, Composable semantic models for actor theories. TAPSOFT, 1997. 23. B. Thalheim, Event-conditioned semantics in databases. OO-ER-94, (Ed. P. Loucopoulos), LNCS 881, 171–189, Manchester, 1994. 24. B. Thalheim, Development of database-backed information services for Cottbus net. Preprint CS-20-97, Computer Science Institute, BTU Cottbus, 1997. 25. B. Thalheim, The strength of ER modeling. Workshop ‘Historical Perspectives and New Directions of Conceptual Modeling’, Los Angeles, 1997, LNCS, 1998. 26. E. Thomson, OLAP solutions: Building multidimensional information systems. John Wiley & Sons, Inc., New York, 1997. 27. E.S.K. Yu, J. Mylopoulos, From E-R to ”A-R” - Modelling strategic actor relationships for business process reengineering. ER’94, LNCS 881, 548-565, Manchester, 1994.
An EER-Based Conceptual Model and Query Language for Time-Series Data Jae Young Lee and Ramez A. Elmasri Computer Science and Engineering Department University of Texas at Arlington Arlington, TX 76019-0015, U.S.A. {jlee, elmasri}@cse.uta.edu
Abstract. Temporal databases provide a complete history of all changes to a database and include the times when changes occurred. This permits users to query the current status of the database as well as the past states, and even future states that are planned to occur. Traditional temporal data models concentrated on describing temporal data based on versioning of objects, tuples or attributes. However, this approach does not effectively manage time-series data that is frequently found in real-world applications, such as sales, economic, and scientific data. In this paper, we first review and formalize a conceptual model that supports time-series objects as well as the traditional version-based objects. The proposed model, called integrated temporal data model (ITDM), is based on EER. It includes in it the concept of time and provides necessary constructs for modeling all different types of objects. We then propose a temporal query language for ITDM, that treats both version-based and timeseries data in a uniform manner.
1. Introduction Objects in the real world can be classified into the following three different types according to their temporal characteristics: 1. Time-invariant objects: These objects are constrained not to change their values in the application being modeled. An example is the SSN of an employee. 2. Time-varying objects (or version-based objects): The value of an object may change with an arbitrary frequency. An example is the salary of an employee. 3. Time-series objects: Objects can change their values, and the change of values is tightly associated with a particular pattern of time. Examples are daily stock price and scientific data sampled periodically. Most traditional temporal databases [3,10,15,18,19] concentrated on the management of version-based objects. There have been specialized time-series management systems [1,4,5,6,16,17] reported in the literature. However, the main T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 21−34, 1998. Springer-Verlag Berlin Heidelberg 1998
22
J.Y. Lee and R.A. Elmasri
focus of these systems was time-series objects only. In [14], the integrated temporal data model (ITDM) was first proposed, which integrates all different types of objects. Based on ITDM, various techniques to implement time-series data were also studied in [7]. In this paper, we first formalize the ITDM, then propose a query language for ITDM, with which we can query time-series data as well as version-based data. We show how time-series querying constructs and version-based querying constructs can be integrated within the same query language constructs. The paper is organized as follows. Chapter 2 briefly reviews ITDM. Chapter 3 discusses the syntax and semantics of path expressions along with the concepts of temporal projection and temporal selection. The proposed query language is discussed in detail in Chapter 4. Related work is discussed in Chapter 5, and Chapter 6 concludes the paper. Due to the lack of space, we do not include formal description of ITDM and query language in this paper. Interested readers are referred to [13].
2. Overview of ITDM 2.1 Basic Concept of Time Series and Time Representation A time series is a sequence of observations made over time. The pattern of time according to which the observations are made is specified by a calendar [2,12]. So, each time series has associated with it a particular calendar. Typically a time series is represented as an ordered set of pairs: TS = {(t1, v1), (t2, v2), …, (tn, vn)}, where ti is the time when the data value vi is observed. Sometimes a time series has two or more data values observed at each ti, and is represented as TS = {(t1, (v(1,1), v(1,2), …, v(1,k))), (t2, (v(2,1), v(2,2), …, v(2,k))), … , (tn, (v(n,1), v(n,2), …, v(n,k)))}. To represent a specific subset of the time dimension, in general, we use temporal element [11]. A temporal element is a finite union of time intervals; T = {I1 ∪ I2 ∪ … ∪ In}. Each time interval is an ordered set of consecutive time units; Ii = {t1, t2, … tk}, and is represented as [t1, tk]. For example, assuming the granularity Day, a temporal element T = {[1, 3], [8, 9]} is equivalent to T = {Day1, Day2, Day3, Day8, Day9}. For convenience, however, we will use a conventional notation in this paper, such as {[1/1/98, 1/20/98], [2/3/98, 2/15/98]}.
2.2 Overview of ITDM ITDM is based on the Enhanced ER (EER) model [8]. An entity represents an independent object or concept in the real world. An entity type represents a collection of entities that have the same properties. The properties are represented by a set of attributes that are associated with the corresponding entity type. Two relationships are supported: a named relationship type and an IS_A relationship. A named relationship type models an association among different entity types, and is defined in terms of participating entity types. An IS_A relationship represents generalization and
An EER-Based Conceptual Model and Query Language
23
specialization processes. Attributes represent the properties of either an entity type or a named relationship type. In ITDM, three different types of objects are recognized and modeled as attributes: time-invariant attributes, time-varying attributes, and timeseries attributes. An attribute A is a tuple (TD, VD, AV), where TD(A) is the temporal domain, VD(A) is the value domain, and AV(A) is the attribute value of A. An entity type E is a tuple (EA, EP). Here, EA(E) is a set of attributes and EP(E) is the population of E; EA(E) = {A1, A2, …, Ak}, EP(E) = {e1, e2, …, en}. An entity ei is a tuple (surrogate, lifespan, EV). The surrogate is a system generated unique identifier of each entity. The lifespan represents the time intervals(s) during which the corresponding entity existed or the entity was of interest to the database. EV, denoting the value of an entity, is a set of attribute values. EV(ei) = {AV(Aj(ei)) | 1 ≤ j ≤ k}, where AV(Aj(ei)): TD(Aj(ei)) → VD(Aj(ei)) and TD(Aj(ei)) ⊆ lifespan(ei). Note that an attribute value of an entity is defined to be a function from the temporal domain of the attribute to the value domain of the attribute. Such a function is referred to as temporal assignment. A named relationship type is modeled as: R = (RE, RA, RP), where − RE(R) = {(E1, ro1, c1), (E2, ro2, c2), …, (Em, rom, cm)}, where Ei is a participating entity, roi is the role name of Ei, and ci is the structural constraints on Ei, and represented by ci = (mini, maxi). − RA(R) = {A1, A2, …, Ak}, a (possibly empty) set of attributes. − RP(R) = {r1, r2, …, rn}, a set of relationship instances, where ri = (PE, lifespan, RV) such that − PE(ri), is represented as (surrogate(e1), surrogate(e2), …, surrogate(em)), where each ej ∈ EP(Ej) 1 ≤ j ≤ m, participates in ri. − lifespan(ri) ⊆ , mj=1 lifespan(e j ) , with each ej participating in ri. − RV(ri) = { AV(Ai(ri)) | 1 ≤ i ≤ k}, where AV(Ai(ri)): TD(Ai(ri)) → VD(Ai(ri)) and TD(Ai(ri)) ⊆ lifespan(ri). 2.3 An Example An example ITDM schema is shown in Fig. 1, which will be used in the following sections to illustrate queries. A time-invariant attribute is represented by an oval. A time-varying attribute is distinguished by a rectangle inside an oval. A time-series attribute has, in addition to a rectangle inside an oval, an associated calendar connected to it by an arrow. In the schema diagram, for example, dividend, price, ticks and population are time-series attributes and Quarters, BusinessWeek, WorkHours, and Years are, respectively, their associated calendars. The calendar Quarters specifies quarterly time units when dividends are paid. The attribute price represents daily high and low prices of a stock. It also has a nested time-series attribute ticks, which records hourly prices. The calendar BusinessWeek specifies 5 days a week (Monday through Friday) except all holidays when stock markets are closed. The calendar WorkHours specifies 9:00 AM to 5:00 PM market hours. A part of the database instances of the example schema is also shown in Fig. 2.
Fig. 2. (continued) Part of database instances of the example database
3. Path Expressions, Temporal Projection, and Temporal Selection This section describes three basic components of the temporal query language for ITDM, namely path expressions, temporal projection, and temporal selection.
26
J.Y. Lee and R.A. Elmasri
3.1 Path Expression Path expressions [9] are used to navigate entity types through relationship types, and to specify join conditions on entity types. A path expression is a rooted tree, such that − Root is an entity type. − If a node p has a child node c, then − If p is an entity type: c is an attribute or a role name connected to p. − If p is an attribute: p is a composite attribute and c is a component attribute of p. − If p is a role name: Let p be the role name of an entity type E1 that participates in a relationship type R and E2 be another participating entity type. In other words, RE(R) = {(E1, p, c1), (E2, ro2, c2)}. Then c is an attribute of R or E2, or a role name connected to E2. − A role name may have a restricting predicate attached to it. Figure 3 shows some valid path expressions on the database schema of Fig. 2.
CUSTOMER
CUSTOMER
CITY
name
market
(a)
(b)
CUSTOMER name
(e)
name
(c)
stocks (d)
CUSTOMER
stocks issuer
MARKET
shares
name
stocks [issuer = 'IBM'] issuer
shares
(f)
Fig. 3. Example path expressions
A path expression can alternatively be represented as a text string. Starting from the root node, which is an entity type, we append its child separated by a dot. If the root has two or more children, the children nodes are enclosed in a pair of angled brackets and commas separate them. Then, we recursively apply the same rules to all of its children. The textual representations and the interpretations of the path expressions are given below: − (a) CUSTOMER: All customers (including all attributes). − (b) CUSTOMER.name: Names of all customers. − (c) CITY.market: For each city, list of stock markets (i.e., their surrogate values) located in the city.
An EER-Based Conceptual Model and Query Language
27
− (d) MARKET.: For each market, the name of the market and all stocks traded in the market. − (e) CUSTOMER.>: For each customer, the name of the customer, and issuer and share of all stocks the customer owns. − (f) CUSTOMER.>: The same as the path expression (e), but only for IBM stocks. Here, the restricting predicate is attached to the role name stocks to select only IBM stocks.
3.2 Nontemporal Queries A nontemporal query is one that accesses the current database state. The syntax of nontemporal queries is: GET FROM WHERE
p1, p2, … E1 e1, E2 e2, … pr
Here, pi is a path expression, Ei is an entity type and ei is a variable that ranges over the entities in EP(Ei), and pr is a predicate. The semantics of a query is as follows. First, form the Cartesian product of all entity types specified in the FROM clause. For each element in the Cartesian product, the predicate specified in the WHERE clause is evaluated. If the predicate evaluates to true, then the element, which is a tuple of entities, is selected. Then, from these entities, only the information specified in the GET clause is displayed. An example nontemporal query is given below. Query 1: List the names of customers who own all the stocks that are traded in the market located in the same city that the customer is living. GET FROM WHERE
c.name CUSTOMER c, MARKET m (c.city.name = m.city.name) AND ((c.stocks) INCLUDE (m.stocks))
3.3 Temporal Projection Temporal projection restricts the information to be displayed to a particular time interval(s). The syntax of a temporal projection is p: T, where p is a path expression and T is a temporal element. Assume that a path expression p returns the following: {[1/1/97, 5/31/97] → Robert, [6/1/97, 10/31/97] → Richard, [11/1/97, now] → Robert}. Then, p: [3/1/97, 12/31/97] will return the following: {[3/1/97, 5/31/97] → Robert, [6/1/97, 10/31/97] → Richard, [11/1/97, 12/31/97] → Robert}. The following example illustrates the use of temporal projection in the GET clause.
28
J.Y. Lee and R.A. Elmasri
Query 2: Show, for all the stocks William has owned, the list of the issuers and the corresponding shares during the time period [1/1/97, 12/31/97]. GET FROM WHERE
c.stocks.: [1/1/97, 12/31/97] CUSTOMER c c.name = ‘William’
3.4 Temporal Selection A predicate on an entity evaluates to either true or false when the entity assumes a nontemporal value. If, however, an entity assumes a temporal value, the result of applying a predicate to the entity returns a temporal assignment whose codomain is {true, false}. Assume that an attribute salary of an entity John assumes the following value in a temporal database: {[1/1/95, 6/30/96] → 45000, [7/1/96, 12/31/97] → 52000, [1/1/98, now] → 55000}. Then, if we apply the predicate (salary > 50000) to John, the result will be {[1/1/95, 6/30/96] → false, [7/1/96, now] → true }. The application of a predicate pr to a temporal entity e is denoted by pr(e), which is referred to as a temporal predicate. The true time of a temporal predicate pr on e, denoted by [[ pr(e) ]], is a temporal element during which the predicate evaluates to true. So, the true time of the predicate (salary > 50000) applied to John is {[7/1/96, now]}. A temporal selection predicate is a Boolean expression that compares two temporal elements using the set comparison operators {=, ≠, ⊆, ⊇}, where at least one of the operands is the true time of a temporal predicate. An example query, which uses the temporal selection predicate in the WHERE clause is shown below. Query 3: Names of customers who owned IBM stock during [1/1/97, 6/30/97]. GET FROM WHERE
4. Temporal Queries 4.1 Basic Temporal Queries The syntax of basic temporal queries is shown below. We will extend this syntax later to include aggregate functions and time series operations. GET FROM WHERE
p1:T1, p2:T2, … E1 e1, E2 e2, … predicate
Here, Ti is a temporal element for temporal projection and predicate is a Boolean expression on the attributes of entities or relationship instances that may include temporal selection predicate. Some example temporal queries are given below. Query 4: Names of customers who owned more than 1000 shares of IBM stock during the whole period of 1997. GET FROM WHERE
In this query, the restricting predicate [issuer = ‘IBM’] restricts the relationship instances between CUSTOMER and STOCK to only IBM stock. If we issue this query to the example database, the result will be: {Tracy}. Query 5: Names of customers who owned more than 1000 shares of IBM stock any time during 1997. GET FROM WHERE
c.name CUSTOMER c NOT EMPTY([[ c.stocks[issuer = ‘IBM’ ].shares > 1000 ]] ∩ [1/1/97, 12/31/97])
The result of this query when applied to the example database is: {William, Tracy}.
4.2 Query Language Constructs for Time-Series Attributes 4.2.1 Aggregate Functions and Granularity Conversion Nontemporal aggregate functions compute the aggregation over a set of data values. Typical aggregate functions are: COUNT, EXISTS, SUM, AVERAGE, MAX, MIN, etc. On the other hand, temporal aggregate functions compute the aggregation over
30
J.Y. Lee and R.A. Elmasri
the time dimension. We use the following temporal aggregate functions: TCOUNT, TEXISTS, TSUM, TMAX, TMIN, etc. The following queries show the use of temporal aggregate functions. Query 6: Compute the average population of New York city between 1990 and 1997. GET FROM WHERE
TAVERAGE (t.population): [1990, 1997] CITY t t.name = ‘New York’
We also define a special type of true time that is applied to aggregate functions. The true time [[ f(A): [tl, tu] ]] returns the time when the attribute A assumes the value specified by the aggregate function f during the time interval [tl, tu]. Here, f is either TMIN or TMAX. The following query illustrates the usage of this type of true time (the GET TIME will be discussed in more detail in Section 4.2.2). Query 7: When did the daily high price of IBM stock reach its highest price in November 1997? GET TIME [[ TMAX(s.price.high): [11/1/97, 11/30/97] ]] FROM STOCK s WHERE s.issuer = ‘IBM’ In applications that include time-series data, sometimes it is necessary to convert the granularity of a time series. A granularity conversion may be into a coarser granularity or into a finer granularity. The conversion to a coarser granularity is specified in a query by attaching to the aggregate function ‘BY target granularity’ as shown in the following example. Query 8: List the weekly high price of GE stock during 1997. GET FROM WHERE
TMAX(s.price.high) BY WEEK: [1/1/97, 12/31/97] STOCK s s.name = ‘GE’
This query converts the granularity of the time series high from Day to Week. In a query that requires the conversion of a granularity to a finer one, we need to specify a particular interpolation function to be used as well as the target granularity as shown in the following example: Query 9: Show the month-by-month population of the city Boston. GET FROM WHERE
t.population BY MONTH(function): [1/1/97, now] CITY t t.name = ‘Boston’
An EER-Based Conceptual Model and Query Language
31
Here, function is an interpolation function provided by a DBMS. It may be a linear function, spline, etc. 4.2.2 Time Selection Functions Sometimes, it is necessary to extract a particular time interval or a time unit from a given temporal element. For this purpose, we define two time selection functions. An interval selection function I_SELECT(i, T) returns the i-th interval from the temporal element T. Here, i is an integer, FIRST, or LAST. When used as the value of i, FIRST and LAST returns the first and last interval, respectively, from T. If T is the true time of a temporal predicate on a time-series attribute, it returns the i-th time unit. A time unit selection function T_SELECT(i, I) returns the i-th time unit from the interval I. Again, i may be an integer, FIRST, or LAST. An example query is shown below. Query 10: List the names of customers who lived in New York during the first tenure of Mayor Robert. GET FROM WHERE
c.name CUSTOMER c, CITY t (t.name = ‘New York’) AND ([[ c.city.name = ‘New York’ ]] ⊇ I_SELECT(FIRST, [[ t.mayor = ‘Robert’ ]])
We can also use the time selection function in the GET clause to extract a particular time. In this case we use GET TIME instead of GET. An example is shown below. Query 11: When did Robert become a mayor of New York for the first time ? GET TIME T_SELECT(FIRST, I_SELECT(FIRST, [[t.mayor = ‘Robert]])) FROM CITY t WHERE t.name = ‘New York’ 4.2.3 Representation of Temporal Windows and Moving Window A temporal window is specified using one of the following representations: [t1, t2], [t, % i %], [t, % iG %]. Here, t is a time unit, i is an integer, and G is a granularity. The first component of a temporal window is called window reference, and the second component is called window end. The interpretation and examples of different types of temporal windows are given below. − Type 1 ([t1, t2]): Specifies all data values between t1 and t2, including the values at the both ends if they exist. Example: [1/1/97, 12/31/97]. − Type 2 ([t, % i %]): Specifies i consecutive data values starting from t, including the value at time t if it exists. If no data value exists at t when applied to a timeseries attribute, then it starts with the next data value in the time series. Example: [5/1/98, %10%].
32
J.Y. Lee and R.A. Elmasri
− Type 3 ([t, % iG %]): Specifies all data values between t and t + iG, including the values at the both ends if they exist. Example: [3/1/98, %14Day%], which is equivalent to [3/1/98, 3/15/98]. If no data value exists at t when applied to a time-series attribute, then it starts with the next data value in the time series. If no data value exists for the window end, the data value that exists in the time series immediately before the window end is used. Type 1 temporal windows were used in the previous query examples. The following example shows the use of Type 2 temporal window. Query 12: Show the 10 consecutive daily high prices of SEARS stock starting from 3/1/1998. GET FROM WHERE
s.price.high: [3/1/98, %10%] STOCK s s.issuer = ‘SEARS’
A moving window is used to specify a series of temporal windows, each of which provides a time interval for an aggregate function. A moving window is specified by attaching two time durations to a temporal window of Type 1. A time duration is represented by: % i %, % iG %, or % name of a calendar %. The first two durations have the same meaning as they do with the temporal windows above. The third duration specifies the period of a periodic calendar. An example moving window is: [1/1/97, 12/31/97] FOR %10% INCREMENT %3Day%. Here, the key word FOR specifies the size of the window and the key word INCREMENT specifies the increment by which the window moves. The example specifies a moving window, where a window has 10 consecutive data values and moves with an increment of 3 days in the time interval [1/1/97, 12/31/97]. The following is an example: Query 13: Show 10-day moving average of daily high price of SEARS stock with an increment of 5 days during 1997. GET FROM WHERE
TAVERAGE(s.price.high): [1/1/97, 12/31/97] FOR %10Day% INCREMENT %5Day% STOCK s s.issuer = ‘SEARS’
4.2.4 Using Relative Time for Data Selection Consider the following query. “What was the daily high price of IBM stock 5 days after it reached the highest price in November 1997?” Here, we want to know the value of a time-series attribute at a time that is specified as a relative time. We define two more temporal window types to specify relative times as follows:
An EER-Based Conceptual Model and Query Language
− −
33
Type 4 ([t; % i %]): Specifies the i-th data value from t. All other semantics is the same as that of Type 2. Type 5 ([t; % iG %]): Specifies the data value at time t + iG. All other semantics is the same as that of Type 3.
Then, the above query can be written using Type 5 temporal window as follows: Query 14: GET FROM WHERE
5. Related Work Very little study has been reported in the literature regarding query languages on time-series data. CALANDA [4,5,6] was implemented based on an object-oriented model. So, data retrieval is performed through method invocation. Common operations on time series are predefined as methods of the time series root class. Timestamp selection and granularity conversion are examples of common operations. Operations for a particular class of time series are defined in that class definition. In [17], time series is modeled as a regular time sequence. This model defines high-level operations to manipulate time sequences. Basic retrieval operations include select, aggregate, accumulate, etc. Informix provides TimeSeries DataBlade module, which provides support for time series and calendar, and offers over 40 predefined functions to manage them. The module is a user-installable extension of Illustra server, which is an object relational DBMS. The query language is an extension of SQL, but the extensions are object-based.
6. Conclusion Time series is a special type of time-varying objects. The change of the value of a time series is tightly associated with a predefined pattern of time called calendar. Time-series objects also require different types of operations on them. Various data models have been proposed for temporal databases and time series management. However, the integration of time-varying (or version-based) objects and time-series objects has been rarely studied. We formalized a conceptual model based on EER, called ITDM (Integrated Temporal Data Model), that incorporates all different type of objects [14]. We then presented a query language for ITDM. We showed that timeseries querying constructs and version-based querying constructs could be integrated within the same query language constructs.
34
J.Y. Lee and R.A. Elmasri
References 1. R. Chandra and A. Segev, “Managing Temporal Financial Data in and Extensible Database,” Proc. 19th Int’l Conf. on VLDB, 1993, pp. 302-313. 2. R. Chandra, A. Segev, and M. Stonebraker, “Implementing Calendars and Temporal Rules in Next Generation Databases,” Proc. 3rd Int’l Conf. on Data Engineering, 1994, pp. 264-273. 3. U. Dayal and G. Wuu, “A uniform Approach to Processing Temporal Queries,” Proc. 18th VLDB Conf., 1992, pp. 407-418. 4. W. Dreyer, A.K. Dittrich, and D. Schmidt, “An Object-Oriented Data Model for a Time Series Management System,” Proc. 7th Int’l Working Conf. on Scientific and Statistical Database Management, 1994, pp. 186-195. 5. W. Dreyer, A.K. Dittrich, and D. Schmidt, “Research Perspectives for Time Series Management Systems,” ACM SIGMOD Record, Vol. 23, No. 1, 1994, pp. 10-15. 6. W. Dreyer, A.K. Dittrich, and D. Schmidt, “Using the CALANDA Time Series Management Systems,” Proc. ACM SIGMOD Int’l Conf., 1995, pp. 489-499. 7. R. Elmasri and J.Y. Lee, “Implementation Options for Time-Series Data,” Temporal Databases: Research and Practice, O. Etzion et. al. (Eds), LNCS No. 1399, 1998, pp. 115127. 8. R. Elmasri and S. Navathe, "Fundamentals of Database Systems," 2nd Edition, Benjamin/Cummings, 1994. 9. R. Elmasri and J. Wiederhold, “GORDAS: A Formal High-Level Query Language for the ER Model,” Proc. 2nd Entity-Relationship Conference, 1981, pp. 49-72. 10. R. Elmasri and G. Wuu, “A Temporal Model and Query Language for ER Database," Proc. 6th Int’l Conf. on Data Engineering, 1990, pp. 76-83. 11. S. Gadia and C. Yeung, “A Generalized Model for a Relational Temporal Database," Proc. ACM SIGMOD Conf., 1988, pp. 251-259. 12. A. Kurt and M. Ozsoyoglu, “Modelling Periodic Time and Calendars,” Proc. Int'l Conf. on Application of Databases, 1995, pp. 221-234. 13. J.Y. Lee, “Database Modeling and Implementation Techniques for Time-Series Data,” Ph.D. Dissertation, Computer Science and Engineering Department, University of Texas at Arlington, May 1998. 14. J.Y. Lee, R. Elmasri, and J. Won, “An Integrated Temporal Data Model Incorporating Time Series Concept,” Data and Knowledge Engineering, Vol. 24, No. 3, 1998, pp. 257276. 15. E. Rose and A. Segev, “TOODM – A Temporal Object-Oriented Data Model with Temporal Constraints,” Proc. 10th Int’l Conf. on the Entity-Relationship approach, 1991. 16. D. Schmidt, A.K. Dittrich, W. Dreyer, and R. Marti, “Time Series, a Neglected Issue in Temporal Database Research?” Proc. Int’l Workshop on Temporal Databases, 1995, pp. 214-232. 17. A. Segev and A. Shoshani, “Logical Modeling of Temporal Data,” Proc. ACM SIGMOD Int’l Conf., 1987, pp. 454-466. 18. A.U. Tansel, “Temporal Relational Data Model,” IEEE Tans. on Knowledge and Data Engineering, Vol. 9, No. 3, 1997, pp. 464-479. 19. G. Wuu and U. Dayal, “A Uniform Model for Temporal Object-Oriented Databases,” Proc. 8th Int’l Conf. on Data Engineering, 1992, pp. 584-593.
Chrono: A Conceptual Design Framework for Temporal Entities? Sonia Bergamaschi1,3 and Claudio Sartori2,3 1
DSI - University of Modena, Italy [email protected] 2 DEIS - University of Bologna, Italy [email protected] 3 CSITE - CNR, Bologna, Italy Viale Risorgimento, 2 - 40136 Bologna, Italy
Abstract. Database applications are frequently faced with the necessity of representing time varying information and, particularly in the management of information systems, a few kinds of behavior in time can characterize a wide class of applications. A great amount of work in the area of temporal databases aiming at the definition of standard representation and manipulation of time, mainly in relational database environment, has been presented in the last years. Nevertheless, conceptual design of databases with temporal aspects has not yet received sufficient attention. The purpose of this paper is twofold: to propose a simple temporal treatment of information at the initial conceptual phase of database design; to show how the chosen temporal treatment can be exploited in time integrity enforcement by using standard DBMS tools, such as referential integrity and triggers. Furthermore, we present a design tool implementing our data model and constraint generation technique, obtained by extending a commercial design tool.
1
Introduction and Motivations
Database applications are frequently faced with the necessity of representing time varying information and, particularly in management information systems, a few kinds of behavior in time can characterize a wide class of applications. In the last years we observed a great amount of work in the area of temporal databases, aiming at the definition of standard representation and manipulation of time, mainly with reference to the relational model. Nevertheless, in our opinion, the design of databases with temporal aspects has not yet received sufficient attention: the decisions about the temporal treatment of information are matter of the conceptual database design phase and, starting from there, a set of related design choices is strictly consequent. For instance, if we consider two related pieces of information and we decide that temporal treatment is needed only for the first one, can we assume that the decision of temporal treatment for ?
With the contribution of Gruppo Formula S.p.A., Bologna, Italy, http://www.formula.it
T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 35–50, 1998. c Springer-Verlag Berlin Heidelberg 1998
36
S. Bergamaschi and C. Sartori
the second one is completely independent or are there necessary clear constraints to preserve information integrity? In order to obtain an answer to the above question, we have to start facing the problem of supporting referential integrity constraints in this more complex scenario. In fact, when we add temporal treatment to an information we limit the validity of this information in time and a reference to it can be done only during this validity interval, therefore the kind of allowed/required temporal treatment for related information is strictly inter-dependent. As a consequence, in order to support temporal treatment we must extend the referential integrity concept, which plays a central role in databases: a reference is valid if the referenced object is valid through all the valid time of the referencing object. Another assumption of our work is related to the general framework of database design: when we are faced with the design of non-trivial application we cannot avoid the usage of methodologies and tools to support some design phases [1]. The Entity-Relationship (ER) model and its extensions has been proved to be useful in the conceptual design phase, and many design tool allow the user to draw conceptual schemata and to automatically generate relational database schemata. In addition, some design tools are also able to generate code for constraint enforcement, such as referential integrity and triggers for different relational DBMS. For this reason, if we design temporal treatment directly at the conceptual level and extend a design tool in this direction we obtain two major advantages: – temporal treatment is documented at a high level as a first class feature and it is dealt with in a standard fashion, – the integrity constraints deriving from temporal treatment can be automatically translated into constraint enforcement code at the DBMS level. The first choice we have to make is the selection of the conceptual model to be extended. In order to obtain a general approach and to easily come to the implementation of temporal treatment on top of an existing design tool, we refer to an industry-standard conceptual model, the IDEF1X model [23]. IDEF1X is an accepted standard for the USA government and is adopted by some conceptual database design tools [18,17]. The second choice is which kind of time is to be supported. The past decade of research on temporal databases led to the presentation of a high number of temporal data models and the book [25] gives a comprehensive account and systematization of this activity. In particular, many extension of the relational model have been proposed to represent the time dimension with different meanings and complexity. At present, there exists a general consensus on the bi-temporal models, such as BCDM [14], where two orthogonal temporal dimensions are considered: valid time (i.e. the time when the fact is true in the modelled reality) and transaction time (i.e. the time when the fact is current in the database and may be retrieved). According to the assumption that supporting the referential integrity is a major issue, it is mandatory for us to support at least the valid time dimension.
Chrono: A Conceptual Design Framework for Temporal Entities
37
In this work we restrict to consider only the valid time, since the transaction time dimension does not affect the enforcement of referential integrity. In other words, if an application requires both the temporal dimensions, it is possible to perform the conceptual design with respect to the valid time dimension, and then add independently the transaction time representation. The third choice refers to the granularity of detail to which conceptual elements have temporal treatment, ranging from a single attribute value to an entire entity instance. Our choice is in favor of the entity level granularity: all the attributes of an entity have the same temporal treatment, and this applies to all its instances. This apparently coarse granularity is well suited for most practical applications, and does not constitute a severe limitation, since a different temporal treatment for two subsets of attributes of the same entity can be easily modeled with vertical partitioning. The fourth choice is the type of time modeling suitable for in a database application. The most intuitive notions are that of event, which happens in a time point and state, which holds during a time interval, described by its starting and ending time points. The research community has reached a quite wide consensus on the notion of temporal element, which is a finite union of n-dimensional time intervals [9]1 . Special cases of temporal elements include valid-time elements, transaction-time elements and bi-temporal elements. As explained above, we consider only one-dimension temporal elements, i.e. valid-time elements. The straightforward modeling of such a temporal element as a non normalized entity attribute would lead to not efficient implementation in relational environment. Being aimed to produce a practical design tool, our choice is to constrain a given entity to have normalized time attributes, i.e. only one of the following kinds of temporal elements: single chronon, finite set of chronons, single interval, finite set of intervals. This means that the designer has to decide in advance which type of temporal treatment is best suited for the modeling of a given entity. On the basis of the above choices on the temporal treatment of information, we define an extension of an industry standard conceptual model, IDEF1X, and develop the necessary set of integrity constraint at the conceptual schema level to preserve information consistency. The result of these extensions is a uniform way to deal with time at the conceptual schema level. Since some design tools provide the automatic mapping of a conceptual schema into a relational one and generate code for constraint enforcement, an effective architectural choice is the extension of a tool like this at the conceptual level. The logical schema and the integrity constraints and triggers to ensure a correct evolution of the database contents with respect to time attributes will thus be automatically generated at the database level. In order to prove the feasibility and the usefulness of our approach, we developed a software layer, called Chrono, on top of the database design case tool ErWin [18]. With Chrono, the conceptual design activity is modified as follows: 1. design the conceptual schema abstracting from the temporal aspects 1
The simple notion of time interval would not guarantee the closure property with respect to usual operation on time intervals, such as union, intersection, etc..
38
S. Bergamaschi and C. Sartori
2. select the appropriate temporal treatment for the entities 3. Chrono automatically converts the schema obtained into a standard IDEF1X schema, adding the temporal attributes and the necessary set of integrity constraints. The paper is organized as follows: Section 2 introduces the Chrono conceptual data model as an extension of IDEF1X with temporal modeling features. Section 3 discusses the design constraints generated by the dependencies between related temporal entities. Sections 4 examines the integrity constraints that rule temporal entities. Section 3 discusses the design constraints generated by the dependencies between related temporal entities. Section 5 shows the architecture of the design tool based on Chrono. Finally, Sect. 6 discusses some related works.
2
The Chrono Conceptual Data Model
Let us briefly recall the modeling principles of the IDEF1X model, partly drawn from the official F.I.P.S. documents [23]. The IDEF1X model is derived from the Entity-Relationship (E/R) model [3] and its well known extensions[1]. The main difference with respect to E/R is the proposal of a conceptual model “closer” to the logical relational data view. The main extension is the distinction between independent and dependent entity, the latter being identified with the contribution of another entity via an identifying relationship. The relationships are either connection or categorization relationship. Connection relationship (also referred as parent-child relationships) are the standard way to express semantic relationships between entities and can be identifying, non-identifying without nulls and non-identifying with nulls. The cardinality is one to one or one to many. The model allows also the non-specific relationship, corresponding to the many to many relationship, but its usage is intended only for the initial development of the schema, to be refined in later development phases and substituted by entities and connection relationships, as explained at the end of this section. A categorization relationship represents a generalization hierarchy and is a relationship between one entity, referred to as the generic entity, and another entity, referred to as a category entity (i.e. the specialization). A category cluster is a set of one or more categorization relationships and represents an exclusive hierarchy. The Chrono conceptual data model is a temporal extension of IDEF1X. In analogy with the Entity–Relationship model, it assumes that each entity instance must be uniquely identified. By way of some internal attributes if the entity is independent, and by other connected entities if the entity is dependent. We will consider the identifier of an entity as time-invariant, while the other attributes can be time variant. Entities are either absolute, if they do not require temporal treatment, or temporal, if they are subject to time support, with explicit representation of its time attributes. Absolute entities will eventually be subject
Chrono: A Conceptual Design Framework for Temporal Entities
39
to insertions, deletions and updates, as is usual in database environment. In contrast, insertions, deletions and updates of temporal entities will be subject to particular rules and restrictions, as will be shown in the following. According to [12], we assume that all the non-key attributes of an entity have the same kind of behavior in time: in this way, temporal treatment can be done at the entity level and not at the attribute level. This is not a real limitation, since if the assumption does not hold for an entity E, attributes can be clustered according to a uniform behavior in time and the entity can be vertically partitioned into entities, say E1 and E2 linked by a one–to–one identifying relationship. In a way similar to the extension of the relational model with time proposed in [22], we translate the temporal treatment proposed for relations to entities: a temporal entity instance e of the entity type E is associated with a temporal element T (e), giving the lifespan of e. In addition, to simplify the integrity constraint enforcement, we accept the notion of temporal normal form proposed in [13] and consider only conceptual schema in third normal form, by an easy extension of the well-known notion of relational normal forms to the conceptual level, as suggested in [19]. Instead of allowing temporal elements composed by any possible mix of chronons and intervals, we consider four kinds of temporal elements: single chronon, finite set of chronons, single interval, finite set of intervals. For a given temporal entity type only one kind of temporal element is allowed. This led us to extend the IDEF1X model with five kinds of temporal entities representing either events, when the allowed temporal element is chronon or a set of chronons, or states, when the allowed temporal element is an interval or a set of intervals. An entity instance can have a single lifespan or an history when its temporal element is a set. The semantics of states is that the state is true during its interval, therefore, when state history is represented, the database must satisfy the constraint that the intervals of states of the same entity with different attributes cannot overlap. To conclude, an additional constraint is available to the designer: if the instance of an entity must always exist inside a given interval, but its state is allowed to change and the history is relevant, then the intervals of the various states of the same entity must always be contiguous. On the basis of the above discussion, we can say that the possible types of entities with temporal treatment are the following five: SP, CMP, MP, E, EV. In the following, the five Chrono types are explained together with their mapping into IDEF1X entities, as shown in Table 1. The mapping consists into the addition of attributes for the representation of time, possibly extending the identifier. Section 4 will discuss the integrity constraints added by this mapping. SP single period: the entity represents a single state and has an associated time interval; the time interval is represented as a couple of chronons, say Ts (Tstart) and Te (Tend). CMP consecutive multi-period: the entity is continuously valid during a period, but some of its aspects change over time, in connection with specific
40
S. Bergamaschi and C. Sartori
time points; therefore, the evolution of the entity can be seen as a succession of states valid during consecutive periods; its temporal element is a set of contiguous time intervals, and a single, absolute, entity instance generates many entity state instances; in order to represent the entity in terms of attributes and keys as required by the underlying conceptual model, we change the entity identifier, say I to include an extreme of the time interval, say Ts; in this way, the instances of the temporal entity are different versions of the instances of the original absolute entity. MP non-consecutive multi-period: the entity is valid during a set of periods, without any constraint of being consecutive; the representation is the same as the consecutive multi-period type. E event: the entity represents an event which took place in a specific time point; its time element is a single time point and can be represented by a single attribute Ts. EV event with versions: the entity represents an event which resulted in a tuple of values of its attributes; the history of the changes to the attribute values is maintained, each change being identified by its specific time point; the time element is a set of chronons and the representation is obtained, as for the consecutive multi-period case, by including the time attribute Ts in the identifier. Let us consider a set of short examples about the classification above. A human being, from the registry office point of view, is “valid” from his birth, and can be classified as type SP, while a living person, which is a specialization of human being, ends his validity with his death and is of type SP too. A company’s employee changes over time his description, including salary, duties, address, and so on. Each change starts at a specific time and holds up to the following change, therefore the employee description requires a time representation of type CMP. A patient can be hospitalized many times, each time with a specific starting and ending time point, and can be represented with a type MP entity. The documents for the management of an organization usually are marked with a time (for instance when an officier wrote the document). Provided that the application requires that no historical record of the document changes is needed, it can be represented as a type E entity. On the other hand, if the application needs to record the different versions of such documents, the type EV temporal treatment can be used. Up to now we considered an entity in isolation, but in practical cases we always have many entities linked with various semantic relationships, such as aggregation and generalization hierarchy. In this case the following question arise: is the temporal treatment of an entity independent from that of the entities related to it? And more, which integrity constraints, if any, govern the time values of a single entity and of related entities? In order to give an answer to the above questions, let us examine the relationships expressed in IDEF1X. In the categorization relationships and identifying
Chrono: A Conceptual Design Framework for Temporal Entities
41
Table 1. Mapping from the Chrono concepts to the IDEF1X concepts Temporal treatment of entity E
Chrono representation E
IDEF1X representation E
SP
Single Period (SP)
Id
Id Ts Te
E
Consecutive Multi–Period (CMP)
E CMP
Id
Id Ts Te
E
E MP
Multi-Period (MP)
Id
Id Ts Te
E
E E
Event (E)
Id
Id Ts
E
Event with Versions (EV)
Same Validity Period identifying relationship (SVP)
E EV
Id
Id Ts
=
relationships a child makes a mandatory reference to the parent and therefore when it is be valid the parent must be valid as well. In summary, the validity of both categorization and identifying relationships is the same as the validity of the child and therefore they cannot have any temporal treatment on their own. On the other hand, in principle it is acceptable that a parent instance has a validity period wider than that of its child instances. Therefore, the choice of the constraints to be enforced depends also on whether the validity period of the child instance can be contained in that of its parent instance or it must be forced to be the same as its corresponding parent instance. We consider as default case the less constrained, that is when the validity period can be contained, and introduce, as a new design element the SVP identifying relationship (same validity period ), as shown in the last row of Table 1. In IDEF1X the constraint is simply removed at the graphical level, being translated into a constraint at the extensional level. The only constraint on a general non-identifying relationship is that the child instance must be valid inside the validity of its parent instance. Consider, for
42
S. Bergamaschi and C. Sartori Employee Emp# Employee Emp#
occupancy
Occupancy Emp#
MP
Room# Room Room# Room Room#
a. IDEF1X representation
b. Chrono representation
Fig. 1. Example of relationship with temporal treatment
instance, the cyclic relationship “parent-offspring” This case, can be represented as a non-identifying relationship between human beings, and the constraint of parent birth-date lower than offspring birth-date must be enforced, while the end of the validity perdiod is infinite and does not give rise to any problem. Non-identifying relationships, either with or without nulls, can hold during an arbitrary interval, included in the intersection of the validity intervals of the related entities. Thus the relationship can have a kind of temporal treatment on its own. On the other hand, when complex aspects of a relationship need to be described, such as generalization between relationships, the standard conceptual design procedure is reification. A reified relationship is promoted to an entity which is the child in a couple of relationships with the two original entities2 . Let us consider, for example, how employees are associated to their office rooms. In a snapshot view, the employee is assigned to exactly one room, as shown in the schema of Fig. 1.a. If we want to model the fact that an employee can change his room over the time, and we want to keep track of this history, we have to modify the schema as follows: 1. reify the relationship occupancy producing the dependent entity Occupancy 2. select for Occupancy one of the Chrono temporal entities, say MP to specify that an employee in a given time period is assigned to at most one room 3. specify the proper cardinality for the relationships of Occupancy: the relationship with Employee derives from the child side and is identifying, while the relationship with Room is non-identifying. The the modified schema is shown in Fig. 1.b. For the intensional level we will define the allowed temporal treatment combination for related entities, and for the extensional level we will state the constraints that rule the insert, update and delete operations on entities. These constraints can be translated into constraints on relational database operations by many design tools. 2
This choice could be considered pertaining the logical level, rather than the conceptual level, but it is coherent with the philosophy of IDEF1X, which is a compromise between a conceptual and a logical data model.
Chrono: A Conceptual Design Framework for Temporal Entities
3
43
Design Constraints on Temporal Entities
When multiple versions of an entity instance are allowed (i.e. for Chrono entities of types CMP,MP and EV) the mapping from a Chrono entity into an IDEF1X entity gives rise to an extension of the entity identifier Id with one of the time attributes (say the starting time), thus obtaining the key K=(Id,Ts). In this way the entity instance uniqueness is preserved by the uniqueness of the identifier, which is supposed to be time invariant, and of the times in which versions are valid: it is not allowed to have different versions starting from the same time point. In the following we will refer to the Chrono representation of entities and also to their corresponding IDEF1X representation. In particular, we define version of an entity instance (or briefly version) an instance of an IDEF1X entity. In particular, an instance of an entity of type CMP, MP and EV may correspond to more versions, sharing the same Id, but with different time elements. In a standard snapshot database, a foreign key constraint has a straightforward implication: an instance of a child entity must have a valid reference to an instance of a parent entity (or a null reference if it is allowed). In a temporal database the validity of a reference must take into account also the time. Therefore it is necessary to extend the notion of referential integrity, by ensuring that the validity times of two instances of related entities have an overlapping. Otherwise, it would be the case that, in some time point, a child instance refers to a parent instance which is not valid in that time point. A relationship between two entities implies two major consequences: – at the intensional level, the allowed temporal treatment of the entities is constrained: the constraints take into account the relationship type and the compatibility between the different temporal treatments; – at the extensional level, the temporal interaction between two instances of connected entities is subject to additional integrity constraints, which have to be enforced. The following subsections will examine the constraints to be enforced for each kind of IDEF1X relationship; the constraints to be enforced when insert, update and delete operations are performed are a consequence of the ones of this section and will be examined in Sect. 4. 3.1
Categorization Relationship
A complete category cluster specifies that each instance of the parent is related to exactly one instance of one of the childs in the cluster and vice-versa. When time is considered, this constraint must hold for any snapshot. Let be τ ∈ {SP, CMP, MP, E, EV} a type of temporal treatment, Ep the parent entity and Eci the i–th child entity. At the intensional level, the following constraints hold: 1. if Ep is of type τ , then all of its childs Eci must be of the same type τ ;
44
S. Bergamaschi and C. Sartori Table 2. Identifying parent-child relationship: allowed combinations
C
P
SP CMP MP E EV
SP
CMP
MP
E
EV
SP
CMP
MP
E
EV
yes yes yes no no
yes yes yes no no
yes yes yes no no Case a
no no no yes yes
no no no no yes
yes yes no no no
yes yes no no no
yes yes yes no no Case b
no no no yes no
no no no no yes
2. if Ep is absolute, then at least one of its childs must be absolute, since otherwise there could exist a snapshot and an instance ep ∈ Ep such that does not exist a valid instance eci ∈ Eci for any i3 . When the category cluster is incomplete the constraint 2 above does not hold. 3.2
Identifying Relationship
Each child instance eci ∈ Eci is completely dependent on a parent instance ep ∈ Ep . Thus, the validity period of a eci must be contained in the validity period of its related parent ep . Otherwise, there would exist at least one point in time where the child instance violates the referential integrity. On the other hand, in principle it is acceptable that a parent instance has a validity period wider than that of its child instances. Therefore, the choice of the constraints to be enforced depends also on an application-dependent requirement: a) the validity period of the child instance can be contained in that of its parent instance b) the validity period of the child instance must be forced to be the same as its corresponding parent instance. Case a - validity period of child instance included in validity period of parent instance In this case, the child can have temporal treatment even if the parent is an absolute entity. On the contrary, if the parent has a temporal treatment then the child too must have one (for a less restrictive choice see footnote 3). Table 2 case a shows the allowed combinations: let us comment shortly some of them. When the parent entity is of type SP, a single parent instance can be connected to many different versions of child instances, and since child validity can be included in parent validity, every kind of temporal treatment is allowed for childs. Vice-versa, when the parent is of type E, there is no room for different versions of a child or for child validity spanning over an interval. 3
A less restrictive choice could be to move this constraint at the extensional level, ensuring that the lifespan of a parent instance be covered by the union of the lifespans of all its childrens.
Chrono: A Conceptual Design Framework for Temporal Entities
45
Case b - same validity period In this case it is not possible that one entity is absolute and the other has a temporal treatment. Nevertheless, the temporal treatment of parent and child entities is not constrained to be of the same type, even though combination of types is not arbitrary. Table 2 Case b shows the allowed combinations: let us comment shortly some of them. When the parent is of type SP, its child can be either of type SP or CMP. In fact, a single parent instance ep[t1 ,t3 ] with validity [t1 , t3 ] can correspond to two versions of child instances, say ec[t1 ,t2 ] and ec[t2 ,t3 ] , which together cover the parent validity interval. When the parent is of type CMP, its child can be either of type SP or CMP. The second case is straightforward, while the first one is analogous to that of the previous paragraph, by exchanging parent and child. When the parent is an event with versions time entity, a parent is represented by many versions with different validity intervals and can be related to one or more child instances. Therefore, the child must be of type E or EV, since the “period” types could not ensure the coincidence of validity intervals. 3.3
Non-identifying Relationship without Nulls
The only difference of this case with respect to identifying relationship is that the child key is now independent from the parent key. Apart from that, Child instances must be related to parent instances, therefore their validity must span inside the parent validity and the same constraints of Sect. 3.2 apply. 3.4
Non-identifying Relationship with Nulls
In this case, child instances can have a null reference to parent. Vice-versa, when the reference is not null, we require it is valid, i.e. child validity is included in parent validity4 . The allowed combination of temporal treatment are the same discussed for case a.
4
Constraints on Temporal Entity Instances
This section examines the intra-entity and the inter-entity constraints which must be enforced to guarantee the consistent evolution of a database deriving from a Chrono conceptual schema. In the following we refer, indifferently, to a Chrono conceptual schema and to its IDEF1X mapped representation. The constraints considered include both rules on the values of the time attributes and pre-requisite and integrity maintenance actions for the insert, delete and update operations. As a necessary premise, we must consider the influence of time-related operations. We will consider separately the constraints related to single entities and those deriving from referential integrity. In particular, let us consider the following two basic time related operations [24]: 4
Because of the weaker parent-child relationship it is not worth considering the equality case as in Sect. 3.2 case b.
46
S. Bergamaschi and C. Sartori
coalesce given n value equivalent instances of the same entity and with adjacent time intervals generate a single entity instance with the maximal time interval split given an entity instance and a time point inside its time interval, generate a couple of value equivalent entity instances with adjacent time intervals. We assume that the above operations be allowed and that the database is always kept in a coalesced state (i.e. no coalescing is possible). Therefore insertions and updates on the database will trigger correction actions when necessary. 4.1
Single Period
In this case, we have only the obvious intra-instance constraint that it must always be Ts ≤ Te, i.e. its time interval is non-empty. The constraint is to be enforced when attempting an insert or update operation. 4.2
Consecutive Multi-Period
The constraints which rule instances of this kind of temporal entities depend on the kind of operations that are accepted at the application level. In particular, we consider the impact of the availability of the operations of coalescing and splitting (or split). If these operations are allowed, then the following constraints are enforced: Insert if the time interval of the new entity instance is non-empty and there not exist an instance with the same value in Id, the insert is accepted, otherwise only the following cases are acceptable: 1. if the new instance starting time Ts meets the ending time of the most recent version of the same instance, the insert represents a new version of an existing instance and is accepted 5 ; 2. if the new instance ending time meets the starting time of the oldest version of the same instance, the insert constitutes an extension of the validity in the past and is accepted; 3. if the new instance ending time Te meets the ending time of an existing version, but the new starting time is greater than the old one, the insert corresponds to a splitting operation and is accepted; as a consequence, the starting and ending time of the new and old versions are updated, to preserve adjacency. Update if the time interval of the new entity instance is non-empty and there not exist an instance with the same value of Id, the update is accepted, otherwise only the following cases are acceptable: 1. if the update does not affect time attributes the update is valid; 2. if the update modifies the starting time Ts of the oldest version or the ending time of the most recent version it is accepted; 5
As an alternative, the most recent version could be open-ended.
Chrono: A Conceptual Design Framework for Temporal Entities
47
3. if the update modifies the time interval to cover exactly the time intervals of two or more consecutive versions of the same instance, these versions have to be eliminated (coalescing). Delete for each instance selected for deletion with identifier component Idi , there are two possible cases: 1. there exist only one instance with Idi or the deleted instance is the oldest or the most recent version and the deletion is accepted 2. the instance is deleted for coalescing. If the application logic does not allow automatic coalescing and splitting, the cases requiring such operations, namely the last presented for each of the above operation, are not accepted. 4.3
Non-consecutive Multi-Period
In this case, we lose the constraint of having the ending time of a version meeting the starting time of the consecutive version and the only requirement is that the versions must be non-overlapping. Given an instance to be inserted, if the time interval of the new instance is non-empty and there not exists an instance with the same value of the entity identifier component Id, the insert is accepted, otherwise the insert is acceptable only if the time interval of the new instance does not overlap the time intervals of the already existing instances with the same identifier Id. For the update and delete operations, the constraints are the same as for the insert. 4.4
Event and Event with Versions
In this case, a single time attribute is sufficient for the temporal representation. Single event entities do not need any special constraint checking, while multiple event entities extend the key with the time attribute, therefore the key uniqueness must be verified on insert and update operations. 4.5
Categorization Relationship
When Ep is of type τ , each instance eci ∈ Eci is related to an instance ep and the two instances must be valid at the same time. Therefore, at the extensional level the following constraints hold: 1. the time attribute Ts (and Te if τ ∈ {SP, CMP, MP}), together with the parent identifier component Id, have the semantics of foreign key from Eci to Ep ; 2. for each parent instance ep there must exist a single instance eci in exactly one Eci with the same value of Id, Ts (and Te if τ ∈ {SP, CMP, MP}). The above constraints hold for any τ . Insert, update and delete operations, both on the parent and child, must be done according with them. When the category cluster is incomplete the constraint 2 above does not hold and only the parent update and delete, child insert and update are subject to constraint enforcement.
48
4.6
S. Bergamaschi and C. Sartori
Identifying Relationship
In this case, parent insert, child insert and child update must check the inclusion of the validity period of the child with respect to that of parent. The attributes (Id, Ts, Te) have the semantics of foreign key, from the child to the parent, and the operations parent update, parent delete, child insert and child update must enforce it. 4.7
Non-identifying Relationship
The constraints take into account the possibility of null reference from child to parent, and check the validity interval when the reference is not null.
5
Implementation of Chrono
The most direct method for the implementation of Chrono would be to extend a case tool for the conceptual design in order to deal with time attributes and to generate the appropriate integrity constraints. At present it was not possible either to build a case tool from the scratch, or to access the source code of an existing tool. For this reason we had to build an additional software layer on top of a case tool, but this was sufficient to prove the feasibility and the usefulness of the project. We chose the case tool ERwin (from LogicWorks) which works in MS Windows environment and is based on the IDEF1X model. ERwin has a graphic interface to support conceptual schema design and is able to automatically generate relational schemata and triggers for many popular RDBMSs. The triggers are generated starting from trigger templates, with some peculiarities for the various supported DBMSs. Some trigger templates are added by Chrono to perform time-related checks and actions. Chrono operates by reading and modifying the files generated by ERwin describing a conceptual schema. The flow of the design activity is constituted by the following steps: 1. 2. 3. 4.
the designer prepares a usual IDEF1X conceptual schema with ERwin; the schema is stored in a file with extension .ERX; Chrono reads the .ERX file and interprets the schema description the designer adds the temporal treatment to the entities of the conceptual schema 5. Chrono writes a modified .ERX file including the schema modifications necessary for the representation of time and the triggers for the preservation of data integrity for time related attributes; 6. ERwin imports the modified .ERX files for possible schema refinement. Report [2] shows more details on the architecture of Chrono.
Chrono: A Conceptual Design Framework for Temporal Entities
6
49
Discussion and Conclusions
We introduced a conceptual data model extending IDEFIX for the representation of time and discussed both the conceptual design constraints and the integrity constraint introduced by the representation of time. The conceptual design constraints introduce limitations on the possible kinds of temporal treatment for related entities, while the integrity constraint are used to guarantee consistent database states with reference to temporal treatment. The idea of extending conceptual models in order to support time representation received some attention in the literature. The report [10] provides a comprehensive survey on the topic and identifies nineteen design criteria to compare the effectiveness of the surveyed models [8,15,16,21,6,7,4,5,26,20,11]. All the models above are extensions of the Entity-Relatonship model, but can be easily compared to Chrono, since IDEF1X too is strictly related to the Entity-Relationship model. Report [2] compares Chrono with the models above. To summarize , we can say that Chrono couples a significant expressive power with the reuse of the existing available technology, obtaining a feasible and low cost approach to the effective representation of time in a database. The most significant evolution of our work would be a deep analysis of the design and constraint issues deriving from a more sophisticated time model, for instance the bi-temporal conceptual model. Acknowledgements. Thanks to Gruppo Formula S.p.A. for the support, to Paolo Pellizzardi for the suggestions and discussions and to Giorgio Ferrara for the programming effort.
References 1. C. Batini, S. Ceri, and S. B. Navathe. Conceptual Database design: an EntityRelationship Approach. The Benjamin/Cummings Publishing Company, 1992. 2. S. Bergamaschi and C. Sartori. Chrono: a conceptual design framework for temporal entities. Technical Report CSITE-011-98, CSITE - CNR, 1998. ftp://wwwdb.deis.unibo.it/pub/reports/CSITE-011-98.pdf. 3. P. Chen. The Entity-Relationship model - towards a unified view of data. ACM Trans. on Database Systems, 1(1):9–36, 1976. 4. R. Elmasri, I. El-Assal, and V. Kouramajian. Semantics of temporal data in an extended ER model. In 9th Int. Conf. on the Entity–Relationship Approach, pages 239–254, Lausanne, Switzerland, 1990. 5. R. Elmasri and V. Kouramajian. A temporal query language for a conceptual model. Lecture Notes in Computer Science, 759:175–??, 1993. 6. R. ElMasri and G. Wuu. A temporal model and query language for ER databases. In Proc. IEEE CS Intl. Conf. No. 6 on Data Engineering, Feb. 1990. 7. R. Elmasri, G. T. J. Wuu, and V. Kouramajian. A temporal model and query language for EER databases. In Tansel et al. [25], chapter 9, pages 212–229.
50
S. Bergamaschi and C. Sartori
8. S. Ferg. Modeling the time dimension in an entity-relationship diagram. In 4th International Conference on the Entity-Relationship Approach, pages 280–286, Silver Spring, MD, 1985. ieee, Computer Society Press. 9. S. K. Gadia and C. S. Yeung. A generalized model for a relational temporal database. In ACM SIGMOD, pages 251–259, 1988. 10. H. Gregersen and C. S. Jensen. Temporal entity–relationship models – a survey. Technical Report TR-3, Time Center, January 1997. http://www.cs.auc.dk/research/DBS/tdb/TimeCenter/publications.html. 11. J. L. Guynes, V. S. Lai, and J. P. Kuilboer. Temporal Databases: Model Design and Commercialization Prospects. Database, 25(3), Aug. 1994. 12. C. S. Jensen and R. T. Snodgrass. Semantics of time-varying information. Information Systems, 21(4):311–352, 1996. 13. C. S. Jensen, R. T. Snodgrass, and M. D. Soo. Extending existing dependency theory to temporal databases. IEEE Transactions on Knowledge and Data Engineering, 8(4):563–582, 1996. 14. C. S. Jensen, M. D. Soo, and R. T. Snodgrass. Unifying temporal models via a conceptual model. Information Systems, 19(7):513–547, 1994. 15. M. R. Klopprogge. TERM: An approach to include the time dimension in the entity-relationship model. In Proceedings of the Second International Conference on the Entity Relationship Approach, pages 477–512, Washington, DC, Oct. 1981. 16. M. R. Klopprogge and P. C. Lockemann. Modelling information preserving databases: Consequences of the concept of time. In M. Schkolnick and C. Thanos, editors, vldb, pages 399–416, Florence, Italy, 1983. 17. I. Knowledge Based Systems. SmartER - information and data modeling and database design. Technical report, Knowledge Based Systems, Inc. - Austin, USA, 1997. http://www.kbsi.com/products/smarter.html. 18. I. Logic Works. Erwin/erx. Technical report, Logic Works, Inc., 1997. http://www.logicworks.com/products/erwinerx/index.asp. 19. H. Mannila and K.-J. R¨ aih¨ a. The design of relational databases. Addison-Wesley, 1993. 20. P. McBrien, A. H. Selveit, and B. Wangler. An entity-relationship model extended to describe historical information. In International Conference on Information Systems and Management of Data (CISMOD’92), pages 244–260, Bangalore, India, July 1992. 21. A. Narasimhalu. A data model for object-oriented databases with temporal attributes and relationships. Technical report, National University of Singapore, 1988. 22. S. Navathe and R. Ahmed. Temporal extensions to the relational model and SQL. In Tansel et al. [25], chapter 4, pages 92–109. 23. F. I. P. S. Publication. Integration definition for function modeling (idef1x). Technical Report 183, National Institute of Standards and Technology, Gaithersburg, Md. 20899, 1993. 24. R. T. Snodgrass. The temporal query language TQel. ACM Trans. Database Syst., 12(2):247–298, 1987. 25. A. U. Tansel et al., editors. Temporal databases: theory, design and implementations. Benjamin Cummings, 1993. 26. C. Theodoulidis, P. Loucopoulos, and B. Wangler. A conceptual modelling formalism for temporal database applications. Information Systems, 16(4):401–416, 1991.
Designing Well-Structured Websites: Lessons to Be Learned from Database Schema Methodology Olga De Troyer Tilburg University, INFOLAB, Tilburg, The Netherlands [email protected]
Abstract. In this paper we argue that many of the problems one may experience while visiting websites today may be avoided if their builders adopt a proper methodology for designing and implementing the site. More specifically, introducing a systematic conceptual design phase for websites, similar in purpose and technique to the conceptual design phase in database systems, proves to be effective and efficient. However, certain differences such as adopting a user-centered view are essential for this. Existing database design techniques such as ER, ORM, OMT are found to be an adequate basis for this approach. We show how they can be extended to make them appropriate for website design. We also indicate how conceptual schemes may be usefully deployed in future automation of site creation and upkeep. Furthermore, by including parts of such a conceptual schema inside the site, a new generation of search engines may emerge.
1
Introduction
The World Wide Web (WWW) offers a totally revolutionary medium for asynchronous computer-based communication among humans, and among their institutions. As its primary use evolves towards commercial purposes, competition for the browser’s attention, often split-second, is now a dominating issue. This has forced the focus of website design towards visual sophistication. Websites must be ‘cool, hip, killer’. Most of the literature on website design therefore appears to deal with graphics, sound, animation, or implementation aspects. The content almost seems to be of less importance. Most ‘web designers’ have certainly never been schooled in traditional design principles nor in fundamental communication techniques. They have ‘learned’ to design webs by looking at other websites and by following a ‘trialand-error’ principle. In addition, the Web is constantly in evolution, outdating itself nearly daily. The combination of all these factors for an individual website easily leads to problems of maintenance but also of elementary usability. Indeed, as any database designer knows, if the represented information is not structured properly, maintenance problems occur which are very similar to those in databases: redundancy, inconsistency, incompleteness and obsolescence. This is not surprising as websites as well as databases may provide (large) amounts of information which need to be maintained. The same aspects also lead to usability problems. T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 51−64, 1998. Springer-Verlag Berlin Heidelberg 1998
52
O. De Troyer
These are particularly obnoxious as they are problems experienced by the target audience of the website: • Redundancy. Information which is needlessly repeated during navigation is annoying to most users. • Inconsistency. If information on the site is found to be inconsistent, the user will probably distrust the whole site. • Incompleteness. Stale and broken links fall in this category, but incompleteness is also experienced by users who cannot find the information which they expect to be available on a site. • Actuality. Organizations and information are often changing so quickly that the information provided on websites soon becomes out of date. If a website has visibly not been updated for a while, confidence of users in the information provided is likely not to be very high. Other usability problems are caused by: • Lack of a mission statement. If the website has no declared goal, that goal, quite simply, can not be reached. The key question, therefore, that must be answered by its owner first is “What do I want to get out of my site?”. This mission statement is the basis for any evaluation of the effectiveness of the site. • Lack of a clearly identified target audience. The target audience is the audience which will be interested in the site. If one does not have a clear understanding of one’s target audience, it is quite difficult to create a compelling and effective site. • Information overload. Users typically are not interested in wading through pages and pages of spurious “information”. Also, attention spans tend to be short. • The lost-in-hyperspace syndrome [11]. Hypertext requires users to navigate through the provided information. If this navigation process is not well structured or guided, users may easily get lost. This makes it more difficult and timeconsuming to locate the desired information. The use of a proper design method could help solve some of these problems. A number of researchers have already recognized the lack of a design method for websites, or more in general for web-based information systems, and have proposed methods: HDM [7] and its successors HDM2 [6] and OOHDM [13], RMM [9], W3DT [17], the method for analysis and design of websites in [15], SOHDM [10]. Older methods (HDM, OOHDM, RMM) were originally designed for hypertext or hypermedia applications and do not deal comfortably with web-specific issues. In addition, these methods are very much data-driven or implementation oriented. Some have their origin in database design methods like the E-R method [1] or object oriented (OO) methods such as OMT [12]. These methods may be able to solve maintenance problems to some extent but they do not address the other usability problems mentioned above. In [4], we have proposed a website design method, called WSDM, which is ‘user centered’ rather than ‘data-driven’. In a data-driven method the data available in the organization is the starting point of the modeling approach. In our approach, however, the starting point is the target audience of the website. The issues related to this target audience run through the method like a continuous thread. We will explain the differences between data-driven and user-centered in more detail in section 3.1.
Designing Well-Structured Websites: Lessons to Be Learned
53
We argue that our approach results in websites which are more tailored to their users and therefore have a higher usability and greater satisfaction coefficient. WSDM also makes a clear distinction between the conceptual design and the design of the actual presentation. The conceptual design, as in database design, is free from implementation details and concentrates on the content and the structuring of the website. The design of the presentation takes into consideration the implementation language used, the grouping in pages, and the actual ‘look and feel’ of the website. This distinction is comparable to the distinction made in database design between the conceptual design (e.g. an E-R schema [1]) and the logical design (e.g. a relational schema). The purpose of this paper is to explain the concept of a conceptual schema within the context of a website design method (section 3) and to identify the different roles it plays in the life cycle of the website (section 4). In section 2 we give a short overview of the different phases of our WebSite Design Method. Section 5 concludes the paper.
2
The WebSite Design Method (WSDM)
We only present a brief overview of WSDM; a more detailed description can be found in [4] and [5]. The method currently concentrates on kiosk websites. A kiosk website [9] mainly provides information and allows users to navigate through that information. An application website is a kind of interactive information system where the user interface is formed by a set of web pages. The core of the method consists of the following phases: User Modeling, Conceptual Design, Implementation Design and the actual Implementation (see Fig. 1 for an overview). We suppose that the mission statement for the website has been formulated before the start of the User Modeling phase. The mission statement should describe the subject and the purpose of the website as well as the target audience. Without giving due consideration to these issues, there is no proper basis for decision making, or for the evaluation of the effectiveness of the website. As an example we consider the mission statement of a typical university department website. It can be formulated as follows: “Provide information about the available educational programmes and the ongoing research to attract more students, researchers and companies, and enhance the internal communication between students and staff members“. The User Modeling phase consists of two sub-phases: User Classification and User Class Description. In the User Classification we identify the future users or visitors of the website and classify them into user classes. The mission statement will give an indication of the target audience, but this has to be refined. One way of doing this is by looking at the organization or the business process which the website should support. Each organization or business process can be divided into a number of activities. Each activity involves people. These people are potential users/visitors of the site. In our method, a user class is a subset of the all potential users who are similar in terms of their information requirements. Users from the same user class have the same information requirements. As an example the user classes of our university example are: Candidate Students, Enrolled Students,
54
O. De Troyer
Researchers, Staff Members and Companies. User classes need not be disjoint. The same person may be in different user classes depending on the different roles he plays in the organizational environment. For example, a person can be an enrolled student as well as a staff member. User Modeling User Classification User Class Description
Implementation Design Implementation Fig. 1. Overview of the WSDM phases
In the User Class Description, the identified user classes are analyzed in more detail. We not only describe (informally) the information requirements of the different users classes, but also their usability requirements and characteristics. Some examples of user’s characteristics are: levels of experience with websites in general, language issues, education/intellectual abilities, age. Some of the characteristics may be translated into usability requirements while others may be used later on in the implementation phase to guide the design of the ‘look and feel’ of the website, e.g. younger people tend to be more visually oriented than older people. Although, all users from a single user class potentially have the same information requirements, they may diverge with respect to their characteristics and usability requirements. For example, within the user class Enrolled Students we may distinguish between local students and exchange students. They have the same information requirements (detailed information on courses) but have different characteristics and usability requirements. Local students are young (between 18 and 28), are familiar with the university jargon, the university rules and customs. They have a good level of experience with the WWW. They prefer the local language for communication, but have in general a good understanding of English. On the other hand, all communication with exchange students is done in English. We may not presume that they are familiar with the university jargon and customs, or with the WWW. To support different characteristics and usability requirements within a single user class, we use perspectives. A perspective is a kind of user subclass. We define a perspective as all users in a user class with the same characteristics and usability
Designing Well-Structured Websites: Lessons to Be Learned
55
requirements. For the user class Enrolled Students we may distinguish two perspectives: Local Students and Exchange Students. The Conceptual Design phase also consists of two sub-phases: the Object Modeling and the Navigational Design. During Object Modeling the information requirements of the different user classes and their perspectives are formally described in a number of conceptual schemes. How this is done is described in section 3. During the Navigational Design we described how the different users will be able to navigate through the website. For each perspective a separated navigation track will be designed. It is precisely the approach taken in the Object Modeling and Navigational Design based on user classes and perspectives that constitute the usercentered approach of WSDM and its departure from purely classic information system modeling. In the Implementation Design we essentially design the ‘look and feel’ of the website. The aim is to create a consistent, pleasing and efficient look and feel for the conceptual design made in the previous phase. If information provided by the website will be maintained by a database then the implementation design phase will also include the logical design of this database. The last phase, Implementation, is the actual realization of the website using the chosen implementation environment, e.g. HTML.
3
The Conceptual Design of a Website
During the User Modeling phase, the requirements and the characteristics of the users are identified and different user classes and perspectives are recognized. The aim of the Conceptual Design phase is to turn these requirements into a high level, formal description which can be used later on to generate (automatically or semi-automatically) effective websites. During Conceptual Design, we concentrate on the conceptual ‘what and how’ rather than on the visual ‘what and how’. This means that like in database design we describe what kind of information will be presented (object types and relationships; the conceptual ‘what’), but unlike in database design we also describe how it will be able to navigate through the information (the conceptual ‘how’). This is needed because navigating through the information space is an essential characteristic of websites. If the navigation is not (well) designed or not adapted to the target audience, serious usability problems occur. The conceptual ‘what’ is covered by the Object Modeling step, the conceptual ‘how’ by the Navigational Design.
3.1
Object Modeling in a User-Centered Approach
In WSDM, the Conceptual Object Modeling results in several different conceptual schemes, rather then in a single one as in classical database design. This is because we have opted for a user-centered approach. In a data-driven approach, as used in database design, the starting point of a conceptual design is the information available in the organization: designers first model the application domain, and subsequently they associate information with each class of users (e.g. by means of views or
56
O. De Troyer
external schemes). However, the data and the way it is organized in the application domain may not reflect the user’s requirements. A good example of such a mismatch can be found in the current website of our university The structure of this website completely reflects the internal organizing structure of our university. This structure is completely irrelevant for, and unknown to most users of this site. As an example, if you want to look at the web for the products offered by the Computer-shop of our university (called PC-shop) you must know that the PC-shop is part of the Computer Center (actually it is one of the ‘External Services’ of the Computer Center), which itself is a ‘Service Department’ of the University. You will not find it under ‘Facilities’ like the Restaurant, the Copy-shop or the Branch Bank. In our user-centered approach we start by modeling the information requirements of the different types of users (user classes). Note that we make a distinction between a user-centered approach and a user-driven approach. In a user-driven approach the users are actively involved in the design process, e.g. through interviews, during scenario analysis, prototyping and evaluation. This is not possible for kiosk websites on the internet because most of the users are unknown and cannot be interviewed in advance or be involved in the design process. However, we can fairly well identify the different types of users and investigate their requirements. After all, the main goal of a kiosk site is to provide information. Therefore, for each user class a conceptual schema is developed expressing the information needs of that type of user. We call these conceptual schemes user object models. Like an “ordinary” conceptual schema, a user object model (UOM) is expressed in terms of the business objects of the organization. 1.
for
with
Exam 2 Date Room Time Duration
Course requiring Id Name giving given by Description 1+ Newsgroup 1+ Exam Type Required Reading Programme Year prerequisite using for Course Material used for Id Name Price Date of Issue
Lecturer Name Title Room Tel E-Mail 1+ author of
written by
Fig. 2. User object model for Enrolled Students
In [5] we explain how a user object model is constructed from the information requirements expressed in a user class description. For each requirement a so-called
1
http://www.kub.nl/ (in Dutch).
Designing Well-Structured Websites: Lessons to Be Learned
57
object chunk is constructed. Next, the object chunks of one user class are merged into a single model. In conceptual modeling in general, object models describe the different object types (OTs), the relationships between these OTs, and rules or constraints. OO models also describe behavior. For our purpose (modeling kiosk websites), modeling behavior is not (yet) needed. The traditional conceptual modeling methods like E-R [1], the Object-Role Model [8], [16], [2], or “true” OO methods like OMT [12] are therefore all suitable. Figure 2 shows the UOM (in OMT notation) developed for the user class Enrolled Students of our university department example.
3.2
Object Type Variants
As explained, the same user class may include different perspectives expressing different usability requirements and characteristics. It is possible that this also results in slightly different information requirements. In WSDM, we model this by means of variants for OTs. A variant of some OT corresponds largely with the original OT but has some small differences (variations). Consider as an example the OT Course for the user class Enrolled Students. See Fig. 2 for a graphical representation. About a course, enrolled students in general need the following information: the identification number of the course, the name of the course, a description of the content of the course, the prerequisites for the course, specification of the required reading, the type of exam of the course, the name of the newsgroup of the course and the programme year in which the course may be followed. However, for the subgroup (perspective) Local Students we want to offer this information in the local language, while for the subgroup Exchange Students the information must be provided in English. Also the programme year is not relevant for exchange students and (the actual value of) the prerequisites, the required reading and the exam type may differ between exchange students and local students. Indeed, local students may have required reading written in the local language while for the exchange students the required reading must be written in English. In implementation terms, this means that for most (but not all) attributes of the OT Course we will need to maintain two variants; an English one and a local language one. The recognition of these differences is essential for a user-centered approach and therefore they should be modeled in an early phase. Some people may argue that the language is a representation issue and therefore it should not be considered in the conceptual phase but left to the implementation design. However, in this example, the language issue is an important user requirement which also influences the actual information that will be provided. If we do not recognize this during conceptual design, the information provided for a course, except for the language, would be the same for local students and exchange students.
58
O. De Troyer
To model the differences we introduce two variants for the OT Course: Course/Local Students and Course/Exchange Students. Course/Local Students: • • • • • • • •
the identification number of the course; the local language name of the course; a description of the content of the course in the local language; the prerequisites for the course for the local students in the local language; the specification of the required reading for the local students in the local language; the type of exam of the course for the local students in the local language; the name of the newsgroup of the course; the programme year in which the course may be followed.
Course/Exchange-Students: • the identification number of the course; • the English name of the course; • a description of the content of the course in English; • the prerequisites for the course for the exchange students in English; • the specification of the required reading for the exchange students in English; • the type of exam of the course for the exchange students in English ; • the name of the newsgroup of the course. Graphically, we use a parent-child notation to represent variants (see Fig. 3). The parent OT is variant independent, each child OT is a variant of the parent OT. The name of a variant OT is composed of the name of the parent OT followed by the variant identification, e.g. Course/Exchange Students. A variant OT can have less attributes that its parent OT. Semantically, this means that the omitted attributes are not meaningful for the variant. E.g. the Programme Year attribute is omitted in the Course/Exchange Students because it is not meaningful for exchange students. Note that in this respect variants are clearly different from the notion of subtype. Subtypes can in general not be used to model variants. Also attributes may have variants. Name/English Name and Name/Dutch Name are two variants of the attribute Name. To relate the attribute variant to the original attribute in the parent OT, the name of the original attribute is preceding the name of the attribute variant. In some cases, it is possible that the original attribute never will have an own value, but only serves as a means to indicate that the underlying variant attributes have the same semantics. This is comparable to the concept of abstract OT in object-oriented modeling. By analogy, we call this an abstract attribute. In the OT Course, the attributes Name, Description, Exam Type and Required Reading are abstract attributes. A variant OT cannot include or refer to attributes which are not defined in the parent OT. This is to prohibit addition of completely new information (attributes) to a var-iant, in which case it will not be a variant anymore.
Designing Well-Structured Websites: Lessons to Be Learned
59
Course Id Name Description Newsgroup Exam Type Required Reading Programme Year
Course/Local Students
Course/Exchange Students
Id Name/ Dutch Name Description/Dutch Descript Newsgroup Exam Type/Exam TypeDutch Required Reading/Dutch Req. Reading Programme Year
Id Name/English Name Description/English Description Newsgroup Exam Type/Exam TypeEnglish Required Reading/English Req. Reading
Fig. 3. Variants for the OT Course
In WSDM, information differences between the perspectives of a single user class are modeled by means of OT variants. For each OT in the UOM of a user class, and for each perspective of this user class, a variant may be defined to reflect the possible information differences. To derive the conceptual schema for a perspective, called a perspective object model (POM), it suffices to replace the OTs in the corresponding UOM by the corresponding perspective variants. If an OT has no variant for the perspective, the OT is kept as it is.
3.3 Linking the Conceptual Models As explained, the Object Modeling starts by building the user object models, one for each user class. Subsequently, these models are refined using perspective variants to derive the perspective object models (if a user class has no perspectives then the user object model acts as perspective object model). In what follows we call OTs from a perspective object model perspective OTs (POTs). Perspective object models of a single user class are related by means of their user object model. However, the different user object models are (still) independent. This is not desirable, especially not when several user classes share the same information. It would result in an uncontrollable redundancy. Therefore, the different user object models must be related. To do this we use an overall object model, the business object model (BOM). This model is a conceptual description of the information (business objects) available in the organization. It is independent of any type of user. Such a business object model may already have been developed for the organization or the application domain. If not, or if it is not available in a shape usable for our purpose, it
60
O. De Troyer
must be (re-)developed. The classical information analysis methods mentioned earlier may be used for this. For this model, a data-driven approach is not a problem, on the contrary: it is preferred. Next, the different user object models are expressed as (possibly complex) views on the BOM. Note that it is possible that during this step it turns out that the (existing) BOM is incomplete. This is the case if information modeled in a user object model cannot be expressed as information modeled in the BOM. In such a case it is necessary to re-engineer the BOM. Figure 4 illustrates how the different types of conceptual schemes developed during Object Modeling relate to each other. application domain
user class description
nt ria va
w vie
user object model
variant
va ria nt
business object model
vie w
user object model
nt varia var ian t
perspective object model perspective object model perspective object model perspective object model
perspective object model
Fig. 4. Relationship between the different types of object models
3.4
Navigational Design
Once the Object Modeling is done, a conceptual navigation model is constructed. The navigation model expresses how the different user types will be able to navigate through the available information. Navigational models are usually described in terms of components and links. We distinguish between information components, navigation components and external components (see Fig. 5). Information components represent information. An information component may be linked to other components to allow navigation. A navigation component can be seen as a grouping of links, and so contains no real information but allows the user to navigate. An external component is actually a reference to a component in another site. Following our user-centered approach, we design an independent navigation track for each perspective. To derive the navigation model, it is sufficient to connect the different navigation tracks by a navigation component. In a nutshell, a navigation track for a perspective may be constructed as follows: information components are
Designing Well-Structured Websites: Lessons to Be Learned
61
derived from the POTs and links are used to represent the relationships between POTs. This forms the information layer of the navigation track. Next, a navigation layer, built up of navigation components, is designed to provide different access paths to the information components in the information layer. The top of a navigation track is a single navigation component which provides access to the different navigation components in the navigational layer. When the different navigation tracks are composed, these top level components form the context layer of the navigation model. Figure 6 shows the navigation track for the POM Exchange Students. Figure 7 shows how the different navigation tracks are composed to make up the navigation model.
Navigation Component
Information Component External Component
Link
Fig. 5. Graphical representation of the navigation model concepts Navigation Track for Exchange Students Context Layer
Exchange Students Perspective
Navigation Layer Exams by Course
Courses by Name
Course Materials
Course Materials by Id
Course/ Exchange Students
Exam
Lecturers by Name
Course Materials by Course
Course Material
Lecturer
Information Layer
Fig. 6. Navigation track for the perspective Exchange Students
University Department
Context Layer Researchers Perspective
Local Students Perspective
Exchange Students Perspective
Navigation Layer
Information Layer
Fig. 7. Composition of navigation tracks into a navigation model
62
O. De Troyer
In the rest of this paper we will use the term conceptual schema (CS) to denote the result of the Conceptual Design: the UOMs, POMs, BOM and the navigation model.
4
Roles of the CS in the Website Life Cycle
The life cycle of a website contains many of the phases of a traditional Information System (IS) life cycle, such as planning, analysis, design and implementation, but also phases which are specific for web systems. The development process of a website is more open-ended because a website is often not as permanently fixed as a traditional IS. Designing a website is an ongoing process. Maintenance includes activities such as monitoring new technologies, monitoring users, and adapting the website accordingly. It is a continuous process of improvement. To emphasize this distinction, the maintenance phase is sometimes called Innovation [3]. The typical Installation phase is replaced by a Promotion phase in which the existence of the website is made public (by publicity, references from other websites, etc.). In this section we explain what role the CS may play in the Implementation phase, the Promotion phase and the Innovation phase, and we explain how the CS may be exploited even more inside the website. During Implementation Design, the ‘look and feel’ of the website is developed. Starting point for this is the navigation model. Through use of graphical design principles and visual communication techniques, taking into account the characteristics of the different perspectives, the navigation model will be translated into a presentation model. (content of pages and their layout). Again, this is in some respect similar to the mapping of a conceptual data schema into a logical data schema (e.g. a relational one). Indeed, during Implementation Design one may decide to group information components and links (from the navigational model) together and to present them to the user as single packages of information. (In fact, we are developing algorithms and tools to support this.) Separating the conceptual and the implementation design for websites has the same advantage as in database design. It offers the flexibility needed for designing large websites. As explained, designing a website is an continuous process. By separating the conceptual design from the implementation design, we yield the flexibility required to support this incremental and evolving design process. Different implementation designs may be built (e.g. as prototypes) and evaluated. Changes and additions to the content are localized to the conceptual level, and the impact on the implementation design can easily be traced. Adding a new user class only involves adding a new UOM with its associated perspectives and navigation tracks. Changes to the presentation only influence the implementation design. The actual implementation can be automated using available tools and environments for assisting in e.g. HTML implementations.
Designing Well-Structured Websites: Lessons to Be Learned
63
Because different perspectives may offer the same information (possibly presented differently) , we need to provide means to maintain this information and keep it consistent. The obvious way of doing this is by maintaining the underlying information (or parts of it) in a database. This need not be a full-fledged database, but in any case a single storage place for information shared between different perspectives. As all information presented in the website is ultimately related to the business object model (BOM) (by means of the POMs and UOMs), this BOM provides the conceptual schema for the underlying database. From this BOM a logical database schema is then generated (using appropriate database development tools) or manually built. The queries needed to extract the information for building the pages can then be derived from the POMs because they are already expressed as views on the BOM. To reduce the lost-in-hyperspace syndrome, many sites contain an index page or site map. This index page or site map gives a (hierarchical) overview of the website and provides a central point for the user to locate a page in the website. We may consider instead to replace it by a representation of (parts of) the conceptual schema which is much richer in information than an index page. Each navigation track could contain a suitable representation of its corresponding POM. This will not only allow the user to locate information directly but will also help him/her to build a mental model of the site and ultimately provide an on-line repository of meta-information which may be queried. The availability of the CS literally ‘in-site’ may also be exploited by the many different types of search engines to enhance their search effectiveness. In this way promotion benefits as well. 2
5
Conclusions
In this paper we have explained the need for a conceptual design phase in website design similar to the conceptual design phase in database systems. Based on early experience with our method WSDM, we argued that a user-centered approach is more appropriate for websites than the traditional data-centered approach used for database design. As a consequence, the conceptual schema of a website cannot be seen as a single schema but as a collection of schemes; each user perspective has its own conceptual schema. To relate the different schemes and to control the redundancy possibly introduced in this way, a business object model is used. To capture variations between perspectives schemes, so-called OT variants are introduced. Because navigation is an essential characteristic of websites, the conceptual schema also includes a navigation model which describes how users will be able to navigate through the website, being a collection of navigation tracks, one for each user perspective. We have also shown that separation of the conceptual and the implementation design for websites has the same advantages as in database design. As for database
2
Note that this does not lead to the redundancy mentioned as a usability problem in the introduction, because a user only follows one perspective and within one perspective, redundancy is avoided.
64
O. De Troyer
design it is possible to deploy the conceptual schema technology in the future automation of site creation and upkeep. CASE-type tools generating well-structured websites from user requirements and business domain models are the next logical step. In addition, (parts of) the conceptual schema may be represented and queried inside the website to reduce the lost-in-hyperspace syndrome. New generations of search engines may exploit such additional structural knowledge, e.g. by allowing them to interpret the meta-information present in a website, and act on its semantics. Acknowledgments. Many thanks go to Wim Goedefroy and Robert Meersman for the interesting discussions on and the contributions to this research work.
References 1. 2. 3. 4. 5.
6.
7.
8. 9. 10. 11. 12. 13. 14. 15.
16. 17.
P.P. Chen, The Entity-Relationship Model: Towards a Unified View of Data, ACM Transactions on Database Systems, Vol 1 no 1, 1976, 471-522. O. M.F. De Troyer, A formalization of the Binary Object-Role Model based on Logic. In: Data & Knowledge Engineering 19, pp. 1-37, 1996. J. December, M. Ginsberg, HTML & CGI Unleased, Sams.net Publishing, 1995. O.M.F. De Troyer, C.J. Leune, WSDM: a User-Centered Design Method for Web Sites, in proceedings of the WWW7 Conference, Brisbane, April 1997. W. Goedefroy, R. Meersman, O. De Troyer, UR-WSDM: Adding User Requirement Granularity to Model Web Based Information Systems. Proceedings of 1st Workshop on Hypermedia Development, Pittsburgh, USA, June 20-24, 1998. F. Garzotto, P. Paolini, L. Mainetti, Navigation patterns in hypermedia databases, Proceedings of the 26th Hawaii International Conference on System Science, IEEE Computer Society Press, pp. 370-379, 1993. F. Garzotto, P. Paolini, D. Schwabe, HDM - A Model-Based Approach to Hypertext Application Design, ACM Transactions on Information Systems, Vol 11, No 1, pp. 1-26, 1993. T. Halpin, Conceptual Schema and Relational Database Design, second edition, Prentice Hall Australia, 1995. T. Isakowitz, E. A. Stohr, P. Balasubramanian, RMM: A Methodology for Structured Hypermedia Design, Communications of the ACM, Vol 38, No 8, pp. 34-43, 1995. H. Lee, C Lee, C. Yoo, A Scenario-Based Object-Oriented Methodology for Developing Hypermedia Information Systems, Proc. of HICSS ‘98. H. Maurer, Hyper-G - The Next Generation Web Solution, Addison-Wesley 1996. J. Rumbaugh, M. Blaha, W. Premerlani, F. Eddy and W. Lorensen, Object Oriented Modeling and Design, Prentice Hall Inc., 1991. D. Schwabe, G. Rossi, The Object-Oriented Hypermedia Design Model, Communications of the ACM, Vol 38, No 8, 1995. D. Schwabe, G. Rossi, S.D.J. Barbosa, Systematic Hypermedia Application Design with OOHDM, http://www.cs.unc.edu/barman/HT96/P52/section1.html. K. Takahashi, E. Liang, Analysis and Design of Web-based Information Systems, Sixth International World Wide Web Conference, 1997, http://www6.nttlabs.com/papers/PAPER245/Paper245.html. J.J. Wintraecken, The NIAM Information Analysis Method - Theory and Practice, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1990. M. Bichler, S. Nusser, W3DT - The Structural Way of Developing WWW-sites, Proceedings of ECIS’96, 1996.
Formalizing the Informational Content of Database User Interfaces Simon R. Rollinson and Stuart A. Roberts School of Computer Studies, University of Leeds Leeds, LS2 9JT, UK {sime, sar}@scs.leeds.ac.uk
Abstract. The work described in this paper addresses the problem of modelling the informational content of a graphical user interface (GUI) to a database. The motivation is to provide a basis for tools that allow customisation of database interfaces generated using model-based techniques. We focus on a particular class of user interface, forms-based GUIs, and explore the similarities between these types of interfaces and a semantic data model. A formalism for translating between forms-based interfaces and a semantic data model is presented. The translation takes account of the context in which each control on the GUI is employed, and accommodates the need to map distinct GUI elements to the same semantic concepts.
1
Introduction
Forms-based user interfaces have remained popular as a means of data entry and update for a number of years, especially in business-oriented applications. Over the years these interfaces have evolved from character-based systems to present day graphical user interfaces (GUIs). GUIs offer a much wider scope for interaction which has led to the move away from hierarchic menu structures to network-like structures, utilising multi-modal navigation between forms. The structure of this style of interface corresponds quite closely with that of a semantic data model: forms can be identified with entities; controls on forms with attributes; and links between forms with relationships between entities. This correspondence was used in [19] for the purpose of interface generation. Although a user interface appears similar to a semantic model there is an important difference. This involves the semantics of user interface controls. In particular a control’s semantics can differ depending upon its usage. For example, in one interface a control may be used to represent an attribute of some entity, whilst in another the same control may be used to represent an entity. To be able to transform user interfaces to a semantic data model it is necessary to understand the semantics of each control in its different uses. The aim of this paper therefore, is to describe an investigation into identifying the semantics of user interface components with respect to their different uses, classifying each component and its use(s) in terms of a familiar semantic modelling concept(s). T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 65–77, 1998. c Springer-Verlag Berlin Heidelberg 1998
66
S.R. Rollinson and S.A. Roberts
The motivation for this work comes from the research done into model-based interface generation. This has been popular for a number of years in the database community, where data models are often utilised along with task models to automatically generate user interfaces for database applications. For example, the MACIDA [17], GENIUS [13] and TRIDENT [3] systems have all used some form of Entity-Relationship model [6] as part of their input. A feature lacking in most of these systems is the ability to customise the generated interface to suit a user’s personal needs. To ensure that no information is lost during customisation, by removing a control for example, it is necessary to validate the customised interface against the generated interface, thus highlighting any missing information in the customised interface. To enable such a facility two things are needed. The first is a means by which the informational content of the interface can be modelled, and the second is an equivalence metric between interface models. Work already exists to do this on an individual forms basis, (see [2]). We seek to undertake a similar study but work with networks of forms, identifying and addressing the issues posed by such interfaces. The paper is structured as follows. Section two examines related work and section three introduces the types of user interfaces focused on in this work. The formalism for representing user interfaces is described in Sect. 4, along with the transformations that map user interface elements to semantic modelling concepts. Section five briefly describes a prototype implementation of the mappings and Sect. 6 looks at applications of the work. Finally, Sect. 7 concludes the paper.
2
Related Work
In [9] the ERMIA (Entity-Relationship Modelling for Information Artifacts) formalism is presented. ERMIA is based on an extended entity-relationship modelling approach and is employed in the evaluation of user interfaces to complement methods such as GOMS [5] and UAN [11]. The significance of ERMIA to this work is its recognition of the close relationship between the structure of information in a user interface and the structure of information in a database. The work described in this paper, however, focuses on establishing a link between user interface components and semantic modelling concepts, whereas ERMIA is concerned with stripping away the ‘renderings’ of information to reveal the underlying structure for evaluation purposes. Abiteboul and Hull in [2] describe a formalism for representing and restructuring hierarchical database objects, including office forms, based on the IFO model [1]. They focus on data preserving transformations from one hierarchic structure to another, to allow, for example, equivalence tests between different forms. Abiteboul and Hull treat each form in isolation whereas we aim to take into account the network-like struture of modern GUIs. Our work can be seen, in part, to be an extension of the work of Abiteboul and Hull enabling tests for equivalence of complete interfaces comprising many linked forms. Furthermore,
Formalizing the Informational Content of Database User Interfaces
67
Table 1. User interface constructs Form Groupbox
A form is a window that allows other constructs to be placed on it. The groupbox groups together related constructs and is named to reflect the grouped information. Listbox The listbox shows either a single or multi-column scrollable list of alphanumeric information. Grid The grid allows alphanumeric information to be entered or shown in a tablular format. Checkbox The checkbox is a rectangle which holds either a “X” or a blank space (i.e. a boolean value). Radio Button The option button cannot be used singly, two or more must be used together. They hold either a “•” or a blank space (i.e. a booleanvalue) In a group of n buttons only one of them must contain a “•” at any given time.Thus representing the fact that one of the options is true and the rest false. Textbox The textbox is a rectangle that can be used to enter or show alphanumeric information. Combobox The combobox has two states, normal and extended. In its normal state the combobox is a rectangle (similar in appearance to the textbox) and allows alphanumeric information to be entered/shown. In the extended state the combobox presents a list of alphanumeric information, from which an item can be selected to be shown in the normal state of the combobox. In the extended state the combobox appears similar to a listbox. Button The button allows actions to be performed. A typical action might be to show another form, effectively linking together two forms. Row The row is part of (i.e. contained in) the grid, listbox and combobox constructs. Column One or more columns form the contents of a row construct.
[2] assumes a mapping between user interface controls on forms and the IFO constructs which is too simple for our purposes. In [10] G¨ uting et al also consider the hierarchic nature of office documents and have developed an algebra for manipulating these structures based on relational algebra. Again each form is treated separately. By extracting a semantic data model from a database application our work is also similar to [20] which considers the reverse engineering of database applications, for the purpose of creating an object-oriented view of a relational database. Our work could also be classed as a form of reverse engineering, as we extract a semantic data model from a database application. However, we take a higher level view, concentrating only on the user interface, whereas Vermeer and Apers [20] are concerned with examining the program and query language statements.
68
S.R. Rollinson and S.A. Roberts
Fig. 1. An example user interface
Other work of interest is that of Mitchell et al [15] who have shown how an object-oriented data language can be used not only to describe a database but also its user interface.
3
User Interfaces
In this section we introduce, in more detail, the type of user interfaces that have been studied. The interfaces comprise linked forms (i.e. windows). Forms are linked and information displayed/entered using interface controls (hereon referred to as constructs). Interface constructs can be generalised into several broad types. Each is described in table 1. To constrain the type of user interface examined in this study to those of a forms-based nature, we opted to use a model-based user interface development environment (MB-UIDE). For this work the TRIDENT system [3] was chosen as a means of limiting the interfaces. Several modifications are required to TRIDENT: the removal of constructs such as thermometers and dials, not supported in our formalism; the introduction of the row and column constructs; and a rationalisation of constructs where the differences between two constructs (e.g. with or without scrollbars) are immaterial to our method. The precise method of defining interfaces used in our study is given in appendix A. To end this section we present a user interface that will be used as an example thoughout the remainder of the paper. Figure 1 shows two forms person and car which both contain textbox and listbox controls. The arc from the car row to the person form indicates a ‘clickable’ link between the row of the listbox and the car form. In this case the link ‘opens’ the car form.
4
The User Interface Model
By examining the user interfaces of database applications it was possible to extract the semantics of each interface construct used within different contexts. Each interface construct was classified as one or more of three modelling concepts depending on the context in which it was used (see table 2). The concepts are:
Formalizing the Informational Content of Database User Interfaces
Forms. Groupboxes containing one or more textboxes, comboboxes, listboxes, grids, groupboxes and checkboxes, but not just checkboxes. Rows containing more than 1 column. Rows containing a single column that have the same name as a form, or a groupbox (as defined in 2). Comboboxes with more than 1 column. Comboboxes with 1 column that have the same name as a form, or a groupbox (as defined in 2). Textboxes that have the same name as a form, or a groupbox (as defined in 2). A groupbox containing only radiobuttons where the groupbox has the same name as a form or a groupbox (as defined in 2). A group containing only checkboxes where the groupbox has the same name as a form or a groupbox (as defined in 2) Lexical Types
10 11 12 13 14 15
Textboxes and columns. A single checkbox. Groupboxes containing 2 or more radiobuttons. Rows with one column that do not have the same name as a form. Comboboxes with 1 column that do not have the same name as a form. Groupboxes that contain only checkboxes. Groupings
16 Groupboxes that contain only checkboxes. 17 Listboxes and Grids.
abstract type, lexical type and grouping, which are taken from the generalised semantic model (GSM) [12]. A key element in our formalism is the use of labels for naming and identifying constructs. We adopt a scheme similar to that of Buneman et al [4] in which different constructs that have the same name are interpreted as representing the same real world concept. In Fig. 1 for example, the car form and car row are interpreted as representing the same entity. Likewise with the columns (reg., make) of the row which are interpreted as representing the same attributes as the reg and make textboxes. Notice, however, that name does not appear in the car row. Again, we adopt the approach of Buneman et al and form the union of the textbox and column labels to obtain the attributes for car. We also exploit the use of plural labels to indicate a repeating group. The entity being repeated is identified by the singular of the plural label. This idea has been described by Rock-Evans [18] as part of a data analysis activity.
70
4.1
S.R. Rollinson and S.A. Roberts
User Interface Model - A Formal Description
To build the necessary base for the transformations we first define a user interface formally. Definition 1. A user interface (UI) is a five-tuple I = hSL, P L, T, N, Ci where: – SL is a finite set of singular labels – P L is a finite set of plural labels – T is the set of interface construct types, i.e. : T = { FORM, GROUPBOX, COMBO, CHECKBOX, COLUMN, ROW, GRID, LISTBOX, TEXTBOX, RADIOBUTTON} – N = (SL × T ) ∪ (P L × T ) is the set of labelled constructs (nodes). – C = N × N is the set of pairs of nodes
Definition 2. An instance of a user interface is a set of directed trees U = {T1 , . . . , Tn }. Each tree Ti = (Vi , Ei ), where 0 < i 6 n, Vi ⊂ N is a set of vertices and Ei ⊂ C is a set of edges. Several functions are defined to operate on instances of user interfaces and their vertices: – ρ(T ) : Ti → N returns the root node of a tree – ψ(v) : N → N returns the parent of vertex v – τ (v) : N → T returns the type of vertex v – σ(v) : N → N 0 ⊂ N returns the set (possibly empty) of children of vertex v – λ(v) : N → SL ∪ P L returns the label of vertex v – Σ(v) : N → SL returns the singular label of vertex v’s plural label. – singlecolumn(v) returns true if a combobox or row v has 1 column. – multicolumn(v) returns true if a combobox or row v has ≥ 2 columns. – checkboxgroup(v) returns true if a groupbox v contains only checkboxes. – radiobuttongroup(v) returns true if a groupbox v contains only radiobuttons.
Formalizing the Informational Content of Database User Interfaces
(a) T1
71
(b) T2
Fig. 2. An example instance of a user interface consisting of two forms
– constructgroup(v) returns true if a groupbox v contains any construct in T . Figure 2 shows an example user interface instance containing two forms, as represented by the sets V1 , E1 and V2 , E2 below. V1 = { v1 = (person, F orm), v2 = (name, T extbox), v3 = (cars, Listbox), v4 = (car, Row), v5 = (reg, Column), v6 = (make, Column)} E1 = { e1 = (v1 , v2 ), e2 = (v1 , v3 ), e3 = (v3 , v4 ), e4 = (v4 , v5 ), e5 = (v4 , v6 )} V2 = { v8 = (car, F orm), v9 = (reg, T extbox), v10 = (make, T extbox), v11 = (name, T extbox), v12 = (ownedBy, Listbox), v13 = (person, Row), v14 = (name, Column)} E2 = { e7 = (v8 , v9 ), e8 = (v8 , v10 ), e9 = (v8 , v11 ), e10 = (v8 , v12 ), e11 = (v12 , v13 ), e12 = (v13 , v14 )}. 4.2
Transformations
A GSM schema is a directed graph G = (N, E), where N is a set of nodes and E is a set of edges. Each L node is of a particular type (abstract type 4, lexical N type ⊂⊃, grouping ) resulting in three subsets of N representing abstract types, lexical types and groupings. We label these sets A, Π and Γ respectively. If U = {T1 , . . . , Tn } is a UI model instance, and Ti (Vi , Ei ) is a tree in U we define two sets V and E as follows. V = V1 ∪, . . . , ∪Vn ,
E = E1 ∪, . . . , ∪En
72
S.R. Rollinson and S.A. Roberts
Transforming an instance of a UI into a GSM schema is a matter of mapping the set of nodes V to the sets of nodes A, Π, Γ that comprise N and the set of edges E to E. We start by mapping V to N . Mapping V to N means classifying each v ∈ V as a member of A, Π or Γ . This can be achieved by defining membership conditions for the three sets. Each set can have several membership conditions so we divide them into subsets, with one subset for each membership condition (e.g. A has nine membership conditions so we define the sets α1 to α9 , the union of which forms A). The same effect is achievable by defining the membership condition for A, Π and Γ as the disjunction of the membership conditions of their subsets. The former method is used for clarity. Definitions three to five below show the membership conditions for each of the sets. Notice that in these definitions we place the labels of nodes, rather than the nodes themselves, into A, Π and Γ . The reason for this is that it is possible for two nodes to represent the same concept. Placing the nodes into the sets directly would result in two concepts with the same name, when in fact we have a single concept used twice (Re-call from Sect. 4 that if two constructs represent the same thing then they have the same name). Thus, if we have two nodes v1 = (person, f orm) and v2 = (person, row) and they are classified as belonging to A then only the label person is placed in A. Definition 3. The set of abstract type labels A is defined as: A = α1 ∪, . . . , ∪α9 α1 = {l : l = λ(v) ∧ τ (v) = FORM} α3 = {l : l = λ(v) ∧ τ (v) = ROW ∧ multicolumn(v)} α4 = {l : l = λ(v) ∧ τ (v) = ROW ∧ singlecolumn(v) ∧ λ(v) ∈ α1 ∪ α2 } α5 = {l : l = λ(v) ∧ τ (v) = COMBO ∧ multicolumn(v)} α6 = {l : l = λ(v) ∧ τ (v) = COMBO ∧ singlecolumn(v) ∧ λ(v) ∈ α1 ∪ α2 } α7 = {l : l = λ(v) ∧ τ (v) = TEXTBOX ∧ λ(v) ∈ α1 ∪ α2 } α8 = {l : l = λ(v) ∧ radiobuttongroup(v) ∧ λ(v) ∈ α1 ∪ α2 } α9 = {l : l = Σ(λ(v)) ∧ checkboxgroup(v) ∧ λ(v) ∈ α1 ∪ α2 } Definition 4. The set of lexical type labels is defined as: Π = π1 ∪, . . . , ∪π6 π1 = {l : l = λ(v) ∧ τ (v) = TEXTBOX ∨ τ (v) = COLUMN} π2 = {l : l = λ(v) ∧ τ (v) = CHECKBOX ∧ ¬checkboxgroup(ψ(v))}
Formalizing the Informational Content of Database User Interfaces
73
π3 = {l : l = λ(v) ∧ radiobuttongroup(v)} π4 = { l : l = λ(v) ∧ τ (v) = ROW ∧ singlecolumn(v) ∧ τ (ψ(v)) = LISTBOX ∨ τ (ψ(v)) = GRID ∧ λ(v) 3 A} π5 = {l : l = λ(v) ∧ τ (v) = COMBO ∧ singlecolumn(v) ∧ λ(v) 3 A} π6 = {l : l = Σ(λ(v)) ∧ checkboxgroup(v)} Definition 5. The set of grouping labels is defined as: Γ = γ1 ∪, . . . , ∪γ2 . γ1 = {l : l = λ(v) ∧ checkboxgroup(v)} γ2 = { l : l = λ(v) ∧ τ (v) = LISTBOX ∨ τ (v) = GRID ∧ |σ(v)| = 1 ∧ τ (σ(v)) = ROW} The previous step resulted in sets of labels representing the nodes of the GSM schema. To connect the nodes, we must map the set of edges E connecting nodes, to the set of edges E, connecting node labels. More precisely, for every node e = (m, n) ∈ E connecting two nodes m and n, we need an edge in E to connect the labels of the two nodes n and m, that is, e0 = (λ(m), λ(n))). To do this we form equivalence classes of V. Definition 6. An equivalence class N (x) is defined as: N (x) = {y : y ∈ V ∧ λ(x) = λ(y)} This results in an equivalence class for each distinct name appearing in V, thus each equivalence class represents a node in the GSM schema. To connect the GSM nodes we need to establish relationships between equivalence classes. We do this by taking a pair of equivalence classes and looking for elements from each of them that, when paired, occur as an edge in E. An edge (N (x), N (y)) in the GSM schema is given by: (N (x), N (y)) ⇐⇒ ∃x0 ∈ N (x) ∧ ∃y 0 ∈ N (y) ∧ (x0 , y 0 ) ∈ E It is possible, in a GSM schema derived using the above method, for a lexical type to exist which has several incoming arcs. It is uncommon, although not invalid, for such a situation to exist within a GSM schema. When this occurs we replace the lexical type with n copies of itself, where n is the number of arcs terminating at the lexical type. Figure 3a shows an example of this with car and person both having arcs terminating at the lexical type name. Figure 3b shows the GSM schema after name has been split.
74
S.R. Rollinson and S.A. Roberts
(a)
(b)
Fig. 3. (a) the GSM schemata of the UI model instance in Fig. 2, and (b) Fig. 3a after the splitting of name
4.3
Example Transformation
If we return to the UI model instance in Fig. 2 we can define the set V = {v1 , . . . , v14 } and E = {e1 , . . . , e12 }. Testing the elements of V against the membership conditions of A, Π, Γ gives: A = {person, car}, Π = {name, reg, make}, Γ = {cars, ownedBy} and we arrive at the following set of edges: E = { e1 = (person, name), e2 = (person, cars), e3 = (car, reg), e4 = (car, make), e5 = (car, name), e6 = (car, ownedBy), e7 = (ownedBy, person), e8 = (cars, car)} Figure 3a shows a graphical representation of the GSM schema derived from the UI model.
5
Prototype Tool
A prototype tool has been developed to test the transformations. It has been implemented mainly in Prolog and uses the work described in [7] to allow user interface model instances to be specified graphically. Output from the prototype is of a form that can be visualised using the XVCG tool [14]. Thus, the system operates completely in the graphical domain.
Formalizing the Informational Content of Database User Interfaces
75
The system allows the user to specify a user interface model instance as an Xfig diagram, comprising a tree for each form in the user interface. This is converted into Prolog predicates using the visual parser [7]. The transformations are then applied to the predicates resulting in a GRL (graph representation language) specification of the GSM representation of the interface. Finally the XVCG tool is invoked on the GRL specification allowing the GSM schema to be visualised.
6
Applications
The work presented here has several applications in the design and development of information systems. The first is to provide an ‘informational’ consistency check to validate modifications, made by end-users, to interfaces generated using MB-UIDEs. To this end, we are currently exploring equivalence metrics to compare GSM schemata. A second application is as part of a reverse engineering toolkit. The transformations presented in this paper provide a suitable mechanism for translating the user interface of a forms-based application into a GSM schema. This initial model of the application’s data can then be augmented and refined using other reverse engineering techniques. Such a tool would be of use in situations where there is no underlying database. A futher application of our work is as the basis for a data modelling tool. In [16] Moody reports how novices found conventional data models difficult to work with and identified the need for more ‘user-friendly’ methods of data modelling. Embly [8] describes forms as being well-understood [by their users], structuring information according to well-established and longstanding conventions. As part of our ongoing work we shall investigate the use of our transformations to provide a more friendly approach to data modelling.
7
Conclusions & Further Work
This paper has presented a formalism for the representation of forms in a graphical user interface environment and their transformation to semantic modelling concepts. In this work the formalism has been applied to database user interfaces, although it could be applied more generally to other forms-based interfaces. We make two extensions to the work of Abiteboul and Hull [2]. Firstly, our formalism takes into account the context in which the user interface controls have been used. Secondly, we are able to transform complete forms-based interfaces because we identify the network-like structuring of forms, and the multi-modal navigation between them, allowed by graphical user interface environments. This work represents the first step towards transforming network-structured forms interfaces to GSM schemata. In future work we intend to extend the mapping to include the transformation of data manipulation operations supported by the interface, thus providing a more complete mapping.
76
S.R. Rollinson and S.A. Roberts
References 1. S. Abiteboul and R. Hull. Ifo: A formal semantic database model. ACM Transactions on Database Systems, 12(4):525–565, 1987. 2. S Abiteboul and R Hull. Restructuring hierarchical database objects. Theorectical Computer Science, 62:3–38, 1988. 3. F. Bodart et al. Architecture elements for highly-interactive business-oriented applications. In L. Bass et al., editors, Lecture Notes in Computer Science, volume 753 of LNCS, pages 83–104. Springer-Verlag, 1993. 4. P. Buneman, S. Davidson, and A. Kosky. Theoretical aspects of schema merging. In A. Pirotte, C. Delobel, and G. Gottlob, editors, Advances in Database Technology, Proceedings of the third international conference on extending database technology, volume 580 of LNCS, pages 152–167. Springer-Verlag, 1992. 5. S. K. Card, T. P. Moran, and A. Newell. The Psychology of Human-Computer Interaction. Lawrence Erlbaum, Hillsdale, NJ, 1983. 6. P. P. Chen. The entity-relationship model - towards a unified view of data. ACM Transactions on database Systems, 1(1):9–36, 1976. 7. A. de Graaf. Levis: Lexical scanning for visual languages. Master’s thesis, University of Leiden, The Netherlands, July 1996. 8. D. W. Embley. Nfql: The natural forms query language. ACM Transactions on Database Systems, 14(2):168–211, 1989. 9. T. R. G. Green and D. R. Benyon. The skull beneath the skin: entity-relationship models of information artifacts. International Journal of Human-Computer Studies, 44(6):801–828, 1996. 10. R. H. Guting, R. Zicari, and D. M. Choy. An algebra for structured office documents. ACM Transactions on Office Information Systems, 7(4):123–157, 1989. 11. H. R. Hartson and A. Dix. Toward empirically derived methodologies and tools for human-computer interface development. International Journal of HumanComputer Studies, 31:477–494, 1989. 12. R Hull and R King. Semantic database modelling: Survey, applications, and research issues. ACM Computing Surveys, 19(3):201–260, 1987. 13. C. Janssen, A. Weisbecker, and J. Ziegler. Generating user interfaces from data models and dialogue net specifications. In S. Ashlund K. Mullet A. Henderson E. Hollnagel T. White, editor, Proceedings of INTERCHI’93, pages 418–423, 1993. 14. I. Lemke and G. Sander. Visualization of compiler graphs. Technical Report D3.12.1-1, Universitat des Saarlandes, FB 14 Informatik, 1993. 15. K. J. Mitchell, J. B. Kennedy, and P. J. Barclay. Using a conceptual data language to describe a database and its interface. In . Goble, C and J. Keane, editors, Advances in Databases, Proceedings of the 13th British National Conference on Databases, volume 940 of LNCS, pages 79–100. Springer-Verlag, 1995. 16. D. L. Moody. Graphical entity-relationship models: Towards a more user understandable representation of data. In B. Thalheim, editor, Proceedings of the 15th International Conference on Conceptual Modelling, Cottbus, Germany, volume 1157 of LNCS, pages 227–244, 1996. 17. I. Petoud and Y. Pigneur. An automatic and visual approach for user interface design. In Engineering for Human-Computer Interaction, North-Holland, pages 403–420, 1990. 18. R. Rock-Evans. A Simple Introduction to Data and Activity Analysis. Computer Weekly Publications, 1989.
Formalizing the Informational Content of Database User Interfaces
77
19. S. R. Rollinson and S. A. Roberts. A mechanism for automating database interface design, based on extended e-r modelling. In C. Small et al., editors, Advances in Databases Proceedings of the 15th British National Conference on Databases, volume 1271 of LNCS, pages 133–134. Springer-Verlag, 1997. 20. M. W. W. Vermeer and P. M. G. Apers. Reverse engineering of relational database applications. In Proceedings of OO-ER’95, Fourteenth International Conference on Object-Oriented and Entity-Relationship Modelling, volume 1021 of LNCS, pages 89–100. Springer-Verlag, 1995.
A
Constraints for Composing Interface Controls
1 All constructs must have a label, this includes the rows of grids, listboxes and comboboxes as well as the columns of each row. 2 All constructs must be placed on a form with the exception of rows and columns. 3 Grid, listbox and combobox constructs can contain only row constructs. 4 Row constructs when used as the row of a grid can contain only column and combobox constructs, and must contain at least one column or combobox. 5 Row constructs when used as the row of a list- or combobox can contain only column constructs and must contain at least one column. 6 A single checkbox does not have to be grouped with a groupbox. 7 A radiobutton must be associated with at least one other radiobutton. 8 Two or more checkboxes/radiobuttons must be grouped using a groupbox. 9 A listbox must have an associated form that has a construct for at least every column in the listbox. 10 A combobox with more than one column must have an associated form that has a construct for at least every column of the combobox.
A Conceptual-Modeling Approach to Extracting Data from the Web D.W. Embley1? , D.M. Campbell1 , Y.S. Jiang1 , S.W. Liddle2?? , Y.-K. Ng1 , D.W. Quass2?? , R.D. Smith1 1
Department of Computer Science School of Accountancy and Information Systems Brigham Young University, Provo, Utah 84602, U.S.A. {embley,campbell,jiang,ng,smithr}@cs.byu.edu; {liddle,quass}@byu.edu 2
Abstract. Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data. The approach is based on an ontology – a conceptual model instance – that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth. Keywords: data extraction, data structuring, unstructured data, datarich document, World-Wide Web, ontology, ontological conceptual modeling.
1
Introduction
The amount of data available on the Web has been growing explosively during the past few years. Users commonly retrieve this data by browsing and keyword searching, which are intuitive, but present severe limitations [2]. To retrieve Web data more efficiently, some researchers have resorted to ideas taken from databases techniques. Databases, however, require structured data and most Web data is unstructured and cannot be queried using traditional query languages. To attack this problem, various approaches for querying the Web have ? ??
Research funded in part by Novell, Inc. Research funded in part by Faneuil Research Group
T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 78–91, 1998. c Springer-Verlag Berlin Heidelberg 1998
A Conceptual-Modeling Approach to Extracting Data from the Web
79
been suggested. These techniques basically fall into one of two categories: querying the Web with Web query languages (e.g., [3]) and generating wrappers for Web pages (e.g., [4]). In this paper, we discuss an approach to extracting and structuring data from documents posted on the Web that differs markedly from those previously suggested. Our proposed data extraction method is based on conceptual modeling, and, as such, we also believe that this approach represents a new direction for research in conceptual modeling. Our approach specifically focuses on unstructured documents that are data rich and narrow in ontological breadth. A document is data rich if it has a number of identifiable constants such as dates, names, account numbers, ID numbers, part numbers, times, currency values, and so forth. A document is narrow in ontological breadth if we can describe its application domain with a relatively small ontology. Neither of these definitions is exact, but they express the idea that the kinds of Web documents we are considering have many constant values and have small, well-defined domains of interest.
Brian Fielding Frost
As an example, the unstructured documents we have chosen for illustraOur beloved Brian Fielding Frost, tion in this paper are obituaries. Figage 41, passed away Saturday morning, March 7, 1998, due to injuries sustained ure 1 shows an example1 . An obituary in an automobile accident. He was born is data rich, typically including sevAugust 4, 1956 in Salt Lake City, to eral constants such as name, age, death Donald Fielding and Helen Glade Frost. date, and birth date of the deceased He married Susan Fox on June 1, 1981. person; a funeral date, time, and adHe is survived by Susan; sons Jordan (9), Travis (8), Bryce (6); parents, dress; viewing and interment dates, three brothers, Donald Glade (Lynne), times, and addresses; names of reKenneth Wesley (Ellen), Alex Reed, lated people and family relationships. and two sisters, Anne (Dale) Elkins and The information in an obituary is also Sally (Kent) Britton. A son, Michael narrow in ontological breadth, having Brian Frost, preceded him in death. Funeral services will be held at 12 data about a particular aspect of genoon Friday, March 13, 1998 in the nealogical knowledge that can be deHoward Stake Center, 350 South 1600 scribed by a small ontological model East. Friends may call 5-7 p.m. Thursinstance. day at Wasatch Lawn Mortuary, 3401 Specifically, our approach conS. Highland Drive, and at the Stake Center from 10:45-11:45 a.m. Friday. sists of the following steps. (1) We deInterment at Wasatch Lawn Memorial velop the ontological model instance Park. over the area of interest. (2) We parse this ontology to generate a database Fig. 1. A sample obituary. scheme and to generate rules for matching constants and keywords. (3) To obtain data from the Web, we invoke a record extractor that separates an unstructured Web document into indi1
To protect individual privacy, this obituary is not real. It based on an actual obituary, but it has been significantly changed so as not to reveal identities. Obituaries used in our experiment reported later in this paper are real, but only summary data and isolated occurrences of actual items of data are reported.
80
D.W. Embley et al.
vidual record-size chunks, cleans them by removing markup-language tags, and presents them as individual unstructured documents for further processing. (4) We invoke recognizers that use the matching rules generated by the parser to extract from the cleaned individual unstructured documents the objects and relationships expected to populate the model instance. (5) Finally, we populate the generated database scheme by using heuristics to determine which constants populate which records in the database scheme. These heuristics correlate extracted keywords with extracted constants and use cardinality constraints in the ontology to determine how to construct records and insert them into the database scheme. Once the data is extracted, we can query the structure using a standard database query language. To make our approach general, we fix the ontology parser, Web record extractor, keyword and constant recognizer, and database record generator; we change only the ontology as we move from one application domain to another. Thus, the effort required to apply our suggested technique to a new domain, depends only on the effort required to construct a conceptual model for the new domain. In an earlier paper [10], we presented some of these ideas for extracting and structuring data from unstructured documents. We also presented results of experiments we conducted on two different types of unstructured documents taken from the Web, namely, car ads and job ads. In those experiments, our approach attained recall ratios in the range of 90% and precision ratios near 98%. These results were very encouraging; however, the ontology we used was very narrow, essentially only allowing single constants or single sets of constants to be associated with a given item of interest (i.e., a car or a job). In this paper we enrich the ontology—the conceptual model—and we choose an application that demands more attention to this richer ontology. For example, our earlier model supported only binary relationship sets, but our current approach supports n-ary. Furthermore, we enhance the ontology in two significant ways. (1) We adopt “data frames” as a way to encapsulate the concept of a data item with all of its essential properties [8]. (2) We include lexicons to enrich our ability to recognize constants that are difficult to describe as simple patterns, such as names of people. Together, data frames and lexicons enrich the expressiveness of an ontological model instance. This paper also extends our earlier work by adding an automated tool for detecting and extracting unstructured records from HTML Web documents. We are thus able to fully automate the extraction process once we have identified a Web document from which we wish to extract data. Further enhancements are still needed to locate documents of interest with respect to the ontology and to handle sets of related documents that together provide the data for a given ontology. Nevertheless, the extensions we do add in this paper significantly enhance the approach presented earlier [10].
2
Related Work
Of the two approaches to extracting Web data (Web query languages and wrappers), the approach we take falls into the category of extracting data using
A Conceptual-Modeling Approach to Extracting Data from the Web
81
wrappers. A wrapper for extracting data from a text-based information source generally consists of two parts: (1) extracting attribute values from the text, and (2) composing the extracted values for attributes into complex data structures. Wrappers have been written either fully manually [5,11,12], or with some degree of automation [1,4,7,13,16]. The work on automating wrapper writing focuses primarily on using syntactic clues, such as HTML tags, to identify and direct the composition of extraction of attribute values. Our work differs fundamentally from this approach to wrapper writing because it focuses on conceptual modeling to identify and direct extraction and composition (although we do use syntactic clues to distinguish between record boundaries in unstructured documents). In our approach, once the conceptual-model instance representing the application ontology has been written, wrapper generation is fully automatic. A large body of research exists in the area of information extraction using natural-language understanding techniques [6]. The goal of these naturallanguage techniques is to extract conceptual information from the text through the use of lexicons identifying important keywords combined with sentence analysis. In comparison, our work does not attempt to extract such a deep level of understanding of the text but also does not depend upon complete sentences, which their work does. We believe our approach to be more appropriate for Web pages and classified ads, which often do not contain complete sentences. The work closest to ours is [15]. In this work, the authors explain how they extract information from text-based data sources using a notion of “concept definition frames,” which are similar to the “data frames” in our conceptual model. An advantage of our approach is that our conceptual model is richer, including, for example, cardinality constraints, which we use in the heuristics for composing extracted attribute values into object structures.
3
Web Data Extraction and Structuring
Figure 2 shows the overall process we use for extracting and structuring Web data. As depicted in the figure, the input (upper left) is a Web page, and the output (lower right) is a populated database. The figure also shows that the application ontology is an independent input. This ontology describes the application of interest. When we change applications, for example from car ads, to job ads, to obituaries, we change the ontology, and we apply the process to different Web pages. Significantly, everything else remains the same: the routines that extract records, parse the ontology, recognize constants and keywords, and generate the populated database instance do not change. In this way, we make the process generally applicable to any domain. 3.1
Ontological Specification
As Fig. 2 shows, the application ontology consists of an object-relationship model instance, data frames, and lexicons. An ontology parser takes all this information as input and produces constant/keyword matching rules and a database description as output.
82
D.W. Embley et al.
Application Ontology Object-Relationship Model Instance
Data Frames Lexicons
Web Page
Ontology Parser
Record Extractor
Constant/Keyword Matching Rules
Unstructured Record Documents
Constant/Keyword Recognizer
Data-Record Table (Descriptor/String/Position)
Database Description Record-Level Objects, Relationships, and Constraints
Database Scheme
Database-Instance Generator
Populated Database
Fig. 2. Data extraction and structuring process.
Figure 3 gives the object-relationship model instance for our obituary application in a graphical form. We use the Object-oriented Systems Model (OSM) [9] to describe our ontology. In OSM rectangles represent sets of objects. Dotted rectangles represent lexical object sets (those such as Age and Birth Date whose objects are strings that represent themselves), and solid rectangles represent nonlexical object sets (those such as Deceased Person and Viewing whose objects are object identifiers that represent nonlexical real-world entities). Lines connecting rectangles represent sets of relationships. Binary relationship sets have a verb phrase and reading-direction arrow (e.g., Funeral is on Funeral Date names the relationship set between Funeral and Funeral Date), and n-ary relationships have a diamond and a full descriptive name that includes the names of its connected object sets. Participation constraints near connection points between object and relationship sets designate the minimum and maximum number of times an object in the set participates in the relationship. In OSM a colon (:) after an object-set name (e.g., Birth Date: Date) denotes that the object set is a specialization (e.g., the set of objects in Birth Date is a subset of the set of objects in the implied Date object set).
A Conceptual-Modeling Approach to Extracting Data from the Web Deceased Name: Name
Age 1..*
Birth Date: Date
died on
0..1
0..1
Interment Date: Date
1..*
0..1
Deceased Person has Relationship to Relative Name
1
Deceased -> Person 0..1
has
0..* 0..*
Viewing Date: Date 1..*
has
1
has
has
Interment has
0..1
Viewing Address: 1..* has Address
has
Funeral Date: Date 1..*
has
0..1
Viewing 0..1
Funeral
0..1
has
Beginning Time: Time
0..1 has
Funeral 1..* Address: Address
0..1
1..*
1 0..1
1 0..1
has
is on
1..*
Interment Address: Address
Relationship
1..* 0..1
1..*
1..*->1
1..*->1
0..1
Death Date: Date
Relative Name: Name
has
has
has
1..*
83
1..*
Ending Time: Time 1..*
Funeral Time: Time
Fig. 3. Sample object-relationship model instance.
For our ontologies, an object-relationship model instance gives both a global view (e.g., across all obituaries) and a local view (e.g., for a single obituary). We express the global view as previously explained and specialize it for a particular obituary by imposing additional constraints. We denote these specializing constraints in our notation by a “becomes” arrow (->). In Fig. 3, for example, the Deceased Person object set becomes a single object, as denoted by “-> •”, and the 1..* participation constraint on both Deceased Name and Relative Name becomes 1. We thus declare in our ontology that an obituary is for one deceased person and that a name either identifies the deceased person or the family relationship of a relative of the deceased person. From these specializing constraints, we can also derive other facts about individual obituaries, such as that there is only one funeral and one interment, although there may be several viewings and several relatives. A model-equivalent language has been defined for OSM [14]. Thus, we can faithfully write any OSM model instance in an equivalent textual form. We use the textual representation for parsing. In the textual representation, we can determine whether an object set is lexical or nonlexical by whether it has associated data frame that describes a set of possible strings as objects for the object set. In general a data frame describes everything we wish to know about an object set. If the data frame is for a lexical object set, it describes the string
84
D.W. Embley et al.
patterns for its constants (member objects). Whether lexical or nonlexical, an associated data frame can describe context keywords that indicate the presence of an object in an object set. For example, we may have “died” or “passed away” as context keywords for for Death Date and “buried” as a context keyword for Interment. A data frame for lexical object sets also defines conversion routines to and from a common representation and other applicable operations, but our main emphasis here is on recognizing constants and context keywords. In Fig. 4 we show as examples part of the data frames for Name and Relative Name. A number in brackets designates the longest expected constant for the data frame; we use this number to generate upper-bounds for “varchar” declarations in our database scheme. Inside a data frame we declare constant patterns, keyword patterns, and lexicons of constants. We can declare patterns to be case sensitive or case insensitive and switch back and forth as needed. We write all our patterns using Perl 5 regular expression syntax. The lexicons referenced in Name in Fig. 4 are external files consisting of a simple list of names: first.dict contains 16,167 first names from “aaren” to “zygmunt” and last.dict contains 16,522 last names from “aalders” to “zywiel”. We use these lexicons in patterns by referring to them respectively as First and Last. Thus, for example, the first constant pattern in Name matches any one of the names in the first-name lexicon, followed by one or more white-space characters, followed by any one of the names in the last-name lexicon. The other pattern matches a string of letters starting with a capital letter (i.e., a first name, not necessarily in the lexicon), followed by white space, optionally followed by a capital-letter/period pair (a middle initial) and more white space, and finally a name in the last-name lexicon. ... Name matches [80] case sensitive constant { extract First, "\s+", Last; }, ... { extract "[A-Z][a-zA-Z]*\s+([A-Z]\.\s+)?", Last; }, ... lexicon { First case insensitive; filename "first.dict"; }, { Last case insensitive; filename "last.dict"; }; end; Relative Name matches [80] case sensitive constant { extract First, "\s*\(", First, "\)\s*", Last; substitute "\s*\([ˆ)]*\)" -> ""; ... end; ... Fig. 4. Sample data frames.
The Relative Name data frame in Fig. 4 is a specialization of the Name data frame. In many obituaries, spouse names of blood relatives appear parenthetically inside names. In Fig. 1, for example, we find “Anne (Dale) Elkins”. Here,
A Conceptual-Modeling Approach to Extracting Data from the Web
85
Anne Elkins is the sister of the deceased, and Dale is the husband of Anne. To extract the name of the blood relative, the Relative Name data frame applies a substitution that discards the parenthesized name, if any, when it extracts a possible name of a relative. Besides extract and substitute, a data frame may also have context and filter clauses, which respectively tell us what context we must have for an extraction and what we filter out when we do the extraction. 3.2
Unstructured Record Extraction
As mentioned earlier, we leave for future work the problem of locating Web pages of interest and classifying them as a page containing exactly one record, a page containing many records, or a part of a group of pages containing one record. Assuming we have a page containing many records, we report here on our implementation of one possible approach to the problem of separating these records and feeding them one at a time to our data-extraction routines. The approach we take builds a tree of the page’s structure based on HTML, heuristically searches the tree for the subtree most likely to contain the records, and then heuristically finds the most likely separator among the siblings in this subtree of records. We explain the details in succeeding paragraphs. There are other approaches that may work as well (e.g., we can preclassify particular HTML tags as likely separators or match the given ontology against probable records), but we leave these for future work. HTML tags define regions within an HTML document. Based on the nested structure of start- and end-tags, we build a tree called a tag-tree. Figure 5(a) gives part of a sample obituary HTML document, and Fig. 5(b) gives its corresponding tag-tree. As Fig. 5(a) shows, the tag-pair - surrounds the entire document and thus html becomes the root of the tag-tree. Similarly, we have title nested within head, which is nested within html, and as a sibling of head we have body with its nested structure. The leaves nested within the
-
pair are the ordered sequence of sibling nodes h1, h4, hr, h4, ... . A node in a tag-tree has two fields: (1) the first tag of each start-tag/end-tag pair or a lone tag (when there is no closing tag), and (2) the associated text. We do not show the text in Fig. 5(b), but, for example, the text field for the title node is “Classifieds” and the text field for the first h4 field following the first ellipsis in the leaves is the obituary for Brian Fielding Frost. Using the tag-tree, we find the subtree with the largest fan-out—td in Fig. 5(b). For documents with many records of interest, the subtree with the largest fan-out should contain these records; other subtrees represent global headers or trailers. To find the record separators within the highest fan-out subtree, we begin by counting the number of appearances of each sibling tag below the root node of the subtree (the number of appearances of h1, h4, and hr for our example). We ignore tags with relatively few appearances (h1 in our example) and concentrate on dominant tags, tags with many appearances (h4 and hr in our example). For the dominant tags, we apply two heuristics: a Most-Appearance (MA) heuristic and a Standard-Deviation (SD) heuristic. If there is only one dominant tag, the MA heuristic selects it as the separator. If there are several dominant tags, the
86
D.W. Embley et al. Classifieds
Funeral Notices
Lemar K. Adamson ...
...
Brian Fielding Frost ...
Leonard Kenneth Gunther ...
...
All material is copyrighted.
(a) A sample obituary HTML document
html head
body
title
table
tr
td
h1
h4
hr
h4
hr ... h4 hr h4
hr ... h4
(b) Tag-tree of HTML . document in (a).
Fig. 5. An HTML document and its tag-tree.
MA heuristic checks to see whether the dominant tags all have the same number of appearances or are within one of having the same number of appearances. If so, the MA heuristic selects any one of the dominant tags as the separator. If not, we apply the SD heuristic. For the SD heuristic, we first find the length of each text segment between identical dominant tags (e.g., the lengths of the text segments between each successive pair of hr tags and between each successive pair of h4 tags). We then calculate the standard deviation of these lengths for each tag. Since the records of interest often all have approximately the same length, we choose the tags with the least standard deviation to be the separator. Once we know the separator, it is easy to separate the unstructured records and feed them individually to downstream processes. 3.3
Database Record Generation
With the output of the ontology parser and the record extractor in hand, we proceed with the problem of populating the database. To populate the database, we iterate over two basic steps for each unstructured record document. (1) We produce a descriptor/string/position table consisting of constants and keywords recognized in the unstructured record. (2) Based on this table, we match attributes with values and construct database tuples. As Fig. 2 shows, the constant/keyword recognizer applies the generated matching rules to an unstructured record document to produce a data-record table. Figure 6 gives the first several lines of the data-record table produced from our sample obituary in Fig. 1. Each entry (a line in the table) describes either a constant or a keyword. We separate the fields of an entry by a bar (|). The first field is a descriptor: for constants the descriptor is an object-set name to which the constant may belong, and for keywords the descriptor is KEYWORD(x) where x is an object-set name to which the keyword may apply. The second field is the constant or keyword found in the document, possibly transformed as it is extracted according to substitution rules provided in a data frame. The last two fields give the position as the beginning and ending character count for the first and last characters of the recognized constant or keyword.
A Conceptual-Modeling Approach to Extracting Data from the Web
To facilitate later processing, we sort this table on the third field, the beginning character position of the recognized constant or keyword. A careful consideration of Fig. 6 reveals some interesting insights into the recognition of constants and keywords and also into the processing required by the database-instance generator. Notice in the first four lines, for example, that the string “Brian Fielding Frost” is the same and that it could either be the name of the deceased or the name of a relative of the deceased. To determine which one, we must heuristically resolve this conflict. Since there is no keyword here for Deceased Person, no keyword directly resolves this conflict for us. However, we know that the important item in a record is almost always introduced at the beginning, a strong indication that the name is the name of the deceased, not the name of one of the deceased’s relatives. More formally, since the constraints on DeceasedName within a record require a one-to-one correspondence between DeceasedName and DeceasedPerson and since DeceasedName is not optional, the first name that appears is almost assuredly the name of the deceased person. Keyword resolution of conflicts is common. In Fig. 6, for example, consider the resolution of the death date and the birth date. Since the various dates are all specializations of Date, a particular date, without context, could be any one of the different dates (e.g., “March 7, 1998” might be any one of five possible
88
D.W. Embley et al.
kinds of date). Notice, however, that “passed away”, a keyword for DeathDate, is only 20 characters away from the beginning of “March 7, 1998”, giving a strong indication that it is the death date. Similarly, “born”, a keyword for BirthDate, is within two characters of “August 4, 1956”. Keyword proximity easily resolves these conflicts for us. Continuing with one more example, consider the phrase “born August 4, 1956 in Salt Lake City, to”, which is particularly interesting. Observe in Fig. 6 that the recognizer tags this phrase as a keyword for Relationship and also in the next line as constant for Relationship, with “parent” substituted for the longer phrase. The regular expression that the recognizer uses for this phrase matches “born to” with any number of intervening characters. Since we have specified in our Relationship data frame that “born to” is a keyword for a family relationship and is also a possible constant value for the Relationship object set, with the substitution “parent”, we emit both lines as shown in Fig. 6. Observe further that we have “parent” close by (two characters away from) the beginning of the name Donald Fielding and close by (twenty-two characters away from) the beginning of the name Helen Glade Frost, which are indeed the parents of the deceased. The database-instance generator takes the data-record table as input along with a description of the database and constructs tuples for the extracted raw data. The heuristics applied in the database-instance generator are motivated by observations about the constraints in the record-level description. We classify these constraint-based heuristics as singleton heuristics, functional-group heuristics, and nested-group heuristics. – Singleton Heuristics. For values that should appear at most once, we use keyword proximity to find the best match, if any, for the value (e.g., we match DeathDate with “March 7, 1998” and BirthDate with “August 4, 1956” as explained earlier). For values that must appear at least once, if keyword proximity fails to find a match, we choose the first appearance of a constant belonging to the object set whose value must appear. If no such value appears, we reject the record. For our ontology, only the name of the deceased must be found. – Functional-Group Heuristics. An object set whose objects can appear several times, along with its functionally dependent object sets constitutes a functional group. In our sample ontology Viewing and its functionally dependent attributes constitutes such a group. Keywords that do not pertain to the item of interest provide boundaries for context switches. For our example (see Fig. 1), we have a Funeral context before the viewing information and an Interment context after the viewing information. Within this context we search for ViewingDate / ViewingAddress / BeginningTime / EndingTime groups. – Nested-Group Heuristics. We use nested-group heuristics to process n-ary relationship sets (for n > 2). Writers often produce these groups by a nesting structure in which one value is given followed by its associated values, which may be nested, and so forth. Indeed, the obituaries we considered consistently
A Conceptual-Modeling Approach to Extracting Data from the Web
89
follow this pattern. In Fig. 1 we see “sons” followed by “Jordan”, “Travis”, and “Bryce”; “brothers” followed by “Donald”, “Kenneth”, and “Alex”; and “sisters” followed by “Anne” and “Sally”. The result of applying these heuristics to an unstructured obituary record is a set of generated SQL insert statements. When we applied our extraction process to the obituary in Fig. 1, the values extracted were quite accurate, but not perfect. For example, we missed the second viewing address, which happens to have been correctly inserted as the funeral address, but not also as the viewing address for the second viewing. Our implementation currently does not allow constants to be inserted in two different places, but we plan to have future implementations allow for this possibility. Also, we obtained neither of the viewing dates, both of which can be inferred from “Thursday” and “Friday” in the obituary. We also did not obtain the full name for some of the relatives, such as sons of the deceased, which can be inferred by common rules for family names. At this point our implementation only finds constants that actually appear in the document. In future implementations, we would like to add procedures to our data frames to do the calculations and inferences needed to obtain better results.
4
Results
For our test data, we took 38 obituaries from a Web page provided by the Salt Lake Tribune (www.sltrib.com) and 90 obituaries from a Web page provided by the Arizona Daily Star (www.azstarnet.com). When we ran our extraction processor on these obituaries, we obtained the results in Table 1 for the Salt Lake Tribune and in Table 2 for the Arizona Daily Star. As Tables 1 and 2 show, we counted the number of facts (attribute-value pairs) in the test-set documents. Consistent with our implementation, which only extracts explicit constants, we counted a string as being correct if we extracted the constant as it appeared in the text. With this understanding, counting was basically straightforward. For names, however, we often only obtained partial names. Because our name lexicon was incomplete and our name-extraction expressions were not as rich as possible, we sometimes missed part of a name or split a single name into two. We list the count for these cases after the + in the Declared Correctly column. We noted that this also caused most of the problem of the large number of incorrectly identified relatives. With a more accurate and complete lexicon and with richer name-extraction expressions, we believe that we could achieve much higher precision.
5
Conclusions
We described a conceptual-modeling approach to extracting and structuring data from the Web. A conceptual model instance, which we called an ontology, provides the relationships among the objects of interest, the cardinality constraints
90
D.W. Embley et al. Table 1. Salt Lake Tribune Obituaries Number of Number of Facts Number of Facts Facts in Source Declared Correctly Declared Incorrectly + Partially Correct DeceasedPerson 38 38 0 DeceasedName 38 23+15 0 Age 22 20 1 BirthDate 30 30 1 DeathDate 33 31 0 FuneralDate 24 22 0 FuneralAddress 25 24 1 FuneralTime 29 28 0 IntermentDate 0 0 0 IntermentAddress 4 4 0 Viewing 29 27 1 ViewingDate 10 7 0 ViewingAddress 17 13 0 BeginningTime 32 28 0 EndingTime 29 26 0 Relationship 453 359+9 29 RelativeName 453 322+75 159
Recall Precision Ratio Ratio 100% 100% 91% 100% 94% 92% 96% 97% NA 100% 93% 70% 76% 88% 90% 81% 88%
for these relationships, a description of the possible strings that can populate various sets of objects, and possible context keywords expected to help match values with object sets. To prepare unstructured documents for comparison with the ontology, we also proposed a means to identify the records of interest on a Web page. With the ontology and record extractor in place, we were able to extract records automatically and feed them one at a time to a processor that heuristically matched them with the ontology and populated a database with the extracted data.
A Conceptual-Modeling Approach to Extracting Data from the Web
91
The results we obtained for our obituary example are encouraging. Because of the richness of the ontology, we had initially expected much lower recall and precision ratios. Achieving about 90% recall and 75% precision for names and 95% precision elsewhere was a pleasant surprise.
References 1. Adelberg, B.: NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. Proc. 1998 ACM SIGMOD International Conference on Management of Data. (1998) 283–294 2. Apers, P.: Identifying internet-related database research. Proc. 2nd International East-West Database Workshop (1994) 183–193 3. Arocena, G., Mendelzon, A.: WebOQL: restructuring documents, databases and webs. Proc. Fourteen International Conference on Data Engineering (1998) 4. Ashish, N., Knoblock, C.: Wrapper generation for semi-structured internet sources. SIGMOD Record 26 (1997) 8–15 5. Atzeni, P., Mecca, G.: Cut and paste. Proc. PODS’97 (1997) 6. Cowie, J., Lehnert, W.: Information extraction. Communications of the ACM 39 (1996) 80–91 7. Doorenbos, R., Etzioni, O., Weld, D.: A scalable comparison-shopping agent for the world-wide web. Proc. First International Conference on Autonomous Agents (1997) 39–48 8. Embley, D.: Programming with data frames for everyday data items. Proc. 1980 National Computer Conference (1980) 301–305 9. Embley, D., Kurtz, B., Woodfield, S.: Object-oriented Systems Analysis: A ModelDriven Approach. (Prentice Hall, 1992) 10. Embley, D., Campbell, D., Smith, R., Liddle, S.: Ontology-based extraction and structuring of information from data-rich unstructured documents. Proc. Conference on Information and Knowledge Management (CIKM’98) (1998) (to appear) 11. Gupta, A., Harinarayan, V., Rajaraman, A.: Virtual database technology. SIGMOD Record 26 (1997) 57–61 12. Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. Proc. Workshop on Management of Semistructured Data (1997) 13. Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper induction for information extraction. Proc. 1997 International Joint Conference on Artificial Intelligence (1997) 729–735 14. Liddle, S., Embley, D., Woodfield, S.: Unifying modeling and programming through an active, object-oriented, model-equivalent programming language. Proc. Fourteenth International Conference on Object-Oriented and Entity-Relationship Modeling (1995) 55–64 15. Smith, D., Lopez, M.: Information extraction for semi-structured documents. Proc: Workshop on Management of Semistructured Data (1997) 16. Soderland, S.: Learning to extract text-based information from the world wide web. Proc: Third International Conference on Knowledge Discovery and Data Mining (1997) 251–254
Information Coupling in Web Databases? Sourav S. Bhowmick, Wee-Keong Ng, and Ee-Peng Lim Center for Advanced Information Systems, School of Applied Science, Nanyang Technological University, Singapore 639798, SINGAPORE {sourav,wkn,aseplim}@cais.ntu.edu.sg
Abstract. Web information coupling refers to an association of topically related web documents. This coupling is initiated explicitly by a user in a web warehouse specially designed for web information. Web information coupling provides the means to derive additional, useful information from the WWW. In this paper, we discuss and show how two web operators, i.e., global web coupling and local web coupling, are used to associate related web information from the WWW and also from multiple web tables in a web warehouse. This paper discusses various issues in web coupling such as coupling semantics, coupling-compability, and coupling evaluation.
1
Introduction
Given the high rate of growth of the volume of data available on the WWW, locating information of interest in such an anarchic setting becomes a more difficult task everyday. Thus, there is the recognition of the undeferring need for effective and efficient tools for information consumers, who must be able to easily locate and manipulate information in the Web. Currently, web information may be discovered primarily by two mechanisms: browsers and search engines. This form of information access on the Web has a few shortcomings: • While web browsers fully exploit hyperlinks among web pages, search engines have so far made little progress in exploiting link information. Not only do most search engines fail to support queries on the Web utilizing link information, they also fail to return link information as part of a query’s result. • From the query’s result returned by search engines, a user may wish to couple a set of related Web documents together for reference. Presently, he may only do so manually by visiting and downloading these documents as files on the user’s hard disk. However, this method is tedious, and it does not allow the user to retain the coupling framework . ?
This work was supported in part by the Nanyang Technological University, Ministry of Education (Singapore) under Academic Research Fund #4-12034-5060, #4-120343012, #4-12034-6022. Any opinions, findings, and recommendations in this paper are those of the authors and do not reflect the views of the funding agencies.
T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 92–106, 1998. c Springer-Verlag Berlin Heidelberg 1998
Information Coupling in Web Databases Issues
http://www.virtualdisease.com/
e
93
z
Symptoms x
y
Symptoms f
Treatment
Treatment w
Fig. 1. Coupling framework (query graph) of ‘Symptoms’.
• The set of downloaded documents can be refreshed (or updated) only by repeating the above procedure manually. • If a user successfully coupled a set of Web documents together, he may wish to know if there are other Web documents satisfying the same coupling framework. Presently, the only way is to request the same or other search engines for further Web documents and probe these documents manually. • Over a period of time, there will be a number of coupled collections of Web documents created by the user. As each of these collections exists simply as a set of files on the user’s system, there is no convenient way to organize, manage and infer further useful information from them. In this paper, we introduce the concept of Web Information Coupling (WIC) to help overcome the limitations of present search engines. WIC enables us to efficiently manage and manipulate coupled information extracted from the Web. We use coupling because it is a convenient way to relate information located separately on the WWW. In this paper, we discuss two types of coupling; global and local web coupling. Global coupling enables a user to retrieve a set of collections of inter-related documents satisfying a coupling framework regardless of the locations of the documents in the Web. To initiate global coupling, a user specifies the coupling framework in the form of a query graph. The actual coupling is performed by the WIC system and is transparent to the user. The result of such user-driven coupling is a set of related documents materialized in the form of a web table. Thus, global web coupling eliminates the problem of manually visiting and downloading Web documents as files in user’s hard disk. Coupling is not limited to the extraction of related information directly from the WWW. Local coupling can be performed on web tables [15] materialized by global coupling. This form of web coupling is achieved locally without resorting to the WWW. Given two web tables, local coupling is initiated explicitly by specifying a pair(s) of web documents and a set of keyword(s) to relate them. The result of local web coupling is a web table consisting of a set of collections of inter-related Web documents from the two input tables. The following example briefly illustrates global and local web coupling. Example 1. Suppose Bill wish to find a list of diseases with their symptoms and treatments, and a list of drugs and their side effects on diseases on the WWW. Assume that there are web sites at http://www.virtualdisease.com/
94
S.S. Bhowmick, W.-K. Ng, and E.-P. Lim Drug List
http://www.virtualdrug.com/
Issues
Side effects t Side effects
a
b
c
d
Fig. 2. Coupling framework (query graph) of ‘Drug list’.
http://www.virtualdisease.com/
Cancer Issues
Cancer Symptoms
e0
x0
g0 Cancer
Symptoms
y0
z0 Cancer Treatment f0 Treatment
http://www.virtualdisease.com/
w0
Breast Cancer Issues
g1
z1‘
e1
Breast Cancer
Breast Cancer Symptoms
y1
x0
Symptoms w1
f1 Breast Cancer Treatment
http://www.virtualdisease.com/ g3 Diabetes
Issues of Diabetes e3
y3
x0
Treatment
z3
Symptoms Diabetes Symptoms w3 f3 Treatment
http://www.virtualdisease.com/
x0
AIDS
Diabetes Treatment
http://www.aidsresearch.org/
g4
z4
e4 AIDS Symptoms Issues
Symptoms w4 f4 AIDS Treatment Treatment
Fig. 3. Partial view of‘Symptoms’ web table.
and http://www.virtualdrug.com/ which integrate disease and drug related information from various web sites respectively. Bill figured that there could be hyperlinks with anchor labels ‘symptoms’ and ‘treatments’ in the web site at http://www.virtualdisease.com/ and labels ‘side effects’ in the web site at http://www.virtualdrug.com/ that might be useful. In order to initiate global web coupling (i.e.,to couple these related information from the WWW), Bill constructs coupling frameworks (query graphs) as shown in Figs. 1 and 2. The global web coupling operator is applied to retrieve those set of related documents that match the coupling frameworks. Each set of inter-linked documents retrieved for each coupling framework is a connected, directed graph (also called web tuples) and is materialized in web tables Symptoms and Drug list respectively. A small portion of these web tables is shown in Figs. 3 and 4. Each web tuple in Symptoms and Drug list contains information about the symptoms and treatments of a particular disease, and the side effects of a drug on the disease respectively. Suppose a user want to extract information related to the symptoms and treatments of cancer and AIDS, and a list of drugs with their side effects on them. Clearly, these information are already stored in tables Symptoms and Drug list.
Information Coupling in Web Databases http://www.vritualdrug.com/
Cancer Drug List
Issues of Beta Carotomel
95
Side effects
s0 a0
Cancer
http://www.vritualdrug.com/
b0
Beta Carotomel
Side effects c0
Cancer Drug List
Side effects
s0 a0
Cancer
http://www.vritualdrug.com/
t1 b0
Docetaxel
Drug List
c1 Issues
s1 a0
Breast Cancer
AIDS
Issues
Indavir
b4
d1 Side effects
t2
c2
http://www.aidsresearch.org/
s4 a0
Side effects
Side effects
Anastrozole b1
http://www.vritualdrug.com/
d0
t0
Issues of Docetaxel
d2 Side effects
Side effects
c7
d7
Fig. 4. Partial view of ‘Drug list’ web table.
The local web coupling operator enables us to extract these related information from the two web tables. A user may indicate the documents (say y and b) in the coupling frameworks of Symptoms and Drug list and the keywords (in this case “cancer” and “AIDS”) based on which local web coupling will be performed. A portion of the coupled web table is shown in Fig. 5. A Web Information Coupling (WIC) system is a database system for managing and manipulating coupled information extracted from the Web. To realize this system, we first propose a data model called the Web Information Coupling Model (WICM) to describe and abstract web objects. We then introduce the operators to perform global and local coupling.
2
Web Information Coupling Model
We proposed a data model for a web warehouse in [5,15]. The data model consists of a hierarchy of web objects. The fundamental objects are Nodes and Links. Nodes correspond to HTML or plain text documents and links correspond to hyper-links interconnecting the documents in the World Wide Web. We define a Node type and a Link type to refer to these two sets of distinct objects. These objects consist of a set of attributes as shown below: Node = [url, title, format, size, date, text] Link = [source-url, target-url, label, link-type]
WICM supports structured or topological querying; different sets of keywords may be specified on the nodes and additional criteria may be defined for the hyperlinks among the nodes. Thus, the query is a graph-like structure and is used to match portions of the WWW satisfying the conditions. In this way, the query result is a set of directed graphs (called web tuples) instantiating the query graph. Formally, a web tuple w = hNw , Lw , Vw i, is a triplet where Nw is a set of nodes in web tuple w, Lw is a set of links in web tuple w and Vw is the set of connectivities (next section). A collection of these web tuples is called a
96
S.S. Bhowmick, W.-K. Ng, and E.-P. Lim
web table. If the web table is materialized, we associate a name with the table. defined as 4-tuple M = hXn , X` , C, P i where Xn is a set of node variables, X` is a set of link variables, C is a set of connectivities in DNF, P is a set of predicates in DNF. The web schema of the web table is the query graph that is used to derive the table. It is defined as 4-tuple M = hXn , X` , C, P i where Xn is a set of node variables, X` is a set of link variables, C is a set of connectivities in DNF, P is a set of predicates in DNF. A set of web tables constitutes a web database. We illustrate the concept of web schema with the following examples. Consider the query graphs (Figs. 1 and 2) in Example 1. The schemas of these query graphs are given below: Example 2. Produce a list of diseases with their symptoms and treatments, starting from the web site at http://www.virtualdisease.com/. We may express the schema of the above query by Mi = hXi,n , Xi,` , Ci , Pi i where Xi,n = {x, y, z, w}, Xi,` = {e, f, -}, Ci ≡ ki1 ∧ ki2 ∧ ki3 such that ki1 = xh-iy, ki2 = yheiz, ki3 = yhf iw, and Pi ≡ pi1 ∧ pi2 ∧ pi3 ∧ pi4 ∧ pi5 ∧ pi6 such that pi1 (x) ≡ [x.url EQUALS "http://www.virtualdisease.com/"], pi2 (y) ≡ [y.title CONTAINS "issues"], pi3 (e) ≡ [e.label CONTAINS "symptoms"], pi4 (z) ≡ [z.title CONTAINS "symptoms"], pi5 (f ) ≡ [f .label CONTAINS "treatments"], pi6 (w) ≡ [w.title CONTAINS "treatments"]. Example 3. Produce a list of drugs and their side effects starting from the web site at http: // www. virtualdrug. com/ . The schema of the above query is Mj = hXj,n , Xj,` , Cj , Pj i where Xj,n = {a, b, c, d}, Xj,` = {t, -}, Cj ≡ kj1 ∧ kj2 ∧ kj3 such that kj1 = ah-ib, kj2 = bh-ic, kj3 = chtid and Pj ≡ pj1 ∧ pj2 ∧ pj3 ∧ pj4 ∧ pj5 such that pj1 (a) ≡ [a.url EQUALS "http://www.virtualdrug.com/"], pj2 (b) ≡ [b.title CONTAINS "Drug List"], pj3 (c) ≡ [c.title CONTAINS "Issues"], pj4 (d) ≡ [d.title CONTAINS "side effects"], pj5 (t) ≡ [t.label CONTAINS "side effects"]. The query graphs (web schemas) as described in Examples 2 and 3 express Bill’s need to extract a set of inter-linked documents related to symptoms and treatments of diseases, and the side effects of drugs on these diseases from the WWW. Since conventional search engines cannot accept a query graph as input and return the inter- linked documents as the query result, a global web coupling operator is required. The global web coupling operator matches those portions of the WWW that satisfy the query graphs. The results of global web coupling is a collection of sets of related Web documents materialized in the form of a web table. Although global web coupling retrieves data directly from the WWW, the full potential of web coupling lies in the fact that it can couple related information residing in two different web tables in a web database. Suppose Bill wish to know the symptoms and treatments associated with cancer and AIDS, and a list of drugs with their side effects on them. There are two methods in a web database to gather the composite information:
Information Coupling in Web Databases
97
Cancer Issues http://www.vritualdrug.com/
Cancer Drug List
s0 a0
Cancer
Side effects http;//www.virtualdisease.com/
Issues of Docetaxel t1
b0
Docetaxel
c1
Side effects
Cancer Symptoms
y0
g0 d1
e0
Cancer
Symptoms
x0
z0 Cancer Treatment
f0 Treatment Cancer Issues http://www.vritualdrug.com/
Cancer Drug List
Issues of Docetaxel
s0 a0
Cancer
Side effects http;//www.virtualdisease.com/ t1
b0
Docetaxel
c1
Side effects
y1
g1 d1
w0
Symptoms e1
Breast Cancer
Breast Cancer Symtpms
x0
z1 Treatment
f1 Breast Cancer Treatment Cancer Issues http://www.vritualdrug.com/
http://www.aidsresearch.org/
s4
Issues
Indavir
Side effects
http;//www.virtualdisease.com/
AIDS a0
AIDS
b4
c7
d7
x0
w1
z4 e4
g4
Side effects
y4
AIDS Symptoms Issues
Symptoms f4
w4
AIDS Treatment Treatment
Fig. 5. Web coupling.
1. Bill may construct a new web query for this purpose. The disadvantage of this method is that the information (stored in web tables) created by the queries in Examples 2 and 3 are not being used for this query. 2. Browse the web tables of queries in Examples 2 and 3 and select those tuples containing information related to cancer and AIDS and then compare the results manually. However, there may be many matching web tuples, making the user’s task of going over them tedious. This motivates us to design a local web coupling operator that allows us to gather related information from the two web tables in a web database.
3
Global Web Coupling
In this section, we discuss global web coupling; a mechanism to couple related information from the WWW. We begin by formally defining the global web coupling operator. Next we explain how a coupled web table is created. 3.1
Definition
The global web coupling operator Γ takes in a query (expressed as a schema M ) and extracts a set of web tuples from the WWW satisfying the schema. Let Wg be the resultant table, then Wg = Γ (M ). Each web tuple matches a portion of the WWW satisfying the conditions described in the schema. These related set of web tuples are coupled together and stored in a web table. Each web tuple in the web table is structurally identical to the schema of the table. Some computability issues arise when applying the global web coupling operator to WWW. The global web coupling operator, is bound if and only if all
98
S.S. Bhowmick, W.-K. Ng, and E.-P. Lim
variables that begin a connectivity in the schema specified for the operator are bound. A query which embedds a bound Γ operator is always computable. Let us see why. Suppose a web query with schema M is posed against the WWW, i.e., we wish to compute Γ (M ). Intuitively, the Γ operator is evaluable when there are starting points in the WWW from which we can begin our search. With current web technology, there are two methods to locate a web resource; we either know its URL and access the resource directly or we go through a search engine by supplying keywords to obtain the URL’s. Let x be a node variable, then a predicate such as [x.url EQUALS "a-url-here"] in a query allows us to use the URL specified to locate the document corresponding to x. The second method is embedded by predicates such as [x.text CONTAINS "some-keywords"], [x.title EQUALS "a-title-here"], and [e.label CONTAINS "some-keywords"]. Here, x and e are the bound variables. When a node or link variable is bound, we can acccess the resource it corresponds to either directly or through a web search engine. Variables that begin connectivities and are bound provide the starting point in the WWW for retrieving web tuples. Hence, queries with such variables are computable. 3.2
Web Table Creation
We now discuss briefly how to create the coupled web table. Given a web schema (query graph), Γ extracts a set of web tuples satisfying the query graph. We discuss how to extract the set of web tuples from the WWW. Our approach to determine the set of web tuples from the WWW is as follows: 1. Check if the given web schema is computable. If it is, then obtain a set of URL(s) as the starting point of traversal by analyzing the predicates in the schema. 2. Get the node variables representing these start URL(s) and identify connectivities which contain the start node variables. Note that the start node variable will always be in the left hand side of a connectivity. 3. Download the documents from the WWW that satisfy the predicates for the nodes and that contain links that satisfy the link predicates for the outgoing edges of this node. 4. Get the web documents (nodes) pointed by the links and check whether these documents satisfy the predicates of the node in the schema. Repeat this untill we reach the right hand side of the connectivity. 5. Repeat the above two steps for all the connectivities in the schema. 6. Once all the web documents are collected by the above procedure, create individual web tuples by matching the set of nodes and links with the schema.
4
Local Web Coupling
Once we have the ability to couple useful information directly from the WWW using the global web coupling operator, we need to introduce an additional operator to facilitate local web coupling, i.e., extracting useful information locally from two web tables.
Information Coupling in Web Databases
99
Cancer Issues http://www.vritualdrug.com/
Cancer Drug List
s0 a0
Cancer
Side effects http;//www.virtualdisease.com/
Issues of Docetaxel t1
b0
Docetaxel
c1
Side effects
Cancer Symptoms
y0
g0 d1
e0
Cancer
Symptoms
x0
z0 Cancer Treatment
f0 Treatment Cancer Issues http://www.vritualdrug.com/
Cancer Drug List
Side effects http;//www.virtualdisease.com/ Issues
Issues of Docetaxel
s0
t1
w0
Symptoms e3 z3
a0
Cancer
b0
Docetaxel
c1
Side effects
d1
Diabetes x0
Diabetes Symptoms
y3
Treatment
f3 w3
Diabetes Treatment Cancer Issues http://www.vritualdrug.com/
Cancer Drug List
a0
Cancer
Side effects
Issues of Docetaxel
s0
http;//www.virtualdisease.com/
t1 b0
Docetaxel
c1
Side effects
AIDS
d1 x0
y4
z4 e4
g4
AIDS Symptoms Issues
Symptoms f4
w4
AIDS Treatment Treatment
Fig. 6. Web cartesian product.
4.1
Definition
The local web coupling operator combines two web tables by integrating web tuples of one web table with web tuples of another table whenever there exists coupling nodes. Let Wi and Wj be two web tables with schemas Mi = hXi,n , Xi,` , Ci , Pi i and Mj = hXj,n , Xj,` , Cj , Pj i respectively. Suppose we want to couple Wi and Wj on node variables nci and ncj as they both contain information about diseases, and we want to correlate web tuples of Wi and Wj related to cancer. Let wi and wj be two web tuples from Wi and Wj respectively, and nc (wi ) and nc (wj ) be instances of nci and ncj respectively. Suppose documents at http://www.virtualdisease.com/cancer/index.html (represented by node nc (wi )) and http://www.virtualdrug.com/cancerdrugs/index.html (represented by node nc (wj )) respectively contain information related to cancer and appear in wi and wj respectively. Tuples wi and wj are coupling-compatible locally on nc (wi ) and nc (wj ) since they both contain similar information (information related to cancer). Thus, coupling nodes are nc (wi ) and nc (wj ). We store the coupled web tuple in a separate web table. Note that coupling-compatibility of two web tuples depends on the pair(s) of node variables and keyword(s) specified explicitly by the user in the local coupling query. We now formally define coupling-compatibility. Definition 1. Let K(n, w, W ) denote a set of keywords appearing in a web document (represented by node n) in web tuple w of web table W . Two web tuples wi and wj of web tables Wi and Wj are coupling-compatible locally on the node pair (nc (wi ), nc (wj )) based on some keyword set Kc if and only if the following conditions are true: nc (wi ) ∈ Nwi , nc (wj ) ∈ Nwj , Kc ⊆ K(nc (wi ), wi , Wi ) and Kc ⊆ K(nc (wj ), wj , Wj ).
100
S.S. Bhowmick, W.-K. Ng, and E.-P. Lim
The new web tuple w derived from the coupling of wi and wj is defined as: Nw = Nwi ∪ Nwj , Lw = Lwi ∪ Lwj and Vw = Vwi ∪ Vwj . We express local web coupling between two web tables as follows: W = Wi ⊗({hnode
pairi,hkeyword(s)i})
Wj
where Wi and Wj are the two web tables participating in the coupling operation and W is the coupled web table created by the coupling operation satisfying schema M = hXn , X` , C, P i. In this case, hnode pairi specifies a pair of coupling node variables from Wi and Wj , and hkeyword(s)i specifies a list of keyword(s) on which the similarity between the coupling node variable pair is evaluated. Note that in order to couple the two web tables, the keyword(s) should be present in at least one instance of the coupling node variable pair. Furthermore, there may be more than one pair of coupling node variables on which local web coupling can be performed. Local web coupling is a combination of two web operations: a web cartesian product followed by a web select based on some selection condition on the coupling nodes. Like its relational counterpart, a web cartesian product, (denoted by ×), is a binary operation that combines two web tables by concatenating a web tuple of one web table with a web tuple of other. If Wi and Wj have n and m web tuples respectively, then the resulting web cartesian product has n × m web tuples. The schema of the resultant web table W 0 is given as M 0 = hXn 0 , X` 0 , C 0 , P 0 i where Xn 0 = Xi,n ] Xj,n X` 0 = Xi,` ] Xj,` C 0 = Ci ] Cj and P 0 = Pi ] Pj . The symbol ] refers to the disambiguation [5,15] of nodes and link variables. Let us now illustrate web cartesian product with an example. Example 4. Consider the web tables Symptoms and Drug list in Figs. 3 and 4. The web cartesian product of these two web tables is shown in Fig. 6. Due to space limitations, we only show a small portion of the resultant web table. The schema of the resultant web table is M 0 = {Xn 0 , X` 0 , C 0 , P 0 } where Xn 0 = Xi,n ] Xj,n = {x, y, z, w, a, b, c, d}, X` 0 = Xi,` ] Xj,` = {t, e, f, -}, C 0 = Ci ] Cj ≡ k1 0 ∧ k2 0 ∧ k3 0 ∧ k4 0 ∧ k5 0 ∧ k6 0 such that k1 0 = xh-iy, k2 0 = yheiz, k3 0 = yhf iw, k4 0 = ah-ib, k5 0 = bh-ic, k6 0 = chtid, and P 0 = Pi ] Pj ≡ p1 0 ∧ p2 0 ∧ p3 0 ∧ p4 0 ∧ p5 0 ∧ p6 0 ∧ p7 0 ∧ p8 0 ∧ p9 0 ∧ p10 0 ∧ p11 0 such that p1 0 (x) ≡ [x.url EQUALS "http://www.virtualdisease.com/"], p2 0 (y) ≡ [y.title CONTAINS "issues"], p3 0 (e) ≡ [e.label CONTAINS "symptoms"], p4 0 (z) ≡ [z.title CONTAINS "symptoms"], p5 0 (f ) ≡ [f .label CONTAINS "treatments"], p6 0 (w) ≡ [w.title CONTAINS "treatments"], p7 0 (a) ≡ [a.url EQUALS "http://www.virtualdrug.com/"], p8 0 (b) ≡ [b.title CONTAINS "Drug List"], p9 0 (c) ≡ [c.title CONTAINS "Issues"], p10 0 (d) ≡ [d.title CONTAINS "side effects"], p11 0 (t) ≡ [t.label CONTAINS "side effects"]. A web select operation is performed after web cartesian product to filter out web tuples where the specified nodes cannot be related based on the keyword(s) conditions. These conditions impose additional constraints on the node variables participating in local web coupling. We denote this sequence of operations as local web coupling and we can replace the two operations
Information Coupling in Web Databases
W = σ(hnode
W 0 = Wi × Wj
pairi,hkeyword condition(s)i) (W
0
101
)
with W = Wi ⊗({hnode pairi,hkeyword(s)i}) Wj . The symbol σ denotes web selection. The result of a local web coupling operation is a web table having one web tuple for each combination of web tuple—one from Wi and one from Wj —whenever there exist coupling nodes. Let us illustrate web coupling with an example. Example 5. Consider the web tables Symptoms and Drug list as depicted in Examples 2 and 3. Suppose Bill wish to find symptoms, treatments details of “Cancer” and “AIDS” and the list of drugs with their side effects on these diseases. The coupled web table is shown in Fig. 5. Note that the third and fourth web tuples in Fig. 6 are excluded in the coupled web table since they do not satisfy the keyword conditions. The schema of the coupled web table is M = hXn , X` , C, P i where Xn = Xn 0 , X` = X` 0 C = Ci 0 and P = P 0 . The construction details of the coupled schema and the coupled web table will be explained in Sect. 4.3. 4.2
Terminology
We introduce some terms we shall be using to explain local web coupling in this paper. • Coupling nodes: Two web tuples wi and wj of web tables Wi and Wj respectively can be coupled if there exist at least one node nc (wi ) and nc (wj ) in wi and wj which can be coupled with the other based on similar information content. We refer to these nodes as coupling nodes. We express the coupling nodes of wi and wj as coupling pairs since they cannot exist as a single node. Formally, ( nc (wi ), nc (wj )) is a coupling pair where node nc (wi ) is coupled with nc (wj ) of wj . The attributes of nc (wi ) and nc (wj ) are called coupling attributes. For example, the coupling nodes of the first web tuple in Figs. 3 and 4 are y0 and b0 respectively. The coupling pair for these nodes is (y0 , b0 ), The coupling attributes of y0 and b0 are text, title etc. • Coupling-activating links: All the incoming links to the coupling nodes nc (wi ) and nc (wj ) are called coupling-activating links. Formally, `nc (wi ) is the coupling-activating link of the coupling node nc (wi ). For example, the link g0 in Fig. 3 is the coupling-activating link of node y0 . • Coupling keywords: The keyword condition(s) specified by the user based on which coupling between node variables are performed are called coupling keywords. 4.3
Web Table Creation
We now discuss the process of deriving the coupled web table from two input web tables. Given two web tables, a set of coupling keyword(s), and pair(s) of
102
S.S. Bhowmick, W.-K. Ng, and E.-P. Lim
node variables, we first construct the schema of the coupled web table and then proceed to create the table itself. Let web tables Wi and Wj with schemas Mi and Mj be participating in the local web coupling process. Let the coupled web table be W with schema M = hXn , X` , C, P i. Construction of the coupled schema. We now determine the four components of M in the following steps: Step 1: Determine the Node set: Node variables in Xni and Xnj can either be nominally distinct from one another or there may exist at least one pair of node variables from Xni and Xnj which are identical to one another. If the node variables are not nominally distinct, we disambiguate one of the identical node variable(s). The node set of the coupled schema is given as:Xn = (Xni ] Xnj ). Step 2: Determine the Link set: Similarly, we disambiguate the identical link variables in X`i and X`j if necessary, and the link set of the coupled schema is given as: X` = X`i ] X`j . Step 3: Determine the Connectivity set: If the node and link variables are not nominally distinct, we replace any one of the identical variables in Ci or Cj with the disambiguated value. The connectivity set of the coupled schema C is given as: C = Ci ] Cj . Step 4: Determine the Predicate set: Our approach to determine the predicate set of the coupled schema is similar as above. If the node and link variables are not nominally distinct we replace any one of the identical node variables in Pi or Pj with the disambiguated value. The predicate set of the coupled schema is given as: P = Pi ] Pj . Construction of the coupled web table. The coupled web table is created by integrating the two input web tables. We describe the steps below: Step 1: Given two web tables, we first perform a web cartesian product on the two web tables. Step 2: For each web tuple in the web table created by web cartesian product, the specified nodes are inspected to determine whether the web tuple is couplingcompatible locally (based on the coupling keyword(s) provided by the user). In order to be coupling-compatible, the specified pair of nodes in the web tuple must satisfy some coupling-compatibility conditions. We determine these conditions in the next section. We inspect each web tuples in the web table created by web cartesian product to determine if the specified pair(s) of node satisfy any one of the coupling compatibility conditions. Step 3: If a pair of nodes satisfy none of the conditions, the corresponding web tuple is rejected. If the nodes satisfy at least one of the above conditions, the web tuple is stored in a separate web table (coupled web table).
Information Coupling in Web Databases
103
Table 1. Node attributes of y and b. Node URL Title Text y0 http://www.virtualdisease.com/cancerindex.html Cancer Issues Cancer b0 http://www.virtualdrug.com/cancer.html Cancer Drug List Cancer
Table 2. Link attributes of g and s. Link From Node To Node Label Link Type g0 x0 y0 Cancer local s0 a0 b0 Cancer local
Step 4: Repeat steps 2 and 3 for other web tuples in the resultant web table created by web cartesian product. 4.4
Coupling-Compability Conditions
Local web coupling-compability conditions may be based on node attributes of the instances of specified node variables and/or attributes of the instances of incoming link variables of the specified node variables (coupling-activating links). Let us define some terms to facilitate our exposition. Given a web tuple w of web table W with schema M = hXn , X` , C, P i, let n(w) be a node of w and `n(w) be incoming link to node n(w) such that: • attr(n(w)) ∈ {url, text, title, format, date, size} is a node attribute; • attr(`n(w) ) ∈ {source url, target url, label, link type} is a link attribute and • val(n(w)) and val(`n(w) ) are the values of attr(n(w)) and attr(`n(w) ) respectively. For example, consider Tables 1 and 2 which depict some of the attributes of node variables b, y and link variables g, s. For node b0 and attr(b0 ) = title, and val(b0 ) = Cancer Drug List. For link s0 (incoming link to node b0 ), with attr(s0 ) = label and val(s0 ) = Cancer. Let nci and ncj be node variables in schemas Mi and Mj of web tables Wi and Wj respectively participating in the local web coupling and Kc be the coupling keywords. Let wi and wj be two web tuples of Wi and Wj such that nc (wi ) and nc (wj ) are instances of nci and ncj respectively. Moreover, let the web cartesian product of Wi and Wj be W 0 and let w0 be a web tuple in W 0 which is the cartesian product of wi and wj . Web documents represented by nodes nc (wi ) and nc (wj ) can be coupling nodes (that is web tables Wi and Wj are coupling-compatible) if they satisfy at least one of the coupling-compatibility conditions given below: 1. title of the web documents is equal to Kc or contains the coupling keyword Kc , i.e., attr(nc (wi )) = attr(nc (wj )) = title, val(nc (wi )) and val(nc (wj )) is equal to Kc or contains Kc .
104
S.S. Bhowmick, W.-K. Ng, and E.-P. Lim
2. text of the web documents contains Kc , i.e., attr(nc (wi )) = attr(nc (wj )) = text, val(nc (wi )) and val(nc (wj )) contains Kc . 3. The coupling keyword Kc is contained in the text of one web document and in the title of the other document, i.e., attr(nc (wi )) = text, attr(nc (wj )) = title, val(nc (wi )) is equal to or contains Kc and val(nc (wj )) contains Kc . 4. The coupling keyword is contained in the file name of URL of the web documents, i.e., attr(nc (wi )) = attr(nc (wj )) = url.filename, val(nc (wi )) and val(nc (wj )) contains Kc . 5. The coupling keyword is contained in the text of one web document and in the file name of the URL of other document, i.e., attr(nc (wi )) = text, attr(nc (wj )) = url.filename, val(nc (wi )) and val(nc (wj )) contains Kc . 6. The coupling keyword is contained in the file name of the URL of one document and in the title of the other document, i.e., attr(nc (wi )) = url.filename, attr(nc (wj )) = title, val(nc (wi )) contains Kc and val(nc (wj )) contains or is equal to Kc . 7. The label of the incoming links `nc (wi ) and `nc (wj ) to the web documents contains the coupling keyword Kc , i.e., attr(`nc (wi ) ) = attr(`nc (wj ) )) = label, val(`nc (wi ) ) and val(`nc (wj ) ) are equal to Kc or contains Kc . 8. The label of the incoming link `nc (wi ) and the title of node nc (wj ) contains or are equal to Kc , i.e., attr(`nc (wi ) ) = label, attr(nc (wj )) = title, val(`nc (wi ) ) and val(nc (wj )) are equal to Kc or contains Kc . 9. The label of the incoming link to one document contains or is equal to Kc and the text of the other web document contains the coupling keyword, i.e., attr(`nc (wi ) ) = label, attr(nc (wj )) = text, val(`nc (wi ) ) is equal to or contains Kc and val(nc (wj )) contains Kc . 10. The label of the incoming link contains or is equal to Kc and the file name of the URL of the other web document contains Kc , i.e., attr(`nc (wi ) ) = label, attr(nc (wj )) = url.filename, val(`nc (wi ) ) is equal to or contains Kc and val(nc (wj )) contains Kc .
5
Related Work
We would like to briefly survey web data retrieval and manipulation systems proposed so far, and compare them with web information coupling. There has been considerable work in data model and query languages for the World Wide Web [9], [11], [12], [13]. To the best of our knowledge, we are not aware of any work which deals with web information coupling in web databases. Mendelzon, Mihaila and Milo [13] proposed a WebSQL query language based on a formal calculus for querying the WWW. The result of WebSQL query is a set of web tuples flattened immediately into linear tuples. This limits the expressiveness of queries to some extent as complex queries involving operators such as local web coupling are not possible. Konopnicki and Shmueli [11] proposed a high level querying system called W3QS for the WWW whereby users may specify content and structure queries on the WWW and maintain the results of queries as database views of the WWW. In W3QL, queries are always made to the WWW.
Information Coupling in Web Databases
105
Past query result are not used for the evaluation of future queries. This limit the usage of web operators like local web coupling to derive additional information from the past queries. Fiebig, Weiss and Moerkotte [9] extended relational algebra to the World Wide Web by augmenting the algebra with new domains (data types), and functions that apply to the domains. The extended model is known as RAW (Relational Algebra for the Web). Only two low level operators on relations, scan and index-scan, have been proposed to expand an URL address attribute in a relation and to rank results returned by web search engine(s) respectively. RAW made minor improvements on the existing relational model to accommodate and manipulate web data and there is no notion of a coupling operation similar to the one in WICM. Inspired by concepts in declarative logic, Lakshmanan, Sadri and Subramanian [12] designed WebLog to be a language for querying and restructuring web information. But there is no formal definition of web operations such as web coupling. Other proposals, namely Lorel [1] and UnQL [8], aim at querying heterogeneous and semistructured information. These languages adopt a lightweight data model to represent data, based on labeled graphs, and concentrate on the development of powerful query languages for these structures. Moreover, in both proposals there is no notion of web coupling operation similar to the one in WICM. Website restructuring systems like Araneus [4] and Strudel [10], exploit the knowledge of a website’s structure to define alternative views over its content. Both these models do not focus on web information coupling similar to the one in WICM. The WebOQL system [3] supports a general class of data restructuring operations in the context of the Web. It synthesizes ideas from query languages for the Web, semistructured data and web site restructuring. The data model proposed in WebOQL is based on ordered trees where a web is a graph of trees. This model enables us to navigate, query and restructure graphs of trees. In this system, the concatenate operator allows us to juxtapose two trees which can be viewed as the manipulation of trees. But there is no notion of web coupling operation similar to ours.
6
Summary and Future Work
In this paper, we have motivated the need for coupling useful information residing in the WWW and in multiple web tables from a web database. We have introduced the notion of global web coupling and local web coupling that enable us to couple useful related information from the WWW and associate related information residing in different web tables by combining web tuples whenever they are coupling-compatible. We have shown how to construct the coupled web table globally and locally from the WWW and two input web tables respectively. Presently, we have implemented the global web coupling operator and have interfaced it with other web operators. The current global web coupling operator can be used efficiently for simple web queries. We are in the process of implementing the local web coupling operator and finding ways to optimize web coupling.
106
S.S. Bhowmick, W.-K. Ng, and E.-P. Lim
References 1. S. Abiteboul, D. Quass, J. McHugh, J. Widom, J. Weiner. The Lorel Query Language for Semistructured Data. Journal of Digital Libraries, 1(1):68-88, April 1997. 2. S. Abiteboul, V. Vianu. Queries and Computation on the Web. Proceedings of the 6th International Conference on Database Theory, Greece, 1997. 3. G. Arocena, A. Mendelzon WebOQL: Restructuring Documents, Databases and Webs. Proceedings of ICDE 98 , Orlando, Florida, February 1998. 4. P. Atzeni, G. Mecca, P. Merialdo Semistructured and Structured Data in the Web: Going Back and Forth. Proceedings of Workshop on Semi-structured Data, Tuscon, Arizona, May 1997. 5. S. S. Bhowmick, W.-K. Ng, E.-P. Lim. Join Processing in Web Databases. Proceedings of 9th International Conference on Database and Expert Systems Applications (DEXA’98), Vienna, Austria, August 24–28, 1998. 6. S. S. Bhowmick, S. K. Madria, W.-K. Ng, E.-P. Lim. Web Bags: Are They Useful in A Web Warehouse? Submitted for publication. 7. S. S. Bhowmick, S. K. Madria, W.-K. Ng, E.-P. Lim. Semi Web Join in WICS. Submitted for publication. 8. P. Buneman, S. Davidson, G. Hillebrand, D. Suciu. A query language and optimization techniques for unstructured data. Proceedings of the ACM SIGMOD International Conference on Management of Data, Canada, June 1996. 9. T. Fiebig, J. Weiss, G. Moerkotte. RAW: A Relational Algebra for the Web. Workshop on Management of Semistructured Data (PODS/SIGMOD’97), Tucson, Arizona, May 16, 1997. 10. M. Fernandez, D. Florescu, A. Levy, D. Suciu A Query Language and Processor for a Web-Site Management Systems. Proceedings of Workshop on Semi-structured Data, Tuscon, Arizona, May 1997. 11. D. Konopnicki, O. Shmueli. W3QS: A Query System for the World Wide Web. Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 1995. 12. L.V.S. Lakshmanan, F. Sadri., I.N. Subramanian A Declarative Language for Querying and Restructuring the Web. Proceedings of the Sixth International Workshop on Research Issues in Data Engineering, February, 1996. 13. A. O. Mendelzon, G. A. Mihaila, T. Milo. Querying the World Wide Web. Proceedings of the International Conference on Parallel and Distributed Information Systems (PDIS’96), Miami, Florida, 1996. 14. W.-K. Ng, E.-P. Lim, S. S. Bhowmick, S. K. Madria An Overview of A Web Warehouse. Submitted for publication. 15. W.-K. Ng, E.-P. Lim, C.-T. Huang, S. Bhowmick, F.-Q. Qin. Web Warehousing: An Algebra for Web Information. Proceedings of IEEE International Conference on Advances in Digital Libraries (ADL’98), Santa Barbara, California, April 22–24, 1998.
Structure-Based Queries over the World Wide Web Tao Guan, Miao Liu, and Lawrence V. Saxton Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {guan,lium,saxton}@cs.uregina.ca
Abstract. With the increasing importance of the World Wide Web as an information repository, how to locate documents of interest becomes more and more significant. The current practice is to send keywords to search engines. However, these search engines lack the capability to take the structure of the Web into consideration. We thus present a novel query language, NetQL and its implementation, for accessing the World Wide Web. Rather than working on global text-full search, NetQL is designed for local structure-based queries. It not only exploits the topology of web pages given by hyperlinks, but also supports queries involving information inside pages. A novel approach to extract information from web pages is presented. In addition, the methods to control the complexity of query processing are also addressed in this paper.
1
Introduction
The World Wide Web provides a huge information repository based on the Internet. It is a big problem to find documents of interest in this system. The current practice mostly depends on sending a keyword or a combination of keywords to search engines such as AltaVista and Yahoo. Although this is successful in some cases, there are still limitations in this approach. For example, (1) the search is limited to page content which is viewed as unstructured text so that the inner structural information is ignored; (2) the accuracy of results is low and garbage exists in the output. On the other hand, structure-based query languages [3,11,13,12,16] are presented to exploit the link structures between the pages. Most of these work are based on the metaphor of the Web as a database. However, the nature of the Web is fundamentally different from traditional databases. The main characteristics are its global nature and the loosely textual semi-structured information it holds. Although these languages can solve problems in search engines to some extent, they usually suffer from the following drawbacks: (1) They focus on the hyperlinks so that page contents are simplified as atomic objects(i.e. strings) or relations with specific attributes(i.e. URL, title, text, type etc). The inner structure which is valuable for many queries is ignored. T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 107–120, 1998. c Springer-Verlag Berlin Heidelberg 1998
108
T. Guan, M. Liu, and L.V. Saxton
This limits the expressive power of the languages. For example, the following queries cannot be expressed, List the name and e-mail address of all professors at the University of Regina; Find hotels in Hong Kong whose price is less than US$100; The problem with these two queries is that the information on price or email address, in most cases, is kept inside a web page. If languages view a page as an atomic object and just support operations like keyword matching, it is hard to exploit the valuable data inside a page. The main difficulty may be that it is too tough to obtain this information from a Web page since pages usually are irregular. Therefore, how to extract the structural information is a key point. Although new technology XML provides the feature of self-describing structures, valuable information hidden inside semi-structured textual lines is still useful for users. The technique on how to mine this kind of data is thus important. Actually, it has been studied in [1,2,8,17]. Here we present a novel approach to deal with it. (2) How to control the complexity of query processing is not addressed. Since structure-based queries are evaluated on original, distributed data, the communication costs of accessing remote data may be huge. Therefore, a blind search is inefficient. Methods to control the run time should be investigated. Our contributions. This paper presents an intelligent query language over the WWW, called NetQL. Our purpose is not to give birth to another powerful language. Instead, we focus on the problems mentioned above. NetQL follows the approach of structure-based queries; however, it attempts to overcome the problems unsolved by other languages. First, a novel approach to mine information from web pages is presented so that queries which involve information or structures inside pages can be issued. Secondly, various methods are provided to control the complexity of query processing. Rather than representing a web page as a labeled graph or relations as in current practices [3,11,13,12,16], our mining algorithm extracts the desired information from irregular pages directly by keywords or patterns. We assume: (1) the important information is always highlighted by keywords or is meaningfully semi-structured since most web data is summarized and condensed(except online news or newspaper); (2) some common patterns exist in English, e.g. the word after “Dr.” or “Mr.” should be a name; (3) similar structures or patterns occur in web pages of an institute since most public web pages are written by the same professional webmasters and thus a similar style(even simple copies) is employed. Therefore, a set of heuristic rules or patterns can be used to identify this information. Our experments show that this novel approach of extracting information from the unstructured web is more effective than conventional ones, which depends on syntax(i.e. HTML tags) or declarative languages [2,8,17]. In addition, the complexity of query processing is controlled in NetQL in two levels. Firstly, users are given various choices to control run time. For example, they can specify a more exact path if they have partial knowledge of the structure of the searched site or simply limit the evaluation of queries to local data or a
Structure-Based Queries over the World Wide Web
109
fixed number of returned results. Secondly, an effective optimizing technique, which is based on semantic similarity, is developed to guide the search to the most promising direction. The remainder of this paper is organized as follows. In Sect. 2, we briefly introduce our query language, NetQL. Then we discuss how to mine information from web pages in Sect. 3. Section 4 presents methods to control the complexity of queries. Experimental results are shown in Sect. 5. Finally, conclusions and references are presented.
2
The Language NetQL
We briefly introduce our query language NetQL in this section. A web site is modeled as a rooted, edge-labeled graph as in semistructured databases. Each node represents a page, and each page has an unique URL and can be viewed as either a semistructured textual string or a set of textual lines which consists of a few fields(one or more words. The definition is shown later). Figure 1 is an example to model a portion of the web site at the CS department of the University of Regina. http://www.cs.uregina.ca
History Information
People
Research
Staff
Graduate Student
Class Files
.... Faculty
Publication
.....
The university of Regina was founed in 1911...
Programs The Department currently offers B.Sc, M.Sc, Ph.D...
Research ....
Software and applications are the primary fields...
..... .........
Fig. 1. A Sample web site and page
The syntax of NetQL is similar to the standard SQL SELECT statement, but the data are web documents instead of relational databases. The general grammar is as follows, select variables from startingpage→path contain keywords match patterns where conditions restricted specification
110
T. Guan, M. Liu, and L.V. Saxton
The select clause contains a list of variables which indicate what information is finally extracted from the chosen pages. There are two kinds of variables in NetQL. One is called the keyword variable whose value is mined from pages directly by a set of heuristic rules. The other is the pattern variable which must appear in the string or structure patterns in the match clause and the value is obtained when the pattern is matched against a portion of the content of pages(see detail in the next section). The from clause specifies where web pages are reached. If absent, the default case is all pages located at the web server where the user sends the query. Otherwise, the clause specifies a starting page and further path expression. (The latter is not imperative. It is mainly used to improve the performance of queries). A path expression is usually a set of predicates starting from the specified page, following certain hyperlinks satisfying the predicate and arriving at other pages. While “→” is used to separate the starting page from the following hyperlinks, “.” represents a hyperlink from one page to another. For example, http://www.cs.uregina.ca/→people.faculty is a path from the CS page to the faculty page through the people link. In addition, the wildcard “∗” represents an arbitrary length of the path and “-” means the path only goes one depth further. When the pages specified in the from clause are reached, NetQL first checks if they contain keywords given in the contain clause. If not, the pages are discarded. Otherwise, the desired information are mined by keywords in the select clause or the pages are matched against the patterns in the match clause. Furthermore, the obtained information is sent to evaluate the conditions in the where clause. If they are true, the values assigned to variables in the select clause are returned to users. Finally, the restricted clause is used to control the complexity of query processing. We discuss this further in Sect. 4. Example 2.1. Find the name and e-mail address of all professors at the University of Regina. select Name, E-mail from http://www.uregina.ca/→∗ contain professor match [Dr. Name] In this case, Name and E-mail are variables used to indicate what we are looking for. E-mail is a keyword variable whose value is mined directly from web pages specified in the from clause(all pages containing the word professor at site http://www.uregina.ca). In contrast, Name is a variable occurring in patterns [Dr. Name]. The query first finds the pages containing keyword professor at site http://www.uregina.ca and then locates the constant string “Dr.” in the returned pages and assigns the first noun phrase after it to variable Name. For example, if the string “Dr. Jack Boan is ....” is found, the “Jack Boan” is set to Name. While more than one possible value are set to a variable, the conflict is solved by the rules given in Sect. 3. The final result is shown in Fig. 2.
Structure-Based Queries over the World Wide Web
111
Fig. 2. The Final Result for Example 2.1
Of course, the result does not cover all the desired information and errors also appear(e.g. the email for Dr. Bryan Austin cannot be [email protected]). However, they actually are more exact information than that of search engines(84460 links were returned if we use keywords professor, university and regina to Yahoo). Example 2.2. Find hotels in Hong Kong whose price is less than US$100. This example seems difficult since we do not know where we can find the information on hotels in Hong Kong. However, we can solve the problem by sending the words hotel and Hong Kong to a search engine, i.e. Yahoo and get the page as in Fig. 3. There are 44 hotels returned and we may browse them manually to find what we need. However, if the number of hotels is large, it will be difficult to search manually. Fortunately, the following NetQL query can deal with it when we know of a homepage as in Fig. 3 at http://www.asia-hotels.com/HongKong.asp. select X, Z from http://www.asia-hotels.com/HongKong.asp match {X, Y, Z<100} This query is to match structure pattern {X, Y, Z<100} against the content of the page whose URL is http://www.asia-hotels.com/HongKong.asp. It treats a web page as a set of textual lines, which consists of a couple of fields. (The fields are separated by delimiters, which are defined as two or more spaces or any HTML tags, i.e. , ). For example, the following line has three fields,
112
T. Guan, M. Liu, and L.V. Saxton
Fig. 3. The Sample Page for Example 2.2
Anne Black Guest House (YWCA) Special offer US$ 43 to 101 The structure pattern {X, Y, Z<100} is matched against a textual line (or a row in a table) which exactly has three fields (or columns) and the third field contains a number which is less than 100. Therefore, for the above line, we have X = “Anne Black Guest House (YWCA)”, Y = “Special offer” and Z = “US$ 43 to 101”. Since the value of variable Z should be a number, the heuristic rule is to choose the first number in the string, that is, 43 to Z. Therefore, the line is matched against the pattern and the value of variable X and Z are output. The above examples highlight the main novelty of NetQL; that is, keywords and patterns are used to extract the information from web pages. We will further describe it in the next section. Another feature, complexity control and the restricted clause, will be discussed in Sect. 4.
3
Page Mining
We discuss how to mine information from web pages in this section. Our idea is similar to the human approach in locating desired information. When a person is looking for something, he/she always follows two approaches: (1) use a keyword to recognize desired information. For example, a keyword Email: or Tel: indicates
Structure-Based Queries over the World Wide Web
113
that the string after it is an email address or telephone number. (2) use semantic knowledge or a pattern to recognize objects. For example, most people know that the string “3400 Rae St. Regina, SK, Canada” is an address although there is no keyword address or contact before it. Since the semantic knowledge is hard to acquire for a computer, NetQL currently only supports keywords-based mining and string or structure pattern-based mining. We discuss them in the following two sections respectively. 3.1
Keyword-Based Mining
Keyword-based mining is used to extract values associated with keywords , e.g. E-mail, Publication, or Research interests. When a keyword is given, the system first looks for the keyword in pages. If it is located, the following heuristic rules are applied to mine the corresponding value automatically, – If the word is in a label of a hyperlink, then the value is the content of the page pointed to by the link. For example, the information on publications must be in the pointed page if there is a hyperlink label containing publications. – If the word is a title(
-
), then the value is the string between this title and the next title. If this is the last title in the HTML file, then the value is ended when (1) a blank line appears; (2) tag or appears. For example, in Fig. 1, the value for keyword History is the string after it until the next title Programs. – If the word is an item of a list(
or ), then the value is the string after it until the next
or or the end of the list; – If the word is a field in a table(
or
), then the value of the field in the first column(except table head) of a 2-columns table is the field on the right. Otherwise, it is the field under it. For example, in the following table on the left, the value for Single room is 80. But the same field on the right is 50.
Price $
Single room
Single room
80
50
Double room
140
houses
Extra bed
Double room 90
Extra bed 20
30
Fig. 4. Two Tables In the Web
– If the word is at the beginning of a textual line, which itself consists of an independent paragraph(e.g. HTML tags separate it from the previous and
114
T. Guan, M. Liu, and L.V. Saxton
latter texts) and there are some HTML tags(e.g. and ) or more than two spaces to separate it from the latter words, then the value is the string after it until the end of the line. For example, in the following line, Office: CW308.2 The value for Office is CW308.2 since it is separated by tags and in the HTML page. If the keyword denotes a number(in a conditional expression as in Example 2.2), then the value is the first number occurring in the string obtained from the above rules. 3.2
String Pattern Mining
A string pattern contains a couple of constant words or variables(variables are indicated in select or where clauses) and is delimited by a pair of brackets. When it is given, the system first locates strings from pages to match the constant words in patterns. If successful, the noun phrase or number corresponding to variables are assigned to the variables. For example, a pattern [Dr. Name] can be matched against a string starting with “Dr.” and then the first noun phrase after “Dr.” is assigned to variable Name. For string “Dr. Jack Boan is ...” , then “Jack Boan” is assigned to Name. The reason why we focus on noun phrases and numbers is that we think most information involved in a query is represented in noun phrases or numbers(Verbs usually indicate an action or state). The definition for noun phrases is as follows, NP → N P2 | Det N P2 | NP Conj NP N P2 → Noun | Noun N P2 | Adj N P2 | N P2 PP PP → Prep NP where NP denotes a noun phrase, Det denotes a determiner, Conj denotes a conjunction, Adj denotes an adjective and Prep denotes a preposition. Two or more patterns may be linked using boolean operators. e.g. [Mr. Name] or [Ms Name]. The word matched against any one of the patterns can be set to variable Name. Of course, it is possible that more than one word is matched against a variable. If that happens, we count the number of matches in the page and choose the one with the highest frequency or the first if two or more words have the same maximum frequency count. In addition, wildcards can be used in string pattern, i.e. “∗” and “-”, where “-” denotes a word and “∗” represents any number of words. For example, the pattern [Dr. Name received Degree from ∗ in Year] is to extract information on variables Name, Degree and Year, while the university is ignored. 3.3
Structure Pattern Mining
Structure pattern mining means that a structure pattern is given to match a textual line or a row of tables(defined by tags
and
) in web pages. NetQL treats each textual line or a row of tables as a set of fields and thus the syntax of a structure pattern is
Structure-Based Queries over the World Wide Web
115
{f1 , f2 , .... fn } where fi denotes a field, which may be: – a variable, e.g. X, Y or Z, – a constant which may contain wildcards as in string pattern, e.g. “Hong Kong”, 10, “University of ∗”; – a simple expression of form “variable operator constant”, where operators may be <, =, >, ≤, or ≥. For example, Z <100; – “-”, a field whose value can be ignored; or – “∗”, any number of fields whose values can be ignored; For example, structure pattern {X, “Canada”, ∗, Z <20 } is matched against a textual line or a row of a table, in which the second field is a string “Canada” and the last field is a number less than 20. The processing of a structure pattern is used to find the values of all fields in a textual line or a row and then match them against the pattern. For textual lines, fields are separated by delimiters, which are defined as two or more spaces, or a HTML tag, e.g. ,
etc. The value of a string field is the textual string(without HTML tags) between two delimiters and the value of a number field is the first number between two delimiters. For example, consider the following two lines, Order 4
Country Canada
Gold 6
Silver 5
Bronze 7
Total 18
There are 6 fields in each line. The second line matches with the pattern {X, “Canada”, ∗, Z <20 }; thus the value of variable X is 4 and for Z is 18. As for tables, the fields are delimited by tags
and
for each row (between
and
). The value for a string field is the string eliminating all tags(included in <>). Similarly, if it is a number field, we take the first number in the string as the value. For example, the following table is a portion of a HTML file on a row of a table.
There are three fields. The value of the first field is “Anne Black Guest House” and the second is “Special offer” and the third is “US$43 to 101”. This example matches the pattern in Example 2.2.
116
T. Guan, M. Liu, and L.V. Saxton
We should note again that structure-based patterns cannot handle all structural information, e.g. complex tables. However, 2-dimensional relational tables, in which each row has the same number of fields, are the most popular in practice. Therefore, our method is feasible in many cases.
4
Complexity Control
Complexity is a big challenge for structure-based web queries since it is evaluated on original, distributed data. NetQL deals with complexity in two levels. Firstly, users are given various methods to control complexity. Secondly, an effective optimization technique is provided to guide the search to the closest or most promising path so that the expected results can be obtained as soon as possible. In this section, we discuss them respectively. 4.1
Users’ Role
If users have partial knowledge about the structure of the searched site, they can issue more specific path information to reduce a blind search. The more information users specify, the more efficient the query is. In example 2.1, if they know that all professors are listed under hyperlink faculty, then the path expression in from clause can be updated as http://www.cs.uregina.ca/→faculty.∗. The query only checks the pages which are under hyperlink faculty and contain the word professor. The run time is thus reduced significantly, especially when the query is issued at a remote site. Partial knowledge is possible for users since: (1) they have visited the site before; (2) web sites on a similar topic have similar structures, e.g. the structure of one university can be derived if other universities were visited before. Of course, if users know nothing about the site, they can limit queries in the following aspects: – Restrict the search to local data. The search only follows the links inside a web server where the starting page is located. – Restrict the search to a certain number of returned results. If the specified number of results is exceeded, the search stops. – Restrict the search to a certain amount of time. When the time reaches its limit, the search stops and returns results found. For example, select Publications from http://www.cs.uregina.ca/→faculty.∗ contain professor where Publications CONTAINS database restricted LOCAL and RESULTS < 10 This approach trades a portion of the results for fast response. It is useful in the cases where inaccurate and incomplete results are acceptable by users. However, if some users hope for an exhaustive search, internal optimization will be applied in this situation.
Structure-Based Queries over the World Wide Web
4.2
117
Optimization
Our idea is inspired by the heuristic search for problem-solving programs in AI. Rather than pruning the search at certain sub-graphs, our method attempts to guide the search to the closest or most promising path so that the expected results can be obtained as soon as possible. The queries restricted by time and the number of results will be benefited directly. Our algorithm uses semantic information on hyperlink labels for heuristic search. The semantic similarity between the current set of links and the goal can help the optimizer to decide which is preferred for the next step. Actually, humans employ this way to guide their navigation when they browse web pages manually. For example, if the starting point has the following structure,
http://www.cs.uregina.ca ...... Information
People
Research
Class Files
.........
Fig. 5. A Portion of web site
When we need to locate the pages containing professor, which hyperlink has the highest priority? Obviously, it should be People since this label is more similar to professor than any other. The key point for this method is how to compute the semantic distances between words or noun phrases. This problem has been widely studied in the fields of Natural language Processing(NLP) and Information Retrieval(IR). Various methods have been presented in [9,10,19]. In this paper, we follow the approach of word-word similarities based on WordNet presented in [9,19]. In WordNet, conceptual similarity is considered in terms of synset(a set of synonyms) similarity. The similarity between two synsets is approximated by the maximum information content of the super synsets in the hierarchy that subsumes both synsets. The information content of a synset is quantified as negative the log likelihood, -log P(s), and for our case P(s) is computed simply as a relative frequency: P P(s) = count(w) / N, w ∈ synset(s); where count(w) is the number of occurrence of the word w in the corpus(we use noun frequencies from the Brown Corpus of American English [5]), N is the total number of words observed. Thus the similarity of two synsets can be expressed as sim(s1 , s2 ) = maxs [IC(s)] = maxs [-log P(s)], where s ∈ Sup(s1 , s2 ) which is the set of super synsets that subsume both s1 and s2 . If there is no super synsets for s1 and s2 , then the similarity is 0.
118
T. Guan, M. Liu, and L.V. Saxton
Assume that a set of labels of candidate hyperlinks L = {l1 , l2 , ...., ln } is matched with a predicate p, the following algorithm is applied for optimization, – A stoplist is used to remove the common words (and, the, in etc) from each label li . The result is denoted as keywordset(li ); – The similarity of each li and p is computed as sim’(li , p) = max(sim(synset(w), synset(p))), where w ∈ keywordset(li ); Thus, the following heuristic rule is used, Rule: The link whose label has the maximum similarity with the predicate p is selected first for the next search. For example, in Fig. 5, we have sim’(“Information”, “Professor”) = 0; sim’(“People”, “Professor”) = 8.11; sim’(“Research”, “Professor”) = 0; sim’(“Class Files”, “Professor”) = 0; Therefore, the link with the label People is selected first for the next search.
5
Experiments
The most important facility in NetQL is information mining from multiple web pages. Our first attempt is to extract structural information from web pages based on the syntactic or semantic structures. The method is to transform a page into a labeled graph as in semistructured databases [2,6,7] and then obtain the desired information from the graph. However, this approach failed in our experiment since HTML provided many flexible constructors and thus most web pages were quite irregular. We therefore tried another information mining based on keywords, patterns and structure. Although this approach cannot guarantee 100% successes in mining the desired information, our initial experience showed that it was effective in practice. In Example 2.1, the average recall1 for keyword-based mining (i.e. the variable E-mail) was 93% and the average precision2 was 85%. However, the recall for pattern-based mining was a little bit low. It was only 21% for the pattern [Dr. Name]; nevertheless, the precision was 95%. From the experiments, the following observations are made: (1) There is no keyword before the desired information, such as names, title or addresses in personal homepages (humans can recognize them in the context or by semantic knowledge). This problem may be solved by pattern-matching to some extent. For example, some names always have a title before it, such as Dr., Mr., Ms etc. Of course, if there is no obvious pattern, it is hard to handle the problem. Also, simple concepts, e.g. name, address, place or prices, could be easily identified by NLP techniques. But complex concepts, e.g. bibliography, would be hard. (2) The information denoted by a keyword is a complex concept so that our program cannot mine it completely. For example, the rate of a hotel usually 1 2
Recall is the percentage of desired instances obtained by the system. Precision is the percentage of correct instances.
Structure-Based Queries over the World Wide Web
119
includes the rate for single-room, double-room, adult or children. The mining of these kinds of data also requires the semantic knowledge and NLP techniques. (3) Some information is represented as images and complex tables. These cannot be easily handled automatically (at least not by present NetQL). In reality, only a highly intelligent human could recognize the relationship between such data. In short, we find that mining pages from a local or remote site is easier than from global search since web pages in the former usually organized by the same institute. Thus the pages exhibits very similar stylistic properties. The irregular cases are web pages designed by different individuals. Such web pages can be presented in very different styles. Despite this, many common structures and styles do appear in web sites which share the same theme in practice. For example, most professors’ homepages have the following structure, name, title, address, biography, teaching, research interests, projects and publications. Therefore, information mining over the Internet is not impossible. Our second experiment focuses on the performance of local and remote search. The result shows that the optimizing technique is useful in improving the performance [14]. Although it is not effective in all cases, it out-performs exhaustive blind searching in almost all cases. Usually, a local search over 10,000 pages or more can be done in a couple of minutes. Under this time frame, the desired information can be found after the navigation of 5 pages from the starting point on average. If optimization is applied, the searching time is only a few minutes with an average navigation length of 8. For a remote search, it depends heavily on the speed and loads on the Internet. Our experience shows that about 1000 pages are accessed from the US and Canada within a couple of minutes without optimization. On average, only 3 pages are visited from the starting point for a medium-size site which contains around 10, 000 pages. However, if the optimizing technique is used, it is possible to obtain data at level 5 or 6 in a few minutes.
6
Conclusion and Further Work
An intelligent query language over WWW, NetQL and its implementation, is presented in this paper. Rather than developing a powerful language, we focus on the problems ignored by other languages. The main contributions of NetQL are: (1) it provides a novel approach to extract information from irregular textual web pages; (2) it supports various methods to control the complexity of queries. Future work will focus on the following questions: – Is it possible to extract information by the semantic knowledge? e.g. name, address or biography? – Are there other heuristic rules for web queries? – Which method is the best for semantic similarity in the context of WWW ? In short, structure-based web queries is a new area and current solutions are incomplete. There are lots of room for further research.
120
T. Guan, M. Liu, and L.V. Saxton
References 1. B. Adeberg: NoDOSE - A tool for semi-automatically extracting structured and semistructured data from text documents. In Proc. of the ACM SIGMOD International Conference on Management of Data, 1998 2. N. Ashish and C. Knoblock: Wrapper generation for semi-structured Internet sources. In 1st Workshop on Management of Semistructured Data, Arizona, 1997 3. P. Atzeni, G. Mecca and P. Merialdo: Semistructured and structured data in the Web: going back and forth. In 1st Workshop on Management of Semistructured Data, 1997 4. M. Costantino, R.G. Morgan, R. J. Collingham and R. Garigliano: Natural language processing and information extraction: Qualitative analysis of financial news articles. In Proc. of the Conf. on Computational Intelligence for Financial Engineering, 1997 5. W. N. Francis and H. Kucera: Frequency analysis of English usage: lexicon and grammar. Houghton Mifflin, 1982 6. M. Fernandez and D. Suciu: Query optimizations for semi-structured data using graph schema, In ICDE’98 7. R. Goldman and J. Widom: Interactive query and search in semistructured databases. Technical Report, Stanford University, 1998 8. J. Hammer, H.G. Molina, J. Cho, R. Aranha and A. Crespo.: Extracting semistructured information from the Web. In 1st Workshop on Management of Semistructured Data, Arizona, 1997. 9. J. Jiang and D. Conrath: Semantic similarity based on corpus statistics and lexical taxonomy. In Proc. of Int’l Conf. on Research on Computational Linguistics, Taiwan, 1997. 10. H. Kozima and T. Furugori: Similarity between words computed by spreading activation on an English dictionary. In Proc. of EACL-93(Utrecht), pp.232-239, 1993. 11. D. Konopnicki and O. Shmueli: W3QS: A query system for the world wide web. In VLDB’95, Zurich, 1995, pages. 54-65. 12. Z. Lacrox, A. Sahuguet, R. Chandrasekar and B. Srinivas: A novel approach to querying the Web: Integrating Retrieval and Browsing. ER97 Workshop on Conceptual Modeling for Multimedia Information Seeking, 1997 13. L.V.S. Lakshmanan, F. Sadri and I.N. Subramanian. A declarative language for querying and restructuring the Web. In Proc. of 6th. International Workshop on Research Issues in Data Engineering, RIDE’96, New Orleans, February 1996. 14. M. Liu: NetQL: an intelligent web query language. Master Thesis, University of Regina 15. G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller: Introduction to WordNet: an on-line lexical database, International Journal of Lexicography, 1993. 16. A. Mendelzon, G. Mihaila and T. Milo: Querying the World Wide Web. In 1st Int. Conf. on Parallel and Distributed Information System, 1996. 17. D. Smith and M. Lopez: Information extraction for semi-structured documents. In 1st Workshop on Management of Semistructured Data, Arizona, 1997. 18. S. Soderland: Learning to extract text-based information from the world wide wed. In Proc. of 3rd International Conf. on Knowledge Discovery and Data Mining(KDD-97), 1997 19. A.F. Smeaton and I. Quigley: Experiments on using semantics distances between words in image caption retrieval. In SIGIR’96
Integrated Approach for Modelling of Semantic and Pragmatic Dependencies of Information Systems Remigijus Gustas
Department of Information Technology, University of Karlstad S-651 88 Karlstad, Sweden [email protected]
Abstract. Traditional semantic models are based on entity notations provided by several links. Links are established to capture semantic detail about relationships among concepts. The ability to describe a process in a clear and sufficiently rich way is acknowledged as crucial to conceptual modelling. Current workflow models used in business process re-engineering offer limited analytical capabilities. Entity-Relationship models and Data Flow Diagrams are closer to the technical system development stage and, therefore, they do not capture organisational aspects. Although object-oriented models are quite comprehensible for users, they are not provided by rules of reasoning and complete integration between static and dynamic diagrams. The ultimate objective of this paper is to introduce principles of integration for different classes of semantic and pragmatic representations.
1 Introduction Any information system activity needs to be defined in a context of organisational processes. Thus, two levels of information system models are necessary [5]. Organisational level model that defines an ideal system structure. Further it will be referred to as an enterprise model. The implementation level determines data processing needs for a specific application. Most conventional semantic data models are heavily centred around the implementation level. The way in which organisational activity is conceptualised will define what information system is appropriate. Activities in an organisational system can be expressed in terms of actions of communication and collaboration between actors of an information system. This kind of knowledge is crucial to reason about purposeful implications of organisational processes. Most semantic models that have been used in traditional information system modelling approaches neglect such essential aspects of communication. Enterprise engineering is a branch of requirements engineering which deals with an early phase of integrated information system development. At the same time it can be viewed as an extension and generalisation of system analysis activity. Enterprise T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 121−134, 1998. Springer-Verlag Berlin Heidelberg 1998
122
R. Gustas
modelling takes place in the early and middle phases of information system development life cycle. The most difficult part of it is arriving at complete, integrated and consistent description of a new system that sometimes is known as conceptual, semantic or requirements model. Despite of apparent clarity of the semantic models used by various Object – Oriented methods and CASE tools, the research has shown that a large part of the maintenance costs can be attributed to improper enterprise modelling or to misconception of real requirements [15]. Various graphical diagrams [19] are used to define semantics of information systems. It is obvious that all notations have been designed to describe one or a few, but not all aspects of information systems. This means that information system models should comprise a combination of several notations, each for some particular aspect. This may lead to a difficult question: ‘how is it possible to use several notations in a complimentary way to develop clear models?’ [14]. More importantly, it is often possible to employ a notation to describe some other aspects than those it has been designed to use for. The solution could be found in the identification of a set of the basic semantic and pragmatic modelling primitives that are adequate to analyse static and dynamic aspects of processes. In this study we will present and analyse a set of abstractions that can be considered as a necessary basis to built an integrated enterprise model. The focus of this approach is on modelling primitives which are not only are taking into account semantic models of traditional approaches, but also put a communication aspect into the foreground of information system modelling. Integration of semantic, pragmatic and non-traditional communication dependencies is considered as a most important feature of the suggested framework. Such enterprise models can be useful for the purpose of understanding and reasoning that is critical to the success of conceptual engineering activity in many areas.
2 Pragmatic Dependencies Starting point in a business process re-engineering research, are initial requirement statements that express the wishes of stakeholders about new organisation of system. These initial requirements are usually presented as a natural language text that is often ambiguous, incomplete and inconsistent. Although processes of information system adaptation can be driven by these pragmatic statements, usually traditional semantic models are not taking into account dependencies between activities and goals. Moreover interdependencies between goal models and process models are usually defined in a very fuzzy way. Some requirement engineering methodologies have already identified the problem of making system requirements precise, unambiguous, complete and consistent. The process of bridging goals to information system specifications was entitled 'from fuzzy to formal' [7]. Predominance of fuzzy thinking in a goal modelling has led to serious lack of interaction between semantic and pragmatic descriptions of processes. More often, a process goal is merely postulated rather than expressed in terms of semantic dia-
Modelling of Semantic and Pragmatic Dependencies of Information Systems
123
grams. Such ignorance of a goal power was recognised in the area of cognitive modelling. Goals are usually understood as states of affairs or situations that should be reached or at least striven for. Situations are resulting from the actions [6]. Goals can be defined as desirable situations that are interpreted by an actor as final. Such pragmatic notions as objective, vision, goal, etc. express the wishes and desires of actors concerning system they design or manage. A goal hierarchy can be formed of interconnected goals on different levels of abstraction ranging from high-level business objectives to low-level operational goals. Usually, objectives at the bottom level are situations that can be defined in terms of various semantic dependencies. On neighbouring levels of decomposition, goals are related by the composition dependency. The opposite of a goal is a problem. A problem describes a situation which is not desirable. The notion of a problem is used to refer to a problematic situation of an actor. Semantic specification of the problem is regarded as a part of the actual specification. This means that the problem can not be identified without stating the goal. If the designer has no predefined goal then the problem does not make sense [9]. Usually problematic situations denote restrictions that actors try to avoid. A pragmatic link between an actor and desired situation is entitled to as a goal dependency ( g ). The problem link ( p ) are used to refer a problematic situation of an actor. The goal and problem dependencies can be used to refer desirable or not desirable states or situations. In the following chapters, two pragmatic dependencies of influence between goals will be formally defined. They are entitled to as negative influence dependency ( - ) and positive influence dependency ( + ). The negative influence dependency from A to B (A - B) indicates that the goal A hinders to the achievement of goal B. The positive influence dependencies ( + ) between two goals means that the achievement of the first goal, would contribute to the achievement of the second. The negative and positive influences between goals are imposed by the conflicting interests of actors.
3 Semantic Dependencies Most of the semantic modelling techniques are based on entity notations provided by several links. Links are established to capture semantic detail about static and dynamic relationships. Typically semantic constraints have to be general enough to specify dependencies of a system in the different perspectives such as the "why", "what", "who", "where", "when" and "how" [16]. Semantic constraints can be described by using intensional and extensional [13] dependencies of various kinds. Semantics of static intensional dependencies can be defined as cardinalities represented by minimum and maximum number of individuals of concepts. Extensional dependencies usually specify constrains between classes and instances. Static dependencies of concepts stem from various semantic data models. Graphical notations of several associations in Martin/Odell’s style [13] are represented in Fig. 2.1.
124
R. Gustas a)
A
B
(0 ,1 ;? ,? )
b)
A
B
(1 ,1 ;? ,? )
c)
A
B
(0 ,* ;? ,? )
d)
A
B
(1 ,* ;? ,? )
Fig. 2.1. Graphical notation of cardinality constraints. Note: the meaning of * is ‘many’ (i.e. more than one) and the meaning of ?is ‘not defined’.
Notations that are commonly used at the initial phase of concept modelling have to provide a clear understanding of cardinality constraints in both directions. The most common static dependencies that may be specified between any two concepts A and B are as follows: • (1,1;0,1) - Injection dependency which will be denoted by A ⇒ B , • (1,1;1,1) - Bijection dependency (A B), • (1,1;0,*) - Total Functional dependency (A B), • (1,*;1,1) - Surjection dependency (A B), • (1,*;0,1) - Surjective partial functional dependency (A ⇒> B ), • (1,*;1,*) - Mutual multivalued dependency (A B), • (1,*;0,*) - Total Multivalued dependency (A B), • (0,1;0,1) - Partial injectional dependency, (A |⇒ B ) , • (0,1;0,*) - Functional (partial) dependency (A B), • (0,*;0,*) - Multivalued (partial) dependency (A B). Many concepts have common constraints. The similarities can be shared between concepts by extracting and attaching them to a more general concept. In such a way, similar constraints can be inherited by several concepts. One way to represent generalisation hierarchies is by using of inheritance dependency. It will be denoted by a solid line arrow ( ). By means of inheritance similarities of concepts are shown. Aggregation is a conceptual operation which is useful for the formation of a concept interpreted as a whole from other concepts that may be viewed as component parts. Aggregation can be specified by a composition dependency ( ). In the area of artificial intelligence, sometimes, composition is referred to as a ‘part of’ relation [17]. Semantics of the composition link can be completed by cardinality constraints. Most behavioural diagrams put into foreground a dynamic link, which is very similar to the state transition in a finite state machine. Such transition link, constitute a modelling basis of various object-oriented diagrams that are used for specification of object behaviour. In our approach, a state transition is defined in terms of two states. If two states are connected by the transition dependency, then it means that by the action an object can be transferred from the actual state to the next state. Actual static constraints define a set of conditions for an object in the current state. The
Modelling of Semantic and Pragmatic Dependencies of Information Systems
125
expected state defines a set of conditions for an object in the desired state [11]. The graphical illustration of the transition dependency is presented in Fig. 2.2. ACTUAL STATE
ACTION
NEW STATE
Fig. 2.2. Transition dependency between actual and desired situation. Actions are represented by ellipse.
States are resulting from the actions [6]. Specification of actual and desired states is crucial to the understanding of action semantics. Any state transition dependency indicates a possibility to change a state, and visa versa, a possibility to accomplish the action can be specified by a state transition dependency, i.e. (ACTUAL STATE) (NEW STATE). Communication dependencies between two actors involved in a particular action, describe the "who" perspective. Such dependency link between two actors (agent and recipient) indicates that one actor depends on the other for some flow. The agent can be any actor who is able to send a flow, for example, individual, group, role, position, organisation, machine, information system, etc. Graphical notation of the flow dependency between an agent and recipient (AGENT FLOW RECIPIENT) is represented in Fig. 2.3. FLOW AGENT
R E C IP IE N T
Fig. 2.3. Flow dependency
The flow dependency represents a transfer of the ownership right for a particular object of FLOW. Before sending the flow, it is owned by an AGENT and later, dependent on whether a flow would be accepted or not, the ownership is transferred to a RECIPIENT. The flows can be decision, information or material. Recipients by depending on agents are able to achieve their goals.
4 Interaction between Static and Behavioural Dependencies Any flow dependency between two actors may imply a communicative action as well. It is then considered to be an action and a communication flow. It should be noted, that many approaches in the area of business process re-engineering do not view actions in two different perspectives [8]. Cohesion of action and flow results into a more complex abstraction. Therefore, the flow dependency link between two actors specifies that a recipient depends on an agent not just only by a specific flow, but also by action. Actors are specific sub-systems of the overall system. The semantic link ( ) from actor to action (ACTOR ACTION) indicates that the action can be initiated
126
R. Gustas
by any individual, which belongs to the class ACTOR. The presented dependency may define co-ordination, decision, control, etc. The dependency link from action to actor, i.e. ACTION ACTOR, means that an actor will be affected by the executed action. Often this dependency link is combined with the flow that is desired by a dependent actor. Graphical notation of the communicative action dependency is represented in Fig. 4.1. A CTIO N
AG EN T
FLOW RECIP IEN T
Fig. 4.1. Communicative action dependency between two actors
Underlying concepts and dependencies play an important role in various business modelling approaches which are based on communication [21]. A typical action workflow loop can be defined in terms of two communicative action dependencies. Sequences of communicative actions in workflow models [1] may serve as a basis to define obligations, authorisations and contracts [20]. Example of a typical workflow loop that is defined in terms of two communicative actions into opposite directions is represented in Fig. 4.2. ORDER
ORDER
CUSTOMER
SUPPLIER ITEM
SUPPLY
Fig. 4.2. Action workflow loop between Customer and Supplier
This graphical example shows that a customer is authorised to send an order to a supplier by using the predefined ordering action. If this order would be accepted then a supplier is obliged to supply an item. The supplier is also responsible to follow a contract, which can be defined in terms of relationships between incoming (ORDER) and outcoming flows (ITEM). A contract in the presented example can be defined as follows: If CUSTOMER ORDER SUPPLIER then SUPPLIER ITEM CUSTOMER. An agent carries out a specific action in order to achieve a predefined state. Existence of object x in some state would also impose the fulfilment of a set of static dependencies between states. Two kinds of changes such as disconnection and connection occur concerning the associations of an object during a transition of the object from one state to another. A disconnection removes an existing association from existence and connection adds a new association. The disconnected and connected associations are represented by entirely different relationships of two states. The definition of such noteworthy difference in a current and new state of action is very important to understand the nature of action. Only those processes, that conclude with a state change expressed by the disconnection or connection event, can
Modelling of Semantic and Pragmatic Dependencies of Information Systems
127
be interpreted as actions. For instance, a graphical example of the noteworthy difference among three different states, that are important to understand actions of customer and supplier, is illustrated by Fig. 4.3. Product Not O rdered Product Cust omer Product Needs to be O rdered
O rdered Product
O rder
Item
O rder
Sup plier
Product Needs not to be O rdered
Not Supplied Item
Supply
Supplied Item
Item
Fig. 4.3. Constraints to Order and to Supply. Note: inheritance ( ) in this paper is used in a non traditional way. Strict definition of the inheritance dependency is presented in chapter 7.
A transition from the state ‘Product Needs to be Ordered’ to the state ‘Ordered Product’ can be performed by the action of Ordering. Existence of an object x in the state ‘Ordered Product’ would imply that it may have one or many ‘Not Supplied Item’ associated to the x Product. The noteworthy changes performed by actions are important for actors, because they create a need to react. The reaction mechanism is represented in terms of communication flows. For instance, if ‘Product Needs to be Ordered’ then a Customer has to react appropriately. In this particular situation, a Customer is supposed to send an Order to a Supplier. If an item is ‘Not Supplied Item’ then the Supplier is supposed to Supply an Item to the Customer. The semantic difference between two states ‘Not Supplied Item’ and ‘Supplied Item’ is defined in terms of two semantic links (see ‘Ordered Product’ and ‘Product Needs not to be Ordered’).
5 Dependencies in an Extended Action Workflow Loop A typical action workflow loop include two communication flows sent into opposite directions. The customer is an actor who initiates the work flow loop to achieve his goal. The receiver of a flow is a performer. Flow dependencies in two opposite directions imply that certain relationships are established between two actors. In the
128
R. Gustas
reality it represents either a commitment or a contract [8] between customer and performer. Any business process defines a set of responsibilities as well as set of requests that a customer can ask a performer. Usually, the customer is an actor who initiates the action in order to achieve his goal that is referred by a desired situation (DS2). The goal corresponds to a final situation in a communicative action workflow loop. By using a flow in the forward direction, a customer is asking a performer for some action. If the request corresponds to a contract, it will always create a situation that is an opportunity for a performer. Pragmatic dependencies are represented graphically in Fig. 5.1. p
Cust omer
o Problematic Situation (PS1)
Action of Customer
Desired Situation (DS1)
g Flow 2
Flow 1
Problematic Situation (PS2)
p
Perfor mer
o Action of Supplier
Desired Situation (DS2)
Fig. 5.1. Graphical representation of pragmatic dependencies
A pragmatic link between an actor and his desired situation is entitled to as a goal dependency ( g ). The problem link ( p ) is used to refer a problematic situation of an actor. An opportunity link ( o ) is referring to an intermediate situation between a problematic and desired. If an actor has a social power to activate an action that changes a situation from the problematic to intermediate, then this intermediate situation for some other actor may help to create new desirable situations. According to the presented schema, a customer has a possibility to initiate the action by sending Flow 1 to a performer in order to avoid a problem denoted by a problematic situation (Customer p PS1). If a performer is satisfied by the flow (it will be accepted) then the problematic situation would be replaced by action of customer to a desired situation (DS1). In the next step, a performer by sending Flow 2 has a possibility to change his problematic situation (PS2) to the desired situation (DS2) that is regarded to as a goal of customer (Customer g DS2). Graphical example of the semantic and pragmatic dependencies in a typical action workflow loop is depicted in Fig. 5.2.
Modelling of Semantic and Pragmatic Dependencies of Information Systems
p
129
Cust omer
o Not O rdered Product
Item
O rder O rdered Product
g
O rder
Available Item
p
Sup plier
o Supplied Item Supply
Fig. 5.2. Example of dependencies in an action workflow loop
Satisfaction of actors is closely related to their goals and problems. In order to activate an action, an agent has to know about the opportunities available to a recipient. If a recipient is viewing the intermediate situation as an opportunity to achieve his goal or to avoid problem, then the flow, that has been sent by an agent, would be acceptable by a recipient.
6 Dependencies of Positive and Negative Influence Any two goals can be contradictory, if one of them is interpreted as a problem to reach another goal. Contradictory goal can influence negatively or hinder the achievement of a desirable situation. This means that interpretation of goal and problem is relative and dependent on the actor objectives. The same situation can be interpreted as a goal for one actor or as a problem for another. The negative influence dependency has been introduced in F3 [7] to specify contradictions between goals. The negative influence dependency from A to B (A B) indicates that goal A hinders to the achievement of goal B. Conditions for the existence of a negative influence dependency between two situations are as follows: if ACTOR p S1 and ACTOR g S4 then S1 - S4 . This axiom specifies that a problematic situation (S1) hinders to the achievement of a desired situation (S4). Moreover, in a context of an opportunity (S2), the following axiom is true: if ACTOR p S1 , ACTOR o S2 and ACTOR g S4 then S2 - S1, S2 + S4. According to this definition, an opportunity must influence negatively a problematic situation and influence positively a desired situation. The positive influence dependency between two goals means that the achievement of one goal would contribute to the achievement of the second. It should be noted that
130
R. Gustas
any problem that hinders to a problematic situation may be considered as an opportunity for some other actor, i.e. if ACTOR p S3 and S2 - S3 then ACTOR o S2. Any opportunity that influences positively a desired situation may be considered as an opportunity as well, i.e. if ACTOR g S5 and S4 + S5 then ACTOR o S4. The negative influence dependency is useful to express contradictions between goals of various actors. If one of the goals hinders to the achievement of the other, then these goals are in conflict. The positive influence dependency between two goals is indicating that the achievement of one goal helps to achieve the second. It should be noted that the additional pragmatic dependencies are derived according to the following inference rules: if A +
B , B - C then A - C ,
if A - B , B - C then A + C , if A +
B , B + C then A + C ,
if A B then A + B . is a goal composition dependency [12]. Here: Semantic and pragmatic dependencies in an action workflow loop are very important to analyse viability of business processes. It should be noted that viability of a single communicative action dependency between two actors guaranties that the desired situations create new possibilities for the recipients of flows. The customer tries to initiate an action, because he wants to avoid a problematic situation or to achieve a new desired situation. By depending on a performer, a customer is able to achieve a situation that can not be reached without involvement of a specific performer. At the same time, if the performer fails to deliver the flow to a customer, then customer becomes vulnerable to the failure. The negative and positive dependencies between various situations are imposed by different intentions of actors. If influences in the action workflow loop are incompatible, i.e. A + B and A - B, then this situation is referred to as a contradiction. By using a set of influences between situations from the point of view of different actors, the contradictory goals can be identified. It is not so difficult to see in the example of the previous chapter that pragmatic dependencies are not contradictory. An overall set of semantic and pragmatic dependencies of a particular process constitute a formal basis for inconsistency analysis. Inconsistencies may be eliminated either through negotiation among actors, and by disregarding some of the goals or by disregarding some of the actors. In a case of existence of a cost effective development action, inconsistency among goals may serve as a driving force for business process re-engineering. Inconsistencies between actor goals in a context of the same process would mean that one of the actors is vulnerable to the failure in the achievement of his goal. Consistency of pragmatic dependencies in the action workflow loop guaranties that interests of two actors are not conflicting. If interests of actors are in conflict, then the action workflow loop may be not viable. Viability of action workflow loops can be studied in terms of semantically complete diagrams.
Modelling of Semantic and Pragmatic Dependencies of Information Systems
131
7 Semantic Incompleteness of Diagrams The way normally people analyze systems is by reasoning on a basis of the model for a particular part of the system. Many analysts in the area of information system development define their systems in terms of initial conceptual models, and later extend them by making a whole lot of assumptions. Very often these conceptual models are quite vague, because of several reasons. Sometimes, dependencies of conceptual diagrams may not be defined strictly and can be interpreted ambiguously. Even if the diagrams developed by experts are presented in a formal way, the system model may be still not clear enough. This can take place for the reason that the description of system is incomplete. Elimination of semantic incompleteness in a diagram, by the refinement of relations among concepts is important, if system analyst wish to reason automatically about the expected scenarios, contingent actions and opportunities available in a particular business process. An applicable set of dependencies allows us to avoid semantic holes [10]. In this study about conceptual dependencies, we concentrate on a particular subset of semantic links, which is entitled to as totally applicable dependencies. Applicability of dependencies can be achieved through appropriate transformations of concept diagrams. Some methods call this process of change as strengthening or restricting [2], [3]. In object-oriented approach, the transformation process is entitled to as sharpening meaning of concepts [13]. It improves the ability to understand and communicate conceptual models. Transformations of diagrams have been mostly studied in the context of the static dependencies. In this approach, we have introduced the common basis to deal with both static and behavioural parts of representations. It means that semantic transformations can take into account both aspects of the information system description. Semantic relationships of information system can be specified by using two kinds of abstractions: aggregation and generalisation. Abstraction of aggregation is based on the presented set of the totally applicable binary dependencies. These links are as follows: ⇒ is the injection dependency, ⇒> is the surjective partial functional dependency and is the total functional dependency, is the composition dependency, is the communication dependency and is the transition dependency. These dependencies will be referred to as basic. The generalisation abstraction is based on the inheritance dependency ( ). Inheritance links can be of various kinds. Although the inheritance constructs are well-understood, there is no complete agreement of an interplay between inheritance dependency and other types of basic constraints of semantic models. For example, some of researchers understand the inheritance in the same way as the inclusion dependency. To eliminate ambiguity, we will define the inheritance dependency in terms of the presented basic constraints. Let Ad be a set of static and dynamic semantic links that are specified for the dependent from A concepts. The inheritance dependency from concept A to B is defined by A B if and only if A ⊆ B and Bd ⊆ Ad. The inheritance is characterised by the following axioms: B,B C then A C, 1) if A
132
R. Gustas
2) if A B , B ⇒ C then A ⇒ C , B,B C then A C, 3) if A 4) if A B,B C then A C, 5) if A B , B ⇒> C then A ⇒> C , 6) if A B, B C then A C, B, B C then A C. 7) if A Inheritance is defined in terms of two abstractions: intensional and extensional. If A inherits B, then the structure (intension) of concept B must be included as part of concept A intensional structure. Definition of inheritance in the extensional sense is defined as follows: if x ∈ B, B C then x ∈ C. It should be noted, that the presented definition is more general than that is assumed in object-oriented approach. For instance, concepts A and B can be interpreted as categories of states and actors. Any communicative action is defined unambiguously if and only if it is expressed in terms of applicable constraints. Two states of the semantically unambiguous action must be connected to other concepts by the total static dependencies. As far as actor communication links are concerned, the flow dependency from an agent to recipient is considered as an applicable constraint as well. The same condition holds for a current state that is linked by a transition dependency to the new state. It means that any object, which belongs to the current state, is applicable for a specified transition link. The presented set of totally applicable dependencies is useful to assess the semantic ambiguity [10] of specifications and to reason about a particular part of the system. Very often information system specifications are quite vague, because some of the semantic dependencies are optional. If the diagrams can be defined in terms of basic dependencies, then certain formal inference rules may be applied to derive additional semantic links that can be used to check semantic consistency of information system requirements at the enterprise level.
8 Conclusions Interplay between semantic and pragmatic dependencies lie in the foreground of the suggested framework. In this approach, we have shown how to bring various semantic diagrams together and to combine insights from the point of view of different goals. An attempt of the generic framework is to bridge goals of various actors and a way various business processes are described. Goals and desires of actors constitute an important part of knowledge about business processes. The intentional relationships among actors are viewed with potentially common and conflicting interests. The presented pragmatic dependencies can explain freedom of a specific actor and the extent to which actors are exposed to a danger. The usefulness of great number of semantic dependencies in the area of information system design is an open problem. For instance, many dependencies introduced in database theory encounter problems with missing attribute values. These problems result from the fact that either the instance of the attribute is
Modelling of Semantic and Pragmatic Dependencies of Information Systems
133
temporally unknown, but applicable, or that its value can never be known, because the attribute is not applicable for a specific instance [4]. For the unambiguous definition of concepts, only applicable dependencies can be used. Despite of strictness of this requirement, it allows us to discover contingent actions in the semantic diagrams and to introduce rules of reasoning [10] at the enterprise level. The presented actor dependencies constitute a unified basis for modelling of dynamic relationships and can be regarded as an extension and integration of the state transition and interaction diagrams. Actor goal dependencies constitute a unified basis for modelling of pragmatic relations that are able to define actor intentions. Goals justify and explain the presence of the semantic dependencies, which are used to specify components of information system requirements. Thus, such integrated approach offers a novel perspective on semantic analysis of information systems. The main difference between this framework and Yu & Mylopoulos approach [18] is that any actor dependency may be considered at the same time to be the action and the goal dependency. The action dependency is defined in terms of state transition and flow dependencies. It has also been shown how the negative and positive influence dependencies, introduced in F3 [7], can be formally defined in terms of the semantic and pragmatic dependencies. There is a growing interest to integrate information system development methodologies from different areas such as requirements engineering, method engineering, workflow management, business process modelling, object – oriented approach, etc. Many system analysts recognise that it is not enough to describe semantics of information system by concentrating distinctly on one of the methods. When re-engineering information systems most of models tend to neglect communication aspects among several actors of organisation. Dependencies of communication and co-operation between actors and their actions describe a very important part of knowledge about business processes. Unfortunately, communication approaches often neglect some behavioural aspects of system modelling that are basic in information system engineering. The presented set of totally applicable dependencies can be viewed as an integrated semantic modelling technique to specify the deeper structures of relationships among concepts. Suggested generic framework focuses on the modelling of static and dynamic constraints, where several actors co-operate to achieve new desired states. The basic dependencies provide a uniform formal basis in the area of concurrent business process modelling, analysis and integration. The purpose of such introduced basis is that eventually information system diagrams can be used as a tool to assist reasoning and validate enterprise models before they are implemented.
References 1. Action Technologies. Action Workflow Analysis Users Guide. Action Technologies, 1993. 2. A T Borgida. Generalisation/Specialisation as a Basis for Software Specification. On Conceptual Modelling, M Brodie, J Mylopoulos, J W Schmidt (eds.), Springer-Verlag, New York, 1984, pp.87-112.
134
R. Gustas
3. R Brachman, J G Schmolze. An Overview of the KLONE Knowledge Representation System. Cognitive science, 9(2), pp. 171-212, 1985. 4. E F Codd. The Relational Model for Database Management. Addison-Wesley Publ. Co., 1990. 5. G B Davis, M Olson. Management Information Systems. McGraw Hill, New York, 1985. 6. E D Falkenberg et al. A Framework of Information System Concepts. The Report of the IFIP WG8.1 Task Group FRISCO, 1996. 7. F3 Consortium. 'F3 Reference Manual (Esprit III Project 6612)', SISU, Kista, Sweden, 1994. 8. G Goldkuhl. Information as Action and Communication. The Infological Equation, Goteborg University, Sweden, pp. 63-79, 1995. 9. R Gustas, J Bubenko jr., B Wangler. Goal Driven Enterprise Modelling: Bridging Pragmatic and Semantic Descriptions of Information Systems. Information Modelling and Knowledge Bases VII, Y Tanaka, H Kangassalo, H Jaakola, A Yamamoto (eds.), IOS Press, 1996 , pp. 73 -91. 10. R Gustas. Towards Understanding and Formal Definition of Conceptual Constraints. Proc. of the European-Japanese seminar on Information Modelling and Knowledge Bases VI, 1994, IOS Press, pp. 381-399. 11. R Gustas. A Basis for Integration within Enterprise Modelling. Second Int. Conference on Concurrent Engineering: Research and Applications, Washington, DC Area, August 23-25, 1995, pp. 107-120. 12. R Gustas. A Framework for Description of Pragmatic Dependencies and Inconsistency of Goals. Proc. of the second Int. conference on the Design of Cooperative Systems, June 1214, 1996, Juan-Les-Pins, France, pp. 625-643. 13. J Martin, J J Odell. Object-Oriented Methods: Foundation. Prentice-Hall, New Jersey, 1995. 14. W E Riddle. Fundamental Process Modelling Concepts. NSF Workshop on Workflow and Process Automation in Information Systems, May 8-10, 1996. 15. K Siau, Y Wand, I Benbasat. The Relative Importance of Structural Constraints and Surface Semantics in Information Modelling. Information Systems, Vol. 22, No 2/3, pp 155-170, 1997. 16. J F Sowa, J A Zachman. Extending and Formalizing the Framework for Information Systems Architecture. IBM Systems Journal, 31(3), pp. 590 - 616, 1992. 17. V C Storey. Understanding Semantic Relationships. VLDB Journal, F Marianski (ed.), Vol.2, pp.455-487, 1993. 18. E Yu, J Mylopoulos. from E-R to 'A-R' - Modelling Strategic Actor Relationships for Business Process Reengineering. 13th Int. Conf. on the Entity - Relationship Approach, P Loucopoulos (ed.), Manchester, U.K., 1994. 19. E Yourdon. Modern Structured Analysis, Prentice-Hall, Englewood Cliffs, N.J., 1989. 20. H Weigand, E Verharen, F Dignum. Dynamic Business Models as a basis for Interoperable Transaction Design. Information Systems, Vol. 22, No 2/3, pp 139-154, 1997. 21. T Winograd, F Flores. Understanding Computers and Cognition: A New Foundation for Design. Ablex Norwood, NJ, 1986.
Inference of Aggregate Relationships through Database Reverse Engineering Christian SOUTOU Université de Toulouse II, IUT ‘B’ Groupe de Recherche ICARE 1 Place Georges Brassens, 31703 Blagnac, FRANCE
Abstract. This paper presents a process to improve the reverse engineering of relational databases. Our process extracts the current aggregate relationships from a relational database through a combination of data dictionary, data schema and data instance analysis. The process we propose can refine conceptual diagrams of commercial tools with reverse engineering options as Power AMC (Sybase), Designer (Oracle).
1 Introduction Reverse engineering is of increasing interest today in the context of migrating legacy database systems, in the development of multidatabase systems that integrate a variety of existing systems under a common data model interface, or in database evolution. The goal of a reverse engineering process is to produce a conceptual description of a given database where the input may consist of any combination of source code description, a data dictionary, a database instance, application programs. Algorithms for converting a relational schema into the original Entity-Relationship (ER) model [4] can be found in [7]. [15] proposes a more powerful approach that considers inheritance and provide a very detailed classification of relations and attributes. Other approaches [1,10,11,12] provide to define an Extended ER diagram from a relational schema. Other approaches for reverse engineering of relational database consider binary-relationship [18] or object-oriented as target semantic data models [3,9,14,17,27]. Some tools exist [6,8,17,18]. Recent relational database reverse engineering methodologies use instances [5,16,19,22]. The majority of these existing approaches do not take into account the extraction of n-ary relationships of degree higher than two. Aggregate relationships are not very well studied neither in commercial tools nor in reverse engineering methodologies. A reason can be that the aggregate relationship would complicate complex and big conceptual scheme. We think that is not so because we consider that aggregate relationships can refine n-ary relationships extracted in a reverse engineering process. So that the semantic of the T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 135−149, 1998. Springer-Verlag Berlin Heidelberg 1998
136
C. Soutou
conceptual diagram resulting of our reverse engineering process is enhanced.[5] takes into account a kind of aggregate relationships in their process, we will see in Section 3 that it exists several kinds of aggregate relationships. If we assume that the starting point of a reverse engineering process can be any combination of data description and application programs, results may be inaccurate as the data themselves would not be taken into account. We believe that as the relational model is a data based model, the very thing to start with should be considering instances together with data definition statements. This work continues recent analysis completed with [23] dealing with the principles for reverse engineering n-ary relations (n≥2).
2. Aggregate Relationships The first paper dealing with aggregate relationships seems to be [20]. Aggregation is an abstraction which turns a relationship between objects into an aggregate abject. A relationship between objects is regarded as a higher level object. We consider here the aggregate relationships extracted from n-ary relationships.
2.1
Example
Let us consider the relational database Fig. 1. Foreign key attributes are into brackets, we also use an arrow from the foreign key attribute(s) to the primary key attribute(s). soft[softno#,softname]
Although Chen was the first to publish his contribution in the ACM TODS journal [4], another survey proposed also a conceptual model [13]. The fundamental differences of these two approaches is the interpretation of cardinality constraints particularly for nary relationships. The approach of Chen and its extensions [5,17,22,26] consider cardinality constraints based on the identifiers of relationships. The second approach [10,15,24] consider cardinality constraints which are based on participation constraints between entities and relationships. This approach is used by the majority of commercial tools and methodologies [2,25]. These two different approaches are neither complementary nor opposite [23]. We can note that the formalism of Chen [4] is more precise on the semantic of n-ary relationships.Indeed, each couple of cardinality constraints depends on the interaction with the other entities, the other approach does not. Though the cardinality constraints
Inference of Aggregate Relationships through Database Reverse Engineering
137
of binary relationships must be only inverted for the two different conceptual models, for n-ary relationships the semantic is not the same. See Figs. 2 and 3 the conceptual diagrams produced with our previous methodology [23] with the instances in the Appendix 1. Chen’s formalism of the relationship ‘Installation’ indicate that for a given pair of (department, soft) there is only one instance of server, for a given pair of (soft, server) there are one or many departments, for a given pair of (server, department) there are one or many softs. Participation constraints of the relationship ‘Installation’ indicate that a soft and a server can be implicated in one or many installations, a department can be implicated in many installations or cannot be implicated in any installation. Soft
1,N Installation dateinst
softno# softname
0,N
1,1 Server servno# servtype
1,N Dept
Owner
1,N
deptno# deptname
Fig. 2. Chen’s formalism
Soft
1,N Server
1,N Installation dateinst
softno# softname
1,N
servno# servtype
0,N Dept
Owner
0,N
deptno# deptname
Fig. 3. ParticipationConstraints
However these conceptual diagrams do not represent any semantic between the relationship ‘Installation’ and the relationship ‘Owner’. The binary relationship is semantically a subset of the ternary relationship. Song calls this type of binary relationship a Semantically Constraining Binary Relationships [21]. By looking at data appendix 1 we can infer an aggregate relationship between these two relationships see Fig. 4. We use the formalism based on participation constraints to describe aggregate relationships because this approach is used by the majority of commercial tools and methodologies. The aggregate relationship ‘Installation’ links the entity ‘Server’ to the relationship ‘Owner’ that we call aggregate. It express the fact that an installation of a soft in a server is valid if the department is the owner of the soft installed. This result improves the semantic of the conceptual scheme Fig. 2 and Fig. 3.
2.2
Taxonomy
We use the following notations to describe aggregate relationships (Fig. 5). The aggregate relationship (A2) links an entity (E3) to an aggregate (A1), the id of entities (ex : a1#,b1#,c1#), the properties of entities (ex : a2,...,b2...), the properties of relationships (ex :d1,..., p1,...). Minimum cardinality constraints (x,y,z,v) can be 0,1 or N. Maximum cardinality constraints between the aggregate relationship and the aggregate (Z, V) can be 1 or N.
138
C. Soutou
We can divide the aggregate relationships into four families according to the maximum cardinality constraint between the aggregate (A1) and the entity (E3). These families are (N-N, 1-N, N-1, 1-1). Indeed, the structure of relations which represent the aggregate and the aggregate relationship has no influence upon minimum cardinality constraints. Our process takes into account all these cases. Soft softno# softname
1,N
0,1
Owner
1,N Installation dateinst
Server servno# servtype
0,N Dept deptno# deptname
Fig. 4. Example of Aggregate Relationship and Aggregate
E1 a1# a2...
x,N
z,Z
A2
v,V E3 c1# c2...
[p1...]
A1 [d1...]
y,N E2 b1# b2...
Fig. 5. Aggregate Relationship and Aggregate
2.3
Inverse Reference
An important factor to take into account is inverse references. An inverse reference exist between two relations when the first includes a foreign key upon the second and vice-versa. We can note that the relational schemas which include inverse references represent in the majority of case two aggregate relationships instead of one. Let us consider the following relational schema including an inverse reference. mission[carnumber#,datem#, km, (driver)]
2
1
employee[empno#, empname, (caremp,datemp)]
Fig. 6. Relational schema with one inverse reference
Inference of Aggregate Relationships through Database Reverse Engineering
139
According to the instances appendix 2, this schema can represent either one aggregate relationship or two distinct aggregate relationships. The instances appendix 2 enable us to deduce two distinct aggregate relationships ‘AR1’ and ‘AR2’ Fig. 7. We could call ‘AR1’ as ‘occasionalpassenger’ (is a passenger which participate in only one mission), ‘AR2’ as ‘driver’. Car carno# totalkm
0,N
1,1
0,1 Employee AR1
Mission km
0,N Date
empno# empname
0,1 1,N
AR2
datem#
Fig. 7. Aggregate Relationships deduced from instances appendix 2
The instances appendix 3 enable us to deduce only one aggregate relationship ‘AR1’. We can see that data are required to infer a valid current conceptual schema. Car carno# totalkm
0,N
1,1
Mission km
0,1 Employee AR1
empno# empname
0,N Date datem#
Fig. 8. Aggregate Relationship deduced from instances appendix 3
3. Inferring Aggregate Relationships For a given relational schema we will see that it can exist many potential aggregate relationships. We call them potentials because only instances will provide to infer the correct cardinality constraints. On the other hand, for a given aggregate relationship different relational schema are possible.
140
C. Soutou
3.1
Maximum Cardinality Constraints
Table 1 describe maximum cardinality constraints of the aggregate relationships from two relations. Only the case (2) is taken into account in [5] as we said in the introduction. The case (3) can be illustrated Fig 6, see the aggregate relationships Fig. 7 and Fig. 8. The case (3) can lead to five sub-cases : four double aggregate relationships and one single aggregate relationship. We can note that the name of the aggregate relationships must be provided by human intervention. Table 1. Maximum cardinality constraints of aggregate relationships from two relations Relational schema
A1[a1#,b1#,d1...] E3[c1#,c2...,(a1,b1)]
Conditions on instances
Maximum Cardinality A1/E3 N-1
(a1,b1) non unique in E3
1-1
(a1,b1) unique in E3
1-N
c1 non unique in A1
1-1
c1 unique in A1
1-N N-1 1-1 N-1 1-N 1-1 1-1 1-1 1-1
c1 non unique in A1 (a1,b1) non unique in E3 (a1,b1) unique in E3 (a1,b1) non unique in E3 c1 non unique in A1 (a1,b1) unique in E3 c1 unique in A1 (a1,b1) unique in E3 (a1,b1,c1) are simultaneously in E3 et A1
Table 2 describe maximum cardinality constraints of the aggregate relationships from three relations. Due to space limitations we do not consider here the possible inverse references, but our process includes these cases. We can note that the name of the aggregate relationships is the name of A2. As an example of a combination of the cases (5) and (3) let us consider the relation ‘usualpassenger’ as follows : employee[empno#, empname, (caremp,datemp)] mission[carnumber#,datem#, km,(driver)] usualpassenger[(caruspg#,dateuspg#),(empuspg)#]
Fig. 9. Relational schema with inverse references
The instances of the relations ‘employee’ and ‘mission’ are given appendix 3, we consider an example of instances for the relation ‘usualpassenger’. We can deduce the aggregate relationships Fig.10.
Inference of Aggregate Relationships through Database Reverse Engineering
141
Table 2. Maximum cardinality constraints of aggregate relationships from three relations Relational schema
Conditions on instances
Maximum Cardinality A1/E3 N-1
(a1,b1) non unique in E3
1-1 1-N
(a1,b1) unique in E3 c1 non unique in A2
(5)
1-1 N-N
A2[(a1#,b1#),(c1#),p1..]
N-1
c1 unique in A2 c1 non unique in A2 for (a1,b1) given AND (a1,b1) non unique in A2 for c1 given c1 unique in A2 for (a1,b1) given AND (a1,b1) non unique in A2 for c1 given c1 non unique in A2 for (a1,b1) given AND (a1,b1) unique in A2 for c1 given c1 unique in A2 for (a1,b1) given AND (a1,b1) unique in A2 for c1 given
Fig. 10. Aggregate Relationships deduced from instances appendix 3
3.2
Minimum Cardinality Constraints
Table 3 describe minimum cardinality constraints of the aggregate relationships from two relations. The case (3) where the aggregate relationships is deduced from the arrow n°2 can be illustrated with the instances appendix 2. Fig. 10 shows ‘AR2’ as ‘driver’. We can see according to the instances that the minimum cardinality constraint to the side of ‘E3’ must be 0 because it exists c1 (here ‘driver’) in ‘E3’ (here ‘employee’) and not in ‘A1’ (here ‘mission’) : for example the employee ‘A01’. According to the instances the minimum cardinality constraint to the side of ‘A1’ must be 1 because it is true that it doesn’t exist c1 (here ‘driver’) NULL in ‘E3’ (here ‘employee’).
142
C. Soutou Table 3. Minimum cardinality constraints of aggregate relationships from two relations Relational schema
A1[a1#,b1#,d1...]
E3
E3[c1#,c2...,(a1,b1)]
A1
(1) A1[a1#,b1#,d1...,(c1)] E3 A1
E3[c1#,c2...] (2) A1[a1#,b1#,d1..,(c1)]
E3
1 E3[c1#,c2...,(a1,b1)] A1 A1[a1#,b1#,d1..,(c1)]
2
E3
E3[c1#,c2...,(a1,b1)] A1 (3) Two aggregate relationships A1[a1#,b1#,d1...,(c1)]
E3
E3[c1#,c2...,(a1,b1)] A1 (3) Single aggregate relationship
Minimum Cardinality 0 1 0 1 N 0 1 N 0 1 0 1 0 1 N 0 1 N 0 1 0 1 0 1 N
Conditions on instances It exists (a1,b1) NULL in E3 It doesn’t exist (a1,b1) NULL in E3 It exists (a1,b1) in A1 which is not in E3 Every (a1,b1) in A1 is also in E3 (a1,b1) is neither NULL nor unique in E3 It exists c1 in E3 and not in A1 Every c1 in E3 is also in A1 c1 is neither NULL nor unique in A1 It exists c1 NULL in A1 It doesn’t exist c1 NULL in A1 Cf. case (1)
Cf. case (2)
It exists (a1,b1) NULL in E3 It doesn’t exist (a1,b1) NULL in E3 It exists c1 NULL in A1 It doesn’t exist c1 NULL in A1 Impossible: the maximum constraint is 1
Table 4 describe minimum cardinality constraints of the aggregate relationships from three relations. The case (6) can be illustrated with the instances appendix 2 and appendix 3. Fig. 10 shows the aggregate relationship ‘usualpassenger’. We can see according to the instances that the minimum cardinality constraint to the side of ‘E3’ must be 0 because it exists c1 (here ‘empuspg’) in ‘E3’ (here ‘employee’) and not in ‘A2’ (here ‘usualpassenger’) : for example the employee ‘A04’. According to the instances the minimum cardinality constraint to the side of ‘A2’ must be 1 because it is true that c1 (here ‘empuspg’) is unique in A2 for (a1,b1) (here ‘caruspg’,’empuspg’) given : the two last rows for example.
4. Process We suppose that there are no constraints on the uniqueness of the attribute names because we consider the constraint id in the data dictionary instead of the name of the attribute. Though we use Oracle, we can adopt this method for other relational database management systems having a data dictionary.
Inference of Aggregate Relationships through Database Reverse Engineering
143
Table 4. Minimum cardinality constraints of aggregate relationships from three relations Relational schema E3[c1#,c2...,(a1,b1)]
Minimum Cardinality 0 1 0 1 N 0 1 N 0 1 0 1 N 0 1 N
Conditions on instances It exists (a1,b1) NULL in E3 It doesn’t exist (a1,b1) NULL in E3 It exists (a1,b1) in A2 which is not in E3 Every (a1,b1) in A2 is also in E3 (a1,b1) is neither NULL nor unique in E3 It exists c1 in E3 and not in A2 Every c1 in E3 is also in A2 c1 is neither NULL nor unique in A2 It exists c1 NULL in A2 It doesn’t exist c1 NULL in A2 It exists c1 in E3 and not in A2 Every c1 in E3 is also in A2 Every c1 in E3 is also in A2 AND (a1,b1) never unique in A2 for c1 given Impossible : c1 is primary key c1 unique in A2 for (a1,b1) given c1 never unique in A2 for (a1,b1) given
Data Dictionary in Input
The first step of our process consist in automatically selecting the aggregate relations from the data dictionary. We use a view of the data dictionary that we call ‘cross_references’ in order to take in input for extracting the relations which compose the aggregate relationships. This view enables us to join attributes and tables with foreign and primary key. For our running example appendix 4 the content of this view is described appendix 5.
4.2
Extraction of Relations Composing the Aggregate Relationships
The extraction of the relations which compose the aggregate relationships is made with a query for each of the six cases of aggregate relationships we take into account in this paper (see Section 3). These queries inspect the view ‘cross_references’. We can note that these queries are written once for every relational schema. Each query populate or not the table ‘aggregate_results’ which define the relations which compose the aggregate relationships. Due to space limitations we cannot show each of these queries. As an example let us consider the query appendix 6 which extracts the relations composing the case (6) of aggregate relationship. We detail the fact extracted for each predicate. For our running example this query insert only one row in the table ‘aggregate_results’. This row is written in bold Fig. 11.
144
C. Soutou
4.3
Extraction of the Aggregate Relationships
The following script describe the final step of the extraction of aggregate relationships. For our running example three queries have been successful. The following figure shows the final result of the extraction. select A2,A1,E3,aggregate from aggregate_results; A2 A1 E3 aggregate --------------- --------------- --------------- -----------------MISSION EMPLOYEE 3 INSTALLATION OWNER SERVER 5 USUALPASSENGER MISSION EMPLOYEE 6
Fig. 11. Contents of the table ‘aggregate_results’
5. Conclusion and Further Research An automatic process which improves relational database reverse engineering has been presented. This process is based on a combination of data dictionary, data schema and data instance analysis. For SQL92 relational databases (where foreign key and primary key clauses exist in the schema definition) this process is fully automated. For legacy systems (databases which do not support foreign key definitions), human intervention is required to propose potential foreign key attributes. We investigate a snapshot of the database extension that we find at the beginning of the reverse engineering process to extract six cases of aggregate relationships. As our process examine data, it is true that if additional data were given conclusions of cardinality constraints of n-ary relationships could be different. Thought the results of our process does not tell us anything definite about the equivalent conceptual schema it facilitate the meaning of the semantic expressed in n-ary relationships. Our process can be included in a complete database reverse engineering methodology or can refine results from commercial tools with reverse engineering options. Commercial tools produce a diagram taking in input either a file describing the relational schema or the database itself, data dictionary is taken into account but not data. The first step of our process consists in automatically extracting the n-ary relations from the data dictionary. The second step provides the inference of current minimum and maximum cardinality constraints of n-ary relationships. Our process generates a set of SQL queries for each n-ary relation of the database studied. The final step of our process will take in input the table ‘aggregate_results’ and in output the current cardinality constraints of each aggregate relationship extracted. The cardinality constraints will be inferred according to the conditions of instances described in detail table 1, 2, 3 and 4. A Pro*C program will generate adequate SQL query for each case of aggregate relationship.
Inference of Aggregate Relationships through Database Reverse Engineering
145
References [1]
[2] [3] [4] [5]
[6]
[7]
[8] [9] [10]
[11] [12]
[13]
[14]
[15]
[16]
[17]
Andersson, M. Extracting an Entity Relationship Schema from a Relational Database through Reverse Engineering, in Proceedings of the 13th Int. Conference on EntityRelationship Approach, (ed P. Loucopoulos), Springer Verlag, 881, (1994) 403-419. Batini, C., Ceri, S., Navathe, S.B. Conceptual Database Design : an Entity Relationship Approach, Benjamin Cummings, Redwood City (1992). Castellanos, M. A Methodology for Semantically Enriching Interoperable Databases, in Proceedings of the 11th British National Conference on Databases, (1993) 58-75. Chen, P.P. The Entity-Relationship Model : Towards a Unified View of Data, ACM Transactions on Database Systems, 1, 1, (Mar. 1976) 2-36. Chiang, R., Barron, T., Storey, V.C. Reverse engineering of relational databases : Extraction of an EER model from a relational database, Journal of Data and Knowledge Engineering, 12, 2, (1994) 107-142. Comyn-Wattiau, I, Akoka, J. Reverse Engineering of Relational Database Physical Schemas, in Proceedings of the 15th Int. Conference on Entity-Relationship Approach, (ed B. Thahleim), Springer Verlag, 1157, (Oct. 1996) 372-391. Davis, K.H., Arora, A.K. Converting a Relational Database Model into an Entity Relationship Model, in Proceedings of the 6th International Conference on EntityRelationship Approach, (Nov. 1987) 243-257. Englebert, V., Henrard, J., Hick, J.M., Roland, D. Hainaut, J.L. DB-MAIN : a database oriented CASE Tool, Engineering of Information Systems, 4, 1, (1996) 87-116. Gardarin, G. Translating relational to object databases, Engineering of Information Systems, 2, 3, (1994) 317-346. Hainaut, J.L., Tonneau, C., Joris, M., Chandelon, M. Transformation-based Database Reverse Engineering, in Proceedings of the 12th Int. Conference on Entity-Relationship Approach, Springer Verlag, 823, (1993) 364-375. Johanneson, P. A method for Translating Relational Schemas into Conceptual Schemas, in Proceedings of the 10th Int. Conference on Data Engineering, (1994) 190-201. Markowitz, K.M., Makowsky, J.A. Identifying Extended Entity-Relationship Object Structures in Relational Schemas, IEEE, Transactions on Software Engineering, 16, 8, (Aug. 1990) 777-790. Moulin P., Randon J., Savoysky S., Spaccapietra S., Tardieu H., Teboul M., « Conceptual model as database design tool », Proceedings of the IFIP Working conference on Modelling in Database Management Systems , G.M. Nijssen Ed., North-Holland, 1976. Narasimham, B. Navathe, S.B., Jayaraman, S. On Mapping ER and Relational Models into OO Schemas, in Proceedings of the 12th Int. Conference on Entity-Relationship Approach, Springer Verlag, 823, (1993) 402-413. Navathe, S.B., Awong, H. Abstracting Relational and Hierarchical Data with a Semantic Data Model, in Proceedings of the 6th International Conference on Entity-Relationship Approach, (Nov. 1987) 277-305. Petit, J.M., Kouloumdjian, J. Boulicaut, J.F., Toumani, F. Using Queries to Improve Database Reverse Engineering, Proceedings of the 13th Int. Conference on EntityRelationship Approach, Springer Verlag, 881, (1994) 369-386. Premerlani, W.J., Blaha, M.R. An Approach for Reverse Engineering of Relational Databases, in Proceedings of the IEEE Working Conference on Reverse Engineering, Baltimore, (Nov. 1993) 151-160.
146
C. Soutou
[18] Shoval, P., Shreiber, N. Database Reverse engineering : From the Relational to the Binary Relationship Model, Journal of Data and Knowledge Engineering, 10, (1993) 293-315. [19] Signore, O., Loffredo, M., Gregori, M. Cima, M. Reconstruction of ER Schema from Database Application : a Cognitive Approach, in Proceedings of the 13th Int. Conference on Entity-Relationship Approach, Springer Verlag, 881, (1994), 387-402. [20] H.A. Smith, D.C.P. Smith, Database Abstractions : Aggregation and Generalization, ACM Transactions on Database Systems, Vol 2, N°2, pp 105-133, 1977. [21] Song, Y.I., Jones, T.H. Analysis of Binary Relationships within Ternary Relationships in ER Modeling, in Proceedings of the 12th Int. Conference on Entity-Relationship Approach, Springer Verlag, 823, (1993) 271-282. [22] Soutou, C. Extracting N-ary Relationships through Database Reverse Engineering, in Proceedings of the 15th Int. Conference on Entity-Relationship Approach, (ed B. Thahleim), Springer Verlag, 1157, (Oct. 1996) 392-405. [23] Soutou, C. Relational Database Reverse Engineering : Extraction of Cardinality Constraints, to appear in Journal of Data and Knowledge Engineering. [24] Spaccapietra, S., Parent, C. ERC+ :an Object Based Entity Relationship Approach", in Conceptual Modeling, Databases and CASE : An Integrated View of Information Systems Development, (Ed P. Loucopoulos and R. Zicari), John Wiley (1993). [25] Tardieu., H., Rochfeld, A., Colleti, R. La méthode MERISE, Les Editions d’Organisation, Paris, (1986). [26] Teorey, T.J., Yang, D., Fry, J.P. A logical design methodology for relational databases using the extended entity-relationship model, ACM Computing Surveys, 18, 12, (June 1986), 197-222. [27] Vermeer, M.W.W., Apers, P.M.G. Reverse Engineering of Relational Database Applications, in Proceedings of the 14th Int. Conference on Entity-Relationship Approach, (ed M.P. Papazoglou), Springer Verlag, 1021, (1995) 89-100.
Appendix Appendix 1. Instances server servno# icare0 serviut Scoserv
servtype Sun Sparc 5 Pentium 100, NT 3.51 Pentium 200, Unix SCO
Appendix 6 Query extracting the relations for aggregate case (6) insert into aggregate_results select distinct NULL,c0.relation,c2.relation,c4.relation,NULL,NULL,'6' from cross_references c0, cross_references c2, cross_references c3, cross_references c4 where c0.constraint in A2[a1#,b1#,c1#... (select c1.constraint from cross_references c1 where c1.type = 'P' group by c1.constraint having count(*)=3) and c0.relation=c3.relation and exists (select c1.constraint from cross_references c1 where c1.type = 'R' and c1.relation=c0.relation and c1.constraint=c3.constraint group by c1.ref_constraint having count(*)=2)
A2[(a1,b1),.. A1[...
and c3.ref_constraint=c2.constraint c2 refers to A1 and exists (select c1.constraint from cross_references c1 where c1.constraint=c2.constraint A1[a1#,b1#,... and c1.type='P' group by c1.constraint having count(*)=2) and c4.constraint = (select c1.ref_constraint from cross_references c1 where c1.relation=c0.relation and c1.type='R' group by c1.ref_constraint having count(*)=1)
c4 refers to E3
and not exists (select c1.constraint from cross_references c1 where c1.relation = c4.relation and c1.type = 'P' group by c1.constraint having count(*)>1)
E3[c1#,...
and exists (select c3.constraint from cross_references c3 where c3.relation = c0.relation and c3.type = 'R' and c3.ref_constraint= c4.constraint);
A2[...,
(c1)...
E3[c1#,...
On the Consistency of Int-cardinality Constraints Sven Hartmann FB Mathematik, Universit¨ at Rostock, 18051 Rostock, Germany
Abstract. In the entity-relationship model, cardinality constraints are frequently used to specify dependencies between entities and relationships. They impose lower and upper bounds on the cardinality of relationships an instance of a fixed type may participate in. However, for certain applications it is not enough to prescibe only bounds, but it is necessary to specify the exact set of permitted cardinalities. This leads to the concept of int-cardinality constraints as proposed by Thalheim [14]. Different from ordinary cardinality constraints this concept allows gaps in the sets of permitted cardinalities. Our objective is to investigate the consistency of a set of int-cardinality constraints for a database scheme, i.e. the question whether there exists a fully-populated database instance satisfying all the given int-cardinality constraints.
1
Introduction
In database design, great attention is devoted to the modeling of semantics. If we consider a database as a set of tuples over certain domain values, then semantics are usually given by integrity constraints. They specify the way by that data are associated to each other. Hence, integrity constraints help us to decide whether a database is meaningful for an application or not. During the last few decades, a plenty of different classes of integrity constraints have been discussed and actually used in database design. There are several books and monographs which give an overview on semantics in databases (cf. [10,11,13]). A general approach towards integrity constraints is developed in [14], and uses extensions from [1]. Within this paper, we use the entity-relationship model (ERM) to express database schemes. In this approach, cardinality constraints are among the most popular classes of integrity constraints. They impose lower and upper bounds on the number of relationships an entity of a given type may be involved in. Thus, cardinality constraints limit the possible structure of a database. This makes the cardinality constraints to be a very powerful class of constraints. However, for certain applications cardinality constraints are even not powerful enough to allow a straightforward specification of the desired semantics. To illustrate this, we present the following small example. T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 150–163, 1998. c Springer-Verlag Berlin Heidelberg 1998
On the Consistency of Int-cardinality Constraints
151
Example. A new travel agency is going to organize sight-seeing tours through Europe. Each of the offered tours visits a number of Europe’s most popular cities. Using the entity-relationship approach, the itineraries of the tours are planned according to the database scheme in Fig. 1 containing two entity types (tour and city) as well as two relationship types (visits and starts). Relationships of type starts determine in which city a certain tour starts, relationships of type visits specify that a certain city is visited during a given tour.
tour
visits
6 starts
-
? city
Fig. 1. Entity-relationship diagram for the travel agency in the example.
Obviously, each tour has exactly one starting point. Further, the management decided that every tour visits 3 or 4 cities. In addition, from time to time the organizers want to offer special tours visiting 7 cities to attract new customers. However, demand and financial limitations have to be taken into consideration. Hence, every week there shall start 2 or 3 tours in each of the cities in the catalogue. Only during the long vacations the management intends to offer 5 or 6 tours per week. Finally, every city shall be visited by 1 to 3 tours per week. But, the organizers would also accept 6 tours to visit certain cities. Then they are able to negotiate with the hotels for better rates. Modeling the specified restrictions with the help of ordinary cardinality constraints would obviously cause some trouble. For example, each city is allowed to be involved in 1, 2, 3 or 6 relationships of the type visits, but not in 4 or 5. Hence, it is not enough to give lower and upper bounds for the number of participations, but the complete list of permitted values has to be specified. This leads to a generalization of the ordinary concept of cardinality constraints, namely to int-cardinality constraints as proposed by Thalheim [12]. A formal definition of these constraints will be given in the sequel. Of course, our travel agency wants to know first of all, whether it is possible to construct a database instance, i.e. a catalogue of tours satisfying all the specified rules. Reasoning about such integrity constraints belongs to the fundamental tasks in database design. When reasoning about a set of constraints, one is frequently interested in whether this set is consistent and whether it implies further constraints. Given a database scheme, a set of integrity constraints defined on
152
S. Hartmann
the scheme is said to be consistent iff it admits at least one fully-populated database satisfying all these constraints. Obviously, consistency is a basic requirement for the correctness of the chosen scheme, i.e. representation of the modeled real world. The question whether a set of ordinary cardinality constraints is consistent has been considered e.g. in [9,14]. Our objective is to investigate this problem for the larger class of int-cardinality constraints. This paper is organized as follows. In Sect. 2, we briefly describe the data model to be used. All our considerations are carried out in the entity-relationship model (ERM) of Chen. In Sect. 3, we give a formal definition of int-cardinality constraints. How to check the consistency of a set of such constraints is studied in Sect. 4. Finally in Sects. 5 and 6, we will discuss two variations of the usual idea of consistency for int-cardinality constraints. Our results can be exploited to detect dummy values in int-cardinality constraints and for scheme correction as suggested by Thalheim [14].
2
The Data Model
In the sequel, we will use a particular data model, namely the entity-relationship model (ERM) which goes back to Chen [2]. This approach is based on a simple graphical representation. Using the diagram technique, even complex schemes can be understood and handled. The entity-relationship model has been so successful that it is used at present as a standard tool in conceptual database design and, in addition, several other branches of computer science (cf. [3]). Let us briefly introduce the basic concepts of the ERM. Let E be a non-empty, finite set. In the context of our approach, the elements of E are called entity types. With each entity type e we associate a finite set et called the domain or population of the type e (at moment t). The members of et are entity instances or entities, for short. Intuitively, entities can be seen as real-world objects which are of interest for some application or purpose. By classifying them and specifying their significant properties (attributes), we obtain entity types which are frequently used to model the objects in their domains. A relationship type r is a sequence (e1 , . . . , ek ) of elements from E. Relationship types are used to model associations between real-world objects, i.e. entities. A relationship or instance of type r is an element of the cartesian product et1 × . . . × etk where et1 , . . . , etk are the domains of e1 , . . . , ek , respectively, at moment t. A finite set rt of such relationships forms the population of r at moment t. The relationship types considered so far are often called relationship types of order 1 (cf. [14]). Analogously, relationship types of higher order may be defined hierarchically.
On the Consistency of Int-cardinality Constraints
153
Suppose now, we are given entity and/or relationship types of order less then i > 0. A sequence (q 1 , . . . , q k ) of them forms a relationship type r of order i. As above, we define relationships of type r as elements of the cartesian product q t1 × . . . × q tk for a given moment t. In a relationship type r = (q 1 , . . . , q k ) each of the entity or relationship types q 1 , . . . , q k is said to be a component type of r. Additionally, each of the pairs (r, q j ) is called a link. Let S = {r1 , . . . , rn } be a set of entity and relationship types where with each relationship type r all its component types belong to S, too. Then S is called a database scheme. Replacing each type q in S by its population q t at moment t, we obtain a database or instance S t of S. A database S t is fully-populated iff none of the sets q t is empty. In the sequel, we are only interested in fully-populated databases. From the graph-theoretical point of view, a database scheme S can be considered as a finite digraph ERD = (S, L) with vertex set S and a multiset L of arcs. In ERD there shall be an arc from r to e whenever e is a component type of the relationship type r. Hence, the arcs in ERD are just the links in the database scheme. The digraph ERD is also called the entity-relationship diagram of S. Usually, the entity types are represented graphically by rectangles, the relationship types by diamonds. Figure 1 shows the diagram of the database scheme used by our travel agency. It contains two entity types (tour, city) and two binary relationship types (starts, visits). However, in more complex applications one will probably find relationship types with more than two component types, too.
3
Int-cardinality Constraints
Now, we are ready to give a formal definition of int-cardinality constraints. For a relationship type r in S and a component type e of r, let D(r, e) be a given set of nonnegative integers. The int-cardinality constraint comp(r, e) = D(r, e) specifies that in each database state the number of instances of type r, an instance e of type e is involved in, belongs to D(r, e). Therefore, whenever an instance e of type e participates in exactly d relationships of type r, then d should be a member of the set D(r, e) of permitted cardinalities. Hence, comp(r, e) = D(r, e) holds iff for all database states S t and all e ∈ et we have |{r ∈ rt : r(e) = e}| ∈ D(r, e), where r(e) denotes the restriction of relationship r to the component type e. Note, that an int-cardinality constraint is an ordinary cardinality constraint, too, iff the set D(r, e) is an interval, i.e. a set of consecutive integers. The difference
154
S. Hartmann
of both concepts is that we now allow gaps in the set D(r, e) of permitted cardinalities. If no int-cardinality constraint is defined for a link (r, e), we may assume comp(r, e) = {0} ∪ N, where N denotes the set {1, 2, . . . } of positive integers. It is easy to see, that this does not represent a real constraint, but is more or less a technical agreement. In our example from Sect. 1, the specified requirements can be expressed with the help of the following int-cardinality constraints: comp(starts, tour) = {1}, comp(starts, city) = {2, 3, 5, 6}, comp(visits, tour) = {3, 4, 7}, comp(visits, city) = {1, 2, 3, 6}. Only the first constraint may be interpreted as an ordinary cardinality constraint. All the other sets on the right hand side have gaps. Let C be a set of int-cardinality constraints defined for each link in the given database scheme S. Every instance of S satisfying all the int-cardinality constraints in C is said to be legal. It is easy to check, that the empty database is always legal. However, the travel agency is of course not interested in catalogues without cities or tours. For the same reasons, we are looking only for fully-populated instances of S. By SAT (S, C) we denote the set of fully-populated legal database instances of S. The given set C of int-cardinality constraints is consistent iff it admits at least one fully-populated legal instance of S.
tour
{3, 4, 7}
visits
6
{1}
starts
{1, 2, 3, 6}
-
{2, 3, 5, 6}
? city
Fig. 2. Entity-relationship diagram with labels for the int-cardinality constraints.
Ordinary cardinality constraints are often reflected graphically in the entityrelationship diagrams. For int-cardinality constraints this seems to be somehow more difficult, in particular, if the sets D(r, e) are large and have lots of gaps. Nevertheless, we propose to label the link (r, e) by the set D(r, e) when comp(r, q) = D(r, e) is given. For our example, the entity-relationship diagram together with these labels is shown in Fig. 2.
On the Consistency of Int-cardinality Constraints
4
155
Consistent Sets of Int-cardinality Constraints
To characterize consistent sets C of int-cardinality constraints we propose to use suitable systems of linear diophantine equations. These systems are chosen in such a way that the consistency of C is equivalent to the existence of an integral solution for the associated systems. Assume, C admits a legal database instance S t of S. The number of instances of a type q in the database scheme shall be denoted by g(q). Obviously, these numbers have to meet strict requirements: Fact 1. Let S be a database scheme and C be a set of int-cardinality constraints defined on S. Then C is consistent iff there exists a function g : S → N such that for every link (r, e) ∈ L there are nonnegative integers xd , d ∈ D(r, e), with P x = g(e) P d∈D(r,e) d (1) d∈D(r,e) dxd = g(r). Remark. For proofs of the results presented in this paper we refer to [8]. The question arises whether it will be possible to find a function g with the properties claimed in Fact 1. The following observation gives a first answer to this question. Fact 2. Let (r, e) ∈ L be a link, and g(e) as well as g(r) be positive integers. There exist nonnegative rational values xd , d ∈ D(r, e), satisfying (1) iff min D(r, e) ≤
g(r) ≤ max D(r, e) g(e)
(2)
holds. Although Fact 2 does not explicitely guarantee the existence of an integral solution to system (1), it can easily be exploited to ensure such a solution. Combining Facts 1 and 2 we finally obtain a new characterization of admissible functions g which is somehow easier to be handled. Fact 3. Let S be a database scheme and C a set of int-cardinality constraints defined on S. Then C is consistent iff there exists a function g : S → N such that (2) holds for every link (r, e) ∈ L. Of course, the main problem is to find a function g that satisfies the inequalities (2) simultanously for all the links (r, e). In the sequel we shall use shortest-path methods in suitable digraphs for this purpose. Let G = (S, L∪L−1 ) be the symmetric digraph which we obtain from the entityrelationship diagram ERD = (S, L) by adding to each link L = (r, e) its reverse
156
S. Hartmann
L−1 = (e, r). In the sequel, we use the term link only for the elements in L and arc for an element from L ∪ L−1 . On the arcs of G we define a weight function w : L ∪ L−1 → Q ∪ {∞} by ( ∞ if 0 ∈ D(r, e), w(L) = and w(L−1 ) = max D(r, e), (3) 1 otherwise, min D(r,e) where L is the original link (r, e) ∈ L and L−1 is its reverse. Special interest is devoted to directed cycles. A directed cycle Z is a sequence of consecutive arcs A1 , . . . , Ak in the digraph. It is said to be critical whenever its weight w(Z) = w(A1 ) · · · w(Ak ) is less than 1.
tour
6
1 3
-
7
6
1 1
?
starts
visits 6 1
6 1 2
-
?
city
Fig. 3. The digraph G associated to the ERD from our example.
Figure 3 shows the digraph G obtained from the entity-relationship diagram ERD for our travel agency. Here, we labeled the arc by their weights according to (3). For an arbitrary link L = (r, e) ∈ L of the database scheme, let g(e) and g(r) be given positive integers. By the definition of the weight function w, inequality (2) holds iff we have both, g(e) g(r)
≤ w(L)
and
g(r) g(e)
≤ w(L−1 ).
Thus, to decide the consistency of a set of int-cardinality constraints, we are looking for a function g : S → N defined on the arc set of the digraph G such that g(v) ≤ w(A) g(u)
(4)
holds for every arc A = (u, v) in G. Note, that the arc A might be both, a link or its reverse. Functions satisfying (4) have already been used when reasoning about sets of ordinary cardinality constraints (cf. [6]). We shall call them feasible or admissible. Admissible functions and their relation to database design have been considered
On the Consistency of Int-cardinality Constraints
157
in [4,6]. In particular, admissible functions exist iff there is no critical cycle in G. For further properties of admissible functions, we refer to [6]. Theorem 4. Let S be a database scheme and C a set of int-cardinality constraints defined on S. Then C is consistent iff the digraph G has no critical cycle. Sketch of the proof. As pointed out above, a function g satisfies (2) for every link L = (r, e) of the database scheme iff it is admissible with respect to the weight function (3). On the other hand, as proved in [6], there exists such a function g iff the digraph G admits no critical cycle. Hence, the claim follows by Fact 3. u t In [6] a polynomial-time algorithm is proposed to construct admissible functions using shortest-path methods (namely a variation of the well-known Bellman-Ford algorithm). Therefore, the question whether a set C of int-cardinality constraints is consistent or not can be decided in polynomial time. The statement of Theorem 4 is especially remarkable, since exactly the same claim holds for ordinary cardinality constraints, too (see e.g. [9,6,14]). Obviously, a database instance satisfying an int-cardinality constraint comp(r, e) = D(r, e) also meets the relaxed cardinality constraint comp(r, e) = (a, b) = {a, a + 1, . . . , b}, where a and b denote min D(r, e) and max D(r, e), respectively. Of course, the converse is usually not true. Nevertheless, replacing the weaker cardinality constraint by the stronger int-cardinality constraint does not affect the consistency of the constraints. The set C of int-cardinality constraints in our example for the travel agency is consistent iff the relaxed set C 0 of ordinary cardinality constraints comp(starts, tour) = (1, 1) = {1}, comp(starts, city) = (2, 6) = {2, . . . , 6}, comp(visits, tour) = (3, 7) = {3, . . . , 7}, comp(visits, city) = (1, 6) = {1, . . . , 6} is consistent, too. This observation is rather astonishing, since SAT (S, C) is usually a proper subset of SAT (S, C 0 ). However, if SAT (S, C 0 ) is empty, then SAT (S, C) is as well. We record this result as Corollary 5. Let S be a database scheme. A set C of int-cardinality constraints defined on S is consistent iff the relaxed set C 0 of ordinary cardinality constraints is consistent, too.
158
5
S. Hartmann
Restricted Consistency
It is easy to check, that the digraph G in Fig. 3 contains no critical cycle. Hence, the set of int-cardinality constraints in our example is consistent. This enables us to construct a fully-populated legal database instance for the scheme in Fig. 2. However, for practical purposes it is often not enough to prove the mere existence of legal databases. The travel agency in our example would surely not be interested in databases with hundreds of thousands of cities or tours. Due to economic limitations the number of entity and relationship instances must be bounded from above. Thus, the question arises whether there exists a fullypopulated legal database of reasonable size. Suppose, we are given upper bounds N (q) for the numbers of instances of the types q in the database scheme. Again, an upper bound ∞ does not express a real constraint. We call a set C of int-cardinality constraints restricted consistent, if there exists a fully-populated legal database instance S t of S such that for every type q the number g(q) of its instances is bounded by N (q). Unfortunately, it happens to be difficult to decide this question, as we shall see in the sequel. Theorem 6. Let S be a database scheme and C a set of int-cardinality constraints defined on S. It is NP-complete to decide, whether C is restricted consistent (with respect to given bounds N (q) ∈ N ∪ {∞}). In order to see this one has to verify that the following decision problem is NP-complete: Restricted consistency. Does there exist a function g : S → N such that there is a nonnegative integral solution to the system (1) for every link (r, e) ∈ L and such that g(q) ≤ N (q) holds for every type q ∈ S, where N (q) is some integer (from the input)? Clearly, the problem belongs to NP, since for guessed function g and integers xd , d ∈ D(r, e) and (r, e) ∈ L, the equalities in the systems (1) can be tested in polynomial time. The proof of the NP-completeness uses reduction from Integer Knapsack, which is well-known to be NP-complete. Theorem 6 shows, that asking for restricted consistency may result in a considerable increase of the complexity of the appropriate decision problem. In general, solvers of integer linear programming problems seem to be the only way to tackle the Restricted consistency problem.
6
Strong Consistency
Figure 4 shows a catalogue of tours offered by our travel agency this week. It provides a fully-populated legal database instance to the scheme in Fig. 2.
On the Consistency of Int-cardinality Constraints tour 1 3 5 7 9 11
starts in Paris Vienna Prague Rome London Geneve
159
visits tour starts in visits Prague, Vienna, Rome 2 Paris Rome, London, Geneve Prague, Rome, Paris 4 Vienna London, Geneve, Prague Vienna, Rome, London 6 Prague Geneve, Vienna, Paris Prague, Vienna, Geneve 8 Rome London, Geneve, Paris Paris, Vienna, Rome 10 London Geneve, Prague, Paris Prague, Rome, London 12 Geneve Paris, London, Vienna
Fig. 4. A catalogue of tours offered by the travel agency in our example.
Clearly, this week every tour visits exactly 3 cities. But, a customer might wish to book a tour visiting 7 cities as promised by the travel agency. Should he wait with booking till next week? Will there ever be a tour visiting 7 cities? As we shall see shortly, the answer is no. The chosen int-cardinality constraints do not allow the construction of databases with tours visiting more than 3 cities. This is of course an unpleasant fact, in particular for the customer. However, there is no way out unless the management of the travel agency changes its policy, i.e. the chosen set of int-cardinality constraints. In our example, the permitted cardinalities 4 and 7 in the set D(visits, tour) are dummy values. They might be deleted without affecting the set SAT (S, C) of fully-populated legal databases. The problem is of course how to find those cardinalities which never will occur in a legal database instance. Let S be a database scheme and (r, e) a link of S. We call a consistent set C of int-cardinality constraints strongly consistent iff for every value d in the given set D(r, e) there exists a fully-populated legal database S t containing an instance e of type e which participates in exactly d relationships of type r, i.e. satisfies |r ∈ rt : r(e) = e| = d. We call this database S t a certificate for the value d in D(r, e) Hence, in a strongly consistent set C of int-cardinality constraints non of the sets D(r, e) contains dummy values. In this section we shall show how to detect such dummy cardinalities and, consequently, how to check the strong consistency of a given set C. It is easy to see, that the union of two legal database instances is again legal. Hence, if all the sets D(r, e) in C are finite, then C is strongly consistent iff there exists a fully-populated database S t which is a certificate for all values d ∈ D(r, e) and all links (r, e): just choose a certificate for each of these values d and join all them. This provides again a legal database, which is the claimed common certificate. If there is an infinite set D(r, e), too, the argument has to be slightly changed. Obviously, a database which forms a certificate for all values d in D(r, e) would
160
S. Hartmann
be infinite. However, for any choice of a finite subset D+ (r, e) ⊂ D(r, e) the last observation remains true. From Fact 1, we obtain Fact 7. Let S be a database scheme and C be a consistent set of int-cardinality constraints defined on S. Then C is strongly consistent iff there exists a function g : S → N such that for every link (r, e) ∈ L there are nonnegative integers xd , d ∈ D(r, e) satisfying (1) such that xd is positive for every ( if D(r, e) is finite, D(r, e) d∈ D+ (r, e) if D(r, e) is infinite, where D+ (r, e) is an arbitrary finite subset of D(r, e). As is Sect. 4, this result can be used to prove the following observation. Fact 8. Let (r, e) ∈ L be a link with a finite set D(r, e) of permitted cardinalities. Further, let g(e) and g(r) be positive integers. There exists a positive rational solution to system (1) iff min D(r, e) <
g(r) < max D(r, e) g(e)
holds, or D(r, e) is of size one and we have
g(r) g(e)
(5)
∈ D(r, e).
Combining Facts 7 and 8, we obtain a characterization of strong consistency. Fact 9. Let S be a database scheme and C a consistent set of int-cardinality constraints defined on S. Then C is strongly consistent iff there exists a function g : S → N such that (5) holds for every link (r, e) ∈ L whose set D(r, e) of permitted cardinalities is of size at least two. Recall the digraph G = (S, L ∪ L−1 ) introduced in Sect. 4. A directed cycle Z in G is said to be subcritical if its weight w(Z) equals 1. In the sequel, we will show how to use subcritical cycles to check strong consistency. An arc A = (u, v) lies on such a subcritical cycle Z iff w(A) =
g(v) g(u)
(6)
holds for every admissible function g. This helps us to derive the following consequences. Fact 10. Let S be a database scheme and C a consistent set of int-cardinality constraints defined on S. Further, let L = (r, e) ∈ L be a link of S. If the link L itself lies on a subcritical cycle in G, then all permitted cardinalities different from min D(r, e) are dummy values. If the reverse arc L−1 of the link lies on a subcritical cycle in G, then all permitted cardinalities different from max D(r, e) are dummy values.
On the Consistency of Int-cardinality Constraints
161
The digraph G obtained from the entity-relationship diagram of the database scheme for our travel agency contains two directed cycles. One of them is subcritical as Fig. 5 shows. On this cycle, we find for example the link (visits,tour). According to Fact 10, we may in particular conclude that every tour visits exactly 3 cities. The values 4 and 7 in comp(visits, city) are dummy values.
tour
1 3
visits
1
6
? starts
1 2
-
6
city
Fig. 5. A subcritical cycle in the digraph G from our example.
Theorem 11. Let S be a database scheme and C a consistent set of intcardinality constraints defined on S. Then C is strongly consistent iff the set D(r, e) of permitted cardinalities for a link (r, e) ∈ L is of size one whenever the link itself or its reverse arc lies on a subcritical cycle in G. Sketch of the proof. The necessity of the claim immediately follows from Fact 10. It remains to verify the sufficiency. As mentioned above, there exists an admissible function g such that (6) holds for an arc A = (u, v) iff A lies on a subcritical cycle in G. This function g satisfies the preconditions of Fact 9. u t Whenever a consistent set C of int-cardinality constraints is not strongly consistent, then there must be dummy values in at least one of the sets D(r, e) of permitted cardinalities. In [12], Thalheim suggests to delete these values from the sets D(r, e). This process is also called scheme correction. It reduces the amount of information necessary to describe the legal databases: There will never be a database state using any of the dummy values. Hence, it is unnecessary to store these cardinalities.
7
Algorithmic Aspects
Our investigations in Sects. 4 and 6 provide characterizations of consistent and strongly consistent sets of int-cardinality constraints, respectively. As claimed, both properties can be tested in polynomial time applying methods from combinatorial optimization. According to Theorem 4, C is consistent iff the digraph G contains no critical cycles with respect to the weight function (3). The existence of critical cycles,
162
S. Hartmann
i.e. directed cycles of weight smaller 1, can be tested using shortest-path methods. In [8], we present a variation of the well-known Floyd-Warshall algorithm (cf. [5]) to decide the existence of critical cycles. Its complexity is cubic in the size of the database scheme S. If a consistent set C of int-cardinality constraints is not strongly consistent, then a slight modification of this algorithm can be used to delete dummy values from the sets D(r, e) of permitted cardinalities. When deleting all dummy values in our example according to Fact 10, we obtain the int-cardinality constraints comp(starts, tour) = {1}, comp(starts, city) = {2}, . comp(visits, tour) = {3}, comp(visits, city) = {6} Hence, it is not longer surprising that in the database in Fig. 4 every tour visits exactly 3 cities. Due to Theorem 11 this is a consequence of the chosen set of constraints. Thus, scheme corrections result in a considerable decrease of the complexity of information.
8
Conclusions
In this paper, we have developed a theory for int-cardinality constraints which generalize the well-known concept of cardinality constraints. We have shown that significant properties of sets of int-cardinality constraints can be recognized with methods from integer and combinatorial optimization. In particular, we proved that the consistency of such sets can be checked by looking for critical cycles in an associated digraph. It is worth mentioning, that a set of int-cardinality constraints is consistent whenever the corresponding relaxed set of ordinary cardinality constraints is consistent, too. Hence, gaps in the sets of permitted cardinalities do not affect the consistency of those constraints. This is of special interest for database designers as int-cardinality constraints often allow a straightforward modeling of semantics, even in more involved database applications. In addition, we introduced two variations of the consistency problem, namely restricted consistency and strong consistency. For practical reasons is often necessary to ensure the existence of legal databases with a bounded number of entities and relationships. However, this small modification of the original question leads to a dramatic increase of the complexity of such a problem. Checking the strong consistency of int-cardinality constraints allows us to detect dummy values in the sets of permitted cardinalities. To give a realistic impression of the structure of legal databases, these values should be deleted via scheme corrections. Again, this can be managed within polynomial time.
On the Consistency of Int-cardinality Constraints
163
References 1. A.P. Buchmann, R.S. Carrera and M.A. Vazquez-Galindo, A generalized constraint and exception handler for an object-oriented CAD-DBMS, IEEE conf. (1986) 3849. 2. P. Chen, The Entity-Relationship Model: Towards a unified view of data, ACM TODS 1,1 (1984) 9-36. 3. P. Chen and H.-D. Knoell, Der Entity-Relationship-Ansatz zum logischen Systementwurf (BI-Wissenschaftsverlag, Mannheim, 1991). 4. K. Engel and S. Hartmann, Constructing realizers of semantic entity relationship schemes, Preprint 95/3, Universit¨ at Rostock (1995). 5. M. Gondran and N. Minoux, Graphs and algorithms (Wiley, New York, 1984). 6. S. Hartmann, Graph-theoretic methods to construct entity-relationship databases, in: M. Nagl (ed.), Graphtheoretic concepts in computer science, LNCS 1017 (Springer, Berlin, 1995) 131-145. ¨ 7. S. Hartmann, Uber die Charakterisierung und Konstruktion von EntityRelationship-Datenbanken mit Kardinalit¨ atsbedingungen, Ph.D. thesis, Universit¨ at Rostock (1996). 8. S. Hartmann, Int-cardinality constraints in data modeling, Preprint, Universit¨ at Rostock (1998). 9. M. Lenzerini and P. Nobili, On the satisfiability of dependency constraints in Entity-Relationship schemata, Information Systems 15 (1990) 453-461. 10. D. Maier, The theory of relational databases (Computer Science Press, Rockville/MD, 1983). 11. J. Paredaens, P. de Bra, M. Gyssens and D. van Gucht, The structure of the relational database model (Springer, Berlin, 1989). 12. B. Thalheim, Fundamentals of cardinality constraints, in: G. Pernul and A.M. Tjoa (eds.), Entity-relationship approach, LNCS 645 (Springer, Berlin, 1992) 7-23. 13. B. Thalheim, A survey on Database Constraints, Reihe Informatik I-8, Universit¨ at Cottbus (1994). 14. B. Thalheim, Fundamentals of Entity-Relationship Models (Springer, Berlin, 1997).
Realizing Next Generation Internet Applications: Are There Genuine Research Problems, or Is It Advanced Product Development? Chairpersons: Kamalakar Karlapalem (HKUST) and Qing Li (CUHK) Panelists: Dik Lee (HKUST), Mukesh Mohania (University of South Australia), and John Mylopoulos (University of Toronto/CUHK)
The aim of this panel is to be both educational in providing pointers to characteristics of the Next Generation Internet Applications (NGIA), and a forum for debating relevance of these characteristics in generating new research directions. Over last three years Internet driven applications industry has seen one of the highest growth rates. Almost every week new “start-up” companies are being setup, and new applications are released into the market. This growth rate is going to continue well into the next millennium to cater to NGIA. One of the critical aspects of these new applications is the development time from conceptualization to the release of the product. This could be as short as a week end. Under this high paced applications development environment the role of the academic researchers will be discussed by this panel. The NGIA range from content providers (multimedia, push/pull scenarios), electronic commerce (transactions, workflow), and internet-based virtual database systems. An overview of these applications will be the starting point of this debate. The debate will concentrate on distinguishing between research issues and development issues in each of the following aspects (though not limited to only these): – Role of data semantics in realizing NGIA. Do we need more modeling power than what we have now? Even if we do, will we use it? – Role of design methodologies in NGIA. We need efficient and fast design methodologies. And we need design methodologies that generate efficient and lean application programs. Are there any research issues here? – Role of component oriented application deployment. The end-users may just buy different highly efficient functional application components, and assemble them to deploy new applications. How does this change the way applications are designed and developed? – Role of integrating appliances into the realm of user applications. There is a need for external software that manages persistent data in the appliances and smart cards. What kind of system development issues arise? – Role of standard middle-ware in NGIA. Will this help? Are there any research issues (meta-data,?) that come up in developing this middle-ware? T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, p. 164, 1998. c Springer-Verlag Berlin Heidelberg 1998
Web Sites Need Models and Schemes Paolo Atzeni Dipartimento di Informatica e Automazione Universit` a di Roma Tre Via della Vasca Navale, 79 00146 Roma, Italy http://www.dia.uniroma3.it/˜atzeni/ [email protected]
The World Wide Web is likely to become the standard platform for future generation applications. Specifically, it will be the uniform interface for sharing data in networks, both Internet and intranet. Referring mainly to “data-intensive” Web sites, which have the publication of information and data as their main goal, we can say that, in most cases, they do not satisfy the users’ needs: the information kept is poorly organized and difficult to access; also it is often out-of-date, because of obsolete content and broken links. In general, this is a consequence of difficulties in the management of the site, both in terms of maintaining the structure and of updating the information. Many Web sites exist that have essentially been abandoned. We believe that this situation is caused by the absence of a sound methodological foundation, as opposed to what is now standard for traditional information systems. In fact, Web sites are complex systems, and, in the same way as every other complex system, they need to be developed in a disciplined way, organized in phases, each devoted to a specific aspect of the system. In a Web site, there are at least three components: the information to be published (possibly kept in a database); the hypertextual structure (describing pages and access paths); the presentation (the graphical layout of pages). It is widely accepted that data are described by means of models and schemes, at various levels, for example conceptual and logical. In a data-intensive Web site, it is common to have large sets of pages that contain data with the same structure (coming from tuples of the same relation, if there is a database): therefore, we argue for the relevance of the notions of model and scheme also for the description of hypertexts. Given the various facets that can arise, we also believe that hypertexts have, in the same way as data, both a conceptual and a logical level. These issues have led us to develop a methodology (Atzeni et al. [1]) that is based on a clear separation among three well distinguished and yet tightly interconnected design tasks: the database design, the hypertext design, and the presentation design. Both database and hypertext design have a conceptual phase followed by a logical one. Figure 1 shows the phases, the precedences among them, and their major products (schemes according to appropriate models). T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 165–167, 1998. c Springer-Verlag Berlin Heidelberg 1998
6. Hypertext to DB Mapping and Page Generation: Web Site (html)
Fig. 1. The Araneus Design Methodology
The originality of the methodology is in the conceptual and logical design of hypertexts, which make use of specific models, developed in this framework: – ncm, the Navigation Conceptual Model , a conceptual model for hypertexts, which is essentially a variation of the er Model suitable to describe hypertextual features in an implementation independent way; – adm, the Araneus Data Model , a logical model for hypertexts (Atzeni et al. [2]), whose main construct is that of page scheme, used to describe the common features of similar pages. It is worth noting that other proposals have been recently published that present some similarities, though with a less detailed articulation of models (P. Fraternali and P. Paolini [4], Fernandez et al. [3]). The origins of the methodological aspects can be traced back to previous work on hypermedia design (Garzotto et al. [5], Isakowitz et al. [6], Schwabe and Rossi [7]).
Web Sites Need Models and Schemes
167
Acknowledgments. I would like to thank Gianni Mecca and Paolo Merialdo, together with whom most of the concepts mentioned here are being developed.
References 1. P. Atzeni, G. Mecca, and P. Merialdo. Design and Maintenance of Data-Intensive Web Sites. Advances in Database Technology—EDBT’98, Lecture Notes in Computer Science, Vol. 1377, Springer, 1998. 2. P. Atzeni, G. Mecca, and P. Merialdo. To Weave the Web. In International Conf. on Very Large Data Bases (VLDB’97), Athens, Greece, August 26-29, 1997. 3. M. F. Fernandez, D. Florescu, J. Kang, A. Y. Levy, D. Suciu. Catching the Boat with Strudel: Experiences with a Web-Site Management System. In Proc. of ACMSIGMOD Int’l Conference on Management of Data, 1998. 4. P. Fraternali and P. Paolini. A conceptual model and a tool environment for developing more scalable, dynamic, and customizable Web applications. Advances in Database Technology—EDBT’98, Lecture Notes in Computer Science, Vol. 1377, Springer, 1998. 5. F. Garzotto, P. Paolini, and D. Schwabe. HDM – a model-based approach to hypertext application design. ACM Transactions on Information Systems, 11(1):1–26, January 1993. 6. T. Isakowitz, E. A. Stohr, and P. Balasubramanian. RMM: A methodology for structured hypermedia design. Communications of the ACM, 58(8):34–44, August 1995. 7. D. Schwabe and G. Rossi. The Object-Oriented Hypermedia Design Model. Communications of the ACM, 58(8):45–46, August 1995.
ARTEMIS: A Process Modeling and Analysis Tool Environment S. Castano1 , V. De Antonellis2 , and M. Melchiori2 1 2
University of Milano, DSI - via Comelico, 39 - 20135 Milano Italy [email protected] University of Brescia, DEA - via Branze, 38 - 25123 Brescia - Italy {deantone,melchior}@ing.unibs.it
Abstract. To support business process understanding and reengineering, techniques and tools for process modeling and analysis are required. The paper presents the ARTEMIS tool environment for business process modeling and analysis. Process analysis is performed according to an organizational structure perspective and an operational structure perspective, to capture the degree of autonomy/dependency of organization units in terms of coupling, and the inter-process semantic correspondences, in terms of data and operation similarity, respectively. Processes are modeled as workflows and techniques developed for workflow analysis are presented in the context of a pilot application involving the Italian Ministry of Justice.
1
Introduction
Most private and public organizations have recently turned their attention to the process by which they operate, to improve service and product quality and customer satisfaction [13]. To support business process understanding and reengineering, techniques and tools for process modeling and analysis are studied [12, 9,14]. Moreover, reverse engineering techniques to reconstruct conceptual models of existing applications and databases are proposed for analysis purposes [1]. Process analysis is generally performed following an information processing viewpoint [10], focusing on input/output data and on the process structure and execution modalities [2,15]. In [6], we have presented a process analysis approach according to an inherently data-oriented perspective, that mainly focuses on characteristics of data manipulated and exchanged by processes and on related operations. In this paper, we extend the approach to the analysis of process structure and execution modalities described by workflow specifications, and we present the ARTEMIS (Analysis of Requirements: Tool Environment for Multiple Information Sources) tool environment for process modeling and analysis. The analysis techniques of ARTEMIS rely on workflow descriptions of processes and allow the analyst to discover and classify critical situations requiring reengineering interventions, according to operational structure and organizational structure perspectives. T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 168–182, 1998. c Springer-Verlag Berlin Heidelberg 1998
ARTEMIS: A Process Modeling and Analysis Tool Environment
169
Functionalities provided by ARTEMIS are illustrated together with results of their application to the international adoption processes of the Italian Juvenile Court of the Ministry of Justice, in the context of the PROGRESS (PROcess Guided REengineering Support System) project. PROGRESS is a research project, funded by the Italian National Research Council (CNR) and by the Italian National Consortium for Informatics (CINI), which aims at reengineering data and processes of Italian Public Administration information systems. The paper is organized as follows. In Sect. 2, we present an overview of the ARTEMIS functionalities. In Sect. 3, we describe process workflow modeling, while, in Sect. 4, we describe the ARTEMIS analysis functionalities, with application to selected processes of the Juvenile Court. Finally, Sect. 5 draws some conclusions and describes future work.
2
Functionalities of the ARTEMIS Tool Environment
ARTEMIS provides the following analysis functionalities to support reengineering activities: – Process form cataloging. Process forms in PROGRESS provide textual description of organization units and related processes, giving preliminary information on their input/output and composing tasks. Once such forms are filled-in with information on processes to be analyzed and modeled, they are stored and properly classified by means of keywords to facilitate their subsequent retrieval through ARTEMIS. – Process workflow modeling. In ARTEMIS, processes to be analyzed are modeled as workflows, according to the WIDE workflow model, presented in Sect. 3. The WIDE model has been developed in the framework of the Esprit Project WIDE (Workflow on Intelligent Distributed database Environment) [3], and allows the representation of operational and organizational aspects relevant for workflow analysis. Process workflow modeling is accomplished by interfacing the WIDE workflow specification tool, called FORO designer. – Process workflow analysis. These are the core functionalities of ARTEMIS, which operate according to the following analysis perspectives. Analysis of the operational structure. Processes are analyzed with respect to their input/output information entities and their functionality, in order to identify situations of replication/redundancy/overlapping of activities and to evaluate the relevance and repetitiveness of processes within a given unit. The analysis is based on the following similarity coefficients: i) entity-based similarity coefficient, to evaluate the degree of similarity of two processes with respect to their input/output information entities. It can be a point of reference to evaluate the adequacy of data usage in the operational structure; ii) functionality-based similarity coefficient, to analyze process relationships due to operation commonality. It can be a point of reference to check the adequacy of production/manipulation of information in the operational structure.
170
S. Castano, V. De Antonellis, and M. Melchiori
Analysis of the organizational structure. Processes are analyzed with respect to their exchanged information flows in order to evaluate the degree of interdependency between them. On the basis of the number of exchanged information flows, we can measure the degree of coupling between different processes. The analysis is based on the following coupling coefficients: i) actual coupling coefficient, to evaluate the degree of coupling of processes in the same or different organization units, based on the analysis of information flows involving identical entities; ii) potential coupling coefficient, to evaluate the degree of coupling, based on the analysis of information flows involving entities that are similar according to the semantic dictionary contents. Coupling coefficients can be a point of reference to understand the information flow network and its implications to determine, at an aggregated level, the information flows among the separate involved processes and envisage possible regrouping of processes to simplify or expedite the flow and improve global effectiveness. To support the process analysis, two further functionalities are provided in ARTEMIS: – Interactive construction of a semantic dictionary. The evaluation of similarity and coupling coefficients requires the comparison of information entities and of operations of different processes. A semantic dictionary is exploited to handle possible synonyms and other terminological relationships between entity and operation names. The semantic dictionary is semi-automatically built before starting the analysis, where information entity names and operation names are stored as terms and organized by generalization and aggregation mechanisms. Issues related to the construction of the semantic dictionary are presented in [6]. – Interactive extraction of process descriptors. Process descriptors provide summary information on processes to be analyzed, for the evaluation of similarity and coupling coefficients. They are interactively extracted from workflow specifications produced with the modeling functionality. In the following, we will focus on functionalities for process workflow analysis, with application to the International Adoption processes of the Italian Juvenile Court. In particular, after a general description of the process workflow model, we discuss the analysis perspectives and the associated coefficients and present results of their application to selected processes.
3
Process Workflow Modeling
In ARTEMIS, processes are modeled as workflows, using the WIDE workflow model [3]. In WIDE, a workflow schema is the model of a (business) process. It is a collection of tasks, which are the elementary work units that collectively achieve the workflow goal. Tasks are organized into a flow structure defining the
ARTEMIS: A Process Modeling and Analysis Tool Environment
171
execution dependencies among tasks, i.e., the order in which tasks are executed. Sequential, parallel and conditional executions can be specified. A workflow case is an execution of a workflow schema, i.e., an instance of the process. Multiple executions of the same process may be active at the same time, and are denoted by a different activation number. Each task within a workflow schema has associated a set of characteristics, among which: the name; a textual description; the actions, either manual or a sequence of statements of the description language defining how both temporary and persistent workflow data are manipulated by the task; roles which may perform the task, and a set of constraints concerning task assignment to agents; the task information (documents, forms, dossier, accessed databases), classified as input and output to be used/produced when achieving the task; exceptions, that can be specified to handle abnormal situations that can occur during the execution of the task and need to be managed properly. The flow structure is specified by means of a set of constructs allowing sequence, alternative, and parallelism. Two tasks may be directly connected by an edge to denote sequence: as soon as the first one ends, the second one is scheduled for execution. More complex execution dependencies are specified by means of the fork connectors (for initiating concurrent execution) and join connectors (for synchronizing after concurrent execution) [3]. A case is executed by scheduling tasks (as defined by the flow structure) and by assigning them for execution to a human or an automated agent. As a case is started, the first task (the successor of the start symbol) is activated. As a task connected to the stop symbol is completed, the case is also completed. The WIDE workflow model includes also an advanced construct of supertask to enable modularization in a workflow specification. It is a composite task, composed of elementary (atomic) tasks or of other supertasks. As the predecessor of the supertask is completed, the first task in the supertask is activated. As the last task in the supertask is completed, the successor of the supertask is scheduled for execution. For example, the International Adoption process is modeled in WIDE as a set of supertasks (shadowed boxes in Fig. 1), describing the composite activities involved in the process. The process starts when a couple submits an application to be declared eligible for an international adoption. After receiving the submitted application, international qualifying is attested for the couple, waiting for a foreign judge action (this condition is represented by means of the oval box preceding Foreign Action Procedure supertask in the figure). Depending on the validity of the action, the Revoke Fostering or the Adoption task is executed. If the action is valid, after one year and if no other events occur in the meantime, the Adoption supertask can start. If the foreign action is invalid, the Revoke Fostering activity is started, which is also executed when a notification arrives stating that the fostering is not valid. This situation is modeled using a conditional fork (diamond symbol after Foreign Action Procedure in the figure). The workflow can terminate in two different cases. In the first case, termination occurs if the fostering is not valid, or no adoption is allowed (this
172
S. Castano, V. De Antonellis, and M. Melchiori
Application Couples applying for the int. adoption are recorded and evaluated
Intern. Qualifying The international qualifying is attested
arrival of a foreign judge action
Foreign Action Procedure The action of a foreign judge is given legal effectiveness Exceptions Notification from social welfare that fostering is not valid; --> end ST
Valid foreign action
Invalid foreign action
Revoke Fostering A revoke fostering action is issued revoked fostering
validated fostering 1 year or notification from social welfare that fostering is not valid or the couple gives back the children
notification arrival
end WF no notification
(starting WF National Adoption)
Adoption The adoption action is issued
Exceptions Notification from social welfare on incorrect fostering; the couple gives back the children --> end ST, starting ST Revoke Fostering
no adoption adoption
end WF
Fig. 1. The International Adoption process in WIDE
ARTEMIS: A Process Modeling and Analysis Tool Environment
173
... PUBLIC PROSECUTOR Input International adoption dossier Output: International adoption dossier + Opinion of the Public Prosecutor
Attorney Consultation Issue Opinion of the Public Prosecutor EDP
Input: International adoption dossier Output : International adoption dossier
Update EDP Update DataBase
I,S PROCEDURE (Date of submitting acts to the Public Prosecutor)
OFFICIAL RECEIVER Input : International adoption dossier, possible appeal Output : International adoption dossier + action request
Action Requirement Issue a requirement on the effectiveness of the adoption or fostering foreign act CHAMBER OF COUNCIL
Input : International adoption dossier Output : International adoption dossier + result of trial
Trial and Decisions
Issue result of trial
CHANCELLERY
Input: International adoption dossier Output International adoption dossier + Action of adoption validation or fostering validation + communication of the result of trial + registers
Getting Results Update Registers I,S,U A CTION (all the attributes) EDP
Input: International adoption dossier Output: International adoption dossier + validation action of international fostering or international adoption
Insert Results Update Database and Print Results
Wait for appeal
Appeal
No Appeal
Fig. 2. The Foreign Action Procedure activity in WIDE
is modeled by using a conditional join -circle symbol- before the end WF symbol on the leftside), and the National Adoption workflow can be started. In the second case, the workflow terminates with successful adoption. Each supertask of the International Adoption is in turn expanded into a workflow, to model its corresponding component tasks and executing agents. In Fig. 2, part of the expansion of the supertask Foreign Action Procedure into elementary tasks is shown. For each task, the organization unit responsible for task execution is reported, together with the input and output documents.
174
S. Castano, V. De Antonellis, and M. Melchiori
Moreover, if database operations are performed from within the task, these are listed, by specifying the type (Select, Update, Insert, Delete) and the involved attributes of database entities and/or relationships.
4
Process Workflow Analysis
In ARTEMIS, process workflow analysis is performed using descriptors. A process descriptor gives a summary, structured representation of the features of a process that are relevant for the application of the analysis coefficients. A descriptor provides the following information for a process: i) the name of organizational unit responsible for process execution; ii) the set IN of input information entities; iii) the set OU T of output information entities; iv) the set OP of operations performed by the process, formally described as triplets h action, constitutive entities (CST ), circumstantial entities (CSM )i. In ARTEMIS, descriptors are interactively extracted from WIDE workflow specifications, by analyzing the internal representation generated by the FORO designer tool. Since workflows describe complex processes (i.e., business processes), descriptors are extracted from workflow tasks, to allow for a fine-grained analysis of performed activities. In particular, the organization unit field of the descriptor corresponds to the agent that performs the task (graphically shown in the upper right hand corner of the task diagram, see Fig. 1); IN and OU T sets of information entities correspond to input and output documents associated with the task (graphically they are labeled as INPUT and OUTPUT in the figure); the set OP of operations that the task performs are derived from task description, by recognizing the action (i.e., a verb), the set CST of constitutive entities required by the operation, and the set CSM of circumstantial entities involved in the operation. The operations are interactively extracted, starting from task description, with tool assistance. As an example, in Fig 3, the descriptor for the task Getting Results within the Foreign Action Procedure workflow is shown.
Process descriptor TASK: Getting Results ORGANIZATION UNIT: Chancellery INPUT: {international adoption application, code of the minor, code of the couple, role number, dossier of the couple, dossier of the minor, action of delegation assignment, authorization of the Ministry of the Internal and Foreign Affairs, action request, result of trial } OUTPUT: { communication to the Ministry of the Internal and Foreign Affairs, registers } OPERATIONS: { }
Fig. 3. An example of descriptor for the task Getting Results
ARTEMIS: A Process Modeling and Analysis Tool Environment
4.1
175
Analysis of the Operational Structure with Experimentation in PROGRESS
Following this perspective, processes are analyzed according to similarity criteria and classified into families to evaluate possible situations of redundancy, replication, or inconsistency with respect to manipulated information entities and performed operations. Process analysis and classification is based on the following similarity coefficients. Entity-based similarity coefficient. The Entity-based similarity coefficient of two processes Pi and Pj , denoted by ESim(Pi , Pj ), is evaluated by comparing the input/output information entities in their corresponding descriptors, that is,
ESim(Pi , Pj ) =
2 · Atot (IN (Pi ), IN (Pj )) 2 · Atot (OU T (Pi ), OU T (Pj )) + | IN (Pi ) | + | IN (Pj ) | | OU T (Pi ) | + | OU T (Pj ) |
where Atot (IN (Pi ), IN (Pj )) (respectively, Atot (OU T (Pi ), OU T (Pj ))) denotes the total value of “affinity” between the pairs of input (respectively, output) information entities in Pi and Pj , and | | denotes the cardinality of a given set. Atot is obtained by summing up the affinity values of all the pairs of input/output information entities that have affinity in the semantic dictionary. Proper “affinity functions” are available on the dictionary which, given two names, determine their affinity value A() ∈ [0, 1] based on the existence of a path of terminological relationships (e.g., synonymy, hypernymy) between them, on its length and on the strength of the involved relationships. High affinity values (i.e., A() ∈ [0.7, 1]) denote that the corresponding information entities can be considered for the evaluation of the ESim coefficient. ESim can assume values in the range [0, 2]. It is 0 when no pairs of entities with affinity are found in IN and OU T , while has value 2 when each input and output entities of Pi has affinity with an input and output entity of Pj and viceversa. Intermediate values are proportional to the number of pairs of information entities with affinity in Pi and Pj and to their affinity value. Functionality-based similarity coefficient. The Functionality-based similarity coefficient of two processes Pi and Pj , denoted by FSim(Pi , Pj ), is evaluated by comparing the operations in their corresponding descriptors. Also in this case, the comparison is based on the semantic dictionary, that is, F Sim(Pi , Pj ) =
2 · Atot (OP (Pi ), OP (Pj )) | OP (Pi ) | + | OP (Pj ) |
where Atot (OP (Pi ), OP (Pj )) denotes the total value of affinity of the pairs of operations that are similar in Pi and Pj . Two operations are similar if their actions, their constitutive information entities and, if defined, their circumstantial information entities have affinity in the dictionary. The similarity value of two
176
S. Castano, V. De Antonellis, and M. Melchiori
operations is obtained by summing up the affinity values of their corresponding elements. F Sim assumes values in the range [0, 3]. It is 0 when no similar operations are found in Pi and Pj , while has value 3 when each operation of Pi is similar to an operation of Pj and vice versa, and elements in each operation pair have the greatest affinity value. Intermediate values are proportional to the number of pairs of similar operations in Pi and Pj and to the affinity of their elements. Once selected processes for the analysis, the user can interactively set similarity thresholds to filter out process pairs based on computed ESim and F Sim values. Similarity reports are produced by ARTEMIS in form of tables, listing process pairs according to different criteria (e.g., decreasing order of similarity, alphabetical order). Two kind of semantic correspondences are identified between two processes Pi , Pj : i) Semantic equivalence, (Pi ≡ Pj ), denoting processes whose ESim and F Sim coefficients have the maximum value. This means that they perform same real-world activity (activity replication) and for them the unification of the involved activity should be evaluated. ii) Semantic relationship, (Pi ∼ Pj ), denoting processes with T1 ≤ ESim(Pi , Pj ) < max and T2 ≤ F Sim(Pi , Pj ) < max, where T1 and T2 are similarity thresholds specified by the user. This means that processes execute partially overlapping real-world activities, and this situation should be analyzed to evaluate the unification or the standardization of the involved activities. Let us now discuss the results of performing the operational structure analysis on the adoption processes in PROGRESS. Four processes related to the national and international adoption procedures have been selected for the analysis, namely, National Evaluation, Art. 144 Law 189, International Adoption, and National Adoption. The analysis has been performed separately for each process, since the four selected processes represent different procedures with different objectives. Two organization units play an important role in all examined processes, namely the EDP and the Chancellery units, which have been identified as crucial units for reengineering activities. We report the results of the analysis of the tasks performed by the EDP and the Chancellery organization units within the International Adoption process. Analyzing the International Adoption workflow means analyzing all the workflows corresponding to its supertasks, that is, the Application, International Qualifying, Foreign Action Procedure, Revoke Fostering, and Adoption workflows (see Fig. 1). We first performed the operational structure analysis of all International Adoption tasks of each organization unit separately, to classify the activities performed by the unit and reason about their relevance and repetitiveness. Then, we performed the analysis of all International Adoption tasks of both the EDP and Chancellery together, to recognize activity replication/overlapping and evaluate restructuring interventions. This way, ARTEMIS allowed us to reason about different aspects and point out critical situations of different nature, within the same organization unit or involving the two units together. Analysis of all tasks of a given organization unit. Tasks performed by the Chancellery and the EDP in the International Adoption process have been
ARTEMIS: A Process Modeling and Analysis Tool Environment
177
analyzed separately for each unit. On the basis of obtained similarity values, tasks of each unit were classified into families. Families obtained with ARTEMIS were used to classify the activities performed by the two considered organizational units. For example, for the Chancellery, we recognized four main categories of activities performed within the International Adoption, namely Update Registers, Getting Application, Registration, and Request of Documents. Each category groups tasks executed in different points of the examined workflows, which are characterized by similar operations on similar documents. Identified activity categories are the basis for evaluating the following parameters: i) replication of a given task in a workflow; ii) repetitiveness of a given category within a workflow; iii) relevance of a given category in terms of other categories which depend on its accomplishment in order to start their execution. Based on these parameters, it is possible to establish if a category has to be considered as “crucial” and if its task occurrences refer to similar operational conditions. In this case, the category of tasks is candidate to automation, if the activities are (partially) manually performed. By analyzing task categories for the considered units, we found that the Chancellery performs mainly administration activities while the EDP unit is characterized by several database update activities. By evaluating the repetitiveness of each category within the International Adoption process, we discovered that crucial tasks are those concerning register update for the Chancellery and database update for the EDP, respectively. Moreover, by examining the execution flow in correspondence of tasks of a given category, we observed that, in the present way of working, tasks concerning register update are mandatory for the continuation of the activities in the workflow, while corresponding database updates are not. Consequently, tasks performed by the Chancellery are to be considered as relevant for the International Adoption process. On the contrary, tasks performed by the EDP unit are non relevant in the present configuration, being executed time after the necessary information is produced. As a consequence, the database does not reflect in real-time the situation contained in paper-based registers. It would be desirable to have real-time updates of the database, as soon as the necessary information is produced. This would be possible if the Chancellery were enabled to perform on-line updates on the database. To formulate possible solutions of workflow reorganization to meet such a requirement, we need to exploit also the results of the analysis of the tasks performed by both EDP and Chancellery units in the International Adoption. Analysis of all tasks of different organization units. EDP and Chancellery tasks related for the International Adoption process were analyzed for evaluating their similarity; in Table 1, we report the obtained ESim and F Sim coefficients. As we can see from these values, only tasks with a semantic relationship characterize these two units in the considered process. By analyzing the tasks with the highest similarity values (i.e., the four tasks characterized by ESim = 1 and F Sim = 2, 8 and the pair with ESim = 2 and F Sim = 1, 2 in Table 1), together
178
S. Castano, V. De Antonellis, and M. Melchiori
Table 1. ESim and F Sim coefficients for the Chancellery and the EDP tasks in the International Adoption process Chancellery Application Submitting(75) Document Request(78) Getting Results(98) Getting Results(98) Getting Results(98) Getting Results(98) Getting Results(102) Getting Results(107) Int. Adoption Application(88) Int. Adoption Application(88) Int. Adoption Application(88) Int. Adoption Application(88) Int. Adoption Application(88) Registration of Decrees(86) Registration of Decrees(86) Registration (77) Registration (91) registration (91) Registration (91) Registration (91)
with the involved workflows, we discovered the repetition of a recurring “pattern” in task execution, that is, in correspondence of a task of register update by the Chancellery, a “dual” task performing the database update is performed by the EDP. For example, let us consider the tasks Getting Results and Insert Results (fourth row in Tab. 1) performed by the Chancellery and the EDP units, respectively. In the current Foreign Action Procedure workflow, these two tasks are performed in sequence (see Fig. 2, boxed area), with disadvantages related to periodical update overhead for the EDP and inconsistent state of the database with respect to information in the registers. According to obtained similarity values and by taking into account current execution modalities, we envisaged two possible solutions: Solution a) concurrent tasks (see Fig. 4): the EDP and the Chancellery can execute the updates concurrently. To make this possible, the Chamber of Council has to send a copy of the documents to be updated to both EDP and Chancellery. This solution puts a (limited) overhead to the activity of Chamber of Council, to make an additional copy of produced documents to be sent to the EDP. This solution can be useful in a transition phase, from the current workflow to the target workflow, defined according to Solution b).
ARTEMIS: A Process Modeling and Analysis Tool Environment
179
... Input : International adoption dossier Output: registers, communication of the result of trial
Input: International adoption dossier Output: International adoption dossier + validation action of international fostering or international adoption CHANCELLERY
Getting Results
Update Registers
EDP
Insert Results I,S ACTION (all the attributes) Update Database and Print Results
2
...
Fig. 4. Solution a): Concurrent tasks ...
CHANCELLERY Input : International adoption dossier Output : International adoption dossier + validation action of international fostering or international adoption registers , communication of the result of trial
Getting Results Update Registers Update Database and Print Results
I,S ACTION
(all the attributes)
...
Fig. 5. Solution b): Unified tasks
Solution b) unified tasks (see Fig. 5): with this solution, the activities actually demanded to the EDP and the Chancellery are unified into a unique task under responsibility of the Chancellery. This way, the Chancellery is enabled to perform directly necessary database updates, as soon as modifications are introduced in the registers. This solution, whose benefits are evident, requires an underlying distributed architecture to be implemented. 4.2
Analysis of the Organizational Structure with Experimentation in PROGRESS
The effectiveness of an organization unit depends on its level of coupling with outside (i.e., other organization units). Following this analysis perspective, we analyze the interactions between different units, to understand the information flow network, and its implications. The analysis is based on input and output information entities to determine, at an aggregate level, the information flow among the processes. When analyzing processes of different organization units, two kinds of flows are relevant: “actual flows”, denoted by 7→ek ,eh , originated by the exchange of the same information entity; “potential flows”, denoted by
180
S. Castano, V. De Antonellis, and M. Melchiori Table 2. Actual Coupling Coefficients for Chancellery and EDP processes CHANCELLERY Getting Results(98) Getting Results(102) Getting Results(102)
;ek ,eh , originated by the exchange of information entities with affinity. To evaluate process coupling in a precise way allowing to point out the nature of involved entities, the following coupling coefficients are introduced. Actual Coupling coefficient. The Actual Coupling coefficient of two processes Pi and Pj , denoted by AC(Pi , Pj ) measures the amount of relationships between the processes due to actual information flows, that is, AC(Pi , Pj ) =| {hek , eh i | Pi 7→ek ,eh Pj } | where {hek , eh i | Pi 7→ek ,eh Pj } is the set composed of information entities that originate actual flows between Pi and Pj . Information flows are determined by comparing the input (respectively, output) information entities of one process with the output (respectively, input) information entities of the other one. Potential Coupling coefficient. The Potential Coupling coefficient of two processes Pi and Pj , denoted by P C(Pi , Pj ), measures the amount of relationships between the processes due to the actual and potential flows between them, that is, P C(Pi , Pj ) =| {hek , eh i | Pi 7→ek ,eh Pj ∨ Pi ;ek ,eh Pj } | The Potential coupling coefficient is evaluated by exploiting the semantic dictionary, to recognize information entities with affinity. Analogously to the operational structure, summary reports for both the actual and potential coupling coefficients can be produced with ARTEMIS, with different presentation options. The user can decide which kind of coupling coefficient to compute, and on which processes. In particular, it is possible to compute coupling coefficients on families of similar processes obtained as a result of the operational structure analysis, if processes of different organization
ARTEMIS: A Process Modeling and Analysis Tool Environment
181
units are involved, to provide complementary insights of process families for a comprehensive analysis. In our experimentation, coupling coefficients have been computed for the tasks of the Chancellery and EDP units. In Table 2, we report the actual coupling values for the International Adoption tasks that have a semantic relationship in the two organization units. Since the examined tasks involve the same information entities, the potential coupling is not relevant. The higher the coupling values, the higher their level of dependency, which suggests unification interventions. On the basis of the analysis of the volumes and type of the exchanges, it is possible to envisage possible regrouping of processes to simplify or expedite this flow. By analyzing the information flows between the Chancellery and EDP units, we found that they exchange the same documents several times, but not all exchanged documentation is necessary for the accomplishment of the tasks. Moreover, such additional documentation exchanges contribute to slow the overall process. A solution based on the task unification would contribute also to reduce the number of exchanged documents. Another solution could be the reorganization of the documentation flow, to reduce and optimize the number of document exchanges required to accomplish the goal. In general, modifications to existing workflow schemas can be incorporated in ARTEMIS, by producing new workflows, and new analysis coefficients evaluated and compared with previous ones, to facilitate the analysis. For this purpose, reporting functionalities offered by the ARTEMIS tool environment can be exploited.
5
Concluding Remarks
In this paper, we have presented the ARTEMIS tool environment for process modeling and analysis. Based on workflow specification of processes, the analysis is performed according to an organizational structure perspective and an operational structure perspective, to capture the degree of autonomy/dependency of organization units in terms of coupling and the inter-process semantic correspondences, in terms of data and operation similarity, respectively. The analysis tool environment described in this paper is intended to be a support to facilitate process reengineering activities, by providing similarity-based techniques for the systematic analysis of processes for aspects related to information and operation similarity, and to exchanged information flows. Results of performing the analysis on the adoption processes of the Juvenile Court of the Ministry of Justice have been discussed. Actually, such results and proposed solutions are under examination of the Juvenile Court. Goal of future work is the analysis of execution modalities of single processes and communication protocols between them, to discover possible inefficiency and failures and enrich tool capabilities for reengineering.
182
S. Castano, V. De Antonellis, and M. Melchiori
References 1. Aiken, P.: Data Reverse Engineering, McGraw-Hill 1996. 2. Barros, A.P., ter Hofstede, A.H.M., Proper, H.A.: Towards Real-Scale Business Transaction Workflow Modelling. In Proc. of CAiSE’97 - Int. Conf. on Advanced Information Systems Engineering, Barcelona, Spain (1997). 3. Casati, F., Ceri, S., Pernici, B., Pozzi, G.: Conceptual Modeling of Workflows. In Proc. of OO-ER’95, Int. Conf. on the Object-Oriented and Entity-Relationship Modelling, Gold Coast, Australia (1995). 4. Castano, S., De Antonellis, V., Fugini, M.G., Pernici, B.: Conceptual Schema Analysis: Techniques and Applications. ACM Transactions on Database Systems (to appear). 5. Castano, S., De Antonellis, V.: Semantic Dictionary Design for Database Interoperability. Proc. of ICDE’97, IEEE Int. Conf. on Data Engineering, Birmingham (1997). 6. Castano, S., De Antonellis, V.: A Framework for Expressing Semantic Relationships Between Multiple Information Systems for Cooperation. Information Systems, Special Issue on CAiSE’97, 27(3/4) (1998). 7. Castano, S., De Antonellis, V.: Reference Conceptual Architectures for Reengineering Information Systems. International Journal of Cooperative Information Systems 4(2 & 3) (1995). 8. Castano, S., De Antonellis, V.: Reengineering Processes in Public Administrations. Proc. of OO-ER’95, Int. Conf. on the Object-Oriented and Entity-Relationship Modeling, Gold Coast, Australia, (1995). 9. Fong, J.S.P., Huang, S.M.: Information Systems Reengineering, Springer-Verlag (1997). 10. Galbraith, J.R.: Designing Complex Organizations, Addison-Wesley Publishing Company (1973). 11. Hammer, M.J.: Reengineering Work: Don’t Automate. Obliterate, Harvard Business Review, July/August (1990). 12. Georgakopoulos, D., Hornik, M., Sheth, A.: An Overview of Workflow Management: From Process Modeling to Workflow Automation Infrastructure. Distributed and Parallel Databases, Kluwer Academic Publishers, 3 (1995). 13. Karagiannis, D. (Ed.): Special Issue on Business Process Reengineering, SIGOIS Bulletin, 16(1) (1995). 14. Nurcan, S., Grosz, G., Souveyet, C.: Describing Business Processes with a Guided Use Case Approach. Proc. of CAiSE*98, - Int. Conf. on Advanced Information Systems Engineering, Pisa, Italy, (1998). 15. Workflow Management Coalition: The wfmc specification - terminology & glossary. Doc. WFMC-TC00-1011 (1996).
From Object Oriented Conceptual Modeling to Automated Programming in Java* 1
1
1
Oscar Pastor , Vicente Pelechano , Emilio Insfrán , Jaime Gómez
2
1
Department of Information Systems and Computation Valencia University of Technology Camí de Vera s/n 46071 Valencia (Spain) { opastor | pele | einsfran }@dsic.upv.es 2 Department of Languages and Information Systems Alicante University C/ San Vicente s/n 03690 San Vicente del Raspeig. Alicante (Spain) {[email protected]}
Abstract. The development of Internet commercial applications and corporate Intranets around the world, that often use Java as programming language, is a significant topic in modern Software Engineering. In this context, more than ever, well-defined methodologies and high-level tools are essential for developing quality software in a way that should be as independent as possible of the changes in technology. In this article, we present an OO method based on a formal object-oriented model. The main feature of this method is that developers' efforts are focused on the conceptual modeling step, where analysts capture system requirements, and the full implementation can automatically be obtained following an execution model (including structure and behaviour). The final result is a web application with a three-tiered architecture, which is implemented in Java with a relational DBMS as object repository.
1 Introduction The boom of “web computing” environments, has originated the development of Internet commercial applications and the creation of corporate Intranets around the world. The use of the Java language [1] in these environments has opened significant research related to the proper implementation of final correct software products. In this context, where new technologies are continuously emerging, the software development companies must make suitable methods, languages, techniques and tools * This work has been supported by the CICYT under MENHIR/ESPILL TIC 97-0593-C05-01 Project and by DGEUI under IRGAS GV97-TI-05-34 Project. T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 183−196, 1998. Springer-Verlag Berlin Heidelberg 1998
184
O. Pastor et al.
for dealing with the new market requirements. More than ever, well-defined methodologies and the high-level tools which support them are essential for developing quality software in a way that should be as independent as possible of the changes in technology. The idea of clearly separating the conceptual model level, centered in what the system is, and the execution model, intended to give an implementation in terms of how the system is to be implemented, provides a solid basis for operational solutions to the problem. If we have rich-enough conceptual modeling environments to capture the relevant system properties in the problem space, a correct software representation at the solution space can be easily generated. Related work in this area has been developed based on the idea that we can significantly reduce the complexity of advanced-application specification and implementation by using a model-equivalent language (a language with a one-to-one correspondence to an underlying, executable model [3]). Nowadays OO methodologies like OMT [2], OOSE [4] or Booch [5] are widely used in industrial software production environments. Industry trends attempt to provide unified notations such as the UML proposal ([6]) which was developed to standardize the set of notations used by the most well-known existing methods. Although the attempt is commendable, this approach has the implicit danger of providing users with an excessive set of models that have overlapping semantics without a methodological approach. Following this approach we have CASE tools such as Rational ROSE/Java [7], FrameWork [8] or Paradigm Plus [9] which include Java code generation from the analysis models. However if we go into depth with this proposed code generation feature, we find that it is not at all clear how to produce a final software product in Java that is functionally equivalent to the system description collected in the conceptual model. This is a common weak point of these approaches. Far from what is required, what we have after completing the conceptual model is nothing more than a template for the declaration of classes where no method is implemented and where no related architectural issues are taken into account. In order to provide an operational solution to the above problem, in this paper we present a method that is based on a formal object-oriented model. The main feature of this method is that developers' efforts are focused on the conceptual modeling step, where analysts capture system requirements, and the full implementation is automatically obtained following an execution model (including structure and behaviour). The final result is a web application with a three-tiered architecture, which is implemented in Java with a relational DBMS as object repository. The main contribution of this approach with respect to the topic of model-equivalent programming languages is that we start from a graphical representation of a formal object-oriented specification language for conceptual modeling purposes. After having defined a finite set of behavioural patterns, a mapping from these patterns to software components in a given software development environment (Java in this paper) is defined. In consequence, our methodological approach is not attached to any particular programming language. A CASE Tool gives support to this method. It constitutes an operational approach to the ideas of the automated programming paradigm [10]: a collection of system information properties in a graphical environment (conceptual modeling step),
From Object Oriented Conceptual Modeling to Automated Programming in Java
185
automated generation of a formal OO system specification and of a complete software prototype (including static and dynamic) obtained from the conceptual model and which is functionally equivalent to the quoted system specification (execution model step).
2 The OO-Method: An Object-Oriented Method Following the OO-Method strategy, the software production process starts with the conceptual modeling step where we have to collect the relevant system properties. Once we have an appropriate system description, a formal OO specification is automatically obtained. This specification is the source of a well-defined execution model which determines all the implementation-dependent features in terms of user interface, access control, service activation, etc. This execution model provides a well-structured framework that enables the building of an automatic code generation tool. It is important to note that the formal specification is hidden from the OOMethod user: the relevant system information is introduced in a graphical way, which is syntactically compliant with the conventional OO models, but which is semantically designed to fill the class definition templates according to the formal OO basis.
2.1 OASIS: an Object-Oriented Formal Model The OO-Method was created on the formal basis of OASIS, an OO formal specification language for Information Systems [11]. In fact, we can see the OOMethod as a graphical OASIS editor, that provides the conventional approach of using an object, dynamic and functional model to make designers think that they are using a conventional OO method. The formalism is in this way hidden to them, avoiding the controversial attached to the use of formalisms in software development environments. Previous works on this idea can be found in the tunable formalism in object-oriented system analysis presented in [12]. Our approach provides a different kind of tunable formalism: the OASIS expressiveness is fully preserved, but it is presented to analysts according to a well-known conventional graphical notation (UML compliant). Below, we give a quick overview of the characteristics of OASIS. From an intuitive point of view, an object can be viewed as a cell or capsule with a state and a set of services. The state is hidden to other objects and can be handled only by means of services. The set of services is the object's interface, which allows other objects to access the state. Object evolution is characterized in terms of changes of 1 states. Events represent atomic changes of state and can be grouped into transactions . When we build a system specification, we specify classes. Classes represent a collection of objects sharing the same template. The template must allow for the 1
Molecular units of processing composed of object services that have the properties of nonobservability of intermediate states and the all-or-nothing policy during execution.
186
O. Pastor et al.
declaration of an identification mechanism, the signature of the class including attributes and methods, and finally a set of formulae of different kinds to cover the rest of the class properties: • integrity constraints (static and dynamic) which state conditions that must be satisfied. • valuations which state how attributes are changed by event occurrences. • derivations which relate some attribute’s values to others. • preconditions which determine when an event can be activated. • triggers which introduce internal system activity. Finally, as an object can be defined as an observable process, a class definition should be enriched with the specification of the process attached to the class. This process will allow us to declare possible object lives as terms whose elements are events and transactions. OASIS deals with complexity by introducing aggregation and inheritance operators. A complete description of the OASIS language can be found in [13].
2.2
Conceptual Modeling
Conceptual modeling in OO-Method collects the Information System relevant properties using three complementary models: Object Model: a graphical model where system classes including attributes, services and relationships (aggregation and inheritance) are defined. Additionally, agent relationships are introduced to specify who can activate each class service (client/server relationship). Dynamic Model: another graphical model to specify valid object life cycles and interobject interaction. We use two kinds of diagrams: • State Transition Diagrams to describe correct behaviour by establishing valid object life cycles for every class. By valid life, we mean a right sequence of states that characterizes the correct behaviour of the objects. • Object Interaction Diagram: represents interobject interactions. In this diagram we define two basic interactions: triggers, which are object services that are activated in an automated way when a condition is satisfied, and global interactions, which are transactions involving services of different objects. Functional Model: is used to capture semantics attached to any change of an object state as a consequence of an event occurrence. We specify declaratively how every event changes the object state depending on the involved event arguments (if any) and the object’s current state. We give a clear and simple strategy for dealing with the introduction of the necessary information. This is a contribution of this method. It allows us to generate a complete OASIS specification in an automated way. More detailed information can be found in [14]. From these three models, a corresponding formal and OO OASIS specification is obtained using a well-defined translation strategy. The resultant OASIS specification
From Object Oriented Conceptual Modeling to Automated Programming in Java
187
acts as a complete high-level system repository, where the relevant system information, coming from the conceptual modeling step, is captured.
2.3
Execution Model
Once all the relevant system information has been specified, we use an execution model to accurately state the implementation-dependent features associated with the selected object society machine representation. More precisely, we have to explain the pattern to be used to implement all the system properties in a logical three-tiered architecture for any target software development environment: • interface tier: classes that implement the interaction with end users presenting a visual representation of the application and giving users a way to access and control the object's data and services. • application tier: classes that fully implement the behaviour of the business classes specified in the conceptual modeling step enforcing the semantics of our underlying object model • persistence tier: classes that provide services allowing the business objects to interact with their specified permanent object repository. In order to easily implement and animate the specified system, we predefine a way in which users interact with system objects. We introduce a new way of interaction, close to what we could label as an OO virtual reality, in the sense that an active object immerses in the object society as a member and interacts with the other society objects. To achieve this behaviour the system has to: 1. identify the user (an access control): logging the user into the system and providing an object system view determining the set of object attributes and services that it can see or activate. 2. allow service activation: finally, after the user is connected and has a clear object system view, the users can activate any available service in their worldview. Among these services, we will have system observations (object queries) or events or transactions served by other objects. The process of access control and the building of the system view (visible classes, services and attributes to the user) are implemented in the interface tier. The information to properly configure the system view is included in the system specification obtained in the conceptual modeling step. Any service activation has two steps: build the message and execute it (if possible). In order to build the message the user has to provide information to: 1. identify the object server: The server object existence is an implicit condition for executing any service, unless we are dealing with a new event 2. At this point, the persistence tier retrieves the object server from the database. 2
Formally, a new event is a service of a metaobject that represents the class. The metaobject acts as object factory for creating individual class instances. This metaobject (one for each class) has the class population attribute as a main property, the next oid and the quoted new event.
188
O. Pastor et al.
2. introduce event arguments: The interface tier asks for the arguments of the event being activated (if necessary). Once the message is sent, the service execution is characterized by the occurrence of the following sequence of actions in the server object (the application tier): 1. check state transition: verification in the object State Transition Diagram (STD) that a valid transition exists for the selected service in the current object state. 2. precondition satisfaction: the precondition associated with the service must hold If 1 and 2 don't hold, an exception will arise and the message is ignored. 3. valuation fulfillment: the induced event modifications (specified in the Functional Model) take place in the involved object state. 4. integrity constraint checking in the new state: to assure that the service execution leads the object to a valid state, the integrity constraints (static and dynamic) are verified in the final state. If the constraint does not hold, an exception will arise and the previous change of state is ignored. 5. trigger relationships test: after a valid change of state, the set of condition-action rules that represents the internal system activity is verified. If any of them hold, the specified service will be triggered. The previous steps guide the implementation of any program to assure the functional equivalence between the object system specification collected in the conceptual model and its reification in a programming environment. Some interesting related work can be found in [15], where a tool (IPOST) that automatically generates a prototype from an object-oriented analysis model is introduced. This permits users to refine the model to generate a requirement specification. The contribution of OOMethod comes from the fact that the resultant prototype is not only a requirements specification. It is much closer to the final software product, because what is generated uses the solution space notation (Java, as we are going to show).
3 An Architecture for Implementing the Execution Model in Java The abstract execution model shown above is based on a generic three-tiered architecture. Below, we will introduce a concrete implementation using web technology and Java as the programming language. This will provide a methodological framework to deal with Java implementations, which are functionally equivalent to the source conceptual model.
3.1 Translating a Conceptual Model into Java Classes using the Execution Model Starting from the proposed Execution Model, we want to design the architecture of classes needed to implement the three logic levels: interface, application and persistence. Following, we specify the most relevant features of the Java classes needed to support the intended architecture:
From Object Oriented Conceptual Modeling to Automated Programming in Java
189
At the interface level we have complementary classes, which are not explicitly used in the conceptual model but which help to implement the interaction between the user and the application, following the underlying object model semantics. • Access_control class. This class extends a panel with the typical widgets to allow users to be identified as a member of the object society (by providing the object identifier, password and the class to which the user belongs). One access control object is created every time that a user wants to be connected to the system as an active object sending and receiving messages. This class implements the first step of our execution model. import java.awt.*; import java.lang.*; import excepciones.*; //exceptions defined for //the application. public class Access_control extends Panel { ... } • System_view class: Once an active object (user) is connected to the system, a system_view object (instance of the system_view class) will show a page with the same number of items as classes the user is allowed to interact with. These items are clickable regions that the user can activate in order to see the services (also displayed as clickable items) that he/she can use. The declaration of the class is the following: import java.applet.*; import java.awt.*; import excepciones.*; //exceptions defined for //the application. public class System_view extends Applet implements Runnable { ... } • Service_activation class: This class will define a typical web interface for data entry, where the relevant service arguments are requested. The service_activation object will be a generic one for all the services of all the classes. Depending on the service activated, it will show the corresponding edit boxes for the identification of the object and the parameters of the service. Once they are fulfilled, the user can send the message to the destination (Ok) or ignore the data request action (Cancel). import java.awt.*; public class Service_activation extends Panel {...} At the application level we have the classes that implement the behaviour of the business classes specified in the conceptual model. In order to ensure that the implementation of the application classes will follow our underlying object model semantic and have persistence facilities, we will define our business classes as an implementation of an OASIS interface and an extension of an Object_mediator class [16]. The OASIS interface specifies the necessary services to support the execution model structure, as shown in the following paragraph:
190
O. Pastor et al.
import java.awt.*; interface Oasis { void check_preconditions (string event) void check_state_transition (string event) void check_integrity_constraints () void check_triggers () ... } At the persistence level a Java class called object_mediator must be created. It implements the methods for saving, deleting and retrieving system domain objects that are stored in a persistent secondary memory (object repository). The object_mediator class has the following general structure: class Object_mediator { void delete (); void save (); void retrieve (); } JDBC classes will be used for proper interaction with the involved RDBMS servers in the implementation of these methods. Even if we focus on Java in this paper, this design could be translated to any other OO programming language by properly distributing the application components depending on the target environment characteristics and necessities.
3.2 Distributing Java Classes in a Web Architecture Once the previous components have been created, they must be properly distributed in a web architecture. Many proposals to distribute Inter/Intranet application components exist. We are going to present a three-tiered web architecture (see Fig. 1) which fits very well with the OO-Method Execution Model features presented above. A client (a web browser) unloads the relevant HTML pages from a web server, together with the applets that conform the application interface. The user will interact with the Java system objects through the web client. These Java system objects will be stored in a web application server that will query or update the object state stored in a Data Server, through the services provided by the JDBC objects. It is important to remark again that the components distributed in this architecture are generated in an automated way using the OO-Method CASE tool. This is done by defining a precise mapping between the finite set of behavioural patterns that constitutes the Conceptual Model in OO-Method and their corresponding software representation in the target software development environment, Java in this work. Preconditions, state transitions, integrity constraints and triggers are declared in the graphical OO-Method models, and implemented in each one of the Java Business Classes as was presented before. As they finally are well-formed formulas in OASIS, these formulas are translated into the Java syntax. This is how the problem space
From Object Oriented Conceptual Modeling to Automated Programming in Java
191
concepts are converted into their corresponding software representation. Let’s more clearly show these concepts through an example.
Fig. 1. Three-tiered web architecture
4 Automated Programming in Java. A Case Study A CASE Tool supporting the OO-Method allows us to model and automatically generate fully functional prototypes in Java. In order to better understand the architecture and behaviour of the previous component classes generated using the Java Execution Model, we introduce a Rent-a-Car case study as a brief example: “A company rents vehicles without drivers. These vehicles are bought at the beginning of the season and usually sold when the season is over. When a customer rents a vehicle, a contract is generated and it remains open until the customer returns the vehicle. At that time, the total amount to be paid is calculated. After this step, the Vehicle is ready to be rented again”. First, we construct the Conceptual Model (object, dynamic and functional models) by identifying the classes and specifying their relationships and all the static and dynamic properties. Due to space limitations, we cannot present all the three OOMethod models (object, dynamic and functional) of this simple example, but let’s assume that the classes identified in this problem domain are the following: Contract,
192
O. Pastor et al.
Customer, Vehicle and Company. Every class will have a set of attributes, services, preconditions, integrity constraints, valid transitions, evaluations and triggers. Based on the Execution Model proposed above, we obtain the architecture of the web application in an automated way. This architecture includes Java classes and relational tables attached to the conceptual model in Table 1. Table 1. Rent-a-Car System implementation architecture. Persistence Tier Customer table Vehicle table Contract table ...
Application Tier Customer class Vehicle class Contract class ...
Interface Tier Access_control class System_view class Service_Activation class ...
The Java code that implements a business class in the application tier following the execution model strategy can be seen in the code example for the Vehicle class in the following paragraph (comments have been introduced with the aim of making it selfexplanatory): package application_tier; import excepciones.*; import object_mediator.*; import oasis.*; public class Vehicle extends Object_mediator implements Oasis { // Attributes specified in the conceptual model class Vehicle private string state; ... // Events specified in the conceptual model // The following method implements the change of object's state
protected boolean rent() throws EX_Check_Error { state="rented"; ... } ... // The following method implements the precondition checking
From Object Oriented Conceptual Modeling to Automated Programming in Java
193
// The following method implements the execution of the rent event
public void eval_rent() { try { retrieve( ); // retrieves the object from the Database check_precondition('rent'); check_state_transition('rent'); rent(); check_integrity_constraint(); check_triggers(); save( ); // saves the object in the Database } catch(EX_Check_Error e) {...} } ... }
Next, we are going to describe an illustrative scenario for the Rent-a-Car prototype generated. This scenario will show the interaction between Java objects and their behaviour when a user enters into the object system. When a client loads the main HTML page that calls the Java applet, an instance of the Access Control class is created and User identification is required (see Fig. 2). After a User connects to the Rent-a-Car system, a menu page with an option for every class will appear (see Fig. 2). If the user clicks in a class option, a new menu page associated with the selected class will be generated (including one option for every class event or transaction).
Fig. 2. A User Access Control Page and the Rent-a-Car System View page.
In our example, when the user selects the Vehicle class, the service list offered by this class is shown, as can be seen in Fig. 3.
194
O. Pastor et al.
Every service option activation will generate a new request parameter page, as can be seen Fig. 3. This page will ask the user for the arguments needed to execute the service. The Ok control button has a code associated with it that will call to a class method that implements the effect of the service on the object state. This method will check the state transition correctness and method preconditions. If this checking process succeeds, the object change of state is carried out according to the functional model specification. We finish the method execution by verifying the integrity constraints and the trigger condition satisfaction in the new state. Object state updates in the selected persistent object system become valid through the services inherited from the object_mediator class.
Fig. 3. Vehicle Class Services Menu Page and Parameter Request Page.
5 Conclusions and Further Work Advanced web applications will depend on object technology to make them feasible, reliable and secure. To achieve this goal, well-defined methodological frameworks, which properly connect OO conceptual modeling and OO software development environments, must be introduced. The OO-Method provides such an environment. The most relevant features are the following: • an operational implementation of an automated programming paradigm where a concrete execution model is obtained from a process of conceptual model translation • a precise object-oriented model, where the use of a formal specification language as a high-level data dictionary is a basic characteristic
From Object Oriented Conceptual Modeling to Automated Programming in Java
195
All of this is done within the next-generation web development environments, making use of the Inter/Intranet architectures and using Java as a software development language. Research work is still being undertaken to improve the quality of the final software product that is generated including advanced features such as user-defined interface, schema evolution, optimized database access mechanisms or how the structure of the generated code impacts the performance of the system.
Acknowledgments. We wish to thank the anonymous referees for their valuable comments and suggestions.
References 1. Arnold K., Gosling J. The Java Programming Language. Sun MicroSystems. AddisonWesley, 1996. 2. Rumbaugh J. et al. W. Object Oriented Modeling and Design. Englewood Cliffs, Nj. Prentice-Hall. 1991. 3. Liddle S.W., Embley D.W., Woodfield S.N. Unifying Modeling and Programming Through an Active, Object-Oriented, Model-Equivalent Programming Language . In Proceedings th of 14 International Conference on Object-Oriented and Entity-Relationship Modeling (OOER’95). 13-15 Dec 1995. Gold Coast, Australia. Lecture Notes in Computer Science v. 1021; p. 55-64. 1995. 4. Jacobson I. et al. G. OO Software Engineering , a Use Case Driven Approach. Reading, Massachusetts. Addison -Wesley. 1992. 5. Booch,G. OO Analysis and Design with Applications. Addison-Wesley, 1994. 6. Booch G., Rumbaugh J., Jacobson I. UML. v1. Rational Software Co., 1997. 7. Rational Software Corporation. Rational Rose User's Manual. 1997. 8. Ptech FrameWork. Ptech Inc., Boston, MA, USA, Web Site: http://www.ptechinc.com/. 1998. 9. Platinum Technology, Inc., Paradigm Plus: Round-Trip Engineering for JAVA, White Paper. Platinum Web Site: http://www.platinum.com/. 1997. 10. Balzer R. et al. Software Technology in the 1990s: Using a New Paradigm. IEEE Computer, Nov. 1983. 11. Pastor O., Hayes F., Bear S. OASIS: An object-oriented specification language. In P. Loucopoulos, editor, Proceedings of the CAiSE'92 conference, pp. 348-363, Berlin, Springer, LNCS 593 (1992). 12. Clyde S. W., Embley D.W., Woodfield S.N. Tunable Formalism in Object-Oriented System Analysis: Meeting the Needs of Both Theoreticians and Practitioners. In Proceedings of OOPSLA’92 Conference. Vancouver, Canada; p. 452-465. 1992. 13. Pastor O., Ramos I. Oasis 2.1.1: A Class-Definition Language to Model Information rd Systems Using an Object-Oriented Approach, October 95 (3 edition).
196
O. Pastor et al.
14. Pastor O. et al. OO-METHOD: An OO Software Production Environment Combining Conventional and Formal Methods. In Antoni Olivé and Joan Antoni Pastor editors, Proceedings of CAiSE97 conference, pp. 145-158, Berlin, Springer-Verlag, LNCS 1250. June 1997. 15. Jackson R.B., Embley D.W., Woodfield S.N. Automated Support for the Development of Formal Object-Oriented Requirements Specification. In Proceedings of CAiSE-94 Conference. Utrecht, The Netherlands. Lecture Notes in Computer Science; v. 811; p.135148. 1994. 16. Argawal S., Jensen R., and Keller A. M. Architecting object applications for high performance with relational databases. In OOPSLA Workshop on Object Database Behaviour, Benchmarks, and Performance, Austin, 1995.
An Evaluation of Two Approaches to Exploiting RealWorld Knowledge by Intelligent Database Design Tools 1
2
Shahrul Azman Noah and Michael Lloyd-Williams 1
Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, UK [email protected] 2 School of Information Systems and Computing, University of Wales Institute Cardiff, Colchester Avenue, Cardiff CF3 7XR, UK [email protected]
Abstract. Recent years have seen the development of a number of expert system type tools who’s primary objective is to provide support to a human during the process of database analysis and design. However, whereas human designers are able to draw upon their experience and knowledge of the real world when performing such a task, knowledge-based database design tools are generally unable to do so. This has resulted in numerous calls for the development of tools that are capable of exploiting real-world knowledge during a design session. It has been claimed that the use of such knowledge has the potential to increase the appearance of intelligence of the tools, to improve the quality of the designs produced, and to increase processing efficiency. However to date, little if any formal evaluation of these claims has taken place. This paper presents such an evaluation of two of the approaches proposed to facilitate system-storage and exploitation of real-world knowledge; the thesaurus approach and the knowledge reconciliation approach. Results obtained have demonstrated that certain aspects of the claimed benefits associated with the use of real-world knowledge have been achieved. However, the extent to which these benefits have been attained and subsequently statistically validated varies.
1 Introduction Recent years have seen the development of a number of intelligent database design tools that employ expert system technology in order to provide support to a human designer during the process of database analysis and design [3, 12, 24]. Such tools are generally intended to act as assistants to human designers [26], being capable of providing guidance, advice, proposing alternate solutions, and helping to investigate the consequences of design decisions [11]. The effectiveness of existing tools has demonstrated the viability of representing database design expertise in a computer program, however observing such systems in T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 197−210, 1998. Springer-Verlag Berlin Heidelberg 1998
198
S.A. Noah and M. Lloyd-Williams
use makes it clear that human designers contribute far more than database design expertise to the design process [25]. Human designers, even when working in an unfamiliar domain, are able make use of their knowledge of the real world in order to interact with users, make helpful suggestions and inferences, and identify potential errors and inconsistencies [20, 24]. Conversely, the majority of existing intelligent database design tools do not possess such real-world knowledge, and are therefore required to ask many questions during a design session that may be viewed as being trivial [10, 20]. A human designer for instance, would recognize terms such as “client”, “customer” and “patron” as being potentially synonymous, regardless of the application domain. Existing intelligent database design tools are unable to identify such situations. This situation has resulted in numerous calls for the representation of real-world knowledge within such tools, coupled with the ability to reason with and make use of this knowledge. A number of approaches to representing and exploiting such real-world knowledge have been proposed, including the thesaurus approach [10, 11], and the knowledge reconciliation approach [21, 25]. These approaches have been accompanied by various claims [10, 20] that the use of such knowledge has the potential to increase the appearance of intelligence of the tools, to improve the quality of the designs produced, and to increase processing efficiency. However to date, little if any formal evaluation of these claims has taken place. This paper presents an evaluation of the thesaurus and knowledge reconciliation approaches, as originally employed by the Object Design Assistant [9, 10] and the View Creation System [22, 23] respectively, the intention being to initiate the gathering of evidence to support or otherwise the claims previously stated.
2 Method of Investigation In order to conduct evaluative experiments on the use of the thesaurus and knowledge reconciliation approaches, a prototype intelligent database design tool, the Intelligent Object Analyzer (IOA), was developed. IOA provides support for the design of the structural (data) aspects of object oriented databases. The intended user is a database designer or systems analyst who is familiar with systems modeling concepts and the domain to be modeled. Knowledge of objectoriented databases or of object-oriented analysis and design techniques is not a requirement. It is not the purpose of this paper to discuss IOA in depth, however, a brief outline of the structure and method of operation is required in order to illustrate how the real-world knowledge may be represented and exploited during design processing. The current version of the IOA system runs in a PC environment, and was developed using Common LISP (Allegro CL\PC). The IOA knowledge-base contains a mixture of rules and facts. Rules correspond to knowledge of how to perform the design task (the order in which design activities take place), detecting and resolving ambiguities, redundancies and inconsistencies within an evolving design, and handling the gradual augmentation of an evolving design as a design session
An Evaluation of Two Approaches to Exploiting Real-World Knowledge
199
progresses. Facts are used to represent two views of the application domain; an initial representation (the problem domain model) as provided by the user, and the objectoriented design generated from this initial representation. During a design session, IOA follows a two-step procedure. • The first step involves creating an initial representation of the application domain (known as the problem domain model) and the subsequent refinement of this model. • The second step involves the refinement of the problem domain model by detecting and resolving any inconsistencies that may exist, and the transformation of the model into object-oriented form. The first stage of processing requires a set of declarative statements that describe the application domain to be submitted to IOA. These statements are a variation of the method of interactive schema specification described by Baldiserra et al [1], being based upon the binary model described by Bracchi et al [5]. Each statement links together two concepts (taking the form A verb-phrase B), and falls into one of three classes of construct, corresponding directly to the structural abstractions of association, generalization, and aggregation. The statements are used to construct a problem domain model representing the application domain. Once constructed, IOA attempts to confirm it’s understanding of the semantic aspects of the problem domain model; that is, whether each structure within the model represents generalization, aggregation or association. Once constructed, the problem domain model is submitted to a series of refinement procedures in order to detect and resolve any inconsistencies (such as redundancies that may be present within generalization hierarchies) that may exist. These procedures are performed both with and without the requirement of user input (sometimes referred to as external and internal validation respectively). Once such inconsistencies have been resolved, IOA makes use of the problem domain model in order to generate a conceptual model (in object-oriented form). As previously discussed, the IOA system has been developed in order to assist with a series of experiments designed to evaluate the contribution of real-world knowledge to the activities of intelligent database design tools. In order to facilitate this aim, IOA is capable of conducting design sessions both with and without making use of realworld knowledge.
2.1
Representation of Real-World Knowledge
The following text provides a brief overview of the methods of knowledge representation employed by the thesaurus and knowledge reconciliation approaches. Those interested in further details of each approach, along with the claimed benefits associated with their use, are referred to the relevant source literature (see for instance [10, 11, 21, 25]). Figs. 1 and 2 present illustrative fragments of real-world knowledge (a university domain) using the thesaurus and knowledge reconciliation approaches respectively. The knowledge presented in Figs. 1 and 2 is not claimed to be statistically repre-
200
S.A. Noah and M. Lloyd-Williams
sentative, but is seen as a reasonable representation of certain aspects of a university domain. The main purpose is to provide an illustration of the thesaurus and knowledge reconciliation approaches, not to produce the definitive structures for the domain concerned. Programme Faculty
Course 1 opt
aggr mand
Lecturer Academicstaff
mand
V
opt
opt
Department
assoc N
assoc
1
N mand
mand
Student
assoc 1
N
gen Postgraduatestudent Graduatestudent
Undergraduatestudent
Fig. 1. Fragment of knowledge (university) represented using the thesaurus approach
Faculty
Course
aggr Lecturer
Attachedto
Department
Enrolled
Allocatedto Postgraduatestudent
Student gen Undergraduatestudent
Fig. 2. Fragment of knowledge (university) represented using the knowledge reconciliation approach
It can be seen that although the thesaurus and knowledge reconciliation approaches share similar semantic constructs (in representing domains using a series of concepts linked together via abstraction mechanisms), there are a number of differences apparent. The domain concepts represented by the thesaurus approach may be referred to by any number of associated synonyms where appropriate [11]. This is not the case with the knowledge reconciliation approach. Both approaches employ abstraction mechanisms (generalization, aggregation and association) to link concepts together, however, the knowledge reconciliation approach requires association links to be explicitly named. All abstraction mechanisms represented by the thesaurus
An Evaluation of Two Approaches to Exploiting Real-World Knowledge
201
approach are categorized rather than being named, thus, allowing the links between any pair of concepts to take any name provided by the user. Integrity constraints are not represented by the knowledge reconciliation approach, as are indeed membership requirements (mandatory or optional) for links between pairs of concepts. Both forms of constraint, however, are represented by the thesaurus approach. The IOA system is capable of processing in three modes, without the use of realworld knowledge (basic mode), using real-world knowledge provided by the thesaurus approach, and using real-world knowledge provided by the knowledge reconciliation approach. The basic mode of processing has been previously described in this paper. At various stages during basic mode processing, the IOA conducts a dialogue with the user in order to confirm its understanding of the application domain or to obtain additional information. When making use of real-world knowledge provided by the thesaurus approach, the tool refers to this knowledge wherever possible, only resorting to questioning the user if the real-world knowledge cannot provide the required information. This procedure is also followed when making use of real-world knowledge provided by the knowledge reconciliation approach. However, in addition to this procedure, IOA also attempts to conduct a process of reconciliation of knowledge, where the description of the application domain submitted by the user is compared and matched with the system-held real-world knowledge in order to assist the IOA in identifying potential missing elements (concepts and relationships) within the user’s domain description.
2.2
Testing and Evaluation Strategy
In order to evaluate the use of real-world knowledge by the IOA, a number of domains were modeled using both thesaurus and knowledge reconciliation approaches. In each case, the real-world knowledge structures were developed independently of the example scenarios used during testing. This was a deliberate attempt to minimize any bias that might be introduced by taking the content of the test material into account. In normal circumstances, it would be a logical procedure to use each example application domain encountered to augment the real-world knowledge held, thus, increasing the knowledge of the system in the way a human designer would automatically update his/her knowledge when working within a new domain. However, for purposes of testing the effectiveness of the use of real-world knowledge, it was decided to develop the knowledge structures completely independently of the examples encountered. For each example domain, the evaluation process involved the execution of a set of benchmarks (test-cases) in basic mode, and by exploiting knowledge provided by the thesaurus and knowledge reconciliation approaches. Thus, for each example domain, three sets of results were obtained and compared, following an approach recommended by O'Keefe & Preece [16]. The test-cases used were generated from a set of design problems which were primarily extracted from the available literature, the advantage being that the accompanying solution could be used as a benchmark and compared with the IOA-suggested solution in order to confirm the appropriateness or
202
S.A. Noah and M. Lloyd-Williams
otherwise of the designs produced. Each of the example design problems was systematically altered by dividing them into multiple test-cases with varying degrees of complexity. Within the scope of the testing, the complexity of a design test-case is defined as the number of concepts, and the relationships between the concepts [8, 17]. For instance, the university design problem found in Rob & Rob [19] was systematically fragmented to generate a total of five test-cases, having complexity degrees of 3, 7, 10, 13 and 17 respectively, as illustrated by Fig. 3.
Runs
Dean
Employs Professor
Chairs
Teaches
School Operates Department
Has
Offers Course Contains
Student
Section Advises Example of a design problem
Test-case 1
Test-case 5
Test-case 2
Test-case 4
Test-case 3
Fig. 3. Generation of test-cases with varying degrees of complexity
The number and the quality of test-cases employed has a direct and significant impact on the reliability of the results produced. Exhaustive testing, although
An Evaluation of Two Approaches to Exploiting Real-World Knowledge
203
generally desirable, is impractical, since a large number of test-cases must be executed and evaluated even in the simplest of design problems [7]. The results presented in this paper emanate from a series of tests performed on university domain problems found in the general literature [2, 4, 6, 19]. A total of 24 test-cases were generated from these initial problems, thereby providing the observed results with statistical validity, the required number of 15 observations [18] for this form of experiment being exceeded. The main criteria of interest used during the evaluation are as follows: • Processing time. Processing time refers to the CPU time required to perform a single design action (such as resolving an inconsistency). Processing time is not influenced by human factors as it is measured from the point at which the tool commences an action until that action is complete. Processing time is however influenced by the complexity of the design input; the complexity of the systemheld domain knowledge and the reasoning associated with it; and the specification of the processor of the PC in use. • User/tool interaction. User/tool interaction refers to the number of interactions required between the tool and the user in order for the tool to confirm its understanding of some aspect of the application domain or to acquire additional information should it be required. • Suggestion of missing design elements. This criterion measures whether the elements (within the generated design) are based entirely upon user-provided information, or are included as a direct result of the system consulting with its realworld knowledge. • Completeness of the resulting design. Completeness is defined as the ability of a data model to meet all the user information requirements [13]. Within the scope of the testing performed, completeness is measured in terms of the number of missing classes and relationships associated with the design example used. In order to prevent bias during testing, processing time and the number of user/tool interactions measured did not include processing arising as a direct result of suggestions made by the tool (for instance relating to potential missing elements within the evolving design) as a result of consulting its encapsulated real-world knowledge. The assumption underpinning this decision is that increased processing time and number of user/tool interactions involved in such processing are beneficial to the design process, and should not be viewed as being detrimental to performance efficiency. The research hypothesis of this investigation is that the use of real-world knowledge by an intelligent database design tool has the capability of increasing the efficiency of the tool (by reducing the number of processing time and user/tool interactions required); increase the completeness of the resulting design output (by minimizing the number of missing elements); and increase the appearance of tool intelligence (by providing suggestions for missing information and minimizing the number of interactions required).
204
S.A. Noah and M. Lloyd-Williams
3 Analysis of Results As previously discussed, a total of 24 test-cases were generated and executed within each of the three available processing modes. Results obtained from processing the test-cases using the real-world knowledge (the thesaurus and knowledge reconciliation approaches) were compared with the results obtained when no such knowledge was in use (the basic approach). Table 1 provides a preliminary overview of the results. Table 1. Preliminary overview of results Criteria
Basic
Thesaurus
Mean CPU time per complexity (sec) Mean user/tool interactions per complexity Mean suggested elements per test Mean missing elements per test
3.86 3 0 9
3.22 2 0 9
Knowledge Reconciliation 6.94 7 4 5
Table 1 illustrates that the thesaurus approach required a lower mean of CPU time and number of user/tool interactions per complexity compared with the basic approach. In contrast, the knowledge reconciliation approach required a higher mean of CPU time and number of user/tool interactions per complexity. These results are supported by the linear regression results of the CPU time (in seconds) and the number of interaction required by the three approaches as illustrated in Fig. 4 and Fig. 5. These figures illustrate that compared to the basic approach, the thesaurus approach resulted in a reduction of approximately 6.7% of the CPU time required, and of approximately 14.3% of the number of user/tool interactions required for each increase in complexity. The knowledge reconciliation approach, however, resulted in an increase of CPU time required and of user/tool interactions required by approximately 54.1% and 21.4% respectively.
Time (sec.)
200.0
199.47 132.98
150.0 100.0
86.31
66.49 43.15 40.25
50.0 0.0 0 Basic
10 Thesaurus
129.46 120.76
80.51
Complexity
20
Knowledge Reconciliation
Fig. 4. CPU time required for each increase in complexity
30
An Evaluation of Two Approaches to Exploiting Real-World Knowledge
Interaction
120 80
67
101 84 72
56
34
40
205
48
28 24
0
0 0
Basic
10 Thesaurus
Complexity
20
30
Knowledge Reconciliation
Fig. 5. User/tool interaction required for each increase in complexity
Table 1 also illustrates the potential of the knowledge reconciliation approach to provide suggestions for the inclusion of (required) elements within the resulting designs. Such elements would not be identified or included when using the other approaches. Thus the knowledge reconciliation approach has the capacity to facilitate greater level of completeness of the designs produced. These preliminary findings provide a general overview of the results obtained. In order to validate the effectiveness of each of the approaches, a statistical hypothesis test was conducted in order to assess the significance of the differences between the results observed from the execution of test-cases both with and without the use of real-world knowledge at the 5% of significance level. Although there are several recommended statistical methods available to test such hypotheses, the paired t-test method is highly appropriate in such circumstance as those prevailing in this study [14, 15]. A discussion of the statistical analysis of the observed results follows.
3.1
The Thesaurus Approach
Based upon the paired t-test results presented in Table 2 it is apparent that there are significant differences between the thesaurus approach and the basic approach in terms of the number of user/tool interactions and the CPU time required per complexity. The observed significance level and the negative t-Value1 suggest that the null hypothesis should be rejected for both criteria. It may, therefore, be stated that
1
As the objective of the statistical analysis was to validate whether the approaches taken to representing real-world knowledge significantly reduced or increased any of the evaluation criteria, referring to the P value alone will not provide a sufficient result as it only informs as to whether there is any significant different between the observed results. In this case however, the t-Value can be used [18], where a negative t-Value implies that the observed criterion is significantly reduced by the use of real-world knowledge and a positive t-Value implies otherwise.
206
S.A. Noah and M. Lloyd-Williams
the thesaurus approach increased the overall processing efficiency by reducing the number of user/tool interactions and the CPU time required per complexity. Table 2. Paired t-test reault- the thesaurus and basic approaches Criteria Interaction per complexity CPU time per complexity Suggested elements per test Completeness per test
t-Value -3.30 -3.39 N/A N/A
df 23 23 23 23
Sig. T (P) 0.003 0.002 N/A N/A
However, Table 2 also indicates that the aspect of completeness and the suggesting elements per test were not significantly different between the two approaches (the statistical test was invalid as the thesaurus and the basic approaches do not provide suggestions for missing information, therefore, both approaches result in similar numbers of missing elements within the resulting designs). Accordingly, the use of thesaurus approach has not resulted in an improvement in the quality of the resulting design output (measured in terms of increasing the completeness of the designs produced). The significant reduction in the number of user/tool interactions required suggests an increase in the appearance of intelligence of the tool. However, the (statistical) non-significance of the suggested elements per test criterion may be viewed as jeopardizing this claim.
3.2
The Knowledge Reconciliation Approach
The capability of the knowledge reconciliation approach to increase the appearance of tool intelligence (by facilitating the suggestion of potential missing design elements) is evidenced by the results illustrated in Table 3. In this case, P < 0.05 with a positive t-Value indicates that the criterion of suggesting related elements was significantly increased with the use of the knowledge reconciliation approach. However, the reduction of the number of user/tool interactions was non-significant (although P < 0.05, the t-Value is positive indicating that the knowledge reconciliation approach significantly increased the number of interaction per complexity), resulting in the conclusion that the claim for an overall increase in the appearance of tool intelligence is unjustifiable. Table 3. Paired t-test result - the knowledge reconciliation and basic approaches Criteria Interaction per complexity CPU time per complexity Suggested elements per test Completeness per test
df 23 23 23 23
t-Value 8.11 9.26 6.09 -6.09
Sig. t (P) 0.000 0.000 0.000 0.000
An Evaluation of Two Approaches to Exploiting Real-World Knowledge
207
The claim for an increase in overall processing efficiency of the tool as a result of the use of the knowledge reconciliation approach is not supported by the results obtained. This is evidenced by the non-significant reduction in the number of user/tool interactions and CPU time required. Although P < 0.05 in both cases, the tValues are positive, indicating that required CPU time and user/tool interactions per complexity increased overall when using the knowledge reconciliation approach. As the capability of suggesting missing design elements was proven to be significantly increased, it may be argued that the completeness of the designs produced should be corresponding increased (by minimizing the number of missing elements). This aspect is illustrated in Table 3, where P < 0.05 and is accompanied by a negative t-Value. Therefore, the improvement in the completeness of the design suggests that the claim of increasing the quality of the designs produced has been met in terms of this criterion for the knowledge reconciliation approach.
4 Discussion and Conclusion Tables 4 and 5 present a summary of the conclusions reached for both thesaurus and knowledge reconciliation approaches.
Table 4. Summary of conclusions Criteria
Reduces the number of user/tool interactions Reduces the CPU time required Increases the no. of missing elements suggested Increases the completeness of designs produced
Approaches Thesaurus Knowledge Reconciliation Yes No Yes No No Yes No Yes
Increases overall tool processing efficiency Improves quality designs produced Increases overall appearance of tool intelligence
Approaches Thesaurus Knowledge Reconciliation Yes No No Yes Unjustifiable Unjustifiable
Both tables provide a conclusive evidence that certain aspects of the claimed benefits have generally been achieved by the use and exploitation of real-world knowledge represented by the thesaurus and knowledge reconciliation approaches.
208
S.A. Noah and M. Lloyd-Williams
The claimed of increasing the overall processing efficiency in intelligent database design tools by the use of real-world knowledge has been met by the thesaurus approach. The conclusion ensued as a result of the significant reduction in the number of user/tool interactions and CPU time required for each increase in complexity. These reductions occurred to a certain extent due to the additional information held by thesaurus approach as opposed to the knowledge reconciliation approach. For instance, constraint and membership requirements related information which are not represented by the knowledge reconciliation approach, yet both have the potential to impact upon the performance-related criteria. The claimed for an improve in the quality of the designs produced, on the other hand, has only been met by the knowledge reconciliation approach as a result of the significant increased of completeness in the overall designs produced. The increased in completeness was due to the fact that the approach is capable of identifying which elements are though to be missing, and of suggesting the inclusion of these elements to the user. These results suggest that the system-held real-world knowledge has the potential to guide the tool in playing an active part during the design process, and at the same time increase aspects of the appearance of intelligence of the tool [20]. The question of whether the use of real-world knowledge could increase the overall appearance of intelligence of an intelligent database design tool, therefore, remains largely unresolved, as both approaches did not produce any conclusive results. The thesaurus approach, although, significantly reduced the number of user/tool interactions required, was augmented by a corresponding non-significant increase in the number of suggestions made regarding possible missing design elements within the evolving design. The knowledge reconciliation approach on the other hand, although, significantly increased the number of suggestions made for missing design elements, was found incapable to significantly reduce the number of user/tool interactions required. Although encouraging results have been obtained from the testing and evaluation work, it is recognized that consideration must be given to a number of practical issues. The effectiveness of the tool depends greatly on the accuracy and completeness of the system-held real-world knowledge, and the results obtained from the tests may be influenced to certain extent by the variety and coverage of the generated test-cases.
5 Summary and Future Work This paper has presented the findings of an assessment of the thesaurus and knowledge reconciliation approaches to representing and exploiting real-world knowledge by an intelligent database design tool (IOA). The intention of this experiment has been to evaluate the claims made regarding the use of domain knowledge (represented as the thesaurus and knowledge reconciliation approaches) by intelligent database design tools, and not to compare the effectiveness or the efficiency between these representative approaches. The results obtained have demonstrated that certain aspects of the claimed benefits associated with the use of such real-world knowledge
An Evaluation of Two Approaches to Exploiting Real-World Knowledge
209
(increased processing efficiency, improved quality designs produced, and increased the appearance of intelligence) have been achieved. However, the extent to which these benefits have been attained and subsequently statistically validated varies. Similar experiments were conducted to the clinical/hospital and library domains, and consistent findings as presented in this paper were obtained. Although, there are a number of methodologies proposed for the testing and evaluation of expert systems [14, 15], there are only a handful of papers that describe actual experiences related to the testing and performance evaluation of operational systems. This paper has presented an approach and methodology for performing such a task. The methodology employed involved: the generation of a series of test-cases, the processing of the test-cases, and the use of statistical analysis to evaluate the observed results. Ongoing work includes extending the variety of the evaluative experiments by subjecting the tool to a range of test-cases including a series of intentionally-generated errors. Such test-cases contain a combination of different types and numbers of synthesized errors, including synonymous class(es), synonymous relationship(s), and combinations of both. Acknowledgments. The authors wish to thank the anonymous referees for their helpful and constructive comments on a previous version of this paper.
References 1. Baldiserra, C., Ceri, S., Pelagatti, G. & Bracchi, G. (1979) “Interactive specification and formal verification of user's views in database design”, In: Proceedings of the 5th International Conference on Very Large Databases, Rio de Janeiro, Brazil. 262-272. 2. Batini, C., Ceri, S. & Navathe, S. (1992) Conceptual Database Design: An Entity Relationship Approach. Redwood City, CA: Benjamin-Cummings. 3. Bouzeghoub, M. (1992). “Using expert systems in schema design”. In Loucopoulos, P. & Zicari, R. (eds.) Conceptual Modeling, Databases, and CASE: an Integrated View of Information Systems Development. New York: Wiley, 465-487. 4. Bowers, D. S. (1993) From Data to Database. London: Chapman Hall. 5. Bracchi, G., Paolini, P. & Pelagatti, G. (1976) “Binary logical associations in data modeling”. In: Nijsen, G. M. (ed.) Modeling in Data Base Management Systems. Amsterdam: North-Holland, 125-148. 6. Elmasri, R. & Navathe, S. B. (1989) Fundamentals of Database Systems. Redwood City, CA: Benjamin Cummings. 7. Gonzalez, A. J., Gupta, U. G. & Chianese, R. B. (1996). “Performance evaluation of a large diagnostic expert system using a heuristic test case generator”. Engineering Application of Artificial Intelligence, 9(3), 275-284. 8. Kesh, S. (1995) “Evaluating the quality of entity relationship models”. Information and Software Technology, 37(12), 681-689. 9. Lloyd-Williams, M. (1993). “Expert system support for object-oriented database design”. International Journal of Applied Expert Systems, 1(3), 197-212. 10. Lloyd-Williams, M. (1994). “Knowledge-based CASE tools: improving performance using domain specific knowledge”. Software Engineering Journal, 9(4), 167-173.
210
S.A. Noah and M. Lloyd-Williams
11. Lloyd-Williams, M. (1997). “Exploiting domain knowledge during the automated design of object-oriented databases”, In: Embley, D. W. & Goldstein, R. C. (eds.), Proceedings of the 16th International Conference on Conceptual Modeling, Berlin: Spinger-Verlag, 16-29. 12. Lloyd-Williams, M. & Beynon-Davies, P. (1992). “Expert system for database design: a comparative review”. Artificial Intelligence Review, 6, 263-283. 13. Moody, D. L. & Shanks, G. G. (1994). “What makes a good data model? Evaluating the quality of entity-relationship models”, In: Loucopoulos, P. (eds.), Proceedings of the 13th International Conference on the Entity-Relationship Approach, Berlin: Springer-Verlag, 94-101. 14. O'Keefe, R. M., Balci, O. & Smith, E. P. (1987). “Validating expert system performance”. IEEE Expert, Winter, 81-90. 15. O'Keefe, R. M. & O'Leary, D. E. (1993). “Expert system verification and validation: a survey and tutorial”. Artificial Intelligence Review, 7(1), 3-42. 16. O'Keefe, R. M. & Preece, A. D. (1996) “The development, validation and implementation of knowledge-based systems”. European Journal of Operational Research, 92(3), 458-473. 17. Pippenger, N. (1978) “Complexity theory”. Scientific American, 238(6), 90-102. 18. Rees, D. G. (1995) Essential Statistics (3rd Edition). London: Chapman & Hall. 19. Rob, P. & Rob, C. C. (1993) Database Systems: Design, Implementation and Management. Belmont, CA: Wadsworth Publishing. 20. Storey, V. C. (1992). “Real world knowledge for databases”. Journal of Database Administration, 3(1), 1-19. 21. Storey, V. C., Chiang, R. H. L., Dey, D., Goldstein, R. C., Sundararajan, A. & Sundaresan, S. (1994) “Knowledge reconciliation for common sense reasoning”. In: De, P. & Woo, C. (eds.) Proceeding of the 4th Annual Workshop on Information Technologies and Systems. Vancouver: Univ. British Columbia, 87-96. 22. Storey, V. C. & Goldstein, R. C. (1990a). “Design and development of an expert database design system”. International Journal of Expert Systems Research and Applications, 3(1), 31-63. 23. Storey, V. C. & Goldstein, R. C. (1990b). “An expert view creation system for database design”. Expert Systems Review, 2(3), 19-45. 24. Storey, V. C. & Goldstein, R. C. (1993). “Knowledge-based approach to database design”. Management Information Systems Quarterly, 17(1), 25-46. 25. Storey, V. C., Goldstein, R. C., Chiang, R. H. L. & Dey, D. (1993). “A common-sense reasoning facility based on the entity-relationship model”, In: Elmasri, R. A., Kouramajian, V. & Thalheim, B. (eds.) Proceedings of the 12th International Conference on the Entity Relationship Approach, Berlin: Springer-Verlag, 218-229. 26. Vessey, I. & Sravanapudi, A. P. (1995) “CASE tools as collaborative support technologies”. Communications of the ACM, 37(1), 83-102.
Metrics for Evaluating the Quality of Entity Relationship Models Daniel L. Moody Simsion Bowles and Associates, 1 Collins St., Melbourne, Australia 3000. email: [email protected] Abstract. This paper defines a comprehensive set of metrics for evaluating the quality of Entity Relationship models. This is an extension of previous research which developed a conceptual framework and identified stakeholders and quality factors for evaluating data models. However quality factors are not enough to ensure quality in practice, because different people will have different interpretations of the same concept. The objective of this paper is to refine these quality factors into quantitative measures to reduce subjectivity and bias in the evaluation process. A total of twenty five candidate metrics are proposed in this paper, each of which measures one of the quality factors previously defined. The metrics may be used to evaluate the quality of data models, choose between alternatives and identify areas for improvement.
1
Introduction
The choice of an appropriate representation of data is one of the most crucial tasks in the entire systems development process. Although the data modelling phase represents only a small proportion of the total systems development effort, its impact on the final result is probably greater than any other phase (Simsion, 1994). The data model is a major determinant of system development costs (ASMA, 1996), system flexibility (Gartner, 1992), integration with other systems (Moody and Simsion, 1995) and the ability of the system to meet user requirements (Batini et al., 1992). For this reason, effort expended on improving the quality of data models is likely to pay off many times over in later phases. Previous Research Evaluating the quality of data models is a discipline which is only just beginning to emerge. Quantitative measurement of quality is almost non-existent. A number of frameworks for evaluating the quality of data models have now been proposed in the literature (Roman, 1985; Mayer, 1989; von Halle, 1991; Batini et al., 1992; Levitin and Redman, 1994; Simsion, 1994; Moody and Shanks, 1994; Krogstie, Lindland and Sindre, 1995; Lindland, Sindre and Solveberg, 1994; Kesh, 1995; Moody and Shanks, 1998). Most of these frameworks suggest criteria that may be used to evaluate the quality of data models. However quality criteria are not enough on their own to ensure quality in practice, because different people will generally have different interpretations of what they mean. According to the Total Quality Management (TQM) literature, measurable criteria for assessing quality are necessary to avoid “arguments of style” (Zultner, 1992). The objective should be to replace intuitive notions of design “quality” with T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 211−225, 1998. Springer-Verlag Berlin Heidelberg 1998
212
D.L. Moody
formal, quantitative measures to reduce subjectivity and bias in the evaluation process. However developing reliable and objective measures of quality in software development is a difficult task. As Van Vliet (1993) says: “The various factors that relate to software quality are hard to define. It is even harder to measure them quantitatively. There are very few quality factors or criteria for which sufficiently sound numeric measures exist.”
Of the frameworks that have been proposed, only two address the issue of quality measurement. Moody and Shanks (1994) suggest a number of evaluation methods, which in some cases are measures (eg. data model complexity) and in other cases are processes for carrying out the evaluation (eg. user reviews). Kesh (1995) defines a number of metrics for evaluating data models but these are theoretically based, and of limited use in practice. Most of the other frameworks rely on experts giving overall subjective ratings of the quality of a data model with respect to the criteria proposed.
2
A Framework for Evaluating and Improving the Quality of Data Models
This paper uses the framework for data model evaluation and improvement proposed by Moody and Shanks (1998) as a basis for developing quality metrics. An earlier version of the framework was published in Moody and Shanks (1994). This framework was developed in practice, and has now been applied in a wide range of organisations around the world (Moody, Shanks and Darke, 1998). The framework is summarised by the Entity Relationship model shown below. Stakeholder
interacts with
concerned with
Weighting defines importance of
Quality Factor
Quality Metric evaluated by
used to improve
Improvement Strategy
Fig. 1. Data Model Quality Evaluation Framework
• Quality factors are the properties of a data model that contribute to its quality. These answer the question: “What makes a good data model?”. A particular quality factor may have positive or negative interactions with other quality factors— these represent the trade-offs implicit in the modelling process.
Metrics for Evaluating the Quality of Entity Relationship Models
213
• Stakeholders are people who are involved in building or using the data model, and therefore have an interest in its quality. Different stakeholders will generally be interested in different quality factors. • Quality metrics are define ways of evaluating particular quality factors. There may be multiple measures for each quality factor. • Weightings define the relative importance of different quality factors in a problem situation. These are used to make trade-offs between different quality factors. • Improvement strategies are techniques for improving the quality of data models with respect to one or more quality factors. A previous paper (Moody and Shanks, 1994) defined the stakeholders and quality factors relevant to data modelling, as well as methods for evaluating the quality of data models. This paper defines metrics for each quality factor. Stakeholders The key stakeholders in the data modelling process are: • The business user, whose requirements are defined by the data model • The analyst, who is responsible for developing the data model • The data administrator, who is responsible for ensuring that the data model is consistent with the rest of the organisation’s data • The application developer, who is responsible for implementing the data model (translating it into a physical database schema) Quality Factors The proposed quality factors and the primary stakeholders involved in evaluating them are shown in Fig. 2 below. Business User
Completeness
Business User
Integrity
Business User
Flexibility
Business User
Understandability
DATA MODEL QUALITY
Correctness
Data Analyst
Simplicity
Integration
Data Analyst
Data Administrator
Implementability
Application Developer
Fig. 2. Data Model Quality Factors
These quality factors may be used as criteria for evaluating the quality of individual data models or comparing alternative representations of requirements. Together they incorporate the needs of all stakeholders, and represent a complete picture of data model quality. The following sections define quality measures for each quality factor.
214
3
D.L. Moody
Completeness
Completeness relates to whether the data model contains all information required to meet user requirements. This corresponds to one half of the 100% principle—that the conceptual schema should define all static aspects of the Universe of Discourse (ISO, 1987). Completeness is the most important requirement of all because if it is not satisfied, none of the other quality factors matter. If the requirements as expressed in the data model are inaccurate or incomplete, the system which results will not satisfy users, no matter how well designed or implemented it is. Evaluating Completeness In principle, completeness can be checked by checking that each user requirement is represented somewhere in the model, and that each element of the model corresponds to a user requirement (Batini et al, 1992). However the practical difficulty with this is that there is no external source of user requirements—they exist only in people’s minds. Completeness can therefore only be evaluated with close participation of business users. The result of completeness reviews will be a list of elements (entities, relationships, attributes, business rules) that do not match user requirements. Fig. 3 illustrates the different types of completeness mismatches:
Data Model
User Requirements
Fig. 3. Types of Completeness Errors
• Area 1 represents elements included in the data model that do not correspond to any user requirement or are out of scope of the system—these represent unnecessary elements. We call these Type 1 errors. • Area 2 represents user requirements which are not represented anywhere in the data model—these represent gaps or omissions in the model. We call these Type 2 errors. • Area 3 represents items included in the data model that correspond to user requirements but have been inaccurately defined. We call these Type 3 errors. • Area 4 represents elements in the data model that accurately correspond to user requirements. The objective of completeness reviews is to eliminate all items of type 1, 2 and 3.
Metrics for Evaluating the Quality of Entity Relationship Models
215
Proposed Completeness Metrics The proposed quality measures for correctness all take the form of mismatches with respect to user requirements. The purpose of the review process will be to eliminate all such defects, so that the model exactly matches user requirements:
✎ ✎
✎
✎
4
Metric 1. Number of items in the data model that do not correspond to user requirements (Type 1 errors). Inclusion of such items will lead to unnecessary development effort and added cost. Metric 2. Number of user requirements which are not represented in the data model (Type 2 errors). These represent missing requirements, and will need to be added later in the development lifecycle, leading to increased costs, or if they go undetected, will result in users not being satisfied with the system Metric 3. Number of items in the data model that correspond to user requirements but are inaccurately defined (Type 3 errors). Such items will need to be changed later on the development lifecycle, leading to rework and added cost, or if they go undetected, will result in users being unsatisfied with the system. Metric 4. Number of inconsistencies with process model. A critical task in verifying the completeness of the data model is to map it against the business processes which the system needs to support. This ensures that all functional requirements can be met by the model. The result of this analysis can be presented in the form of a CRUD (Create, Read, Update, Delete) matrix. Analysis of the CRUD matrix can be used to identify gaps in the data model as well as to "prune away" unnecessary data from the model (Martin, 1989).
Integrity
Integrity is defined as the extent to which the business rules (or integrity constraints) which apply to the data are enforced by the data model1. Integrity corresponds to the other half of the 100% principle—that the conceptual schema should define all dynamic aspects of the Universe of Discourse (ISO, 1987). Business rules define what can and can’t happen to the data. Business rules are necessary to maintain the consistency and integrity of data stored, as well as to enforce business policies (Date, 1989; Loffman and Rush, 1991). All rules which apply to the data should be documented in the data model to ensure they are enforced consistently across all application programs (ISO, 1987). Evaluating Integrity Like completeness, integrity can only really be evaluated with close participation of business users. The rules represented by the data model may be verified by translating them into natural language sentences. Users can then verify whether each rule is true 1
In the original version of the evaluation framework (Moody and Shanks, 1994) integrity was included as part of completeness, but has since been separated out as a quality factor in its own right.
216
D.L. Moody
or false. This is useful as a check on the integrity of the data model because business users often have difficulty understanding the constraints defined in data models, particularly cardinality rules on relationships (Batini et al., 1992). Many CASE tools can automatically translate relationship cardinality rules into natural language sentences, provided relationships have been named correctly. Proposed Integrity Metrics The proposed quality measures for integrity take the form of mismatches between the data model and business policies. The purpose of the review process will be to eliminate all such defects:
✎ ✎
5
Metric 5. Number of business rules which are not enforced by the data model. Non-enforcement of these rules will result in data integrity problems and/or operational errors. Metric 6. Number of integrity constraints included in the data model that do not accurately correspond to business policies (i.e. which are false). Incorrect integrity constraints may be further classified as: • too weak: the rule allows invalid data to be stored • too strong: the rule does not allow valid data to be stored and will lead to constraints on business operations and the need for user “workarounds”.
Flexibility
Flexibility is defined as the ease with which the data model can cope with business change. The objective is for additions and/or changes in requirements to be handled with the minimum possible change to the data model. The data model is a key contributor to the flexibility of the system as a whole (Gartner, 1992; Simsion, 1994). Lack of flexibility in the data model can lead to: • Maintenance costs: of all types of maintenance changes, changes to data structures and formats are the most expensive. This is because each such change has a “ripple effect” on all the programs that use it. • Reduced organisational responsiveness: inflexible systems inhibit changes to business practices, organisational growth and the ability to respond quickly to business or regulatory change. Often the major constraint on introducing business change—for example, bringing a new product to market—is the need to modify the computer systems that support it (Simsion, 1988). Evaluating Flexibility Flexibility is a particularly difficult quality factor to assess because of the inherent difficulty of predicting what might happen in the future. Evaluation of flexibility requires identifying what requirements might change in the future, their probability of occurrence and their impact on the data model. However no matter how much time spent thinking about what might happen in the future, such changes remain hard to anticipate. In this respect, evaluating flexibility has much in common with weather forecasting—there is a limit to how far and how accurately the future can be predicted.
Metrics for Evaluating the Quality of Entity Relationship Models
217
Proposed Flexibility Metrics The proposed measures for evaluating flexibility focus on areas where the model is potentially unstable—where changes to the model might be required in the future as a result of changes in the business environment. The purpose of the review process will be to look at ways of minimising impact of change on the model, taking into account the probability of change, strategic impact and likely cost of change. A particular focus of flexibility reviews is identifying business rules which might change.
✎ ✎
✎ 6
Metric 7. Number of elements in the model which are subject to change in the future. This includes changes in definitions or business rules as a result of business or regulatory change. Metric 8. Estimated cost of changes. For each possible change, the probability of change occurring and the estimated cost of making the change postimplementation should be used to calculate the probability-adjusted cost of the change. Metric 9. Strategic importance of changes. For each possible change, the strategic impact of the change should be defined, expressed as a rating by business users of the need to respond quickly to the change.
Understandability
Understandability is defined as the ease with which the data model can be understood. Business users must be able to understand the model in order to verify that it meets their requirements. Similarly, application developers need to be able to understand the model to implement it correctly. Understandability is also important in terms of the useability of the system. If users have trouble understanding the concepts in the data model, they are also likely to have difficulty understanding the system which is produced as a result. The communication properties of the data model are critical to the success of the modelling effort. However empirical studies show that in practice data models are poorly understood by users, and in most cases are not developed with direct user involvement (Hitchman, 1995). While data modelling has proven very effective as a technique for database design, it has been far less effective for communication with users (Moody, 1996a). Evaluating Understandability Understandability can only be evaluated with close participation with the users of the model—business users and application developers. In principle, understandability can be checked by checking that each element of the model is understandable. However the practical difficulty with this is that users may think they understand the model while not understanding its full implications and possible limitations from a business perspective.
218
D.L. Moody
Proposed Understandability Metrics The proposed measures for understandability take the form of ratings by different stakeholders and tests of understanding. The purpose of the review process will be to maximise these ratings.
✎
✎
✎
7
Metric 10. User rating of understandability of model: user ratings of understandability will be largely based on the concepts, names and definitions used, as well as how the model is presented. A danger with this metric is that it is common for users to grasp familiar business terms without appreciating the meaning represented in the model. As a result, they may think they understand the model while not really understanding its full implications for the business. Metric 11. Ability of users to interpret the model correctly. This can be measured by getting users to instantiate the model using actual business examples (scenarios). Their level of understanding can then be measured by the number of errors in populating the model. This is a better operational test of understanding than the previous metric—it measures whether the model is actually understood rather than whether it is understandable (Lindland et al, 1994). This is much more important from the point of view of verifying the accuracy of the model. Metric 12. Application developer rating of understandability. It is essential that the application developer understands the model fully so that they can implement it correctly. Getting the application developer to review the model for understandability is particularly useful for identifying where the model is unclear or ambiguous because they will be less familiar with the model and the business domain than either the analyst or business user. Many things that seem obvious to those involved in developing the model may not be to someone seeing it for the first time.
Correctness
Correctness refers to whether the model conforms to the rules of the data modelling technique being used. Rules of correctness include diagramming conventions, naming rules, definition rules and rules of composition (for example, each entity must have a primary key). Correctness is concerned only with whether the data modelling technique has been used correctly (syntactic or grammatical correctness). It answers the question: “Is this a valid model?”. Another important aspect of correctness, and a major focus of data modelling in practice, is to ensure that the model contains no redundancy—that each fact is represented in only one place (Simsion, 1994). Evaluating Correctness Correctness is the easiest of all the quality factors to evaluate, because there is very little subjectivity involved, and no degrees of quality—the model either obeys the rules or it does not. Also, the model can be evaluated in isolation, without reference to user requirements. The result of correctness reviews will be a list of defects, defining where the data model does not conform to the rules of the data modelling technique. Many of these checks can be carried out automatically using CASE tools.
Metrics for Evaluating the Quality of Entity Relationship Models
219
Proposed Correctness Metrics The proposed quality measures for correctness all take the form of defects with respect to data modelling standards (syntactic rules). We break down correctness errors into different types or defect classes to assist in identifying patterns of errors or problem areas which may be addressed by training or other process measures. The purpose of the review process will be to eliminate all such defects:
✎
✎
✎
Metric 13. Number of violations to data modelling conventions. These can be further broken down into the following defect classes: • Diagramming standards violations (eg. relationships not named) • Naming standards violations (eg. use of plural nouns as entity names) • Invalid primary keys (non unique, incomplete or non-singular) • Invalid use of constructs (eg. entities without attributes, overlapping subtypes, many to many relationships) • Incomplete definition of constructs (e.g. data type and format not defined for an attribute; missing or inadequate entity definition) Metric 14. Number of normal form violations. Second and higher normal form violations identify redundancy among attributes within an entity (intra-entity redundancy). Normal form violations may be further classified into: • First normal form (1NF) violations • Second normal form (2NF) violations • Third normal form (3NF) violations • Higher normal form (4NF+) violations Metric 15. Number of instances of redundancy between entities—for example, where two entity definitions overlap or where redundant relationships are included. This is called inter-entity redundancy, to distinguish this from redundancy within an entity (intra-entity redundancy—Metric 14) and redundancy of data with other systems (external redundancy—Metric 21). Fig. 4 summarises the types of redundancy and corresponding metrics. Redundancy Internal Redundancy
External Redundancy
Intra-Entity Redundancy
Metric14
Inter-Entity Redundancy
Metric 15
Metric 21
Fig. 4. Classification of Redundancy
220
8
D.L. Moody
Simplicity
Simplicity means that the data model contains the minimum possible constructs. Simpler models are more flexible (Meyer, 1988), easier to implement (Simsion, 1991), and easier to understand (Moody, 1997). The choice of simplicity as a quality factor is based on the principle of Ockham’s Razor, which has become one of the cornerstones of the scientific method. This says that if there are two theories which explain the same observations, the one with the fewer constructs should be preferred (Dubin, 1979). The extension of this to data modelling is if there are two data models which meet the same requirements, the simpler one should be preferred. Evaluating Simplicity Simplicity is the easiest of all quality factors to evaluate, because it only requires a simple count of data model elements. This can be done automatically by CASE tools, or carried out manually. It takes no skill (apart from the ability to count!) and is totally objective. Simplicity metrics are particularly useful in comparing alternative data models—all other things being equal, the simpler one should be preferred. Proposed Simplicity Metrics Metrics for evaluating simplicity take the form of complexity measures. The purpose of the review process will be to minimise the complexity of the model while still satisfying user requirements. The following metrics represent alternative ways of measuring complexity of a data model—Metric 17 is recommended as the most useful of the measures proposed:
✎
✎
✎
Metric 16. Number of entities (E). This is the simplest measure of the complexity of a data model. The justification for this is that the number of entities in the logical data model is a surrogate measure for system complexity and development effort. Symons (1988, 1991) found that in sizing of business (“data rich”) applications, the major determinant of software size (and development effort) was the number of entities. Metric 17. Number of entities and relationships (E+R). This is a finer resolution complexity measure which is calculated as the number of entities (E) plus the number of relationships (R) in the data model. This derives from complexity theory, which asserts that the complexity of any system is defined by the number of components in the system and the number of relationships between them (Klir, 1985; Pippenger, 1978). Subtypes should not be included in the calculation of the number of entities because these represent subcategories within a single construct and generally do not translate into separate database tables. In addition, many to many relationships should be counted as three constructs, since when they are resolved, they will form one entity and two relationships (Shanks, 1997). This helps to standardise differences between different modelling styles. Metric 18. Number of constructs (E+R+A). This is the finest resolution complexity measure, and includes the number of attributes in the calculation of data
Metrics for Evaluating the Quality of Entity Relationship Models
221
model complexity. Such a metric could be calculated as a weighted sum of the form aNE + bNR + cNA where NE is the number of entities, NR is the number of relationships and NA is the number of attributes. In practice however, such a measure does not provide any better information than Metric 17.
9
Integration
Integration is defined as the level of consistency of the data model with the rest of the organisation’s data. In practice, application systems are often built in relative isolation of each other, leading to the same data being implemented over and over again in different ways. This leads to duplication of data, interface problems and difficulties consolidating data from different systems for management reporting (Moody and Simsion, 1995). The primary mechanism for achieving corporate-wide data integration is a corporate data model (Goodhue et al., 1992). The corporate data model provides a common set of data definitions which is used to co-ordinate the activities of application development teams so that separately developed systems work together. The corporate data model allows opportunities for sharing of data to be identified, and ensures that different systems use consistent data naming and formats (Martin, 1989). Evaluating Integration Integration is assessed by comparing the application data model with the corporate data model (Batini, Lenzerini and Navathe, 1986). The result of this will be a list of conflicts between the project data model and the corporate data model. This is usually the responsibility of the data administrator (also called information architect, data architect, data manager), who has responsibility for corporate-wide sharing and integration of data. It is their role to maintain the corporate data model and review application data models for conformance to the corporate model. Proposed Integration Metrics Most of the proposed measures for integration are in the form of conflicts with the corporate data model or with existing systems. The purpose of the review process will be to resolve these inconsistencies.
✎
✎
Metric 19. Number of data conflicts with the Corporate Data Model. These can be further classified into: • Entity conflicts: number of entities whose definitions are inconsistent with the definition entities in the corporate data model. • Data element conflicts: number of attributes with different definitions or domains to corresponding attributes defined in the corporate data model. • Naming conflicts: number of entities or attributes with the same business meaning but different names to concepts in the corporate data model (synonyms). Also entities or attributes with the same name but different meaning to concepts in the corporate data model (homonyms). Metric 20. Number of data conflicts with existing systems. These can be further classified into:
222
D.L. Moody
•
•
•
✎
✎
Number of data elements whose definitions conflict with those in existing systems e.g. different data formats or definitions. Inconsistent data item definitions will lead to interface problems, the need for data translation and difficulties comparing and consolidating data across systems. Number of key conflicts with existing systems or other projects. Key conflicts occur when different identifiers are assigned to the same object (eg. a particular customer) by different systems. This leads to fragmentation of data across systems and the inability to link or consolidate data about a particular entity across systems. Number of naming conflicts with other systems (synonyms and/or homonyms). These are less of a problem in practice than other data conflicts, but are a frequent source of confusion in system maintenance and interpretation of data.
Metric 21. Number of data elements which duplicate data elements stored in existing systems or other projects. This is called external redundancy to distinguish it from redundancy within the model itself (Metrics 14 and 15). This form of redundancy is a serious problem in most organisations—empirical studies show that there are an average of ten physical copies of each primary data item in medium to large organisations (O’Brien and O’Brien, 1994). Metric 22. Rating by representatives of other business areas as to whether the data has been defined in a way which meets corporate needs rather than the requirements of the application being developed. Because all data is potentially shareable, all views of the data should be considered when the data is first defined (Thompson, 1993). In practice, this can be done by a high level committee which reviews all application development projects for data sharing, consistency and integration (Moody, 1996b).
10 Implementability Implementability is defined as the ease with which the data model can be implemented within the time, budget and technology constraints of the project. While it is important that a data model does not contain any assumptions about the implementation (ISO, 1987), it is also important that it does not ignore all practical considerations. After all, there is little point developing a model which cannot be implemented or that the user cannot afford. Evaluating Implementability The implementability of the data model is assessed by the application developer, who is responsible for implementing the data model once it has been completed. The application developer provides an important “reality check” on what is technically possible and/or economically feasible. The process of reviewing the model also allows the application developer to gain familiarity with the model prior to the design stage to ensure a smooth transition.
Metrics for Evaluating the Quality of Entity Relationship Models
223
Proposed Implementability Metrics Proposed measures of implementability all take the form of ratings by the application developer. The purpose of the review process will be to minimise these ratings:
✎ ✎ ✎
Metric 23. Technical risk rating: estimate of the probability that the system can meet performance requirements based on the proposed data model and the technological platform (particularly the target DBMS) being used Metric 24. Schedule risk rating: estimate of the probability that the system can be implemented on time, based on the proposed data model Metric 25. Development cost estimate: this is an estimate of the development cost of the system, based on the data model. Such an estimate will necessarily be approximate but will be useful as a guide for making cost/quality trade-offs between different models proposed. If the quote is too high (exceeds available budget), the model may need to be simplified, reduced in scope or the budget increased.
11 Conclusion This paper has proposed a comprehensive set of metrics for evaluating the quality of data models based on the set of quality factors proposed by Moody and Shanks (1998). A total of twenty five candidate metrics are identified, with eighteen secondary metrics which may be used to classify defects in more detail. It is not expected that all of these metrics would be used in evaluating the quality of a particular data model. Our aim in this paper has been to be as complete as possible—to suggest as many metrics as possible as a starting point for analysis. Selection of the most appropriate metrics should be made based on their perceived usefulness and ease of calculation. Further Research The next step in this research is to validate and refine these metrics in practice. This will help to identify which metrics are most useful. It is proposed to use action research as the research paradigm for doing this. Action research (Checkland and Scholes, 1990) is a research method in which practitioners and researchers work together to test and refine principles, tools, techniques and methodologies that have been developed to address real world problems. It provides the ability to test out new methods in practice for mutual benefit of researchers and practitioners. Moody, Shanks and Darke, 1998 (in this conference) describe how action research has already been used to validate the framework and the quality factors proposed. Further research is also required to develop strategies for improving the quality of data models once a quality problem has been identified. Definition of improvement strategies would complete the specification of the framework.
224
D.L. Moody
References 1. 2. 3.
4. 5. 6. 7. 8.
9. 10.
11. 12. 13.
14. 15. 16. 17. 18. 19. 20.
AUSTRALIAN SOFTWARE METRICS ASSOCIATION (ASMA) (1996): ASMA Project Database, Release 7, November, P.O. Box 1287, Box Hill, Victoria, Australia, 3128. BATINI, C., CERI, S. AND NAVATHE, S.B. (1992): Conceptual Database Design: An Entity Relationship Approach, Benjamin Cummings, Redwood City, California. BATINI, C., LENZERINI, M. AND NAVATHE, S. (1986): A Comparative Analysis of Methodologies for Database Schema Integration, ACM Computing Surveys, 18(4), December: 323-364. CHECKLAND, P.B. and SCHOLES, J., Soft Systems Methodology in Action, Wiley, Chichester, 1990. DATE, C.J. (1989):, Introduction to Database Systems (4th Edition), Addison Wesley. DUBIN, R. (1978): Theory Building, The Free Press, New York. GARTNER RESEARCH GROUP (1992): "Sometimes You Gotta Break the Rules", Gartner Group Strategic Management Series Key Issues, November 23. GOODHUE, D.L., KIRSCH, L.J., AND WYBO, M.D. (1992): The Impact of Data Integration on the Costs and Benefits of Information Systems, MIS Quarterly, 16(3), September: 293-311. HITCHMAN, S. (1995) Practitioner Perceptions On The Use Of Some Semantic Concepts In The Entity Relationship Model, European Journal of Information Systems, 4, 31-40. INTERNATIONAL STANDARDS ORGANISATION (ISO) (1987), Information Processing Systems - Concepts and Terminology for the Conceptual Schema and the Information Base, ISO Technical Report 9007. KESH, S. (1995): Evaluating the Quality of Entity Relationship Models, Information and Software Technology, 37 (12). KLIR, G.J. (1985): Architecture of Systems Problem Solving, Plenum Press, New York. KROGSTIE, J., LINDLAND, O.I. and SINDRE, G. (1995): Towards a Deeper Understanding of Quality in Requirements Engineering, Proceedings of the 7th International Conference on Advanced Information Systems Engineering (CAISE), Jyvaskyla, Finland, June. LEVITIN, A. and REDMAN, T. (1994): Quality Dimensions of a Conceptual View, Information Processing and Management, Volume 31. LINDLAND, O.I, SINDRE, G. and SOLVEBERG, A. (1994): Understanding Quality in Conceptual Modelling, IEEE Software, March. LOFFMAN, R.S. AND RUSH, R.M (1991): Improving Data Quality, Database Programming and Design, 4(4), April, 17-19. MARTIN, J. (1989): Strategic Data Planning Methodologies, Prentice Hall, New Jersey. MAYER, R.E. (1989): Models for Understanding, Review of Educational Research, Spring. MEYER, B. (1988): Object Oriented Software Construction, Prentice Hall, New York. MOODY, D.L. AND SHANKS, G.G. (1994): What Makes A Good Data Model? Evaluating the Quality of Entity Relationship Models, in P. LOUCOPOLIS (ed.) Proceedings of the Thirteenth International Conference on the Entity Relationship Approach, Manchester, December 14-17, 94-111.
Metrics for Evaluating the Quality of Entity Relationship Models
225
21. MOODY, D.L. AND SIMSION, G.C. (1995): Justifying Investment in Information Resource Management, Australian Journal of Information Systems, 3(1), September: 25-37. 22. MOODY, D.L. (1996a) “Graphical Entity Relationship Models: Towards A More User Understandable Representation of Data”, in B. THALHEIM (ed.) Proceedings of the Fourteenth International Conference on the Entity Relationship Approach, Cottbus, Germany, October 7-9, 227-244. 23. MOODY, D.L. (1996b) Critical Success Factors for Information Resource Management, Proc. 7th Australasian Conference on Information Systems, Hobart, Australia, December. 24. MOODY, D.L. (1997): “A Multi-Level Architecture for Representing Enterprise Data Models”, Proceedings of the Sixteenth International Conference on the Entity Relationship Approach, Los Angeles, November 1-3. 25. MOODY, D.L. and SHANKS, G.G. (1998): What Makes A Good Data Model? A Framework for Evaluating and Improving the Quality of Entity Relationship Models, Australian Computer Journal (forthcoming). 26. MOODY, D.L., SHANKS, G.G. and DARKE, P. (1998): Improving the Quality of Entity Relationship Models—Experience in Research and Practice, in Proceedings of the Seventeenth International Conference on Conceptual Modelling (ER ’98), Singapore, November 16—19. 27. O’BRIEN, C. AND O’BRIEN, S. (1994), “Mining Your Legacy Systems: A Data-Based Approach”, Asia Pacific DB2 User Group Conference, Melbourne, Australia, November 21-23. 28. PIPPENGER, N. (1978): Complexity Theory, Scientific American, 238(6): 1-15. 29. ROMAN, G. (1985): A Taxonomy of Current Issues in Requirements Engineering, IEEE Computer, April. 30. SHANKS, G.G. (1997) Conceptual Data Modelling: An Empirical Study of Expert and Novice Data Modellers, Australian Journal of Information Systems, 4:2, 63-73 31. SIMSION, G.C. (1988): Data Planning in a Volatile Business Environment, Australian Computer Society Conference on Strategic Planning for Information Technology, Ballarat, March: 88-92. 32. SIMSION, G.C. (1991): Creative Data Modelling, Proceedings of the Tenth International Entity Relationship Conference, San Francisco, 112-123. 33. SIMSION, G.C. (1994): Data Modelling Essentials, Van Nostrand Reinhold, New York. 34. SYMONS, C.R. (1988): Function Point Analysis: Difficulties and Improvements, IEEE Transactions on Software Engineering, 14(1), January. 35. SYMONS, C.R. (1991): Software Sizing and Estimating: MkII Function Point Analysis, J. Wiley and Sons. 36. THOMPSON, C. (1993): "Living with an Enterprise Model", Database Programming and Design, 6(12), March: 32-38. 37. VAN VLIET, J.C. (1993): Software Engineering: Principles and Practice, John Wiley and Sons, Chichester, England. 38. VON HALLE, B. (1991): Data: Asset or Liability?, Database Programming and Design, 4(7), July: 13-15. 39. ZULTNER, R.E. (1992): “The Deming Way: Total Quality Management for Software”, Proceedings of Total Quality Management for Software Conference, April, Washington, DC, April, 134-145.
A Transformational Approach to Correct Schema Refinements Donatella Castelli and Serena Pisani Istituto di Elaborazione dell’Informazione Consiglio Nazionale delle Ricerche Via S. Maria, 46 Pisa, Italy e-mail: {castelli,serena}@iei.pi.cnr.it Abstract. This paper extends a database schema transformation language, called Schema Refinement Language, with a composition operator and a rule for deriving the conditions under which a composed transformation is guaranteed to produce a correct schema refinement. The framework that results from this extension can be exploited for improving the reliability of the database schema design also when other design frameworks are used.
1
Introduction
The reliability of a schema design is usually obtained by reducing the set of operators that can be used to carry out the design from the conceptual to the logical schema to a fixed set[1], [3], [4], [5], [6], [7], [9], [10], [11]. Each operator is provided with the conditions under which it is guaranteed to produce a correct schema refinement. Usually, these conditions can be proved by simply checking the schema structure. A drawback of this approach is that the set of given operators are often insufficient to cover the specific needs that occur in everyday practice. For some applications, the set of chosen transformations may be too low-level, whereas, for others, this set may be too specialised. This paper proposes a novel approach for supporting a correct schema refinement which overcomes the above drawback. In this case, the transformational operators can be built dynamically following the designers’ needs. The approach proposed relies upon a design language called Schema Refinement Language (SRL)[13]. SRL consists of a set of schema transformation primitives with the associated set of their applicability conditions. The proof of these conditions ensures the correctness of the design step. A composition operator for this language that permits the definition of a personalised set of schema refinement operators is proposed. Moreover, a rule for automatically deriving the applicability conditions of a composed transformation from the applicability conditions of the component transformations is introduced. Because of its generality, the framework presented can also be exploited to automatically derive the correctness conditions of schema transformations that are specified in other languages. The SRL framework is described in the next section. Section 3 introduces the composition operator. Section 4 presents the rule for deriving the applicability T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 226–240, 1998. c Springer-Verlag Berlin Heidelberg 1998
A Transformational Approach to Correct Schema Refinements
227
conditions of a composed schema transformation. In particular, it discusses how, by exploiting this rule, it is possible also to discover situations in which the definition of a transformation is incorrect. Section 5 shows how the results presented can be exploited to derive the applicability conditions in different transformational frameworks. To illustrate this point, a few examples are presented, taken from well known transformational frameworks. Section 6 contains concluding remarks. The algorithm for the generation of the applicability conditions is given in the Appendix.
2
Schema Refinement Language
The Schema Refinement Language (SRL) assumes that the whole design relies on a single notation able to represents semantic models. This notation, illustrated briefly through the example in Fig. 11 , allows to model the database structure and behavior into a single module, called Database Schema (DBS)[8]. This module encloses classes, attributes, is-a relationships, integrity constraints and operations. A graphical representation of the structural part of the schema on the example is given in Fig. 4(a). The DBS notation is formalised in terms of the formal model introduced within the B-Method[2]. This formalisation allows to exploit the B theory and tools for proving expected properties of the DBS schemas. The SRL primitive operators implement DBS schema transformations. They are given in Table 1. The equality conditions that appear as a parameter in the add/rem transformations specify how the new/removed element can be derived from the already existing/remaining ones. These conditions are required since only redundant components can be added and removed in a refinement step. SRL does not permit to add or remove schema operations. It only permits to change the way in which an operation is defined. Note that the operation definitions are also automatically modified as a side effect of the transformations that add and remove schema components. In particular, these automatic modifications add appropriate updates for each of the new schema components, cancel the occurrences of the removed components and apply the proper variable substitutions. A transformation can be applied when its applicability conditions are verified. These are sufficient conditions, to be checked before the execution of the transformation, that prevent from applying meaningless and correctness breaking schema design. Each applicability condition is composed by the conjunction of simple conditions, that in the rest of the paper will be called applicability predicates. The criterion for the correctness of schema design is based on the following definition (for a formal definition see[13]): Definition (DBS schema refinement relation) A DBS schema S1 refines a DBS schema S2 if: 1
In the figure, ; is the relation composition operator.
228
D. Castelli and S. Pisani
database schema Materials class vein of VEIN with (type:string) class aspect of ASPECT with (colour:string, has vein:vein) class material of MATERIAL with (name:string, has aspect:aspect) class marble is-a material with () class stone is-a material with () constraints ran(has aspect)=dom(has vein) initialization material,name,has aspect,aspect,colour,has vein,vein,type,marble,stone:=Ø operations vt←marble veins types=vt:={t| ∃m∈marble·has aspect;has vein(m)=v∧type(v)=t} Fig. 1. A Database Schema. Table 1. SRL language. add.class (class.name, class.name =expr ) rem.class (class.name, class.name =expr) add.attr(attr.name, class.name, attr.name =expr) rem.attr (attr.name, class.name, attr.name =expr) add.isa(class.name1, class.name2) rem.isa (class.name1, class.name2) mod.op (op.name, body)
(a) S1 and S2 have the same signature2 ; (b) there exists a 1:1 correspondence between the states modelled by S1 and S2 ; (c) the database B1 and B2 , modelled by S1 and S2 , when initialised and submitted to the same sequence of updates, are such that each possible query on B1 returns one of the results expected by evaluating the same query on B2 .2 This notion of correctness is a restricted version those used within the B refinement theory. Let us outline that the main concerns in defining the SRL framework have been simplicity and generality. These qualities are achieved defining both the model and the schema refinement language provided with very primitive mechanisms. SRL, as presented above, however, is not sufficiently general to be used to interpret other schema design frameworks. The schema transformations are usually more complex of those listed above. In order to overcome this limitation, a composition operator for SRL is introduced in the next section.
2
With “same signature” we indicate that S1 and S2 have corresponding operations with the same names and the same input and results parameters.
A Transformational Approach to Correct Schema Refinements
3
229
Composition Operator
The composition operator permits to defines complex transformations from simpler ones. Before introducing it, the following preliminary definition is needed. Definition (Consistent operation modification) A set of SRL schema transformations t1 , t2 , . . ., tn specifies consistent operation modifications if, for each pair of transformations (tk , tj ), with 1≤ k,j ≤ n, and k 6= j, that modify the same operation op, at last one of the conditions below holds: 2) bodyj v bodyk 1) bodyk v bodyj ; where v is the algorithmic refinement as defined in[2] and bodyk and bodyj are the new behaviour of op, specified by, respectively, tk and tj .2 Intuitively, this definition means that all the bodies that are specified for the same operations by different transformations must describe the same general behaviour. They can only differ for being more or less refined. The SRL composition operator can now be defined as follows. Definition (Composition operator “◦”) Let t1 , t2 , . . . , tn be a set of SRL schema transformations that specify consistent operation modifications. Let be a DBS schema where: Cl, Attr, IsA, Constr and Op are sets of respectively classes, attributes, is-a relationships, integrity constraints, and schema operations. Op always contains an operation Init that specifies the schema initialisation. The SRL schema transformation composition operator is defined as follows: t1 ◦ t2 ◦ . . . ◦ tn () = where ACl/RCl, AAttr/RAttr and AIsA/RIsA are sets formed, respectively, by the set of classes, attributes and is-a relationships that are added/removed by t1 , t2 , . . . , tn and are extracted from the component transformation parameters. RemSubst* is the transitive closure of the variable substitutions x:=E dictated by the conditions that are specified when an element is removed. If we have, for example, rem.class(c, c=E) ◦ rem.class(d, d=f(c)) ◦ rem.class(e, e=F) then RemSubst* is the parallel composition of the substitutions c:=E, d:=f(E) and e:=F. [RemSubst*]X is the expression that is obtained by applying the substitution RemSubst* to X. For example, [x:=E]R(x) is R(E). This substitution permits to rephrase integrity constraints and operation definitions by removing the cancelled schema components. AConstr are the conjunction of the inherent constraints associated with the new schema components and the conditions that specify how an added element relates to the remaining ones. Finally, Op0 is the new set of operation definitions. These result from the modifications that are required explicitly and from the automatic adjustments caused by the addition and removal of schema components. When more than one of the component transformations modifies an operation, the more specialised behavior is selected.2 Note that the result of a composition depends from the component transformations and not from the order in which they appear. Differently from other proposals[9], [10], [11], SRL so extended is a complete DBS schema refinement language. This property ensures that SRL is powerful
230
D. Castelli and S. Pisani
enough to express every DBS schema transformation. As a consequence of this property, the designer can progressively enrich the set of schema refinement transformations following his needs. The following example illustrates how new transformations can be build. Example. Let us suppose that the transformation illustrated in Fig. 2 has to be defined. This transformation adds a direct relationship a1 between two classes that were related by an indirect link. This new link is defined as composition of a2 and a3 . Moreover, it removes the relationship a3 . This transformation can be built as composition of simple SRL transformations in the following way: path replacement(C1 , a1 , C2 , a2 , a3 ) = add.attr(a1 , C1 , a1 =a2 ;a3 ) ◦ rem.attr(a3 , C2 , a3 =a−1 2 ;a1 ) C1
a2- C 2 a3 ? C3
⇒
C1
a2- C 2 a1 - C 3
Fig. 2. Path replacement.
The transformation path replacement can be used as any other SRL transformation. For example, it can be applied to the database schema Materials of Fig. 1 as follows: path replacement(marble,has vein marble,aspect,has aspect,has vein) obtaining the DBS schema presented in Fig. 3. Fig. 4 illustrates graphically the effect of the transformation on the static part of the schema. database schema Materials1 class vein of VEIN with (type:string) class aspect of ASPECT with (colour:string) class material of MATERIAL with (name:string, has aspect:aspect) class marble is-a material with (has vein marble:vein) class stone is-a material with () constraints ran(has aspect)=dom(has aspect−1 ;has vein marble) initialization material,name,has aspect,aspect,colour,has vein marble,vein,type,marble,stone:= Ø operations vt←marble veins types=vt:={t| ∃m∈marble·has vein marble(m)=v∧type(v)=t} Fig. 3. DBS Materials1 .
This section has illustrated as the DBS schema refinement transformations can be dynamically built. The next section shows as, in this dynamic context, it is still possible to support the designer in carrying out a correct design process.
A Transformational Approach to Correct Schema Refinements name material has aspect - aspect colour
6 marble
stone
has vein vein type
?
(a)
231
name material has aspect - aspect colour
6 stone marble has vein marble (b)
vein
type
6
Fig. 4. Path replacement.
4
Applicability Conditions
The generation of the applicability conditions of a composed transformation has a double purpose. First, it permits to highlight mistakes in the definition of the transformation and suggests to the designer how remove them. Second, it provides a set of sufficient conditions for ensuring that the application of the transformation results in a correct design. The applicability conditions are generated constructively[13] by the Applicability Condition Generating Algorithm (ACGA), given in Appendix. ACGA generates the applicability conditions by considering the schema structure and the modifications brought by the component transformations. As far as the applicability conditions of composed transformations, the following property holds[13]: Definition (SRL is a refinement language) Let t1 , t2 , . . ., tn be SRL schema transformations and S be a DBS schema. The application of the transformation t1 ◦ t2 ◦ . . . ◦ tn (S), when its applicability conditions are verified, produces a refinement of S.2 This property ensures the correctness of any SRL database design. The following two sections describe in details the two uses of the applicability conditions that have been mentioned above. 4.1
Applicability of a Transformation
As there are no constraints on how the transformations should be composed, it may happen that a newly defined transformation results to be never applicable, i.e. its applicability conditions are never verified. If t is a composed transformation, with n parameters and applicability conditions applt , the proof of the following condition permits to exclude such wrong definition: ∃p1 , · · · , pn , S · applt(p1 ,···,pn )(S) where p1 , · · ·, pn are the parameters of t and S is a DBS schema. This condition expresses that the transformation is applicable if there is at least one instance of the parameters and a schema that verify the applicability conditions. On the contrary, there is some mistake in the definition of t. To illustrate the last point, let us examine the following example3 : 3
In that follows, stands for the range restriction, C2 is-a-reach C1 is verified if there exists an is-a path between C1 and C2 , and C3 is-aC2 stands for the inherent constraint “C3 is a subclass of C2 ”.
Carrying out the verification of the above applicability conditions it turns out that the last condition will be never verified. Actually, C can be instantiated with “C3 ”. The failure of the proof gives us an indication of what is wrong in the definition of the transformation. The transformation removes the class C2 that is is-a related to C3 . Independently from the values that will be given to C2 and C3 , when the transformation will be instantiated it will always produce a dangling is-a relationship, as illustrated in Fig. 5. a C C 1 2 (a)
a C 1
C3
?i
C3
(b) Fig. 5. Specialisation.
4.2
Correctness of a Design Step
The applicability conditions of a composed transformation are parametric with respect to the parameters of the composed transformation. By reasoning on these conditions, it turns out that some of them can be solved without instantiating the parameters; others can be discharged by simply comparing the values of the parameters. This suggest us to automatically prune these predicates and associate to the instance of a transformation only the so simplified set of applicability predicates. The pruning is done at different stages. When a transformation, with parameters p1 , . . . , pn , is defined, the set of applicability predicates is scanned and, for each predicate Pij of the set, the proof of ∀ p1 . . . pn , S · Pij , where S is a DBS schema, is attempted. If the proof is successful, Pij is inserted in the set of the applicability predicates that have not to be proved anymore. The
A Transformational Approach to Correct Schema Refinements
233
second kind of pruning is executed when the transformation is instantiated. By reasoning on the structure of the component transformations and the values of the parameters, several applicability predicates are discharged. The ACGA algorithm, reported in the Appendix, actually implements a mix between the generation of the applicability conditions and the second pruning, The result is the set of applicability predicates that the designer has to prove for a particular application of the transformation. Notice that this set is often very small. Moreover, since the SRL framework and its application conditions are formalised, an automatic, or at least guided, discharge of the applicability conditions that are generated is possible. As an example of dynamic generation of the applicability conditions of a composed transformation, let us see which are the applicability conditions of the transformation path replacement, as invoked in the example of Sect. 3. In these conditions, the following abbreviations are used: • Constr stands for the constraints of the initial schema; • Inh stands for the inherent constraints that are implicitly added by the transformation; • NewConstr1 = Constr ∧ Inh ∧ has vein=has aspect−1 ;has vein marble • NewConstr2 = Constr ∧ Inh ∧ has vein marble=has aspect;has vein The applicability conditions of path replacement are: (a) NewConstr1 ⇒ dom(has aspect;has vein) ⊆ marble (b) NewConstr2 ⇒ has vein = has aspect−1 ;has vein marble The first condition requires that the added relationship, defined as composition of has aspect and has vein, be defined on the class marble. The second condition requires that the removed relationship be derivable as the sequential composition of the remaining ones. Those listed above are the only applicability conditions that are returned to the designer. Several others are checked and discharged by the ACGA algorithm automatically.
5
Exploiting SRL in Other Design Frameworks
The described approach can be used for achieving a more reliable design also in other frameworks. In all the design frameworks that can be interpreted as a special case of the one that has been described, the applicability conditions of any transformation can be generated by exploiting th given framework. In order to generate these conditions, it is sufficient, first, to define the refinement transformation as the difference between the initial and the final schemas and, then, to express this difference as composition of SRL primitives. At this point, the applicability conditions follow automatically. This approach to the derivation of the applicability conditions of a refinement transformation can be useful in all those design contexts in which the preconditions for a correct design are not given. These include also those contexts in which there are no established transformations but the design is done by writing down the logical schema directly. In this case, the transformation is implicit
234
D. Castelli and S. Pisani
and, of course, there are no preconditions for guaranteeing its correctness. This approach may be useful also when there are applicability conditions, but they are only given informally. In these cases, assistance tools for the verification of these preconditions cannot be built. Using the suggested approach we can take advantage of those provided for SRL. The approach illustrated above has an inherent limitation: it can be applied only if its premises agree with those of SRL. In particular, the employed model must be a submodel of DBS and the DBS schema refinement relation must conform to the one that has been given in Sect. 2. In order to illustrate how the approach proposed can be useful in other frameworks, we present two examples of derivation of applicability conditions. The examples consider two refinement transformations taken from two different well known languages. Example 1. The first example considers one of the schema transformations proposed in[3]. The transformations within this set do not change the information content of the schema. Moreover, the assumed data model conforms to DBS. The transformation chosen is: elimination of dangling subentities in generalisation hierarchies. For brevity, below it will be named elimination. The transformation elimination removes n non overlapping subclasses and reduces them to a superclass. The elements of the superclass are partitioned in n groups by the value of added attribute. Fig. 6 shows this transformation. C C1
6
···
Cn
⇒
(a)
C a (b)
Fig. 6. elimination.
The schema in Fig. 6(b) differs from the schema in Fig. 6(a) since it has a new attribute a and the classes C1 ,· · ·, Cn and their is-a relationships are missing. This difference can be easily expressed as composition of SRL primitives: elimination(a,C,(v1 ,· · ·,vn ),(C1 ,· · ·,Cn ))= add.attr(a,C,a={(x,y)|x∈(C1 ∪· · · ∪Cn )∧(x∈C1 →y=v1 )∧ · · · ∧(x∈Cn →y=vn )})◦ rem.class(C1 ,C1 =dom(a{v1 }))◦ · · · ◦rem.class(Cn ,Cn =dom(a{vn }))◦ rem.isa(C1 ,C)◦ · · · ◦rem.isa(Cn ,C) The transformation elimination, for example, can be applied to the schema S in Fig. 7(a) to obtain the schema of Fig. 7(b): elimination(type,Material,(marble,stone),(Marble,Stone))(S) The applicability conditions associated to the above instantiation of the elimination are the following4 : 4
NewConstr stands for the conjunction of the constraints of the initial schema and a subset of those added. This subset consists of all the constraints added by the component transformations that differ from the transformation that has generated the condition.
A Transformational Approach to Correct Schema Refinements Material Marble
6
⇒
Stone
235
Material type (b)
(a) Fig. 7. elimination.
• NewConstr ⇒ (dom({(m,t) | m∈(Marble∪Stone) ∧ (m∈Marble → t=marble ∧ m∈Stone → t=stone)}) ⊆ Material) • NewConstr ⇒ Marble=dom(type{marble}) • NewConstr ⇒ Stone=dom(type{stone}) Example 2. The second example is taken from[10]. Here, a set of ER schema transformations to support the designer during the schema development is proposed. One of these transformations is disaggregation a compound attribute. For brevity, it will be called disaggregation. This transformation is semanticspreserving, i.e. it does not change the information content of the schema. The transformation disaggregation replaces the compound attribute with its component fields as shown in Fig. 8. C a< a1 ,· · ·,an >