Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen 1507 3 Berlin Heidelberg New Yor...

Author: Tok Wang Ling | Sudha Ram

10 downloads 993 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

1507

3 Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo

Tok Wang Ling Sudha Ram Mong Li Lee (Eds.)

Conceptual Modeling - ER ’98 17th International Conference on Conceptual Modeling Singapore, November 16-19, 1998 Proceedings

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands

Volume Editors Tok Wang Ling Mong Li Lee National University of Singapore School of Computing, Department of Computer Science 55 Science Drive 2, Singapore 117599 E-mail: {lingtw,leeml}@comp.nus.edu.sg Sudha Ram University of Arizona, Department of Management Information Systems 430J McClelland Hall, College of BPA Tuscon, AZ 85721, USA E-mail: [email protected]

Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Conceptual modeling : proceedings / ER ’98, 17th International Conference on Conceptual Modeling, Singapore, November 16 - 19, 1998. Tok Wang Ling ; Sudha Li Lee (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Budapest ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1998 (Lecture notes in computer science ; Vol. 1507) ISBN 3-540-65189-6

CR Subject Classification (1991): H.2, H.4, F.1.3, F.4.1, I.2.4, H.1, J.1 ISSN 0302-9743 ISBN 3-540-65189-6 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. c Springer-Verlag Berlin Heidelberg 1998 Printed in Germany Typesetting: Camera-ready by author SPIN 10639013 06/3142 – 5 4 3 2 1 0

Printed on acid-free paper

Foreword

I would like to welcome you to Singapore and the 17th International Conference on Conceptual Modeling (ER’98). This conference provides an international forum for technical discussion on conceptual modeling of information systems among researchers, developers and users. This is the first time that this conference is held in Asia, and Singapore is a very exciting place to host ER’98. We hope that you will find the technical program and workshops useful and stimulating. The technical program of the conference was selected by the distinguished program committee consisting of two co-chairs and 83 members. Credit for the excellent final program is due to Tok Wang Ling and Sudha Ram. Special thanks to Frederick H. Lochovsky for selecting interesting panels, and Alain Pirotte for preparation of attractive tutorials. I would also like to thank Yong Meng Teo (Publicity Chair), and the region co-ordinators, Alberto Laender, Erich Neuhold, Shiwei Tang, and Masaaki Tsubaki, for taking care of publicity. The following three workshops are also organized to discuss specific topics of data modeling and databases: “International Workshop on Data Warehousing and Data Mining” organized by Sham Navathe (Workshop chair) and Mukesh Mohania (Program Committee Chair), “International Workshop on New Database Technologies for Collaborative Work Support and Spatio-Temporal Data Management” organized by Yoshifumi Masunaga, and “International Workshop on Mobile Data Access” organized by Dik L. Lee. Ee Peng Lim took care of all detailed work related to the workshops. I would like to thank all these people who organized the workshops as well as the members of program committees. The workshop proceedings will be published jointly after the workshop. I would also like to express my appreciation to other organizing committee members, Chuan Heng Ang (Publication), Hock Chuan Chan (Registration), Mong Li Lee and Danny Poo (Local Arrangements), Cheng Hian Goh (Treasurer), and Kian Lee Tan (Industrial Chair). Special thanks to Tok Wang Ling who worked as the central member of the organizing committee, and who made my job very easy. Last, but not least, I would like to express thanks to the members of the Steering Committee, especially to Stefano Spaccapietra (Chair), Bernhard Thalheim (Vice Chair), and Peter Chen (Chair, Emeritus) who invented the widely used ER model and started this influential conference. Finally I would like to thank all the sponsors and attendees of the conference, and hope that you will enjoy the conference, the workshops, and Singapore to the utmost extent.

November 1998

Yahiko Kambayashi Conference Chair

Program Chairs’ Message

The 17th International Conference on Conceptual Modeling (ER’98) is aimed at providing an international forum for technical discussion among researchers, developers, practitioners, and users whose major emphasis is on conceptual modeling. This conference was originally devoted to the Entity-Relationship (ER) model, but has long since expanded to include all types of semantic data modeling, behavior and process modeling, and object-oriented systems modeling. This year’s conference embraces all phases of software development including analysis, specification, design, implementation, evolution, and reengineering. Our emphasis this year has been to bring together industry and academia to provide a unique blend of original research and contributions related to practical system design using conceptual modeling. We have an exciting agenda focusing on emerging topics ranging from conceptual modeling for Web based information systems to data warehousing and industrial case studies on the use of conceptual models. The conference attracted 95 papers from authors in 31 different countries. Both industry and academic contributions were solicited. Similarly high standards were applied to evaluating both types of submissions. Of the submissions, 32 were accepted for presentation at the conference based on extensive reviews from the Program Committee and external reviewers. The program consists of 26 research papers and 6 industrial papers representing 17 different countries from around the globe. The entire submission and reviewing process was handled electronically, which proved to be a challenge and a blessing at the same time. A conference of this magnitude is the work of many people. The program committee with the help of external reviewers worked under a tight schedule to provide careful, written evaluations of each paper. Mong Li Lee, Chuan Heng Ang, Choon Leong Chua, and Sew Kiok Toh helped to coordinate the review of our electronic submission and review system, tabulated the scores and distributed reviews to authors. Since the program co-chairs are from two different continents, great coordination was required and achieved through the use of the Internet. Jinsoo Park from the University of Arizona and Mong Li Lee from the National University of Singapore did an outstanding job of assisting the Program Co-Chairs. On behalf of the entire ER’98 committee, we would like to express our appreciation to all the people who helped with the conference. Finally, our thanks to all of you for attending the conference here in Singapore. We wish you a week of fun in the enchanting garden city of Singapore! November 1998

Tok Wang Ling and Sudha Ram Program Co-Chairs

Conference Organization

Conference Chair: Yahiko Kambayashi (Kyoto University, Japan) Program Co-Chairs: Tok Wang Ling (National University of Singapore, Singapore) Sudha Ram (University of Arizona, USA) Panel Chair: Frederick H. Lochovsky (HK University of Science & Technology, Hong Kong) Tutorial Chair: Alain Pirotte (University of Louvain, Belgium) Publication Chair: Chuan Heng Ang (National University of Singapore, Singapore) Registration Chair: Hock Chuan Chan (National University of Singapore, Singapore) Finance Chair: Cheng Hian Goh (National University of Singapore, Singapore) Local Arrangements Co-Chairs: Mong Li Lee (National University of Singapore, Singapore) Danny Poo (National University of Singapore, Singapore) Workshop Chair: Ee Peng Lim (Nanyang Technological University, Singapore) Industrial Chair: Kian Lee Tan (National University of Singapore, Singapore) Publicity Chair: Yong Meng Teo (National University of Singapore, Singapore) Steering Committee Representatives: Stefano Spaccapietra (Swiss Federal Institute of Technology, Switzerland) Bernhard Thalheim (Cottbus Technical University, Germany) Peter Chen (Louisiana State University, USA)

VIII

Conference Organization

Region Co-ordinators: Alberto Laender (Federal University of Minas Gerais, Brazil) Erich Neuhold (German National Research Center for Information Technology, Germany) Masaaki Tsubaki (Data Research Institute, Japan) Shiwei Tang (Peking University, China)

Tutorials Multimedia Information Retrieval, Categorisation and Filtering by Carlo Meghini and Fabrizio Sebastini (CNR Pisa, Italy) Co-design of Structures, Processes and Interfaces for Large-Scale Reactive Information Systems by Bettina Schewe, Klaus-Dieter Schewe and Bernhard Thalheim (Germany) Advanced OO Modeling: Metamodels and Notations for the Next Millenium by Brian Henderson-Sellers, Rob Allen, Danni Fowler, Don Firesmith, Dilip Patel, and Richard Due Modeling Information Security - Scope, State-of-the-Art, and Evaluation of Techniques by Gunther Pernul and Essen (Germany) Spatio-Temporal Information Systems: a Conceptual Perspective by Christine Parent, Stefano Spaccapietra, and Esteban Zimanyi (EPFL Lausanne, Switzerland)

Workshops Data Warehousing and Data Mining Chair: Sham Navathe (Georgia Institute of Technology, USA) Program Chair: Mukesh Mohania (University of South Australia, Australia) Mobile Data Access Chair: Dik L. Lee (HK University of Science and Technology, Hong Kong) New Database Technologies for Collaborative Work Support and SpatioTemporal Data management Chair: Yoshifumi Masunaga (University of Library and Info. Science, Japan)

Conference Organization

Program Committee Peter Apers, The Netherlands Akhilesh Bajaj, USA Philip Bernstein, USA Elisa Bertino, Italy Glenn Browne, USA Stefano Ceri, Italy Hock Chuan Chan, Singapore Chin-Chen Chang, Taiwan Arbee L. P. Chen, Taiwan Roger Hsiang-Li Chiang, Singapore Joobin Choobineh, USA Phillip Ein-Dor, Israel Ramez Elmasri, USA David W. Embley, USA Tetsuya Furukawa, Japan Georges Gardarin, France Cheng Hian Goh, Singapore Wil Gorr, USA Terry Halpin, USA Igor Hawryszkiewycz, Australia Alan Hevner, USA Uwe Hohenstein, Germany Sushil Jajodia, USA Ning Jing, China Leonid Kalinichenko, Russia Hannu Kangassalo, Finland Jessie Kennedy, UK Hiroyuki Kitagawa, Japan Ramayya Krishnan, USA Gary Koehler, USA Prabhudev Konana, USA Uday Kulkarni, USA Akhil Kumar, USA Takeo Kunishima, Japan Alberto Laender, Brazil Laks V. S. Lakshmanan, Canada Per-Ake Larson, USA Dik-Lun Lee, China Mong Li Lee, Singapore Suh-Yin Lee, Taiwan Qing Li, China Stephen W. Liddle, USA

Ling Liu, USA Pericles Loucopoulos, UK Leszek A. Maciaszek, Australia Stuart E. Madnick, USA Kia Makki, USA Salvatore March, USA Heinrich C Mayr, Austria Vojislav Misic, China David E. Monarchi, USA Shamkant Navathe, USA Erich Neuhold, Germany Peter Ng, USA Dan O’Leary, USA Maria E Orlowska, Australia Aris Ouksel, USA Mike Papazoglou, The Netherlands Jeff Parsons, Canada Joan Peckham, USA Niki Pissinou, USA Calton Pu, USA Sandeep Purao, USA Sury Ravindran, USA Arnon Rosenthal, USA N L Sarda, India Sumit Sarkar, USA Arun Sen, USA Peretz Shoval, Israel Keng Leng Siau, USA Il-Yeol Song, USA Stefano Spaccapietra, Switzerland Veda Storey, USA Toby Teorey, USA Bernhard Thalheim, Germany A Min Tjoa, AUSTRIA Alex Tuzhilin, USA Ramesh Venkataraman, USA Yair Wand, Canada Kyu-Young Whang, Korea Carson Woo, Canada Jian Yang, Australia Masatoshi Yoshikawa, Japan

IX

X

Conference Organization

External Referees Iqbal Ahmed Hiroshi Asakura H. Balsters Linda Bird Jan W. Buzydlowski Sheng Chen Yam San Chee Sheng Chen Wan-Sup Cho Eng Huang Cecil Chua Peter Fankhauser Thomas Feyer George Giannopoulos Spot Hua Gerald Huck Hasan M. Jamil Panagiotis Kardasis Justus Klingemann Suk-Kyoon Lee Wegin Lee

Ki-Joune Li Jun Li Hui Li Weifa Liang Ee Peng Lim P. Louridas Sam Makki Elisabeth M´etais Wilfred Ng E. K. Park Ivan Radev Rodolfo Resende Klaus-Dieter Schewe Takeyuki Shimura Kian-Lee Tan Thomas Tesch Chiou-Yann Tsai Christelle Vangenot R Wilson

Conference Organization

Organized By School of Computing, National University of Singapore Sponsored By ACM The ER Institute

In Cooperation with School of Applied Science, Nanyang Technological University Singapore Computer Society Information Processing Society of Japan

Corporate Sponsors Beacon Information Technology Inc., Japan CSA Automated Pte Ltd Digital Equipment Asia Pacific Pte Ltd Fujitsu Computers (Singapore) Pte Ltd IBM Data Management Competency Center (Singapore) Lee Foundation NSTB(National Science and Technology Board) Oracle Systems S.E.A. (S) Pte Ltd Sybase Taknet Systems Pte Ltd

XI

Table of Contents

Keynote 1: The Rise, Fall and Return of Software Industry in Japan . . . . . . . . . . . . . . . . . . . . 1 Yoshioki Ishii (Beacon Information Technology Inc., Japan)

Session 1: Conceptual Modeling and Design Conceptual Design and Development of Information Services . . . . . . . . . . . . . . . . 7 Thomas Feyer, Klaus-Dieter Schewe, Bernhard Thalheim, Germany An EER-Based Conceptual Model and Query Language for Time-Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Jae Young Lee, Ramez A. Elmasri, USA Chrono: A Conceptual Design Framework for Temporal Entities . . . . . . . . . . . .35 Sonia Bergamaschi, Claudio Sartori, Italy

Session 2: User Interface Modeling Designing Well-Structured Websites: Lessons to Be Learned from Database Schema Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Olga De Troyer, The Netherlands Formalizing the Informational Content of Database User Interfaces . . . . . . . . . 65 Simon R. Rollinson, Stuart A. Roberts, UK

Session 3: Information Retrieval on the Web A Conceptual-Modeling Approach to Extracting Data from the Web . . . . . . . 78 D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, Y.-K. Ng, D.W. Quass, R.D. Smith, USA Information Coupling in Web Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Sourav S. Bhowmick, Wee-Keong Ng, Ee-Peng Lim, Singapore Structure-Based Queries over the World Wide Web . . . . . . . . . . . . . . . . . . . . . . . 107 Tao Guan, Miao Liu, Lawrence V. Saxton, Canada

XIV

Table of Contents

Session 4: Semantics and Constraints Integrated Approach for Modelling of Semantic and Pragmatic Dependencies of Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Remigijus Gustas, Sweden Inference of Aggregate Relationships through Database Reverse Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Christian Soutou, France On the Consistency of Int-cardinality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 150 Sven Hartmann, Germany

Panel 1: Realizing Next Generation Internet Applications: Are There Genuine Research Problems, or Is It Advanced Product Development? . . . . . . . . . . . . . . . . . . . . . . .164 Chairpersons: Kamalakar Karlapalem and Qing Li, Hong Kong

Keynote 2: Web Sites Need Models and Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Paolo Atzeni, Universit` a di Roma Tre, Italy

Session 5: Conceptual Modeling Tools ARTEMIS: A Process Modeling and Analysis Tool Environment . . . . . . . . . . 168 S. Castano, V. De Antonellis, M. Melchiori, Italy From Object Oriented Conceptual Modeling to Automated Programming in Java* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Oscar Pastor, Vicente Pelechano, Emilio Insfr´ an, Jaime G´ omez, Spain An Evaluation of Two Approaches to Exploiting Real-World Knowledge by Intelligent Database Design Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Shahrul Azman Noah, Michael Lloyd-Williams, UK

Session 6: Quality and Reliability Metrics Metrics for Evaluating the Quality of Entity Relationship Models . . . . . . . . . 211 Daniel L. Moody, Australia

Table of Contents

XV

A Transformational Approach to Correct Schema Refinements . . . . . . . . . . . . 226 Donatella Castelli, Serena Pisani, Italy The Guidelines of Modeling - An Approach to Enhance the Quality in Information Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Reinhard Schuette, Thomas Rotthowe, Germany

Industrial Session 1: Industrial Experiences in Conceptual Modeling Improving the Quality of Entity Relationship Models - Experience in Research and Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Daniel Moody, Graeme G. Shanks, Peta Darke, Australia The TROLL Approach to Conceptual Modelling: Syntax, Semantics and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Antonio Grau, Juliana K¨ uster Filipe, Mojgan Kowsari, Silke Eckstein, Ralf Pinger, Hans-Dieter Ehrich, Germany Process Failure in a Rapidly Changing High-Tech Organisation: A System Dynamics View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Bruce Campbell, Australia

Session 7: Object-Oriented Database Management Systems ROL2: A Real Deductive Object-Oriented Database Language . . . . . . . . . . . . 302 Mengchi Liu, Min Guo, Canada Multiobjects to Ease Schema Evolution in an OODBMS . . . . . . . . . . . . . . . . . . 316 Lina Al-Jadir, Michel L´eonard, Switzerland Implementation of Automatic Lock Determination in C++-based OODBMSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Yong S. Jun, Eunji Hong, Suk I. Yoo, Korea

Panel 2: Do We Need Information Modeling for the Information Highway? . . . . . . . . . 348 Panel chair: Bernhard Thalheim, Germany

XVI

Table of Contents

Session 8: Data Warehousing Design and Analysis of Quality Information for Data Warehouses* . . . . . . . . 349 Manfred A. Jeusfeld, The Netherlands, Christoph Quix, Matthias Jarke, Germany Data Warehouse Schema and Instance Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Dimitri Theodoratos, Timos Sellis, Greece Reducing Algorithms for Materialized View Updates . . . . . . . . . . . . . . . . . . . . . . 377 Tetsuya Furukawa, Fei Sha, Japan

Industrial Session 2: Industrial Case Studies Reengineering Conventional Data and Process Models with Business Object Models: A Case Study Based on SAP R/3 and UML . . . . . . . . . . . . . . . . . . . . . . 393 Eckhart v. Hahn, Barbara Paech, Germany, Conrad Bock, USA An Active Conceptual Model for Fixed Income Securities Analysis for Multiple Financial Institutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 Allen Moulton, St´ephane Bressan, Stuart E. Madnick, Michael D. Siegel, USA An Entomological Collections Database Model for INPA . . . . . . . . . . . . . . . . . . 421 J. Sonderegger, P. Petry, J.L. Campos dos Santos, N.F. Alves, Brazil

Session 9: Object-Oriented Approaches A Global Object Model for Accommodating Instance Heterogeneities . . . . . 435 Ee-Peng Lim, Roger H.L. Chiang, Singapore On Formalizing the UML Object Constraint Language OCL . . . . . . . . . . . . . . 449 Mark Richters, Martin Gogolla, Germany Derived Horizontal Class Partitioning in OODBs: Design Strategies, Analytical Model and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 Ladjel Bellatreche, Kamalakar Karlapalem, Qing Li, Hong Kong

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481

The Rise, Fall, and Return of Software Industry in Japan Yoshioki Ishii Beacon Information Technology Inc. Shinjuku L Tower, 7F. 1-6-1, Nishi-shinjuku Shinjuku-ku, Tokyo 163-1507, Japan

Abstract. The Software Industry in Japan grew extraordinarily only in the field of custom software, and fell after the collapse of the “bubble economy” in 1991. In Japan, the field of packaged software is still at an early stage of development. Why did this happen? On the other hand, Japan surpassed the U.S.A in the game software field, and became No. 1 in the world. Why is this? Can Japanese packaged software survive in the future? Or, will Western packaged software made by Microsoft, SAP etc. conquer the Japanese market? I will state my opinion based on my own experience in Software Industry during the past 30 years.

1 Introduction I have been working in the development of DBMS since the late 1960s. In the process, I have provided consulting expertise on Database products for a wide range of customers beginning in 1968. In 1973, I introduced ADABAS to the Japanese market and have been supplying this product to the IT market ever since. There are presently 800 corporate customers of ADABAS in Japan alone. I am fortunate to say that many of my peers recognize me as a pioneer in Database related developments in Japan. In parallel to my activities on the industry side, I have also been an active participant in the academic circle of Information Processing. I presented many research papers since 1973 at Database Research Group in the Information Processing Society of Japan. When ACM SIGMOD Japan was first established in 1993, I honorably accepted the post of chairperson and worked as the first chairperson of the organization from 1993 to 1995. About 10 years ago, I also started to focus on the Multi-Dimensional Model. I am presently also providing sales and development oriented consulting on Multidimensional DBMS (Essbase). Based on all of these experiences, I published my first book titled “Data Warehouse” in 1996 in Japan. Those ideas were also presented at the VLDB ’96.

T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 1−6, 1998.  Springer-Verlag Berlin Heidelberg 1998

2

Y. Ishii

From my perspective as a technical person in the Database arena, and an executive who has managed a successful software company for the last 30 years, I will briefly speak on the Software Industry in Japan. I will relate my observations through a road that takes us from the Rise, through the Fall and to the Return of the Software Industry in Japan. I will elaborate a while on the cause of the Fall. The roots of the Software Industry in Japan trace back to 1964. As was the case in the US, computers were installed at computer centers and leased for usage by the hour. Software companies started appearing in 1968. After the first ten years, annual revenues exceeded 400 Billion Japanese Yen in 1978, and it was officially recognized as an industry. Please refer to the following chart (Fig. 1.) that shows “The Rise, Fall and Return of the Software Industry in Japan”. Billion Yen 7000

6000

5000

4000

3000

2000

1000

0 64

70

80

90

98

Fig. 1. Software Industry Growth in Japan

2 Period of Rise During this period, companies focused on computerization of back office activities relying mainly on mainframes. Custom software was developed using Cobol, Fortran and PL/I, for various private corporations, national, prefecture and local governments. Most of the development work was outsourced to software companies. Due to this reason, growth in the custom software field was abnormally high and that of packaged software was relatively low in Japan as compared to the rest of the world. Please refer to the following chart (Fig. 2), which shows a comparison of the share of custom

The Rise, Fall, and Return of Software Industry in Japan

3

software and packaged software for Japan, Europe and the US in 1988, which was also the end of this period of rise.

($Billion) User Expenditures

Source: Input

30

20

10

Custom Packaged

0

U.S.

Europe

Japan Market Overview

Fig. 2. Custom Software Development vs. Software Products, 1988

The extraordinary growth of the software industry in Japan, actually backfired and became a serious cause of its subsequent fall. The main reason was the collapse of Japanese “bubble economy” in 1991.

3 Period of Fall The computer hardware industry of Japan grew mainly on the strength in mainframe technology in the 1980s, and there even arose a possibility of surpassing the successes of the industry in the US. In order to wrestle the initiative in the 1990’s and beyond, the Japanese Government started an ambitious project called the Fifth Generation Computer Project in 1983. This project was to range over 10 years and was concentrated mainly on AI. Resources for this project were pooled not only from scientists in University Laboratories but were also recruited from the technical staff of Japanese six major companies, such as Fujitsu and Hitachi, but not IBM Japan. The scale of this project was truly massive and a lot of time, money and resources were allocated. The project classified the existing computers as 3rd generation computers and aimed to completely skip the next, 4th generation of computer technology by focusing on AI

4

Y. Ishii

technology to achieve the advanced functionality of 21st century computing. This was termed 5th generation technology and future computers were termed 5th generation computers. Around 1990, both the Japanese Government and mainframers had illusions of the coming of the 5th generation computing era and that it would arrive soon. Japan was at the peak of enjoying prosperity that accompanied the bubble economy. On the other hand, the US went through a period of recession in the latter half of 1980’s. Riding of the “Downsizing” wave, growth was seen in the sales of UNIX machines and personal computers. These technologies were an extension of 3rd generation technology. In other words, 4th generation technology made steady progress in the US. Meanwhile Japan was consistently aiming much efforts at 5th generation computing, which never materialized. The 4th generation finally did arrive in Japan. But by then, due to this strategic planning failure, computer related technologies in Japan were going in the wrong directions, and the strength of computing in Japan went down considerably compared to the US. As I had earlier mentioned the software industry in Japan concentrated disproportionately in the area of custom software development. The collapse of the Japanese “bubble economy” in 1991 had a drastic effect, and abruptly private corporations altogether stopped custom software development projects for mainframes. As a result, growth in the Japanese software industry was greatly reduced. (Fig. 1.) During the earlier years, corporations in Japan had a strong tendency for developing application software exclusively and for internal usage only. This, as one could imagine, was prohibitively expensive. With the collapse of the “bubble economy”, these private development efforts virtually stopped. To reduce costs and advance in usage of information technology, these corporations turned their attention towards UNIX on business applications for the first time. The adoption of UNIX technology in Japan therefore lagged the US by at least 3 years. Since the software industry in Japan was mainframe centered and there were few companies with the technology and experience in the UNIX arena, the customers’ demands for downsizing could not be met and the growth of the industry suffered heavily. Fujitsu, Hitachi and NEC, which had also grown mainly on the strength of the mainframe, were stagnant for several years in a similar manner to IBM. Except for a few exceptions, Japan did not have local access to 4th generation technology and UNIX in particular. Japan was misdirecting its efforts for several years. Between 1992 and 1996, what Japan could only do was to concentrate on learning US technology, which was itself quite arduous. During this period, a number of technical experts moved from software industry to other fields. Custom software is usually made as “the only one of its kind” in the world and runs on a specific site. Therefore, it is almost impossible to evaluate the quality or excellence of the application as being good or otherwise. Under this environment, even those technicians who did not produce quality software would be incorrectly perceived as technical experts. On the other hand, actual end users usually evaluate packaged software and only high quality software survives and that with inferior quality often disappears. As a result, the abilities of technicians in the package software field improved dramatically. But since an overwhelming majority of technicians in Japan grew up in the custom software field, I think that many of these technicians have not been able to excell..

The Rise, Fall, and Return of Software Industry in Japan

5

4 Return The Software industry in Japan and the six Japanese computer manufacturers entered into difficult times since 1991. After that, however, there occurred a big change and the Software industry in Japan has returned almost completely. Please refer to the following Fig. 3 “Worldwide 1997 Software Revenue” and Fig. 4 “Japan 1997 Software Revenue”. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Company IBM Microsoft Computer Associates Oracle Hitachi* SAP Fujitsu* DEC SUN Siemens Nixdorf Parametric Tech Intel Novell Adobe Sybase

Revenue (US$Million) 12,844.0 12,836.0 4,457.0 4,447.0 4,023.0 2,290.0 2,000.0 1,174.2 1,117.7 1,071.0 1,013.8 1,002.8 930.0 911.9 903.9

Source: Software Magazine Fig. 3. Worldwide 1997 Software Revenue

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13

Company Microsoft Japan Oracle Japan SAP Japan Lotus Japan Just System* Ashisuto* Beacon IT* Novell Japan Informix Japan Sybase Japan CA Japan BSP* Baan Japan

Revenue (Million yen) 160,000 46,592 29,800 25,000 20,570 15,300 10,313 10,000 7,100 6,500 5,520 3,780 3,000

* Japanese company Fig. 4. Japan 1997 Software Revenue

6

Y. Ishii

Computer manufacturers are included in Fig. 3, but not in Fig. 4. NEC was not ranked in the Fig. 3. It seems that NEC did not report their revenue to the research group. Hitachi, Fujitsu and NEC came to be ranked highly in the worldwide ranking. Also, in the genuine Japanese software product field, various Japan-made software products were developed. These products were not only used in the Japanese market, but some of them are also being exported. Japan is positioned to become the third axis, next to the US and Europe, in the software product field as we move into the future.

Conceptual Design and Development of Information Services 1

Thomas Feyer1 , Klaus-Dieter Schewe2 , and Bernhard Thalheim1 Computer Science Institute, Brandenburg Technical University at Cottbus, P.O. Box 101344, 03013 Cottbus, FRG 2 Computer Science Institute, Clausthal Technical University, Erzstr. 1, 38678 Clausthal-Zellerfeld, FRG Abstract. Due to the development of the internet and cable nets information services are going to be widely used. On the basis of projects for the development of information services like regional information services or shopping services we develop a method for the creation of information systems and services to be used through different nets. The approach is based on the codesign of data, dialogues and presentations. The main concepts are information units and information containers. These basic concepts for information services are presented in this paper. Information units are defined by generalized views with enabled functions for retrieving, summarizing and restructuring of information. Information containers are used for transfering information to users according to their current needs and the corresponding dialogue step. Size and presentation of information containers depend on the restrictions of the users environment.

1

Background

The ‘Internet’ is currently one of the main buzzwords in journals and newspapers. However, to become a vital and fruitful source for information presentation, extraction, acquisition and maintenance fundamental design concepts are required but still missing. To bridge this gap we investigate information units and information containers and their integration with database-backed infrastructures. These are based on practical experience from projects for the development of a regional information and shopping service and meant to be the main concepts for the development of integrated information services. 1.1

Information Services

Currently we can observe a common and still increasing interest in information services to be used with the internet. Unfortunately, there is no systematic or even commonly accepted approach in building these services. On the other hand it is the goal of conceptual modelling to provide adequate methods for this task. Then at least the following problems have to be met: – – – –

Conceptual understanding of information services; Integration of different information systems; Maintenance of information quality; Evaluation of information quality;

T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 7–20, 1998. c Springer-Verlag Berlin Heidelberg 1998

8

T. Feyer, K.-D. Schewe, and B. Thalheim

– User-adequate presentation of information collections; – Ressource-adequate transmission through networks. Due to the large variety and diversity of information services it is advisable to concentrate either on specific aspects or characteristics. Our experience on information service development is based on two large industrial cooperations: In the project FuEline [7,18] an online database service has been developed for collecting, managing, trading and intelligently retrieving data on industrial and university research facilities. The system realizes a client-server architecture using the retrieval-oriented database machine ODARS. The system is currently used as the base system for technology transfer institutes in Brandenburg. The project Cottbus net (since 1995) aims at the development of intelligent information services that are simple to capture and simple to use. These should be available for around 90.000 households in the Lausitian region through the internet and the cable nets using either computers or TV set-top boxes. Up to now the project has developed a number of information services for travel information, shopping, regional information, industry, administration and booking services. Several architectures and suggestions from the literature have been tested including multidimensional database architectures [12,26]. Unfortunately, it has shown to be too weak and to provide unacceptable performance. The proposals very recently made in [2,8] are similar trials but different in scope, application area, and devoted to different users. The ideas presented below have been used in the development of different information services for Cottbus net. Currently, two other architectures (multi-tier architectures with fat or thin clients [18,20, 21]) are tested in parallel. Both projects have shown to us that the ad-hoc development of information services (such as web.sql) as the state of the art for most internet-based services is not acceptable due to maintenance and development costs. In both projects the information service is characterizable by the access to large amounts of almost structured data accessible through databases. The conceptual understanding of information services is based on conceptual modeling of the underlying databases, modelling of functionality and user intentions. The latter ones can be modelled by dialogues. In particular, the integration of different information systems is enabled. Careful modeling can increase information quality. Therefore, we concentrate on two main concepts for user-adequate presentation and delivery of information: information unit and information container . Information containers are transmitted through the network according to the necessary amount of information. They transfer only those data which are necessary for the current dialogue step. Technically, this optimization of data transmission is achieved by careful integration of data modeling with supplied functions and dialogues. The approach has been used to develop a platform which is now in use for Cottbus net. 1.2 Database-Backed Information Services Besides the various approaches to grasp the meaning of ‘information’ [24] and the large number of books on ‘information systems’ it is generally accepted that information needs a carrier in the form of data. For our purposes we may assume

Conceptual Design and Development of Information Services

9

that these data are structured, formatted, filtered and summarized, meet the needs and current interests of its receiver and is going to be selected, arranged and processed by him/her on the basis of his/her interests, experience, intuition, knowledge etc. Within this context we can assume that information services are systems that are based on database machines and use a certain communication infrastructure. Loosely spoken, information can be extracted from filtered and summarized data collections, where filtration is similar to view generation and summarization of selected data can be performed on the basis of the computational functionality of the given machine. Finally, information presentation respects environmental conditions and user needs. A large number of information service applications is out of the scope of our particular research. For example, travel guidance systems are based on information which is relatively stable. They are usually made commercially available on CD-ROMs. Database systems are used whenever data has a good update rate and the information service requires actuality. The technical embedding of database systems into information services can be based on middleware solutions. In general, database-backed information services can be integrated into DBMSs, although database vendors do not yet offer such a fully integrated solution. A large number of tools for retrieval and manipulation of databases through the internet has been developed. These tools use specific protocols and are mainly designed for specific DBMSs. For this reason each information service has to be based on several interfaces to databases, whilst the information service itself uses specific databases. Thus, these databases can be adapted to the needs of information services. In this case, the design and development of information services subsumes some of the ordinary design and development tasks. Additional requirements are implied by the variety of used displays. 1.3

Codesign of Information Service Applications

As outlined so far, many information service applications are based on information systems. This renders conceptual modelling, especially database design, a fundamental task in information service development. This task subsumes the design of database structure with corresponding static integrity constraints, database processes with corresponding dynamic integrity constraints and user interfaces. Conceptually, there are two dimensions: static/dynamic and global/local. The global static component is usually modelled by database schemata and the global dynamic component by processes implemented as transactional application programs. The local static component is often modelled by information units and the local dynamic component by the user interface which depends on the information units and the processes. Although views filter and summarize data from the database, the local static component for information services is more complex. Information units are computed by computational rules, condensed by abstraction and rebuilding rules and finally scaled by customizing and building a facetted representation. In Sect. 2 we shall discuss this process. Similarly, the local dynamic component is much more complex than the user

10

T. Feyer, K.-D. Schewe, and B. Thalheim

interface. It captures all aspects of user-driven processing on different application layers. Therefore, we prefer to talk of a dialogue component. Each dialogue consists of elementary dialogue steps corresponding to actions selected by the user [19]. Their order depends on the application story and its underlying business processes. Thus, dialogues generalize ‘use cases’. In general we can model dialogues for groups of actors or roles as stated in [14,24,27]. Since we do not intend to discuss codesign in detail, we refer the interested reader to [4]. 6 local

information containers -

information units

6 filtration summarization scaling

dialogues

6 enabled manipulation requests

supplied processes

global

database schema

enabled processes

static

- processes dynamic

-

Fig. 1. Information Services Codesign: Data and Process Flow Perspective

Information units can be the input for dialogues using either the formation at run-time according the actual environment and the user request or predefined data collections. The first approach is more general but seldom computationally tractable. The second approach is simpler and can be based on results of conceptual design. Information containers are obtained by the application of formation and wrapping rules to collections of information units. In Sect. 3 the complete definition of containers is given. Containers are constructed from information units according to the user needs and their environment. The chosen approach to create information services is illustrated in Fig. 1.

2

Modelling Information Units

Information units depend on the database schema. They represent data in a standard, intuitive framework that allow high-performance access. Information units modelling can be compared with the modelling of semistructured data. Then information units turn out to be generalized views on the database [3, 15]. The generalization should support data condensation and supplementary facilities to enable an adequate representation to the user. We restrict the rule system used for generating units from the database to the smallest possible system. The rule system can be extended by inclusion of different analysis systems to enable a detailed analysis of data sets. Other extensions can be included, since the rule system is considered to be an open system. In order to define the rule system, we discuss first the modelling process.

Conceptual Design and Development of Information Services

11

2.1 Modelling Process Since we are interested in the support of information services we use the most general definition. Thus, the computation of information units is separated into three consecutive steps: Filtration by computational rules results in a view in the usual sense. In general, a view has its own schema, the simplest case being a subschema of the given database schema. Summarization by abstraction and rebuilding rules is the abstraction and construction of preinformation from the filtered data. The result will be called a raw information unit. In this step the demanded data condensation applies. Scaling by scaling rules is a process of customizing and building a facetted representation of information based on user interests, profiles etc. It uses typestructured queries and satisfies the requirement for supplementary facilities. (a) HERM subdiagram HH promoted company HHon I @ @ 6 ? @ H HH H organizes trading location HH HH 6@ I @ ? ? ? HH @ H belongs held H - event HHto HHon 6 ? HH has person site HH role

(b) raw information unit obtained by filtering and summarizing promotion period

selling period

3 Q k Q Q Q sport HH - location organizing H event H site ∈ Cottbus ?

hosting club

Fig. 2. Subschema for cultural, sport etc. events

No matter, whether views are materialized or not, raw information units and information units depend on the application and the functionality attached to the information containers. Example 1. The database schema in Fig. 2a representing data on events is a simplification of the schema used in Cottbus net. We use the higher-order ER model which allows relationship types to be defined over relationship types as their components, e.g. consider the type has role. Suppose that filtration is based on selecting sport events, companies which are clubs and locations residing in Cottbus. The filtration rule is expressible by a nested Select-From-Where-statement in ERQL. Alternatively, we may use the generalized ER-QBE discussed in [9,25]. Then a simplified ER-QBE-table for this query is the following one: organizes trading promoted on belongs to event company date kind held on ... event ... site location kind name kind ... location ... sport n club hosting l n Cottbus l

t u

12

2.2

T. Feyer, K.-D. Schewe, and B. Thalheim

Abstraction and Rebuilding Rules

Since filtration is defined by views we concentrate on the rules for summarization and scaling. Views are used for representation of various aspects in the application, but it is often claimed that the data consumed by different processes cannot be consistently represented in the database at the same time. This problem can be solved on the basis of event-condition semantics [23]. Derived views considered so far do not introduce new values as needed for condensation. This is achieved by abstraction and rebuilding rules, e.g. for summarization of numeric values, and extends aggregation formulae in SQL. Many information service operations (comparisons with aggregation, multiple aggregation, reporting features) are hard or impossible to express in SQL. Further, other query techniques like scanning, horizontal and vertical partitioning, parallel query processing, optimization of nested subqueries, or commutation of group by (cube) and join cannot be applied. Abstraction and rebuilding rules result in raw information units which need further to be adapted to the user’s needs, requirements and capabilities. We remark that on the basis of the specification of units a certain database functionality is enabled. Example 2. The events database in Fig. 2a keeps data on ongoing cultural or sport events etc. Our aim is to define an information unit which is used to obtain information on sport events organized in Cottbus by hosting clubs with information for picking up tickets and advertisement. Thus, we summarize the filtered data from Example 1 according to the schema in Fig. 2b. t u 2.3

Scaling Rules

Information units are obtained from raw information units by supplement rules and completion with functions: – Measure rules are used for translation of different scales used for the domain of values. Measure rules are useful especially for numerical values, prices etc. – Ordering rules apply to the ordering of objects in the information unit which depends on the application scenario. They are useful for the determination of the correct order in the presentation during dialogues. – Adhesion rules specify the coherence of objects that are put together into one unit. Adhesion rules are used for detecting disapproved decompositions. Objects with a high adhesion should be displayed together. – Hierarchy metarules express hierarchies among data which can be either linear or fanned. The rules can be used for computation of more compact presentations of data summaries. Example 3. In our event example from Fig. 2 a preordering is given by hosting club ' sport event selling period promotion period location. The preorder can be defined on different levels of abstraction. For example the attributes within entity hosting club are preordered by club name kind founded size remark. The adhesion of clubs to events is higher than the one of locations and time

Conceptual Design and Development of Information Services

13

to event, although two functional dependencies hold, and one is not preferred above the other. To represent adhesion we state the matrix which contains proximity between entities, where 0 indicates no adhesion and 1 indivisible adhesion (similar to ordering, adhesion can be additionally defined on attribute level): Adhesion proximity hosting club ...

hosting club 1.0

sport event 0.7

selling period 0.5

promotion period 0.5

location 0.3

Finally, several hierarchies exist such as the time hierarchy (year, month, week, day, daytime) and the location hierarchy (region, town, village, street). t u Besides the pure static aspects of information units described so far, functions from the following (not yet complete) list can be attached to information units: – Generalization functions are used for generation of aggregated data. They are useful in the case of insufficient space or for the display of complementary, generalized information after terminating a task. Hierarchy rules are used for the specification of applicability of generalization functions. The roll-up function in [1], slicing, and grouping are special generalization functions. – Specialization functions are used for querying the database in order to obtain more details for aggregated data. The user can obtain more specific information after he has seen the aggregated data. Hierarchy rules are used for the specification of applicability of specialization functions. The drill-down function used in the data warehouse approach is a typical example. – Reordering functions are used for the rearrangement of units. The pivoting, dimension destroying, pull and push functions [1] and the rotate function are special reordering functions. – Browsing functions are useful in the case that information containers are too small for the presentation of the complete information. – Sequentialization functions are used for the decomposition of sets or sequences of information. – Linking functions are useful whenever the user is required to imagine the context or link structure of units. – Survey functions are used for the graphical visualization of unit contents. – Searching functions can be attached to units in order to enable the user for computation of add-hoc aggregates. – Join functions are used for the construction of more complex units from units on the basis of the given metaschema. Example 4. Depending on the time granularity opening hours of organizers are presented by time intervals, weekly opening hours, or single dates. Generalization and specialization functions may swap between these representations. By applying reordering functions content of event data will be tailored to users needs. Event data includes, for example, either the event, its location and visualized map coordinates or the event, its hosting club and contact information. If additional information as detailed description or visualized directions do not fit into one container, browsing functions distribute data into several containers. They are provided by appropriate context and linking information. t u

14

T. Feyer, K.-D. Schewe, and B. Thalheim

Finally, we derive an interchange format for the designed information units which is used for the packing of units into containers. Identifiers are used for the internal representation of units. The formal context interchange format represents the order-theoretic formal contexts of units. The context interchange format is specified for each unit by the unit identifier, the type of context, the subsequent units, and the incident units. Example 5. In the event example, the order is either specified by the scenario of the workflow or by the order of information presentation. For example, it is assumed that information on actual sport events is shown before information on previous sport events is given. An advantage of the approach is the consideration of rule applicability to raw units. For this reason almost similar looking, simple units are generated. t u 2.4

Differences between Views and Information Units

Our intention behind the introduction of information units is to provide a standard, intuitive framework for information representation that enables high-performance access. Summarization and compactification should be supported by appropriate software as well as methods for the analysis of information. ER schemata turn out to be unsuitable for this purpose, since end-users, especially casual users, cannot understand nor remember complex ER schemata nor navigate through them. Thus, the task of query specification is getting too hard for the user. Therefore, for the development of information services we need – a standard and predicatable framework that allows for high-performance access, navigation, understanding, creation of reports, and queries, – a high-performance ‘browsing’ facility across the components within an information unit and – an intuitive understanding of constraints, especially cardinality constraints as guidance of user behaviour. At first glance this list of requirements looks similar to the one for data warehouses or multidimensional databases[12,17]. However, the requirements for information services are harder to meet, since the conceptual schema splits into multiple external views. The ER design techniques seek to remove redundancy in data and in the schema. Multidimensional databases often can handle redundancy for the price of inefficiency, infeasibility and modification complexity. Incremental modification, however, is a possible approach to information units and hence for information containers. By developing simple and intuitively usable information units user behaviour becomes more predictable. We can attach an almost similar functionality to the information units. This advantage preserves the genericity property of relational databases where the operations like insert are defined directly with the specification of the relations. Since information containers are composed from units, containers also maintain their main properties. Since user behaviour is encorporated and additional functionality is added, containers have additional

Conceptual Design and Development of Information Services

15

properties that will be discussed in Sect. 3 and compared with other concepts in anticipation to the following table: ER schemata MultiER-based ER-based with external dimensional information information views databases units containers redundancy + + + schema modification + + + navigation through subschemata (+) + + relationship-based subschemata (+) + + + coexistence of subschemata (±) (+) + + additional functionality + + compositionality + genericity (±) (∓) + + TA-efficiency + + + materialization + ± -

3

Information Containers

In internet applications it is commonly assumed that pages can be arbitrarily large. Psychological studies, however, show that typical users only scan that part of a page that is currently displayed leaving vertical browsing facilities through mouse or pointers untouched [10]. This limited use of given functionality is even worse for cable net users, since the browsing devices are even harder to use. For this reason we have to take into consideration the limitations of display. The concept of information containers solves this problem. They can be considered as flexible generalizations of dialogue objects [19]. Containers can be large as in the case of mouse-based browsers or tiny as in the case of TV displays. Similar to approaches in information retrieval we distinguish between the logical structure described by container parameters, the semantical content given by container instantiations and layout defined by container presentations. 3.1

Defining Information Containers

Since the data transferred in information containers are semistructured we may adapt the concept of tuple space [6]. The tuple space of containers is defined as a multiset of tuples, i.e., sequences of actual fields, which can be expressions, values or multi-typed variables. Variables can be used for the presentation of information provided by information units. The loading procedure for a container includes the assigment of types to variables. The assigment itself considers the display type (especially the size) of the variables. Pattern-matching is used to select tuples in a tuple space. Two tuples match if they have the same values in those fields which are common in both. Variables match any value of the same display type, and two values match only if they are identical. The operations to be discussed below for information containers are based on this general framework. Information containers are defined by: – Capacity of containers restricts the size and the display types of variables in the tuple space of the container.

16

T. Feyer, K.-D. Schewe, and B. Thalheim

– Loadability of containers parametrizes the computational functionality for putting information into one container. Functions like preview, precomputation, prefetching, or caching are useful especially in the case when capacity of containers is low. – Unloadability of containers specify readability, scannability and surveyability attached to containers. Instantiation of information containers is guided by the rules and the supported functions of the information units from which the container is loaded. Whether supported functions are enabled or not depends on the application and the rules of the units. The operations defined on tuple spaces are used for instantiation of containers by values provided by the information units. The size parameters limit the information which can be loaded into the containers. Figure 3 shows three different variants of containers. The “middle” container allows us to ‘see’ the general information on a selected meeting and the information on organizers and sales agents. The dialogues for information services we are currently developing are still rather simple. Dialogue steps can be modelled by graphs, in some cases even by trees. The typing system can be changed, if dialogues are more complex or the modelling of complex workflow is intended. Thus, the dialog itself can be in a certain state which corresponds to a node in the graph or to a subgraph of the dialogue graph. Information containers are used in dialogues to deliver the information for dialogue states. Layout of containers is expressible through style rules depending on the container parameters. Additional style rules can be used for deriving container layout according to style guides developed for different applications. 3.2

Modelling the Information Content for Dialogs

Information containers support dialogues and dialogue states. Therefore, an escort information is attached to each container which depends on its instantiation (see Fig. 3). This information is used to guide the user in the current state and the next possible states and to provide additional background information. In internet pages this information is often displayed through frames, but frames are very limited in their expressibility and often misleading. For this reason we prefer the explicit display of escort information. Then we can use two different modes: complete information displays the graph with all nodes from which the current state can be reached; minimal information displays at least the path which is used through the application graph to reach the current node. One important aim in the Cottbus net project is the development of dialogs with a self-descriptive and self-explainable information content. In order to achieve this goal, dialogs are modeled on the basis of their suported processes, their enabled manipulation operations and especially on the basis of the enabled information units with attached functionality. Dialogs are constructed from dialog steps. Each dialog step has its information content and its context information. The composition of dialogs from dialog steps can be used to separate the information which needs to be displayed in

Conceptual Design and Development of Information Services

complete escort information

17

escort information

(for ”small” and ”middle”

(for ”small” and ”middle”

containers)

containers)

Cottbus information ...

sports organizations

interest in sport

HH HH ... HH HH

...

...

sport events

HH HH H

A

...

A A A

...

sports commercial PP provider club PP HH ... ... H P

...

...

HH H

time schedule

kinds of sport

PP H PPHH PPH PH

...

meetings

"middle" container

sports enthusiast

...

...

"small" container

@H H "large" @HH container H @ HH @ H time

organizer agents, general additional information selling informationinformation

details, location

Fig. 3. The subgraph of interest in sport

the single step from the information which belongs to the step but has been displayed already in the previous steps. For example, in the middle container the sport club information can be separated into necessary information which has to be displayed for the container and into escort information which can be shown upon request. The separation we have used is based on the functions defined for tuple spaces[6] like selective insert, cascaded insert, conditional insert etc. 3.3

Formation and Wrapping Rules

Formally, instantiation of information containers is the process of assigning values from information units to variables of the tuple space. Furthermore, information containers have functions for loading the container and functions for searching within the container. On the basis of enabled computational functions (generalization, specialization, reordering, browsing, linking, surveying, search-

18

T. Feyer, K.-D. Schewe, and B. Thalheim

ing, joining) for analysis and interpretation of data in the used units, the general functionality which can be used for information containers is derivable. Additionally, the user profile is taken under consideration. In order to handle these requirements we use two different rule sets: Formation rules are used to instantiate the container in dependence of the necessary information and functionality. Information containers are similar to containers used in transportation. They can be unloaded only in a certain order and with certain instruments. Thus, depending on the necessary size information containers can be loaded with different information units. The loading process is based on the structure of the dialog and on the properties of units like association of information in different units. Based on the design of units, the set of available information containers and the design of dialogs we can infer the presentation scenario. It contains the description of units, their association (adhesion, cohesion) and their enabled functionality for each dialog step in a certain dialog. The presentation scenario is used to describe the different approaches enabled for the user to extract information from the container. Browsing, linking and join functions of the exploited units can be used for achieving flexibility in dialog steps. Since the variety of possible sets of enabled functions can be very high, we use different models for computing of data. These models are based on application scenarios. Operations like aggregation and prediction and analysis operations for generating status reports and comparing different variants of data. A typical status data type is the shopping basket. In the sports example users are enabled to store several variants of shopping data and schedules. The sport example has only one presentation scenario. However, there is a large variety of generated links and a browsing functionality. Wrapping rules are used to pack the containers depending on the user’s needs and the dialog steps in the current dialog. The application of wrapping rules depends also on the properties of containers. The wrapping rules can be changed whenever different style rules or display rules are going to be applied[21,20]. This flexibility is also necessary if the communication channel is currently overloaded and does not allow the transportation of large containers. The transportation of container contents can be dynamically adapted to the order of dialog steps. Special wrapping rules are used for labeling the containers content. If such rules are applicable then the user can ask for a summary of the containers content before requesting to send the complete container. The label of each container is generated on the basis if survey functions defined for the units of the container. Thus, this approach enables in intuitive data manipulation in the style users know from spreadsheets. Further, wrapping rules can be developed for restructuring information presentation in accordance to the repeadadly visited steps of the dialogs. Also, reordering and sequentialization functions defined for units can be used for better flexibility in information containers. Style rules are used for wrapping the instantiated information container. Information containers are based on user profiles, their preferences, their expectations, and their environment. Thus, handling of containers, loading and

Conceptual Design and Development of Information Services

19

reloading should be as flexible as possible. Customizing containers to the dialogue environment can be done on the basis of customization rules. In our sports example, wrapping rules can be used for the display style of (escort) information, for the placement of information on the screens, for enabling different functions and for displaying the content of the container.

4

Conclusion

There is a need for the systematic construction of information services. The approach outlined in this paper is based on information units which are used in a larger number of dialogue steps. Information units are composed to information containers which form the basis for the human interface to the services. Information containers allow a certain functionality tailored to the user needs. Based on this functionality a user can send manipulation requests to an underlying database. The whole approach is rule-based with execution semantics given by lazy evaluation. Currently, conflicts are just monitored, but in a later stage this will also be done systematically. The approach presented so far has shown to be sufficient for several large information service projects. In order to meet further requirements we have developed an open architecture. Thus, additional rules can be added to each of the presented steps. Furthermore, the addition of new components to derived views, raw units, information units and to a certain extent also to containers can be handled in a simple fashion. This makes the approach easily extendible in the case of unexpected changes. We are not interested in developing a general framework for information handling. Our aim so far is the development of a platform which enables the conceptual design of information services such as business information services, administration information services, online inhabitants services, educational information services shopping services etc. The topics discussed in this paper are currently used in our information services projects and are still under investigation. Therefore, there is a large number of open questions like incremental updates of units and containers. Nevertheless, the approach has been flexible enough for the inclusion of solutions to new requirements. Thus, the presented method can be considered to be one vital approach to the development of database-backed information services. Acknowledgement. We would like to thank the members of the FuEline and Cottbus net project teams for their stimulating discussions and their effort to implement our ideas.

References 1. R. Agrawal, A. Gupta, S. Sarawagi, Modeling multidimensional database. Proc. Data Engineering Conference, 232–243, Birmingham, 1997. 2. P. Atzeni, G. Mecca, P. Merialdo, Design and maintenance of data-intensive web sites. EDBT 98, Valencia, 1998, LNCS 1377, 436–450. 3. P. Bretherton, P. Singley, Metadata: A user’s view. IEEE Bulletin, February, 1994.

20

T. Feyer, K.-D. Schewe, and B. Thalheim

4. W. Clauss, B. Thalheim, Abstraction layered structure-process codesign. D. Janaki Ram, editor, Management of Data, Narosa Publishing House, New Delhi, 1997. 5. L.M.L. Delcambre, D. Maier, R. Reddy, L. Anderson, Structured maps: Modeling explicit semantics over a universe of information. Int. Journal of digital Libraries, 1997, 1(1), 20–35. 6. R. De Nicola, G.L. Ferrari, R. Pugliese, KLAIM: a kernel language for agents interaction and mobility. Report, Dipartimento di Sistemi e Informatica, Universit` a di Firenze, Florence, 1997. 7. F. Fehler, Planing and development of online-systems for enterprise-wide information exchange. PhD Thesis, BTU Cottbus, 1996 (In German). 8. P. Fraternali, P. Paolini, A conceptual model and a tool environment for developing more scalable, dynamic, and custumizable web applications. EDBT 98, Valencia, 1998, LNCS 1377, 422–435. 9. J. Grant, T.W. Ling, and M. L. Lee, ERL: Logic for entity-relationship databases. Journal of Intelligent Information Systems, 1993, 2, 115–147. 10. J. Hasebrock, Multimedia psychology. Spektrum, Berlin, 1995. 11. R.E. Kent, C. Neuss, Conceptual analysis of hypertext. Intelligent Hypertext (Eds. C. Nicholas, J. Mayfield), LNCS 1326, Springer, 1997, 70–91. 12. R. Kimball, A dimensional modeling manifesto. DBMS, July 1996, 51–56. 13. M.W. Lansdale, T.C. Ormerod, Understandig interfaces. Academic Press, 1995. 14. J. Lewerenz, Dialogs as a mechanism for specifying adaptive interaction in database application design. Submitted for publication, Cottbus, 1998. 15. A. Motro, Superviews: Virtual integration of multiple databases. IEEE ToSE, 13, 7, July, 1987. 16. K. Parsaye, M. Chignell, Intelligent database tools and applications. John Wiley & Sons, Inc., New York, 1995. 17. N. Pendse, The olapreport. Available through www.olapreport.com, 1997. 18. M.Roll, B Thalheim, The surplus value service system FOKUS. INFO’95, Information technologies for trade, industry and administration, Potsdam, 355–366, 1995. (in German). 19. K.-D. Schewe, B. Schewe, View-centered conceptual modelling - an object-oriented approach. ER’96, LNCS 1157, Cottbus, 1996, 357–371. 20. T. Schmidt, Requirements, concepts, and solutions for the development of a basic technology of information services - The client. Master Thesis, BTU Cottbus, 1998 (In German). 21. R. Schwietzke, Requirements, concepts, and solutions for the development of a basic technology of information services - The server. Master Thesis, BTU Cottbus, 1998 (In German). 22. C.T. Talcott, Composable semantic models for actor theories. TAPSOFT, 1997. 23. B. Thalheim, Event-conditioned semantics in databases. OO-ER-94, (Ed. P. Loucopoulos), LNCS 881, 171–189, Manchester, 1994. 24. B. Thalheim, Development of database-backed information services for Cottbus net. Preprint CS-20-97, Computer Science Institute, BTU Cottbus, 1997. 25. B. Thalheim, The strength of ER modeling. Workshop ‘Historical Perspectives and New Directions of Conceptual Modeling’, Los Angeles, 1997, LNCS, 1998. 26. E. Thomson, OLAP solutions: Building multidimensional information systems. John Wiley & Sons, Inc., New York, 1997. 27. E.S.K. Yu, J. Mylopoulos, From E-R to ”A-R” - Modelling strategic actor relationships for business process reengineering. ER’94, LNCS 881, 548-565, Manchester, 1994.

An EER-Based Conceptual Model and Query Language for Time-Series Data Jae Young Lee and Ramez A. Elmasri Computer Science and Engineering Department University of Texas at Arlington Arlington, TX 76019-0015, U.S.A. {jlee, elmasri}@cse.uta.edu

Abstract. Temporal databases provide a complete history of all changes to a database and include the times when changes occurred. This permits users to query the current status of the database as well as the past states, and even future states that are planned to occur. Traditional temporal data models concentrated on describing temporal data based on versioning of objects, tuples or attributes. However, this approach does not effectively manage time-series data that is frequently found in real-world applications, such as sales, economic, and scientific data. In this paper, we first review and formalize a conceptual model that supports time-series objects as well as the traditional version-based objects. The proposed model, called integrated temporal data model (ITDM), is based on EER. It includes in it the concept of time and provides necessary constructs for modeling all different types of objects. We then propose a temporal query language for ITDM, that treats both version-based and timeseries data in a uniform manner.

1. Introduction Objects in the real world can be classified into the following three different types according to their temporal characteristics: 1. Time-invariant objects: These objects are constrained not to change their values in the application being modeled. An example is the SSN of an employee. 2. Time-varying objects (or version-based objects): The value of an object may change with an arbitrary frequency. An example is the salary of an employee. 3. Time-series objects: Objects can change their values, and the change of values is tightly associated with a particular pattern of time. Examples are daily stock price and scientific data sampled periodically. Most traditional temporal databases [3,10,15,18,19] concentrated on the management of version-based objects. There have been specialized time-series management systems [1,4,5,6,16,17] reported in the literature. However, the main T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 21−34, 1998.  Springer-Verlag Berlin Heidelberg 1998

22

J.Y. Lee and R.A. Elmasri

focus of these systems was time-series objects only. In [14], the integrated temporal data model (ITDM) was first proposed, which integrates all different types of objects. Based on ITDM, various techniques to implement time-series data were also studied in [7]. In this paper, we first formalize the ITDM, then propose a query language for ITDM, with which we can query time-series data as well as version-based data. We show how time-series querying constructs and version-based querying constructs can be integrated within the same query language constructs. The paper is organized as follows. Chapter 2 briefly reviews ITDM. Chapter 3 discusses the syntax and semantics of path expressions along with the concepts of temporal projection and temporal selection. The proposed query language is discussed in detail in Chapter 4. Related work is discussed in Chapter 5, and Chapter 6 concludes the paper. Due to the lack of space, we do not include formal description of ITDM and query language in this paper. Interested readers are referred to [13].

2. Overview of ITDM 2.1 Basic Concept of Time Series and Time Representation A time series is a sequence of observations made over time. The pattern of time according to which the observations are made is specified by a calendar [2,12]. So, each time series has associated with it a particular calendar. Typically a time series is represented as an ordered set of pairs: TS = {(t1, v1), (t2, v2), …, (tn, vn)}, where ti is the time when the data value vi is observed. Sometimes a time series has two or more data values observed at each ti, and is represented as TS = {(t1, (v(1,1), v(1,2), …, v(1,k))), (t2, (v(2,1), v(2,2), …, v(2,k))), … , (tn, (v(n,1), v(n,2), …, v(n,k)))}. To represent a specific subset of the time dimension, in general, we use temporal element [11]. A temporal element is a finite union of time intervals; T = {I1 ∪ I2 ∪ … ∪ In}. Each time interval is an ordered set of consecutive time units; Ii = {t1, t2, … tk}, and is represented as [t1, tk]. For example, assuming the granularity Day, a temporal element T = {[1, 3], [8, 9]} is equivalent to T = {Day1, Day2, Day3, Day8, Day9}. For convenience, however, we will use a conventional notation in this paper, such as {[1/1/98, 1/20/98], [2/3/98, 2/15/98]}.

2.2 Overview of ITDM ITDM is based on the Enhanced ER (EER) model [8]. An entity represents an independent object or concept in the real world. An entity type represents a collection of entities that have the same properties. The properties are represented by a set of attributes that are associated with the corresponding entity type. Two relationships are supported: a named relationship type and an IS_A relationship. A named relationship type models an association among different entity types, and is defined in terms of participating entity types. An IS_A relationship represents generalization and

An EER-Based Conceptual Model and Query Language

23

specialization processes. Attributes represent the properties of either an entity type or a named relationship type. In ITDM, three different types of objects are recognized and modeled as attributes: time-invariant attributes, time-varying attributes, and timeseries attributes. An attribute A is a tuple (TD, VD, AV), where TD(A) is the temporal domain, VD(A) is the value domain, and AV(A) is the attribute value of A. An entity type E is a tuple (EA, EP). Here, EA(E) is a set of attributes and EP(E) is the population of E; EA(E) = {A1, A2, …, Ak}, EP(E) = {e1, e2, …, en}. An entity ei is a tuple (surrogate, lifespan, EV). The surrogate is a system generated unique identifier of each entity. The lifespan represents the time intervals(s) during which the corresponding entity existed or the entity was of interest to the database. EV, denoting the value of an entity, is a set of attribute values. EV(ei) = {AV(Aj(ei)) | 1 ≤ j ≤ k}, where AV(Aj(ei)): TD(Aj(ei)) → VD(Aj(ei)) and TD(Aj(ei)) ⊆ lifespan(ei). Note that an attribute value of an entity is defined to be a function from the temporal domain of the attribute to the value domain of the attribute. Such a function is referred to as temporal assignment. A named relationship type is modeled as: R = (RE, RA, RP), where − RE(R) = {(E1, ro1, c1), (E2, ro2, c2), …, (Em, rom, cm)}, where Ei is a participating entity, roi is the role name of Ei, and ci is the structural constraints on Ei, and represented by ci = (mini, maxi). − RA(R) = {A1, A2, …, Ak}, a (possibly empty) set of attributes. − RP(R) = {r1, r2, …, rn}, a set of relationship instances, where ri = (PE, lifespan, RV) such that − PE(ri), is represented as (surrogate(e1), surrogate(e2), …, surrogate(em)), where each ej ∈ EP(Ej) 1 ≤ j ≤ m, participates in ri. − lifespan(ri) ⊆ , mj=1 lifespan(e j ) , with each ej participating in ri. − RV(ri) = { AV(Ai(ri)) | 1 ≤ i ≤ k}, where AV(Ai(ri)): TD(Ai(ri)) → VD(Ai(ri)) and TD(Ai(ri)) ⊆ lifespan(ri). 2.3 An Example An example ITDM schema is shown in Fig. 1, which will be used in the following sections to illustrate queries. A time-invariant attribute is represented by an oval. A time-varying attribute is distinguished by a rectangle inside an oval. A time-series attribute has, in addition to a rectangle inside an oval, an associated calendar connected to it by an arrow. In the schema diagram, for example, dividend, price, ticks and population are time-series attributes and Quarters, BusinessWeek, WorkHours, and Years are, respectively, their associated calendars. The calendar Quarters specifies quarterly time units when dividends are paid. The attribute price represents daily high and low prices of a stock. It also has a nested time-series attribute ticks, which records hourly prices. The calendar BusinessWeek specifies 5 days a week (Monday through Friday) except all holidays when stock markets are closed. The calendar WorkHours specifies 9:00 AM to 5:00 PM market hours. A part of the database instances of the example schema is also shown in Fig. 2.

24

J.Y. Lee and R.A. Elmasri

phone

shares

name

high

dividend

low

issuer

ticks

Quarters CUSTOMER

stocks (1,m)

CUSTOMER_STOCK

customer (1,m)

city

market

CUSTOMER_CITY

STOCK_MARKET

(0, m)

(1,m)

customer

stocks

market (0,m)

CITY

CITY_MARKET

city (1,1)

name

population

WorkHours BusinessWeek

(1,1)

(1,1)

mayor

price

STOCK

MARKET

name

years

Fig. 1. An example ITDM database schema

CITY Surrogate t1

t2 t3

lifespan

Name

{[1/1/97, now]}

New York

… …

Chicago Boston

Mayor {[1/1/97, 5/31/97] → Robert, [6/1/97,10/31/97] → Richard, [11/1/97, now] → Robert} … …

Population {[1997] → 500,000, [1998] → 520,000}

… …

Fig. 2. Part of database instance of the example database

An EER-Based Conceptual Model and Query Language

25

STOCK Dividend Surrogate

lifespan

Issuer

s1

{[1/1/97, now]}

IBM

s2 s3

… …

s4

…

GE CITIBANK SEARS

CUSTOMER_STOCK Entities (c1, s1) – (William, IBM)

{[1/1/97] → 1.00, [4/1/97] → 1.50, [7/1/97] → 1.30, [10/1/97]→1.20, …} … … …

Lifespan {[1/1/97, now]}

Price

{[1/1/97] → <125, 87, <09:00 → 113, 10:00 → 116, …, 17:00 → 95>>, [1/2/97] → <…> , …} … … …

Shares {[1/1/97, 5/1/97] → 2000, [5/2/97, 10/5/97] → 1500, [10/6/97, 1/5/98] → 800, [1/6/98, now] → 1300} {[1/1/97, 7/12/97] → 2000, [10/21/97, now] → 3000} {[1/1/97, 2/15/97] → 1000, [2/16/97, 6/15/97] → 500} {[1/1/97, 5/6/97] → 850, [5/7/97, 12/31/97] → 1100} {[1/1/97, 12/15/97] → 1200, [12/16/97, now] → 1500} {[1/1/97, 12/31/97] → 1300}

(c1, s3) – … (William, CITIBANK) (c1, s2) – … (William, GE) (c2, s3) – … (Susan, CITIBANK) (c2, s4) – … (Susan, SEARS) (c3, s1) – … (Tracy, IBM) (c3, s4) – … {[1/1/97, 8/16/97] → 850, (Tracy, SEARS) [8/17/97, now] → 1250} Note: ci denotes a surrogate of a customer entity.

Fig. 2. (continued) Part of database instances of the example database

3. Path Expressions, Temporal Projection, and Temporal Selection This section describes three basic components of the temporal query language for ITDM, namely path expressions, temporal projection, and temporal selection.

26

J.Y. Lee and R.A. Elmasri

3.1 Path Expression Path expressions [9] are used to navigate entity types through relationship types, and to specify join conditions on entity types. A path expression is a rooted tree, such that − Root is an entity type. − If a node p has a child node c, then − If p is an entity type: c is an attribute or a role name connected to p. − If p is an attribute: p is a composite attribute and c is a component attribute of p. − If p is a role name: Let p be the role name of an entity type E1 that participates in a relationship type R and E2 be another participating entity type. In other words, RE(R) = {(E1, p, c1), (E2, ro2, c2)}. Then c is an attribute of R or E2, or a role name connected to E2. − A role name may have a restricting predicate attached to it. Figure 3 shows some valid path expressions on the database schema of Fig. 2.

CUSTOMER

CUSTOMER

CITY

name

market

(a)

(b)

CUSTOMER name

(e)

name

(c)

stocks (d)

CUSTOMER

stocks issuer

MARKET

shares

name

stocks [issuer = 'IBM'] issuer

shares

(f)

Fig. 3. Example path expressions

A path expression can alternatively be represented as a text string. Starting from the root node, which is an entity type, we append its child separated by a dot. If the root has two or more children, the children nodes are enclosed in a pair of angled brackets and commas separate them. Then, we recursively apply the same rules to all of its children. The textual representations and the interpretations of the path expressions are given below: − (a) CUSTOMER: All customers (including all attributes). − (b) CUSTOMER.name: Names of all customers. − (c) CITY.market: For each city, list of stock markets (i.e., their surrogate values) located in the city.

An EER-Based Conceptual Model and Query Language

27

− (d) MARKET.: For each market, the name of the market and all stocks traded in the market. − (e) CUSTOMER.>: For each customer, the name of the customer, and issuer and share of all stocks the customer owns. − (f) CUSTOMER.>: The same as the path expression (e), but only for IBM stocks. Here, the restricting predicate is attached to the role name stocks to select only IBM stocks.

3.2 Nontemporal Queries A nontemporal query is one that accesses the current database state. The syntax of nontemporal queries is: GET FROM WHERE

p1, p2, … E1 e1, E2 e2, … pr

Here, pi is a path expression, Ei is an entity type and ei is a variable that ranges over the entities in EP(Ei), and pr is a predicate. The semantics of a query is as follows. First, form the Cartesian product of all entity types specified in the FROM clause. For each element in the Cartesian product, the predicate specified in the WHERE clause is evaluated. If the predicate evaluates to true, then the element, which is a tuple of entities, is selected. Then, from these entities, only the information specified in the GET clause is displayed. An example nontemporal query is given below. Query 1: List the names of customers who own all the stocks that are traded in the market located in the same city that the customer is living. GET FROM WHERE

c.name CUSTOMER c, MARKET m (c.city.name = m.city.name) AND ((c.stocks) INCLUDE (m.stocks))

3.3 Temporal Projection Temporal projection restricts the information to be displayed to a particular time interval(s). The syntax of a temporal projection is p: T, where p is a path expression and T is a temporal element. Assume that a path expression p returns the following: {[1/1/97, 5/31/97] → Robert, [6/1/97, 10/31/97] → Richard, [11/1/97, now] → Robert}. Then, p: [3/1/97, 12/31/97] will return the following: {[3/1/97, 5/31/97] → Robert, [6/1/97, 10/31/97] → Richard, [11/1/97, 12/31/97] → Robert}. The following example illustrates the use of temporal projection in the GET clause.

28

J.Y. Lee and R.A. Elmasri

Query 2: Show, for all the stocks William has owned, the list of the issuers and the corresponding shares during the time period [1/1/97, 12/31/97]. GET FROM WHERE

c.stocks.: [1/1/97, 12/31/97] CUSTOMER c c.name = ‘William’

The result is: issuer IBM CITIBANK GE

shares [1/1/97, 5/1/97] → 2000, [5/2/97, 10/5/97] → 1500, [10/6/97, 12/31/97] → 800 [1/1/97, 7/12/97] → 2000, [10/21/97, 12/31/97] → 3000 [1/1/97, 2/15/97] → 1000, [2/16/97, 6/15/97] → 500

3.4 Temporal Selection A predicate on an entity evaluates to either true or false when the entity assumes a nontemporal value. If, however, an entity assumes a temporal value, the result of applying a predicate to the entity returns a temporal assignment whose codomain is {true, false}. Assume that an attribute salary of an entity John assumes the following value in a temporal database: {[1/1/95, 6/30/96] → 45000, [7/1/96, 12/31/97] → 52000, [1/1/98, now] → 55000}. Then, if we apply the predicate (salary > 50000) to John, the result will be {[1/1/95, 6/30/96] → false, [7/1/96, now] → true }. The application of a predicate pr to a temporal entity e is denoted by pr(e), which is referred to as a temporal predicate. The true time of a temporal predicate pr on e, denoted by [[ pr(e) ]], is a temporal element during which the predicate evaluates to true. So, the true time of the predicate (salary > 50000) applied to John is {[7/1/96, now]}. A temporal selection predicate is a Boolean expression that compares two temporal elements using the set comparison operators {=, ≠, ⊆, ⊇}, where at least one of the operands is the true time of a temporal predicate. An example query, which uses the temporal selection predicate in the WHERE clause is shown below. Query 3: Names of customers who owned IBM stock during [1/1/97, 6/30/97]. GET FROM WHERE

c.name CUSTOMER c [[ c.stocks.issuer = ‘IBM’ ]] ⊇ [1/1/97, 6/30/97]

An EER-Based Conceptual Model and Query Language

29

4. Temporal Queries 4.1 Basic Temporal Queries The syntax of basic temporal queries is shown below. We will extend this syntax later to include aggregate functions and time series operations. GET FROM WHERE

p1:T1, p2:T2, … E1 e1, E2 e2, … predicate

Here, Ti is a temporal element for temporal projection and predicate is a Boolean expression on the attributes of entities or relationship instances that may include temporal selection predicate. Some example temporal queries are given below. Query 4: Names of customers who owned more than 1000 shares of IBM stock during the whole period of 1997. GET FROM WHERE

c.name CUSTOMER c [[ c.stocks[issuer = ‘IBM’ ].shares > 1000 ]] ⊇ [1/1/97, 12/31/97]

In this query, the restricting predicate [issuer = ‘IBM’] restricts the relationship instances between CUSTOMER and STOCK to only IBM stock. If we issue this query to the example database, the result will be: {Tracy}. Query 5: Names of customers who owned more than 1000 shares of IBM stock any time during 1997. GET FROM WHERE

c.name CUSTOMER c NOT EMPTY([[ c.stocks[issuer = ‘IBM’ ].shares > 1000 ]] ∩ [1/1/97, 12/31/97])

The result of this query when applied to the example database is: {William, Tracy}.

4.2 Query Language Constructs for Time-Series Attributes 4.2.1 Aggregate Functions and Granularity Conversion Nontemporal aggregate functions compute the aggregation over a set of data values. Typical aggregate functions are: COUNT, EXISTS, SUM, AVERAGE, MAX, MIN, etc. On the other hand, temporal aggregate functions compute the aggregation over

30

J.Y. Lee and R.A. Elmasri

the time dimension. We use the following temporal aggregate functions: TCOUNT, TEXISTS, TSUM, TMAX, TMIN, etc. The following queries show the use of temporal aggregate functions. Query 6: Compute the average population of New York city between 1990 and 1997. GET FROM WHERE

TAVERAGE (t.population): [1990, 1997] CITY t t.name = ‘New York’

We also define a special type of true time that is applied to aggregate functions. The true time [[ f(A): [tl, tu] ]] returns the time when the attribute A assumes the value specified by the aggregate function f during the time interval [tl, tu]. Here, f is either TMIN or TMAX. The following query illustrates the usage of this type of true time (the GET TIME will be discussed in more detail in Section 4.2.2). Query 7: When did the daily high price of IBM stock reach its highest price in November 1997? GET TIME [[ TMAX(s.price.high): [11/1/97, 11/30/97] ]] FROM STOCK s WHERE s.issuer = ‘IBM’ In applications that include time-series data, sometimes it is necessary to convert the granularity of a time series. A granularity conversion may be into a coarser granularity or into a finer granularity. The conversion to a coarser granularity is specified in a query by attaching to the aggregate function ‘BY target granularity’ as shown in the following example. Query 8: List the weekly high price of GE stock during 1997. GET FROM WHERE

TMAX(s.price.high) BY WEEK: [1/1/97, 12/31/97] STOCK s s.name = ‘GE’

This query converts the granularity of the time series high from Day to Week. In a query that requires the conversion of a granularity to a finer one, we need to specify a particular interpolation function to be used as well as the target granularity as shown in the following example: Query 9: Show the month-by-month population of the city Boston. GET FROM WHERE

t.population BY MONTH(function): [1/1/97, now] CITY t t.name = ‘Boston’

An EER-Based Conceptual Model and Query Language

31

Here, function is an interpolation function provided by a DBMS. It may be a linear function, spline, etc. 4.2.2 Time Selection Functions Sometimes, it is necessary to extract a particular time interval or a time unit from a given temporal element. For this purpose, we define two time selection functions. An interval selection function I_SELECT(i, T) returns the i-th interval from the temporal element T. Here, i is an integer, FIRST, or LAST. When used as the value of i, FIRST and LAST returns the first and last interval, respectively, from T. If T is the true time of a temporal predicate on a time-series attribute, it returns the i-th time unit. A time unit selection function T_SELECT(i, I) returns the i-th time unit from the interval I. Again, i may be an integer, FIRST, or LAST. An example query is shown below. Query 10: List the names of customers who lived in New York during the first tenure of Mayor Robert. GET FROM WHERE

c.name CUSTOMER c, CITY t (t.name = ‘New York’) AND ([[ c.city.name = ‘New York’ ]] ⊇ I_SELECT(FIRST, [[ t.mayor = ‘Robert’ ]])

We can also use the time selection function in the GET clause to extract a particular time. In this case we use GET TIME instead of GET. An example is shown below. Query 11: When did Robert become a mayor of New York for the first time ? GET TIME T_SELECT(FIRST, I_SELECT(FIRST, [[t.mayor = ‘Robert]])) FROM CITY t WHERE t.name = ‘New York’ 4.2.3 Representation of Temporal Windows and Moving Window A temporal window is specified using one of the following representations: [t1, t2], [t, % i %], [t, % iG %]. Here, t is a time unit, i is an integer, and G is a granularity. The first component of a temporal window is called window reference, and the second component is called window end. The interpretation and examples of different types of temporal windows are given below. − Type 1 ([t1, t2]): Specifies all data values between t1 and t2, including the values at the both ends if they exist. Example: [1/1/97, 12/31/97]. − Type 2 ([t, % i %]): Specifies i consecutive data values starting from t, including the value at time t if it exists. If no data value exists at t when applied to a timeseries attribute, then it starts with the next data value in the time series. Example: [5/1/98, %10%].

32

J.Y. Lee and R.A. Elmasri

− Type 3 ([t, % iG %]): Specifies all data values between t and t + iG, including the values at the both ends if they exist. Example: [3/1/98, %14Day%], which is equivalent to [3/1/98, 3/15/98]. If no data value exists at t when applied to a time-series attribute, then it starts with the next data value in the time series. If no data value exists for the window end, the data value that exists in the time series immediately before the window end is used. Type 1 temporal windows were used in the previous query examples. The following example shows the use of Type 2 temporal window. Query 12: Show the 10 consecutive daily high prices of SEARS stock starting from 3/1/1998. GET FROM WHERE

s.price.high: [3/1/98, %10%] STOCK s s.issuer = ‘SEARS’

A moving window is used to specify a series of temporal windows, each of which provides a time interval for an aggregate function. A moving window is specified by attaching two time durations to a temporal window of Type 1. A time duration is represented by: % i %, % iG %, or % name of a calendar %. The first two durations have the same meaning as they do with the temporal windows above. The third duration specifies the period of a periodic calendar. An example moving window is: [1/1/97, 12/31/97] FOR %10% INCREMENT %3Day%. Here, the key word FOR specifies the size of the window and the key word INCREMENT specifies the increment by which the window moves. The example specifies a moving window, where a window has 10 consecutive data values and moves with an increment of 3 days in the time interval [1/1/97, 12/31/97]. The following is an example: Query 13: Show 10-day moving average of daily high price of SEARS stock with an increment of 5 days during 1997. GET FROM WHERE

TAVERAGE(s.price.high): [1/1/97, 12/31/97] FOR %10Day% INCREMENT %5Day% STOCK s s.issuer = ‘SEARS’

4.2.4 Using Relative Time for Data Selection Consider the following query. “What was the daily high price of IBM stock 5 days after it reached the highest price in November 1997?” Here, we want to know the value of a time-series attribute at a time that is specified as a relative time. We define two more temporal window types to specify relative times as follows:

An EER-Based Conceptual Model and Query Language

− −

33

Type 4 ([t; % i %]): Specifies the i-th data value from t. All other semantics is the same as that of Type 2. Type 5 ([t; % iG %]): Specifies the data value at time t + iG. All other semantics is the same as that of Type 3.

Then, the above query can be written using Type 5 temporal window as follows: Query 14: GET FROM WHERE

s.price.high: [ [[ TMAX(s.price.high): [11/1/97, 11/30/97] ]]; %5Day%] STOCK s s.issuer = ‘IBM’

5. Related Work Very little study has been reported in the literature regarding query languages on time-series data. CALANDA [4,5,6] was implemented based on an object-oriented model. So, data retrieval is performed through method invocation. Common operations on time series are predefined as methods of the time series root class. Timestamp selection and granularity conversion are examples of common operations. Operations for a particular class of time series are defined in that class definition. In [17], time series is modeled as a regular time sequence. This model defines high-level operations to manipulate time sequences. Basic retrieval operations include select, aggregate, accumulate, etc. Informix provides TimeSeries DataBlade module, which provides support for time series and calendar, and offers over 40 predefined functions to manage them. The module is a user-installable extension of Illustra server, which is an object relational DBMS. The query language is an extension of SQL, but the extensions are object-based.

6. Conclusion Time series is a special type of time-varying objects. The change of the value of a time series is tightly associated with a predefined pattern of time called calendar. Time-series objects also require different types of operations on them. Various data models have been proposed for temporal databases and time series management. However, the integration of time-varying (or version-based) objects and time-series objects has been rarely studied. We formalized a conceptual model based on EER, called ITDM (Integrated Temporal Data Model), that incorporates all different type of objects [14]. We then presented a query language for ITDM. We showed that timeseries querying constructs and version-based querying constructs could be integrated within the same query language constructs.

34

J.Y. Lee and R.A. Elmasri

References 1. R. Chandra and A. Segev, “Managing Temporal Financial Data in and Extensible Database,” Proc. 19th Int’l Conf. on VLDB, 1993, pp. 302-313. 2. R. Chandra, A. Segev, and M. Stonebraker, “Implementing Calendars and Temporal Rules in Next Generation Databases,” Proc. 3rd Int’l Conf. on Data Engineering, 1994, pp. 264-273. 3. U. Dayal and G. Wuu, “A uniform Approach to Processing Temporal Queries,” Proc. 18th VLDB Conf., 1992, pp. 407-418. 4. W. Dreyer, A.K. Dittrich, and D. Schmidt, “An Object-Oriented Data Model for a Time Series Management System,” Proc. 7th Int’l Working Conf. on Scientific and Statistical Database Management, 1994, pp. 186-195. 5. W. Dreyer, A.K. Dittrich, and D. Schmidt, “Research Perspectives for Time Series Management Systems,” ACM SIGMOD Record, Vol. 23, No. 1, 1994, pp. 10-15. 6. W. Dreyer, A.K. Dittrich, and D. Schmidt, “Using the CALANDA Time Series Management Systems,” Proc. ACM SIGMOD Int’l Conf., 1995, pp. 489-499. 7. R. Elmasri and J.Y. Lee, “Implementation Options for Time-Series Data,” Temporal Databases: Research and Practice, O. Etzion et. al. (Eds), LNCS No. 1399, 1998, pp. 115127. 8. R. Elmasri and S. Navathe, "Fundamentals of Database Systems," 2nd Edition, Benjamin/Cummings, 1994. 9. R. Elmasri and J. Wiederhold, “GORDAS: A Formal High-Level Query Language for the ER Model,” Proc. 2nd Entity-Relationship Conference, 1981, pp. 49-72. 10. R. Elmasri and G. Wuu, “A Temporal Model and Query Language for ER Database," Proc. 6th Int’l Conf. on Data Engineering, 1990, pp. 76-83. 11. S. Gadia and C. Yeung, “A Generalized Model for a Relational Temporal Database," Proc. ACM SIGMOD Conf., 1988, pp. 251-259. 12. A. Kurt and M. Ozsoyoglu, “Modelling Periodic Time and Calendars,” Proc. Int'l Conf. on Application of Databases, 1995, pp. 221-234. 13. J.Y. Lee, “Database Modeling and Implementation Techniques for Time-Series Data,” Ph.D. Dissertation, Computer Science and Engineering Department, University of Texas at Arlington, May 1998. 14. J.Y. Lee, R. Elmasri, and J. Won, “An Integrated Temporal Data Model Incorporating Time Series Concept,” Data and Knowledge Engineering, Vol. 24, No. 3, 1998, pp. 257276. 15. E. Rose and A. Segev, “TOODM – A Temporal Object-Oriented Data Model with Temporal Constraints,” Proc. 10th Int’l Conf. on the Entity-Relationship approach, 1991. 16. D. Schmidt, A.K. Dittrich, W. Dreyer, and R. Marti, “Time Series, a Neglected Issue in Temporal Database Research?” Proc. Int’l Workshop on Temporal Databases, 1995, pp. 214-232. 17. A. Segev and A. Shoshani, “Logical Modeling of Temporal Data,” Proc. ACM SIGMOD Int’l Conf., 1987, pp. 454-466. 18. A.U. Tansel, “Temporal Relational Data Model,” IEEE Tans. on Knowledge and Data Engineering, Vol. 9, No. 3, 1997, pp. 464-479. 19. G. Wuu and U. Dayal, “A Uniform Model for Temporal Object-Oriented Databases,” Proc. 8th Int’l Conf. on Data Engineering, 1992, pp. 584-593.

Chrono: A Conceptual Design Framework for Temporal Entities? Sonia Bergamaschi1,3 and Claudio Sartori2,3 1

DSI - University of Modena, Italy [email protected] 2 DEIS - University of Bologna, Italy [email protected] 3 CSITE - CNR, Bologna, Italy Viale Risorgimento, 2 - 40136 Bologna, Italy

Abstract. Database applications are frequently faced with the necessity of representing time varying information and, particularly in the management of information systems, a few kinds of behavior in time can characterize a wide class of applications. A great amount of work in the area of temporal databases aiming at the definition of standard representation and manipulation of time, mainly in relational database environment, has been presented in the last years. Nevertheless, conceptual design of databases with temporal aspects has not yet received sufficient attention. The purpose of this paper is twofold: to propose a simple temporal treatment of information at the initial conceptual phase of database design; to show how the chosen temporal treatment can be exploited in time integrity enforcement by using standard DBMS tools, such as referential integrity and triggers. Furthermore, we present a design tool implementing our data model and constraint generation technique, obtained by extending a commercial design tool.

1

Introduction and Motivations

Database applications are frequently faced with the necessity of representing time varying information and, particularly in management information systems, a few kinds of behavior in time can characterize a wide class of applications. In the last years we observed a great amount of work in the area of temporal databases, aiming at the definition of standard representation and manipulation of time, mainly with reference to the relational model. Nevertheless, in our opinion, the design of databases with temporal aspects has not yet received sufficient attention: the decisions about the temporal treatment of information are matter of the conceptual database design phase and, starting from there, a set of related design choices is strictly consequent. For instance, if we consider two related pieces of information and we decide that temporal treatment is needed only for the first one, can we assume that the decision of temporal treatment for ?

With the contribution of Gruppo Formula S.p.A., Bologna, Italy, http://www.formula.it

T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 35–50, 1998. c Springer-Verlag Berlin Heidelberg 1998

36

S. Bergamaschi and C. Sartori

the second one is completely independent or are there necessary clear constraints to preserve information integrity? In order to obtain an answer to the above question, we have to start facing the problem of supporting referential integrity constraints in this more complex scenario. In fact, when we add temporal treatment to an information we limit the validity of this information in time and a reference to it can be done only during this validity interval, therefore the kind of allowed/required temporal treatment for related information is strictly inter-dependent. As a consequence, in order to support temporal treatment we must extend the referential integrity concept, which plays a central role in databases: a reference is valid if the referenced object is valid through all the valid time of the referencing object. Another assumption of our work is related to the general framework of database design: when we are faced with the design of non-trivial application we cannot avoid the usage of methodologies and tools to support some design phases [1]. The Entity-Relationship (ER) model and its extensions has been proved to be useful in the conceptual design phase, and many design tool allow the user to draw conceptual schemata and to automatically generate relational database schemata. In addition, some design tools are also able to generate code for constraint enforcement, such as referential integrity and triggers for different relational DBMS. For this reason, if we design temporal treatment directly at the conceptual level and extend a design tool in this direction we obtain two major advantages: – temporal treatment is documented at a high level as a first class feature and it is dealt with in a standard fashion, – the integrity constraints deriving from temporal treatment can be automatically translated into constraint enforcement code at the DBMS level. The first choice we have to make is the selection of the conceptual model to be extended. In order to obtain a general approach and to easily come to the implementation of temporal treatment on top of an existing design tool, we refer to an industry-standard conceptual model, the IDEF1X model [23]. IDEF1X is an accepted standard for the USA government and is adopted by some conceptual database design tools [18,17]. The second choice is which kind of time is to be supported. The past decade of research on temporal databases led to the presentation of a high number of temporal data models and the book [25] gives a comprehensive account and systematization of this activity. In particular, many extension of the relational model have been proposed to represent the time dimension with different meanings and complexity. At present, there exists a general consensus on the bi-temporal models, such as BCDM [14], where two orthogonal temporal dimensions are considered: valid time (i.e. the time when the fact is true in the modelled reality) and transaction time (i.e. the time when the fact is current in the database and may be retrieved). According to the assumption that supporting the referential integrity is a major issue, it is mandatory for us to support at least the valid time dimension.

Chrono: A Conceptual Design Framework for Temporal Entities

37

In this work we restrict to consider only the valid time, since the transaction time dimension does not affect the enforcement of referential integrity. In other words, if an application requires both the temporal dimensions, it is possible to perform the conceptual design with respect to the valid time dimension, and then add independently the transaction time representation. The third choice refers to the granularity of detail to which conceptual elements have temporal treatment, ranging from a single attribute value to an entire entity instance. Our choice is in favor of the entity level granularity: all the attributes of an entity have the same temporal treatment, and this applies to all its instances. This apparently coarse granularity is well suited for most practical applications, and does not constitute a severe limitation, since a different temporal treatment for two subsets of attributes of the same entity can be easily modeled with vertical partitioning. The fourth choice is the type of time modeling suitable for in a database application. The most intuitive notions are that of event, which happens in a time point and state, which holds during a time interval, described by its starting and ending time points. The research community has reached a quite wide consensus on the notion of temporal element, which is a finite union of n-dimensional time intervals [9]1 . Special cases of temporal elements include valid-time elements, transaction-time elements and bi-temporal elements. As explained above, we consider only one-dimension temporal elements, i.e. valid-time elements. The straightforward modeling of such a temporal element as a non normalized entity attribute would lead to not efficient implementation in relational environment. Being aimed to produce a practical design tool, our choice is to constrain a given entity to have normalized time attributes, i.e. only one of the following kinds of temporal elements: single chronon, finite set of chronons, single interval, finite set of intervals. This means that the designer has to decide in advance which type of temporal treatment is best suited for the modeling of a given entity. On the basis of the above choices on the temporal treatment of information, we define an extension of an industry standard conceptual model, IDEF1X, and develop the necessary set of integrity constraint at the conceptual schema level to preserve information consistency. The result of these extensions is a uniform way to deal with time at the conceptual schema level. Since some design tools provide the automatic mapping of a conceptual schema into a relational one and generate code for constraint enforcement, an effective architectural choice is the extension of a tool like this at the conceptual level. The logical schema and the integrity constraints and triggers to ensure a correct evolution of the database contents with respect to time attributes will thus be automatically generated at the database level. In order to prove the feasibility and the usefulness of our approach, we developed a software layer, called Chrono, on top of the database design case tool ErWin [18]. With Chrono, the conceptual design activity is modified as follows: 1. design the conceptual schema abstracting from the temporal aspects 1

The simple notion of time interval would not guarantee the closure property with respect to usual operation on time intervals, such as union, intersection, etc..

38

S. Bergamaschi and C. Sartori

2. select the appropriate temporal treatment for the entities 3. Chrono automatically converts the schema obtained into a standard IDEF1X schema, adding the temporal attributes and the necessary set of integrity constraints. The paper is organized as follows: Section 2 introduces the Chrono conceptual data model as an extension of IDEF1X with temporal modeling features. Section 3 discusses the design constraints generated by the dependencies between related temporal entities. Sections 4 examines the integrity constraints that rule temporal entities. Section 3 discusses the design constraints generated by the dependencies between related temporal entities. Section 5 shows the architecture of the design tool based on Chrono. Finally, Sect. 6 discusses some related works.

2

The Chrono Conceptual Data Model

Let us briefly recall the modeling principles of the IDEF1X model, partly drawn from the official F.I.P.S. documents [23]. The IDEF1X model is derived from the Entity-Relationship (E/R) model [3] and its well known extensions[1]. The main difference with respect to E/R is the proposal of a conceptual model “closer” to the logical relational data view. The main extension is the distinction between independent and dependent entity, the latter being identified with the contribution of another entity via an identifying relationship. The relationships are either connection or categorization relationship. Connection relationship (also referred as parent-child relationships) are the standard way to express semantic relationships between entities and can be identifying, non-identifying without nulls and non-identifying with nulls. The cardinality is one to one or one to many. The model allows also the non-specific relationship, corresponding to the many to many relationship, but its usage is intended only for the initial development of the schema, to be refined in later development phases and substituted by entities and connection relationships, as explained at the end of this section. A categorization relationship represents a generalization hierarchy and is a relationship between one entity, referred to as the generic entity, and another entity, referred to as a category entity (i.e. the specialization). A category cluster is a set of one or more categorization relationships and represents an exclusive hierarchy. The Chrono conceptual data model is a temporal extension of IDEF1X. In analogy with the Entity–Relationship model, it assumes that each entity instance must be uniquely identified. By way of some internal attributes if the entity is independent, and by other connected entities if the entity is dependent. We will consider the identifier of an entity as time-invariant, while the other attributes can be time variant. Entities are either absolute, if they do not require temporal treatment, or temporal, if they are subject to time support, with explicit representation of its time attributes. Absolute entities will eventually be subject

Chrono: A Conceptual Design Framework for Temporal Entities

39

to insertions, deletions and updates, as is usual in database environment. In contrast, insertions, deletions and updates of temporal entities will be subject to particular rules and restrictions, as will be shown in the following. According to [12], we assume that all the non-key attributes of an entity have the same kind of behavior in time: in this way, temporal treatment can be done at the entity level and not at the attribute level. This is not a real limitation, since if the assumption does not hold for an entity E, attributes can be clustered according to a uniform behavior in time and the entity can be vertically partitioned into entities, say E1 and E2 linked by a one–to–one identifying relationship. In a way similar to the extension of the relational model with time proposed in [22], we translate the temporal treatment proposed for relations to entities: a temporal entity instance e of the entity type E is associated with a temporal element T (e), giving the lifespan of e. In addition, to simplify the integrity constraint enforcement, we accept the notion of temporal normal form proposed in [13] and consider only conceptual schema in third normal form, by an easy extension of the well-known notion of relational normal forms to the conceptual level, as suggested in [19]. Instead of allowing temporal elements composed by any possible mix of chronons and intervals, we consider four kinds of temporal elements: single chronon, finite set of chronons, single interval, finite set of intervals. For a given temporal entity type only one kind of temporal element is allowed. This led us to extend the IDEF1X model with five kinds of temporal entities representing either events, when the allowed temporal element is chronon or a set of chronons, or states, when the allowed temporal element is an interval or a set of intervals. An entity instance can have a single lifespan or an history when its temporal element is a set. The semantics of states is that the state is true during its interval, therefore, when state history is represented, the database must satisfy the constraint that the intervals of states of the same entity with different attributes cannot overlap. To conclude, an additional constraint is available to the designer: if the instance of an entity must always exist inside a given interval, but its state is allowed to change and the history is relevant, then the intervals of the various states of the same entity must always be contiguous. On the basis of the above discussion, we can say that the possible types of entities with temporal treatment are the following five: SP, CMP, MP, E, EV. In the following, the five Chrono types are explained together with their mapping into IDEF1X entities, as shown in Table 1. The mapping consists into the addition of attributes for the representation of time, possibly extending the identifier. Section 4 will discuss the integrity constraints added by this mapping. SP single period: the entity represents a single state and has an associated time interval; the time interval is represented as a couple of chronons, say Ts (Tstart) and Te (Tend). CMP consecutive multi-period: the entity is continuously valid during a period, but some of its aspects change over time, in connection with specific

40

S. Bergamaschi and C. Sartori

time points; therefore, the evolution of the entity can be seen as a succession of states valid during consecutive periods; its temporal element is a set of contiguous time intervals, and a single, absolute, entity instance generates many entity state instances; in order to represent the entity in terms of attributes and keys as required by the underlying conceptual model, we change the entity identifier, say I to include an extreme of the time interval, say Ts; in this way, the instances of the temporal entity are different versions of the instances of the original absolute entity. MP non-consecutive multi-period: the entity is valid during a set of periods, without any constraint of being consecutive; the representation is the same as the consecutive multi-period type. E event: the entity represents an event which took place in a specific time point; its time element is a single time point and can be represented by a single attribute Ts. EV event with versions: the entity represents an event which resulted in a tuple of values of its attributes; the history of the changes to the attribute values is maintained, each change being identified by its specific time point; the time element is a set of chronons and the representation is obtained, as for the consecutive multi-period case, by including the time attribute Ts in the identifier. Let us consider a set of short examples about the classification above. A human being, from the registry office point of view, is “valid” from his birth, and can be classified as type SP, while a living person, which is a specialization of human being, ends his validity with his death and is of type SP too. A company’s employee changes over time his description, including salary, duties, address, and so on. Each change starts at a specific time and holds up to the following change, therefore the employee description requires a time representation of type CMP. A patient can be hospitalized many times, each time with a specific starting and ending time point, and can be represented with a type MP entity. The documents for the management of an organization usually are marked with a time (for instance when an officier wrote the document). Provided that the application requires that no historical record of the document changes is needed, it can be represented as a type E entity. On the other hand, if the application needs to record the different versions of such documents, the type EV temporal treatment can be used. Up to now we considered an entity in isolation, but in practical cases we always have many entities linked with various semantic relationships, such as aggregation and generalization hierarchy. In this case the following question arise: is the temporal treatment of an entity independent from that of the entities related to it? And more, which integrity constraints, if any, govern the time values of a single entity and of related entities? In order to give an answer to the above questions, let us examine the relationships expressed in IDEF1X. In the categorization relationships and identifying

Chrono: A Conceptual Design Framework for Temporal Entities

41

Table 1. Mapping from the Chrono concepts to the IDEF1X concepts Temporal treatment of entity E

Chrono representation E

IDEF1X representation E

SP

Single Period (SP)

Id

Id Ts Te

E

Consecutive Multi–Period (CMP)

E CMP

Id

Id Ts Te

E

E MP

Multi-Period (MP)

Id

Id Ts Te

E

E E

Event (E)

Id

Id Ts

E

Event with Versions (EV)

Same Validity Period identifying relationship (SVP)

E EV

Id

Id Ts

=

relationships a child makes a mandatory reference to the parent and therefore when it is be valid the parent must be valid as well. In summary, the validity of both categorization and identifying relationships is the same as the validity of the child and therefore they cannot have any temporal treatment on their own. On the other hand, in principle it is acceptable that a parent instance has a validity period wider than that of its child instances. Therefore, the choice of the constraints to be enforced depends also on whether the validity period of the child instance can be contained in that of its parent instance or it must be forced to be the same as its corresponding parent instance. We consider as default case the less constrained, that is when the validity period can be contained, and introduce, as a new design element the SVP identifying relationship (same validity period ), as shown in the last row of Table 1. In IDEF1X the constraint is simply removed at the graphical level, being translated into a constraint at the extensional level. The only constraint on a general non-identifying relationship is that the child instance must be valid inside the validity of its parent instance. Consider, for

42

S. Bergamaschi and C. Sartori Employee Emp# Employee Emp#

occupancy

Occupancy Emp#

MP

Room# Room Room# Room Room#

a. IDEF1X representation

b. Chrono representation

Fig. 1. Example of relationship with temporal treatment

instance, the cyclic relationship “parent-offspring” This case, can be represented as a non-identifying relationship between human beings, and the constraint of parent birth-date lower than offspring birth-date must be enforced, while the end of the validity perdiod is infinite and does not give rise to any problem. Non-identifying relationships, either with or without nulls, can hold during an arbitrary interval, included in the intersection of the validity intervals of the related entities. Thus the relationship can have a kind of temporal treatment on its own. On the other hand, when complex aspects of a relationship need to be described, such as generalization between relationships, the standard conceptual design procedure is reification. A reified relationship is promoted to an entity which is the child in a couple of relationships with the two original entities2 . Let us consider, for example, how employees are associated to their office rooms. In a snapshot view, the employee is assigned to exactly one room, as shown in the schema of Fig. 1.a. If we want to model the fact that an employee can change his room over the time, and we want to keep track of this history, we have to modify the schema as follows: 1. reify the relationship occupancy producing the dependent entity Occupancy 2. select for Occupancy one of the Chrono temporal entities, say MP to specify that an employee in a given time period is assigned to at most one room 3. specify the proper cardinality for the relationships of Occupancy: the relationship with Employee derives from the child side and is identifying, while the relationship with Room is non-identifying. The the modified schema is shown in Fig. 1.b. For the intensional level we will define the allowed temporal treatment combination for related entities, and for the extensional level we will state the constraints that rule the insert, update and delete operations on entities. These constraints can be translated into constraints on relational database operations by many design tools. 2

This choice could be considered pertaining the logical level, rather than the conceptual level, but it is coherent with the philosophy of IDEF1X, which is a compromise between a conceptual and a logical data model.

Chrono: A Conceptual Design Framework for Temporal Entities

3

43

Design Constraints on Temporal Entities

When multiple versions of an entity instance are allowed (i.e. for Chrono entities of types CMP,MP and EV) the mapping from a Chrono entity into an IDEF1X entity gives rise to an extension of the entity identifier Id with one of the time attributes (say the starting time), thus obtaining the key K=(Id,Ts). In this way the entity instance uniqueness is preserved by the uniqueness of the identifier, which is supposed to be time invariant, and of the times in which versions are valid: it is not allowed to have different versions starting from the same time point. In the following we will refer to the Chrono representation of entities and also to their corresponding IDEF1X representation. In particular, we define version of an entity instance (or briefly version) an instance of an IDEF1X entity. In particular, an instance of an entity of type CMP, MP and EV may correspond to more versions, sharing the same Id, but with different time elements. In a standard snapshot database, a foreign key constraint has a straightforward implication: an instance of a child entity must have a valid reference to an instance of a parent entity (or a null reference if it is allowed). In a temporal database the validity of a reference must take into account also the time. Therefore it is necessary to extend the notion of referential integrity, by ensuring that the validity times of two instances of related entities have an overlapping. Otherwise, it would be the case that, in some time point, a child instance refers to a parent instance which is not valid in that time point. A relationship between two entities implies two major consequences: – at the intensional level, the allowed temporal treatment of the entities is constrained: the constraints take into account the relationship type and the compatibility between the different temporal treatments; – at the extensional level, the temporal interaction between two instances of connected entities is subject to additional integrity constraints, which have to be enforced. The following subsections will examine the constraints to be enforced for each kind of IDEF1X relationship; the constraints to be enforced when insert, update and delete operations are performed are a consequence of the ones of this section and will be examined in Sect. 4. 3.1

Categorization Relationship

A complete category cluster specifies that each instance of the parent is related to exactly one instance of one of the childs in the cluster and vice-versa. When time is considered, this constraint must hold for any snapshot. Let be τ ∈ {SP, CMP, MP, E, EV} a type of temporal treatment, Ep the parent entity and Eci the i–th child entity. At the intensional level, the following constraints hold: 1. if Ep is of type τ , then all of its childs Eci must be of the same type τ ;

44

S. Bergamaschi and C. Sartori Table 2. Identifying parent-child relationship: allowed combinations

C

P

SP CMP MP E EV

SP

CMP

MP

E

EV

SP

CMP

MP

E

EV

yes yes yes no no

yes yes yes no no

yes yes yes no no Case a

no no no yes yes

no no no no yes

yes yes no no no

yes yes no no no

yes yes yes no no Case b

no no no yes no

no no no no yes

2. if Ep is absolute, then at least one of its childs must be absolute, since otherwise there could exist a snapshot and an instance ep ∈ Ep such that does not exist a valid instance eci ∈ Eci for any i3 . When the category cluster is incomplete the constraint 2 above does not hold. 3.2

Identifying Relationship

Each child instance eci ∈ Eci is completely dependent on a parent instance ep ∈ Ep . Thus, the validity period of a eci must be contained in the validity period of its related parent ep . Otherwise, there would exist at least one point in time where the child instance violates the referential integrity. On the other hand, in principle it is acceptable that a parent instance has a validity period wider than that of its child instances. Therefore, the choice of the constraints to be enforced depends also on an application-dependent requirement: a) the validity period of the child instance can be contained in that of its parent instance b) the validity period of the child instance must be forced to be the same as its corresponding parent instance. Case a - validity period of child instance included in validity period of parent instance In this case, the child can have temporal treatment even if the parent is an absolute entity. On the contrary, if the parent has a temporal treatment then the child too must have one (for a less restrictive choice see footnote 3). Table 2 case a shows the allowed combinations: let us comment shortly some of them. When the parent entity is of type SP, a single parent instance can be connected to many different versions of child instances, and since child validity can be included in parent validity, every kind of temporal treatment is allowed for childs. Vice-versa, when the parent is of type E, there is no room for different versions of a child or for child validity spanning over an interval. 3

A less restrictive choice could be to move this constraint at the extensional level, ensuring that the lifespan of a parent instance be covered by the union of the lifespans of all its childrens.

Chrono: A Conceptual Design Framework for Temporal Entities

45

Case b - same validity period In this case it is not possible that one entity is absolute and the other has a temporal treatment. Nevertheless, the temporal treatment of parent and child entities is not constrained to be of the same type, even though combination of types is not arbitrary. Table 2 Case b shows the allowed combinations: let us comment shortly some of them. When the parent is of type SP, its child can be either of type SP or CMP. In fact, a single parent instance ep[t1 ,t3 ] with validity [t1 , t3 ] can correspond to two versions of child instances, say ec[t1 ,t2 ] and ec[t2 ,t3 ] , which together cover the parent validity interval. When the parent is of type CMP, its child can be either of type SP or CMP. The second case is straightforward, while the first one is analogous to that of the previous paragraph, by exchanging parent and child. When the parent is an event with versions time entity, a parent is represented by many versions with different validity intervals and can be related to one or more child instances. Therefore, the child must be of type E or EV, since the “period” types could not ensure the coincidence of validity intervals. 3.3

Non-identifying Relationship without Nulls

The only difference of this case with respect to identifying relationship is that the child key is now independent from the parent key. Apart from that, Child instances must be related to parent instances, therefore their validity must span inside the parent validity and the same constraints of Sect. 3.2 apply. 3.4

Non-identifying Relationship with Nulls

In this case, child instances can have a null reference to parent. Vice-versa, when the reference is not null, we require it is valid, i.e. child validity is included in parent validity4 . The allowed combination of temporal treatment are the same discussed for case a.

4

Constraints on Temporal Entity Instances

This section examines the intra-entity and the inter-entity constraints which must be enforced to guarantee the consistent evolution of a database deriving from a Chrono conceptual schema. In the following we refer, indifferently, to a Chrono conceptual schema and to its IDEF1X mapped representation. The constraints considered include both rules on the values of the time attributes and pre-requisite and integrity maintenance actions for the insert, delete and update operations. As a necessary premise, we must consider the influence of time-related operations. We will consider separately the constraints related to single entities and those deriving from referential integrity. In particular, let us consider the following two basic time related operations [24]: 4

Because of the weaker parent-child relationship it is not worth considering the equality case as in Sect. 3.2 case b.

46

S. Bergamaschi and C. Sartori

coalesce given n value equivalent instances of the same entity and with adjacent time intervals generate a single entity instance with the maximal time interval split given an entity instance and a time point inside its time interval, generate a couple of value equivalent entity instances with adjacent time intervals. We assume that the above operations be allowed and that the database is always kept in a coalesced state (i.e. no coalescing is possible). Therefore insertions and updates on the database will trigger correction actions when necessary. 4.1

Single Period

In this case, we have only the obvious intra-instance constraint that it must always be Ts ≤ Te, i.e. its time interval is non-empty. The constraint is to be enforced when attempting an insert or update operation. 4.2

Consecutive Multi-Period

The constraints which rule instances of this kind of temporal entities depend on the kind of operations that are accepted at the application level. In particular, we consider the impact of the availability of the operations of coalescing and splitting (or split). If these operations are allowed, then the following constraints are enforced: Insert if the time interval of the new entity instance is non-empty and there not exist an instance with the same value in Id, the insert is accepted, otherwise only the following cases are acceptable: 1. if the new instance starting time Ts meets the ending time of the most recent version of the same instance, the insert represents a new version of an existing instance and is accepted 5 ; 2. if the new instance ending time meets the starting time of the oldest version of the same instance, the insert constitutes an extension of the validity in the past and is accepted; 3. if the new instance ending time Te meets the ending time of an existing version, but the new starting time is greater than the old one, the insert corresponds to a splitting operation and is accepted; as a consequence, the starting and ending time of the new and old versions are updated, to preserve adjacency. Update if the time interval of the new entity instance is non-empty and there not exist an instance with the same value of Id, the update is accepted, otherwise only the following cases are acceptable: 1. if the update does not affect time attributes the update is valid; 2. if the update modifies the starting time Ts of the oldest version or the ending time of the most recent version it is accepted; 5

As an alternative, the most recent version could be open-ended.

Chrono: A Conceptual Design Framework for Temporal Entities

47

3. if the update modifies the time interval to cover exactly the time intervals of two or more consecutive versions of the same instance, these versions have to be eliminated (coalescing). Delete for each instance selected for deletion with identifier component Idi , there are two possible cases: 1. there exist only one instance with Idi or the deleted instance is the oldest or the most recent version and the deletion is accepted 2. the instance is deleted for coalescing. If the application logic does not allow automatic coalescing and splitting, the cases requiring such operations, namely the last presented for each of the above operation, are not accepted. 4.3

Non-consecutive Multi-Period

In this case, we lose the constraint of having the ending time of a version meeting the starting time of the consecutive version and the only requirement is that the versions must be non-overlapping. Given an instance to be inserted, if the time interval of the new instance is non-empty and there not exists an instance with the same value of the entity identifier component Id, the insert is accepted, otherwise the insert is acceptable only if the time interval of the new instance does not overlap the time intervals of the already existing instances with the same identifier Id. For the update and delete operations, the constraints are the same as for the insert. 4.4

Event and Event with Versions

In this case, a single time attribute is sufficient for the temporal representation. Single event entities do not need any special constraint checking, while multiple event entities extend the key with the time attribute, therefore the key uniqueness must be verified on insert and update operations. 4.5

Categorization Relationship

When Ep is of type τ , each instance eci ∈ Eci is related to an instance ep and the two instances must be valid at the same time. Therefore, at the extensional level the following constraints hold: 1. the time attribute Ts (and Te if τ ∈ {SP, CMP, MP}), together with the parent identifier component Id, have the semantics of foreign key from Eci to Ep ; 2. for each parent instance ep there must exist a single instance eci in exactly one Eci with the same value of Id, Ts (and Te if τ ∈ {SP, CMP, MP}). The above constraints hold for any τ . Insert, update and delete operations, both on the parent and child, must be done according with them. When the category cluster is incomplete the constraint 2 above does not hold and only the parent update and delete, child insert and update are subject to constraint enforcement.

48

4.6

S. Bergamaschi and C. Sartori

Identifying Relationship

In this case, parent insert, child insert and child update must check the inclusion of the validity period of the child with respect to that of parent. The attributes (Id, Ts, Te) have the semantics of foreign key, from the child to the parent, and the operations parent update, parent delete, child insert and child update must enforce it. 4.7

Non-identifying Relationship

The constraints take into account the possibility of null reference from child to parent, and check the validity interval when the reference is not null.

5

Implementation of Chrono

The most direct method for the implementation of Chrono would be to extend a case tool for the conceptual design in order to deal with time attributes and to generate the appropriate integrity constraints. At present it was not possible either to build a case tool from the scratch, or to access the source code of an existing tool. For this reason we had to build an additional software layer on top of a case tool, but this was sufficient to prove the feasibility and the usefulness of the project. We chose the case tool ERwin (from LogicWorks) which works in MS Windows environment and is based on the IDEF1X model. ERwin has a graphic interface to support conceptual schema design and is able to automatically generate relational schemata and triggers for many popular RDBMSs. The triggers are generated starting from trigger templates, with some peculiarities for the various supported DBMSs. Some trigger templates are added by Chrono to perform time-related checks and actions. Chrono operates by reading and modifying the files generated by ERwin describing a conceptual schema. The flow of the design activity is constituted by the following steps: 1. 2. 3. 4.

the designer prepares a usual IDEF1X conceptual schema with ERwin; the schema is stored in a file with extension .ERX; Chrono reads the .ERX file and interprets the schema description the designer adds the temporal treatment to the entities of the conceptual schema 5. Chrono writes a modified .ERX file including the schema modifications necessary for the representation of time and the triggers for the preservation of data integrity for time related attributes; 6. ERwin imports the modified .ERX files for possible schema refinement. Report [2] shows more details on the architecture of Chrono.

Chrono: A Conceptual Design Framework for Temporal Entities

6

49

Discussion and Conclusions

We introduced a conceptual data model extending IDEFIX for the representation of time and discussed both the conceptual design constraints and the integrity constraint introduced by the representation of time. The conceptual design constraints introduce limitations on the possible kinds of temporal treatment for related entities, while the integrity constraint are used to guarantee consistent database states with reference to temporal treatment. The idea of extending conceptual models in order to support time representation received some attention in the literature. The report [10] provides a comprehensive survey on the topic and identifies nineteen design criteria to compare the effectiveness of the surveyed models [8,15,16,21,6,7,4,5,26,20,11]. All the models above are extensions of the Entity-Relatonship model, but can be easily compared to Chrono, since IDEF1X too is strictly related to the Entity-Relationship model. Report [2] compares Chrono with the models above. To summarize , we can say that Chrono couples a significant expressive power with the reuse of the existing available technology, obtaining a feasible and low cost approach to the effective representation of time in a database. The most significant evolution of our work would be a deep analysis of the design and constraint issues deriving from a more sophisticated time model, for instance the bi-temporal conceptual model. Acknowledgements. Thanks to Gruppo Formula S.p.A. for the support, to Paolo Pellizzardi for the suggestions and discussions and to Giorgio Ferrara for the programming effort.

References 1. C. Batini, S. Ceri, and S. B. Navathe. Conceptual Database design: an EntityRelationship Approach. The Benjamin/Cummings Publishing Company, 1992. 2. S. Bergamaschi and C. Sartori. Chrono: a conceptual design framework for temporal entities. Technical Report CSITE-011-98, CSITE - CNR, 1998. ftp://wwwdb.deis.unibo.it/pub/reports/CSITE-011-98.pdf. 3. P. Chen. The Entity-Relationship model - towards a unified view of data. ACM Trans. on Database Systems, 1(1):9–36, 1976. 4. R. Elmasri, I. El-Assal, and V. Kouramajian. Semantics of temporal data in an extended ER model. In 9th Int. Conf. on the Entity–Relationship Approach, pages 239–254, Lausanne, Switzerland, 1990. 5. R. Elmasri and V. Kouramajian. A temporal query language for a conceptual model. Lecture Notes in Computer Science, 759:175–??, 1993. 6. R. ElMasri and G. Wuu. A temporal model and query language for ER databases. In Proc. IEEE CS Intl. Conf. No. 6 on Data Engineering, Feb. 1990. 7. R. Elmasri, G. T. J. Wuu, and V. Kouramajian. A temporal model and query language for EER databases. In Tansel et al. [25], chapter 9, pages 212–229.

50

S. Bergamaschi and C. Sartori

8. S. Ferg. Modeling the time dimension in an entity-relationship diagram. In 4th International Conference on the Entity-Relationship Approach, pages 280–286, Silver Spring, MD, 1985. ieee, Computer Society Press. 9. S. K. Gadia and C. S. Yeung. A generalized model for a relational temporal database. In ACM SIGMOD, pages 251–259, 1988. 10. H. Gregersen and C. S. Jensen. Temporal entity–relationship models – a survey. Technical Report TR-3, Time Center, January 1997. http://www.cs.auc.dk/research/DBS/tdb/TimeCenter/publications.html. 11. J. L. Guynes, V. S. Lai, and J. P. Kuilboer. Temporal Databases: Model Design and Commercialization Prospects. Database, 25(3), Aug. 1994. 12. C. S. Jensen and R. T. Snodgrass. Semantics of time-varying information. Information Systems, 21(4):311–352, 1996. 13. C. S. Jensen, R. T. Snodgrass, and M. D. Soo. Extending existing dependency theory to temporal databases. IEEE Transactions on Knowledge and Data Engineering, 8(4):563–582, 1996. 14. C. S. Jensen, M. D. Soo, and R. T. Snodgrass. Unifying temporal models via a conceptual model. Information Systems, 19(7):513–547, 1994. 15. M. R. Klopprogge. TERM: An approach to include the time dimension in the entity-relationship model. In Proceedings of the Second International Conference on the Entity Relationship Approach, pages 477–512, Washington, DC, Oct. 1981. 16. M. R. Klopprogge and P. C. Lockemann. Modelling information preserving databases: Consequences of the concept of time. In M. Schkolnick and C. Thanos, editors, vldb, pages 399–416, Florence, Italy, 1983. 17. I. Knowledge Based Systems. SmartER - information and data modeling and database design. Technical report, Knowledge Based Systems, Inc. - Austin, USA, 1997. http://www.kbsi.com/products/smarter.html. 18. I. Logic Works. Erwin/erx. Technical report, Logic Works, Inc., 1997. http://www.logicworks.com/products/erwinerx/index.asp. 19. H. Mannila and K.-J. R¨ aih¨ a. The design of relational databases. Addison-Wesley, 1993. 20. P. McBrien, A. H. Selveit, and B. Wangler. An entity-relationship model extended to describe historical information. In International Conference on Information Systems and Management of Data (CISMOD’92), pages 244–260, Bangalore, India, July 1992. 21. A. Narasimhalu. A data model for object-oriented databases with temporal attributes and relationships. Technical report, National University of Singapore, 1988. 22. S. Navathe and R. Ahmed. Temporal extensions to the relational model and SQL. In Tansel et al. [25], chapter 4, pages 92–109. 23. F. I. P. S. Publication. Integration definition for function modeling (idef1x). Technical Report 183, National Institute of Standards and Technology, Gaithersburg, Md. 20899, 1993. 24. R. T. Snodgrass. The temporal query language TQel. ACM Trans. Database Syst., 12(2):247–298, 1987. 25. A. U. Tansel et al., editors. Temporal databases: theory, design and implementations. Benjamin Cummings, 1993. 26. C. Theodoulidis, P. Loucopoulos, and B. Wangler. A conceptual modelling formalism for temporal database applications. Information Systems, 16(4):401–416, 1991.

Designing Well-Structured Websites: Lessons to Be Learned from Database Schema Methodology Olga De Troyer Tilburg University, INFOLAB, Tilburg, The Netherlands [email protected]

Abstract. In this paper we argue that many of the problems one may experience while visiting websites today may be avoided if their builders adopt a proper methodology for designing and implementing the site. More specifically, introducing a systematic conceptual design phase for websites, similar in purpose and technique to the conceptual design phase in database systems, proves to be effective and efficient. However, certain differences such as adopting a user-centered view are essential for this. Existing database design techniques such as ER, ORM, OMT are found to be an adequate basis for this approach. We show how they can be extended to make them appropriate for website design. We also indicate how conceptual schemes may be usefully deployed in future automation of site creation and upkeep. Furthermore, by including parts of such a conceptual schema inside the site, a new generation of search engines may emerge.

1

Introduction

The World Wide Web (WWW) offers a totally revolutionary medium for asynchronous computer-based communication among humans, and among their institutions. As its primary use evolves towards commercial purposes, competition for the browser’s attention, often split-second, is now a dominating issue. This has forced the focus of website design towards visual sophistication. Websites must be ‘cool, hip, killer’. Most of the literature on website design therefore appears to deal with graphics, sound, animation, or implementation aspects. The content almost seems to be of less importance. Most ‘web designers’ have certainly never been schooled in traditional design principles nor in fundamental communication techniques. They have ‘learned’ to design webs by looking at other websites and by following a ‘trialand-error’ principle. In addition, the Web is constantly in evolution, outdating itself nearly daily. The combination of all these factors for an individual website easily leads to problems of maintenance but also of elementary usability. Indeed, as any database designer knows, if the represented information is not structured properly, maintenance problems occur which are very similar to those in databases: redundancy, inconsistency, incompleteness and obsolescence. This is not surprising as websites as well as databases may provide (large) amounts of information which need to be maintained. The same aspects also lead to usability problems. T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 51−64, 1998.  Springer-Verlag Berlin Heidelberg 1998

52

O. De Troyer

These are particularly obnoxious as they are problems experienced by the target audience of the website: • Redundancy. Information which is needlessly repeated during navigation is annoying to most users. • Inconsistency. If information on the site is found to be inconsistent, the user will probably distrust the whole site. • Incompleteness. Stale and broken links fall in this category, but incompleteness is also experienced by users who cannot find the information which they expect to be available on a site. • Actuality. Organizations and information are often changing so quickly that the information provided on websites soon becomes out of date. If a website has visibly not been updated for a while, confidence of users in the information provided is likely not to be very high. Other usability problems are caused by: • Lack of a mission statement. If the website has no declared goal, that goal, quite simply, can not be reached. The key question, therefore, that must be answered by its owner first is “What do I want to get out of my site?”. This mission statement is the basis for any evaluation of the effectiveness of the site. • Lack of a clearly identified target audience. The target audience is the audience which will be interested in the site. If one does not have a clear understanding of one’s target audience, it is quite difficult to create a compelling and effective site. • Information overload. Users typically are not interested in wading through pages and pages of spurious “information”. Also, attention spans tend to be short. • The lost-in-hyperspace syndrome [11]. Hypertext requires users to navigate through the provided information. If this navigation process is not well structured or guided, users may easily get lost. This makes it more difficult and timeconsuming to locate the desired information. The use of a proper design method could help solve some of these problems. A number of researchers have already recognized the lack of a design method for websites, or more in general for web-based information systems, and have proposed methods: HDM [7] and its successors HDM2 [6] and OOHDM [13], RMM [9], W3DT [17], the method for analysis and design of websites in [15], SOHDM [10]. Older methods (HDM, OOHDM, RMM) were originally designed for hypertext or hypermedia applications and do not deal comfortably with web-specific issues. In addition, these methods are very much data-driven or implementation oriented. Some have their origin in database design methods like the E-R method [1] or object oriented (OO) methods such as OMT [12]. These methods may be able to solve maintenance problems to some extent but they do not address the other usability problems mentioned above. In [4], we have proposed a website design method, called WSDM, which is ‘user centered’ rather than ‘data-driven’. In a data-driven method the data available in the organization is the starting point of the modeling approach. In our approach, however, the starting point is the target audience of the website. The issues related to this target audience run through the method like a continuous thread. We will explain the differences between data-driven and user-centered in more detail in section 3.1.

Designing Well-Structured Websites: Lessons to Be Learned

53

We argue that our approach results in websites which are more tailored to their users and therefore have a higher usability and greater satisfaction coefficient. WSDM also makes a clear distinction between the conceptual design and the design of the actual presentation. The conceptual design, as in database design, is free from implementation details and concentrates on the content and the structuring of the website. The design of the presentation takes into consideration the implementation language used, the grouping in pages, and the actual ‘look and feel’ of the website. This distinction is comparable to the distinction made in database design between the conceptual design (e.g. an E-R schema [1]) and the logical design (e.g. a relational schema). The purpose of this paper is to explain the concept of a conceptual schema within the context of a website design method (section 3) and to identify the different roles it plays in the life cycle of the website (section 4). In section 2 we give a short overview of the different phases of our WebSite Design Method. Section 5 concludes the paper.

2

The WebSite Design Method (WSDM)

We only present a brief overview of WSDM; a more detailed description can be found in [4] and [5]. The method currently concentrates on kiosk websites. A kiosk website [9] mainly provides information and allows users to navigate through that information. An application website is a kind of interactive information system where the user interface is formed by a set of web pages. The core of the method consists of the following phases: User Modeling, Conceptual Design, Implementation Design and the actual Implementation (see Fig. 1 for an overview). We suppose that the mission statement for the website has been formulated before the start of the User Modeling phase. The mission statement should describe the subject and the purpose of the website as well as the target audience. Without giving due consideration to these issues, there is no proper basis for decision making, or for the evaluation of the effectiveness of the website. As an example we consider the mission statement of a typical university department website. It can be formulated as follows: “Provide information about the available educational programmes and the ongoing research to attract more students, researchers and companies, and enhance the internal communication between students and staff members“. The User Modeling phase consists of two sub-phases: User Classification and User Class Description. In the User Classification we identify the future users or visitors of the website and classify them into user classes. The mission statement will give an indication of the target audience, but this has to be refined. One way of doing this is by looking at the organization or the business process which the website should support. Each organization or business process can be divided into a number of activities. Each activity involves people. These people are potential users/visitors of the site. In our method, a user class is a subset of the all potential users who are similar in terms of their information requirements. Users from the same user class have the same information requirements. As an example the user classes of our university example are: Candidate Students, Enrolled Students,

54

O. De Troyer

Researchers, Staff Members and Companies. User classes need not be disjoint. The same person may be in different user classes depending on the different roles he plays in the organizational environment. For example, a person can be an enrolled student as well as a staff member. User Modeling User Classification User Class Description

Conceptual Design Object Modeling Navigational Design

Implementation Design Implementation Fig. 1. Overview of the WSDM phases

In the User Class Description, the identified user classes are analyzed in more detail. We not only describe (informally) the information requirements of the different users classes, but also their usability requirements and characteristics. Some examples of user’s characteristics are: levels of experience with websites in general, language issues, education/intellectual abilities, age. Some of the characteristics may be translated into usability requirements while others may be used later on in the implementation phase to guide the design of the ‘look and feel’ of the website, e.g. younger people tend to be more visually oriented than older people. Although, all users from a single user class potentially have the same information requirements, they may diverge with respect to their characteristics and usability requirements. For example, within the user class Enrolled Students we may distinguish between local students and exchange students. They have the same information requirements (detailed information on courses) but have different characteristics and usability requirements. Local students are young (between 18 and 28), are familiar with the university jargon, the university rules and customs. They have a good level of experience with the WWW. They prefer the local language for communication, but have in general a good understanding of English. On the other hand, all communication with exchange students is done in English. We may not presume that they are familiar with the university jargon and customs, or with the WWW. To support different characteristics and usability requirements within a single user class, we use perspectives. A perspective is a kind of user subclass. We define a perspective as all users in a user class with the same characteristics and usability

Designing Well-Structured Websites: Lessons to Be Learned

55

requirements. For the user class Enrolled Students we may distinguish two perspectives: Local Students and Exchange Students. The Conceptual Design phase also consists of two sub-phases: the Object Modeling and the Navigational Design. During Object Modeling the information requirements of the different user classes and their perspectives are formally described in a number of conceptual schemes. How this is done is described in section 3. During the Navigational Design we described how the different users will be able to navigate through the website. For each perspective a separated navigation track will be designed. It is precisely the approach taken in the Object Modeling and Navigational Design based on user classes and perspectives that constitute the usercentered approach of WSDM and its departure from purely classic information system modeling. In the Implementation Design we essentially design the ‘look and feel’ of the website. The aim is to create a consistent, pleasing and efficient look and feel for the conceptual design made in the previous phase. If information provided by the website will be maintained by a database then the implementation design phase will also include the logical design of this database. The last phase, Implementation, is the actual realization of the website using the chosen implementation environment, e.g. HTML.

3

The Conceptual Design of a Website

During the User Modeling phase, the requirements and the characteristics of the users are identified and different user classes and perspectives are recognized. The aim of the Conceptual Design phase is to turn these requirements into a high level, formal description which can be used later on to generate (automatically or semi-automatically) effective websites. During Conceptual Design, we concentrate on the conceptual ‘what and how’ rather than on the visual ‘what and how’. This means that like in database design we describe what kind of information will be presented (object types and relationships; the conceptual ‘what’), but unlike in database design we also describe how it will be able to navigate through the information (the conceptual ‘how’). This is needed because navigating through the information space is an essential characteristic of websites. If the navigation is not (well) designed or not adapted to the target audience, serious usability problems occur. The conceptual ‘what’ is covered by the Object Modeling step, the conceptual ‘how’ by the Navigational Design.

3.1

Object Modeling in a User-Centered Approach

In WSDM, the Conceptual Object Modeling results in several different conceptual schemes, rather then in a single one as in classical database design. This is because we have opted for a user-centered approach. In a data-driven approach, as used in database design, the starting point of a conceptual design is the information available in the organization: designers first model the application domain, and subsequently they associate information with each class of users (e.g. by means of views or

56

O. De Troyer

external schemes). However, the data and the way it is organized in the application domain may not reflect the user’s requirements. A good example of such a mismatch can be found in the current website of our university The structure of this website completely reflects the internal organizing structure of our university. This structure is completely irrelevant for, and unknown to most users of this site. As an example, if you want to look at the web for the products offered by the Computer-shop of our university (called PC-shop) you must know that the PC-shop is part of the Computer Center (actually it is one of the ‘External Services’ of the Computer Center), which itself is a ‘Service Department’ of the University. You will not find it under ‘Facilities’ like the Restaurant, the Copy-shop or the Branch Bank. In our user-centered approach we start by modeling the information requirements of the different types of users (user classes). Note that we make a distinction between a user-centered approach and a user-driven approach. In a user-driven approach the users are actively involved in the design process, e.g. through interviews, during scenario analysis, prototyping and evaluation. This is not possible for kiosk websites on the internet because most of the users are unknown and cannot be interviewed in advance or be involved in the design process. However, we can fairly well identify the different types of users and investigate their requirements. After all, the main goal of a kiosk site is to provide information. Therefore, for each user class a conceptual schema is developed expressing the information needs of that type of user. We call these conceptual schemes user object models. Like an “ordinary” conceptual schema, a user object model (UOM) is expressed in terms of the business objects of the organization. 1.

for

with

Exam 2 Date Room Time Duration

Course requiring Id Name giving given by Description 1+ Newsgroup 1+ Exam Type Required Reading Programme Year prerequisite using for Course Material used for Id Name Price Date of Issue

Lecturer Name Title Room Tel E-Mail 1+ author of

written by

Fig. 2. User object model for Enrolled Students

In [5] we explain how a user object model is constructed from the information requirements expressed in a user class description. For each requirement a so-called

1

http://www.kub.nl/ (in Dutch).

Designing Well-Structured Websites: Lessons to Be Learned

57

object chunk is constructed. Next, the object chunks of one user class are merged into a single model. In conceptual modeling in general, object models describe the different object types (OTs), the relationships between these OTs, and rules or constraints. OO models also describe behavior. For our purpose (modeling kiosk websites), modeling behavior is not (yet) needed. The traditional conceptual modeling methods like E-R [1], the Object-Role Model [8], [16], [2], or “true” OO methods like OMT [12] are therefore all suitable. Figure 2 shows the UOM (in OMT notation) developed for the user class Enrolled Students of our university department example.

3.2

Object Type Variants

As explained, the same user class may include different perspectives expressing different usability requirements and characteristics. It is possible that this also results in slightly different information requirements. In WSDM, we model this by means of variants for OTs. A variant of some OT corresponds largely with the original OT but has some small differences (variations). Consider as an example the OT Course for the user class Enrolled Students. See Fig. 2 for a graphical representation. About a course, enrolled students in general need the following information: the identification number of the course, the name of the course, a description of the content of the course, the prerequisites for the course, specification of the required reading, the type of exam of the course, the name of the newsgroup of the course and the programme year in which the course may be followed. However, for the subgroup (perspective) Local Students we want to offer this information in the local language, while for the subgroup Exchange Students the information must be provided in English. Also the programme year is not relevant for exchange students and (the actual value of) the prerequisites, the required reading and the exam type may differ between exchange students and local students. Indeed, local students may have required reading written in the local language while for the exchange students the required reading must be written in English. In implementation terms, this means that for most (but not all) attributes of the OT Course we will need to maintain two variants; an English one and a local language one. The recognition of these differences is essential for a user-centered approach and therefore they should be modeled in an early phase. Some people may argue that the language is a representation issue and therefore it should not be considered in the conceptual phase but left to the implementation design. However, in this example, the language issue is an important user requirement which also influences the actual information that will be provided. If we do not recognize this during conceptual design, the information provided for a course, except for the language, would be the same for local students and exchange students.

58

O. De Troyer

To model the differences we introduce two variants for the OT Course: Course/Local Students and Course/Exchange Students. Course/Local Students: • • • • • • • •

the identification number of the course; the local language name of the course; a description of the content of the course in the local language; the prerequisites for the course for the local students in the local language; the specification of the required reading for the local students in the local language; the type of exam of the course for the local students in the local language; the name of the newsgroup of the course; the programme year in which the course may be followed.

Course/Exchange-Students: • the identification number of the course; • the English name of the course; • a description of the content of the course in English; • the prerequisites for the course for the exchange students in English; • the specification of the required reading for the exchange students in English; • the type of exam of the course for the exchange students in English ; • the name of the newsgroup of the course. Graphically, we use a parent-child notation to represent variants (see Fig. 3). The parent OT is variant independent, each child OT is a variant of the parent OT. The name of a variant OT is composed of the name of the parent OT followed by the variant identification, e.g. Course/Exchange Students. A variant OT can have less attributes that its parent OT. Semantically, this means that the omitted attributes are not meaningful for the variant. E.g. the Programme Year attribute is omitted in the Course/Exchange Students because it is not meaningful for exchange students. Note that in this respect variants are clearly different from the notion of subtype. Subtypes can in general not be used to model variants. Also attributes may have variants. Name/English Name and Name/Dutch Name are two variants of the attribute Name. To relate the attribute variant to the original attribute in the parent OT, the name of the original attribute is preceding the name of the attribute variant. In some cases, it is possible that the original attribute never will have an own value, but only serves as a means to indicate that the underlying variant attributes have the same semantics. This is comparable to the concept of abstract OT in object-oriented modeling. By analogy, we call this an abstract attribute. In the OT Course, the attributes Name, Description, Exam Type and Required Reading are abstract attributes. A variant OT cannot include or refer to attributes which are not defined in the parent OT. This is to prohibit addition of completely new information (attributes) to a var-iant, in which case it will not be a variant anymore.

Designing Well-Structured Websites: Lessons to Be Learned

59

Course Id Name Description Newsgroup Exam Type Required Reading Programme Year

Course/Local Students

Course/Exchange Students

Id Name/ Dutch Name Description/Dutch Descript Newsgroup Exam Type/Exam TypeDutch Required Reading/Dutch Req. Reading Programme Year

Id Name/English Name Description/English Description Newsgroup Exam Type/Exam TypeEnglish Required Reading/English Req. Reading

Fig. 3. Variants for the OT Course

In WSDM, information differences between the perspectives of a single user class are modeled by means of OT variants. For each OT in the UOM of a user class, and for each perspective of this user class, a variant may be defined to reflect the possible information differences. To derive the conceptual schema for a perspective, called a perspective object model (POM), it suffices to replace the OTs in the corresponding UOM by the corresponding perspective variants. If an OT has no variant for the perspective, the OT is kept as it is.

3.3 Linking the Conceptual Models As explained, the Object Modeling starts by building the user object models, one for each user class. Subsequently, these models are refined using perspective variants to derive the perspective object models (if a user class has no perspectives then the user object model acts as perspective object model). In what follows we call OTs from a perspective object model perspective OTs (POTs). Perspective object models of a single user class are related by means of their user object model. However, the different user object models are (still) independent. This is not desirable, especially not when several user classes share the same information. It would result in an uncontrollable redundancy. Therefore, the different user object models must be related. To do this we use an overall object model, the business object model (BOM). This model is a conceptual description of the information (business objects) available in the organization. It is independent of any type of user. Such a business object model may already have been developed for the organization or the application domain. If not, or if it is not available in a shape usable for our purpose, it

60

O. De Troyer

must be (re-)developed. The classical information analysis methods mentioned earlier may be used for this. For this model, a data-driven approach is not a problem, on the contrary: it is preferred. Next, the different user object models are expressed as (possibly complex) views on the BOM. Note that it is possible that during this step it turns out that the (existing) BOM is incomplete. This is the case if information modeled in a user object model cannot be expressed as information modeled in the BOM. In such a case it is necessary to re-engineer the BOM. Figure 4 illustrates how the different types of conceptual schemes developed during Object Modeling relate to each other. application domain

user class description

nt ria va

w vie

user object model

variant

va ria nt

business object model

vie w

user object model

nt varia var ian t

perspective object model perspective object model perspective object model perspective object model

perspective object model

Fig. 4. Relationship between the different types of object models

3.4

Navigational Design

Once the Object Modeling is done, a conceptual navigation model is constructed. The navigation model expresses how the different user types will be able to navigate through the available information. Navigational models are usually described in terms of components and links. We distinguish between information components, navigation components and external components (see Fig. 5). Information components represent information. An information component may be linked to other components to allow navigation. A navigation component can be seen as a grouping of links, and so contains no real information but allows the user to navigate. An external component is actually a reference to a component in another site. Following our user-centered approach, we design an independent navigation track for each perspective. To derive the navigation model, it is sufficient to connect the different navigation tracks by a navigation component. In a nutshell, a navigation track for a perspective may be constructed as follows: information components are

Designing Well-Structured Websites: Lessons to Be Learned

61

derived from the POTs and links are used to represent the relationships between POTs. This forms the information layer of the navigation track. Next, a navigation layer, built up of navigation components, is designed to provide different access paths to the information components in the information layer. The top of a navigation track is a single navigation component which provides access to the different navigation components in the navigational layer. When the different navigation tracks are composed, these top level components form the context layer of the navigation model. Figure 6 shows the navigation track for the POM Exchange Students. Figure 7 shows how the different navigation tracks are composed to make up the navigation model.

Navigation Component

Information Component External Component

Link

Fig. 5. Graphical representation of the navigation model concepts Navigation Track for Exchange Students Context Layer

Exchange Students Perspective

Navigation Layer Exams by Course

Courses by Name

Course Materials

Course Materials by Id

Course/ Exchange Students

Exam

Lecturers by Name

Course Materials by Course

Course Material

Lecturer

Information Layer

Fig. 6. Navigation track for the perspective Exchange Students

University Department

Context Layer Researchers Perspective

Local Students Perspective

Exchange Students Perspective

Navigation Layer

Information Layer

Fig. 7. Composition of navigation tracks into a navigation model

62

O. De Troyer

In the rest of this paper we will use the term conceptual schema (CS) to denote the result of the Conceptual Design: the UOMs, POMs, BOM and the navigation model.

4

Roles of the CS in the Website Life Cycle

The life cycle of a website contains many of the phases of a traditional Information System (IS) life cycle, such as planning, analysis, design and implementation, but also phases which are specific for web systems. The development process of a website is more open-ended because a website is often not as permanently fixed as a traditional IS. Designing a website is an ongoing process. Maintenance includes activities such as monitoring new technologies, monitoring users, and adapting the website accordingly. It is a continuous process of improvement. To emphasize this distinction, the maintenance phase is sometimes called Innovation [3]. The typical Installation phase is replaced by a Promotion phase in which the existence of the website is made public (by publicity, references from other websites, etc.). In this section we explain what role the CS may play in the Implementation phase, the Promotion phase and the Innovation phase, and we explain how the CS may be exploited even more inside the website. During Implementation Design, the ‘look and feel’ of the website is developed. Starting point for this is the navigation model. Through use of graphical design principles and visual communication techniques, taking into account the characteristics of the different perspectives, the navigation model will be translated into a presentation model. (content of pages and their layout). Again, this is in some respect similar to the mapping of a conceptual data schema into a logical data schema (e.g. a relational one). Indeed, during Implementation Design one may decide to group information components and links (from the navigational model) together and to present them to the user as single packages of information. (In fact, we are developing algorithms and tools to support this.) Separating the conceptual and the implementation design for websites has the same advantage as in database design. It offers the flexibility needed for designing large websites. As explained, designing a website is an continuous process. By separating the conceptual design from the implementation design, we yield the flexibility required to support this incremental and evolving design process. Different implementation designs may be built (e.g. as prototypes) and evaluated. Changes and additions to the content are localized to the conceptual level, and the impact on the implementation design can easily be traced. Adding a new user class only involves adding a new UOM with its associated perspectives and navigation tracks. Changes to the presentation only influence the implementation design. The actual implementation can be automated using available tools and environments for assisting in e.g. HTML implementations.

Designing Well-Structured Websites: Lessons to Be Learned

63

Because different perspectives may offer the same information (possibly presented differently) , we need to provide means to maintain this information and keep it consistent. The obvious way of doing this is by maintaining the underlying information (or parts of it) in a database. This need not be a full-fledged database, but in any case a single storage place for information shared between different perspectives. As all information presented in the website is ultimately related to the business object model (BOM) (by means of the POMs and UOMs), this BOM provides the conceptual schema for the underlying database. From this BOM a logical database schema is then generated (using appropriate database development tools) or manually built. The queries needed to extract the information for building the pages can then be derived from the POMs because they are already expressed as views on the BOM. To reduce the lost-in-hyperspace syndrome, many sites contain an index page or site map. This index page or site map gives a (hierarchical) overview of the website and provides a central point for the user to locate a page in the website. We may consider instead to replace it by a representation of (parts of) the conceptual schema which is much richer in information than an index page. Each navigation track could contain a suitable representation of its corresponding POM. This will not only allow the user to locate information directly but will also help him/her to build a mental model of the site and ultimately provide an on-line repository of meta-information which may be queried. The availability of the CS literally ‘in-site’ may also be exploited by the many different types of search engines to enhance their search effectiveness. In this way promotion benefits as well. 2

5

Conclusions

In this paper we have explained the need for a conceptual design phase in website design similar to the conceptual design phase in database systems. Based on early experience with our method WSDM, we argued that a user-centered approach is more appropriate for websites than the traditional data-centered approach used for database design. As a consequence, the conceptual schema of a website cannot be seen as a single schema but as a collection of schemes; each user perspective has its own conceptual schema. To relate the different schemes and to control the redundancy possibly introduced in this way, a business object model is used. To capture variations between perspectives schemes, so-called OT variants are introduced. Because navigation is an essential characteristic of websites, the conceptual schema also includes a navigation model which describes how users will be able to navigate through the website, being a collection of navigation tracks, one for each user perspective. We have also shown that separation of the conceptual and the implementation design for websites has the same advantages as in database design. As for database

2

Note that this does not lead to the redundancy mentioned as a usability problem in the introduction, because a user only follows one perspective and within one perspective, redundancy is avoided.

64

O. De Troyer

design it is possible to deploy the conceptual schema technology in the future automation of site creation and upkeep. CASE-type tools generating well-structured websites from user requirements and business domain models are the next logical step. In addition, (parts of) the conceptual schema may be represented and queried inside the website to reduce the lost-in-hyperspace syndrome. New generations of search engines may exploit such additional structural knowledge, e.g. by allowing them to interpret the meta-information present in a website, and act on its semantics. Acknowledgments. Many thanks go to Wim Goedefroy and Robert Meersman for the interesting discussions on and the contributions to this research work.

References 1. 2. 3. 4. 5.

6.

7.

8. 9. 10. 11. 12. 13. 14. 15.

16. 17.

P.P. Chen, The Entity-Relationship Model: Towards a Unified View of Data, ACM Transactions on Database Systems, Vol 1 no 1, 1976, 471-522. O. M.F. De Troyer, A formalization of the Binary Object-Role Model based on Logic. In: Data & Knowledge Engineering 19, pp. 1-37, 1996. J. December, M. Ginsberg, HTML & CGI Unleased, Sams.net Publishing, 1995. O.M.F. De Troyer, C.J. Leune, WSDM: a User-Centered Design Method for Web Sites, in proceedings of the WWW7 Conference, Brisbane, April 1997. W. Goedefroy, R. Meersman, O. De Troyer, UR-WSDM: Adding User Requirement Granularity to Model Web Based Information Systems. Proceedings of 1st Workshop on Hypermedia Development, Pittsburgh, USA, June 20-24, 1998. F. Garzotto, P. Paolini, L. Mainetti, Navigation patterns in hypermedia databases, Proceedings of the 26th Hawaii International Conference on System Science, IEEE Computer Society Press, pp. 370-379, 1993. F. Garzotto, P. Paolini, D. Schwabe, HDM - A Model-Based Approach to Hypertext Application Design, ACM Transactions on Information Systems, Vol 11, No 1, pp. 1-26, 1993. T. Halpin, Conceptual Schema and Relational Database Design, second edition, Prentice Hall Australia, 1995. T. Isakowitz, E. A. Stohr, P. Balasubramanian, RMM: A Methodology for Structured Hypermedia Design, Communications of the ACM, Vol 38, No 8, pp. 34-43, 1995. H. Lee, C Lee, C. Yoo, A Scenario-Based Object-Oriented Methodology for Developing Hypermedia Information Systems, Proc. of HICSS ‘98. H. Maurer, Hyper-G - The Next Generation Web Solution, Addison-Wesley 1996. J. Rumbaugh, M. Blaha, W. Premerlani, F. Eddy and W. Lorensen, Object Oriented Modeling and Design, Prentice Hall Inc., 1991. D. Schwabe, G. Rossi, The Object-Oriented Hypermedia Design Model, Communications of the ACM, Vol 38, No 8, 1995. D. Schwabe, G. Rossi, S.D.J. Barbosa, Systematic Hypermedia Application Design with OOHDM, http://www.cs.unc.edu/barman/HT96/P52/section1.html. K. Takahashi, E. Liang, Analysis and Design of Web-based Information Systems, Sixth International World Wide Web Conference, 1997, http://www6.nttlabs.com/papers/PAPER245/Paper245.html. J.J. Wintraecken, The NIAM Information Analysis Method - Theory and Practice, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1990. M. Bichler, S. Nusser, W3DT - The Structural Way of Developing WWW-sites, Proceedings of ECIS’96, 1996.

Formalizing the Informational Content of Database User Interfaces Simon R. Rollinson and Stuart A. Roberts School of Computer Studies, University of Leeds Leeds, LS2 9JT, UK {sime, sar}@scs.leeds.ac.uk

Abstract. The work described in this paper addresses the problem of modelling the informational content of a graphical user interface (GUI) to a database. The motivation is to provide a basis for tools that allow customisation of database interfaces generated using model-based techniques. We focus on a particular class of user interface, forms-based GUIs, and explore the similarities between these types of interfaces and a semantic data model. A formalism for translating between forms-based interfaces and a semantic data model is presented. The translation takes account of the context in which each control on the GUI is employed, and accommodates the need to map distinct GUI elements to the same semantic concepts.

1

Introduction

Forms-based user interfaces have remained popular as a means of data entry and update for a number of years, especially in business-oriented applications. Over the years these interfaces have evolved from character-based systems to present day graphical user interfaces (GUIs). GUIs offer a much wider scope for interaction which has led to the move away from hierarchic menu structures to network-like structures, utilising multi-modal navigation between forms. The structure of this style of interface corresponds quite closely with that of a semantic data model: forms can be identified with entities; controls on forms with attributes; and links between forms with relationships between entities. This correspondence was used in [19] for the purpose of interface generation. Although a user interface appears similar to a semantic model there is an important difference. This involves the semantics of user interface controls. In particular a control’s semantics can differ depending upon its usage. For example, in one interface a control may be used to represent an attribute of some entity, whilst in another the same control may be used to represent an entity. To be able to transform user interfaces to a semantic data model it is necessary to understand the semantics of each control in its different uses. The aim of this paper therefore, is to describe an investigation into identifying the semantics of user interface components with respect to their different uses, classifying each component and its use(s) in terms of a familiar semantic modelling concept(s). T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 65–77, 1998. c Springer-Verlag Berlin Heidelberg 1998

66

S.R. Rollinson and S.A. Roberts

The motivation for this work comes from the research done into model-based interface generation. This has been popular for a number of years in the database community, where data models are often utilised along with task models to automatically generate user interfaces for database applications. For example, the MACIDA [17], GENIUS [13] and TRIDENT [3] systems have all used some form of Entity-Relationship model [6] as part of their input. A feature lacking in most of these systems is the ability to customise the generated interface to suit a user’s personal needs. To ensure that no information is lost during customisation, by removing a control for example, it is necessary to validate the customised interface against the generated interface, thus highlighting any missing information in the customised interface. To enable such a facility two things are needed. The first is a means by which the informational content of the interface can be modelled, and the second is an equivalence metric between interface models. Work already exists to do this on an individual forms basis, (see [2]). We seek to undertake a similar study but work with networks of forms, identifying and addressing the issues posed by such interfaces. The paper is structured as follows. Section two examines related work and section three introduces the types of user interfaces focused on in this work. The formalism for representing user interfaces is described in Sect. 4, along with the transformations that map user interface elements to semantic modelling concepts. Section five briefly describes a prototype implementation of the mappings and Sect. 6 looks at applications of the work. Finally, Sect. 7 concludes the paper.

2

Related Work

In [9] the ERMIA (Entity-Relationship Modelling for Information Artifacts) formalism is presented. ERMIA is based on an extended entity-relationship modelling approach and is employed in the evaluation of user interfaces to complement methods such as GOMS [5] and UAN [11]. The significance of ERMIA to this work is its recognition of the close relationship between the structure of information in a user interface and the structure of information in a database. The work described in this paper, however, focuses on establishing a link between user interface components and semantic modelling concepts, whereas ERMIA is concerned with stripping away the ‘renderings’ of information to reveal the underlying structure for evaluation purposes. Abiteboul and Hull in [2] describe a formalism for representing and restructuring hierarchical database objects, including office forms, based on the IFO model [1]. They focus on data preserving transformations from one hierarchic structure to another, to allow, for example, equivalence tests between different forms. Abiteboul and Hull treat each form in isolation whereas we aim to take into account the network-like struture of modern GUIs. Our work can be seen, in part, to be an extension of the work of Abiteboul and Hull enabling tests for equivalence of complete interfaces comprising many linked forms. Furthermore,

Formalizing the Informational Content of Database User Interfaces

67

Table 1. User interface constructs Form Groupbox

A form is a window that allows other constructs to be placed on it. The groupbox groups together related constructs and is named to reflect the grouped information. Listbox The listbox shows either a single or multi-column scrollable list of alphanumeric information. Grid The grid allows alphanumeric information to be entered or shown in a tablular format. Checkbox The checkbox is a rectangle which holds either a “X” or a blank space (i.e. a boolean value). Radio Button The option button cannot be used singly, two or more must be used together. They hold either a “•” or a blank space (i.e. a booleanvalue) In a group of n buttons only one of them must contain a “•” at any given time.Thus representing the fact that one of the options is true and the rest false. Textbox The textbox is a rectangle that can be used to enter or show alphanumeric information. Combobox The combobox has two states, normal and extended. In its normal state the combobox is a rectangle (similar in appearance to the textbox) and allows alphanumeric information to be entered/shown. In the extended state the combobox presents a list of alphanumeric information, from which an item can be selected to be shown in the normal state of the combobox. In the extended state the combobox appears similar to a listbox. Button The button allows actions to be performed. A typical action might be to show another form, effectively linking together two forms. Row The row is part of (i.e. contained in) the grid, listbox and combobox constructs. Column One or more columns form the contents of a row construct.

[2] assumes a mapping between user interface controls on forms and the IFO constructs which is too simple for our purposes. In [10] G¨ uting et al also consider the hierarchic nature of office documents and have developed an algebra for manipulating these structures based on relational algebra. Again each form is treated separately. By extracting a semantic data model from a database application our work is also similar to [20] which considers the reverse engineering of database applications, for the purpose of creating an object-oriented view of a relational database. Our work could also be classed as a form of reverse engineering, as we extract a semantic data model from a database application. However, we take a higher level view, concentrating only on the user interface, whereas Vermeer and Apers [20] are concerned with examining the program and query language statements.

68

S.R. Rollinson and S.A. Roberts

Fig. 1. An example user interface

Other work of interest is that of Mitchell et al [15] who have shown how an object-oriented data language can be used not only to describe a database but also its user interface.

3

User Interfaces

In this section we introduce, in more detail, the type of user interfaces that have been studied. The interfaces comprise linked forms (i.e. windows). Forms are linked and information displayed/entered using interface controls (hereon referred to as constructs). Interface constructs can be generalised into several broad types. Each is described in table 1. To constrain the type of user interface examined in this study to those of a forms-based nature, we opted to use a model-based user interface development environment (MB-UIDE). For this work the TRIDENT system [3] was chosen as a means of limiting the interfaces. Several modifications are required to TRIDENT: the removal of constructs such as thermometers and dials, not supported in our formalism; the introduction of the row and column constructs; and a rationalisation of constructs where the differences between two constructs (e.g. with or without scrollbars) are immaterial to our method. The precise method of defining interfaces used in our study is given in appendix A. To end this section we present a user interface that will be used as an example thoughout the remainder of the paper. Figure 1 shows two forms person and car which both contain textbox and listbox controls. The arc from the car row to the person form indicates a ‘clickable’ link between the row of the listbox and the car form. In this case the link ‘opens’ the car form.

4

The User Interface Model

By examining the user interfaces of database applications it was possible to extract the semantics of each interface construct used within different contexts. Each interface construct was classified as one or more of three modelling concepts depending on the context in which it was used (see table 2). The concepts are:

Formalizing the Informational Content of Database User Interfaces

69

Table 2. Informal user interface construct mappings Abstract Types 1 2 3 4 5 6 7 8 9

Forms. Groupboxes containing one or more textboxes, comboboxes, listboxes, grids, groupboxes and checkboxes, but not just checkboxes. Rows containing more than 1 column. Rows containing a single column that have the same name as a form, or a groupbox (as defined in 2). Comboboxes with more than 1 column. Comboboxes with 1 column that have the same name as a form, or a groupbox (as defined in 2). Textboxes that have the same name as a form, or a groupbox (as defined in 2). A groupbox containing only radiobuttons where the groupbox has the same name as a form or a groupbox (as defined in 2). A group containing only checkboxes where the groupbox has the same name as a form or a groupbox (as defined in 2) Lexical Types

10 11 12 13 14 15

Textboxes and columns. A single checkbox. Groupboxes containing 2 or more radiobuttons. Rows with one column that do not have the same name as a form. Comboboxes with 1 column that do not have the same name as a form. Groupboxes that contain only checkboxes. Groupings

16 Groupboxes that contain only checkboxes. 17 Listboxes and Grids.

abstract type, lexical type and grouping, which are taken from the generalised semantic model (GSM) [12]. A key element in our formalism is the use of labels for naming and identifying constructs. We adopt a scheme similar to that of Buneman et al [4] in which different constructs that have the same name are interpreted as representing the same real world concept. In Fig. 1 for example, the car form and car row are interpreted as representing the same entity. Likewise with the columns (reg., make) of the row which are interpreted as representing the same attributes as the reg and make textboxes. Notice, however, that name does not appear in the car row. Again, we adopt the approach of Buneman et al and form the union of the textbox and column labels to obtain the attributes for car. We also exploit the use of plural labels to indicate a repeating group. The entity being repeated is identified by the singular of the plural label. This idea has been described by Rock-Evans [18] as part of a data analysis activity.

70

4.1

S.R. Rollinson and S.A. Roberts

User Interface Model - A Formal Description

To build the necessary base for the transformations we first define a user interface formally. Definition 1. A user interface (UI) is a five-tuple I = hSL, P L, T, N, Ci where: – SL is a finite set of singular labels – P L is a finite set of plural labels – T is the set of interface construct types, i.e. : T = { FORM, GROUPBOX, COMBO, CHECKBOX, COLUMN, ROW, GRID, LISTBOX, TEXTBOX, RADIOBUTTON} – N = (SL × T ) ∪ (P L × T ) is the set of labelled constructs (nodes). – C = N × N is the set of pairs of nodes

Definition 2. An instance of a user interface is a set of directed trees U = {T1 , . . . , Tn }. Each tree Ti = (Vi , Ei ), where 0 < i 6 n, Vi ⊂ N is a set of vertices and Ei ⊂ C is a set of edges. Several functions are defined to operate on instances of user interfaces and their vertices: – ρ(T ) : Ti → N returns the root node of a tree – ψ(v) : N → N returns the parent of vertex v – τ (v) : N → T returns the type of vertex v – σ(v) : N → N 0 ⊂ N returns the set (possibly empty) of children of vertex v – λ(v) : N → SL ∪ P L returns the label of vertex v – Σ(v) : N → SL returns the singular label of vertex v’s plural label. – singlecolumn(v) returns true if a combobox or row v has 1 column. – multicolumn(v) returns true if a combobox or row v has ≥ 2 columns. – checkboxgroup(v) returns true if a groupbox v contains only checkboxes. – radiobuttongroup(v) returns true if a groupbox v contains only radiobuttons.

Formalizing the Informational Content of Database User Interfaces

(a) T1

71

(b) T2

Fig. 2. An example instance of a user interface consisting of two forms

– constructgroup(v) returns true if a groupbox v contains any construct in T . Figure 2 shows an example user interface instance containing two forms, as represented by the sets V1 , E1 and V2 , E2 below. V1 = { v1 = (person, F orm), v2 = (name, T extbox), v3 = (cars, Listbox), v4 = (car, Row), v5 = (reg, Column), v6 = (make, Column)} E1 = { e1 = (v1 , v2 ), e2 = (v1 , v3 ), e3 = (v3 , v4 ), e4 = (v4 , v5 ), e5 = (v4 , v6 )} V2 = { v8 = (car, F orm), v9 = (reg, T extbox), v10 = (make, T extbox), v11 = (name, T extbox), v12 = (ownedBy, Listbox), v13 = (person, Row), v14 = (name, Column)} E2 = { e7 = (v8 , v9 ), e8 = (v8 , v10 ), e9 = (v8 , v11 ), e10 = (v8 , v12 ), e11 = (v12 , v13 ), e12 = (v13 , v14 )}. 4.2

Transformations

A GSM schema is a directed graph G = (N, E), where N is a set of nodes and E is a set of edges. Each L node is of a particular type (abstract type 4, lexical N type ⊂⊃, grouping ) resulting in three subsets of N representing abstract types, lexical types and groupings. We label these sets A, Π and Γ respectively. If U = {T1 , . . . , Tn } is a UI model instance, and Ti (Vi , Ei ) is a tree in U we define two sets V and E as follows. V = V1 ∪, . . . , ∪Vn ,

E = E1 ∪, . . . , ∪En

72

S.R. Rollinson and S.A. Roberts

Transforming an instance of a UI into a GSM schema is a matter of mapping the set of nodes V to the sets of nodes A, Π, Γ that comprise N and the set of edges E to E. We start by mapping V to N . Mapping V to N means classifying each v ∈ V as a member of A, Π or Γ . This can be achieved by defining membership conditions for the three sets. Each set can have several membership conditions so we divide them into subsets, with one subset for each membership condition (e.g. A has nine membership conditions so we define the sets α1 to α9 , the union of which forms A). The same effect is achievable by defining the membership condition for A, Π and Γ as the disjunction of the membership conditions of their subsets. The former method is used for clarity. Definitions three to five below show the membership conditions for each of the sets. Notice that in these definitions we place the labels of nodes, rather than the nodes themselves, into A, Π and Γ . The reason for this is that it is possible for two nodes to represent the same concept. Placing the nodes into the sets directly would result in two concepts with the same name, when in fact we have a single concept used twice (Re-call from Sect. 4 that if two constructs represent the same thing then they have the same name). Thus, if we have two nodes v1 = (person, f orm) and v2 = (person, row) and they are classified as belonging to A then only the label person is placed in A. Definition 3. The set of abstract type labels A is defined as: A = α1 ∪, . . . , ∪α9 α1 = {l : l = λ(v) ∧ τ (v) = FORM} α3 = {l : l = λ(v) ∧ τ (v) = ROW ∧ multicolumn(v)} α4 = {l : l = λ(v) ∧ τ (v) = ROW ∧ singlecolumn(v) ∧ λ(v) ∈ α1 ∪ α2 } α5 = {l : l = λ(v) ∧ τ (v) = COMBO ∧ multicolumn(v)} α6 = {l : l = λ(v) ∧ τ (v) = COMBO ∧ singlecolumn(v) ∧ λ(v) ∈ α1 ∪ α2 } α7 = {l : l = λ(v) ∧ τ (v) = TEXTBOX ∧ λ(v) ∈ α1 ∪ α2 } α8 = {l : l = λ(v) ∧ radiobuttongroup(v) ∧ λ(v) ∈ α1 ∪ α2 } α9 = {l : l = Σ(λ(v)) ∧ checkboxgroup(v) ∧ λ(v) ∈ α1 ∪ α2 } Definition 4. The set of lexical type labels is defined as: Π = π1 ∪, . . . , ∪π6 π1 = {l : l = λ(v) ∧ τ (v) = TEXTBOX ∨ τ (v) = COLUMN} π2 = {l : l = λ(v) ∧ τ (v) = CHECKBOX ∧ ¬checkboxgroup(ψ(v))}

Formalizing the Informational Content of Database User Interfaces

73

π3 = {l : l = λ(v) ∧ radiobuttongroup(v)} π4 = { l : l = λ(v) ∧ τ (v) = ROW ∧ singlecolumn(v) ∧ τ (ψ(v)) = LISTBOX ∨ τ (ψ(v)) = GRID ∧ λ(v) 3 A} π5 = {l : l = λ(v) ∧ τ (v) = COMBO ∧ singlecolumn(v) ∧ λ(v) 3 A} π6 = {l : l = Σ(λ(v)) ∧ checkboxgroup(v)} Definition 5. The set of grouping labels is defined as: Γ = γ1 ∪, . . . , ∪γ2 . γ1 = {l : l = λ(v) ∧ checkboxgroup(v)} γ2 = { l : l = λ(v) ∧ τ (v) = LISTBOX ∨ τ (v) = GRID ∧ |σ(v)| = 1 ∧ τ (σ(v)) = ROW} The previous step resulted in sets of labels representing the nodes of the GSM schema. To connect the nodes, we must map the set of edges E connecting nodes, to the set of edges E, connecting node labels. More precisely, for every node e = (m, n) ∈ E connecting two nodes m and n, we need an edge in E to connect the labels of the two nodes n and m, that is, e0 = (λ(m), λ(n))). To do this we form equivalence classes of V. Definition 6. An equivalence class N (x) is defined as: N (x) = {y : y ∈ V ∧ λ(x) = λ(y)} This results in an equivalence class for each distinct name appearing in V, thus each equivalence class represents a node in the GSM schema. To connect the GSM nodes we need to establish relationships between equivalence classes. We do this by taking a pair of equivalence classes and looking for elements from each of them that, when paired, occur as an edge in E. An edge (N (x), N (y)) in the GSM schema is given by: (N (x), N (y)) ⇐⇒ ∃x0 ∈ N (x) ∧ ∃y 0 ∈ N (y) ∧ (x0 , y 0 ) ∈ E It is possible, in a GSM schema derived using the above method, for a lexical type to exist which has several incoming arcs. It is uncommon, although not invalid, for such a situation to exist within a GSM schema. When this occurs we replace the lexical type with n copies of itself, where n is the number of arcs terminating at the lexical type. Figure 3a shows an example of this with car and person both having arcs terminating at the lexical type name. Figure 3b shows the GSM schema after name has been split.

74

S.R. Rollinson and S.A. Roberts

(a)

(b)

Fig. 3. (a) the GSM schemata of the UI model instance in Fig. 2, and (b) Fig. 3a after the splitting of name

4.3

Example Transformation

If we return to the UI model instance in Fig. 2 we can define the set V = {v1 , . . . , v14 } and E = {e1 , . . . , e12 }. Testing the elements of V against the membership conditions of A, Π, Γ gives: A = {person, car}, Π = {name, reg, make}, Γ = {cars, ownedBy} and we arrive at the following set of edges: E = { e1 = (person, name), e2 = (person, cars), e3 = (car, reg), e4 = (car, make), e5 = (car, name), e6 = (car, ownedBy), e7 = (ownedBy, person), e8 = (cars, car)} Figure 3a shows a graphical representation of the GSM schema derived from the UI model.

5

Prototype Tool

A prototype tool has been developed to test the transformations. It has been implemented mainly in Prolog and uses the work described in [7] to allow user interface model instances to be specified graphically. Output from the prototype is of a form that can be visualised using the XVCG tool [14]. Thus, the system operates completely in the graphical domain.

Formalizing the Informational Content of Database User Interfaces

75

The system allows the user to specify a user interface model instance as an Xfig diagram, comprising a tree for each form in the user interface. This is converted into Prolog predicates using the visual parser [7]. The transformations are then applied to the predicates resulting in a GRL (graph representation language) specification of the GSM representation of the interface. Finally the XVCG tool is invoked on the GRL specification allowing the GSM schema to be visualised.

6

Applications

The work presented here has several applications in the design and development of information systems. The first is to provide an ‘informational’ consistency check to validate modifications, made by end-users, to interfaces generated using MB-UIDEs. To this end, we are currently exploring equivalence metrics to compare GSM schemata. A second application is as part of a reverse engineering toolkit. The transformations presented in this paper provide a suitable mechanism for translating the user interface of a forms-based application into a GSM schema. This initial model of the application’s data can then be augmented and refined using other reverse engineering techniques. Such a tool would be of use in situations where there is no underlying database. A futher application of our work is as the basis for a data modelling tool. In [16] Moody reports how novices found conventional data models difficult to work with and identified the need for more ‘user-friendly’ methods of data modelling. Embly [8] describes forms as being well-understood [by their users], structuring information according to well-established and longstanding conventions. As part of our ongoing work we shall investigate the use of our transformations to provide a more friendly approach to data modelling.

7

Conclusions & Further Work

This paper has presented a formalism for the representation of forms in a graphical user interface environment and their transformation to semantic modelling concepts. In this work the formalism has been applied to database user interfaces, although it could be applied more generally to other forms-based interfaces. We make two extensions to the work of Abiteboul and Hull [2]. Firstly, our formalism takes into account the context in which the user interface controls have been used. Secondly, we are able to transform complete forms-based interfaces because we identify the network-like structuring of forms, and the multi-modal navigation between them, allowed by graphical user interface environments. This work represents the first step towards transforming network-structured forms interfaces to GSM schemata. In future work we intend to extend the mapping to include the transformation of data manipulation operations supported by the interface, thus providing a more complete mapping.

76

S.R. Rollinson and S.A. Roberts

References 1. S. Abiteboul and R. Hull. Ifo: A formal semantic database model. ACM Transactions on Database Systems, 12(4):525–565, 1987. 2. S Abiteboul and R Hull. Restructuring hierarchical database objects. Theorectical Computer Science, 62:3–38, 1988. 3. F. Bodart et al. Architecture elements for highly-interactive business-oriented applications. In L. Bass et al., editors, Lecture Notes in Computer Science, volume 753 of LNCS, pages 83–104. Springer-Verlag, 1993. 4. P. Buneman, S. Davidson, and A. Kosky. Theoretical aspects of schema merging. In A. Pirotte, C. Delobel, and G. Gottlob, editors, Advances in Database Technology, Proceedings of the third international conference on extending database technology, volume 580 of LNCS, pages 152–167. Springer-Verlag, 1992. 5. S. K. Card, T. P. Moran, and A. Newell. The Psychology of Human-Computer Interaction. Lawrence Erlbaum, Hillsdale, NJ, 1983. 6. P. P. Chen. The entity-relationship model - towards a unified view of data. ACM Transactions on database Systems, 1(1):9–36, 1976. 7. A. de Graaf. Levis: Lexical scanning for visual languages. Master’s thesis, University of Leiden, The Netherlands, July 1996. 8. D. W. Embley. Nfql: The natural forms query language. ACM Transactions on Database Systems, 14(2):168–211, 1989. 9. T. R. G. Green and D. R. Benyon. The skull beneath the skin: entity-relationship models of information artifacts. International Journal of Human-Computer Studies, 44(6):801–828, 1996. 10. R. H. Guting, R. Zicari, and D. M. Choy. An algebra for structured office documents. ACM Transactions on Office Information Systems, 7(4):123–157, 1989. 11. H. R. Hartson and A. Dix. Toward empirically derived methodologies and tools for human-computer interface development. International Journal of HumanComputer Studies, 31:477–494, 1989. 12. R Hull and R King. Semantic database modelling: Survey, applications, and research issues. ACM Computing Surveys, 19(3):201–260, 1987. 13. C. Janssen, A. Weisbecker, and J. Ziegler. Generating user interfaces from data models and dialogue net specifications. In S. Ashlund K. Mullet A. Henderson E. Hollnagel T. White, editor, Proceedings of INTERCHI’93, pages 418–423, 1993. 14. I. Lemke and G. Sander. Visualization of compiler graphs. Technical Report D3.12.1-1, Universitat des Saarlandes, FB 14 Informatik, 1993. 15. K. J. Mitchell, J. B. Kennedy, and P. J. Barclay. Using a conceptual data language to describe a database and its interface. In . Goble, C and J. Keane, editors, Advances in Databases, Proceedings of the 13th British National Conference on Databases, volume 940 of LNCS, pages 79–100. Springer-Verlag, 1995. 16. D. L. Moody. Graphical entity-relationship models: Towards a more user understandable representation of data. In B. Thalheim, editor, Proceedings of the 15th International Conference on Conceptual Modelling, Cottbus, Germany, volume 1157 of LNCS, pages 227–244, 1996. 17. I. Petoud and Y. Pigneur. An automatic and visual approach for user interface design. In Engineering for Human-Computer Interaction, North-Holland, pages 403–420, 1990. 18. R. Rock-Evans. A Simple Introduction to Data and Activity Analysis. Computer Weekly Publications, 1989.

Formalizing the Informational Content of Database User Interfaces

77

19. S. R. Rollinson and S. A. Roberts. A mechanism for automating database interface design, based on extended e-r modelling. In C. Small et al., editors, Advances in Databases Proceedings of the 15th British National Conference on Databases, volume 1271 of LNCS, pages 133–134. Springer-Verlag, 1997. 20. M. W. W. Vermeer and P. M. G. Apers. Reverse engineering of relational database applications. In Proceedings of OO-ER’95, Fourteenth International Conference on Object-Oriented and Entity-Relationship Modelling, volume 1021 of LNCS, pages 89–100. Springer-Verlag, 1995.

A

Constraints for Composing Interface Controls

1 All constructs must have a label, this includes the rows of grids, listboxes and comboboxes as well as the columns of each row. 2 All constructs must be placed on a form with the exception of rows and columns. 3 Grid, listbox and combobox constructs can contain only row constructs. 4 Row constructs when used as the row of a grid can contain only column and combobox constructs, and must contain at least one column or combobox. 5 Row constructs when used as the row of a list- or combobox can contain only column constructs and must contain at least one column. 6 A single checkbox does not have to be grouped with a groupbox. 7 A radiobutton must be associated with at least one other radiobutton. 8 Two or more checkboxes/radiobuttons must be grouped using a groupbox. 9 A listbox must have an associated form that has a construct for at least every column in the listbox. 10 A combobox with more than one column must have an associated form that has a construct for at least every column of the combobox.

A Conceptual-Modeling Approach to Extracting Data from the Web D.W. Embley1? , D.M. Campbell1 , Y.S. Jiang1 , S.W. Liddle2?? , Y.-K. Ng1 , D.W. Quass2?? , R.D. Smith1 1

Department of Computer Science School of Accountancy and Information Systems Brigham Young University, Provo, Utah 84602, U.S.A. {embley,campbell,jiang,ng,smithr}@cs.byu.edu; {liddle,quass}@byu.edu 2

Abstract. Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data. The approach is based on an ontology – a conceptual model instance – that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth. Keywords: data extraction, data structuring, unstructured data, datarich document, World-Wide Web, ontology, ontological conceptual modeling.

1

Introduction

The amount of data available on the Web has been growing explosively during the past few years. Users commonly retrieve this data by browsing and keyword searching, which are intuitive, but present severe limitations [2]. To retrieve Web data more efficiently, some researchers have resorted to ideas taken from databases techniques. Databases, however, require structured data and most Web data is unstructured and cannot be queried using traditional query languages. To attack this problem, various approaches for querying the Web have ? ??

Research funded in part by Novell, Inc. Research funded in part by Faneuil Research Group

T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 78–91, 1998. c Springer-Verlag Berlin Heidelberg 1998

A Conceptual-Modeling Approach to Extracting Data from the Web

79

been suggested. These techniques basically fall into one of two categories: querying the Web with Web query languages (e.g., [3]) and generating wrappers for Web pages (e.g., [4]). In this paper, we discuss an approach to extracting and structuring data from documents posted on the Web that differs markedly from those previously suggested. Our proposed data extraction method is based on conceptual modeling, and, as such, we also believe that this approach represents a new direction for research in conceptual modeling. Our approach specifically focuses on unstructured documents that are data rich and narrow in ontological breadth. A document is data rich if it has a number of identifiable constants such as dates, names, account numbers, ID numbers, part numbers, times, currency values, and so forth. A document is narrow in ontological breadth if we can describe its application domain with a relatively small ontology. Neither of these definitions is exact, but they express the idea that the kinds of Web documents we are considering have many constant values and have small, well-defined domains of interest.

Brian Fielding Frost

As an example, the unstructured documents we have chosen for illustraOur beloved Brian Fielding Frost, tion in this paper are obituaries. Figage 41, passed away Saturday morning, March 7, 1998, due to injuries sustained ure 1 shows an example1 . An obituary in an automobile accident. He was born is data rich, typically including sevAugust 4, 1956 in Salt Lake City, to eral constants such as name, age, death Donald Fielding and Helen Glade Frost. date, and birth date of the deceased He married Susan Fox on June 1, 1981. person; a funeral date, time, and adHe is survived by Susan; sons Jordan (9), Travis (8), Bryce (6); parents, dress; viewing and interment dates, three brothers, Donald Glade (Lynne), times, and addresses; names of reKenneth Wesley (Ellen), Alex Reed, lated people and family relationships. and two sisters, Anne (Dale) Elkins and The information in an obituary is also Sally (Kent) Britton. A son, Michael narrow in ontological breadth, having Brian Frost, preceded him in death. Funeral services will be held at 12 data about a particular aspect of genoon Friday, March 13, 1998 in the nealogical knowledge that can be deHoward Stake Center, 350 South 1600 scribed by a small ontological model East. Friends may call 5-7 p.m. Thursinstance. day at Wasatch Lawn Mortuary, 3401 Specifically, our approach conS. Highland Drive, and at the Stake Center from 10:45-11:45 a.m. Friday. sists of the following steps. (1) We deInterment at Wasatch Lawn Memorial velop the ontological model instance Park. over the area of interest. (2) We parse this ontology to generate a database Fig. 1. A sample obituary. scheme and to generate rules for matching constants and keywords. (3) To obtain data from the Web, we invoke a record extractor that separates an unstructured Web document into indi1

To protect individual privacy, this obituary is not real. It based on an actual obituary, but it has been significantly changed so as not to reveal identities. Obituaries used in our experiment reported later in this paper are real, but only summary data and isolated occurrences of actual items of data are reported.

80

D.W. Embley et al.

vidual record-size chunks, cleans them by removing markup-language tags, and presents them as individual unstructured documents for further processing. (4) We invoke recognizers that use the matching rules generated by the parser to extract from the cleaned individual unstructured documents the objects and relationships expected to populate the model instance. (5) Finally, we populate the generated database scheme by using heuristics to determine which constants populate which records in the database scheme. These heuristics correlate extracted keywords with extracted constants and use cardinality constraints in the ontology to determine how to construct records and insert them into the database scheme. Once the data is extracted, we can query the structure using a standard database query language. To make our approach general, we fix the ontology parser, Web record extractor, keyword and constant recognizer, and database record generator; we change only the ontology as we move from one application domain to another. Thus, the effort required to apply our suggested technique to a new domain, depends only on the effort required to construct a conceptual model for the new domain. In an earlier paper [10], we presented some of these ideas for extracting and structuring data from unstructured documents. We also presented results of experiments we conducted on two different types of unstructured documents taken from the Web, namely, car ads and job ads. In those experiments, our approach attained recall ratios in the range of 90% and precision ratios near 98%. These results were very encouraging; however, the ontology we used was very narrow, essentially only allowing single constants or single sets of constants to be associated with a given item of interest (i.e., a car or a job). In this paper we enrich the ontology—the conceptual model—and we choose an application that demands more attention to this richer ontology. For example, our earlier model supported only binary relationship sets, but our current approach supports n-ary. Furthermore, we enhance the ontology in two significant ways. (1) We adopt “data frames” as a way to encapsulate the concept of a data item with all of its essential properties [8]. (2) We include lexicons to enrich our ability to recognize constants that are difficult to describe as simple patterns, such as names of people. Together, data frames and lexicons enrich the expressiveness of an ontological model instance. This paper also extends our earlier work by adding an automated tool for detecting and extracting unstructured records from HTML Web documents. We are thus able to fully automate the extraction process once we have identified a Web document from which we wish to extract data. Further enhancements are still needed to locate documents of interest with respect to the ontology and to handle sets of related documents that together provide the data for a given ontology. Nevertheless, the extensions we do add in this paper significantly enhance the approach presented earlier [10].

2

Related Work

Of the two approaches to extracting Web data (Web query languages and wrappers), the approach we take falls into the category of extracting data using

A Conceptual-Modeling Approach to Extracting Data from the Web

81

wrappers. A wrapper for extracting data from a text-based information source generally consists of two parts: (1) extracting attribute values from the text, and (2) composing the extracted values for attributes into complex data structures. Wrappers have been written either fully manually [5,11,12], or with some degree of automation [1,4,7,13,16]. The work on automating wrapper writing focuses primarily on using syntactic clues, such as HTML tags, to identify and direct the composition of extraction of attribute values. Our work differs fundamentally from this approach to wrapper writing because it focuses on conceptual modeling to identify and direct extraction and composition (although we do use syntactic clues to distinguish between record boundaries in unstructured documents). In our approach, once the conceptual-model instance representing the application ontology has been written, wrapper generation is fully automatic. A large body of research exists in the area of information extraction using natural-language understanding techniques [6]. The goal of these naturallanguage techniques is to extract conceptual information from the text through the use of lexicons identifying important keywords combined with sentence analysis. In comparison, our work does not attempt to extract such a deep level of understanding of the text but also does not depend upon complete sentences, which their work does. We believe our approach to be more appropriate for Web pages and classified ads, which often do not contain complete sentences. The work closest to ours is [15]. In this work, the authors explain how they extract information from text-based data sources using a notion of “concept definition frames,” which are similar to the “data frames” in our conceptual model. An advantage of our approach is that our conceptual model is richer, including, for example, cardinality constraints, which we use in the heuristics for composing extracted attribute values into object structures.

3

Web Data Extraction and Structuring

Figure 2 shows the overall process we use for extracting and structuring Web data. As depicted in the figure, the input (upper left) is a Web page, and the output (lower right) is a populated database. The figure also shows that the application ontology is an independent input. This ontology describes the application of interest. When we change applications, for example from car ads, to job ads, to obituaries, we change the ontology, and we apply the process to different Web pages. Significantly, everything else remains the same: the routines that extract records, parse the ontology, recognize constants and keywords, and generate the populated database instance do not change. In this way, we make the process generally applicable to any domain. 3.1

Ontological Specification

As Fig. 2 shows, the application ontology consists of an object-relationship model instance, data frames, and lexicons. An ontology parser takes all this information as input and produces constant/keyword matching rules and a database description as output.

82

D.W. Embley et al.

Application Ontology Object-Relationship Model Instance

Data Frames Lexicons

Web Page

Ontology Parser

Record Extractor

Constant/Keyword Matching Rules

Unstructured Record Documents

Constant/Keyword Recognizer

Data-Record Table (Descriptor/String/Position)

Database Description Record-Level Objects, Relationships, and Constraints

Database Scheme

Database-Instance Generator

Populated Database

Fig. 2. Data extraction and structuring process.

Figure 3 gives the object-relationship model instance for our obituary application in a graphical form. We use the Object-oriented Systems Model (OSM) [9] to describe our ontology. In OSM rectangles represent sets of objects. Dotted rectangles represent lexical object sets (those such as Age and Birth Date whose objects are strings that represent themselves), and solid rectangles represent nonlexical object sets (those such as Deceased Person and Viewing whose objects are object identifiers that represent nonlexical real-world entities). Lines connecting rectangles represent sets of relationships. Binary relationship sets have a verb phrase and reading-direction arrow (e.g., Funeral is on Funeral Date names the relationship set between Funeral and Funeral Date), and n-ary relationships have a diamond and a full descriptive name that includes the names of its connected object sets. Participation constraints near connection points between object and relationship sets designate the minimum and maximum number of times an object in the set participates in the relationship. In OSM a colon (:) after an object-set name (e.g., Birth Date: Date) denotes that the object set is a specialization (e.g., the set of objects in Birth Date is a subset of the set of objects in the implied Date object set).

A Conceptual-Modeling Approach to Extracting Data from the Web Deceased Name: Name

Age 1..*

Birth Date: Date

died on

0..1

0..1

Interment Date: Date

1..*

0..1

Deceased Person has Relationship to Relative Name

1

Deceased -> Person 0..1

has

0..* 0..*

Viewing Date: Date 1..*

has

1

has

has

Interment has

0..1

Viewing Address: 1..* has Address

has

Funeral Date: Date 1..*

has

0..1

Viewing 0..1

Funeral

0..1

has

Beginning Time: Time

0..1 has

Funeral 1..* Address: Address

0..1

1..*

1 0..1

1 0..1

has

is on

1..*

Interment Address: Address

Relationship

1..* 0..1

1..*

1..*->1

1..*->1

0..1

Death Date: Date

Relative Name: Name

has

has

has

1..*

83

1..*

Ending Time: Time 1..*

Funeral Time: Time

Fig. 3. Sample object-relationship model instance.

For our ontologies, an object-relationship model instance gives both a global view (e.g., across all obituaries) and a local view (e.g., for a single obituary). We express the global view as previously explained and specialize it for a particular obituary by imposing additional constraints. We denote these specializing constraints in our notation by a “becomes” arrow (->). In Fig. 3, for example, the Deceased Person object set becomes a single object, as denoted by “-> •”, and the 1..* participation constraint on both Deceased Name and Relative Name becomes 1. We thus declare in our ontology that an obituary is for one deceased person and that a name either identifies the deceased person or the family relationship of a relative of the deceased person. From these specializing constraints, we can also derive other facts about individual obituaries, such as that there is only one funeral and one interment, although there may be several viewings and several relatives. A model-equivalent language has been defined for OSM [14]. Thus, we can faithfully write any OSM model instance in an equivalent textual form. We use the textual representation for parsing. In the textual representation, we can determine whether an object set is lexical or nonlexical by whether it has associated data frame that describes a set of possible strings as objects for the object set. In general a data frame describes everything we wish to know about an object set. If the data frame is for a lexical object set, it describes the string

84

D.W. Embley et al.

patterns for its constants (member objects). Whether lexical or nonlexical, an associated data frame can describe context keywords that indicate the presence of an object in an object set. For example, we may have “died” or “passed away” as context keywords for for Death Date and “buried” as a context keyword for Interment. A data frame for lexical object sets also defines conversion routines to and from a common representation and other applicable operations, but our main emphasis here is on recognizing constants and context keywords. In Fig. 4 we show as examples part of the data frames for Name and Relative Name. A number in brackets designates the longest expected constant for the data frame; we use this number to generate upper-bounds for “varchar” declarations in our database scheme. Inside a data frame we declare constant patterns, keyword patterns, and lexicons of constants. We can declare patterns to be case sensitive or case insensitive and switch back and forth as needed. We write all our patterns using Perl 5 regular expression syntax. The lexicons referenced in Name in Fig. 4 are external files consisting of a simple list of names: first.dict contains 16,167 first names from “aaren” to “zygmunt” and last.dict contains 16,522 last names from “aalders” to “zywiel”. We use these lexicons in patterns by referring to them respectively as First and Last. Thus, for example, the first constant pattern in Name matches any one of the names in the first-name lexicon, followed by one or more white-space characters, followed by any one of the names in the last-name lexicon. The other pattern matches a string of letters starting with a capital letter (i.e., a first name, not necessarily in the lexicon), followed by white space, optionally followed by a capital-letter/period pair (a middle initial) and more white space, and finally a name in the last-name lexicon. ... Name matches [80] case sensitive constant { extract First, "\s+", Last; }, ... { extract "[A-Z][a-zA-Z]*\s+([A-Z]\.\s+)?", Last; }, ... lexicon { First case insensitive; filename "first.dict"; }, { Last case insensitive; filename "last.dict"; }; end; Relative Name matches [80] case sensitive constant { extract First, "\s*$", First, "$\s*", Last; substitute "\s*$[ˆ)]*$" -> ""; ... end; ... Fig. 4. Sample data frames.

The Relative Name data frame in Fig. 4 is a specialization of the Name data frame. In many obituaries, spouse names of blood relatives appear parenthetically inside names. In Fig. 1, for example, we find “Anne (Dale) Elkins”. Here,

A Conceptual-Modeling Approach to Extracting Data from the Web

85

Anne Elkins is the sister of the deceased, and Dale is the husband of Anne. To extract the name of the blood relative, the Relative Name data frame applies a substitution that discards the parenthesized name, if any, when it extracts a possible name of a relative. Besides extract and substitute, a data frame may also have context and filter clauses, which respectively tell us what context we must have for an extraction and what we filter out when we do the extraction. 3.2

Unstructured Record Extraction

As mentioned earlier, we leave for future work the problem of locating Web pages of interest and classifying them as a page containing exactly one record, a page containing many records, or a part of a group of pages containing one record. Assuming we have a page containing many records, we report here on our implementation of one possible approach to the problem of separating these records and feeding them one at a time to our data-extraction routines. The approach we take builds a tree of the page’s structure based on HTML, heuristically searches the tree for the subtree most likely to contain the records, and then heuristically finds the most likely separator among the siblings in this subtree of records. We explain the details in succeeding paragraphs. There are other approaches that may work as well (e.g., we can preclassify particular HTML tags as likely separators or match the given ontology against probable records), but we leave these for future work. HTML tags define regions within an HTML document. Based on the nested structure of start- and end-tags, we build a tree called a tag-tree. Figure 5(a) gives part of a sample obituary HTML document, and Fig. 5(b) gives its corresponding tag-tree. As Fig. 5(a) shows, the tag-pair - surrounds the entire document and thus html becomes the root of the tag-tree. Similarly, we have title nested within head, which is nested within html, and as a sibling of head we have body with its nested structure. The leaves nested within the - pair are the ordered sequence of sibling nodes h1, h4, hr, h4, ... . A node in a tag-tree has two fields: (1) the first tag of each start-tag/end-tag pair or a lone tag (when there is no closing tag), and (2) the associated text. We do not show the text in Fig. 5(b), but, for example, the text field for the title node is “Classifieds” and the text field for the first h4 field following the first ellipsis in the leaves is the obituary for Brian Fielding Frost. Using the tag-tree, we find the subtree with the largest fan-out—td in Fig. 5(b). For documents with many records of interest, the subtree with the largest fan-out should contain these records; other subtrees represent global headers or trailers. To find the record separators within the highest fan-out subtree, we begin by counting the number of appearances of each sibling tag below the root node of the subtree (the number of appearances of h1, h4, and hr for our example). We ignore tags with relatively few appearances (h1 in our example) and concentrate on dominant tags, tags with many appearances (h4 and hr in our example). For the dominant tags, we apply two heuristics: a Most-Appearance (MA) heuristic and a Standard-Deviation (SD) heuristic. If there is only one dominant tag, the MA heuristic selects it as the separator. If there are several dominant tags, the

86

D.W. Embley et al. Classifieds

Funeral Notices

Lemar K. Adamson ...

...

Brian Fielding Frost ...

Leonard Kenneth Gunther ...

...

All material is copyrighted.

(a) A sample obituary HTML document

html head

body

title

table

tr

td

h1

h4

hr

h4

hr ... h4 hr h4

hr ... h4

(b) Tag-tree of HTML . document in (a).

Fig. 5. An HTML document and its tag-tree.

MA heuristic checks to see whether the dominant tags all have the same number of appearances or are within one of having the same number of appearances. If so, the MA heuristic selects any one of the dominant tags as the separator. If not, we apply the SD heuristic. For the SD heuristic, we first find the length of each text segment between identical dominant tags (e.g., the lengths of the text segments between each successive pair of hr tags and between each successive pair of h4 tags). We then calculate the standard deviation of these lengths for each tag. Since the records of interest often all have approximately the same length, we choose the tags with the least standard deviation to be the separator. Once we know the separator, it is easy to separate the unstructured records and feed them individually to downstream processes. 3.3

Database Record Generation

With the output of the ontology parser and the record extractor in hand, we proceed with the problem of populating the database. To populate the database, we iterate over two basic steps for each unstructured record document. (1) We produce a descriptor/string/position table consisting of constants and keywords recognized in the unstructured record. (2) Based on this table, we match attributes with values and construct database tuples. As Fig. 2 shows, the constant/keyword recognizer applies the generated matching rules to an unstructured record document to produce a data-record table. Figure 6 gives the first several lines of the data-record table produced from our sample obituary in Fig. 1. Each entry (a line in the table) describes either a constant or a keyword. We separate the fields of an entry by a bar (|). The first field is a descriptor: for constants the descriptor is an object-set name to which the constant may belong, and for keywords the descriptor is KEYWORD(x) where x is an object-set name to which the keyword may apply. The second field is the constant or keyword found in the document, possibly transformed as it is extracted according to substitution rules provided in a data frame. The last two fields give the position as the beginning and ending character count for the first and last characters of the recognized constant or keyword.

A Conceptual-Modeling Approach to Extracting Data from the Web

87

RelativeName|Brian Fielding Frost|1|20 DeceasedName|Brian Fielding Frost|1|20 RelativeName|Brian Fielding Frost|36|55 DeceasedName|Brian Fielding Frost|36|55 KEYWORD(Age)|age|58|60 Age|41|62|63 KEYWORD(DeathDate)|passed away|66|76 BirthDate|March 7, 1998|96|108 DeathDate|March 7, 1998|96|108 IntermentDate|March 7, 1998|96|108 FuneralDate|March 7, 1998|96|108 ViewingDate|March 7, 1998|96|108 KEYWORD(Relationship)|born August 4, 1956 in Salt Lake City, to|172|212 Relationship|parent|172|212 KEYWORD(BirthDate)|born|172|175 BirthDate|August 4, 1956|177|190 DeathDate|August 4, 1956|177|190 IntermentDate|August 4, 1956|177|190 FuneralDate|August 4, 1956|177|190 ViewingDate|August 4, 1956|177|190 RelativeName|Donald Fielding|214|228 DeceasedName|Donald Fielding|214|228 RelativeName|Helen Glade Frost|234|250 DeceasedName|Helen Glade Frost|234|250 KEYWORD(Relationship)|married|257|263 Relationship|spouse|257|263 Fig. 6. Sample entries in a data-record table.

To facilitate later processing, we sort this table on the third field, the beginning character position of the recognized constant or keyword. A careful consideration of Fig. 6 reveals some interesting insights into the recognition of constants and keywords and also into the processing required by the database-instance generator. Notice in the first four lines, for example, that the string “Brian Fielding Frost” is the same and that it could either be the name of the deceased or the name of a relative of the deceased. To determine which one, we must heuristically resolve this conflict. Since there is no keyword here for Deceased Person, no keyword directly resolves this conflict for us. However, we know that the important item in a record is almost always introduced at the beginning, a strong indication that the name is the name of the deceased, not the name of one of the deceased’s relatives. More formally, since the constraints on DeceasedName within a record require a one-to-one correspondence between DeceasedName and DeceasedPerson and since DeceasedName is not optional, the first name that appears is almost assuredly the name of the deceased person. Keyword resolution of conflicts is common. In Fig. 6, for example, consider the resolution of the death date and the birth date. Since the various dates are all specializations of Date, a particular date, without context, could be any one of the different dates (e.g., “March 7, 1998” might be any one of five possible

88

D.W. Embley et al.

kinds of date). Notice, however, that “passed away”, a keyword for DeathDate, is only 20 characters away from the beginning of “March 7, 1998”, giving a strong indication that it is the death date. Similarly, “born”, a keyword for BirthDate, is within two characters of “August 4, 1956”. Keyword proximity easily resolves these conflicts for us. Continuing with one more example, consider the phrase “born August 4, 1956 in Salt Lake City, to”, which is particularly interesting. Observe in Fig. 6 that the recognizer tags this phrase as a keyword for Relationship and also in the next line as constant for Relationship, with “parent” substituted for the longer phrase. The regular expression that the recognizer uses for this phrase matches “born to” with any number of intervening characters. Since we have specified in our Relationship data frame that “born to” is a keyword for a family relationship and is also a possible constant value for the Relationship object set, with the substitution “parent”, we emit both lines as shown in Fig. 6. Observe further that we have “parent” close by (two characters away from) the beginning of the name Donald Fielding and close by (twenty-two characters away from) the beginning of the name Helen Glade Frost, which are indeed the parents of the deceased. The database-instance generator takes the data-record table as input along with a description of the database and constructs tuples for the extracted raw data. The heuristics applied in the database-instance generator are motivated by observations about the constraints in the record-level description. We classify these constraint-based heuristics as singleton heuristics, functional-group heuristics, and nested-group heuristics. – Singleton Heuristics. For values that should appear at most once, we use keyword proximity to find the best match, if any, for the value (e.g., we match DeathDate with “March 7, 1998” and BirthDate with “August 4, 1956” as explained earlier). For values that must appear at least once, if keyword proximity fails to find a match, we choose the first appearance of a constant belonging to the object set whose value must appear. If no such value appears, we reject the record. For our ontology, only the name of the deceased must be found. – Functional-Group Heuristics. An object set whose objects can appear several times, along with its functionally dependent object sets constitutes a functional group. In our sample ontology Viewing and its functionally dependent attributes constitutes such a group. Keywords that do not pertain to the item of interest provide boundaries for context switches. For our example (see Fig. 1), we have a Funeral context before the viewing information and an Interment context after the viewing information. Within this context we search for ViewingDate / ViewingAddress / BeginningTime / EndingTime groups. – Nested-Group Heuristics. We use nested-group heuristics to process n-ary relationship sets (for n > 2). Writers often produce these groups by a nesting structure in which one value is given followed by its associated values, which may be nested, and so forth. Indeed, the obituaries we considered consistently

A Conceptual-Modeling Approach to Extracting Data from the Web

89

follow this pattern. In Fig. 1 we see “sons” followed by “Jordan”, “Travis”, and “Bryce”; “brothers” followed by “Donald”, “Kenneth”, and “Alex”; and “sisters” followed by “Anne” and “Sally”. The result of applying these heuristics to an unstructured obituary record is a set of generated SQL insert statements. When we applied our extraction process to the obituary in Fig. 1, the values extracted were quite accurate, but not perfect. For example, we missed the second viewing address, which happens to have been correctly inserted as the funeral address, but not also as the viewing address for the second viewing. Our implementation currently does not allow constants to be inserted in two different places, but we plan to have future implementations allow for this possibility. Also, we obtained neither of the viewing dates, both of which can be inferred from “Thursday” and “Friday” in the obituary. We also did not obtain the full name for some of the relatives, such as sons of the deceased, which can be inferred by common rules for family names. At this point our implementation only finds constants that actually appear in the document. In future implementations, we would like to add procedures to our data frames to do the calculations and inferences needed to obtain better results.

4

Results

For our test data, we took 38 obituaries from a Web page provided by the Salt Lake Tribune (www.sltrib.com) and 90 obituaries from a Web page provided by the Arizona Daily Star (www.azstarnet.com). When we ran our extraction processor on these obituaries, we obtained the results in Table 1 for the Salt Lake Tribune and in Table 2 for the Arizona Daily Star. As Tables 1 and 2 show, we counted the number of facts (attribute-value pairs) in the test-set documents. Consistent with our implementation, which only extracts explicit constants, we counted a string as being correct if we extracted the constant as it appeared in the text. With this understanding, counting was basically straightforward. For names, however, we often only obtained partial names. Because our name lexicon was incomplete and our name-extraction expressions were not as rich as possible, we sometimes missed part of a name or split a single name into two. We list the count for these cases after the + in the Declared Correctly column. We noted that this also caused most of the problem of the large number of incorrectly identified relatives. With a more accurate and complete lexicon and with richer name-extraction expressions, we believe that we could achieve much higher precision.

5

Conclusions

We described a conceptual-modeling approach to extracting and structuring data from the Web. A conceptual model instance, which we called an ontology, provides the relationships among the objects of interest, the cardinality constraints

90

D.W. Embley et al. Table 1. Salt Lake Tribune Obituaries Number of Number of Facts Number of Facts Facts in Source Declared Correctly Declared Incorrectly + Partially Correct DeceasedPerson 38 38 0 DeceasedName 38 23+15 0 Age 22 20 1 BirthDate 30 30 1 DeathDate 33 31 0 FuneralDate 24 22 0 FuneralAddress 25 24 1 FuneralTime 29 28 0 IntermentDate 0 0 0 IntermentAddress 4 4 0 Viewing 29 27 1 ViewingDate 10 7 0 ViewingAddress 17 13 0 BeginningTime 32 28 0 EndingTime 29 26 0 Relationship 453 359+9 29 RelativeName 453 322+75 159

Recall Precision Ratio Ratio 100% 100% 91% 100% 94% 92% 96% 97% NA 100% 93% 70% 76% 88% 90% 81% 88%

100% 100% 95% 97% 100% 100% 96% 100% NA 100% 96% 100% 100% 100% 100% 93% 71%

Table 2. Arizona Daily Star Obituaries Number of Number of Facts Number of Facts Facts in Source Declared Correctly Declared Incorrectly + Partially Correct DeceasedPerson 90 90 0 DeceasedName 90 80+10 0 Age 73 63 1 Birthdate 26 25 1 DeathDate 86 72 1 FuneralDate 45 43 3 FuneralAddress 33 27 6 FuneralTime 50 46 7 IntermentDate 1 1 0 IntermentAddress 0 0 1 Viewing 29 28 0 ViewingDate 25 25 0 ViewingAddress 21 20 0 BeginningTime 29 27 1 EndingTime 22 21 0 Relationship 626 566+11 20 RelativeName 626 446+150 211

Recall Precision Ratio Ratio 100% 100% 86% 96% 84% 96% 82% 92% 100% NA 97% 100% 95% 93% 95% 92% 95%

100% 100% 98% 96% 99% 93% 82% 87% 100% 100% 100% 100% 100% 96% 100% 97% 74%

for these relationships, a description of the possible strings that can populate various sets of objects, and possible context keywords expected to help match values with object sets. To prepare unstructured documents for comparison with the ontology, we also proposed a means to identify the records of interest on a Web page. With the ontology and record extractor in place, we were able to extract records automatically and feed them one at a time to a processor that heuristically matched them with the ontology and populated a database with the extracted data.

A Conceptual-Modeling Approach to Extracting Data from the Web

91

The results we obtained for our obituary example are encouraging. Because of the richness of the ontology, we had initially expected much lower recall and precision ratios. Achieving about 90% recall and 75% precision for names and 95% precision elsewhere was a pleasant surprise.

References 1. Adelberg, B.: NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. Proc. 1998 ACM SIGMOD International Conference on Management of Data. (1998) 283–294 2. Apers, P.: Identifying internet-related database research. Proc. 2nd International East-West Database Workshop (1994) 183–193 3. Arocena, G., Mendelzon, A.: WebOQL: restructuring documents, databases and webs. Proc. Fourteen International Conference on Data Engineering (1998) 4. Ashish, N., Knoblock, C.: Wrapper generation for semi-structured internet sources. SIGMOD Record 26 (1997) 8–15 5. Atzeni, P., Mecca, G.: Cut and paste. Proc. PODS’97 (1997) 6. Cowie, J., Lehnert, W.: Information extraction. Communications of the ACM 39 (1996) 80–91 7. Doorenbos, R., Etzioni, O., Weld, D.: A scalable comparison-shopping agent for the world-wide web. Proc. First International Conference on Autonomous Agents (1997) 39–48 8. Embley, D.: Programming with data frames for everyday data items. Proc. 1980 National Computer Conference (1980) 301–305 9. Embley, D., Kurtz, B., Woodfield, S.: Object-oriented Systems Analysis: A ModelDriven Approach. (Prentice Hall, 1992) 10. Embley, D., Campbell, D., Smith, R., Liddle, S.: Ontology-based extraction and structuring of information from data-rich unstructured documents. Proc. Conference on Information and Knowledge Management (CIKM’98) (1998) (to appear) 11. Gupta, A., Harinarayan, V., Rajaraman, A.: Virtual database technology. SIGMOD Record 26 (1997) 57–61 12. Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. Proc. Workshop on Management of Semistructured Data (1997) 13. Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper induction for information extraction. Proc. 1997 International Joint Conference on Artificial Intelligence (1997) 729–735 14. Liddle, S., Embley, D., Woodfield, S.: Unifying modeling and programming through an active, object-oriented, model-equivalent programming language. Proc. Fourteenth International Conference on Object-Oriented and Entity-Relationship Modeling (1995) 55–64 15. Smith, D., Lopez, M.: Information extraction for semi-structured documents. Proc: Workshop on Management of Semistructured Data (1997) 16. Soderland, S.: Learning to extract text-based information from the world wide web. Proc: Third International Conference on Knowledge Discovery and Data Mining (1997) 251–254

Information Coupling in Web Databases? Sourav S. Bhowmick, Wee-Keong Ng, and Ee-Peng Lim Center for Advanced Information Systems, School of Applied Science, Nanyang Technological University, Singapore 639798, SINGAPORE {sourav,wkn,aseplim}@cais.ntu.edu.sg

Abstract. Web information coupling refers to an association of topically related web documents. This coupling is initiated explicitly by a user in a web warehouse specially designed for web information. Web information coupling provides the means to derive additional, useful information from the WWW. In this paper, we discuss and show how two web operators, i.e., global web coupling and local web coupling, are used to associate related web information from the WWW and also from multiple web tables in a web warehouse. This paper discusses various issues in web coupling such as coupling semantics, coupling-compability, and coupling evaluation.

1

Introduction

Given the high rate of growth of the volume of data available on the WWW, locating information of interest in such an anarchic setting becomes a more difficult task everyday. Thus, there is the recognition of the undeferring need for effective and efficient tools for information consumers, who must be able to easily locate and manipulate information in the Web. Currently, web information may be discovered primarily by two mechanisms: browsers and search engines. This form of information access on the Web has a few shortcomings: • While web browsers fully exploit hyperlinks among web pages, search engines have so far made little progress in exploiting link information. Not only do most search engines fail to support queries on the Web utilizing link information, they also fail to return link information as part of a query’s result. • From the query’s result returned by search engines, a user may wish to couple a set of related Web documents together for reference. Presently, he may only do so manually by visiting and downloading these documents as files on the user’s hard disk. However, this method is tedious, and it does not allow the user to retain the coupling framework . ?

This work was supported in part by the Nanyang Technological University, Ministry of Education (Singapore) under Academic Research Fund #4-12034-5060, #4-120343012, #4-12034-6022. Any opinions, findings, and recommendations in this paper are those of the authors and do not reflect the views of the funding agencies.

T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 92–106, 1998. c Springer-Verlag Berlin Heidelberg 1998

Information Coupling in Web Databases Issues

http://www.virtualdisease.com/

e

93

z

Symptoms x

y

Symptoms f

Treatment

Treatment w

Fig. 1. Coupling framework (query graph) of ‘Symptoms’.

• The set of downloaded documents can be refreshed (or updated) only by repeating the above procedure manually. • If a user successfully coupled a set of Web documents together, he may wish to know if there are other Web documents satisfying the same coupling framework. Presently, the only way is to request the same or other search engines for further Web documents and probe these documents manually. • Over a period of time, there will be a number of coupled collections of Web documents created by the user. As each of these collections exists simply as a set of files on the user’s system, there is no convenient way to organize, manage and infer further useful information from them. In this paper, we introduce the concept of Web Information Coupling (WIC) to help overcome the limitations of present search engines. WIC enables us to efficiently manage and manipulate coupled information extracted from the Web. We use coupling because it is a convenient way to relate information located separately on the WWW. In this paper, we discuss two types of coupling; global and local web coupling. Global coupling enables a user to retrieve a set of collections of inter-related documents satisfying a coupling framework regardless of the locations of the documents in the Web. To initiate global coupling, a user specifies the coupling framework in the form of a query graph. The actual coupling is performed by the WIC system and is transparent to the user. The result of such user-driven coupling is a set of related documents materialized in the form of a web table. Thus, global web coupling eliminates the problem of manually visiting and downloading Web documents as files in user’s hard disk. Coupling is not limited to the extraction of related information directly from the WWW. Local coupling can be performed on web tables [15] materialized by global coupling. This form of web coupling is achieved locally without resorting to the WWW. Given two web tables, local coupling is initiated explicitly by specifying a pair(s) of web documents and a set of keyword(s) to relate them. The result of local web coupling is a web table consisting of a set of collections of inter-related Web documents from the two input tables. The following example briefly illustrates global and local web coupling. Example 1. Suppose Bill wish to find a list of diseases with their symptoms and treatments, and a list of drugs and their side effects on diseases on the WWW. Assume that there are web sites at http://www.virtualdisease.com/

94

S.S. Bhowmick, W.-K. Ng, and E.-P. Lim Drug List

http://www.virtualdrug.com/

Issues

Side effects t Side effects

a

b

c

d

Fig. 2. Coupling framework (query graph) of ‘Drug list’.

http://www.virtualdisease.com/

Cancer Issues

Cancer Symptoms

e0

x0

g0 Cancer

Symptoms

y0

z0 Cancer Treatment f0 Treatment

http://www.virtualdisease.com/

w0

Breast Cancer Issues

g1

z1‘

e1

Breast Cancer

Breast Cancer Symptoms

y1

x0

Symptoms w1

f1 Breast Cancer Treatment

http://www.virtualdisease.com/ g3 Diabetes

Issues of Diabetes e3

y3

x0

Treatment

z3

Symptoms Diabetes Symptoms w3 f3 Treatment

http://www.virtualdisease.com/

x0

AIDS

Diabetes Treatment

http://www.aidsresearch.org/

g4

z4

e4 AIDS Symptoms Issues

Symptoms w4 f4 AIDS Treatment Treatment

Fig. 3. Partial view of‘Symptoms’ web table.

and http://www.virtualdrug.com/ which integrate disease and drug related information from various web sites respectively. Bill figured that there could be hyperlinks with anchor labels ‘symptoms’ and ‘treatments’ in the web site at http://www.virtualdisease.com/ and labels ‘side effects’ in the web site at http://www.virtualdrug.com/ that might be useful. In order to initiate global web coupling (i.e.,to couple these related information from the WWW), Bill constructs coupling frameworks (query graphs) as shown in Figs. 1 and 2. The global web coupling operator is applied to retrieve those set of related documents that match the coupling frameworks. Each set of inter-linked documents retrieved for each coupling framework is a connected, directed graph (also called web tuples) and is materialized in web tables Symptoms and Drug list respectively. A small portion of these web tables is shown in Figs. 3 and 4. Each web tuple in Symptoms and Drug list contains information about the symptoms and treatments of a particular disease, and the side effects of a drug on the disease respectively. Suppose a user want to extract information related to the symptoms and treatments of cancer and AIDS, and a list of drugs with their side effects on them. Clearly, these information are already stored in tables Symptoms and Drug list.

Information Coupling in Web Databases http://www.vritualdrug.com/

Cancer Drug List

Issues of Beta Carotomel

95

Side effects

s0 a0

Cancer

http://www.vritualdrug.com/

b0

Beta Carotomel

Side effects c0

Cancer Drug List

Side effects

s0 a0

Cancer

http://www.vritualdrug.com/

t1 b0

Docetaxel

Drug List

c1 Issues

s1 a0

Breast Cancer

AIDS

Issues

Indavir

b4

d1 Side effects

t2

c2

http://www.aidsresearch.org/

s4 a0

Side effects

Side effects

Anastrozole b1

http://www.vritualdrug.com/

d0

t0

Issues of Docetaxel

d2 Side effects

Side effects

c7

d7

Fig. 4. Partial view of ‘Drug list’ web table.

The local web coupling operator enables us to extract these related information from the two web tables. A user may indicate the documents (say y and b) in the coupling frameworks of Symptoms and Drug list and the keywords (in this case “cancer” and “AIDS”) based on which local web coupling will be performed. A portion of the coupled web table is shown in Fig. 5. A Web Information Coupling (WIC) system is a database system for managing and manipulating coupled information extracted from the Web. To realize this system, we first propose a data model called the Web Information Coupling Model (WICM) to describe and abstract web objects. We then introduce the operators to perform global and local coupling.

2

Web Information Coupling Model

We proposed a data model for a web warehouse in [5,15]. The data model consists of a hierarchy of web objects. The fundamental objects are Nodes and Links. Nodes correspond to HTML or plain text documents and links correspond to hyper-links interconnecting the documents in the World Wide Web. We define a Node type and a Link type to refer to these two sets of distinct objects. These objects consist of a set of attributes as shown below: Node = [url, title, format, size, date, text] Link = [source-url, target-url, label, link-type]

WICM supports structured or topological querying; different sets of keywords may be specified on the nodes and additional criteria may be defined for the hyperlinks among the nodes. Thus, the query is a graph-like structure and is used to match portions of the WWW satisfying the conditions. In this way, the query result is a set of directed graphs (called web tuples) instantiating the query graph. Formally, a web tuple w = hNw , Lw , Vw i, is a triplet where Nw is a set of nodes in web tuple w, Lw is a set of links in web tuple w and Vw is the set of connectivities (next section). A collection of these web tuples is called a

96

S.S. Bhowmick, W.-K. Ng, and E.-P. Lim

web table. If the web table is materialized, we associate a name with the table. defined as 4-tuple M = hXn , X` , C, P i where Xn is a set of node variables, X` is a set of link variables, C is a set of connectivities in DNF, P is a set of predicates in DNF. The web schema of the web table is the query graph that is used to derive the table. It is defined as 4-tuple M = hXn , X` , C, P i where Xn is a set of node variables, X` is a set of link variables, C is a set of connectivities in DNF, P is a set of predicates in DNF. A set of web tables constitutes a web database. We illustrate the concept of web schema with the following examples. Consider the query graphs (Figs. 1 and 2) in Example 1. The schemas of these query graphs are given below: Example 2. Produce a list of diseases with their symptoms and treatments, starting from the web site at http://www.virtualdisease.com/. We may express the schema of the above query by Mi = hXi,n , Xi,` , Ci , Pi i where Xi,n = {x, y, z, w}, Xi,` = {e, f, -}, Ci ≡ ki1 ∧ ki2 ∧ ki3 such that ki1 = xh-iy, ki2 = yheiz, ki3 = yhf iw, and Pi ≡ pi1 ∧ pi2 ∧ pi3 ∧ pi4 ∧ pi5 ∧ pi6 such that pi1 (x) ≡ [x.url EQUALS "http://www.virtualdisease.com/"], pi2 (y) ≡ [y.title CONTAINS "issues"], pi3 (e) ≡ [e.label CONTAINS "symptoms"], pi4 (z) ≡ [z.title CONTAINS "symptoms"], pi5 (f ) ≡ [f .label CONTAINS "treatments"], pi6 (w) ≡ [w.title CONTAINS "treatments"]. Example 3. Produce a list of drugs and their side effects starting from the web site at http: // www. virtualdrug. com/ . The schema of the above query is Mj = hXj,n , Xj,` , Cj , Pj i where Xj,n = {a, b, c, d}, Xj,` = {t, -}, Cj ≡ kj1 ∧ kj2 ∧ kj3 such that kj1 = ah-ib, kj2 = bh-ic, kj3 = chtid and Pj ≡ pj1 ∧ pj2 ∧ pj3 ∧ pj4 ∧ pj5 such that pj1 (a) ≡ [a.url EQUALS "http://www.virtualdrug.com/"], pj2 (b) ≡ [b.title CONTAINS "Drug List"], pj3 (c) ≡ [c.title CONTAINS "Issues"], pj4 (d) ≡ [d.title CONTAINS "side effects"], pj5 (t) ≡ [t.label CONTAINS "side effects"]. The query graphs (web schemas) as described in Examples 2 and 3 express Bill’s need to extract a set of inter-linked documents related to symptoms and treatments of diseases, and the side effects of drugs on these diseases from the WWW. Since conventional search engines cannot accept a query graph as input and return the inter- linked documents as the query result, a global web coupling operator is required. The global web coupling operator matches those portions of the WWW that satisfy the query graphs. The results of global web coupling is a collection of sets of related Web documents materialized in the form of a web table. Although global web coupling retrieves data directly from the WWW, the full potential of web coupling lies in the fact that it can couple related information residing in two different web tables in a web database. Suppose Bill wish to know the symptoms and treatments associated with cancer and AIDS, and a list of drugs with their side effects on them. There are two methods in a web database to gather the composite information:

Information Coupling in Web Databases

97

Cancer Issues http://www.vritualdrug.com/

Cancer Drug List

s0 a0

Cancer

Side effects http;//www.virtualdisease.com/

Issues of Docetaxel t1

b0

Docetaxel

c1

Side effects

Cancer Symptoms

y0

g0 d1

e0

Cancer

Symptoms

x0

z0 Cancer Treatment

f0 Treatment Cancer Issues http://www.vritualdrug.com/

Cancer Drug List

Issues of Docetaxel

s0 a0

Cancer

Side effects http;//www.virtualdisease.com/ t1

b0

Docetaxel

c1

Side effects

y1

g1 d1

w0

Symptoms e1

Breast Cancer

Breast Cancer Symtpms

x0

z1 Treatment

f1 Breast Cancer Treatment Cancer Issues http://www.vritualdrug.com/

http://www.aidsresearch.org/

s4

Issues

Indavir

Side effects

http;//www.virtualdisease.com/

AIDS a0

AIDS

b4

c7

d7

x0

w1

z4 e4

g4

Side effects

y4

AIDS Symptoms Issues

Symptoms f4

w4

AIDS Treatment Treatment

Fig. 5. Web coupling.

1. Bill may construct a new web query for this purpose. The disadvantage of this method is that the information (stored in web tables) created by the queries in Examples 2 and 3 are not being used for this query. 2. Browse the web tables of queries in Examples 2 and 3 and select those tuples containing information related to cancer and AIDS and then compare the results manually. However, there may be many matching web tuples, making the user’s task of going over them tedious. This motivates us to design a local web coupling operator that allows us to gather related information from the two web tables in a web database.

3

Global Web Coupling

In this section, we discuss global web coupling; a mechanism to couple related information from the WWW. We begin by formally defining the global web coupling operator. Next we explain how a coupled web table is created. 3.1

Definition

The global web coupling operator Γ takes in a query (expressed as a schema M ) and extracts a set of web tuples from the WWW satisfying the schema. Let Wg be the resultant table, then Wg = Γ (M ). Each web tuple matches a portion of the WWW satisfying the conditions described in the schema. These related set of web tuples are coupled together and stored in a web table. Each web tuple in the web table is structurally identical to the schema of the table. Some computability issues arise when applying the global web coupling operator to WWW. The global web coupling operator, is bound if and only if all

98

S.S. Bhowmick, W.-K. Ng, and E.-P. Lim

variables that begin a connectivity in the schema specified for the operator are bound. A query which embedds a bound Γ operator is always computable. Let us see why. Suppose a web query with schema M is posed against the WWW, i.e., we wish to compute Γ (M ). Intuitively, the Γ operator is evaluable when there are starting points in the WWW from which we can begin our search. With current web technology, there are two methods to locate a web resource; we either know its URL and access the resource directly or we go through a search engine by supplying keywords to obtain the URL’s. Let x be a node variable, then a predicate such as [x.url EQUALS "a-url-here"] in a query allows us to use the URL specified to locate the document corresponding to x. The second method is embedded by predicates such as [x.text CONTAINS "some-keywords"], [x.title EQUALS "a-title-here"], and [e.label CONTAINS "some-keywords"]. Here, x and e are the bound variables. When a node or link variable is bound, we can acccess the resource it corresponds to either directly or through a web search engine. Variables that begin connectivities and are bound provide the starting point in the WWW for retrieving web tuples. Hence, queries with such variables are computable. 3.2

Web Table Creation

We now discuss briefly how to create the coupled web table. Given a web schema (query graph), Γ extracts a set of web tuples satisfying the query graph. We discuss how to extract the set of web tuples from the WWW. Our approach to determine the set of web tuples from the WWW is as follows: 1. Check if the given web schema is computable. If it is, then obtain a set of URL(s) as the starting point of traversal by analyzing the predicates in the schema. 2. Get the node variables representing these start URL(s) and identify connectivities which contain the start node variables. Note that the start node variable will always be in the left hand side of a connectivity. 3. Download the documents from the WWW that satisfy the predicates for the nodes and that contain links that satisfy the link predicates for the outgoing edges of this node. 4. Get the web documents (nodes) pointed by the links and check whether these documents satisfy the predicates of the node in the schema. Repeat this untill we reach the right hand side of the connectivity. 5. Repeat the above two steps for all the connectivities in the schema. 6. Once all the web documents are collected by the above procedure, create individual web tuples by matching the set of nodes and links with the schema.

4

Local Web Coupling

Once we have the ability to couple useful information directly from the WWW using the global web coupling operator, we need to introduce an additional operator to facilitate local web coupling, i.e., extracting useful information locally from two web tables.

Information Coupling in Web Databases

99

Cancer Issues http://www.vritualdrug.com/

Cancer Drug List

s0 a0

Cancer

Side effects http;//www.virtualdisease.com/

Issues of Docetaxel t1

b0

Docetaxel

c1

Side effects

Cancer Symptoms

y0

g0 d1

e0

Cancer

Symptoms

x0

z0 Cancer Treatment

f0 Treatment Cancer Issues http://www.vritualdrug.com/

Cancer Drug List

Side effects http;//www.virtualdisease.com/ Issues

Issues of Docetaxel

s0

t1

w0

Symptoms e3 z3

a0

Cancer

b0

Docetaxel

c1

Side effects

d1

Diabetes x0

Diabetes Symptoms

y3

Treatment

f3 w3

Diabetes Treatment Cancer Issues http://www.vritualdrug.com/

Cancer Drug List

a0

Cancer

Side effects

Issues of Docetaxel

s0

http;//www.virtualdisease.com/

t1 b0

Docetaxel

c1

Side effects

AIDS

d1 x0

y4

z4 e4

g4

AIDS Symptoms Issues

Symptoms f4

w4

AIDS Treatment Treatment

Fig. 6. Web cartesian product.

4.1

Definition

The local web coupling operator combines two web tables by integrating web tuples of one web table with web tuples of another table whenever there exists coupling nodes. Let Wi and Wj be two web tables with schemas Mi = hXi,n , Xi,` , Ci , Pi i and Mj = hXj,n , Xj,` , Cj , Pj i respectively. Suppose we want to couple Wi and Wj on node variables nci and ncj as they both contain information about diseases, and we want to correlate web tuples of Wi and Wj related to cancer. Let wi and wj be two web tuples from Wi and Wj respectively, and nc (wi ) and nc (wj ) be instances of nci and ncj respectively. Suppose documents at http://www.virtualdisease.com/cancer/index.html (represented by node nc (wi )) and http://www.virtualdrug.com/cancerdrugs/index.html (represented by node nc (wj )) respectively contain information related to cancer and appear in wi and wj respectively. Tuples wi and wj are coupling-compatible locally on nc (wi ) and nc (wj ) since they both contain similar information (information related to cancer). Thus, coupling nodes are nc (wi ) and nc (wj ). We store the coupled web tuple in a separate web table. Note that coupling-compatibility of two web tuples depends on the pair(s) of node variables and keyword(s) specified explicitly by the user in the local coupling query. We now formally define coupling-compatibility. Definition 1. Let K(n, w, W ) denote a set of keywords appearing in a web document (represented by node n) in web tuple w of web table W . Two web tuples wi and wj of web tables Wi and Wj are coupling-compatible locally on the node pair (nc (wi ), nc (wj )) based on some keyword set Kc if and only if the following conditions are true: nc (wi ) ∈ Nwi , nc (wj ) ∈ Nwj , Kc ⊆ K(nc (wi ), wi , Wi ) and Kc ⊆ K(nc (wj ), wj , Wj ).

100

S.S. Bhowmick, W.-K. Ng, and E.-P. Lim

The new web tuple w derived from the coupling of wi and wj is defined as: Nw = Nwi ∪ Nwj , Lw = Lwi ∪ Lwj and Vw = Vwi ∪ Vwj . We express local web coupling between two web tables as follows: W = Wi ⊗({hnode

pairi,hkeyword(s)i})

Wj

where Wi and Wj are the two web tables participating in the coupling operation and W is the coupled web table created by the coupling operation satisfying schema M = hXn , X` , C, P i. In this case, hnode pairi specifies a pair of coupling node variables from Wi and Wj , and hkeyword(s)i specifies a list of keyword(s) on which the similarity between the coupling node variable pair is evaluated. Note that in order to couple the two web tables, the keyword(s) should be present in at least one instance of the coupling node variable pair. Furthermore, there may be more than one pair of coupling node variables on which local web coupling can be performed. Local web coupling is a combination of two web operations: a web cartesian product followed by a web select based on some selection condition on the coupling nodes. Like its relational counterpart, a web cartesian product, (denoted by ×), is a binary operation that combines two web tables by concatenating a web tuple of one web table with a web tuple of other. If Wi and Wj have n and m web tuples respectively, then the resulting web cartesian product has n × m web tuples. The schema of the resultant web table W 0 is given as M 0 = hXn 0 , X` 0 , C 0 , P 0 i where Xn 0 = Xi,n ] Xj,n X` 0 = Xi,` ] Xj,` C 0 = Ci ] Cj and P 0 = Pi ] Pj . The symbol ] refers to the disambiguation [5,15] of nodes and link variables. Let us now illustrate web cartesian product with an example. Example 4. Consider the web tables Symptoms and Drug list in Figs. 3 and 4. The web cartesian product of these two web tables is shown in Fig. 6. Due to space limitations, we only show a small portion of the resultant web table. The schema of the resultant web table is M 0 = {Xn 0 , X` 0 , C 0 , P 0 } where Xn 0 = Xi,n ] Xj,n = {x, y, z, w, a, b, c, d}, X` 0 = Xi,` ] Xj,` = {t, e, f, -}, C 0 = Ci ] Cj ≡ k1 0 ∧ k2 0 ∧ k3 0 ∧ k4 0 ∧ k5 0 ∧ k6 0 such that k1 0 = xh-iy, k2 0 = yheiz, k3 0 = yhf iw, k4 0 = ah-ib, k5 0 = bh-ic, k6 0 = chtid, and P 0 = Pi ] Pj ≡ p1 0 ∧ p2 0 ∧ p3 0 ∧ p4 0 ∧ p5 0 ∧ p6 0 ∧ p7 0 ∧ p8 0 ∧ p9 0 ∧ p10 0 ∧ p11 0 such that p1 0 (x) ≡ [x.url EQUALS "http://www.virtualdisease.com/"], p2 0 (y) ≡ [y.title CONTAINS "issues"], p3 0 (e) ≡ [e.label CONTAINS "symptoms"], p4 0 (z) ≡ [z.title CONTAINS "symptoms"], p5 0 (f ) ≡ [f .label CONTAINS "treatments"], p6 0 (w) ≡ [w.title CONTAINS "treatments"], p7 0 (a) ≡ [a.url EQUALS "http://www.virtualdrug.com/"], p8 0 (b) ≡ [b.title CONTAINS "Drug List"], p9 0 (c) ≡ [c.title CONTAINS "Issues"], p10 0 (d) ≡ [d.title CONTAINS "side effects"], p11 0 (t) ≡ [t.label CONTAINS "side effects"]. A web select operation is performed after web cartesian product to filter out web tuples where the specified nodes cannot be related based on the keyword(s) conditions. These conditions impose additional constraints on the node variables participating in local web coupling. We denote this sequence of operations as local web coupling and we can replace the two operations

Information Coupling in Web Databases

W = σ(hnode

W 0 = Wi × Wj

pairi,hkeyword condition(s)i) (W

0

101

)

with W = Wi ⊗({hnode pairi,hkeyword(s)i}) Wj . The symbol σ denotes web selection. The result of a local web coupling operation is a web table having one web tuple for each combination of web tuple—one from Wi and one from Wj —whenever there exist coupling nodes. Let us illustrate web coupling with an example. Example 5. Consider the web tables Symptoms and Drug list as depicted in Examples 2 and 3. Suppose Bill wish to find symptoms, treatments details of “Cancer” and “AIDS” and the list of drugs with their side effects on these diseases. The coupled web table is shown in Fig. 5. Note that the third and fourth web tuples in Fig. 6 are excluded in the coupled web table since they do not satisfy the keyword conditions. The schema of the coupled web table is M = hXn , X` , C, P i where Xn = Xn 0 , X` = X` 0 C = Ci 0 and P = P 0 . The construction details of the coupled schema and the coupled web table will be explained in Sect. 4.3. 4.2

Terminology

We introduce some terms we shall be using to explain local web coupling in this paper. • Coupling nodes: Two web tuples wi and wj of web tables Wi and Wj respectively can be coupled if there exist at least one node nc (wi ) and nc (wj ) in wi and wj which can be coupled with the other based on similar information content. We refer to these nodes as coupling nodes. We express the coupling nodes of wi and wj as coupling pairs since they cannot exist as a single node. Formally, ( nc (wi ), nc (wj )) is a coupling pair where node nc (wi ) is coupled with nc (wj ) of wj . The attributes of nc (wi ) and nc (wj ) are called coupling attributes. For example, the coupling nodes of the first web tuple in Figs. 3 and 4 are y0 and b0 respectively. The coupling pair for these nodes is (y0 , b0 ), The coupling attributes of y0 and b0 are text, title etc. • Coupling-activating links: All the incoming links to the coupling nodes nc (wi ) and nc (wj ) are called coupling-activating links. Formally, `nc (wi ) is the coupling-activating link of the coupling node nc (wi ). For example, the link g0 in Fig. 3 is the coupling-activating link of node y0 . • Coupling keywords: The keyword condition(s) specified by the user based on which coupling between node variables are performed are called coupling keywords. 4.3

Web Table Creation

We now discuss the process of deriving the coupled web table from two input web tables. Given two web tables, a set of coupling keyword(s), and pair(s) of

102

S.S. Bhowmick, W.-K. Ng, and E.-P. Lim

node variables, we first construct the schema of the coupled web table and then proceed to create the table itself. Let web tables Wi and Wj with schemas Mi and Mj be participating in the local web coupling process. Let the coupled web table be W with schema M = hXn , X` , C, P i. Construction of the coupled schema. We now determine the four components of M in the following steps: Step 1: Determine the Node set: Node variables in Xni and Xnj can either be nominally distinct from one another or there may exist at least one pair of node variables from Xni and Xnj which are identical to one another. If the node variables are not nominally distinct, we disambiguate one of the identical node variable(s). The node set of the coupled schema is given as:Xn = (Xni ] Xnj ). Step 2: Determine the Link set: Similarly, we disambiguate the identical link variables in X`i and X`j if necessary, and the link set of the coupled schema is given as: X` = X`i ] X`j . Step 3: Determine the Connectivity set: If the node and link variables are not nominally distinct, we replace any one of the identical variables in Ci or Cj with the disambiguated value. The connectivity set of the coupled schema C is given as: C = Ci ] Cj . Step 4: Determine the Predicate set: Our approach to determine the predicate set of the coupled schema is similar as above. If the node and link variables are not nominally distinct we replace any one of the identical node variables in Pi or Pj with the disambiguated value. The predicate set of the coupled schema is given as: P = Pi ] Pj . Construction of the coupled web table. The coupled web table is created by integrating the two input web tables. We describe the steps below: Step 1: Given two web tables, we first perform a web cartesian product on the two web tables. Step 2: For each web tuple in the web table created by web cartesian product, the specified nodes are inspected to determine whether the web tuple is couplingcompatible locally (based on the coupling keyword(s) provided by the user). In order to be coupling-compatible, the specified pair of nodes in the web tuple must satisfy some coupling-compatibility conditions. We determine these conditions in the next section. We inspect each web tuples in the web table created by web cartesian product to determine if the specified pair(s) of node satisfy any one of the coupling compatibility conditions. Step 3: If a pair of nodes satisfy none of the conditions, the corresponding web tuple is rejected. If the nodes satisfy at least one of the above conditions, the web tuple is stored in a separate web table (coupled web table).

Information Coupling in Web Databases

103

Table 1. Node attributes of y and b. Node URL Title Text y0 http://www.virtualdisease.com/cancerindex.html Cancer Issues Cancer b0 http://www.virtualdrug.com/cancer.html Cancer Drug List Cancer

Table 2. Link attributes of g and s. Link From Node To Node Label Link Type g0 x0 y0 Cancer local s0 a0 b0 Cancer local

Step 4: Repeat steps 2 and 3 for other web tuples in the resultant web table created by web cartesian product. 4.4

Coupling-Compability Conditions

Local web coupling-compability conditions may be based on node attributes of the instances of specified node variables and/or attributes of the instances of incoming link variables of the specified node variables (coupling-activating links). Let us define some terms to facilitate our exposition. Given a web tuple w of web table W with schema M = hXn , X` , C, P i, let n(w) be a node of w and `n(w) be incoming link to node n(w) such that: • attr(n(w)) ∈ {url, text, title, format, date, size} is a node attribute; • attr(`n(w) ) ∈ {source url, target url, label, link type} is a link attribute and • val(n(w)) and val(`n(w) ) are the values of attr(n(w)) and attr(`n(w) ) respectively. For example, consider Tables 1 and 2 which depict some of the attributes of node variables b, y and link variables g, s. For node b0 and attr(b0 ) = title, and val(b0 ) = Cancer Drug List. For link s0 (incoming link to node b0 ), with attr(s0 ) = label and val(s0 ) = Cancer. Let nci and ncj be node variables in schemas Mi and Mj of web tables Wi and Wj respectively participating in the local web coupling and Kc be the coupling keywords. Let wi and wj be two web tuples of Wi and Wj such that nc (wi ) and nc (wj ) are instances of nci and ncj respectively. Moreover, let the web cartesian product of Wi and Wj be W 0 and let w0 be a web tuple in W 0 which is the cartesian product of wi and wj . Web documents represented by nodes nc (wi ) and nc (wj ) can be coupling nodes (that is web tables Wi and Wj are coupling-compatible) if they satisfy at least one of the coupling-compatibility conditions given below: 1. title of the web documents is equal to Kc or contains the coupling keyword Kc , i.e., attr(nc (wi )) = attr(nc (wj )) = title, val(nc (wi )) and val(nc (wj )) is equal to Kc or contains Kc .

104

S.S. Bhowmick, W.-K. Ng, and E.-P. Lim

2. text of the web documents contains Kc , i.e., attr(nc (wi )) = attr(nc (wj )) = text, val(nc (wi )) and val(nc (wj )) contains Kc . 3. The coupling keyword Kc is contained in the text of one web document and in the title of the other document, i.e., attr(nc (wi )) = text, attr(nc (wj )) = title, val(nc (wi )) is equal to or contains Kc and val(nc (wj )) contains Kc . 4. The coupling keyword is contained in the file name of URL of the web documents, i.e., attr(nc (wi )) = attr(nc (wj )) = url.filename, val(nc (wi )) and val(nc (wj )) contains Kc . 5. The coupling keyword is contained in the text of one web document and in the file name of the URL of other document, i.e., attr(nc (wi )) = text, attr(nc (wj )) = url.filename, val(nc (wi )) and val(nc (wj )) contains Kc . 6. The coupling keyword is contained in the file name of the URL of one document and in the title of the other document, i.e., attr(nc (wi )) = url.filename, attr(nc (wj )) = title, val(nc (wi )) contains Kc and val(nc (wj )) contains or is equal to Kc . 7. The label of the incoming links `nc (wi ) and `nc (wj ) to the web documents contains the coupling keyword Kc , i.e., attr(`nc (wi ) ) = attr(`nc (wj ) )) = label, val(`nc (wi ) ) and val(`nc (wj ) ) are equal to Kc or contains Kc . 8. The label of the incoming link `nc (wi ) and the title of node nc (wj ) contains or are equal to Kc , i.e., attr(`nc (wi ) ) = label, attr(nc (wj )) = title, val(`nc (wi ) ) and val(nc (wj )) are equal to Kc or contains Kc . 9. The label of the incoming link to one document contains or is equal to Kc and the text of the other web document contains the coupling keyword, i.e., attr(`nc (wi ) ) = label, attr(nc (wj )) = text, val(`nc (wi ) ) is equal to or contains Kc and val(nc (wj )) contains Kc . 10. The label of the incoming link contains or is equal to Kc and the file name of the URL of the other web document contains Kc , i.e., attr(`nc (wi ) ) = label, attr(nc (wj )) = url.filename, val(`nc (wi ) ) is equal to or contains Kc and val(nc (wj )) contains Kc .

5

Related Work

We would like to briefly survey web data retrieval and manipulation systems proposed so far, and compare them with web information coupling. There has been considerable work in data model and query languages for the World Wide Web [9], [11], [12], [13]. To the best of our knowledge, we are not aware of any work which deals with web information coupling in web databases. Mendelzon, Mihaila and Milo [13] proposed a WebSQL query language based on a formal calculus for querying the WWW. The result of WebSQL query is a set of web tuples flattened immediately into linear tuples. This limits the expressiveness of queries to some extent as complex queries involving operators such as local web coupling are not possible. Konopnicki and Shmueli [11] proposed a high level querying system called W3QS for the WWW whereby users may specify content and structure queries on the WWW and maintain the results of queries as database views of the WWW. In W3QL, queries are always made to the WWW.

Information Coupling in Web Databases

105

Past query result are not used for the evaluation of future queries. This limit the usage of web operators like local web coupling to derive additional information from the past queries. Fiebig, Weiss and Moerkotte [9] extended relational algebra to the World Wide Web by augmenting the algebra with new domains (data types), and functions that apply to the domains. The extended model is known as RAW (Relational Algebra for the Web). Only two low level operators on relations, scan and index-scan, have been proposed to expand an URL address attribute in a relation and to rank results returned by web search engine(s) respectively. RAW made minor improvements on the existing relational model to accommodate and manipulate web data and there is no notion of a coupling operation similar to the one in WICM. Inspired by concepts in declarative logic, Lakshmanan, Sadri and Subramanian [12] designed WebLog to be a language for querying and restructuring web information. But there is no formal definition of web operations such as web coupling. Other proposals, namely Lorel [1] and UnQL [8], aim at querying heterogeneous and semistructured information. These languages adopt a lightweight data model to represent data, based on labeled graphs, and concentrate on the development of powerful query languages for these structures. Moreover, in both proposals there is no notion of web coupling operation similar to the one in WICM. Website restructuring systems like Araneus [4] and Strudel [10], exploit the knowledge of a website’s structure to define alternative views over its content. Both these models do not focus on web information coupling similar to the one in WICM. The WebOQL system [3] supports a general class of data restructuring operations in the context of the Web. It synthesizes ideas from query languages for the Web, semistructured data and web site restructuring. The data model proposed in WebOQL is based on ordered trees where a web is a graph of trees. This model enables us to navigate, query and restructure graphs of trees. In this system, the concatenate operator allows us to juxtapose two trees which can be viewed as the manipulation of trees. But there is no notion of web coupling operation similar to ours.

6

Summary and Future Work

In this paper, we have motivated the need for coupling useful information residing in the WWW and in multiple web tables from a web database. We have introduced the notion of global web coupling and local web coupling that enable us to couple useful related information from the WWW and associate related information residing in different web tables by combining web tuples whenever they are coupling-compatible. We have shown how to construct the coupled web table globally and locally from the WWW and two input web tables respectively. Presently, we have implemented the global web coupling operator and have interfaced it with other web operators. The current global web coupling operator can be used efficiently for simple web queries. We are in the process of implementing the local web coupling operator and finding ways to optimize web coupling.

106

S.S. Bhowmick, W.-K. Ng, and E.-P. Lim

References 1. S. Abiteboul, D. Quass, J. McHugh, J. Widom, J. Weiner. The Lorel Query Language for Semistructured Data. Journal of Digital Libraries, 1(1):68-88, April 1997. 2. S. Abiteboul, V. Vianu. Queries and Computation on the Web. Proceedings of the 6th International Conference on Database Theory, Greece, 1997. 3. G. Arocena, A. Mendelzon WebOQL: Restructuring Documents, Databases and Webs. Proceedings of ICDE 98 , Orlando, Florida, February 1998. 4. P. Atzeni, G. Mecca, P. Merialdo Semistructured and Structured Data in the Web: Going Back and Forth. Proceedings of Workshop on Semi-structured Data, Tuscon, Arizona, May 1997. 5. S. S. Bhowmick, W.-K. Ng, E.-P. Lim. Join Processing in Web Databases. Proceedings of 9th International Conference on Database and Expert Systems Applications (DEXA’98), Vienna, Austria, August 24–28, 1998. 6. S. S. Bhowmick, S. K. Madria, W.-K. Ng, E.-P. Lim. Web Bags: Are They Useful in A Web Warehouse? Submitted for publication. 7. S. S. Bhowmick, S. K. Madria, W.-K. Ng, E.-P. Lim. Semi Web Join in WICS. Submitted for publication. 8. P. Buneman, S. Davidson, G. Hillebrand, D. Suciu. A query language and optimization techniques for unstructured data. Proceedings of the ACM SIGMOD International Conference on Management of Data, Canada, June 1996. 9. T. Fiebig, J. Weiss, G. Moerkotte. RAW: A Relational Algebra for the Web. Workshop on Management of Semistructured Data (PODS/SIGMOD’97), Tucson, Arizona, May 16, 1997. 10. M. Fernandez, D. Florescu, A. Levy, D. Suciu A Query Language and Processor for a Web-Site Management Systems. Proceedings of Workshop on Semi-structured Data, Tuscon, Arizona, May 1997. 11. D. Konopnicki, O. Shmueli. W3QS: A Query System for the World Wide Web. Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 1995. 12. L.V.S. Lakshmanan, F. Sadri., I.N. Subramanian A Declarative Language for Querying and Restructuring the Web. Proceedings of the Sixth International Workshop on Research Issues in Data Engineering, February, 1996. 13. A. O. Mendelzon, G. A. Mihaila, T. Milo. Querying the World Wide Web. Proceedings of the International Conference on Parallel and Distributed Information Systems (PDIS’96), Miami, Florida, 1996. 14. W.-K. Ng, E.-P. Lim, S. S. Bhowmick, S. K. Madria An Overview of A Web Warehouse. Submitted for publication. 15. W.-K. Ng, E.-P. Lim, C.-T. Huang, S. Bhowmick, F.-Q. Qin. Web Warehousing: An Algebra for Web Information. Proceedings of IEEE International Conference on Advances in Digital Libraries (ADL’98), Santa Barbara, California, April 22–24, 1998.

Structure-Based Queries over the World Wide Web Tao Guan, Miao Liu, and Lawrence V. Saxton Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {guan,lium,saxton}@cs.uregina.ca

Abstract. With the increasing importance of the World Wide Web as an information repository, how to locate documents of interest becomes more and more significant. The current practice is to send keywords to search engines. However, these search engines lack the capability to take the structure of the Web into consideration. We thus present a novel query language, NetQL and its implementation, for accessing the World Wide Web. Rather than working on global text-full search, NetQL is designed for local structure-based queries. It not only exploits the topology of web pages given by hyperlinks, but also supports queries involving information inside pages. A novel approach to extract information from web pages is presented. In addition, the methods to control the complexity of query processing are also addressed in this paper.

1

Introduction

The World Wide Web provides a huge information repository based on the Internet. It is a big problem to find documents of interest in this system. The current practice mostly depends on sending a keyword or a combination of keywords to search engines such as AltaVista and Yahoo. Although this is successful in some cases, there are still limitations in this approach. For example, (1) the search is limited to page content which is viewed as unstructured text so that the inner structural information is ignored; (2) the accuracy of results is low and garbage exists in the output. On the other hand, structure-based query languages [3,11,13,12,16] are presented to exploit the link structures between the pages. Most of these work are based on the metaphor of the Web as a database. However, the nature of the Web is fundamentally different from traditional databases. The main characteristics are its global nature and the loosely textual semi-structured information it holds. Although these languages can solve problems in search engines to some extent, they usually suffer from the following drawbacks: (1) They focus on the hyperlinks so that page contents are simplified as atomic objects(i.e. strings) or relations with specific attributes(i.e. URL, title, text, type etc). The inner structure which is valuable for many queries is ignored. T.W. Ling, S. Ram, and M.L. Lee (Eds.): ER’98, LNCS 1507, pp. 107–120, 1998. c Springer-Verlag Berlin Heidelberg 1998

108

T. Guan, M. Liu, and L.V. Saxton

This limits the expressive power of the languages. For example, the following queries cannot be expressed, List the name and e-mail address of all professors at the University of Regina; Find hotels in Hong Kong whose price is less than US$100; The problem with these two queries is that the information on price or email address, in most cases, is kept inside a web page. If languages view a page as an atomic object and just support operations like keyword matching, it is hard to exploit the valuable data inside a page. The main difficulty may be that it is too tough to obtain this information from a Web page since pages usually are irregular. Therefore, how to extract the structural information is a key point. Although new technology XML provides the feature of self-describing structures, valuable information hidden inside semi-structured textual lines is still useful for users. The technique on how to mine this kind of data is thus important. Actually, it has been studied in [1,2,8,17]. Here we present a novel approach to deal with it. (2) How to control the complexity of query processing is not addressed. Since structure-based queries are evaluated on original, distributed data, the communication costs of accessing remote data may be huge. Therefore, a blind search is inefficient. Methods to control the run time should be investigated. Our contributions. This paper presents an intelligent query language over the WWW, called NetQL. Our purpose is not to give birth to another powerful language. Instead, we focus on the problems mentioned above. NetQL follows the approach of structure-based queries; however, it attempts to overcome the problems unsolved by other languages. First, a novel approach to mine information from web pages is presented so that queries which involve information or structures inside pages can be issued. Secondly, various methods are provided to control the complexity of query processing. Rather than representing a web page as a labeled graph or relations as in current practices [3,11,13,12,16], our mining algorithm extracts the desired information from irregular pages directly by keywords or patterns. We assume: (1) the important information is always highlighted by keywords or is meaningfully semi-structured since most web data is summarized and condensed(except online news or newspaper); (2) some common patterns exist in English, e.g. the word after “Dr.” or “Mr.” should be a name; (3) similar structures or patterns occur in web pages of an institute since most public web pages are written by the same professional webmasters and thus a similar style(even simple copies) is employed. Therefore, a set of heuristic rules or patterns can be used to identify this information. Our experments show that this novel approach of extracting information from the unstructured web is more effective than conventional ones, which depends on syntax(i.e. HTML tags) or declarative languages [2,8,17]. In addition, the complexity of query processing is controlled in NetQL in two levels. Firstly, users are given various choices to control run time. For example, they can specify a more exact path if they have partial knowledge of the structure of the searched site or simply limit the evaluation of queries to local data or a

Structure-Based Queries over the World Wide Web

109

fixed number of returned results. Secondly, an effective optimizing technique, which is based on semantic similarity, is developed to guide the search to the most promising direction. The remainder of this paper is organized as follows. In Sect. 2, we briefly introduce our query language, NetQL. Then we discuss how to mine information from web pages in Sect. 3. Section 4 presents methods to control the complexity of queries. Experimental results are shown in Sect. 5. Finally, conclusions and references are presented.

2

The Language NetQL

We briefly introduce our query language NetQL in this section. A web site is modeled as a rooted, edge-labeled graph as in semistructured databases. Each node represents a page, and each page has an unique URL and can be viewed as either a semistructured textual string or a set of textual lines which consists of a few fields(one or more words. The definition is shown later). Figure 1 is an example to model a portion of the web site at the CS department of the University of Regina. http://www.cs.uregina.ca

History Information

People

Research

Staff

Graduate Student

Class Files

.... Faculty

Publication

.....

The university of Regina was founed in 1911...

Programs The Department currently offers B.Sc, M.Sc, Ph.D...

Research ....

Software and applications are the primary fields...

..... .........

Fig. 1. A Sample web site and page

The syntax of NetQL is similar to the standard SQL SELECT statement, but the data are web documents instead of relational databases. The general grammar is as follows, select variables from startingpage→path contain keywords match patterns where conditions restricted specification

110

T. Guan, M. Liu, and L.V. Saxton

The select clause contains a list of variables which indicate what information is finally extracted from the chosen pages. There are two kinds of variables in NetQL. One is called the keyword variable whose value is mined from pages directly by a set of heuristic rules. The other is the pattern variable which must appear in the string or structure patterns in the match clause and the value is obtained when the pattern is matched against a portion of the content of pages(see detail in the next section). The from clause specifies where web pages are reached. If absent, the default case is all pages located at the web server where the user sends the query. Otherwise, the clause specifies a starting page and further path expression. (The latter is not imperative. It is mainly used to improve the performance of queries). A path expression is usually a set of predicates starting from the specified page, following certain hyperlinks satisfying the predicate and arriving at other pages. While “→” is used to separate the starting page from the following hyperlinks, “.” represents a hyperlink from one page to another. For example, http://www.cs.uregina.ca/→people.faculty is a path from the CS page to the faculty page through the people link. In addition, the wildcard “∗” represents an arbitrary length of the path and “-” means the path only goes one depth further. When the pages specified in the from clause are reached, NetQL first checks if they contain keywords given in the contain clause. If not, the pages are discarded. Otherwise, the desired information are mined by keywords in the select clause or the pages are matched against the patterns in the match clause. Furthermore, the obtained information is sent to evaluate the conditions in the where clause. If they are true, the values assigned to variables in the select clause are returned to users. Finally, the restricted clause is used to control the complexity of query processing. We discuss this further in Sect. 4. Example 2.1. Find the name and e-mail address of all professors at the University of Regina. select Name, E-mail from http://www.uregina.ca/→∗ contain professor match [Dr. Name] In this case, Name and E-mail are variables used to indicate what we are looking for. E-mail is a keyword variable whose value is mined directly from web pages specified in the from clause(all pages containing the word professor at site http://www.uregina.ca). In contrast, Name is a variable occurring in patterns [Dr. Name]. The query first finds the pages containing keyword professor at site http://www.uregina.ca and then locates the constant string “Dr.” in the returned pages and assigns the first noun phrase after it to variable Name. For example, if the string “Dr. Jack Boan is ....” is found, the “Jack Boan” is set to Name. While more than one possible value are set to a variable, the conflict is solved by the rules given in Sect. 3. The final result is shown in Fig. 2.

Structure-Based Queries over the World Wide Web

111

Fig. 2. The Final Result for Example 2.1

Of course, the result does not cover all the desired information and errors also appear(e.g. the email for Dr. Bryan Austin cannot be [email protected]). However, they actually are more exact information than that of search engines(84460 links were returned if we use keywords professor, university and regina to Yahoo). Example 2.2. Find hotels in Hong Kong whose price is less than US$100. This example seems difficult since we do not know where we can find the information on hotels in Hong Kong. However, we can solve the problem by sending the words hotel and Hong Kong to a search engine, i.e. Yahoo and get the page as in Fig. 3. There are 44 hotels returned and we may browse them manually to find what we need. However, if the number of hotels is large, it will be difficult to search manually. Fortunately, the following NetQL query can deal with it when we know of a homepage as in Fig. 3 at http://www.asia-hotels.com/HongKong.asp. select X, Z from http://www.asia-hotels.com/HongKong.asp match {X, Y, Z<100} This query is to match structure pattern {X, Y, Z<100} against the content of the page whose URL is http://www.asia-hotels.com/HongKong.asp. It treats a web page as a set of textual lines, which consists of a couple of fields. (The fields are separated by delimiters, which are defined as two or more spaces or any HTML tags, i.e. , ). For example, the following line has three fields,

112

T. Guan, M. Liu, and L.V. Saxton

Fig. 3. The Sample Page for Example 2.2

Anne Black Guest House (YWCA) Special offer US$ 43 to 101 The structure pattern {X, Y, Z<100} is matched against a textual line (or a row in a table) which exactly has three fields (or columns) and the third field contains a number which is less than 100. Therefore, for the above line, we have X = “Anne Black Guest House (YWCA)”, Y = “Special offer” and Z = “US$ 43 to 101”. Since the value of variable Z should be a number, the heuristic rule is to choose the first number in the string, that is, 43 to Z. Therefore, the line is matched against the pattern and the value of variable X and Z are output. The above examples highlight the main novelty of NetQL; that is, keywords and patterns are used to extract the information from web pages. We will further describe it in the next section. Another feature, complexity control and the restricted clause, will be discussed in Sect. 4.

3

Page Mining

We discuss how to mine information from web pages in this section. Our idea is similar to the human approach in locating desired information. When a person is looking for something, he/she always follows two approaches: (1) use a keyword to recognize desired information. For example, a keyword Email: or Tel: indicates

Structure-Based Queries over the World Wide Web

113

that the string after it is an email address or telephone number. (2) use semantic knowledge or a pattern to recognize objects. For example, most people know that the string “3400 Rae St. Regina, SK, Canada” is an address although there is no keyword address or contact before it. Since the semantic knowledge is hard to acquire for a computer, NetQL currently only supports keywords-based mining and string or structure pattern-based mining. We discuss them in the following two sections respectively. 3.1

Keyword-Based Mining

Keyword-based mining is used to extract values associated with keywords , e.g. E-mail, Publication, or Research interests. When a keyword is given, the system first looks for the keyword in pages. If it is located, the following heuristic rules are applied to mine the corresponding value automatically, – If the word is in a label of a hyperlink, then the value is the content of the page pointed to by the link. For example, the information on publications must be in the pointed page if there is a hyperlink label containing publications. – If the word is a title(