Advances in Conceptual Modeling - Applications and Challenges: ER 2010 Workshops ACM-L, CMLSA, CMS, [email protected] , FP-UML, SeCoGIS, WISM, Vancouver, BC, ... Applications, incl. Internet Web, and HCI)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

74 downloads 836 Views 6MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6413

Juan Trujillo Gillian Dobbie Hannu Kangassalo Sven Hartmann Markus Kirchberg Matti Rossi Iris Reinhartz-Berger Esteban Zimányi Flavius Frasincar (Eds.)

Advances in Conceptual Modeling – Applications and Challenges ER 2010 Workshops ACM-L, CMLSA, CMS, DE@ER, FP-UML, SeCoGIS, WISM Vancouver, BC, Canada, November 1-4, 2010 Proceedings

13

Volume Editors Juan Trujillo University of Alicante, Spain, [email protected] Gillian Dobbie University of Auckland, New Zealand, [email protected] Hannu Kangassalo University of Tampere, Finland, [email protected] Sven Hartmann Clausthal University of Technology, Germany, [email protected] Markus Kirchberg A*STAR, Singapore, [email protected] Matti Rossi Aalto University, Finland, [email protected] Iris Reinhartz-Berger University of Haifa, Israel, [email protected] Esteban Zimányi Free University of Brussels, Belgium, [email protected] Flavius Frasincar Erasmus University Rotterdam , The Netherlands, [email protected] Library of Congress Control Number: 2010936076 CR Subject Classification (1998): D.2, D.3, H.4, I.2, H.3, H.5 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13

0302-9743 3-642-16384-X Springer Berlin Heidelberg New York 978-3-642-16384-5 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface to ER 2010 Workshops

Welcome to the workshops associated with the 29th International Conference on Conceptual Modeling (ER 2010). As always, the aim of the workshops was to give researchers and participants a forum to discuss cutting edge research in conceptual modeling, and to pose some of the challenges that arise when applying conceptual modeling in less traditional areas. Workshops provided an intensive collaborative forum for exchanging late breaking ideas and theories in an evolutionary stage. Topics of interest span the entire spectrum of conceptual modeling including research and practice in areas such as theories of concepts and ontologies underlying conceptual modeling, methods and tools for developing and communicating conceptual models, and techniques for transforming conceptual models into effective implementations. In order to provoke more discussion and interaction, some workshops organized panels and/or keynote speakers inviting renowned researchers from different areas of conceptual modeling. In all, 31 papers were accepted from a total of 82 submitted, making an overall acceptance rate of 37%. The focus of this year’s seven workshops, which were selected competitively from a call for workshop proposals, ranged from the application of conceptual modeling in less traditional domains including learning, life science applications, services, geographical systems, and Web information systems, to using conceptual modeling for different purposes including domain engineering, and UML modeling. SeCoGIS: CMLSA: CMS: ACM-L: WISM: DE@ER: FP-UML:

Semantic and Conceptual Issues in GIS Conceptual Modeling of Life Sciences Applications Conceptual Modeling of Services Active Conceptual Modeling of Learning Web Information Systems Modeling Domain Engineering Foundations and Practices of UML

Setting up workshops such as these requires a lot of effort. We would like to thank the Workshop Chairs and their Program Committees for their diligence in selecting the papers in this volume. We would also like to thank the main ER 2010 conference committees, particularly the Conference Co-chairs, Yair Wand and Carson Woo, the Conference Program Co-chairs, Jeff Parsons, Motoshi Saeki and Peretz Shoval, the Webmaster, William Tan, and the Proceedings Chair, Sase Singh, for their support in putting the program and proceedings together.

November 2010

Juan Trujillo Gillian Dobbie

ER 2010 Workshop Organization

Workshop Co-chairs Juan Trujillo Gillian Dobbie

Universidad de Alicante, Spain University of Auckland, New Zeeland

SeCoGIS 2010 Program Chairs Jean Brodeur Esteban Zimányi

Natural Resources Canada, Canada Université Libre de Bruxelles, Belgium

SeCoGIS 2010 Program Committee Alia I. Abdelmoty Gennady Andrienko Natalia Andrienko Claudio Baptista Spiridon Bakiras Yvan Bedard Michela Bertolotto Benedicte Bucher James D. Carswell Nicholas Chrisman Christophe Claramunt Eliseo Clementini Maria Luisa Damiani Clodoveu Davis Max Egenhofer Fernando Ferri Frederico Fonseca Antony Galton Ki-Joune Li Therse Libourel Jugurta Lisboa Filho Miguel R. Luaces Jose Macedo Pedro Rafael Muro Medrano Mir Abolfazl Mostafavi Dimitris Papadias Dieter Pfoser

Cardiff University, UK Fraunhofer Institute IAIS, Germany Fraunhofer Institute IAIS, Germany Universidade Federal de Campina Grande, Brazil City University of New York, USA Universite Laval, Canada University College Dublin, Ireland Institut Geographique National, France Dublin Institute of Technology, Ireland Universite Laval, Canada Naval Academy Research Institute, France University of L’Aquila, Italy University of Milano, Italy Federal University of Minas Gerais, Brazil NCGIA, USA IRPPS-CNR, Italy Penn State University, USA University of Exeter, UK Pusan National University, South Korea Université de Montpellier II, France Universidade Federal de Vicosa, Brazil Universidade da Coruna, Spain Federal University of Ceara, Brazil Universidad de Zaragoza, Spain Universitè Laval, Canada University of Science and Technology, China Institute for the Management of Information Systems, Greece

VIII

ER 2010 Workshop Organization

Andrea Rodriguez Diego Seco Sylvie Servigne-Martin Emmanuel Stefanakis Kathleen Stewart Hornsby Christelle Vangenot Luis Manuel Vilches Blazquez Lubia Vinhas Jose Ramon Rıos Viqueira Nancy Wiegand

Universidad de Concepcion, Chile Universidade da Coruna, Spain INSA de Lyon, France Harokopio University of Athens, Greece University of Iowa, USA EPFL, Switzerland Universidad Politecnica de Madrid, Spain Instituto National de Pesquisas Espaciais, Brazil University of Santiago de Compostela, Spain University of Wisconsin-Madison, USA

SeCoGIS 2010 External Reviewers Francisco J. Lopez-Pellicer

CMLSA 2010 Program Chairs Yi-Ping Phoebe Chen Sven Hartmann

La Trobe University, Australia Clausthal University of Technology, Germany

CMLSA 2010 Program Committee Ramez Elmasri Amarnath Gupta Dirk Labudde Dirk Langemann Huiqing Liu Maria Mirto Oscar Pastor Fabio Porto Sudha Ram Keun Ho Ryu Thodoros Topaloglou Xiaofang Zhou

University of Texas, USA University of California San Diego, USA Mittweida University of Applied Sciences, Germany Braunschweig University of Technology, Germany Janssen Pharmaceutical Companies of Johnson & Johnson, USA University of Salento, Italy Valencia University of Technology, Spain EPF Lausanne, Switzerland University of Arizona, USA Chungbuk National University, South Korea University of Toronto, Canada The University of Queensland, Australia

CMLSA 2010 Publicity Chair Jing Wang

Massey University, New Zealand

ER 2010 Workshop Organization

CMS 2010 Program Chairs Markus Kirchberg Bernhard Christian-Albrechts

Institute for Infocomm Research, A*STAR, Singapore University of Kiel, Germany

CMS 2010 Program Committee Michael Altenhofen Don Batory Athman Bouguettaya Schahram Dustdar Andreas Friesen Aditya K. Ghose Uwe Glasser Georg Grossmann Hannu Jaakkola Andreas Prinz Sudha Ram Klaus-Dieter Schewe Michael Schre Thu Trinh Qing Wang Yan Zhu

SAP Research CEC Karlsruhe, Germany University of Texas at Austin, USA CSIRO, Australia Vienna University of Technology, Austria SAP Research Karlsruhe, Germany University of Wollongong, Australia Simon Fraser University, Canada University of South Australia, Australia Tampere University of Technology, Finland University of Agder, Norway University of Arizona, USA Software Competence Center Hagenberg, Austria University of Linz, Austria Technical University of Clausthal, Germany University of Otago, New Zealand Southwest Jiaotong University, China

CMS 2010 External Referees Michael Huemer Florian Rosenberg Wanita Sherchan Xu Yang

ACM-L 2010 Program Chairs Hannu Kangassalo Salvatore T. March Leah Wong

University of Tampere, Finland Vanderbilt University, USA SPAWARSYSCEN Pacific, USA

ACM-L 2010 Program Committee Stefano Borgo Alfredo Cuzzocrea Giancarlo Guizzardi

ISTC-CNR, Italy University of Calabria, Italy Universidade Federal do Espírito Santo, Brazil

IX

X

ER 2010 Workshop Organization

Raymond A Liuzzi Jari Palomäki Oscar Pastor Sudha Ram Laura Spinsanti Il-Yeol Song Bernhard Thalheim

Raymond Technologies, USA Tampere University of Technology/Pori, Finland Valencia University of Technology, Spain University of Arizona, USA LBD lab – EPFL, Switzerland Drexel University, USA Christian Albrechts University Kiel, Germany

WISM 2010 Program Chairs Flavius Frasincar Geert-Jan Houben Philippe Thiran

Erasmus University Rotterdam, The Netherlands Delft University of Technology, The Netherlands Namur University, Belgium

WISM 2010 Program Committee Syed Sibte Raza Abidi Sven Casteleyn Philipp Cimiano Roberto De Virgilio Tommaso Di Noia Flavius Frasincar Irene Garrigos Michael Grossniklaus Hyoil Han Geert-Jan Houben Zakaria Maamar Maarten Marx Michael Mrissa Oscar Pastor Dimitris Plexousakis Jose Palazzo Moreira de Oliveira Davide Rossi Hajo Reijers Philippe Thiran Christopher Thomas Erik Wilde

Dalhousie University, Canada Vrije Universiteit Brussel, Belgium University of Bielefeld, Germany Università di Roma Tre, Italy Technical University of Bari, Italy Erasmus University of Rotterdam, The Netherlands Universidad de Alicante, Spain ETH Zurich, Switzerland LeMoyne-Owen College, USA Delft University of Technology, The Netherlands Zayed University, UAE University of Amsterdam, The Netherlands Namur University, Belgium Valencia University of Technology, Spain University of Crete, Greece UFRGS, Brazil University of Bologna, Italy Eindhoven University of Technology, The Netherlands Namur University, Belgium Wright State University, USA UC Berkeley, USA

ER 2010 Workshop Organization

WISM 2010 External Referees C. Berberidis K. Buza

DE@ER 2010 Program Chairs Iris Reinhartz-Berger Arnon Sturm Ben-Gurion Jorn Bettin Tony Clark Sholom Cohen

University of Haifa, Israel University of the Negev, Israel Sofismo, Switzerland University of Middlesex, UK Carnegie Mellon University, USA

DE@ER 2010 Program Committee Colin Atkinson Mira Balaban Balbir Barn Kim Dae-Kyoo Joerg Evermann Marcelo Fantinato Jeff Gray Atzmon Hen-Tov John Hosking Jaejoon Lee David Lorenz John McGregor Klaus Pohl Iris Reinhartz-Berger Michael Rosemann Julia Rubin Lior Schachter Klaus Schmid Keng Siau Pnina Soffer Il-Yeol Song Arnon Sturm Juha-Pekka Tolvanen Gabi Zodik

University of Mannheim, Germany Ben-Gurion University of the Negev, Israel Middlesex University, UK Oakland University, USA Memorial University of Newfoundland, Canada University of São Paulo, Brazil University of Alabama, USA Pontis, Israel University of Auckland, New Zealand Lancaster University, UK Open University, Israel Clemson University, USA University of Duisburg-Essen, Germany University of Haifa, Israel The University of Queensland, Australia IBM Haifa Research Labs, Israel Pontis, Israel University of Hildesheim, Germany University of Nebraska-Lincoln, USA University of Haifa, Israel Drexel University, USA Ben-Gurion University of the Negev, Israel MetaCase, Finland IBM Haifa Research Labs, Israel

DE@ER 2010 External Referees Andreas Metzger Ornsiri Thonggoom

XI

XII

ER 2010 Workshop Organization

FP-UML 2010 Program Chairs Gunther Pernul Matti Rossi

University of Regensburg, Germany Aalto University, Finland

FP-UML 2010 Program Committee Doo-Hwan Bae Michael Blaha Cristina Cachero Gill Dobbie Irene Garrigos Peter Green Manfred Jeusfeld Ludwik Kuzniarz Jens Lechtenborger Susanne Leist Pericles Loucopoulos Hui Ma Jose Norberto Mazon Antoni Olive Andreas L. Opdahl Jeffrey Parsons Keng Siau Il-Yeol Song Bernhard Thalheim Ambrosio Toval Juan Trujillo Panos Vassiliadis

KAIST, South Korea OMT Associates Inc., USA University of Alicante, Spain University of Auckland, New Zealand University of Alicante, Spain University of Queensland, Australia Tilburg University, The Nederlands Blekinge Institute of Technology, Sweden University of Munster, Germany University of Regensburg, Germany Loughborough University Massey University, New Zealand University of Alicante, Spain Technical University of Catalonia, Spain University of Bergen, Norway Memorial University of Newfoundland, Canada University of Nebraska-Lincoln, USA Drexel University, USA Christian Albrechts University Kiel, Germany University of Murcia, Spain University of Alicante, Spain University of Ioannina, Greece

Table of Contents

SeCoGIS 2010 – Fourth International Workshop on Semantic and Conceptual Issues in Geographic Information Systems Preface to SeCoGIS 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean Brodeur and Esteban Zimanyi

1

Semantical Aspects W-Ray: A Strategy to Publish Deep Web Geographic Data . . . . . . . . . . . . Helena Piccinini, Melissa Lemos, Marco A. Casanova, and Antonio L. Furtado

2

G-Map Semantic Mapping Approach to Improve Semantic Interoperability of Distributed Geospatial Web Services . . . . . . . . . . . . . . . Mohamed Bakillah and Mir Abolfazl Mostafavi

12

MGsP: Extending the GsP to Support Semantic Interoperability of Geospatial Datacubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tarek Sboui and Yvan B´edard

23

Implementation Aspects Range Queries over a Compact Representation of Minimum Bounding Rectangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nieves R. Brisaboa, Miguel R. Luaces, Gonzalo Navarro, and Diego Seco A Sensor Observation Service Based on OGC Speciﬁcations for a Meteorological SDI in Galicia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e R.R. Viqueira, Jos´e Varela, Joaqu´ın Tri˜ nanes, and Jos´e M. Cotos

33

43

CMLSA 2010 – Third International Workshop on Conceptual Modeling for Life Sciences Applications Preface to CMLSA 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi-Ping Phoebe Chen, Sven Hartmann, and Jing Wang

53

XIV

Table of Contents

Conceptual Modelling for Bio-, Eco- and Agroinformatics Provenance Management in BioSciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sudha Ram and Jun Liu

54

Ontology-Based Agri-Environmental Planning for Whole Farm Plans . . . Hui Ma

65

CMS 2010 – First International Workshop on Conceptual Modeling of Service Preface to CMS 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Kirchberg and Bernhard Thalheim

75

Modeling Support for Service Integration A Formal Model for Service Mediators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Klaus-Dieter Schewe and Qing Wang Reusing Legacy Systems in a Service-Oriented Architecture: A Model-Based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yeimi Pe˜ na, Dario Correal, and Tatiana Hernandez Intelligent Author Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qing Wang and Ren´e Noack

76

86 96

Modeling Techniques for Services Abstraction, Restriction, and Co-creation: Three Perspectives on Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Bergholtz, Birger Andersson, and Paul Johannesson The Resource-Service-System Model for Service Science . . . . . . . . . . . . . . . Geert Poels

107 117

ACM-L 2010 The 3rd International Workshop on Active Conceptual Modeling of Learning, ACM-L Preface to ACM-L 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hannu Kangassalo, Sal March, and Leah Wong

127

Advances in Active Conceptual Modeling of Learning ACM-L 2010 Towards a Framework for Emergent Modeling . . . . . . . . . . . . . . . . . . . . . . . Ajantha Dahanayake and Bernhard Thalheim

128

Table of Contents

When Entities Are Types: Eﬀectively Modeling Type-Instantiation Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Faiz Currim and Sudha Ram

XV

138

ACM-L 2009 KBB: A Knowledge-Bundle Builder for Research Studies . . . . . . . . . . . . . . David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Aaron Stewart, and Cui Tao

148

WISM 2010 – The 7th International Workshop on Web Information Systems Modeling Preface to WISM 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flavius Frasincar, Geert-Jan Houben, and Philippe Thiran

159

Web Information Systems Development and Analysis Models Integration of Dialogue Patterns into the Conceptual Model of Storyboard Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Berg, Bernhard Thalheim, and Antje D¨ usterh¨ oft

160

Model-Driven Development of Multidimensional Models from Web Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Hern´ andez, Irene Garrig´ os, and Jose-Norberto Maz´ on

170

Web Technologies and Applications Integrity Assurance for RESTful XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sebastian Graf, Lukas Lewandowski, and Marcel Waldvogel

180

Collaboration Recommendation on Academic Social Networks . . . . . . . . . Giseli Rabello Lopes, Mirella M. Moro, Leandro Krug Wives, and Jos´e Palazzo Moreira de Oliveira

190

Mining Economic Sentiment Using Argumentation Structures . . . . . . . . . . Alexander Hogenboom, Frederik Hogenboom, Uzay Kaymak, Paul Wouters, and Franciska de Jong

200

DE@ER 2010 – Domain Engineering Preface to DE@ER 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iris Reinhartz-Berger, Arnon Sturm, Jorn Bettin, Tony Clark, and Sholom Cohen

211

XVI

Table of Contents

Methods and Tools in Domain Engineering Evaluating Domain-Speciﬁc Modelling Solutions . . . . . . . . . . . . . . . . . . . . . Parastoo Mohagheghi and Øystein Haugen

212

Towards a Reusable Uniﬁed Basis for Representing Business Domain Knowledge and Development Artifacts in Systems Engineering . . . . . . . . . Thomas Koﬂer and Daniel Ratiu

222

DaProS: A Data Property Speciﬁcation Tool to Capture Scientiﬁc Sensor Data Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Irbis Gallegos, Ann Q. Gates, and Craig Tweedie

232

FP-UML 2010 – Sixth International Workshop on Foundations and Practices of UML Preface to FP-UML 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gunther Pernul and Matti Rossi

243

Semantics and Ontologies in UML Incorporating UML Class and Activity Constructs into UEML . . . . . . . . . Andreas L. Opdahl

244

Data Modeling Is Important for SOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Blaha

255

Representing Collectives and Their Members in UML Conceptual Models: An Ontological Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giancarlo Guizzardi

265

Automation and Transformation in UML UML Activities at Runtime: Experiences of Using Interpreters and Running Generated Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Gessenharter

275

Model-Driven Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Aboulsamh, Edward Crichton, Jim Davies, and James Welch

285

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

295

4th International Workshop on Semantic and Conceptual Issues in GIS (SeCoGIS 2010) Preface Recent advances in information technologies have increased the production, collection, and diffusion of geographical data, thus favoring the design and development of geographic information systems (GIS). Nowadays, GISs are emerging as a common information infrastructure, which penetrate into more and more aspects of our society. This has given rise to new methodological and data engineering challenges in order to accommodate new users’ requirements for new applications. Conceptual and semantic modeling are ideal candidates to contribute to the development of the next generation of GIS solutions. They allow to elicit and capture user requirements as well as the semantics of a wide domain of applications. The SeCoGIS workshop brings together researchers, developers, users, and practitioners carrying out research and development in geographic information systems. The aim is to stimulate discussions on the integration of conceptual modeling and semantics into current geographic information systems, and how this will benefit the end users. The workshop provides a forum for original research contributions and practical experiences of conceptual modeling and semantic web technologies for GIS, fostering interdisciplinary discussions in all aspects of these two fields, and will highlight future trends in this area. The workshop is organized in a way to highly stimulate interaction amongst the participants. This edition of the workshop attracted papers from 11 different countries distributed all over the world: Brazil, Canada, Chile, France, Italy, Lebanon, Mexico, Spain, Switzerland, United Kingdom, and USA. We received 17 papers from which the Program Committee selected 5 papers, making an acceptance rate of 29%. The accepted papers were organized in two sessions. The first one is devoted to semantical aspects, where the first paper focuses on publishing Deep Web data, and the latter two are focused on semantic interoperability. In the second session, two papers focusing on implementation aspects will be presented. We would like to express our gratitude to the program committee members and the external referees for their hard work in reviewing papers, the authors for submitting their papers, and the ER 2010 organizing committee for all their support. July 2010

Jean Brodeur Esteban Zimányi

W-Ray: A Strategy to Publish Deep Web Geographic Data Helena Piccinini1,2, Melissa Lemos1, Marco A. Casanova1, and Antonio L. Furtado1 1 Department of Informatics – PUC-Rio – Rio de Janeiro, RJ – Brazil {hpiccinini,melissa,casanova,furtado}@inf.puc-rio.br 2 Diretoria de Informática – IBGE – Rio de Janeiro, RJ – Brazil [email protected]

Abstract. This paper introduces an approach to address the problem of accessing conventional and geographic data from the Deep Web. The approach relies on describing the relevant data through well-structured sentences, and on publishing the sentences as Web pages, following the W3C and the Google recommendations. For conventional data, the sentences are generated with the help of database views. For vector data, the topological relationships between the objects represented are first generated, and then sentences are synthesized to describe the objects and their topological relationships. Lastly, for raster data, the geographic objects overlapping the bounding box of the data are first identified with the help of a gazetteer, and then sentences describing such objects are synthesized. The Web pages thus generated are easily indexed by traditional search engines, but they also facilitated the task of more sophisticated engines that support semantic search based on natural language features. Keywords: Deep Web, Geographic Data, Natural Language Processing.

1 Introduction Unlike the Surface Web of static pages, the Deep Web [1] comprises data stored in databases, dynamic pages, scripted pages and multimedia data, among other types of objects. Estimates suggest that the size of the Deep Web greatly exceeds that of the Surface Web – with nearly 92,000 terabytes of data on the Deep Web versus only 167 terabytes on the Surface Web, as of 2003. In particular, Deep Web databases are typically under-represented in search engines due to the technical challenges of locating, accessing, and indexing the databases. Indeed, since Deep Web data is not available as static Web pages, traditional search engines cannot discover data stored in the databases through the traversal of hyperlinks, but rather they have to interact with (potentially) complex query interfaces. Two basic approaches to access Deep Web data have been proposed. The first approach, called surfacing, or Deep Web Crawl [16], tries to automatically fill HTML forms to query the databases. Queries are executed offline and the results are translated to static Web pages, which are then indexed [15]. The second approach, called federated search, or virtual integration [4, 18], suggests using domain-specific mediators to facilitate access to the databases. Hybrid strategies, which extend the previous approaches, have also been proposed [21]. J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 2–11, 2010. © Springer-Verlag Berlin Heidelberg 2010

W-Ray: A Strategy to Publish Deep Web Geographic Data

3

Despite recent progress, accessing Deep Web data is still a challenge, for two basic reasons [20]. First, there is the question of scalability. Since the Deep Web is orders of magnitude larger than the Surface Web [1], it may not be feasible to completely index the Deep Web. Second, databases typically offer interfaces designed for human users, which complicates the development of software agents to interact with them. This paper proposes a different approach, which we call W-Ray by analogy with medical X-Ray technology, to published conventional and geographic data, in vector or raster format, stored in the Deep Web. The basic idea consists of creating a set of natural language sentences, with a simple structure, to describe Deep Web data, and publishing the sentences as static Web pages, which are then indexed as usual. The use of natural language sentences is interesting for three reasons. First, they lead to Web pages that are acceptable to Web crawlers that consider words randomly distributed in a page as an attempt to manipulate page rank. Second, they facilitate the task of more sophisticated engines that support semantic search based on natural language features [5, 24]. Lastly, the descriptions thus generated are minimally acceptable to human users. The Web pages are generated following the W3C guidelines [3] and the recommendations published by Google to optimize Web site indexing [9]. This paper is organized as follows. Section 2 describes how to publish conventional data. Section 3 discusses how to describe geographic data in vector format. Section 4 extends the discussion to geographic data in raster format. Finally, Section 5 contains the conclusions. The details of the W-Ray approach can be found in [22].

2 The W-Ray Approach for Conventional Databases 2.1 Motivation and Overview of the Approach The W-Ray approach to publishing conventional data as Web pages proceeds in two stages. In the first stage, the designer manually defines a set of database views that capture which data should be published, and specifies templates that indicate how sentences should be generated. The second stage is automatic and consists of materializing the views, translating the materialized data to natural language sentences, with the help of the templates, and publishing the sentences as static Web pages. Note that metadata, typically associated with geographic data, can be likewise processed. As an alternative to synthesizing natural language sentences, one might simply format the materialized view data as HTML tables. However, this is not a reasonable strategy for at least two reasons. First, some search mechanisms consider tables as visual objects. Second, tables may be difficult to read, even for the typical user, or at all impossible, for the visually impaired users. Indeed, the third principle of the W3C recommendation [3] indicates that “Information and the operation of user interface must be understandable.”, and item 4 of the Google Web page optimization guidelines [9] recommends that “(Web page) content should be: easy-to-read; organized around the topic; use relevant language; be fresh and unique; be primarily created for users, not search engines”. This recommendation reflects the fact that Web crawlers may interpret words randomly or repeatedly distributed in a Web page as an attempt to manipulate page rank, and thereby reject indexing the page.

4

H. Piccinini et al.

Finally, we observe that some of the W3C specific recommendations for the visually impaired user in fact coincide with Google’s orientations. Comparing the two, it is clear that the difficulties faced by the visually impaired user are akin to those a search engine suffers during the data collection step. As an example, both Google and W3C recommend using the attribute "alt" to describe the content of an image. Naturally, the content of an image is opaque to both a visually impaired user and a search engine, but an alternate text describing the image can be indexed by a search engine and read (by a screen reader) to the visually impaired user. In general, many W-Ray strategies defined to address the limitations of search engines also apply to the design of a database interface for the visually impaired user. 2.2 Guidelines for View Design The designer should first select which data should be published with the help of database views. We offer the following simple guidelines that the designer should follow: • Attributes whose values have no semantics outside the database should not be directly published. • Artificially generated primary keys, foreign keys that refer to such primary keys, attributes with domains that encode classifications or similar artifacts, if selected for publication, should have their internal values replaced by their respective external definitions. For example, a classification code should be replaced by the corresponding classification term. • Attributes that contain private data should not be published. • Views should not contain too many attributes; only those attributes that are relevant to help locate the objects and their relationships should be selected. 2.3 Translating the Materialized Data to Natural Language Sentences The heart of the W-Ray approach lies in the translation of materialized view data to natural language sentences. Fuchs et al. [8] propose a single language for machine and human users, basically by translating English sentences to first-order logic. Others propose to translate RDF triples to natural language sentences [7, 13], simply by concatenating the triples. Tools to translate conventional data to RDF triples have also been developed [2, 6], which typically map database entities to classes, attributes to datatype properties, and relationships to object properties. The proposals introduced in [7, 13] do not consider sequences of RDF triples, though, which we require to compose simple sentences into more complex syntactical constructions. Therefore, we combine the strategies to synthesize sentences described in [13] with the mapping of conventional data to RDF triples introduced in [2]. The translation of materialized view data to natural language sentences involves two tasks: choice of an appropriate external vocabulary; and definition of templates to guide the synthesis of the sentences. First observe that the database schema names, including view names, are typically inappropriate to be externalized to the database users. This implies that the designer must first define an external vocabulary, that is, a set of terms that will be used to communicate materialized view data to the users. The designer should obey the following generic guideline:

W-Ray: A Strategy to Publish Deep Web Geographic Data

5

• The external vocabulary should preferably be a subset of a controlled vocabulary covering the application domain in question, or of a generic vocabulary, such as that of an upper-level ontology or Wordnet. If followed, this guideline permits defining hyperlinks from the terms of the external vocabulary to the terms of the controlled vocabulary. A similar strategy to synthesize sentences is discussed in [11]. An extension to Wordnet is also proposed in [23] to treat concepts corresponding to compound nouns. After selecting the external vocabulary, the designer must define templates that will guide the synthesis of the sentences. We offer three alternatives: free template definition; default template definition; and modifiable default template definition. The first alternative leaves template definition in the hands of the designer and, thus, may lead to sentences with arbitrary structure. In the default template alternative, the designer first creates an entity-relationship model that is a high-level description of the views, and then uses a tool that generates default templates based on the ER model and synthesizes sentences with a regular syntactical structure. The last alternative is a variation of the second and allows the designer to alter the default templates. For the free template definition alternative, we offer the following guidelines: • A template must use the external vocabulary and other common syntactical elements (articles, conjunctions, etc.) [19], as well as punctuation marks. • A template should generate a sentence that characterizes an entity through its properties and relationships. • The subject of the sentence should have a variable associated with an identifying attribute of the view. • The predicate of the sentence should have variables associated with other view attributes that further describe the entity, or that relate the entity to other entities. The use of free templates is illustrated in what follows, using a relational view of the SIDRA database, which the Brazilian Institute of Geography and Statistics (IBGE) publishes on the Web with the help of HTML forms. The full details can be found in [22]. We start by defining views over the SIDRA database. To save space, Table 1 shows just the “political_division” view: the first column indicates the view name, the second column indicates the attribute names of the view, the third column describes the attributes, and the fourth column associates a variable with each attribute. We then define a template to publish the “political division” view data: U is a “L” that has a total of V M for the year Y and aggregate variable A. Table 1. Schematic definition of a view over the SIDRA database

View Name Attribute Name political_division name level aggreg_var aggreg_var_value unit_measure year

...

Attribute Description name of the political division level of the political division, such as state, county,… name of an aggregation data, such as resident population value of the aggregation data unit measure of the aggregation data year the aggregation data was measured

Variable U L A V M Y

6

H. Piccinini et al.

Next, the view is materialized. Each line of the resulting table is transformed into a sentence, using the template. The following sentence illustrates the result: Roraima is a unit of the federation that has a total of 395.725 people for the year 2007 and aggregate variable “resident population”. Note that: the underlined words are the subject of the sentence; the predicate “is a unit of the federation” qualifies the subject; the words in boldface are view data that play the role of predicatives of the subject, together with the fragments in italics. We now repeat the example using the default templates alternative. Recall that, in this alternative, the designer starts by creating an ER model of the views. In our running example, the ER model would be: entity(political_division,name). attribute(political_division,level). attribute(political_division,aggreg_var). attribute(political_division,aggreg_var_value). attribute(political_division,unit_measure). attribute(political_division,year).

Using the variables defined in Table 1, the tool generates default templates such as: 'There is a political division with name P' 'The level of P is L'

Using default templates, the tool then synthesizes sentences such as (data in boldface): 'There is a political division with name Roraima'. 'The level of Roraima is unit of the federation'.

Finally, the modifiable default template alternative allows the designer to alter the default templates. Examples of template redefinitions are (where the variables in boldface italics in the new template have to occur in the default template): Default template: 'There is a political division with name P' New template: 'P' Default template: 'The level of P is L' New template: 'is a L' The designer is also allowed to compose the modified templates as in the example: facts((political_division(P),level(P,L)).

Using modified templates, the tool synthesizes sentences such as (data in boldface): 'Roraima is a unit of the federation'

2.4 Guidelines for Publishing the Sentences as Static Web Pages As mentioned before, W-Ray follows the W3C recommendation [3], as well as the Google Web page optimization guidelines [9]. Briefly, the most relevant criteria that W-Ray adopts to publish Web pages are: • Create hyperlinks between the published data and metadata (W3C Recomm. 3). • Create hyperlinks between the published data to improve data exploration via navigation (W3C Recomm. 1.3.2 and 2.4 and Google Recomm. 3 and 5). • Create content with well-structured sentences, as addressed in Section 2.2 (W3C Recomm. 3 and Google Recomm. 4).

W-Ray: A Strategy to Publish Deep Web Geographic Data

7

• Use text to describe images when the attribute “alt” does not suffice (W3C Recomm. 1.1.1 and Google Recomm. 7). In the example of Section 2.3, the subject of the sentence – Roraima – would be hyperlinked to a Web Page with further information about the State of Roraima. Briefly, the URLs would be generated upfront by concatenating a base URI with the primary key of the data (see[22] for the details).

3 W-Ray for Geographical Data in Vector Format We first observe that a number of tools [17] offer facilities to convert geographic data in vector format to dynamic Web pages. However, such Web pages are typically not indexed by search engines. We also observe that geographic data in vector format is not opaque, as raster images are, since the data is often associated with conventional data and, in fact, with the (geographic) objects stored in the database. A solution to make vector data visible to the search engines would therefore be to publish the conventional data associated with them, as discussed in Section 2. This strategy would however totally ignore the geographic information that the vector data capture. In the W-ray strategy, we explore how to translate the relevant geographic information again as natural language sentences. On a first approximation, the strategy is the same as for conventional data: define a set of database views that capture which data should be published; materialize the views; translate the materialized data to natural language sentences; and publish the sentences as static Web pages. More specifically, suppose that the vector data is organized by layers. Then, when defining a view, the designer essentially has to decide: • Which layers will be combined in the view. For example, the view might combine the political division, populated places and waterways layers; • For each layer included in the view, which objects will be retained in the view. For example, one might discard all populated places below a certain population; • For each layer included in the view, which attributes will be retained in the view; • When the view combines several layers, o Which is the priority between the layers. For examples, the populated places layer may have priority over the political division and the waterways layers; o Which topological relationships between the objects of different layers should be materialized. For example, for each populated place (of the highest priority layer), one might decide to materialize which navigable waterways (of the lowest priority layer) are within a buffer of 100km centered in the populated place. o In which topological order the objects will be described. For example, populated places might be listed from north to south and from west to east. As for conventional data, the designer should select the external names preferably from a controlled vocabulary such as the ISO19115 Topic Categories [12]. For example, consider a view consisting of three layers - the political division, the populated places and the waterways of Brazil - filtered as follows: • political division: keep only the states, with their name, abbreviated name, area and population, located in the north region

8

H. Piccinini et al.

• populated places: retain only the county and state capitals, with their name, political status, area and population, located in the states in the north region • waterways: keep only the name, navigability and flow Furthermore, assume that the topological relationship between populated places and political division is ‘is located in’ and that between waterways and political division is ‘cross’. Assume that populated places have priority and that they are listed from north to south and from west to east. Examples of sentences would be (using the same conventions as in Section 2.3): Roraima is a unit of the federation that has a total of 395.725 people for the year 2007 and aggregate variable “resident population”. Roraima is located in the North Region, with an area of 22,377,870 square kilometers. Boa Vista is a city that has a total of 249.853 people for the year 2007 and aggregate variable “resident population”. Boa Vista is located in the unit of federation Roraima and is the capital city of the unit of federation Roraima, with an area of 5,687 square kilometers. Amazonas is a waterway that crosses the unit of federation Amazonas and the unit of federation Pará, with flow permanent and navigability navigable. The subject of each sentence (underlined words) would also have a hyperlink to a dynamic Web page with the full information about the state or the city, generated by executing a query over the underlying database. Using default templates, the running example would be restated as follows: • Declaration of the entity-relationship model: entity(political_division,name). entity(populated_places,name). entity(waterways,name). attribute(political_division,population). attribute(political_division,abbreviated_name). attribute(political_division,area). attribute(populated_places,level). attribute(populated_places,local_area). attribute(populated_places,local_population). attribute(waterways,flow). attribute(waterways, navigability). relationship(located_in,[populated_places, political_division]). relationship(crosses, [waterways, political_division]).

• Examples of synthesized sentences, using default templates (with data in boldface): 'There is a populated places with name City of Boavista'. 'There is a political division with name State of Amazonas'. 'There is a political division with name State of Pará'. 'There is a waterways with name Amazon River'. 'The flow of Amazon River is permanent'. 'The navigability of Amazon River is navigable'. 'City of Boavista is related to State of Roraima by located in'. 'Amazon River is related to State of Amazonas by crosses'. 'Amazon River is related to State of Pará by crosses'.

W-Ray: A Strategy to Publish Deep Web Geographic Data

9

Turning to the modified default templates alternative, examples are: • Template redefinition: Default template: 'There is a political division with name P' New template: 'The P' Default template: 'R is related to P by crosses' New template: 'is crossed by R' Default template: 'The flow of R is F' New template: 'which is F' Default template: 'The navegability of R is V' New template: 'and V' • Template composition: facts((political_division(P),crosses(R,P), flow(R,S),navigability(R,V))).

• Sentences generated using the new templates (with data in boldface): 'The State of Amazonas is crossed by Amazon River which is permanent and navigable' 'The State of Pará is crossed by Amazon River which is permanent and navigable'

4 W-Ray for Raster Data Following the idea introduced in Leme et al. [14], the W-Ray strategy describes raster data by publishing sentences that capture the metadata describing how the raster data was acquired, and the geographic objects contained within its bounding box. The geographic objects might be obtained, for example, from a gazetteer, such as the ADL gazetteer [10], which includes a useful Feature Type Thesaurus (FTT) for classifying geographic features. As for vector data, the designer should define views, this time based on the classification of the geographic objects. As a concrete example, consider the image fragment of the City of Rio de Janeiro, taken out of the Web site “Brazil seen from Space”, and assume that: • the metadata of the image indeed indicates the coordinates of its bounding box • the geographic objects and their classifications are taken from the ADL Gazetteer • the designer decides to associate images with geographic objects classified as ‘hydrographic feature’, a topic category of FTT, whose centroid is contained in the bounding box of the image The raster image would then be processed as follows: 1.

2.

The georeferencing parameters are extracted from the image. In this case, the image fragment is consistent with a scale of 1:25.000 and has bounding box defined by ((43°15’W, 22° 52’ 30”S), (43° 07’ 30”W, 23°S)). By querying the ADL Gazetteer using the georeferencing parameters extracted in Step 1 and the ADL FTT term selected, ‘hydrographic feature’, one locates 9 objects, which the first few are:

10

H. Piccinini et al.

a. Feature(“Rodrigo de Freitas, Lagoa - Brazil”, lakes, contains) b. Feature(“Comprido, Rio – Brazil”, streams, contains) c. Feature(“Maracana, Rio – Brazil, streams, contains) The query results would be translated to the following sentence, describing the image (using the same conventions as in Section 2.3): The image of Rio de Janeiro, Brazil, contains the lake “Rodrigo de Freitas” and the streams “Comprido” and “Maracanã”. where the underlined words form the subject of the sentence, the words in boldface italics were extracted from the ADL FTT, and those in boldface denote geographic objects in the ADL Gazetteer whose centroids are contained in the bounding box of the image.

5 Conclusions This paper outlined an approach to overcome the problem of accessing conventional and geographic data from the Deep Web. The approach relies on describing the data through natural language sentences, published as Web pages. The Web pages thus generated are easily indexed by traditional search engines, but they also facilitated the task of engines that support semantic search based on natural language features. The details of the approach can be found in [22]. Further work is planned to assess which of the three alternatives for generating templates, if any, leads to better recall. The experiments will use massive amounts of data from geographic databases organized by IBGE, as well as a large multimedia database. Lastly, we remark that the approach can be easily modified to generate RDF triples, instead of natural language sentences, and to cope with multimedia data. In a broader perspective, it can also be used to describe conventional, geographic and multimedia data to the visually impaired users. The challenges here lie in structuring the sentences in such a way to avoid cognitive overload. Acknowledgements. This work was partly supported by IBGE, CNPq under grants 301497/2006-0, 473110/2008-3, 557128/2009-9, FAPERJ E-26/170028/2008, and CAPES/PROCAD NF 21/2009.

References [1] Bergman, M.K.: The Deep Web: Surfacing Hidden Value. J. Electr. Pub. 7(1) (2001) [2] Bizer, C., Cyganiak, R.: D2R Server – Publishing Relational Databases on the Web as SPARQL Endpoints. In: Proc. 15th Int’l. WWW Conf., Edinburgh, Scotland (2006) [3] Caldwell, B., Cooper, M., Reid, L.G., Vanderheiden, G.: Web Con-tent Accessibility Guidelines (WCAG) 2.0. In: W3C Recommendation (2008) [4] Callan, J.: Distributed information retrieval. In: Advances in Information Retrieval, pp. 127–150. Springer, US (2000)

W-Ray: A Strategy to Publish Deep Web Geographic Data

11

[5] Costa, L.: Esfinge - Resposta a perguntas usando a Rede. In: Proc. Conf. IberoAmericana IADIS WWW/Internet, Lisboa, Portugal (2005) [6] Erling, O., Mikhailov, I.: RDF support in the virtuoso DBMS. In: Proc. 1st Conference on Social Semantic Web, Leipzig, Germany. LNI, vol. 113, pp. 59–68 (2007) [7] Fliedl, G., Kop, C., Vöhringer, J.: Guideline based evaluation and verbali-zation of OWL class and property labels. Data & Knowledge Eng. 69(4), 331–342 (2010) [8] Fuchs, N.E., Kaljurand, K., Kuhn, T.: Attempto Controlled English for Knowledge Representation. In: Baroglio, C., Bonatti, P.A., Małuszyński, J., Marchiori, M., Polleres, A., Schaffert, S. (eds.) Reasoning Web. LNCS, vol. 5224, pp. 104–124. Springer, Heidelberg (2008) [9] Google. In: Google’s Search Engine Optimization Starter Guide, Version 1.1 (2008) [10] Alexandria Digital Library, Guide to the ADL Gazetteer Content Standard, v. 3.2 (2004) [11] Hollink, L., Schreiber, G., Wielemaker, J., Wielinga, B.: Semantic Annotation of Image Collections. In: Proc. Knowledge Markup and Semantic Annota-tion Workshop, Sanibel, Florida, USA (2003) [12] ISO 19115:2003, Geographic Information – Metadata [13] Kalyanpur, A., Halaschek-Wiener, C., Kolovski, V., Hendler, J.: Effective NL Paraphrasing of Ontologies on the Semantic Web. In: Workshop on End-User Semantic Web Interaction, 4th Int. Semantic Web conference, Galway, Ireland (2005) [14] Leme, L.A.P.P., Brauner, D.F., Casanova, M.A., Breitman, K.: A Software Architecture for Automated Geographic Metadata Annotation Generation. In: Proc. XXII Simpósio Brasileiro De Banco De Dados, SBBD, João Pessoa, Brazil (2007) [15] Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the Deep Web: Present and Future. In: Proc. 4th Biennial Conf. on Innovative Data Systems Research (CIDR), Asilomar, California, USA (2009) [16] Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep-Web Crawl. In: Proc. VLDB, vol. 1(2), pp. 1241–1252 (2008) [17] MapServer, http://mapserver.org/about.html#about [18] Meng, W., Yu, C.T., Liu, K.L.: Building efficient and effective metasearch en-gines. ACM Computing. Survey 34(1), 48–89 (2002) [19] Praninskas, J.: Rapid review of English grammar. Prentice-Hall, NJ (1975) [20] Raghavan, S., Garcia-Molina, H.: Crawling the HiddenWeb. In: Proc. VLDB, pp. 129– 138 (2001) [21] Rajaraman, A.: Kosmix: HighPerformance Topic Exploration using the Deep Web. In: Proc. VLDB, Lyon, France (2009) [22] Piccinini, H., Lemos, M., Casanova, M.A., Furtado, A.L.: W-Ray: A Strategy to Publish Deep Web Geographic Data. Tech Rep. 10/10. Dept. Informatics, PUC-Rio (2010) [23] Sorrentino, S., Bergamaschi, S., Gawinecki, M., Po, L.: Schema Normalization for Improving Schema Matching. In: Laender, A.H.F. (ed.) ER 2009. LNCS, vol. 5829, pp. 280–293. Springer, Heidelberg (2009) [24] Zheng, Z.: AnswerBus question answering system. In: Proc. 2nd International Conference on Human Language, San Diego, California, pp. 399–404 (2002)

G-Map Semantic Mapping Approach to Improve Semantic Interoperability of Distributed Geospatial Web Services *

Mohamed Bakillah and Mir Abolfazl Mostafavi 1

Centre de recherche en géomatique (CRG), Université Laval, Québec, Canada, G1K 7P4 [email protected]

Abstract. The geospatial domain is influenced by the Web developments; consequently, an increasing number of geospatial web services become available through Internet. A rich description of geospatial web services is required to resolve semantic heterogeneity and achieve semantic interoperability of geospatial web services. However, existing geospatial web services descriptions and semantic mapping approaches employed to reconcile them are not always rich enough, especially with respect to semantics of spatiotemporal features. This article proposes a new semantic mapping model, the G-MAP, which is based on a semantically augmented description of geospatial web services. G-MAP introduces the idea of semantic mappings between services that depends on context, and an augmented mapping technique based on dependencies between features of concepts describing geo-services. An implementation scenario demonstrates the validity of our approach. Keywords: Geospatial Web Service, Semantic Interoperability, Semantic Mapping, Knowledge Representation.

1 Introduction Geospatial Web Services (GWSs) are modular components of geospatial computing applications; they can be published, discovered and invoked to access and process distributed geospatial data coming from different sources. Previously, geospatial services were available only through GIS desktop application; today, more services are accessible on the Web and through distributed applications and networks [21]. The emergence of geospatial web services (GWSs) and service-oriented architecture (SOA) brought a new paradigm for businesses and organizations where it is now possible to combine different geospatial web services to create more complex services that are adapted to the user’s need. Interoperability is a key issue for the discovering and composition of GWSs, and for the development of the Geospatial Semantic Web [8]. According to ISO TC204, document N271, interoperability is “the ability of systems to provide services to and accept services from other systems and to use the services *

Corresponding author.

J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 12–22, 2010. © Springer-Verlag Berlin Heidelberg 2010

G-Map Semantic Mapping Approach to Improve Semantic Interoperability

13

so exchanged to enable them to operate effectively together.” The Open Geospatial Consortium (OGC) and ISO/TC 211 have created several standards to support interoperability of geospatial web services, such as the Web Service Modeling Language (WSDL) that supports the description of web services and standard operations that allow retrieving the description of the capabilities provided by a service. SOAP is a standard protocol for service binding. Those standards support interoperability at the syntax level. However, semantic heterogeneity affecting GWS is still an obstacle to semantic interoperability. Semantic heterogeneity is the difference in the intended meaning of concepts describing data and services [6]. Semantic interoperability allows organisations to share and re-use knowledge they have, internally and with other stakeholders [20]. Semantic heterogeneity occurs because services are developed by different organizations, for different purposes and using different terminologies [16]. To overcome the problem of service discovery and interoperability, OGC has proposed catalog services, where services are published and users can manually browse the catalog to find the service they look for, but this is a very tedious task. Recent approaches to service interoperability and discovering such as [17] represent the functional capabilities of GWS with ontologies, which are “explicit specifications of a conceptualisation”, according to Gruber’s definition [12]. Ontology is widely used for semantic interoperability of geographic information systems [10]. It is composed by concepts (or classes), relations, and axioms describing entities that are assumed to exist in a domain of interest [1]. Then, semantic mappings or semantic similarities between concepts of ontologies are used to reconcile different services or find services that match a given query. Examples of such approaches are [9][13][21][14][15][7]. To support semantic interoperability of GWS, the description of their capabilities should be as deep as possible. In addition, the semantics of spatial and temporal aspects of this description should be explicit. The semantic matching approach should be developed to reason with a deep description of GWS and produce different semantic relations between them. In this paper, we present a new approach for the semantic interoperability of GWS, which uses a new semantically augmented representation of GWS that integrates context, semantics of spatial and temporal aspects of the service’s description, and dependencies between elements of service’s description. Then, we propose the G-MAP semantic mapping system, which was specifically designed to compare the proposed service descriptions with inference engines, in an automatic manner. G-MAP includes a new augmented structural matching criterion that uses dependencies to find missing, implicit semantic mappings between GWS descriptions. The implementation scenario demonstrates that the approach supports semantic interoperability of GWS and helps the user to discover and select the more relevant GWS with respect to its requirements.

2 Related Work on Geospatial Web Services Semantic Interoperability The Semantic Web has been conceived as a huge data repository where people can search and access needed information [4]. With the emergence of web service technologies, it also became a repository of web functionalities. Examples of geospatial web services (GWSs) include catalog and geospatial repository services, locationbased services, data access and transformation services [2], as well as web map services [5]. Several approaches for the discovery, interoperability and composition of

14

M. Bakillah and M.A. Mostafavi

GWSs have been proposed. Typically, in order to make a GWS available on the Web, service providers publish relevant metadata about the capabilities of their service on a web server where requestors can discover registered services and bind to them to obtain their service [13]. With the development of Geospatial Semantic Web technologies, some approaches use formal languages that support reasoning, such as Description Logics [13][14][19]. In the work of Lutz and Klien [14] on retrieval of geographic information, subsumption-based reasoning is used. When the user submits a search concept, the system returns a taxonomy of concepts that are subsumed by (more specific than) the search concept. However, it does not return the concepts that are more general than or overlapping with the search concept. For example, if the search concept is “lake”, the retrieval system may not return the concept “waterbody” which is also relevant. Similarly, Wiegand and Garcia proposed a task-based and Semantic Web approach to retrieve geospatial data [22]. They formalize the relationships between tasks (ex: land use management), and types of data sources. A user can submit a query to the knowledge base where sources’ descriptions are stored in order to find sources that correspond to a selected task. A Jena reasoning engine retrieves the sources that are associated to the requested tasks. The reasoning engine returns only the sources that completely satisfy the query. Therefore, the problem is the same as with subsumption reasoning. Janowicz [13] suggest that a semantic similarity measure is preferable (or complementary) to subsumption reasoning since it can retrieve concepts that are close in meaning to the search concept, without rejecting those that may not meet the exact condition of subsumption. He proposes a semiautomatic similarity-based retrieval approach for GWS that uses the Web Service Modeling Language (WSML-Core). The semantic similarity indicates to what degree the retrieved GWS satisfy the user requirements. [23] present an ontology-driven discovering model for geographical information services, where a multilevel semantic similarity approach addresses the problem of how to select a similarity threshold above which the service is similar enough to the service request. While the recall of a semantic similarity measure is better than that of subsumption reasoning, it is not expressive enough to help the user to select the more relevant service. What is needed is a semantic mapping system that uses GWS descriptions with deep semantics and produce different kinds of semantic relations between them. The solution proposed in this paper is based on the G-MAP semantic mapping system that overcomes the mentioned limitations of existing approaches. This system uses a new representation of the GWS based on a multi-view augmented concept model.

3 Representation of Geospatial Web Services Descriptions Semantic interoperability of geospatial web services (GWS) is dependent on the richness of the semantic description of GWS. GWS are described with a function, input and output, pre-conditions and post-conditions [14]. The function is the role of the GWS: for example, compute Euclidian distance between two locations. The input is the data taken by the service (ex: two GML points) and the output is the result of the process performed by the service (ex: a distance). The pre-conditions and postconditions are conditions on the input and the output respectively, for example, the minimal spatial accuracy of the input GML points. The proposed representation of

G-Map Semantic Mapping Approach to Improve Semantic Interoperability

15

GWS descriptions is based on the Multi-View Augmented Concept (MVAC) model that we presented in [3]. This model was developed to improve existing concepts definitions, which can lack valuable features. The idea is to add two layers of semantics to the definition of a concept: a set of views valid in different contexts, and dependencies between features of the concept. The MVAC also includes spatial and temporal descriptors, which are new features that define the semantics of spatial and temporal properties of the concept. The MVAC is defined with the following features: cMVA = < n(c), {p(c)}, {r(c)}, {spatial_d(c)}, {temporal_d(c)}, {v(c)}, {dep(c)}>. n(c) is the name of the concept. {p(c)} is its set of properties. {r(c)} is the set of relations that c has with other concepts. {spatial_d(c)} is a set of spatial descriptors about the spatiality of the concept. Spatiality of a concept can be described as a part of a thing, for instance “center of, axis of, contour of, top of…” Spatial descriptors also include characteristics related to geometry: shape, area, length, etc. {temporal_d(c)} is a set of temporal descriptors about the temporality of the concept. The semantics of temporality is an occurrent: a process, event or change that produces in time. Temporal descriptors also include temporal characteristics, such as duration and frequency. {v(c)} is a set of views, and {dep(c)} is a set of dependencies. A view is a selection of features that are valid in a given context. Views are indicated with the following expression: context (context value) → feature (concept, [set of feature’s values]), which reads as: if the context is “context value”, then the value of “feature” is one of this [set of feature’s values]. For example, two possible views of the concept watercourse may be: context (flooding) → function (watercourse, evacuation area), and context (tourism) → function (watercourse, [navigable, skating]). The first view indicates that when the context is “flooding”, the possible values of the property “function” of concept “watercourse” are “navigable” or “skating”. Dependencies express that a first feature's values are related to a second feature's values. We formalize dependencies with rules, in the form: head → body, for example: Is-a (land, lowland) → FloodRisk (land, high), where Is-a (land, lowland) reads as “land is-a lowland”. We propose that some or all parameters of a GWS description (function, input, output, pre-conditions, and post-conditions) can be semantically described with a MVAC concept. For example, consider a GWS that finds flood risk zones inside a given geographical region, given in OWL abstract syntax: Class(input complete restriction(is-A someValuesFrom (GML: surface))) Class(pre-condition complete restriction(part-of someValuesFrom(NorthAmerica))) Class(function complete restriction(is-A someValuesFrom(LocalisationOfFloodRiskZone))) Class(output complete restriction(is-A someValuesFrom(GML: surface) restriction (hasContext someValuesFrom(floodDisasterResponse, floodPrevention))) Class(output_FloodPrevention_Context complete restriction(is-A someValuesFrom(GML: surface) restriction (CloseTo someValuesFrom(waterbody))) Class(output_ floodDisasterResponse_Context complete restriction(is-A someValuesFrom(GML: surface) restriction (AdjacentTo someValuesFrom(waterbody))) Class(post-condition complete restriction(hasSpatialAccuracy (5meters)))

16

M. Bakillah and M.A. Mostafavi

Class(floodedLand complete restriction(is-A someValuesFrom (GML: surface) restriction (depth hasSomeValuesFrom(high)) restriction (status hasSomeValuesFrom (navigable)))

The GWS description indicates that two contexts are possible: flood disaster response and flood prevention. Under those views, a flood risk zone is defined as a surface adjacent to a waterbody or a surface close to a waterbody, respectively, because the conception of the degree of risk in disaster response is different from its conception in disaster prevention. The last class indicates a dependency between the depth of water of a flooded land and its navigable status. The MVAC-based GWS descriptions are the input knowledge representation of the proposed semantic mapping system.

4 G-MAP Augmented Semantic Mapping System In this section, we present the G-MAP augmented semantic mapping system and its core components. Figure 1 illustrates the architecture of the G-MAP system.

GWS GWS Description Description GWS Description

Extract Dependencies

input MVAC Service Description Generation Tool

Final Output:

Basic Matching Augmented Mapping Inference Engine

External Resources

Produces

Uses MVAC GWSMVAC GWS Description Description MVAC GWS Description

Lexical-to-Semantic Transformation Query

Output Basic Elements Mappings

Complex Mapping Inference Engine

Translator

User Query Interface

Basic Element Lexical Matcher

Augmented Mapping Inference Engine

Multi-View Augmented Mappings

Semantic Inference Engine Spatial Semantic Mapping component

Fact Base populates

Temporal Semantic Mapping component Thematic Semantic Mapping component

Mapping Rules Base

Fig. 1. Architecture of the G-MAP Semantic Mapping System

G-MAP executes a gradual process that takes as input the MVAC Geospatial Web Service Description (MVAC GSW descriptions) and a query, matches elements of the MVAC GSW descriptions with elements of the query in a three main steps gradual process, and outputs the semantic relations between the query and the MVAC GSW descriptions. G-MAP is an automatic process since it uses reasoning rules that automatically infer the semantic relations. Prior to the G-MAP action, the MVAC Service Description Generation Tool is responsible for building MVAC GWS descriptions. This process is described in [3]. A query interface allows the service requestor to formulate a query, which is a template of a requested GSW description. The three steps of G-MAP, identified with grey boxes in Fig. 1, are described in the following paragraphs.

G-Map Semantic Mapping Approach to Improve Semantic Interoperability

17

4.1 Basic Matching This first component of G-Map computes the semantic mappings between the simplest elements of MVACs that describe the services’ parameters (input, output, function, pre-conditions, and post-conditions). Those simplest elements, referred to as basic MVAC elements, are the terms used to designate any MVAC features, including names of properties, relations, spatial and temporal descriptors, or their values. The process includes two main steps. At first, the Basic Element Lexical Matcher computes a lexical relation (synonymy, hyponymy, hypernymy, partonomy) for a pair of elements. This lexical relation is determined with the help of an appropriate external resource, for example, a global ontology holding standardized vocabulary about geometrical shapes, spatial relations of topology, etc., another global ontology of Time (temporal relations and attributes) or a domain-independent global ontology. In the second main step, this lexical relation is transformed into a semantic relation between basic MVAC elements: {equivalence, includes, included in, or disjoint}. Example of such transformation is provided in [18]. The Complex Mapping Inference Engine reuses the semantic relations between basic MVAC elements. 4.2 Complex Mapping Inference Engine The role of the Complex Mapping Inference Engine is to infer semantic relations between complex MVAC elements (properties, relations, descriptors, views, MVACs, and finally, GWS descriptions), based on the semantic relations between the basic MVAC elements that compose them. This inference problem is formulated into the problem of verifying a set of logical rules, which express the condition for a semantic relation between two complex MVAC elements to be true. A semantic mapping rule consists of a mapping rule antecedent and a mapping rule consequent. The consequent is a semantic relation between two complex MVAC elements, and the antecedent is a conjunction and/or disjunction of conditions on semantic relations between basic MVAC elements that must be respected for the consequent to be verified. For example, the condition for equivalence between two spatial properties x and y: p(x) ∧ p(y) ∧ name (x, np1) ∧ name (y, np2) ∧ range (x, rp1) ∧ range (y, rp2) ∧ spatial_descriptors (x, sd1) ∧ spatial_descriptors (y, sd2) ∧ equivalent (np1, np2) ∧ equivalent (rp1, rp2) ∧ equivalent (sd1, sd2) ⇒ equivalent (x, y)

We have created those rules that compose the Mapping Rule Base using logic set theory. The general principle is that two MVACs are overlapping if according to their definition, they can share a common set of instances. A concept’s feature (ex: a property) is seen as a very simple concept with only one property. Therefore, for example, two properties are overlapping if their names and their ranges (set of values) are not semantically disjoint. First, the semantic relations between basic MVAC elements are translated into statements that can be compared against the antecedents of mapping rules. Statements are stored in the Fact Base. The Mapping Inference Engine, which has spatial, thematic and temporal components responsible for matching the corresponding features, matches facts of the Fact base against rules in the Mapping Rule Base. If a rule is verified, the relation stated in the consequent is added in the Fact

18

M. Bakillah and M.A. Mostafavi

Base as a new statement. The Mapping Inference Engine verifies another rule until no rules remains in the Mapping Rule Base. Note that the mapping of spatial and temporal properties depends on the mapping of spatial and temporal descriptors. Therefore, spatial and temporal descriptors are mapped prior to properties. The contribution of the Complex Mapping Inference Engine is its ability to compare concepts which structure is more complex than the one used by existing semantic mapping approaches that produce semantic relations. For example, [18] and [11] consider only hierarchical relations between concepts (e.g. is-a), whereas we have developed mapping rules that takes as input any kind of relations, by comparing their names and ranges while preserving their structure. Also, G-MAP takes as input properties that are enriched with descriptors, a capacity that does not exist in previous systems. 4.3 Augmented Mapping Inference Engine The contribution of the Augmented Mapping Inference Engine is to exploit the dependencies to discover missing mappings between MVAC elements. For example, consider two properties depth of watercourse and water level. It is probable that no external resource, such as lexicon, can help to discover that they represent the same property. However, if we discover that they participate in similar dependencies, we could infer that they may represent the same property. The Augmented Mapping Inference Engine extracts the dependencies from the MVAC GWS descriptions. In parallel, the system extracts from the Fact Base the non-equivalent pairs of MVAC elements. We assume that the semantic relation between those elements can be false because implicit information (contained in dependencies) that was not considered can modify the result. Dependencies of different MVACs are matched, considering that the mismatching elements are non-disjoint. If, with this assumption, dependencies matches, then, the previously mismatching elements are presented to the user as a new match. For example, consider the following dependencies: d1: depth(floodedLand, high)→ status(floodedLand, navigable), d2: water level(floodplain, high)→ status(floodplain, navigable), with the following semantic relation: equivalent(floodedLand, floodplain). If we make the assumption equivalent

(depth, water level), we find that d1 and d2 are equivalent, and conclude that equivalent (depth, water level) was an implicit mapping. The final augmented mappings are displayed to the user, which can select the geospatial web service that matches best its query based on computed semantic relations.

5 Implementation of Our Approach To demonstrate the feasibility of our approach, we implemented it with Java and used OWL descriptions of GWS. We show a scenario where an expert–user responsible for flood management searches for flood risk zones in Canada. The expert formulates that the zones returned by the required service should have elevation of 4 meters or less to be considered as flood risk zones. The expert’s request is formulated as a GWS description, based on the vocabulary of its ontology, shown in OWL abstract syntax: Class(input complete restriction(is-A someValuesFrom (GML: surface))) Class(pre-condition complete restriction(part-of someValuesFrom(Canada)))

G-Map Semantic Mapping Approach to Improve Semantic Interoperability

19

Class(function complete restriction(is-A someValuesFrom(FindFloodRiskZone)) restriction (Before someValuesFrom(Storm)) Class(output complete restriction(is-A someValuesFrom(GML: surface) restriction (Elevation someValuesFrom(<=4m))) Class(post-condition complete restriction(hasSpatialAccuracy (10meters)))

The following dependency augments the GWS query description: Elevation (GML: surface, low) → risk(GML: surface, high). Figure 2 shows the interface of GMAP system, where user can formulate the query, the system propose new matches based on dependencies and matching results are displayed as a tree.

Fig. 2. G-MAP System showing query, augmentation impact and mapping results

Table 1 shows an example of an augmented multi-view mapping result where the semantic relation between the requested service description and a given GSW1 description depends on the context. In the first context, flood prevention context, the risk zones computed by the service are the surfaces where ground level is 3 meters or less. Table 1 shows the dependency that allows identifying that Ground Level is the same property as Elevation in the query. It shows that dependencies can play a role to improve semantic interoperability, since otherwise the mapping system would have return that in view1, the GWS1 description is overlap with the query, while it is included in it. In the second context (view 2), flood Disaster Response, the query overlap the GWS1 description since the description of a risk zone is defined by adjacency with a flooded land. The other example is a semantic mapping with a second GWS description, GWS2. The inferred semantic relation is that query is included in GWS2. This relation means that GWS2 provides a more general service (since the function localisationOfRiskArea is more general than FindFloodRiskZone). G-MAP was able to infer this relation since it infers that FloodRisk is more specific than Risk with the Basic Matching. In addition, G-MAP processes temporal relations with the global time ontology to infer that before is equivalent to previous to. However, since disaster includes storm, the temporal relation before(storm) in query is included in the temporal relation previous to(disaster).

20

M. Bakillah and M.A. Mostafavi

View1 of GWS1

View2 of GWS1

GWS2

Query In-

Query Overlap (Ո) View2 of GWS1

Query Included In (Ն) GWS2

FloodPrevention_Context

floodDisasterResponse_Context

DisasterMonitoring_Context

LocalisationOfFloodRiskZone With temporal relation: before someValuesValuesFrom(rainstorm) GML: surface

LocalisationOfFloodRiskZone With temporal relation: after someValuesFrom(rainstorm)

LocalisationOfRiskArea With temporal relation: previous to (disaster)

GML: surface

GML: surface Չ GML: point

GML: surface with Spatial Descriptor: GroundLevel someValuesFrom(<=3m)

GML: surface with Spatial Descriptor: AdjacentTo someValuesValuesFrom(FloodedLand)

GML: surface

part-of someValuesValuesFrom(Ontario)

part-of someValuesFrom(Ontario)

part-of someValuesFrom(Canada)

hasSpatialAccuracy (4meters)

hasSpatialAccuracy (4meters)

hasSpatialAccuracy (100meters)

GroundLevel(GML:s urface, low) → riskLevel(GML:surface , high)

WaterLevel(GML: surface, >2m) → status(GML: surface, navigable)

DisasterFrequency(riskArea, high) → risk(riskArea, high)

cludes(Շ)View1 of GWS1

Dependencies

Postconditions

Preconditions

Output

Input

Function

Context

Semantic Relations

Table 1. Sample of G-MAP augmented semantic mapping results

With respect to subsumption-based approaches described in related work, one of the advantages of G-MAP is to propose results that are more expressive. Since the semantic relations are not only subsumption relations, our approach is able to retrieve more relevant services, where user can choose between overlapping services, more general or more specific services. The G-MAP also support the user to identify the differences in meaning that could avoid misinterpretation of data produced by the service, with respect in particular to the semantics of spatiotemporal aspects of the services’ descriptions.

G-Map Semantic Mapping Approach to Improve Semantic Interoperability

21

6 Conclusion and Perspectives In this paper, we have focus on some issues related to the semantic interoperability of geospatial web services (GWS), which are the GWS description of semantics, and semantic mappings between GWS descriptions. We have proposed a new GWS description based on a multi-view augmented concept model, where the parameters of the GWS depends on context, and can be augmented with dependencies. The role of the latter is to improve, with the G-MAP augmented semantic mapping system we have proposed, the discovery of implicit semantic mappings between GWS descriptions. For this, the G-MAP tool we have developed implements three main reasoning engine, including one that exploit the structural correspondence between dependencies describing different GWS descriptions. The G-MAP also considers separately the spatial, temporal and thematic aspects of GWS descriptions. The vision we have given in this paper is that a sound knowledge representation is a fundamental issue for improving semantic interoperability between GWS. While the prototype is useful to support users in discovering and semantically interoperating GWS that respond to their requirements, an open issue is how can a service query issued by the user in a network of GWS can be propagated to relevant services. In the near future, we intend to investigate how the G-MAP can support this task. Acknowledgements. This research was made possible by an operating grant from Natural Sciences and Engineering Research Council of Canada (NSERC).

References 1. Agarwal, P.: Ontological Considerations in GIScience. International. Journal of Geographical Information Science 19(5), 501–536 (2005) 2. Bai, Y., Di, L., Wei, Y.: A Taxonomy of Geospatial Services for Global Service Discovery and Interoperability. Computers & Geosciences 35(4), 783–790 (2009) 3. Bakillah, M., Mostafavi, M.A., Brodeur, J.: Semantic Augmentation of Geospatial Concepts: the Multi-View Augmented Concept to Improve Semantic Interoperability Between Multiple Geospatial Databases. In: Joint International Conference on Theory, Data Handling and Modelling in GeoSpatial Information Science, Hong Kong, May 26-28 (2010) 4. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American, Mayo (2001) 5. Brisaboa, N.R., Fariña, A., Luaces, M.R., Trillo, D., Viqueira, J.R.: Definition and Implementation of an Active Web Map Service. In: Lecture Notes in Geoinformation and Cartography, pp. 231–245. Springer, Heidelberg (2007) 6. Brodeur, J., Bédard, Y., Edwards, G., Moulin, B.: Revisiting the Concept of Geospatial Data Interoperability Within the Scope of Human Communication Process. Transactions in GIS 7(2), 243–265 (2003) 7. Cruz, I.F., Sunna, W.: Structural Alignment Methods with Applications to Geospatial Ontologies. Transactions in GIS 12(6), 683–711 (2008) 8. Egenhofer, M.J.: Toward the Semantic Geospatial Web. In: GIS 2002, Virginia, USA (2002) 9. Fenza, G., Loai, V., Senatore, S.: A Hybrid Approach to Semantic Web Services Matchmaking. International. Journal of Approximate reasoning 48, 808–828 (2008)

22

M. Bakillah and M.A. Mostafavi

10. Fonseca, F., Egenhofer, M., Davis, C., Câmara, G.: Semantic Granularity in OntologyDriven Geographic Information Systems. AMAI Annals of Mathematics and Artificial Intelligence - Special Issue on Spatial and Temporal Granularity 36(1-2), 121–151 (2002) 11. Giuchiglia, F., Shvaiko, P., Yatskevich, M.: S-Match: An Algorithm and an Implementation of Semantic Matching. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 61–75. Springer, Heidelberg (2004) 12. Gruber, T.R.: A Translation Approach to Portable Ontology Specification. Stanford, California, Knowledge Systems Laboratory Technical Report KSL 92-71 (1993) 13. Janowicz, K.: Similarity-Based Retrieval for Geospatial Semantic Web Services Specified Using the Web Service Modeling Language (WSML-Core). In: Int. Semantic Web Conference ISWC 2006 Workshop, Athens, Georgia, USA (2006) 14. Lutz, M., Klien, E.: Ontology-based Retrieval of Geographic Information. International Journal of Geographical Information Science 20, 233–260 (2006) 15. Lutz, M., Riedemann, C., Probst, F.: A Classification Framework for Approaches to Achieving Semantic Interoperability Between GI Web Services. In: Kuhn, W., Worboys, M.F., Timpf, S. (eds.) COSIT 2003. LNCS, vol. 2825, pp. 186–203. Springer, Heidelberg (2003) 16. Nagarajan, M., Verma, K., Sheth, A.P., Miller, J., Lathem, J.: Semantic Interoperability of Web Services – Challenges and Experiences. In: Proc. of the IEEE Int. Conference on Web Services, pp. 373–382 (2006) 17. Schiel, U., de Souza Baptista, C., de Jesus Maia, A.C., Gomes de Andrade, F.: SEI-Tur: A System Based on Composed Web-Service Discovery to Support the Creation of Trip Plans. In: First International Conference on the Digital Society (ICDS 2007), vol. 28 (2007) 18. Serafini, L., Bouquet, P., Magnini, B., Zanobini, S.: An Algorithm for Matching Contextualized Schemas Via SAT. Technical Report # 0301−06, Istituto Trentino di Cultura, Trento, Italy (2003) 19. Sotnykova, A., Vangenot, C., Cullot, N., Bennacer, N., Aufaure, M.-A.: Semantic Mappings in Description Logics for Spatio-Temporal Database Schema Integration. In: Spaccapietra, S., Zimányi, E. (eds.) Journal on Data Semantics III. LNCS, vol. 3534, pp. 143– 167. Springer, Heidelberg (2005) 20. Tolosana-Calasanz, R., Nogueras-Iso, J., Béjar, R., Muro-Medrano, P.R., Zarazaga-Soria, F.J.: Semantic Interoperability Based on Dublin Core Hierarchical One-to-One Mappings. Int. J. Metadata, Semantics and Ontologies 1(3), 183–188 (2006) 21. Vaccari, L., Shvaiko, P., Marchese, M.: A Geo-Service Semantic Integration in Spatial Data Infrastructures. Int. Journal of Spatial Data Infrastructures Research 4, 24–51 (2009) 22. Wiegand, N., Garcia, C.: A Task-based Ontology Approach to Automate Geospatial Data Retrieval. Transactions in GIS 11(3), 355–376 (2007) 23. Zhang, L., Wang, K., Pan, Z., Wang, Q., Zheng, H.: Ontology-Driven Discovering Model for Geographical Information Services. Geospatial Information Science 13(1), 24–31 (2010)

MGsP: Extending the GsP to Support Semantic Interoperability of Geospatial Datacubes Tarek Sboui1,2 and Yvan Bédard1,2 1

Department of Geomatic Sciences and Centre for Research in Geomatics, Université Laval, Quebec, Qc, G1K 7P4, Canada 2 NSERC Industrial Research Chair in Geospatial Databases for Decision-Support [email protected], [email protected]

Abstract. Data warehouses are being considered as substantial elements for decision support systems. They are usually structured according to the multidimensional paradigm, i.e. datacubes. Geospatial datacubes contain geospatial components that allow geospatial visualization and aggregation. However, the simultaneous use of multiple geospatial datacubes, which may be heterogeneous in design or content, drives to consider interoperability between them. Overcoming the heterogeneity problems has been the principal aim of several research works for the last fifteen years. Among these works, the geosemantic proximity notion (GsP) represents a qualitative approach to measure the semantic similarity between geospatial concepts. The GsP, which has been defined in the transactional context, and can be used to a certain extent in the multidimensional paradigm, needs to be revisited to be more suitable for this paradigm. This paper proposes an extension to the GSP notion in order to support the semantic interoperability between multidimensional geospatial datacubes. The extension, called MGsP, aims to give the possibility to dig into and resolve semantic heterogeneity related to key notions of the multidimensional paradigm. Keywords: Geospatial datacubes, interoperability, semantic heterogeneity, ontology.

1 Introduction Over the last decades, there has been an exponential increase in the amount of data being stored electronically and available from multiple sources. Furthermore, there have been significant innovations in information technology, especially in database technologies, decision support systems (DSS), knowledge discovery, and automatic communication between information systems. Data warehouses are being considered as efficient components of decision support systems [6] and [2]. Data warehouses are databases which are designed to supply DSS with data at different levels of aggregation. They are often structured according to the multidimensional paradigm, which facilitates a rapid navigation within the different levels of data granularity (from a coarser level to a finer level and vice versa). As such, users can rapidly get a global picture of a phenomenon, get more insight into that phenomenon information, J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 23–32, 2010. © Springer-Verlag Berlin Heidelberg 2010

24

T. Sboui and Y. Bédard

compare with other phenomena, or analyse its evolution over time [2]. Thus, using the multidimensional paradigm, data warehouses provide the basis for decision making and problem solving in organizations. Multidimensional databases, hereafter called datacubes, allow users to navigate aggregated data according to a set of dimensions with different levels of hierarchy [6], [2], and [12]. Geospatial datacubes contain geospatial components that allow geospatial visualization and aggregation. Geospatial datacubes are becoming more widely used in the geographic field [2] and [12]. One may need to use several scattered geospatial datacubes at the same time. For example, users may need to simultaneously navigate through different geospatial datacubes, to create a new geospatial datacube from existing scattered ones, or to insert information in one geospatial datacube from the content of another one. For example, in order to analyze the risk of the West Nile virus on the population of Canada and USA, we may need to use two datacubes containing the location of dead birds in south-east of Canada, and in north-east of USA, respectively. Interoperability has been widely recognized as an efficient paradigm for simultaneously (re)using heterogeneous systems by facilitating an efficient exchange of information [3], [4], and [8]. It deals with the heterogeneity of different kinds (e.g., technical, organizational, and semantic heterogeneities). An example of semantic heterogeneity is the fact that the concept forest may be represented, with different geometries, as vegetations, as trees, or as wooded areas. Resolution of semantic heterogeneity is considered as a significant challenge for interoperability [8] and [4]. Such a resolution consists basically of comparing different concepts and measuring the semantic similarity between them. Many researchers have been interested in measuring the semantic similarity between geospatial concepts, and different solutions have been proposed [3], [13], [4], [11], and [10]. Among these solutions, the Geosemantic proximity notion (GsP), proposed by [4], allows to qualitatively evaluate the semantic similarity of geospatial concepts. While the GsP notion can be used to a certain extent to support the interoperability between geospatial datacubes, the efficiency of such interoperability can be improved by extending this notion. This paper revisits the GsP notion, and proposes an extension to this notion in order to offer a more suitable support for the interoperability between geospatial datacubes. In the next section, we review the interoperability between geospatial datacubes. In section 3, we review the GsP notion. The, in section 4, we propose an extension of the GsP notion. We conclude and present further works in section 5.

2 Semantic Interoperability between Geospatial Datacubes 2.1 Geospatial Datacubes Data warehouses are being considered an integral part of modern decision support systems [6], [2], and [12]. They are designed to supply these systems with data at different levels of aggregation. Data warehouses may be structured as datacubes, i.e. according to the multidimensional paradigm. A datacube is composed of a set of measures aggregated according to a set of dimensions with different levels of granularity.

MGsP: Extending the GsP to Support Semantic Interoperability

25

Geospatial datacubes integrate geospatial data with the datacube structure. Both dimensions and measures of a geospatial datacube may contain geospatial data [2]. Geospatial datacubes support the user's mental model of the data and help him/her to make strategic decisions [1] and [2]. In fact, they allow decision makers to interactively navigate through different levels of granularity so; they can get a global picture of a phenomenon, and can get more insight into that phenomenon detailed information. Moreover, geospatial datacubes contains geospatial data (e.g. geographic coordinates, map coordinates) which allow the visualization of phenomena and, hence, help to extract insights that can be helpful to understand these phenomena [1]. 2.2 Interoperating Geospatial Datacubes Interoperability has been generally defined as the ability of heterogeneous systems to communicate and exchange information and applications in an accurate and effective manner [3], [4], and [9]. Geospatial interoperability is considered here as the ability of information systems to a) communicate all kinds of spatial information about the Earth and about the objects and phenomena on, above, and below its surface, and b) cooperatively run applications capable of manipulating such information [14]. Semantic interoperability aims to provide a mutual understanding of different data representations. For geospatial information systems, we include the consideration about object’s geometry in the semantic level since geometry is not inherent to objects but defined according to the needs of a given application. In previous work we discussed the need for interoperating geospatial datacubes, we proposed a definition of the semantic interoperability between geospatial datacubes, and we proposed a categorization of semantic heterogeneity that may occur during such interoperability [16]. The categorization includes Cube-to-Cube heterogeneity, Fact-to-Fact heterogeneity, Measure-to- Measure heterogeneity, and Dimension-toDimension heterogeneity which involves hierarchy heterogeneity and level heterogeneity. At each one of the previous categories, semantic heterogeneity may be due to the difference in the description of concepts (e.g., the concept forest may be represented, with different geometries, as vegetations, as trees, or as wooded areas) and datacube schemas. Normally, resolving the semantic heterogeneity of different concepts is done through a comparison of theirs semantics. This is usually done by reconciling two or more heterogeneous ontologies which can be carried out by mapping, aligning or merging these ontologies [7]. In the context of geographic databases, many researchers have been interested in measuring the semantic similarity between geospatial concepts to support the interoperability process. Examples of research works are: the Semantic Formal Data Structure model [3], the Matching Distance model [13], the semantic matchmaking for geographic information retrieval [11], the geosemantic proximity notion (GsP) [4], and the similarity-based information retrieval approach [10]. In order to support the semantic interoperability between geospatial datacubes, and after reviewing these works, we chose the GsP notion, which allows to qualitatively evaluate the semantic similarity between geospatial concepts (i.e., similarities between their intrinsic and extrinsic properties).

26

T. Sboui and Y. Bédard

The choice of this notion is explained by the fact that 1) the GsP was successfully tested for supporting the interoperability process between software agents in geospatial context, 2) the GsP is based on human-like communication which we believe it is the ideal paradigm for the interoperability process, and 3) the availability of the source code from previous work conducted in our research team [5], and the possibility to adapt it to the interoperability of geospatial datacubes.

3 Revisiting the GsP Notion GsP evaluates qualitatively the semantic similarity between geospatial concepts. It compares the inherent properties of one concept with another. These properties are classified in two types: intrinsic and extrinsic. Intrinsic properties provide the literal meaning of a concept. They consist of the identification, the attributes, the attribute values, the geometries, the temporalities, and the domain of the concept. Extrinsic properties are properties that are subject to external factors (e.g., behaviours and relationships). The semantic of a geospatial concept is defined by the union of intrinsic and extrinsic properties. Then, the GsP of two concepts can be defined by the intersection of their respective properties. It results in a four-intersection matrix when consolidated with intrinsic and extrinsic properties [4]. Each component of the matrix can be evaluated empty (denoted by f or false) or not empty (denoted by t or true). Accordingly, 16 predicates were derived. The predicates are: GsP_ ffff (or disjoint), GsP_ffft, GsP_fftt (or contains), GsP_tfft (or equal), GsP_ftft (or inside), GsP_tftt (or covers), GsP_ttft (or coveredBy), GsP_fttt (or overlap), GsP_tttt, GsP_tfff (or meet), GsP_tftf, GsP_tttf, GsP_ttff, GsP_fttf, GsP_fftf, GsP_ftff [4]. In order to experiment GsP notion, Brodeur et al. developed the GsP tool which imitate human communication to support geospatial interoperability [5]. It depicts a communication process, which takes place between two software agents interacting through a communication channel. In GsP, software agents (a source and destination) exchange concept representations between them. In order to resolve the semantic heterogeneity between a source concept and a destination concept, the intrinsic and extrinsic properties of the respective representations are compared. The comparison is proceeded until there are two equal concepts (‘‘GsP_tfft’’) is found or all concepts are visited. When the comparison is completed, the concepts having a GsP different from ‘‘GsP_ffff’’ (or disjoint relationship) are then sorted from the highest to the lowest GsP [4].

4 Extending the GsP to Support Semantic Interoperability between Geospatial Datacubes The hierarchical structure of dimensions and the dependencies between dimensions and measures induce several semantic conflicts specific to the multidimensional datacube. Notably, the semantic heterogeneity of aggregation of dimension levels, semantic heterogeneity of measure function, and the semantic heterogeneity of

MGsP: Extending the GsP to Support Semantic Interoperability

27

hyper-cells1 [15] present a particular obstacle when interoperating different geospatial datacubes. Thus, we intend to enable agents (software agent or human stakeholder) to focus on resolving semantic heterogeneities related to those particular concepts. For that, we propose an extension of the GsP notion to include the comparisons of basic multidimensional concepts such as the semantic of aggregation and the semantic of hyper-cell. The objective of this extension (called Multidimensional Geosemantic Proximity: MGsP) is to give agents the possibility to focus on the heterogeneity of multidimensional data by digging into more details about the semantic aspects of important notions of the multidimensional paradigm (e.g., aggregation, measure function, and hyper-cell). As such, agents can concentrate on the multidimensional characteristics and make appropriate decisions with regards to their semantic similarity. Accordingly, we define three attributes to specialize the GsP: dimension aggregation, measure function, and hyper-cell. We should note that we chose these attributes as examples to illustrate the usefulness of the GsP extension for the interoperability between geospatial datacubes. This choice is motivated by the wide use of these attributes in the multidimensional paradigm. One can add other attributes if needed. As in GsP, our methodology for qualitatively evaluating the semantic similarity consists of identifying the relations between the semantics of multidimensional elements of geospatial datacubes (e.g., dimension or measure). The semantics of each multidimensional element is evaluated as the union of the properties related to the measure function (or dimension aggregation) and the properties related to the hypercell. Let: M: D:

a measure a dimension MInP: a set of intrinsic multidimensional properties (for measure: MInP = MInPM, whereas for dimension: MInP = MInPD). Where MInPM is the set of properties related to the measure function. The function is considered as intrinsic property since it refers to the meaning of the measure, and MInPD is the set of properties related to the aggregation. The aggregation is considered as intrinsic property since it refers to the meaning of the dimension. MExP: a set of properties related to the hyper-cell. The hyper-cell refers to the dependencies of measures with dimensions. Thus, it is considered as extrinsic property for both dimensions and measures.

MSM: Multidimensional semantics of measure. MSD: Multidimensional semantics of dimension. Then: MSM = MInPM ∪ MExP MSD = MInPD ∪ MExP

1

A hyper-cell is a combination of a set of levels and measures of a datacube.

28

T. Sboui and Y. Bédard

Then, the multidimensional geosemantic proximity (MGsP) is determined according to the intersection between the semantic of two elements (E1 and E2) of heterogeneous datacubes. Let: MSE1: Multidimensional semantics of E1. MSE2: Multidimensional semantics of E2. MGsP (E1, E2): Multidimensional Geosemantic proximity between E1 and E2. Then: MGsP (E1,E2) = MSE1 ∩ MSE2 Accordingly, we define a 4-Intersection matrix containing the following four topological sub-relations. In this matrix: MInPE ∈ InPE (the properties related to the measure function (or to the aggregation) belong to the intrinsic properties defined in GsP); MExPE ∈ ExPE (the properties related to the hyper-cell belong to the extrinsic properties defined in GsP). Thus, MGsP’s matrix is a specialization of the one defined in the GsP, allowing agents to dig into more details of the multidimensional characteristics of geospatial datacubes (see Figure 1). MGsP

GsP ExPE 1 ∩ GExPE 2 ExPE 1 ∩ InPE 2

InPE 1 ∩ ExPE 2

MExPE 2

MInPE 2

MExPE 1

MExPE 1 ∩ MExPE 2

MExPE 1 ∩ MInPE 2

MInPE 1

MInPE 1 ∩ MExPE 2

MInPE 1 ∩ MInPE 2

InPE 1 ∩ InPE 2

Fig. 1. 4-intersection multidimensional matrix as a specialization of the GsP

Since we consider the measure function as an attribute of the measure’s intrinsic properties, whereas the hyper-cell as an attribute of the measure’s extrinsic properties, we define a 4-Intersection matrix for measure as follows:

hyper _ cell M 2 hyper _ cell M 1

measure _ functionM 1

measure _ functionM 2

hyper _ cellM 1 ∩

hyper _ cellM 1 ∩

hyper _ cellM 2

measure _ functionM 2

measure _ functionM 1 ∩

measure _ functionM 1 ∩

hyper _ cellM 2

measure _ functionM 2

MGsP: Extending the GsP to Support Semantic Interoperability

29

Since we consider the aggregation as an attribute of the dimension’s intrinsic properties, whereas the hyper-cell as an attribute of the dimension’s extrinsic properties, we define the following 4-Intersection matrix for dimension:

hyper _ cellD 2 hyper _ cell D1 aggregationD1

aggregationD 2

hyper _ cellD1 ∩

hyper _ cellD1 ∩

hyper _ cellD 2

aggregationD 2

aggregationD1 ∩

aggregationD1 ∩

hyper _ cell D 2

aggregationD 2

As in GsP, the comparison of properties of two elements (measures or dimensions) of heterogeneous datacubes could be evaluated empty (Ø or f) and non-empty (¬Ø or t) expressing respectively that none or some properties are common. This leads to 16 (i.e., 24) possible MGsP predicates for each matrix (see Figure 2). If, for example, h y p e r _ c e l l M 1 ∩ m e a s u r e _ f u n c t i o n M 2 is ¬Ø (Ø), it indicates that the measure function of M2 fits (respectively does not fit) the hyper-cell of M1. In other words, it indicates that the measure function of M2 can be applied (respectively cannot be applied) to the set of measures to which M1 belongs. For example, the function geometric union of the measure fire buffer can be applied to the hyper-cell {fire zone}, {region, time, forest stand} of the measure fire zone. If, for example, hyper _ cell D1 ∩ hyper _ cell D 2 is ¬Ø (Ø), it indicates that the aggregation of D2 fits (respectively does not fit) the hyper-cell of D1. In other words, it indicates that both sets of levels and measures (i.e., hyper-cells) have (respectively does not have) common elements. For example, the hyper-cells {fire zone}, {region, time, fire class} and {fire zone}, {region, time, forest stand} have three common elements: {fire zone, region, time}.

Fig. 2. 16 possible MGsP predicates of the GsP

30

T. Sboui and Y. Bédard

In Figure 2, MGsP predicates are organized in four distinct categories according to four characteristics: common MExP and common MInP, common MExP and no common MInP, no common MExP and no common MInP, and no common MExP and common MInP. 1) The predicates of category 1 refer to the case where both the functions and the hyper-cells of the heterogeneous measures are common. This category includes four possible matrixes: a. MGsP_tttt (E1, E2): In this particular case the function of M1 (respectively M2) fits the hypercell of M2 (respectively M1). b. MGsP_tftt (E1, E2): In this case the function of M1 fits the hyper-cell of M2. However, the function of M2 does not fit the hyper-cell of M1. c. MGsP_ttft (E1, E2) : In this particular case the function of M1 does not fit the hyper-cell of M2. However, function of M2 fits the hyper-cell of M1. d. MGsP_tfft (E1, E2): In this case the function of M1 (respectively M2) does not fit the hyper-cell of M2 (respectively M1). 2) The predicate of category 2 refer to the case where the hyper-cells of the heterogeneous measures are common, whereas the functions are dissimilar. This category includes four possible matrixes: a. MGsP_tttf (E1, E2) : In this particular case the function of M1 (respectively M2) fits the hypercell of M2 (respectively M1). b. MGsP_tftf (E1, E2) : In this case the function of M1 fits the hyper-cell of M2. However, function of M2 does not fit the hyper-cell of M1. c. MGsP_ttff (E1, E2) : In this particular case the function of M1 does not fit the hyper-cell of M2. However, function of M2 fits the hyper-cell of M1. d. MGsP_tfff (E1, E2): In this case the function of M1 (respectively M2) does not fit the hyper-cell of M2 (respectively M1).

MGsP: Extending the GsP to Support Semantic Interoperability

31

3) The predicates of category 3 refer to the case where both the functions and the hyper-cell of the heterogeneous measures are dissimilar. We should note that in this case the function of M1 (respectively M2) should not fit the hyper-cell of M2 (respectively M1). Accordingly, we do not consider the matrix m10, m11 and m12 (see Figure 2): MGsP_ffff (E1, E2): 4) The predicates of category 4 refer to the case where the hyper-cells of the heterogeneous measures are dissimilar, whereas the functions are similar. We should note that, in this case, since there is no intersection between the hypercells, the function of M1 (respectively M2) should not fit the hyper-cell of M2 (respectively M1). Accordingly, we do not consider the matrix m14, m15 and m16 (see Figure 2). MGsP_ffft (E1, E2): Ten resulting predicates are then defined for the MGsP of measures, which are: MGsP_tttt, MGsP_tftt, MGsP_ttft, MGsP_tfft, MGsP_tttf, MGsP_tftf, MGsP_ttff, MGsP_tfff, MGsP_ffff, and MGsP_ffft. Similarly, we define the predicates for the MGsP of dimensions. The resulting predicates are the same ones as those for the measure element. Using such attributes (e.g., hyper-cell, dimension aggregation, and measure function) agents can have a better idea about the semantic similarity of multidimensional concepts, and can make appropriate decisions about resolving the semantic heterogeneity that may occur between the elements of different geospatial datacubes. For example, if the functions of two semantically heterogeneous measures (e.g., density in a datacube C1 and concentration in a datacube C2) are completely different, agents may consider these measures are dissimilar even if they have other common characteristics (e.g., used for the same subject of analysis, represented with the same precision and having the same scale).

5 Conclusion Resolution of semantic heterogeneity is considered as a significant challenge for interoperability. In order to resolve the semantic heterogeneity that may occur when interoperating geospatial datacubes, we extend the geosemantic proximity approach (GsP) which evaluates the semantic similarity of geospatial concepts. The GsP extension, called Multidimensional Geosemantic Proximity (MGsP), gives the possibility to focus on the heterogeneity of multidimensional data by digging into details about semantic characteristics of important notions of the multidimensional paradigm. The MGsP includes the semantics of basic multidimensional concepts such as the semantic of aggregation, the semantic of measure function and the semantic of hyper-cellability. The MGsP extension was defined within a general research project, which was implemented to manage the risks of data misinterpretation during the semantic interoperability between geospatial datacube.

32

T. Sboui and Y. Bédard

Further work is required to enhance the MGsP by defining more refined attributes. For example, for the aggregation attribute we can define the aggregation domain and aggregation constraint. Acknowledgments. We would like to acknowledge the support and comments of Dr. Jean Brodeur for the realization of this work.

References 1. Bédard, Y., Rivest, S., Proulx, M.J.: Spatial on-line analytical processing (SOLAP): concepts, architectures and solutions from a geomatics engineering perspective. In: Koncillia, W.R. (ed.) Data warehouses and OLAP: concepts, architectures and solutions (2005) 2. Bédard, Y., Han, J.: Fundamentals of Spatial Data Warehousing for Geographic Knowledge Discovery. In: Miller, H.J., Han, J. (eds.) Geographic Data Mining and Knowledge Discovery, 2nd edn., Taylor & Francis, Taylor (2008) 3. Bishr, Y.: Overcoming the semantic and other barriers to GIS interoperability. Int. J. Geographical Information science 12, 299–314 (1998) 4. Brodeur, J.: Interopérabilité des données géospatiales: élaboration du concept de proximité géosémantique. Ph.D. Dissertation. Université Laval (2004) 5. Brodeur, J., Bédard, Y., Moulin, B.: A Geosemantic Proximity -Based Prototype for Interoperability of Geospatial Data. Computer Environment and Urban Systems, 669–698 (2005) 6. Codd, E.F., Codd, S.B., Salley, C.T.: Providing OLAP (On-Line Analytical Processing) to User- Analysts: An IT Mandate. Hyperion white papers, p. 20 (1993) 7. Giunchiglia, F., Yatskevich, M.: Element Level Semantic Matching. In: Proceedings of Meaning Coordination and Negotiation Workshop at International Semantic Web Conference (2004) 8. Harvey, F., Kuhn, W., Pundt, H., Bishr, Y., Riedemann, C.: Semantic Interoperability: A Central Issue for Sharing Geographic Information Annals of Regional Science. Special Issue on Geo-spatial Data Sharing and Standardization, 213–232 (1999) 9. ISO/IEC 2382. Information technology – Vocabulary – Part 1: Fundamental terms (1993) 10. Janowicz, K., Wilkes, M., Lutz, M.: Similarity-Based Information Retrieval and Its Role within Spatial Data Infrastructures. Geographic Information Science, 151–167 (2008) 11. Lutz, M., Klien, E.: Ontology-based retrieval of geographic information. International Journal of Geographical Information Science, 233–260 (2006) 12. Malinowski, E., Zimányi, E.: Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications, p. 444. Springer, Heidelberg (2008) 13. Rodriguez, M.A.: Assessing Semantic Similarity Among Entity Classes, Ph.D. Dissertation, University of Maine (2000) 14. Roehrig, J.: Information Interoperability for River Basin Management. J. Technology Resource Management & Development 2 (2002) 15. Salehi, M.: Developing a model and a language to identify and specify the integrity constraints in spatial datacubes, Ph.D. Dissertation, Université Laval (2009) 16. Sboui, T., Bédard, Y., Brodeur, J., Badard, T.: A Conceptual Framework to Support Semantic Interoperability of Geospatial Datacubes. In: Hainaut, J.-L., Rundensteiner, E.A., Kirchberg, M., Bertolotto, M., Brochhausen, M., Chen, Y.-P.P., Cherfi, S.S.-S., Doerr, M., Han, H., Hartmann, S., Parsons, J., Poels, G., Rolland, C., Trujillo, J., Yu, E., Zimányie, E. (eds.) ER Workshops 2007. LNCS, vol. 4802, pp. 378–387. Springer, Heidelberg (2007)

Range Queries over a Compact Representation of Minimum Bounding Rectangles Nieves R. Brisaboa1, Miguel R. Luaces1 , Gonzalo Navarro2, and Diego Seco1 1

2

Database Laboratory, University of A Coru˜ na Campus de Elvi˜ na, 15071, A Coru˜ na, Spain {brisaboa,luaces,dseco}@udc.es Department of Computer Science, University of Chile Blanco Encalada 2120, Santiago, Chile [email protected]

Abstract. In this paper we present a compact structure to index semistatic collections of MBRs that solves range queries while keeping a good trade-oﬀ between the space needed to store the index and its search eﬃciency. This is very relevant considering the current sizes and gaps in the memory hierarchy. Our index is based on the wavelet tree, a structure used to represent sequences, permutations, and other discrete functions in stringology. The comparison with the R*-tree and the STR R-tree (the most relevant dynamic and static versions of the R-tree) shows that our proposal needs less space to store the index while keeping competitive search performance, especially when the queries are not too selective. Keywords: MBR, range query, wavelet tree, compact structures.

1

Introduction

The age of online digital availability has forced changes in the goals of many classical research ﬁelds. For example, the ever-increasing demand for geographic information services, which allow users to ﬁnd the geographic location of some resources previously located in a map (e.g. public administration services, places of interest, etc.), has emphasized the interest in the classical range queries (or window queries). These deﬁne a rectangular query window and retrieve all the geographic objects having at least one point in common with it. Other queries, like region queries and point queries, can be well-approached by them. A survey of spatial queries and Spatial Access Methods (SAMs) designed to solve them is that of Gaede and Gunther [5] which also details the special characteristics of spatial data that determine a set of requirements that spatial indexes should meet (e.g. secondary storage management, dynamism, etc.). All these have been design principles of spatial indexes along the years but recent

This work has been partially supported by “Ministerio de Educaci´ on y Ciencia” (PGE y FEDER) ref. TIN2009-14560-C03-02, by “Xunta de Galicia” ref. 08SIN009CT, and by Fondecyt Grant 1-080019, Chile.

J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 33–42, 2010. c Springer-Verlag Berlin Heidelberg 2010

34

N.R. Brisaboa et al.

improvements in hardware mean that some of these requirements need to be carefully reevaluated. On the one hand, although geographic databases tend to be very large, it is feasible nowadays to place complete spatial indexes in main memory because spatial indexes do not contain the real geographic objects but a simpliﬁcation of them. The most common simpliﬁcation is the MBR or bounding box, which, in the 2-dimensional Euclidean space E 2 , needs four ﬂoating-point values for each geographic object. Note that the most common spatial data are two-dimensional, hence the interest in E 2 (though our structure represents d-dimensional MBRs). Indeed, considering the current sizes of main memories, the space eﬃciency requirement can replace that of secondary storage management. Furthermore, the design of compact indexes is a topic of interest because of the way memory hierarchy has evolved in recent decades. New levels have been added (e.g. ﬂash storage) and the sizes at all levels have been considerably increased. In addition, access times in upper levels of the hierarchy have decreased much faster than in lower levels. Therefore, reducing the size of indexes is a timeless topic of interest because placing these indexes in upper levels of the memory hierarchy considerably reduces access times, by several orders of magnitude in some cases. On the other hand, the data is semi-static in many applications. This is usual in Geographic Information Retrieval [9] systems, which have arisen from the combination of GIS and Information Retrieval systems. In the ﬁeld of GIS, many public organizations are sharing their geographic information using Spatial Data Infrastructure [6]. The spatial information in these systems does not require frequent updates. When a spatial index is integrated in this kind of system, dynamism is not a very important requirement compared to time eﬃciency solving queries. Therefore, the design of static spatial indexes that take advantage of the knowledge of the data distribution is also a topic of interest. In the previous edition of this workshop [3] we presented a spatial index for two-dimensional points based on wavelet trees. The generalization to support ddimensional range queries over MBRs, which we present here, turns out to be a rather challenging problem not arising in other domains where wavelet trees have been used. As a reward, the index we obtain achieves a good trade-oﬀ between space and time eﬃciency. In particular, the index turns out to be quite insensitive to the size of the query window, and as a result it becomes most competitive on not too selective queries. Our experiments, featuring GIS-like scenarios, show that our index is a relevant alternative to classical spatial indexes such as the R-tree [11]. In addition, the strategy used to decompose the problem allows the use of other solutions in the diﬀerent steps achieving diﬀerent trade-oﬀs.

2

Related Work

A great variety of SAMs have been proposed supporting the diﬀerent kinds of queries that can be applied on spatial databases, like exact match or adjacency. In this paper we focus on a very common kind of query, named range query, on collections of d-dimensional geographic objects. For clarity many times we

Range Queries over a Compact Representation of MBRs

35

assume d = 2 but all the results are easily generalized. The problem is formalized as follows. We deﬁne the MBR of a geographic object o, M BR(o) = I1 (o)× I2 (o) where Ii (o) = [li , ui ](li , ui ∈ E 1 ) is the minimum interval describing the extent of o along the dimension i. In the same way, we deﬁne a rectangle query q = [l1q , uq1 ] × [l2q , uq2 ]. Finally, the range query to ﬁnd all the objects o having at least one point in common with q is deﬁned as RQ(q) = {o | q ∩ M BR(o) = ∅}. The R-tree and its dynamic (e.g. R*-tree) and static (e.g. STR R-tree) variants [11] are the most popular SAMs used to solve range queries in GIS. In Gaede and Gunther [5] these structures are presented in the group of SAMs based on the technique of overlapping regions. An alternative technique consists in the transformation of the MBRs into a diﬀerent representation such as higherdimensional points or one-dimensional intervals. Our representation is based on this technique as it decomposes the problem into its d-dimensions, solves the d subproblems, and intersects the results. Each subproblem consists in ﬁnding the one-dimensional intervals intersecting the query, which is obtained applying the same decomposition to the original query. Although many structures can be used to solve these subproblems, we propose the use of wavelet trees because they provide a good trade-oﬀ between space and time eﬃciency. Alternatively, classical interval data structures [13] like interval trees, segment trees, and priority trees can solve each subproblem in at least Ω(log N + k) (considering the output sensitivity complexity where k is output size). Obviously, these classical structures using pointers need much more space than the compact wavelet tree. One-dimensional intervals can be interpreted as two-dimensional points (further details about this transformation are given in the next section) and thus all the data structures supporting two-dimensional range queries over points provide alternatives to solve intersection queries in sets of intervals. Some of them [1,2] theoretically improve the performance of the wavelet tree but they come with a signiﬁcant implementation overhead. Recently, Schmidt [14] presented a simple structure solving the Interval Intersection Problem in asymptotic optimal space (Ω(n)) and time (Ω(1 + k)). However, this structure also needs much more space than the wavelet tree. A compact version of this structure is a promising line of future work because it could provide also an interesting trade-oﬀ. A basic tool in compact data structures is the rank operation: given a sequence S of length N , drawn from an alphabet Σ of size σ, ranka (S, i) counts the occurrences of symbol a ∈ Σ in S[1, i]. For the special case Σ = {0, 1} (S is a bit-vector B), the rank operation can be implemented in constant time and using little additional space on top of B (o(n) in theory [7]). For example, given a bitmap B = 1000110, rank0 (B, 5) = 3 and rank1 (B, 7) = 3.

3

Our Compact Representation

Our structure is based on the decomposition of the problem in its d dimensions. This decomposition produces d interval intersection subproblems that can be tackled with diﬀerent structures. The one we propose uses a wavelet tree to

36

N.R. Brisaboa et al.

solve each subproblem. The idea is to interpret every interval a as a point (la , ra ) in the rank-space grid N × N . Gabow et al. [4] proved that the orthogonal nature of the problem makes it possible to work in the rank space. 3.1

Translation to the Rank Space

In each dimension there are N intervals, each described by two ﬂoat numbers (its endpoints). A common technique to perform the translation of these intervals to the rank space stores two ordered arrays, one with the left endpoints and the other one with the right endpoints. Then, the endpoints of an interval in the rank space correspond with the positions of its real endpoints in these arrays. Although the real coordinates of the MBRs are ﬂoating point numbers, in GIS these numbers can be represented with four bytes each. Note that geographic coordinates can be represented in degrees or meters and in most cases it is possible to round them to integer values, after appropriate scaling, without losing any precision. We make use of this assumption, as it holds in most practical applications. To store these integer coordinates without losing precision we use a compressed storage scheme. An ordered array X = x1 x2 . . . xN is represented as a sequence of nonnegative diﬀerences between consecutive values yi+1 =xi+1 − xi and y1 = x1 . Let Y = y1 y2 . . . yN be this sequence, so that xi = 1≤j≤i yj . Array Y is a representation of X that can be compressed by exploiting the fact that consecutive diﬀerences are smaller numbers. These small numbers can be encoded with diﬀerent coding algorithms. We performed experiments (omitted due to lack of space) with four diﬀerent well-known coding algorithms [12] (Elias-Gamma, Elias-Delta, Rice, and Vbytes) and conclude that Rice encoding outperforms the other in most synthetic and real scenarios. In addition, a vector stores the accumulated sum at regularly sampled positions (say every hth position, thus the vector stores all values xi·h ) to eﬃciently solve the operations rightSearch (given a coordinate v, ﬁnd the largest xi ≤ v) and lef tSearch (the lowest xi ≥ v). The algorithm to map a real coordinate to the rank space ﬁrst performs a binary search in the vector of sampled sums and then carries out a sequential scan in the resulting interval of Y . 3.2

Solving Queries in the Wavelet Trees

In the two-dimensional rank space, N MBRs can be represented in a 2N × 2N grid (Figure 1 left) and an alternative representation for the N intervals in each dimension is an N × N grid where rows represent the intervals ordered according to their left endpoints and columns represent them ordered according to their right endpoints. Figure 1 (right) represents the grids resulting from the transformation of the vertical (up) and horizontal (down) intervals. The following, easy to verify, observation provides a basis for our next developments. It says, essentially, that an intersection between a query q and an object o occurs when, across each dimension, the query ﬁnishes not before the object starts, and starts not after the object ﬁnishes.

Range Queries over a Compact Representation of MBRs 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1 2

b

3

a

4 5

g

6 7 8

e

c

q

11 12

d

13

15 16

l1/u1 1

2 3 4 5 6 7 8 1 a 2 b 3 g 4 c 5 e 6 d 7 h 8 f

l2/u21

9 10

14

16

37

h f

2 3 4 5 6 7 8 1 c 2 h 3 b 4 e 5 d 6 a 7 g 8 f

Fig. 1. Two dimensional example in the rank space

Observation 1. o ∈ RQ(q) iﬀ ∀i, uqi ≥ li ∧ liq ≤ ui . In two dimensions, the four conditions of Observation 1 for query q = [l1q , uq1 ]× q [l2 , uq2 ] are now split into uq1 ≥ l1 ∧ l1q ≤ u1 and uq2 ≥ l2 ∧ l2q ≤ u2 . These are illustrated in Figure 1 (right). In the original space, this partition results in two bands of (l2 − l1 ) × N and (u2 − u1 ) × N , illustrated in Figure 1 (left). Objects intersecting each band are the candidates to be part of the query result. Finally, those objects in the intersection of the bands are the actual results of the query. Note that this intersection can be performed on-line because the candidates of the ﬁrst dimension turn on their bits on a bitmap and the candidates of the second one report the MBR identiﬁer only if their associated bits are turned on in the bitmap (an array of counters allows the generalization to d dimensions). In both transformed grids there is only one point in each row and in each column, therefore we can build two wavelet trees as described in [3]. Figure 2 shows the wavelet trees corresponding to the ﬁrst (left) and second (right) dimensions in the transformed space. The root node of each wavelet tree represents the permutation of the points in the order of the rows whereas the leaves represent the permutation of the points in the order of the columns. The wavelet tree is a perfect binary tree where each node handles an interval of the columns i, and thus knows only the points whose column falls in the interval. The root handles the interval of columns [1, N ] and the children of a node handling interval [i, i ] are associated to [i, (i + i )/2] and [ (i + i )/2 + 1, i ]. The leaves handle intervals for the form [i, i]. Each node v contains a bitmap Bv so that Bv [r] = 0 iﬀ the r-th point handled by node v (in the order of the rows) belongs to the left child and Bv [r] = 1 iﬀ it belongs to the right child. In each wavelet tree we perform a query derived from the translation of the original query q to the new space. The query q = [l1q , uq1 ] × [l2q , uq2 ] is decomposed according to its two dimensions, resulting in a two-sided query to each wavelet tree: qwt1 = [1, uq1 ] × [l1q , N ] and qwt2 = [1, uq2 ] × [l2q , N ]. The ﬁrst intervals of each wavelet tree query represent rows in the transformed grid, and thus they indicate valid positions in the bitmaps of the wavelet trees, whereas the second intervals represent columns, and thus they indicate valid nodes in the tree traversal (and

38

N.R. Brisaboa et al.

L1

1 a 0.5

2 b 1.5

3 g 1.5

4 c 3.5

5 e 3.5

6 d 4.5

7 h 6.5

8 f 6.5

1 c 0.5

L2

2 h 0.5

3 b 1.5

4 e 2.5

0 U1

b 2.5

b 0

2

3

4

5

6

7

8

1

2

3

4

5

6

7

b

g

c

e

d

h

f

c

h

b

e

d

a

g

f

0

0

0

0

1

1

1

1

0

0

0

1

0

1

1

1

1 [5, 8]

[1, 4] 0

1

2

3

4

1

2

3

a

b

g

c

e

d

h

f

0

0

1

1

0

0

1

1

1 a 2.5

8 f 6.5

a

[1, 2] 0

2

7 g 5.5

1

[1, 4] 0

a 1

6 a 4.5

[1, 8]

[1, 8]

1

5 d 2.5

1 [3, 4] 1 2 g 0 0 g 4.5

[5, 6] 0

c 1 1 c 5.5

1

2

e

d

0

1

0 e 5.5

4

1 c 0 [1, 2] 0

1 [7, 8] 1 2 h 0 1 d 7.5

0 h 7.5

f 1 1

0 f 7.5

U2

c 1.5

1

2

c 0

h 1 1 h 1.5

2 h 0

3 b 1

1 [5, 8]

4 d 1 1 [3, 4] 1 2 b d 0 1

0 b 3.5

8

1

2

3

e 0

a 0

g 1

[5, 6] 0

1 e 0

1

0

d 3.5

e 5.5

4 f 1 1 [7, 8] 1 2

2 a 1

g 0

f 1

1 a 6.5

0 g 6.5

1 f 7.5

Fig. 2. Representing MBRs using wavelet trees

are used to prune these traversals). The algorithm solving the queries in each dimension is a simpliﬁcation of the point case [3] because these are two-sided queries, whereas in the general case queries are four-sided. The left wavelet tree in Figure 2 highlights nodes visited to solve the query decomposed in the ranges [1, 6] (valid positions in the root bitmap of the ﬁrst wavelet tree) and [4, 8] (interval to prune the ﬁrst wavelet tree traversal). The ﬁrst range is projected in the child nodes of the root node as [1, rank0 (B, 6)] = [1, 4] and [1, rank1 (B, 6)] = [1, 2]. In the same way the range [1, 4] of the left child is projected in its children as [1, rank0 (B, 4)] = [1, 2] and [1, rank1 (B, 4)] = [1, 2], but the ﬁrst one is not accessed because it covers the range [1,2] which does not intersect the query range [4,8]. If we repeat this process until the leaves, we obtain the set of candidates of the ﬁrst wavelet tree {c, e, d}. Analogously we obtain the result {b, d, e, a} in the second wavelet tree. Therefore, the solution of the query is {c, e, d} ∩ {b, d, e, a} = {d, e}.

4

Experiments

The computer used features an Intel Pentium 4 processor at 3.00GHz with 4GB of RAM. It runs GNU/Linux (kernel 2.6.27). We compiled with gcc version 4.3.2 and option -O9. In these experimentes we used three synthetic collections with one million MBRs each and uniform, Zipf (world size = 1, 000×1, 000, ρ = 1) and Gauss distribution (world size = 1, 000 × 1, 000, μ = 500, σ = 200). We created four query sets for each dataset with diﬀerent selectivities that represent 0.001%, 0.01%, 0.1%, and 1% of the area of the space where the MBRs are located. They contain 1,000 queries with the same distribution of the original datasets and the ratio between the horizontal and vertical extensions varies uniformly between 0.25 and 2.25. Experiments using two real datasets are also presented. The ﬁrst dataset, named Tiger, contains 2,249,727 MBRs from California roads and it is available at the U.S. Census Bureau1 . Six smaller real collections available at the same place were used as query sets: Block (groups of buildings), BG 1

http://www.census.gov/geo/www/tiger

Range Queries over a Compact Representation of MBRs

39

(block groups), AIANNH (American Indian/Alaska Native/Native Hawaiian Areas), SD (elementary, secondary, and uniﬁed school districts), COUSUB (country subdivisions), and SLDL (state legislative districts). The second real collection, named EIEL dataset, contains 569,534 MBRs from buildings in the province of A Coru˜ na, Spain2 . Five smaller collections available at the same place were used as query sets: URBRU (urbanized rural places), URBRE (urbanized residential places), CENT (population centers), PAR (parishes), and MUN (municipalities). 4.1

Space Comparison

We compare our structure with two variants of the R-tree in terms of space needed to store the structure. The space needed by an R-tree over a collection of N MBRs can be estimated considering a certain arity (M ). Dynamic versions of this structure, such as the R*-tree, estimate that nodes are 70% full whereas static versions, such as the STR R-tree, assume that nodes are full (in main N memory). Therefore, an R*-tree needs 0.7×M−1 nodes whereas an STR R-tree N needs M−1 nodes. Each node needs M × sizeof (entry) bytes. The size of an entry is the size of an MBR plus a pointer to the child (or to the data if the node is a leaf). In order to compare these variants with our structure we assume that MBRs are stored in 16 bytes (4 coordinates in 4-bytes numbers) and the N pointer in 4 bytes. Hence, the total size of an R*-tree is 0.7×M−1 × 20 × M N whereas the size of an STR R-tree is M−1 × 20 × M . In our experiments the best time performance of the R*-tree and STR R-tree is achieved with an eﬀective M value of 30. Note that the coordinates stored by the R-tree variants are not sorted, thus it is not possible to apply our diﬀerential encoding. On the other hand, our structure stores the coordinates of the N MBRs (four arrays of N 4-byte numbers without encoding), the MBR identiﬁers (two arrays, one per wavelet tree, of N 4-byte numbers to perform the intersection) and the two wavelet tree bitmaps (see grayed data in Figure 2). The wavelet tree needs only N × log2 N bits (1 bit per MBR per level, that is, N bits per level, and there are log2 N levels). Moreover, in order to perform rank operations in constant time, some auxiliary structures are needed that use an additional space of around 37.5% of the wavelet tree size [7]. Therefore, the complete structure requires 4 × 4 × N + 2 × 4 × N + 2 × (N × log2 N × 1.375)/8 bytes, which is in addition subject to coordinate compression. The eﬀectiveness of this compression varies across datasets, so we show the results for each dataset in Figure 3 (we use Rice codes with sampling h = 500 and maintain this conﬁguration in the time comparison). These results show that our structure, named SW-tree (from spatial wavelet tree) in the graphs, can index collections of MBRs in less space than the two variants of the R-tree due to the compressed encoding of the coordinates and the little space required by the wavelet trees. Some datasets are more compressible than others. The best results were obtained with the real Tiger dataset where we save more than 46% and 22% of the R*-tree and STR R-tree respectively. Compression rates are not 2

http://www.dicoruna.es/webeiel

N.R. Brisaboa et al.

Index Size (bytes/MBR)

40

R*-tree STR R-tree SW-tree

30 25 20 15 10 Uniform

Zipf

Gauss

Tiger

EIEL

Fig. 3. Space comparison

so good with the EIEL datasets because geographic coordinates in this collection are in centimeters, and thus distances between consecutive coordinates are quite large. However, even in this case the space needed to represent our structure is considerably less than the space needed to represent an STR R-tree. 4.2

Time Comparison

To perform the time comparison we implemented our structure as described in Section 3 and use the R-tree implementation provided by the Spatial index library 3 . This library provides several implementations of R-tree variants such as the R*-tree and the STR packing algorithm to perform bulk loading. In addition, all these variants can run in main memory. In our experiments we run both the R*-tree and the STR R-tree in main memory with a load factor M = 30. We ﬁrst perform experiments with the three synthetic collections (Figure 4). The main conclusion that can be extracted from these results is that our structure is competitive with respect to query time eﬃciency when the queries are very selective (0.001% and 0.01%) and in less selective queries (from 0.1%) our structure is signiﬁcantly better than the others.

Time (s)

0.03

0.14

0.025

0.04 0.035

R*-tree STR R-tree SW-tree

0.02

0.025 0.02

R*-tree STR R-tree SW-tree

0.12 0.1

0.015

0.08

0.01

0.06

R*-tree STR R-tree SW-tree

0.04

0.015

0.005

0.01 0.005 0.001%

0.01% 0.1% Selectivity

(a) Uniform

1%

0 0.001%

0.02 0.1% 0.01% Selectivity

1%

(b) Gauss

0 0.001%

0.1% 0.01% Selectivity

1%

(c) Zipf

Fig. 4. Time comparison in three synthetic datasets with diﬀerent distributions

Another important conclusion that can be extracted from these results is the little dependence of our structure to changes in the selectivity. This is due to the space transformation. We divide the problem into two subproblems, each one concerning one dimension. This decomposition makes queries in the two wavelet trees only marginally dependent on the query size (i.e. selectivity). Note that 3

http://www2.research.att.com/∼marioh/spatialindex/

Range Queries over a Compact Representation of MBRs

41

indexed MBRs correspond to points near the main diagonal of the transformed grids, so that larger MBRs translate into points farther from the main diagonal (above it). The query translates into a 2-sided range query. The point where this transformed query starts is below the main diagonal, farther from it as the query size is larger, and thus returning more points (more precisely, the returned set size increases linearly with the width of the query across each coordinate). The impact of the size of the query is more clearly reﬂected in the intersection of the results of both wavelet trees, as an increase by a factor of s2 in the query area (that is, s per coordinate), translates into a factor of just s (i.e. s × N + s × N ) in the amount of data to intersect. This “square root” impact of the query size in the performance of the algorithm explains its resilience to the query selectivity. Of course, they also explain why our technique does not perform so well when queries are very selective, as we work O(s) time in order to retrieve a result of size O((s/N )2 N ) (taking s as the asymptotic variable). The surprising time decrease with the increase of the query size in Figure 4(c) is understood because all the MBRs represented in a node are directly reported without reaching the leaves if the node range is completely contained in the query range and all the positions of the node are valid. Therefore, while smaller queries prune the tree more than bigger ones, bigger queries report more elements without reaching the leaves. The Zipf dataset markedly increases the number of directly reported objects due to the high concentration of MBRs near the origin of coordinates. Finally, we present the results with the two real datasets. Figures 5(a) and 5(b) present the results with the Tiger and EIEL datasets respectively. In these graphs the real query sets have been sorted accordingly with their selectivity (from left to right the query selectivity is fewer). Even though both R-tree variants improve the performance of our structure when queries are too selective (Block and BG in the Tiger dataset, and URBRU in the EIEL dataset) these results are very promising because our structure improves the performance of both R-tree variants in a broad range of real query sets. Note that all of them are meaningful queries. For example, in the EIEL dataset the query set CENT contains queries of the form which buildings are contained in the population center X. The results in the less selective queries (SD, COUSUB and SLDL in the Tiger dataset, and PAR and MUN in the EIEL dataset) are particularly good because the diﬀerences with the classical solutions are impressive. 0.1

0.01

R*-tree STR R-tree SW-tree

Time (s). Log scale

Time (s). Log scale

0.1

0.001 0.0001 1e-005 Block

BG

AIANNH

(a) Tiger

SD

COUSUB SLDL

R*-tree STR R-tree SW-tree

0.01

0.001 URBRU

URBRE

CENT

(b) EIEL

Fig. 5. Time comparison in two real datasets

PAR

MUN

42

5

N.R. Brisaboa et al.

Future Work

There are other versions of the R-tree (e.g. the CR-tree [10]) that use compression techniques achieving lower storage requirements than the STR R-tree. However, these structures can produce precision loss, and thus, false positives. We could reduce the precision of the coordinates in order to achieve higher compression rates yet producing false positives too. Note that each false positive involves a huge penalty because a real geographic object has to be retrieved from disk and a complex comparison operation between this object and the query window has to be performed to check whether it is a true hit. We are also working on allowing the insertion and removal of points once the structure has been constructed. Some recent works [8] describe dynamic versions of rank structures that can be used in the design of wavelet trees with insertion and deletion capabilities. Finally, algorithms to solve other queries like k-nearest neighbor or spatial joins are in our plans too.

References 1. Alstrup, S., Brodal, G.S., Rauhe, T.: New data structures for orthogonal range searching. In: 41st Symp. on Foundations of Computer Science, pp. 198–207 (2000) 2. Bose, P., He, M., Maheshwari, A., Morin, P.: Succinct orthogonal range search structures on a grid with applications to text indexing. In: Dehne, F.K.H.A., Gavrilova, M.L., Sack, J.-R., T´ oth, C.D. (eds.) WADS 2009. LNCS, vol. 5664, pp. 98–109 (2009) 3. Brisaboa, N.R., Luaces, M.R., Navarro, G., Seco, D.: A new point access method based on wavelet trees. In: Heuser, C.A., Pernul, G. (eds.) ER 2009 Workshops. LNCS, vol. 5833, pp. 297–306. Springer, Heidelberg (2009) 4. Gabow, H.N., Bentley, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proceedings of the 16th Annual ACM Symposium on Theory of Computing, pp. 135–143. ACM Press, New York (1984) 5. Gaede, V., Gunther, O.: Multidimensional access methods. ACM Computing Surveys 30(2), 170–231 (1998) 6. Global Spatial Data Infrastructure Association, http://www.gsdi.org/ 7. Gonz´ alez, R., Grabowski, S., M¨ akinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Poster Proceedings Volume of 4th Workshop on Eﬃcient and Experimental Algorithms, pp. 27–38. CTI Press and Ellinika Grammata (2005) 8. Gonz´ alez, R., Navarro, G.: Rank/select on dynamic compressed sequences and applications. Theoretical Computer Science 410, 4414–4422 (2008) 9. Jones, C.B., Purves, R.S.: Geographical information retrieval. International Journal of Geographical Information Science 22(3), 219–228 (2008) 10. Kim, K., Cha, S.K., Kwon, K.: Optimizing multidimensional index trees for main memory access. SIGMOD Record 30(2), 139–150 (2001) 11. Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A.N., Theodoridis, Y.: R-Trees: Theory and Applications. Springer, Heidelberg (2005) 12. Salomon, D.: Data Compression: The Complete Reference. Springer, Heidelberg (2004) 13. Samet, H.: Multidimensional and Metric Data Structures. Morgan Kaufmann, San Francisco (2006) 14. Schmidt, J.M.: Interval stabbing problems in small integer ranges. In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878, pp. 163–172. Springer, Heidelberg (2009)

A Sensor Observation Service Based on OGC Specifications for a Meteorological SDI in Galicia Jos´e R.R. Viqueira, Jos´e Varela, Joaqu´ın Tri˜ nanes, and Jos´e M. Cotos Systems Laboratory, Technological Research Institute, University of Santiago de Compostela, Constantino Candeira s/n, A Corunia, Spain [email protected], [email protected], [email protected], [email protected]

Abstract. The MeteoSIX project, founded by the Galician regional government, aims at the development of a Spatial Data Infrastructure (SDI) and a new SDI based Geo web site to enable an integrated access to meteorological data for a wide variety of users with diﬀerent skills. Such data has to be available through the internet using OGC and OpenNDAP standards. The present paper is focused on the design and implementation of a ﬁrst prototype of a sensor observation web server whose interface is based on the OGC Sensor Observation Service (SOS).

1

Introduction

The INSPIRE directive of the European Union establishes rules for the development of Spatial Data Infrastructures (SDI) in Europe. The application of this directive in Spain in currently an undergoing task that is forcing Public Administrations to implement a network of public web services of geographic information. Examples of such implementations are the Spanish Spatial Data Infrastructure (www.idee.es) and the Galician Spatial Data Infrastructure (http:// sitga.xunta.es/ sitganet). To enable public access to meteorological and oceanographic data is therefore also a requirement for the involved organisms. Regarding the management through the web of data obtained by sensors, the Sensor Web Entablement (SWE) [1] of the Open Geospatial Consortium (OGC) is currently providing a general architecture and general purpose interfaces for web services. Thus, the Sensor Observation Service (SOS) [2] speciﬁcation deﬁnes the interface of a service that enables the query of databases of sensor data. The functionality of this service is provided by three mandatory query operations, two optional transactional operations and some other operations for enhanced functionality. The mandatory query operations are: GetCapabilities that enables the retrieval of the capabilities of the service, DescribeSensor that obtains a detailed description of the characteristics of a given sensor and GetObservation that enables the querying of the observations obtained by the sensors.

This work is supported by Xunta de Galicia (ref. 09MDS034522PR).

J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 43–52, 2010. c Springer-Verlag Berlin Heidelberg 2010

44

J.R.R. Viqueira et al.

Transactional operations enable the registering of new sensors in the service and the insertion of new observations of the new sensors. In this paper, the development of a SOS is presented in the context of the MeteoSIX project, a project that aims at the development of a SDI based Geographic Information System (GIS) for the management and dissemination of meteorological and geographic data in Galicia. The service implements the three mandatory operations of the speciﬁcation and provides access to observations made by meteorological and oceanographic sensors of various types, including static and mobile sensors that obtain in-situ and remote observations. The implementation of the service is based on standards and open source initiatives. The main contributions of the paper with respect to currently available research eﬀorts (see [3] for a representative example) can be summarized as follows: – A generic data model is designed for the persistent storage of observations obtained from both local and remote sensors located at both static and mobile platforms. – Important issues related to the management of spatio-temporal data generated by sensors are raised that must be considered during the eﬃcient design and implementation of a sensor observation server. The remainder of this paper is organized as follows. The MetioSIX project is brieﬂy outlined in Sect. 2. Sect. 3 describes the various types of sensors and observations available in the context of the project. The design of the service data model is described in Sect. 4. A brief description of a ﬁrst limited prototype implementation is given in Sect. 5. Finally, conclusions and issues of further work are drawn in Sect. 6.

2

The MeteoSIX Project

The MeteoSIX project ”Geographic Information System for the Management and Dissemination of the Meteorologic and Oceanographic Information of Galicia” is being founded by the Galician regional government and aims at the development of a Spatial Data Infrastructure (SDI) and required applications to ease the access to the available meteorologic and oceanographic data in this region of the north west of Spain. Four organisms are collaborating in the project: i) MeteoGalicia, dependent of the Consellera de Medio Ambiente Territorio e Infraestructuras of the Galician regional government, is the responsible for the management of meteorological data in Galicia, ii) the Supercomputing Center of Galicia (CESGA) provides high-performance computing services to the scientiﬁc community of Galicia, iii) The Computer Architecture Group of the University of A Coru˜ na and iv) The Systems Laboratory of the Technological Research Institute (TRI) of the University of Santiago de Compostela (USC). The deployment of the components of the system resulting from the project is depicted in Fig. 1. As it may be observed in the ﬁgure, MeteoGalicia will provide web services for the access to meteorological and oceanographic data and relevant end-user web applications. The required high-performance computing

A Sensor Observation Service Based on OGC Speciﬁcations

45

CESGA Specialized User Procesing Infrastructure

: Navigator

: Numeric Prediction : Meteorological Maps : External client

: Application

: Statistics

GeoServices : Web GeoProcessing

Final User

Intranet

SW OGC

: Navigator

Storage Infrastructure WPS

: DBMS

: File System

Mobile Final User : Metadata

: Mobile Client

: Features

: Coverages

Internet SOS

SAS

CS-W

WMS

MeteoGalicia

WCS

WFS

OPeNDAP NetCDF Subset

UNIDATA

SW OGC GeoServices

Web Applications

Intranet : Catalog

: MetoGalicia GeoPortal : Web Map Server

Storage Infrastructure : Mobile Apps. : Internal Management

: DBMS

: Observations

: Metadata

: File System

: Features

: Coverages

: Web Feature Server : Sensor Observation Server

Fig. 1. Component Deployment for the MetoSIX Project

infrastructure will be provided by CESGA through the OGC Web Processing Service (WPS) interface. For the purposes of the present paper, it is noticed that a sensor observation server will be installed in MeteoGalicia that will provide OGC SOS interface. The responsible for the development of this server is the Systems Laboratory TRI-USC.

3

Available Sensor Data

Typically, sensors may be classiﬁed according to the locations of the features they measure into in-situ and remote sensors [4]. An in-situ sensor measures some physical properties in the surroundings of its location. On the other hand, remote sensors measure physical properties of features located at some distance from them. Obviously, to enable a sensor to work properly, it must be attached to some platform. Thus, sensors might be working either on static platforms whose location is constant with respect to time or on mobile platforms whose locations vary with respect to time along some path. The importance of the present MeteoSIX project for testing an SOS implementation lies on the fact that both in-situ and remote sensors on board or both static and mobile platforms are available, as it is brieﬂy described below.

46

J.R.R. Viqueira et al.

The sensors managed by MeteoGalicia and included in the present MeteoSIX project to test the SOS implementation are installed on board of the following types of platforms: – Meteorological Stations: MetoGalicia has a network about 80 automatic meteorological stations with quite homogeneous equipment. Each station is a static platform with a variety of in-situ sensors installed (at various elevations). Properties measured by those sensors include average air temperature, average relative humidity, temperature at 10 cm below the ground, temperature at 10 cm above the ground, solar radiation, rainfall, barometric pressure, etc. Values of properties are provided by sensors at intervals of 10 minutes. – Oceanographic Stations: Data of four oceanographic stations are available for the project. These stations are also static platforms that include both in-situ and remote sensors. In particular, in-situ sensors enable the measurement of meteorological properties like air temperature and humidity and oceanographic properties like water temperature, water salinity, pressure of the water column, conductivity, etc. Meteorological properties are provides at some elevation whereas oceanographic properties are available at least at two distinct depth levels (sea surface and sea bottom). On the other hand, remote sensors are available to measure the sea currents at various depths. In particular, either Vertical or Horizontal Acoustic Doppler Current Proﬁlers (VADCP or HADCP) are installed that remotely measure the vector components of currents at various points along a path. – Radio sounding: The platform with various meteorological sensors is attached to a weather balloon. Beside the sensors, the platform includes also a GPS and a radio communications system. Thus, it is enabled to send the observations together with their locations at the various elevations from surface level up to the stratosphere. The process stops when the platform is destroyed and the communication stops. Clearly, this is a case of in-situ sensors on board of a mobile platform. – Satellite Data: This well known type of sensor data is the main example of remote sensors on board of a mobile platform (the satellite). Time series of distinct types of satellite images are available for various zones.

4

Design of the SOS Data Model

The model for the representation of observations adopted by the OGC is deﬁned in [5]. A specialization of such model for the types of observations available in MeteoSIX (See Sec. 3) is depicted in Fig. 2. An observation is modeled as an event that produces a result whose value is an estimate of of some property of a feature of interest [5]. As it is shown in Fig. 2, an observation is obtained at a sampling time with a result that initially is of some unspeciﬁed type. The observation is obtained by some procedure (meteorological and oceanographic sensors in our particular case) and it estimates a given observed property (meteorological and

A Sensor Observation Service Based on OGC Speciﬁcations

47

oceanographic variable in our case) at the location deﬁned by a given feature of interest. Such location of the feature of interest will be deﬁned in our case either by the global position of an in-situ sensor or by the remote feature measured by a remote sensor. Observations with a result of a measure type (Measurement in the model) are used to model the observations of most of the available sensors. Observations whose result is a discrete coverage are used to represent the results of sensors on board of satellites. Finally, a complex observation has a result composed of various ﬁelds, and in our case enables the representation of current vectors with their two components obtained by the VADCP and HADCP sensors. <> Process

<> AnyFeature

+Id: anyURI

+Id: anyURI

+featureOfInterest

+procedure

0..* <> Observation

PropertyType +observedProperty

<> Measurement +result: Measure

0..* +members

+samplingTime: TM_Object 1..* +result: anyType

<> ObservationCollection

<> DiscreteCoverageObservation

<> ComplexObservation

+result: DiscreteCoverage

+result: Record

Fig. 2. In Memory Observations Data Model

In an SOS database the observations have to be organized in groups called oﬀerings. The observations of an oﬀering are interrelated in such a way that the probability to get and empty response to a GetObservation operation is low. An oﬀering is deﬁned by specifying various conditions for its observations, including: i) a period of time during which the observations must take place, ii) a set of procedures by which observations are obtained, iii) a set of observed properties objective of the observations and iv) a set of observed features of interest. As it was already stated in the introduction, only the three mandatory operations of the OGC SOS speciﬁcation are supported by the current service [2]. Based on the available sensors and on the data that they produce (see Sect. 3), a data model for the persistent storage of related observations was deﬁned (see Fig. 3). In addition to the recording of observations, the database must support the recording of the required data to reply to each of the mandatory SOS requests. As it is shown in Fig. 3 both the constant location of a static platform and the evolution with respect to time of the location of a mobile platform may be recorded in classes StaticPlatform and Track, respectively. These locations provide the geographic information for the feature of interest of the in-situ sensors installed in those platforms. Generally, a sensor may measure various observable properties, as it is reﬂected in the model. A typical example in meteorology are the instruments to measure both air temperature and relative humidity. For each

48

J.R.R. Viqueira et al. Track MobilePlatform

+Time: TimeInstant 0..* +Geo: Geometry

Platform

StaticPlatform +Geo: Geometry

+Id: anyURI 1..* Sensor

ObservableProperty

0..* +Id: anyURI +elevation: measure +definition: sensorML

1..*

measures InSituObservation

CurrentVector 0..* Observation

RemoteObservation

+Id: anyURI +Definition: xml +observationType: string +observationTypeSchema: xmlSchema

+Id: anyURI +SamplingTime: TimePrimitive

+u: real +v: real

CoverageObservation +value: GridCoverage

1..* RemoteFOI +Id: anyURI +Geo: Geometry

Measurement +value: Measure

Fig. 3. Persistent Observations Data Model

combination of sensor and observable property, various observations at distinct time instant may be recorded. The feature of interest of in-situ observations is recorded as a combination of the elevation of the sensor inside its platform and the geographic location of the platform (static or mobile). Regarding observations of remote sensors, their feature of interest has to be explicitly recorded in the RemoteFOI class. It is noticed that features of interest (either platforms or remote FOI), sensors, observable properties and observations have unique ids of anyURI type. This is due to the fact that they have to be uniquely referenced either in operations of the SOS or in the deﬁnition of oﬀerings. As a ﬁnal issue, it is remarked that three specializations of Observation are deﬁned to support the recording of the diﬀerent types of sensor results. Though not represented in the model of Fig. 3, classes are also available to record the data related to the capabilities and oﬀerings of the service

5

A Limited Prototype Implementation

Based on the design decisions explained in Sec. 4 a ﬁrst prototype implementation of the service was already undertaken. The characteristics of such an implementation are next brieﬂy resumed. A ﬁrst consideration regards the functionality of the prototype. In fact, only a part of the required functionality of the SOS was considered. In particular, the service implements the three mandatory operations only for the sensors of the meteorologic stations, i.e., in-situ sensors on board of static platforms. Consequently, the model of Fig. 3 was signiﬁcantly simpliﬁed.

A Sensor Observation Service Based on OGC Speciﬁcations

49

2010-01-15T12:00:00 2010-01-15T12:10:00 2010-01-15T12:20:00 15.17 15.20 16.04

Fig. 4. Compact Representation for Time Series

Regarding tools used for the development of the system, the well-known PostgreSQL DBMS was used together with its PostGIS spatial extension for the recording of the locations of the stations (point data). At this initial stage, the HTTP based distributed computing platform proposed by the OGC was replaced by a W3C based platform, implemented with the support of the Apache Axis Java platform. Thus, both the request and responses of the SOS were encoded in Simple Object Access Protocol (SOAP). In particular, the in memory observation model, implemented in Java, was encoded by Axis to the SOAP xml encoding. Besides, the standard operations metadata deﬁned by the OGC were replaced by a relevant Web Service Deﬁnition Language (WSDL) encoding. Finally, the overhead in the representation of observation collections with constant procedure (sensor), observed property, and feature of interest was reduced with a compact representation for time series. In such a compact representation a time aggregate representation is used to assign to a single observation various time instants and a measure list representation is used to assign various results (see Fig. 4). Such compact representation is actually encoded in SOAP in this prototype. Notice ﬁnally that in the current O&M speciﬁcation of the OGC [5], it is recommended to use time coverages of ISO 19123 to represent such time series, thus for the ﬁnal implementation this has to be upgraded.

6

Conclusions

An OGC SOS was designed in the context of the MetoSIX project founded by the local government of Galicia (Xunta de Galicia). A speciﬁc data model was designed for the recording of the diﬀerent types of observations managed

50

J.R.R. Viqueira et al.

by MetoGalicia. Such observations include in-situ and remote observations obtained by sensors installed in both static and mobile platforms. Therefore, the project includes sensors of all the types considered by the OGC Sensor Web Entablement (SWE) [1], and constitutes thus a very good context to test the new OGC speciﬁcations for the sensor web. Only a ﬁrst prototype with limited functionality was implemented at this stage, thus important challenges have still to be faced. Remarkable issues of further work include the following. – A ﬁrst important issue to solve is the eﬃcient recording of the result sensors on board of satellites, i.e., the recording in persistent storage of spatial coverages. Notice that current recording of spatial coverages relies on the use of speciﬁc ﬁle formats initially devoted for the recording of either images or geoscientiﬁc data. Such an approach forbids the integrated recording of all types of sensor data and it is clearly not as user friendly as the approach based on spatial database generally adopted for the management of spatial features. Related to this issue, few models have been proposed in the literature for the management of spatial coverages [6,7,8,9,10] in databases and thus neither standards nor speciﬁc widely used implementations are available. – An issue related to the one above is the eﬃcient encoding and transmission of spatial coverages over the internet. Currently, web services that work with coverages such as the Web Coverage Service (WCS), are not using XML representations. In fact, they use to provide their results in image ﬁle formats. Recently, the geoscientiﬁc research community in their eﬀort to adopt OGC standards is proposing the use of well known ﬁle formats such as netCDF to represent coverages. As a consequence, initial investigations recommend the use of such formats to retrieve spatial coverage observations in an out-of-band mode [2]. – Various data modeling approaches have faced the problem of representing the evolution with respect to time of spatial objects [11,12,13,14]. However, eﬃcient and widely used tools are still not available. Moving objects appear in the current problem in the context of mobile platforms such as the radio sounding experiments. Notice that such data restricts to moving points and advanced spatio-temporal data analysis is not required by the GetObservation operation. Thus, the combination of the spatial data types of PostGIS with temporal data types provided by PostegreSQL seems the best current solution. – The evolution of spatial coverages with respect to time is also a problem to solve. Relevant models have also already been proposed in the literature [15,16], and they show that the issue is more complex that just the combination of spatial coverages with temporal data types. Regarding the ﬁnal implementation of the service, general purpose SOS implementation should be taken into account. An important tool to analyze is the one SOS implementation developed in the context of the 52 North initiative [17]. Actually, such an issue is already under investigation in the context of the MetoSIX project. Some initial observations are the following.

A Sensor Observation Service Based on OGC Speciﬁcations

51

– The data model for the persistent recording of observations in the 52 North SOS is based on the use of the PostGIS spatial extension to PostgreSQL. Types for the results of observations restrict to numeric, text and spatial objects. Therefore, the recording of complex observations such as vector currents are not elegantly supported and even more complex values such as spatial coverages are not supported at all. – Despite of the limitations in the data types of their results, the data model of 52 North SOS is completely of general purpose. The use of such a generic data model to record the sensor data available in MetoSIX could lead to appearance of inconsistences. As an example, all the features of interest of the observations are recorded in a class. Besides, a many to many relationship exists between such a class and the class where sensors are recorded (Procedures in 52 North). Therefore, nothing forbids the recording of an association between an in-situ sensor of a meteorological station with either an oceanographic station or a remote position observed by a VADCP. – A compact representation for time series is not currently considered as a possible response in the 52 North SOS.

References 1. OGC: Ogc sensor web enablement architecture. Open Geospatial Consortium (OGC) (2008), http://www.opengeospatial.org (retrieved April 2010) 2. OGC: Sensor observation service. Open Geospatial Consortium (OGC) (2007), http://www.opengeospatial.org (retrieved April 2010) 3. Chu, X., Buyya, R.: Service oriented sensor web. In: Mahalik, N.P. (ed.) Sensor Networks and Conﬁguration: Fundamentals, Standards, Platforms, and Applications, pp. 51–74. Springer, Heidelberg (2007) 4. OGC: Opengis sensor model language (sensorml) implementation speciﬁcation. Open Geospatial Consortium (OGC) (2007), http://www.opengeospatial.org (retrieved April 2010) 5. OGC: Observations and measurements – part 1 - observation schema. Open Geospatial Consortium (OGC) (2007), http://www.opengeospatial.org (retrieved April 2010) 6. Vijlbrief, T., van Oosterom, P.: The geo++ system: An extensiblegis. In: Cohn, A., Mark, D. (eds.) Proceedings of the 5th International Symposium on Spatial Data Handling (SDH 1992), August 3-7, vol. 1, pp. 40–50 (1992) 7. Oracle: Oracle spatial: Georaster. 10g release 2 (10.2) (2005) 8. Widmann, N., Baumann, P.: Towards comprehensive database support for geoscientiﬁc raster data. In: Frank, A.U. (ed.) Proc. of the 5th ACM International Symposium on Advances in Geographic Information Systems (GIS 1997), November 13-14, pp. 54–57. ACM Press, New York (1997) 9. Svensson, P., Huang, Z.: Geo-sal – a query language for spatial data analysis. In: G¨ unther, O., Schek, H.-J. (eds.) SSD 1991. LNCS, vol. 525, pp. 119–140. Springer, Heidelberg (1991) 10. Grumbach, S., Rigaux, P., Segouﬁn, L.: Manipulating interpolated data is easier than you thought. In: Abbadi, A., Brodie, M., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) Proceedings of the 26th International Conference on Very Large Data Bases (VLDB 2000), September 10-14, pp. 156–165. Morgan Kaufmann, San Francisco (2000)

52

J.R.R. Viqueira et al.

11. G¨ uting, R.H., B¨ ohlen, M.H., Erwig, M., Jensen, C.S., Lorentzos, N.A., Schneider, M., Vazirgiannis, M.: A foundation for representing and querying moving objects. ACM Transactions on Database Systems 25(1), 1–42 (2000) 12. Moreira, J., Ribeiro, C., Abdessalem, T.: Query operations for moving objects database systems. In: Proceedings of the 8th ACM Symposium on Advances in Geographic Information Systems (GIS 2000), Washington, DC, November 10-11, pp. 108–114 (2000) 13. Sistla, P., Wolfson, O., Chamberlain, S., Dao, S.: Modeling and querying moving objects. In: Proceedings of the 13th International Conference on Data Engineering (ICDE 1997), Birmingham, UK, April 7-11, pp. 422–432 (1997) 14. Viqueira, J., Lorentzos, N.: Sql extension for spatio-temporal data. The VLDB Journal 16(2), 179–200 (2007) 15. d’Onofrio, A., Pourabbas, E.: Formalization of temporal thematic map contents. In: Proceedings of the 9th ACM International Symposium on Advances in Geographic Information Systems (GIS 2001), Atlanta, GA, November 9-10, pp. 15–20 (2001) 16. Erwig, M., Schneider, M.: The honeycomb model of spatio-temporal partitions. In: B¨ ohlen, M.H., Jensen, C.S., Scholl, M.O. (eds.) STDBM 1999. LNCS, vol. 1678, pp. 39–59. Springer, Heidelberg (1999) 17. 52North: Home page: 52 north – exploring horizonts, http://52north.org/ (retrieved April 2010)

Third International Workshop on Conceptual Modelling for Life Sciences Applications (CMLSA 2010)

Preface Life Sciences applications typically involve large volumes of data of various kinds and a multiplicity of software tools for managing, analysing and interpreting them. There are many challenging problems in the processing of life sciences data that require effective support by novel theories, methods and technologies. Conceptual modelling is the key for developing high-performance information systems that put these theories, methods and technologies into practice. The fast growing interest in life sciences applications calls for special attention on resource integration and collaborative efforts in information systems development. This volume contains the papers presented at the Third International Workshop on Conceptual Modelling for Life Sciences Applications (CMLSA 2010) which was held in Vancouver, Canada from November 1 to 4, 2010 in conjunction with the TwentyNinth International Conference on Conceptual Modeling (ER 2010). On behalf of the programme committee we commend these papers to you and hope you find them useful. The primary objective of the workshop is to share research experiences in conceptual modelling for applications in life sciences and to identify new issues and directions for future research in relevant areas, including bioinformatics, ecoinformatics, and agroinformatics. The workshop invited original papers exploring the usage of conceptual modelling ideas and techniques for developing and improving life sciences databases and information systems. Following the call for papers which yielded 12 submissions, there was a rigourous refereeing process that saw each paper refereed by three international experts. The three papers judged best by the programme committee were accepted and are included in this volume. We wish to thank all authors who submitted papers and all workshop participants for the fruitful discussions. We also like to thank the members of the programme committee for their timely expertise in carefully reviewing the sub- missions. Finally, we wish to express our appreciation to the local organisers at the for the wonderful days in Vancouver.

July 2010

Yi-Ping Phoebe Chen Sven Hartmann Jing Wang

Provenance Management in BioSciences* Sudha Ram and Jun Liu 430J McClelland Hall, Department of MIS, Eller School of Management, University of Arizona, Tucson, AZ, 85721 [email protected], [email protected]

Abstract. Data provenance is becoming increasingly important for biosciences with the advent of large-scale collaborative environments such as the iPlant collaborative, where scientists collaborate by using data that they themselves did not generate. To facilitate the widespread use and sharing of provenance, ontologies of provenance need to be developed to enable the capture and standardized representation of provenance for biosciences. Working with researchers from the iPlant Tree of Life (iPToL) Grand Challenge Project, we developed a domain ontology of provenance for phylogenetic analysis. Relying on the conceptual graph formalism, we describe the process of developing the provenance ontology based on the W7 model, a generic ontology of data provenance. This domain ontology provides a structured model for harvesting, storing and querying provenance. We also illustrate how the harvested data provenance based on our ontology can be used for different purposes. Keywords: Provenance, tree of life, W7 model, conceptual graphs.

1 Introduction In recent years, the tendency toward “big science” (i.e., large-scale collaborative science) is increasingly evident in the biological sciences - facilitated by a breakdown of the traditional barriers between academic disciplines and the application of technologies across these disciplines. The growing number and size of computational and data resources is enabling scientists to perform advanced scientific tasks in large collaborative scientific projects such as the the iPlant Collaborative (iPlant, http:// www.iplantcollaborative.org). Provenance is becoming increasingly important for biosciences as more scientists collaborate by using data that they themselves did not generate. Tracking data provenance helps ensure that data provided by many different providers and sources can be trusted and used appropriately. Data provenance also has several other critical uses, including data quality assessment, generating data replication recipes, data security management, and others as outlined in [1]. Recently, a consensus has emerged on the need to develop a generic ontology for standardized, application- and organization-independent representation of data *

This research is supported in part by research grants from the National Science Foundation Plant Cyberinfrastructure Program (#EF-0735191) and from the Science Foundation of Arizona.

J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 54–64, 2010. © Springer-Verlag Berlin Heidelberg 2010

Provenance Management in BioSciences

55

provenance [2]. Such a generic ontology will allow provenance to be exchanged between systems. More importantly, a generic ontology is meant to be extensible and shared across applications and modified according to the requirements of a particular domain, thus eliminating the need to develop domain ontologies from the very beginning. Based on analyzing over 100 use cases, we developed a generic ontology of provenance called the W7 model [3, 4] that defines provenance as consisting of seven interconnected components including what, how, who, when, where, which and why. The W7 model was designed to be general and comprehensive enough to cover a broad range of provenance-related vocabularies (i.e., concepts and their relations). However, the W7 model alone, no matter how comprehensive, is insufficient for capturing provenance for all types of data in biosciences without being adapted and extended. The types and level of detail for tracking provenance vary by data type, purpose, discipline, and project. For instance, the provenance of data on a plant gene may include not only the experimental process by which it was derived, but also information about what plant part and sample was used and how the sample was manipulated. The objective of this paper is to illustrate the process of developing a domain ontology for the plant science domain by adapting and extending the W7 model. Our work is set within the context of the iPlant collaborative project (www.iplantcollaborative.org). The purpose of iPlant is to develop a cyberinfrastructure that enables the plant sciences community to collaboratively define, investigate and solve the grand challenges of plant biology.

2 Background 2.1 The iPlant Collaborative The iPlant Collaborative (iPlant) project’s mission is to foster the development of a diverse, multidisciplinary community of scientists, teachers, and students, and a cyberinfrastructure that facilitates significant advances in the understanding of plant science through the application of computational thinking and approaches to Grand Challenge problems in plant biology. The plant sciences community has identified two important grand challenges they need to address. The first grand challenge is called iPlant tree of Life (iPTOL) while the second one is called iPlant Genotype to Phenotype (iPG2P). Our focus in this paper is on the development of a provenance tracking and management mechanism for the iPTOL grand challenge. Knowledge of evolutionary relationships is fundamental to biology, yielding new insights across the plant sciences, from comparative genomics and molecular evolution, to plant development, to the study of adaptation, speciation, community assembly, and ecosystem functioning. Although our understanding of the phylogeny of the half million known species of green plants has expanded dramatically over the past two decades, the task of assembling a comprehensive "tree of life" for them presents a Grand Challenge. Its solution will require a significant intellectual investment at the developing intersection between phylogenetic biology and the computer sciences. iPTOL brings together plant biologists and computer scientists to build the cyberinfrastructure needed to scale up phylogenetic methods by 100-fold or more, to enable the dissemination of data associated with such large trees, and to implement scalable "post-tree" analysis tools to foster integration of the plant tree of life with the rest of the botanical sciences. The undertaking to unravel the evolutionary relationships among all living things, and to

56

S. Ram and J. Liu

express this in the form of a phylogenetic tree of life, is one of the most profound scientific challenges ever undertaken, and represents a true "moonshot" for plant sciences. We anticipate that early success in addressing the plant phylogeny problem will be especially useful in connection with other Grand Challenge Projects supported through the iPlant Collaborative that involve comparisons between genes, genomes, or species, insuring a broad impact of the project as a whole. Finally, the plant tree of life provides exciting opportunities for training and outreach at all levels. Since Darwin, the tree of life has proven to be a very accessible visual metaphor for nonscientists, providing an elegant opening for communicating results in the plant sciences and evolutionary biology to people with diverse backgrounds. Data provenance is critical for iPTOL. It serves three major purposes: 1) to evaluate the quality and trustworthiness of data, 2) to determine how data has been processed and modified data within the discovery environment in iPlant, and 3) to enable proper attribution of the creator/owners of the datasets and the researchers’ discoveries. In this paper, we describe the development of a domain ontology of provenance for iPToL by extending the W7 model [3, 4] . Extending a generic ontology such as the W7 model to accommodate domain specific requirements can be challenging for domain experts unless a structured approach is followed. We describe the procedure for extending the W7 model, which can be applied by other domain experts who intend to adopt and extend the W7 model for their own fields. 2.2 The Generic W7 Model for Data Provenance Based on analysis a large number of use cases collected from various domains, we conceptualized provenance as a set of 7-tuple, (what, when, where, how, who, which, why), and developed a generic ontology of provenance called the W7 model. The anchor of our provenance is what, i.e., events that affect a data object during its lifecycle. An event can be content related (e.g., creation and modification) or non content related (e.g., location change, ownership change, format change, right change, access and annotation event). Provenance of a data object includes events ranging from creation, to its modification, to its final destruction and archiving. The relationships between what and the other six Ws are graphically represented in Fig. 1. The other six Ws including when, where, how, who, which, and why are linked to what associated with a data object. The further classification of what and the other w’s is shown in Fig. 2.

Fig. 1. Relationship between what and the other w’s

How represents an action leading to the event. It can be classified into single action and complex action. For instance, purchase and donation are actions that lead to an ownership change. When represents the time of the event. Where, by default, represents the location of the event. An event such as a location change is associated

Provenance Management in BioSciences

57

with two locations: origin and destination. Origin, i.e. where the data came from, is critical provenance information and thus captured as a subtype of where. It is common for a digital record to travel from system a to system b while retaining its original copy in a. Such an event is considered a data creation, and origin is important whereprovenance for the event. Who represents people or organizations involved in the event. It includes agents who initiated the event as well as participants of the event. Why refers to reasons that explain why an event occurred. In our research, why includes belief and goal. A belief refers to the rationale or assumptions made in generating or modifying the data. Our use cases indicate that a common goal in creating or manipulating data is to use it in a project or an experiment. Finally, which refers to instruments or software programs used in the event.

Fig. 2. Hierarchy of the 7 w’s (T represents the universal type, a super type all other types)

We represent the W7 model using the conceptual graph (CG) formulism developed by Sowa[5]. We briefly introduce the basic conceptual graph formalism. 1) Conceptual graphs (CGs) A conceptual graph is a finite, connected, bipartite graph with nodes of one type called concepts and nodes of another type called conceptual relations. The conceptual graph shown in Fig. 3 conveys the proposition that “the creation of the data #115 was made by Nicole”. The boxes are concepts. A concept is made up of either a concept type alone or a concept type and its referent information. In the example, the concept [creation] is generic with only a type label inside the box. The other concepts are individual. They have a colon after the type label, followed by a name (e.g., Nicole) or a unique identifier called an individual marker (#115), representing a specific instance of the type. The ovals are conceptual relations. The conceptual relations labeled OBJ and AGNT shown in Fig. 3 represent the linguistic cases object and agent of case grammar. To distinguish the graphs that are meaningful in a domain of interest from those that are not, certain graphs are declared to be canonical. The CG model represents the

Fig. 3. A conceptual graph example

58

S. Ram and J. Liu

knowledge in a domain of interest using two components: a canon and a set of conceptual graphs that are canonical. The canon contains the information necessary for deriving the conceptual graphs. It has four components: a type hierarchy T, a set of individual markers I, a conformity relation :: that relates type labels in T to markers in I, and a finite set of canonical graphs B, called the canonical basis. In essence, the canon provides a repository of concepts and relations to build conceptual graphs. Not all assemblies of concepts and relations into a conceptual graph are meaningful or “canonical”. The canon provides a finite set of canonical graphs that indicate a permissible combination of concepts and relations as the canonical basis. A large number of conceptual graphs that are canonical can then be derived from those in the canonical basis by application of the canonical formation operations. Each of them is a representation of a part of knowledge under the canon. It could thus be considered that conceptual graphs represent knowledge itself while the canon acts as a framework for the organization of knowledge and helps encourage a disciplined approach to representing knowledge in the CGs. 2) The W7 model represented in the CG formulism Our generic ontology is called the W7 model since we conceptualize provenance as a sequence of seven w’s including what, when, where, how, who, which, and why. Based on the CG formulism, we define the W7 model as a triple W7 = (Tc, S, W7Graph) whose components are defined below. Tc is a concept type hierarchy. It includes provenance-related concept types organized in a hierarchical structure, as shown in Fig. 2. S represents a set of schemas defined for concepts located in Tc. A schema is a structure of knowledge that corresponds to a particular concept type t in Tc. Formally, a schema in CGs is defined as a monadic abstraction λau where the formal parameter a is of type t for which the schema is defined, and the body u is a conceptual graph that provides the background of what is plausibly true about the concept type t. The CG formulism allows us to attach any number of “related” schemas to a concept. Schema definition is thus a critical mechanism for ontology extension. For the purpose of the current research, we partition S into two sets: optional schemas and mandatory schemas. A schema of a concept is by default optional that state the commonly associated properties. A schema may not be true or necessary for every use of the type. A mandatory schema, on the other hand, defines necessary conditions that include mandatory properties of the concept. Fig. 4 shows several schema schemas that belong to S. The conceptual graph shown in Fig. 4(a) represents a mandatory schema for the concept type DERIVATION. It asserts that a derivation must have some input data. The CG shown in Fig. 4(b) is an optional schema for the concept type SINGLE ACTION, representing one way the concept can be used: a single action has another action as its successor.

Fig. 4. Example schemas

Provenance Management in BioSciences

59

W7Graph, or the W7 graph, is the graph represented in Fig. 1 that includes the seven W’s and indicates the relationship between them. The W7 graph serves as a base graph, representing the overall structure of provenance. When representing provenance for a given type of data, this graph must be specialized. Several types of important graphical operations called the canonical formation rules (including copy, simplify, restrict and join) allow a number of more specialized conceptual graphs to be derived from the base graph. The copy rule builds an exact copy of a given graph or its subgraph. The simplify rule removes duplicate relations in a graph. One can restrict a concept by replacing the label of that concept type with a subtype (e.g., WHAT can be restricted to CREATION). Restrict can also replace a generic concept with an individual instance. The join rule merges identical concepts. Concepts are identical if both the concept type and referent are the same. The merge is achieved by overlaying one graph on the other at the point that they are identical. Fig. 5 illustrates some of formation rules. Suppose we want to derive a graph that asserts “some data is created through a derivation performed upon some input data” (i.e., the graph e in Fig. 5). This graph can be derived from the W7 graph shown in Fig. 1. The graph a is a copy of a subgraph of the W7 graph shown in Fig. 1. The graph b in Fig. 5 results from restricting the type WHAT in the graph a) to CREATION, and the graph c is the result of restricting HOW in the graph b to DERIVATION. The graph d represents a schema defined for the concept DERIVATION (see Fig. 4). Then join can merge the two concepts of type DERIVATION in the graph c and d to form the graph e.

Fig. 5. Formation rules

3 Developing a Domain Ontology for the iPToL Project Developing a domain ontology for the iPToL project by extending the W7 model involves several steps. First, we identify different types of data available in the domain. The primary type of data available in iPToL is on phylogenetic trees. There are other types of data such as trait data of the species in the phylogenetic trees and outputs from different analysis activities such as phylogenetically independent contrast analysis (PIC). Second, for a given type of data such as trees, we identify the different types of events that may affect the data over its lifetime. For instance, a tree file can be created and modified. There are also annotation events in which as images can be added to specific nodes of a tree. In iPToL, researchers can also share a tree file with other people by assigning access rights to other people. Third, we determine the how, who, when, where, which, and why associated with each type of event that can affect a type of data or data object. Let’s consider how first. How refers to actions leading up to an event. In iPTol, researchers can perform several different types of actions to create a new tree. They can import/upload a tree file from an existing source such as TreeBase or MorphoBank, edit an existing tree and then save it as a new one, or create a tree by merging existing trees. They can also perform

60

S. Ram and J. Liu

different editing actions to modify an existing tree, such as change the name or branch length of a species node in the tree, change the layout of the tree, or add or delete one or more species nodes. They can also first reconcile a tree file and its trait data and then, as a result of the reconciliation, remove unmatched species or swap some species nodes. Who then represents the agent performing the event. When records the time of the event. Where, i.e., where the data came from, is critical for data that were imported from external sources. Which represents software (such as Phylowidget or Phylomatic) used to modify or merge existing trees. In iPToL, if a tree file was modified, researchers need to why the modification occurred. After defining the events (what) relevant to each type of data or data object, and the how, who, when, where, which, and why for each event, we construct a domain ontology for the iPToL project. A domain ontology for iPToL consists of a set of 5-tuple (To, So, W7Graph, E). We define such a 5-tuple for each type of data in iPToL. Here, we present a specific one defined for phylogenetic trees, and describe each of its components. To is a concept type hierarchy. We developed it based on Tc in the W7 model by adding some domain specific concepts and then pruning the ones that are not applicable or relevant for iPToL. Fig. 6 represents the type hierarchy defined for phylogenetic trees. Compared with the type hierarchy of the generic ontology shown in Fig. 1, it includes a number of domain specific concepts. For instance, different types of tree editing actions such as add species, delete species, edit species, and change layout were included in it. Edit species were further classified into change name and change branch length.

Fig. 6. Concept type hierarchy in iPToL ontology

So represents a set of schemas defined for concept types in To. If a concept type in Tc was retained in To, then the schemas defined for the concept in the W7 model would be imported to the domain ontology. It may include schemas defined for the newly added concept types, and we may also define additional schemas for the retained ones, specifying a new way the concept type can be used. These schemas would be used to provide background about a concept. We define a mandatory schema shown in Fig. 7(a) for the concept type reconciliation, indicating that a reconciliation must be performed on a phylogenetic tree and some trait data. The schema shown in Fig. 7(b) specifies that a replication must have some source data and the source data has its own provenance. Fig. 7(c) represents a schema defined for software. This is optional, which means we can choose to record the author and version for the software used in performing the action.

Provenance Management in BioSciences

61

Fig. 7. Schemas defined in the iPToL ontology

W7Graph remains the same as the one shown in Fig. 1. In the CG vocabulary, these three components - To, So, and W7Graph – form a canon, i.e., information necessary for deriving other canonical conceptual graphs. We derive a set of canonical graphs based on the canon. Each of them represents a combination of what, how, who, when, where, which, and why, that is meaningful and relevant for representing provenance for Phylogenetic trees. These canonical graphs are termed event graphs, since each of them describes the information associated with one type of event that can affect a type of data or data object, in our example this is phylogenetic trees. E represents a set of event graphs. A number of event graphs have been created. Due to space limitations, we show two examples of an event graph in Fig. 8. The event graph shown in Fig. 8(a) specifies that the creation (what) of a phylogenetic tree can be a result of importing (how) some source data with its own provenance from a database (where) by an agent (who) at certain time (when) for some reason(why).Which was not included in this graph since it was deemed of little use for this type of event. This graph is derived from the W7 graph shown in Fig. 1 by performing a sequence of graphical operations described earlier. These operations include a copy of a subgraph (without which) of the W7 graph, several restrict operations (e.g., what is restricted to creation, how to importing, and where to database) according to the type hierarchy show in Fig. 6 and then a join with the mandatory schema shown in Fig. 7(b), which was defined for replication and thus applicable to importing, a subtype of replication. The event graph shown in Fig. 8(b) describes how a phylogenetic tree can be modified. A tree can be modified as a complex action including reconciliation and then editing. The reconciliation is performed using the tree and some trait data. We also attempt to capture the who, when, why associated with the modification event as well as which software was used and the author and version of the software. Similarly, this graph was also derived from the W7 graph based on a sequence of graphical operations. Relying on different mechanisms provided by the CG formalism, we construct event graphs that represent the domain specific provenance in a disciplined yet flexible way. Our approach is disciplined since the event graphs can only be canonically defined from the W7 model that defined the structure of provenance. Our approach is also flexible since the schema definition and the join operation enable us to attach schemas to a concept to provide background about the concept at any level of detail. The event graphs are used as conceptual schemas for capturing the provenance. Provenance captured for an event that affects a tree is an instantiation of one of the event graphs.

62

S. Ram and J. Liu

Fig. 8. Examples of event graphs

4 Using Provenance As discussed previously, provenance serves three major purposes for iPToL: 1) Data Quality Assessment: It helps evaluate the quality and trustworthiness of data, 2) Replication Recipes: It allows plant scientists to understand how data was processed and modified within a discovery environment in iPlant, and 3) Attribution: It enables proper attribution of the creator/owners of the datasets and the researchers’ discoveries. Different provenance information may contribute to different uses of data provenance.

Fig. 9. Graphical representation of provenance

Provenance Management in BioSciences

63

A variety of provenance information can be used for estimating the quality of the data. For instance, where the data came from is critical for understanding the quality of data. After a tree file is imported, who modified it for what purposes (why) is of utmost importance to determine data quality. We have developed a mechanism in the iPlant discovery environment that enables users to visually browse the whole provenance of any dataset or data object to understand and evaluate its data quality. For instance, the provenance of a tree file “PDAP.tree.nex” is shown in Fig. 9. The file was imported by a person named Nicole from TreeBASE. It was then modified by Doug. He changed the name of a species to be consistent with a naming convention used elsewhere. This tree file was then modified by Nicole. She reconciled the tree file with its trait data, and subsequently removed a species in the tree file. Another use of provenance is to provide a replication recipe for data. Fig. 10 shows a scientist who browses the provenance to understand how the data was processed and which software tool was used to manipulate it.We also have mechanisms to query and browse the who provenance since attribution of the creator/owners of the datasets and the researchers’ discoveries, on the other hand, relies primarily on provenance such as who created and modified the data.

Fig. 10. The how and which associated with the events

5 Conclusion and Future Research In this paper, we described how a domain ontology of provenance was developed for the iPlant Tree of Life (iPToL) Grand Challenge Project by extending a generic ontology of provenance in the form of the W7 model. Our approach for developing a domain ontology of provenance can be applied by other domain experts who intend to adopt and extend the W7 model for their own fields. In our future work, we propose to investigate the uses of provenance. Representing provenance in a structured way enables more sophisticated uses of provenance. As an example, we are developing metrics mapped to different components of provenance (e.g., the how or who provenance) that can be used to assessing the quality of data

64

S. Ram and J. Liu

semi-automatically. We are also extending this work to other grand challenges in iPlant as well as other domains. This is crucial in most bioscience applications, since an ontology that enables the capture and standardized representation of provenance are critical for scientific data sharing.

References 1. Simmhan, Y., Plale, B., Gannon, D.: A Survey of Data Provenance Techniques, Indiana University, Technical Report IUB-CS-TR618 (2005) 2. Moreau, L., Freire, J., Futrelle, J., McGrath, R.E., Myers, J., Paulson, P.: The Open Provenance Model: An Overview. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 323–326. Springer, Heidelberg (2008) 3. Ram, S., Liu, J.: Understanding the Semantics of Data Provenance to Support Active Conceptual Modeling. In: Chen, P.P., Wong, L.Y. (eds.) ACM-L 2006. LNCS, vol. 4512, pp. 17–29. Springer, Heidelberg (2007) 4. Ram, S., Liu, J.: A New Perspective on Semantics of Data Provenance. Presented at the First International Workshop on the Role of Semantic Web in Provenance Management, Washington D.C., (2009) 5. Sowa, J.: Conceptual structures: Information processing in Mind and Machine. AddisonWesley, Reading (1984)

Ontology-Based Agri-Environmental Planning for Whole Farm Plans Hui Ma School of Engineering and Computer Science Victoria University of Wellington, New Zealand [email protected]

Abstract. Motivated by our experiences with agri-environmental information management and planning for the sustainable land use initiative we present an ontology-based approach for aligning whole farm plan data with other data collections of interest. Semantic interoperability is of utmost importance to enable seamless integration across multiple data collections. Web-based systems combining GIS-technology with ontologies can support environmental information analysts in assessing the interplay between natural resources, farming practices and environmental issues on farms and in generating farm-speciﬁc recommendations for sustainable land management. At the same time such systems unlock data collected or produced during the generation and monitoring of whole farm plans for new applications and broader user groups, including agronomists and scientists working in related disciplines.

1

Introduction

Agri-environmental information management and planning frequently require the retrieval and alignment of data from multiple data collection. Our research has been undertaken in the context of the Sustainable Land Use Initiative (SLUI) which addresses environmental problems in the farming areas of the New Zealand hill country. Whole farm plans are a common tool to integrate environmental goals with current farming operations. Based on an assessment of available natural resources, environmental issues are identiﬁed and evaluated, and countermeasures are developed. This task involves a multitude of data from distributed data collections such as image data, classiﬁcation data, geographical data, observational data, climate data, soil data, air pollution data, ecology data, vegetation distribution data, biodiversity data, and business data. Environmental information analysts and other user groups have data management requirements to cope with this situation. Most importantly, they require convenient, reliable and eﬃcient access to a range of specialised data collections that host the relevant data. Often, these data collections are heterogenous, distributed and owned by third parties. One of the many challenges is the question of how to ensure semantic interoperability across the various data collections. Here we will study how ontologies can be used to address this challenge in the case of agri-environmental information. The paper is organised as follows. In J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 65–74, 2010. c Springer-Verlag Berlin Heidelberg 2010

66

H. Ma

Section 2 we brieﬂy describe the objectives of SLUI and the core ideas of whole farm plans. Section 3 discusses limitations of existing GIS-empowered systems for managing whole farm plan data, while Section 4 motivates the need for data sharing and integration with respect to whole farm plans. Section 5 discusses the use of ontologies to achieve semantic interoperability. Section 6 addresses tasks in agri-environmental planning that can beneﬁt from ontologies. Section 7 presents a web portal prototype to demonstrate the combination of ontologies and GIS-technology for sharing data collected or produced during the generation and monitoring of whole farm plans.

2

Background and Motivation

The major objectives of the Sustainable Land Use Initiative (SLUI) are to assess the current state of natural resources that are essential for the country’s rural economy and to preserve them for future generations. It started after the event of a severe storm in early 2004 that impacted the lower half of New Zealand’s north island, with intense rainfalls triggering extensive ﬂoods and landsliding throughout the hill country. The eﬀorts undertaken in the context of the initiative should also help to increase the environmental resilience to similar climatic events in the future [12]. The regional councils are responsible for implementing the policies and work programmes developed under SLUI. A core instrument are whole farm plans that take an integrated view of farm business and environment. They start with an environmental assessment of land, water and farm production resources and a review of the existing farm business. Then an integrated long-term farm business plan and 5-year environmental programme are developed in cooperation with the farmers to address environmental issues that were identiﬁed. Finally, follow-up procedures need to be agreed upon with farmers, e.g., to organise support by the councils or third-party experts and to monitor progress on the actions set in the plans. It should be noted that participation in the programme is entirely voluntary and the developed plans are recommendations only, but subject to moral commitment. The analysis of the business operations on a farm assembles relevant business information, e.g., on ﬁnances, production processes, infrastructure, livestock, and land use. During the environmental assessment of farms a series of artifacts is generated, including land resource inventories, land use maps, pasture production maps, land use capability maps, soil fertility and nutrient maps, and summaries of environmental issues. They form the basis for a thorough analysis resulting in recommendations for the sustainable management of natural and farm production resources. Land resource inventories collect information on the core natural resources found on the farm, including biophysical characteristics such as vegetation cover, soil type, rock type, slope and erosion severity. They are geo-referenced and displayed on specialised maps using various scales. Detailed farm-level maps use scales between 1:5,000 to 1:30,000. They show farm features of interest such

Ontology-Based Agri-Environmental Planning for Whole Farm Plans

67

as paddocks, fences, buildings, laneways and tracks, waterways, bridges, fords, dams, ponds, lakes and wetlands, wells, boreholes, culverts, dumps, oﬀal holes, scrub, scattered bush, shelterbelts, forestry blocks, or woodlots. Parcel and paddock maps show the subdivision of a farm. Land use maps show the actual use of the farm’s land resources such as arable, forage, breeding, pasture, forest pasture, exotic forestry, or indigenous forestry. Other maps are less detailed, say of scale 1:50,000 or less. Legal title and property maps show property boundaries. Topographic maps show land reliefs by means of contour lines, and also streams and lakes. Geological maps show geological features like rock formations. Often diﬀerent maps are superimposed and used together to get a better understanding of the where-abouts of a farm.

3

Agri-Environmental Information Management

The generation and monitoring of whole farm plans involves a mass of data that needs to be eﬃciently managed. The kinds of data to be managed include image data from high-resolution satellite or orthocorrected aerial photography, sensor data from remote sensing, biophysical data obtained from analysing soil samples during soil tests, statistical data obtained by automated or manual aggregation, and location data from geo-referencing. A considerable portion of the data used during data analysis stems from existing data collection maintained by regional council or others. Thus means for eﬃcient access and alignment are essential. Most farm features incorporate spatial properties, so that a combination of database and GIS technology is called for. Currently there are several highend solutions available like Oracle Spatial that can be easily integrated with ArcGIS/ArcSDE. One can also use emerging spatial extensions to low-end solutions like PostGIS or MySQL GIS. Regional councils have set up GIS-empowered information systems for managing and querying data collected or produced for whole farm plans. When working with the ﬁrst generation of such systems we noticed a number of limitations that should be addressed: 1) Current systems were designed with environmental information analysts, land managers, and agribusiness consultants, being the users in mind. However there are other groups of potential users such as farmers as the core stakeholders, but also agronomists and scientists from various other related disciplines. At the same time the insights and data collected with whole farm plans become interesting for a growing number of emerging applications in domains like conservation, soil science, climate research, or food safety. 2) To unlock systems for broader user groups or new applications, some architectural changes are required. Not only that systems should be web-based to enable platform-independent remote access, they should further adapt to the needs of user groups and applications. This includes role proﬁles, personalisation, customisation, adaptive web interfaces, and support for technical environments, e.g., mobile devices for on-farm use. Moreover, data exchange and data integration with other information systems or databases need to be supported by adequate interfaces. To enable fast and convenient data access for scientiﬁc

68

H. Ma

studies by agronomists or for data sharing with other community agencies, it is desirable to oﬀer web services for the automated exchange of relevant data. 3) Before systems can be unlocked for broader user groups or new applications, existing system features need to be reconsidered and enhanced. As an example consider privacy. While most of the resource information collected with a whole farm plan may become publicly available, most business information is conﬁdential and needs to be protected against disclosure. Some information may only be intended for particular user groups, while others may only be used in aggregated form for statistical analysis or for reporting at the regional level. On the other hand, there might be information related to environmental risks that must be published to comply with transparency requirements. In either case systems need to be equipped with adequate access control mechanisms. If we go for web services it is further desirable to support QoS criteria (including privacy) as used in service-level agreements. 4) Current systems provide only limited support for the conceptual modelling of data requirements for whole farm plans. This becomes even more of a bottleneck if one wants to accommodate the needs of broader user groups or new applications. In [10] we have introduced a geometrically-enhanced entityrelationship model that allows the adequate high-level representation of farm features, including those observing complex geometric properties like Bezier curves and polyBeziers. The Geographic Markup Language (GML) [18] can be used to capture the features contained in the land resource inventory and the various maps used for the whole farm plans. GML is an XML-based standard of the Open Geospatial Consortium [19] for representing geographic information. Different from most GIS tools, GML distinguishes between features and geometric objects like points or regions. Features as collected in land resource inventories usually observe geometric properties, e.g., the feature’s location or extent. GML allows multiple features to share a geometric property by using property references which are actually inspired by RDF. GML-encoded features can be displayed at any required resolution level. GML is popular used as an open standard for geospatial data exchange that is independent of the internal storage format used by the various GIS tools. This provides the basis for data sharing between diﬀerent applications or data collections used by particular user groups.

4

Integration with Other Data Collections

To continue with we outline some examples from practice where the integration of data collected or produced for whole farm plans with other data collections leads to new insights, thus enabling improved procedures for tackling agri-environmental and related issues. When developing recommendations for tackling environmental issues, it is important to investigate the interaction between farms and surrounding forests. Possums, for example, act as wild vectors of diseases, thus contributing to the spread of bovine tuberculosis and of waterborne diseases like giardia or cryptosporidium. The spread of the diseases is impacted by livestock management practices on farms, e.g., moving stock between

Ontology-Based Agri-Environmental Planning for Whole Farm Plans

69

paddocks, grazing in the vicinity of forests, watering from potentially infected sources, inadequate testing and record-keeping. There is evidence that a better coordination of livestock management and possum control helps to reduce the threat of new infections among cattle and deer herds [21]. Animal health experts are interested in aligning their own data on possum distributions and control measures with whole farm plan data to improve the eﬀectiveness of possum control programmes. Existing possum eradication programmes on small islands like Kapiti show the potential of pest control, but cannot be directly transferred to farming areas with open boundaries in the hill country. Possums cause damage to native ecosystems, but also to pasture, crops, commercial forests, and tree plantings for erosion control. The main concern is selective browsing that may eliminate indigenous trees like rata or kamahi from forests and reduce the food supply for native species. Moreover, possums also compete for nest sites and feed on the eggs of native birds. Hence, whole farm plan data is also of interest for conservation experts tackling biodiversity threats. When assessing the land use capability of a farm’s land resources, it is beneﬁcial to consult regional or national soil databases for comparison. This helps environmental information analysts to evaluate type, quality and properties of soils found on farms. This is done to see whether land resources are suitable for particular purposes like cropping or pasture, but also to assess the risk of soil erosion and other environmental threats. Soil surveys are undertaken to identify soil types across a farm. Soil types reﬂect that soil is formed by the interaction of ﬁve soil-forming factors: climate, rock type, topography, time and biological activity. The distribution of soil types on farms is estimated based on observed landscape features, thus giving rise to land units of homogeneous soil types. Soil samples are taken, the soil forming environment is recorded, and the biophysical characteristics of each soil layer are tested. Soil testing is a valuable tool for determining soil quality and soil properties. Soil scientists analyse soil samples, e.g., for granulometry, pH levels, organic matter, exchangeable acidity, plant available nutrients, and nutrient concentrations. Soil databases often keep track of characteristic soil proﬁles against which soil sample data can be aligned. Soil quality and soil properties need to be checked on a regular basis as they may change over time, e.g., due to cropping practices, climate events, or recommended actions like drainage. Soil databases can help to predict the impact of farming practices, to recommend the right actions, and to monitor their eﬀectiveness. Soil fertility data, for example, is used to provide reliable nutrient recommendations. There are several soil classiﬁcation systems in use. At an international level, the soil legend of the Food and Agriculture Organisation (FAO) and its successor, the World Reference Base (WRB) for Soil Resources are most commonly used [17]. The WRB [6] was adopted by the International Union of Soil Sciences as the standard for soil correlation and nomenclature. However, there is also huge number of national soil classiﬁcation systems. Farms in the New Zealand hill country are usually specialised in dairy, sheep and deer farming rather than crop farming. Land resources are frequently classiﬁed by the LRI (land resource

70

H. Ma

index) system that records ﬁve factors: rock type, soil unit, slope class, erosion type & severity, and vegetation. On the other hand, regional councils frequently apply the LUC (land use capability) system for classifying land resources, cf., [1]. The LUC code (e.g., 3w2) has three basic components: class, subclass and capability unit. There are eight classes ranking land from highest to lowest capability (1-8). The four subclasses indicate the major limiting factor to production (wetness, erosion, soil and climate). The capability unit (e.g., 2) allows a more ﬁne-grained ranking that can link with regional classiﬁcations and known best management practices. Neighbouring land resources with the same LUC code of usually grouped together to form larger landcare units.

5

Ontology Support for Agri-Environmental Information

To enable seamless integration between data collections we need to ensure semantic inter-operability. Semantic inter-operability is a major concern in the life science ﬁelds where many terms are used, often having multiple interpretations and variable meanings in diﬀerent contexts. Ontologies have emerged as a unifying mechanism for representing knowledge and generating a common understanding of concepts. As noted in [13] the lack of formalisation as provided by ontologies often hampers the ability of scientists to ﬁnd and incorporate relevant data into broader-scale environmental studies. Ontologies provide a formal representation of the terminology or concepts in a speciﬁc domain, which help to clarify the meaning of these terms/concepts, and the relationships among them. Popular ontology languages like RDF or OWL have XML-based serialisation formats that are machine-readable, thus facilitating automated processing and reasoning. This may help to release human experts from tedious routine tasks in data management and analysis. For example, ontologies have been successfully exploited to guide the translation between various spatial data format used in GIS tools, including GML [4]. Domain ontologies are used for knowledge representation in a speciﬁc domain. They provide the particular meanings of concepts/terms as they apply to that domain. We have used OWL to assemble a collection of domain ontologies for agri-environmental information management in the context of SLUI. For creating the ontologies we used Prot´eg´e, probably the leading ontology editor. It also provides plug-ins for ontology visualisation. For the land use capability assessment, for example, we have translated the LUC system into OWL. Our domain ontology contains concepts such as Farm, Parcel, LandUnit, Capability, LucClass, LucSubclass, LucUnit, Region, Polygon, or PolyBezier. Concepts are arranged in a hierarchy, based on the subsumption (is-a) relationship. Polygon, for example, is a subconcept of Region. Furthermore, we have included object properties such as hasCapability, partOfFarm and hasLandUnits for mereological relationships, bordersLandUnit and hasExtent for spatial relationships. In our ontology one also ﬁnds datatype properties like farmName. During application processing the concepts are populated with individuals representing the farm’s land resources.

Ontology-Based Agri-Environmental Planning for Whole Farm Plans

6

71

Agri-Environmental Planning

It is well-accepted that the up-to-date management of farm operations needs decision support tools [16]. However, the successful application of GIS technology by planning practitioners is still far from common-place [7]. Recommended actions need to be checked for consistency. Before land use changes are recommended, potential impacts on natural resources and business operations need to be assessed, too. Changes in the pasture production, for example, can lead to yield gaps, thus causing changes in stocking rates. Our goal is to develop a web portal based on web services to analyse environmental issues and to study the eﬀect of recommended actions in agri-environmental planning. The potential users of this system require access to multiple data collections and support with spatio-temporal querying and reasoning across them. [3] discusses the construction of spatio-temporal ontologies for geographic information integration. Following their approach we have included concepts for actions and events and object properties for temporal relationships into the domain ontologies. Note that our ontologies stay in the scope of the OWLDL fragments of OWL that guarantee decidability in reasoning.

Fig. 1. Example of a map with new features reﬂecting recommended actions [1]

Whole farm plans aim to generate a list of farm-speciﬁc environmental issues, including their severity, priority and a brief description of the suggested control measures. Soil-related issues can be physical damage, declining organic matter, nutrient depletion and soil erosion. Pasture needs to be protected against pests and weeds. Water quality can be aﬀected by contamination and sediment intake. Biodiversity issues can be grazing in forest pasture where livestock competes for food with native species, possum sightings, or epidemics in indigenous forestry. Soil-related issues can often be tackled by fertiliser use restrictions and erosion control measures. Examples of actions to counter environmental issues are vegetation management (e.g., weed control, tracking, culverts), livestock management (e.g., fences, changes of stock numbers and breeding areas), and water resource

72

H. Ma

management (e.g., dams, stream-bank protection). Recommended actions often serve more than a single purpose (e.g., tree plantings for erosion control may also serve as a shade, shelter or fodder resource for livestock). Sustainable land use often means aligning the actual land use to the observed land use capability. In the literature a multitude of quality indicators has been proposed for assessing the environmental impact of farming practices, and to monitor eﬀects of agri-environmental policies and work programmes. We refer to [14] for a recent discussion on the accuracy of such indicators based on empirical studies. [23] suggested community-based selection of sustainability indicators. This idea can be extended to the identiﬁcation of environmental issues and countermeasures. As farmers know their farms best they can often help spotting known pollutants like dumps, oﬀal holes, and runoﬀ points to water, but also high-risk areas like unprotected wetlands, bush remnants, pest areas, fragile soils and bad erosion areas.

7

A Web Portal Prototype

For our project we have developed a prototype system for managing whole farm plan data that is based on web services. We use the Web Feature Service (WFS) and the Web Map Service (WMS) of the Open Geospatial Consortium (OGC). Both are standard protocols for serving geographic features and map images from a geospatial database over the Internet using simple HTTP get requests. We use them for retrieving GML-encoded farm features as collected in land resource inventories and related base maps from a GeoServer [8] instance, supporting both WFS and WMS. For populating our domain ontologies, we use the RDF/XML syntax of OWL. We store the ontologies and their instantiations in a Sesame [20] RDF store running on top of a MySQL server. The RDF data can be accessed through Sesame’s SPARQL endpoint service. The GeoServer can handle geometric properties like polygons and Bezier curves, and allows one to query spatial relationships, e.g., to retrieve the extent of land resources or topological relationships among them. For example, for assessing the land use capability of a farm’s land resources we can retrieve the relevant information from the GeoServer and add it to the RDF store. An additional advantage of using GML and RDF/XML for data exchange is that they are both XML-based, so we can easily visualise them through a web browser using XSLT style sheets.

8

Related Work

There is a rich literature on the construction and use of ontologies for geographic information systems, for a recent survey see [5]. Some works that study the use of ontologies for related, though diﬀerent applications are the following: [22] suggests the use of domain ontologies and RDF for the GIS-supported environmental monitoring of waterways. [2] presents an simple description logic for spatial structures on farms upon which case-based reasoning can be performed to analyse the spatial organisations of farms with respect to their functioning. [9] discusses the use

Ontology-Based Agri-Environmental Planning for Whole Farm Plans

73

of ontologies and RDF for spatial reasoning about legal land use regulations. [25] reports on the use of Gene Ontology for a functional enrichment analysis for crop and farm animal species. None of these works has the focus on ontology support for agri-environmental planning with whole farm plans or land use capability assessment. [24] reports on a European initiative for developing a framework for assessing and comparing, ex-ante, alternative agricultural and environmental policy options. This is a multinational eﬀort that targets whole ecosystems rather than individual farms, and goes far beyond the requirements of regional councils in New Zealand. The architecture of our prototype adopts the approach in [9], but tailors it to the particularities of agri-environmental planning with whole farm plans that we encountered in the context of SLUI.

9

Conclusion and Future Work

The work presented in this paper demonstrates the usefulness of ontologies for sharing terminology and data for cross-disciplinary applications such as agri-environmental planning. We have developed a web portal prototype that combines GIS-technology with ontologies developed for specialised application domains, as discussed for the example of land use capability assessment. The prototype makes use of OWL/RDF, GML and the OGS web services for accessing farm features, base maps and other geospatial data stored in distributed data collections, and for serving them over the Internet. The prototype can be easily extended to new data collections and ontologies. It can be queried using SPARQL through a web client. As an alternative, one can use semantic reasoners like Pellet with the populated domain ontologies. However, Pellet cannot be used directly with the prototype as Sesame does not yet provide support for it. In the future we like to extend the functionality of the prototype and to undertake thorough performance testing to study the scalability of the proposed approach. An important task is to equip it with means for spatio-temporal reasoning that go beyond the limited capability of Sesame. We also want to investigate the addition of further web services for tasks like semantic annotation and automated data transformations. Here we expect that semantic web services will be beneﬁcial as they combine ontologies with web service technology. We plan to use abstract state services [11] and OWL-S [15] for describing them.

References 1. AgResearch. Farm plan prototype for SLUI (2005), www.nzarm.org.nz/KinrossWholeFarmPlan_A4_200dpi_secure.pdf (retrieved online from the New Zealand Association of Resource Management) 2. Ber, F.L., Napoli, A., Metzger, J.-L., Lardon, S.: Modeling and comparing farm maps using graphs and case-based reasoning. Journal of Universal Computer Science 9(9), 1073–1095 (2003) 3. Bittner, T., Donnelly, M., Smith, B.: A spatio-temporal ontology for geographic information integration. International Journal of Geographic Information Science 23(6), 765–798 (2009)

74

H. Ma

4. Boucher, S., Zim´ anyi, E.: Leveraging OWL for GIS interoperability: rewards and pitfalls. In: ACM Symposium on Applied Computing - SAC, pp. 1267–1272 (2009) 5. Buccella, A., Cechich, A., Fillottrani, P.: Ontology-driven geographic information integration: A survey of current approaches. Computers and Geosciences 35(4), 710– 723 (2009) 6. FAO: World reference base for soil resources, www.fao.org/docrep/-w8594e/w8594e00.htm 7. Geertman, S., Stillwell, J.: Planning support systems: content, issues, and trends. In: Planning Support Systems: New Methods and Best Practice, Advances in Spatial Science, pp. 1–18. Springer, Heidelberg (2009) 8. GeoServer: GeoServer documentation, docs.geoserver.org/ 9. Hoekstra, R., Winkels, R., Hupkes, E.: Reasoning with spatial plans on the semantic web. In: ACM International Conference on Artiﬁcial Intelligence and Law - ICAIL, pp. 185–193 (2009) 10. Ma, H., Schewe, K.-D., Thalheim, B.: Geometrically enhanced conceptual modelling. In: Laender, A.H.F. (ed.) ER 2009. LNCS, vol. 5829, pp. 219–233. Springer, Heidelberg (2009) 11. Ma, H., Schewe, K.-D., Wang, Q.: An abstract model for service provision, search and composition. In: IEEE Asia-Paciﬁc Services Computing Conference - APSCC, pp. 95–102 (2009) 12. Mackay, A.: Speciﬁcations of whole farm plans as a tool for aﬀecting land use change to reduce risk to extreme climatic events. AgResearch (2007) 13. Madin, J.S., Bowers, S., Schildhauer, M.P., Jones, M.B.: Advancing ecological research with ontologies. Trends in Ecology and Evolution 23(3), 159–168 (2008) 14. Makowskia, D., Tichit, M., Guicharda, L., Keulend, H.V., Beaudoine, N.: Measuring the accuracy of agri-environmental indicators. Journal of Environmental Management 90(S2), S139–S146 (2009) 15. Martin, D.L., et al.: Bringing semantics to web services with OWL-S. In: World Wide Web, pp. 243–277 (2007) 16. McCown, R.: Changing systems for supporting farmer’s decisions: problems, paradigms, and prospects. Agricultural Systems 74, 179–220 (2002) 17. Nachtergaele, F.O.F.: The future of the FAO legend and the FAO/UNESCO soil map of the world. In: Soil Classiﬁcation, ch. 12, CRC Press, Boca Raton (2003) 18. OGC: Geographic Markup Language, www.opengeospatial.org/standards/gml 19. OGC: Open Geospatial Consortium Website, www.opengeospatial.org/ 20. OpenRDF: User Guide for Sesame, www.openrdf.org/doc/sesame/users/ 21. Pech, R., Byrom, A., Anderson, D., Thomson, C., Coleman, M.: The eﬀect of poisoned and notional vaccinated buﬀers on possum (trichosurus vulpecula) movements: minimising the risk of bovine tuberculosis spread from forest to farmland. Wildlife Research 37(4), 283–292 (2010) 22. Pundt, H., Bishr, Y.: Domain ontologies for data sharing-an example from environmental monitoring using ﬁeld gis. Computer and Geosciences 28(1), 95–102 (2002) 23. Reed, M.S., Fraser, E.D., Dougill, A.J.: An adaptive learning process for developing and applying sustainability indicators with local communities. Ecological Economics 59, 406–418 (2006) 24. Van Ittersum, M., et al.: Integrated assessment of agricultural systems - a component-based framework for the European Union (SEAMLESS). Agricultural Systems 96, 150–165 (2008) 25. Zhou, X., Su, Z.: EasyGO: Gene Ontology-based annotation and functional enrichment analysis tool for agronomical species. BMC Genomics 8, 246 (2007)

First International Workshop on Conceptual Modelling of Service (CMS 2010)

Preface The aim of the First International Workshop on Conceptual Modelling of Service (CMS)" was to bring together researchers in the areas of services computing, services science, business process modelling, and conceptual modelling. The emphasis of this workshop is on the intersection of the rather new, fast growing services computing and services science paradigms with the well established conceptual modelling area. The call for papers solicited submissions addressing modelling support for service integration; quality of service modelling; modelling languages / techniques for services; conceptual models for integrated design and delivery of value bundles; modelling of semantic services; and formal methods for services computing / services science. A total number of 17 submissions were received by the workshop. Each paper was reviewed by three or four members of the international programme committee, and we finally selected the five best rated submissions for presentation at the workshop. The accepted papers were organised in two sessions. In the first session focused on Modelling Support for Service Integration while the second session covered Modelling Techniques for Services. We wish to thank all authors of submitted papers and all workshop participants, without whom the workshop would not have been possible. We are grateful to the members of the programme committees for their timely expertise in carefully reviewing the submissions. Finally, our thanks go to the organizers of ER 2010 and the workshop chairs Gillian Dobbie and Juan Trujillo for giving us the opportunity to organise this workshop.

July 2010

Markus Kirchberg Bernhard Thalheim

A Formal Model for Service Mediators Klaus-Dieter Schewe1 and Qing Wang2 1

Software Competence Center Hagenberg, Hagenberg, Austria [email protected] 2 University of Otago, Dunedin, New Zealand [email protected]

Abstract. In this paper we present a model of service mediators, which are high-level speciﬁcations of service-based applications. These mediators provide slots that are to be ﬁlled by actual services. Suitable services have to match the speciﬁcation of the slots according to functional and categorical characteristics. Services and mediators are based on the ASM-based model of Abstract State Services.

1

Introduction

A lot of research is currently investigated into service-oriented architectures (SOA) (see e.g. [9,12,16]), service-oriented computing (SOC) [14], web services (see e.g. [1,2,3,6,11]), and cloud computing [4,10,24,25], which are all centred around related problems. In an eﬀort to consolidate and integrate current research activities, the Service-Oriented Computing Research Roadmap [20] has been proposed. Service foundations, service composition, service management and monitoring, and service-oriented engineering have been identiﬁed as core SOC research themes. Despite this big interest in the area, and the many ideas and systems that have been created many fundamental questions have still not been answered. Nonetheless there is an agreement that content, functionality and sometimes even presentation should be made available for use by human users or other services, which ressembles the view of a pool of resources in the meme media architecture [23]. The general idea is that media resources are extracted from any accessible source, wrapped and thereby brought into the generic form of a meme media object, and stored in a meme pool, from which they can be retrieved, re-edited, recombined, and redistributed. Our research aims at laying the foundations of a theory of service-oriented systems. In particular, we try to answer the following fundamental questions: – How must a general model for services look like capturing the basic idea and all facets of possible instantiations, and how can we specify such services? – How can we search for services that are available on the web? – How do we extract from such services the components that are useful for the intended application, and how do we recombine them? – How can we optimise service selection using functional and non-functional (aka “quality of service”) criteria? J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 76–85, 2010. c Springer-Verlag Berlin Heidelberg 2010

A Formal Model for Service Mediators

77

In [18] we addressed the ﬁrst of these problems by developing the formal model of Abstract State Services (AS2 s) as a general, formal model for services. It is based on Abstract State Machines (ASMs) [8], which have already proven their usefulness in many areas. AS2 s abstract from and generalise our research on Web Information Systems [21], and integrate the customised ASM thesis for database transformations [22]. The model of AS2 s captures in particular web services (see e.g. [1]). In [19] we extended our work towards the second problem. We proposed a formalisation of clouds as service federations, which in addition provide an ontological description of the oﬀered services. The ontology must contain at least a functional description of services by means of types, and pre- and postconditions, plus a categorical description of the application area by keywords. It can be formalised by description logics [5]. In a sense, this is the idea of the “semantic web”, which enables at least the semi-automatic selection of services. Thus, AS2 s that are extended in this way by an ontological description of services capture the idea of “semantic web services” (see e.g. [17]). In this paper we further extend our work addressing the third of the problems above. More precisely, we formalise the notion of plot of a service, which speciﬁes algebraically how a service can be used. We then turn the idea around using plots with open slots for services to specify on a high level of abstraction intended service-based applications. We call such speciﬁcations service mediators, as the mediate the collaboration of participating services. We then have to formally deﬁne matching criteria for services that are to ﬁll the slots. In accordance with the proposed ontological description of services we investigate matching conditions based on functional and categorical characteristics. For plots we adopt Kleene algebras with tests instead of looking into much more sophisticated process algebras [7], which would all be far too complicated to have a chance to obtain a decidable matching condition. In the remainder of the paper we ﬁrst elaborate on AS2 s and associated plots in Section 2. Note that in our original work on AS2 s in [18] plots were at best included implicitly. In Section 3 we introduce service mediators with service slots, and formally discuss matching between such slots and actual services. We conclude wth a brief summary and outlook.

2

Service Plots

Abstract State Services are composed of two layers: a database layer and a view layer on top of it. Both layers combine static and dynamic aspects. The assumption of an underlying database is no restriction, as it is hidden anyway, and data services will be formalised by views, which in the extreme case could be empty to capture pure functional services. The sequencing of several service operations in order to execute a particular task is only left implicit in the AS2 model. In this section we make it explicit by algebraic expressions called plots.

78

2.1

K.-D. Schewe and Q. Wang

Abstract State Services

Starting with the database layer and following the general approach of Abstract State Machines [13] we may consider each database computation as a sequence of abstract states, each of which represents the database (instance) at a certain point in time plus maybe additional data that is necessary for the computation, e.g. transaction tables, log ﬁles, etc. In order to capture the semantics of transactions we distinguish between a wide-step transition relation and small step transition relations. A transition in the former one marks the execution of a transaction, so the wide-step transition relation deﬁnes inﬁnite sequences of transactions. Without loss of generality we can assume a serial execution, while of course interleaving is used for the implementation. Then each transaction itself corresponds to a ﬁnite sequence of states resulting from a small step transition relation, which should then be subject to the postulates for database transformations [22]. Definition 1. A database system DBS consists of a set S of states, together with a subset I ⊆ S of initial states, a wide-step transition relation τ ⊆ S × S, and a set T of transactions, each of which is associated with a small-step transition relation τt ⊆ S × S (t ∈ T ) satisfying the postulates of a database transformation over S. A run of a database system DBS is an inﬁnite sequence S0 , S1 , . . . of states Si ∈ S starting with an initial state S0 ∈ I such that for all i ∈ N (Si , Si+1 ) ∈ τ holds, and there is a transaction ti ∈ T with a ﬁnite run Si = Si0 , . . . , Sik = Si+1 such that (Sij , Sij+1 ) ∈ τti holds for all j = 0, . . . , k − 1. Views in general are expressed by queries, i.e. read-only database transformations. Therefore, we can assume that a view on a database state Si ∈ S is given by a ﬁnite run Si = S0v , . . . , Sv of some database transformation v with Si ⊆ Sv – traditionally, we would consider Sv − Si as the view. We can use this to extend a database system by views. In doing so we let each state S ∈ S to be composed as a union Sd ∪V1 ∪· · ·∪Vk such that each Sd ∪ Vj is a view on Sd . As a consequence, each wide-step state transition becomes a parallel composition of a transaction and an operation that “switches views on and oﬀ”. This leads to the deﬁnition of an Abstract State Service (AS2 ). Definition 2. An Abstract State Service (AS2 ) consists of a database system DBS, in which each state S ∈ S is a ﬁnite composition Sd ∪ V1 ∪ · · · ∪ Vk , and a ﬁnite set V of (extended) views. Each view v ∈ V is associated with a database transformation qv such that for each state S ∈ S there are views v1 , . . . , vk ∈ V with ﬁnite runs Sd = S0j , . . . , Snj j = Sd ∪ Vj of vj (j = 1, . . . , k). Each view v ∈ V is further associated with a ﬁnite set Ov of (service) operations o1 , . . . , on such that for each i ∈ {1, . . . , n} and each S ∈ S there is a unique state S ∈ S with (S, S ) ∈ τ . Furthermore, if S = Sd ∪ V1 ∪ · · · ∪ Vk with Vi deﬁned by vi and o is an operation associated with vk , then S = Sd ∪ V1 ∪ · · · ∪ Vm with m ≥ k − 1, and Vi for 1 ≤ i ≤ k − 1 is still deﬁned by vi .

A Formal Model for Service Mediators

79

In a nutshell, in an AS2 we have view-extended database states, and each service operation associated with a view induces a transaction on the database, and may change or delete the view it is associated with, and even activate other views. These service operations are actually what is exported from the database system to be used by other systems or directly by users. Note that for each view v the deﬁning query, i.e. the database transformation qv , can be considered itself a service operation. This simply reﬂects the fact that data that is made available on the web can be extracted and stored or processed elsewhere. In particular, we have the extreme cases of a pure data service, in which no service operations would be associated with a view v, i.e. Ov = ∅, and a pure functional service, in which the view v is empty. A formalisation of database transformations is beyond the scope of this paper. In a nutshell, the postulates require a one-step transition relation between states (sequential time postulate), states as (meta-ﬁnite) ﬁrst-order structures (abstract state postulate), necessary background for database computations such as complex value constructors (background postulate), limitations to the number of accessed terms in each step (bounded exploration postulate), and the preservation of equivalent substructures in one successor state (genericity postulate) [22]. 2.2

Algebraic Plots

According to [21] a plot is a high-level speciﬁcation of an action scheme, i.e. it speciﬁes possible sequences of service operations in order to perform a certain task. For an algebraic formalisation of plots in Web Information Systems (WISs) it was possible to exploit Kleene algebras with tests (KATs [15]). Then a plot is an algebraic expression that is composed out of elementary operations including 0, 1, and propositional atoms, binary operators · and +, and unary operators ∗ and¯, the latter one being only applicable to propositions. With the axioms for KATs we obtain an equational theory that can be used to reason about plots. Propositions and operations testing them are considered the same. Therefore, propositions can be considered as operations, and overloading of operators for operations and propositions is consistent. In particular, 0 represents fail or false, 1 represents skip or true, p · q represents a sequence of operations or a conjunction, if both p and q are propositions, p + q represents the choice between p and q or a disjunction, if both p and q are propositions, p∗ represents iteration, and p¯ represents negation. For our purposes here, the deﬁnition of plots for AS2 s requires that we leave the purely propositional ground. The service operations give rise to elementary processes of the form ϕ(x) op[z](y) ψ(x, y, z), in which op is the name of a service operation, z denotes input for op selected from the view v with op ∈ Opv , y denotes additional input from the user, and ϕ and ψ are ﬁrst-order formulae denoting pre- and postconditions, respectively. The pre- and postconditions can be void, i.e. true, in which case they can be simply omitted. Furthermore, also simple formulae χ(x) – again interpreted as

80

K.-D. Schewe and Q. Wang

tests checking their validity – constitute elementary processes. With this we obtain the following deﬁnition. Definition 3. The set of process expressions of an AS2 is the smallest set P containing all elementary processes that is closed under sequential composition·, parallel composition , choice +, and iteration ∗ . That is, whenever p, q ∈ P hold, then also pq, pq, p + q and p∗ are process expressions in P. The plot of an AS2 is a process expression in P. Example 1. Let us look at some very simplistic examples. For a ﬂight booking service we may have the following (purely sequential) plot: get itineraries[](d) select itinerary[i]() personal data[](t) conﬁrm ﬂight[](y) pay ﬂight[](c) Here the parameters d, i, t, c and y represent dates, selected itinerary, traveller data, card details, and a Boolean ﬂag for conﬁrmation. Similarly, the following expression represents another plot for accommodation booking: get hotels[](d) select hotel[h]() select room[r]() personal data[](t) conﬁrm hotel[](y) pay accommodation[](c) Here the parameters h and r represent the selected hotel and room. Finally, the expression personal data[](t) (papers[]() discount[](d ) represents the plot of a conference registration service.

Note that the set of all instantiations of process expressions in P still deﬁnes a Kleene algebra with tests, but diﬀerent to the work on Web Information Systems in [21] this algebra is not ﬁnitely generated. The sequences of service operations with instantiated parameters that are permitted by the plot deﬁne the semantics of the AS2 .

3

Mediators

With the concept of service mediators we want to capture the plot of a composed AS2 . In other words, we want to deﬁne a plot of an application that is yet to be constructed. The key issue is that such mediators specify service operations to be searched for, which can then be used to realise the problem at hand in a service-oriented way. 3.1

Service Slots

In order to capture the idea to specify service requests we relax the deﬁnition of a plot in such a way that service operations do not have to come from the same AS2 . Thus, in elementary processes we use preﬁxes to indicate the corresponding AS2 , so we obtain ϕ(x) X : op[z](y) ψ(x, y, z), in which X denotes a service slot. Apart from this we leave the construction of the set of process expression as in Deﬁnition 3.

A Formal Model for Service Mediators

81

Definition 4. A service mediator is a process expression with service slots. Furthermore, each service operation is associated with input- and output-types, pre- and postconditions, and a concept in a service terminology. Example 2. Let us specify a service mediator for a conference trip application, which should combine conference registration, ﬂight booking, and accommodation booking. Furthermore, replicative entry of customer data should be avoided, and conﬁrmation of selection as well as payment should be uniﬁed in single local operations. This leads to the following speciﬁcation: L : personal data[](t) (X : papers[]() X : discount[](d ) (Y : get itineraries[](d) Y : select itinerary[i]() Z : get hotels[](d) Z : select hotel[h]() Z : select room[r]()) L : conﬁrm[](y) (Y : conﬁrm ﬂight[](y) Z : conﬁrm hotel[](y)) L : pay[](c) (Y : pay ﬂight[](c) Z : pay hotel[](c)) Here the three slots X, Y and Z refer to the three services for conference registration, ﬂight booking, and accommodation booking, respectively, while the slot L refers to local operations. For conﬁrmation and payment the input parameters y and c are simply pushed through to the two booking services.

The work in [19] contains a precise deﬁnition of service terminologies on the grounds of description logics [5]. For this we assume that C0 and R0 represent not further speciﬁed speciﬁed sets of basic concepts and roles, respectively. Then concepts C and roles R are deﬁned by the following grammar: R

=

R0 | R0−

A C

= =

C0 | | ≥ m.R (with m > 0) A | ¬C | C1 C2 | C1 C2 | ∃R.C | ∀R.C

Definition 5. A service terminology is a ﬁnite set T of assertions of the form C1 C2 with concepts C1 and C2 as deﬁned by the grammar above. Each assertion C1 C2 in a terminology T is called a subsumption axiom. The semantics of a terminology is deﬁned by its models in the usual way [19]. Such a service terminology should comprise at least two parts: a functional description of input- and output types as well as pre- and postconditions telling in technical terms, what the service operation will do, and a categorical description by inter-related keywords telling what the service operation does by using common terminology of the application area (see [19] for details). Example 3. With respect to the service operations in the plots in Example 1 the terminology has to specify that select itinerary is a ﬂight booking service operation. For this purpose the terminology may contain among others the following subsumption axioms:

82

K.-D. Schewe and Q. Wang

Booking Service Operation ∃initiator.Customer ∃initiated by.Request ∃receives.Acknowledgement ∃requires.Customer data ∃requires.Payment Flight booking Booking ∀initiated by.Flight request Further details can be found in [19]. 3.2

Service Matching

A service mediator speciﬁes, which services are needed and how they are composed into a new plot of a composed AS2 . So we now need exact criteria to decide, when a service matches a service slot in a service mediator. It seems rather obvious that in such a matching criteria for all service operations in a mediator associated with a slot X we must ﬁnd matching service operations in the same AS2 , and the matching of service operations has to be based on their functional and categorical description. The guideline is that the placeholder in the mediator must be replaceable by matching service operations. Functionally, this means that the input for the service operation as deﬁned by the mediator must be accepted by the matching service operation, while the output of the matching service operation must be suitable to continue with other operations as deﬁned by the mediator. This implies that we need supertypes and subtypes of the speciﬁed input- and output-types, respectively, in the mediator, as well as a weakening of the precondition and a strengthening of the postcondition. Categorically, the matching service operation must satisfy all the properties of the concept in the terminology that is associated with the placeholder operation, i.e. the concept associated with the matching service operation must be subsumed by that concept. However, the matching of service operations is not yet suﬃcient. We also have to ensure that the projection of the mediator to a particular slot X results in a subplot of the plot of the matching AS2 . Definition 6. A subplot of a plot p is a process expression q such that there exists another process expression r such that p = q + r holds in the equational theory of process expressions. The projection of a mediator m is a process expression pX such that pX = πX (m) holds in the equational theory of process expressions, where πX (m) results from m by replacing all placeholders Y : o with Y = X and all conditions that are irrelevant for X by 1. Based on this deﬁnition it is tempting to require that the projection of a mediator should result in a subplot of a matching service. This would, however, be too simple, as order may diﬀer and certain service operations may be redundant. We call such redundant service operations phantoms. Formally, if for a condition ϕ(x) appearing in a process expression p the equation ϕ(x) = ϕ(x)op[y](z) holds, then op[y](z) is called a phantom of p. That is, if the condition ϕ(x)

A Formal Model for Service Mediators

83

holds, we may execute the operation op[y](z) (or not) without changing the eﬀect. Whenever p = q holds in the equational theory of process expressions, and op[y](z) is a phantom of p with respect to condition ϕ(x), we may replace ϕ(x) by ϕ(x)op[y](z) in q. Each process expression resulting from such replacements is called an enrichment of p by phantoms. Thus, we must consider projections of enrichments by phantoms, which leads us to the following deﬁnition. Definition 7. An AS2 A matches a service slot X in a service mediator m iﬀ the following two conditions hold: 1. For each service operation X : o in m there exists a service operation op provided by A such that – the input-type Iop of op is a supertype of the input-type Io of o, – the output-type Oop of op is a subtype of the output-type Oo of o, – preo ⇒ preop holds for the preconditions preo and preop of o and op, respectively, – postop ⇒ posto holds for the postconditions posto and postop of o and op, respectively, and – the concept Co associated with o in the service terminology subsumes the concept Cop associated with op. 2. There exists an enrichment mX of m by phantoms such that building the projection of m and replacing all service operations X : o by matching service operations op from A results in a subplot of the plot of A. Example 4. Let us look again at the simple service mediator in Example 2. We can assume that the local operation personal data[](t) has the postcondition person(t), and this is invariant under the service operations for itinerary and hotel selection. We can further assume that in both booking services the service operation personal data[](t) is a phantom for person(t). Thus, the mediator can enriched by phantoms, which results in: L : personal data[](t) (X : papers[]() X : discount[](d ) (Y : get itineraries[](d) Y : select itinerary[i]() Y : personal data[](t) Z : get hotels[](d) Z : select hotel[h]() Z : select room[r]()) Z : personal data[](t) L : conﬁrm[](y) (Y : conﬁrm ﬂight[](y) Z : conﬁrm hotel[](y)) L : pay[](c) (Y : pay ﬂight[](c) Z : pay hotel[](c)) The projection of this process expression to the services X, Y and Z, respectively, results exactly in the three plots in Example 1.

4

Conclusion

In this paper we continued our research on foundations of a theory of web-based service-oriented systems. We addressed the problem of service mediation starting

84

K.-D. Schewe and Q. Wang

from a high-level speciﬁcation of an intended service-oriented application, in which ”holes” are to be ﬁlled by suitable services. This led us to the formal model of service mediators with service slots. For the slots we provide matching conditions for services, which combine functional criteria by means of types, and pre- and postconditions, and categorical criteria capturing the application area. Thus, the matching conditions link to an ontological description of services. With this work we add another tile to our theory, which permits the identiﬁcation, search and composition of services in order to build a web-oriented application. While this is highly relevant to realise the vision of cloud and serviceoriented computing on the web, there are still many open problems regarding allocation and optimised performance, selection among choices, and security and privacy. These open problems constitute the challenges for our continuing research.

References 1. Alonso, G., Casati, F., Kuno, H., Machiraju, V.: Web Services: Concepts, Architecture and Applications. Springer, Heidelberg (2004) 2. Altenhofen, M., B¨ orger, E., Lemcke, J.: An abstract model for process mediation. In: Lau, K.-K., Banach, R. (eds.) ICFEM 2005. LNCS, vol. 3785, pp. 81–95. Springer, Heidelberg (2005) 3. Alves, A., et al.: Web services business process execution language, version 2.0., OASIS Standard Committee (2007), http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html 4. Armbrust, M., Fox, A., Griﬃth, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: A berkeley view of cloud computing. Technical Report UCB/EECS-2009-28, Department for Electrical Engineering and Computer Sciences, University of California at Berkeley, US (2009) 5. Baader, F., et al. (eds.): The Description Logic Handbook: Theory, Implementation and Applications. University Press, Cambridge (2003) 6. Benatallah, B., Casati, F., Toumani, F.: Representing, analysing and managing web service protocols. Data and Knowledge Engineering 58(3), 327–357 (2006) 7. Bergstra, J.A., Ponse, A., Smolka, S.A.: Handbook of Process Algebra. Elsevier Science B.V., Amsterdam (2001) 8. B¨ orger, E., St¨ ark, R.: Abstract State Machines. Springer, Heidelberg (2003) 9. Brenner, M.R., Unmehopa, M.R.: Service-oriented architecture and web services penetration in next-generation networks. Bell Labs Technical Journal 12(2), 147– 159 (2007) 10. Buyyaa, R., Yeo, C.S., Venugopala, S., Broberga, J., Brandic, I.: Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems 25(6), 599–616 (2009) 11. Christensen, E., et al.: Web services description language (WSDL) 1.1 (2001), http://www.w3c.org/TR/wsdl 12. Erl, T.: Service-Oriented Architecture: Concepts, Technology, and Design. Prentice Hall PTR, Upper Saddle River (2005) 13. Gurevich, J.: Sequential abstract state machines capture sequential algorithms. ACM Transactions on Computational Logic 1(1), 77–111 (2000)

A Formal Model for Service Mediators

85

14. Huhns, M.N., Singh, M.P.: Service-oriented computing: Key concepts and principles. IEEE Internet Computing 9, 75–81 (2005) 15. Kozen, D.: Kleene algebra with tests. ACM Transactions on Programming Languages and Systems 19(3), 427–443 (1997) 16. Kumaran, S., et al.: Using a model-driven transformational approach and serviceoriented architecture for service delivery management. IBM Systems Journal 46(3), 513–530 (2007) 17. Kuropka, D., Tr¨ oger, P., Staab, S., Weske, M. (eds.): Semantic Service Provisioning. Springer, Heidelberg (2008) 18. Ma, H., Schewe, K.-D., Thalheim, B., Wang, Q.: A theory of data-intensive software services. Service Oriented Computing and Its Applications 3(4), 263–283 (2009) 19. Ma, H., Schewe, K.-D., Wang, Q.: An abstract model for service provision, search and composition. In: Kirchberg, M., et al. (eds.) Services Computing Conference APSCC 2009, pp. 95–102. IEEE Asia Paciﬁc (2009) 20. Papazoglou, M.P., van den Heuvel, W.-J.: Service oriented architectures: Approaches, technologies and research issues. VLDB Journal 16(3), 389–415 (2007) 21. Schewe, K.-D., Thalheim, B.: Conceptual modelling of web information systems. Data and Knowledge Engineering 54(2), 147–188 (2005) 22. Schewe, K.-D., Wang, Q.: A customised ASM thesis for database transformations (2009) (submitted for publication) 23. Tanaka, Y.: Meme Media and Meme Market Architectures. IEEE Press, WileyInterscience, USA (2003) 24. Yara, P., Ramachandran, R., Balasubramanian, G., Muthuswamy, K., Chandrasekar, D.: Global software development with cloud platforms. In: Software Engineering Approaches for Oﬀshore and Outsourced Development, pp. 81–95. Springer, Heidelberg (2009) 25. Zeng, W., Zhao, Y., Ou, K., Song, W.: Research on cloud storage architecture and key technologies. In: Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, pp. 1044–1048. ACM, New York (2009)

Reusing Legacy Systems in a Service-Oriented Architecture: A Model-Based Analysis Yeimi Pe˜ na, Dario Correal, and Tatiana Hernandez University of Los Andes Department of Systems and Computing Engineering Cra 1E No 19A-40, Bogota, Colombia Abstract. Nowadays, organizations face great technological and business challenges. A clear example of this situation is the transformation of businesses from a structure based on isolated areas towards one oriented on business processes. This change in business structure requires the definition and execution of business processes in which prior investments in technology are protected by the reuse of the organization’s legacy systems. In this proposal, we present a model-based strategy (MDE) to analyze the viability of the reuse of existing legacy systems during the design and implementation phase of a service oriented architecture.

1

Introduction

Modern organizations establish business strategies that require agile and ﬂexible solutions that promote integration and re-use of their legacy systems. Service oriented architectures (SOA) [1], have emerged as an alternative to face this challenge; however, the adoption of a SOA brings some risks and diﬃculties associated with the lack of clarity on how to determine the viability and the required eﬀorts to accomplish the reuse of the organization’s legacy systems. In this article, we present a tool that assists software architects in the analysis of the reuse of legacy systems on SOA architectures. This analysis allows technicians to identify correspondence between functionalities oﬀered by legacy systems and functionalities required by the candidate SOA services. Our proposal is based on a model-driven engineering (MDE) approach and the use of Domain Speciﬁc Languages (DSL) to represent and analyze candidate SOA services and the organization’s legacy systems. Through this paper we present our strategy in the following way: in section two we present a case study to illustrate our proposal; in section three we present the metamodel and the DSL proposed in our tool; in section four we use the study case to explain how our analysis strategy and tool works; in section ﬁve we present the most relevant related works; ﬁnally, in section six we present the conclusions and future work.

2

Case Study

In this section, we present an example based on a simpliﬁed version of an accident notiﬁcation process of an insurance company. With the example, we will J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 86–95, 2010. c Springer-Verlag Berlin Heidelberg 2010

Reusing Legacy Systems in a Service-Oriented Architecture

87

introduce some of the concepts and notation used throughout the paper. We will come back to the example in section 4 to illustrate the details of our proposal. In this scenario, an insurance company has, among others, the following legacy systems: a system that handles all the information on sold insurance policies, a CRM application used to register the customer information, and a system to register each accident the company must pay. Currently, the accident registration business process presents redundancy problems and information inconsistencies as well as delays in customer attention because of the lack of integration between the fore mentioned systems. Due to this, the company has decided to adopt a SOA strategy to enhance and automate the accident registration business process. An important premise during the adoption of this strategy is to maximize the reuse of the existent legacy systems. As part of the new SOA strategy adopted, the company has deﬁned a new version of the accident registration process. In Figure 1, we use the Business Process Modeling Notation (BPMN) to present a high level version of it.

Fig. 1. The Accident Registration Business Process

The moment the company’s contact center receives a call, the customer’s information is consulted from the CRM, and the validity and term of the policy are consulted from the policy system. If the information is valid, the accident’s preliminary information is registered in the accident registration system. After this registry, the accident department assigns a representative who travels to the accident’s location for veriﬁcation. After veriﬁcation, the representative registers an initial damage evaluation in the accident registration system. The Claim Department analyzes the recorded information and authorizes or rejects the vehicle’s repairs. The candidate architecture for the accident registration process is deﬁned with a service portfolio and a system ecosystem [1] [2], which are presented in Figure 2. Both the portfolio and the ecosystem are modeled following the notation suggested by the Service Oriented Modeling Framework (SOMF) [3]. In SOMF an atomic service is represented by a rhombus, a composite service by a rectangle, a consumer by a triangle and the connections among its elements are represented by direction lines. The service portfolio allows the description and qualiﬁcation of the organization’s services in a service catalog to ease its governance. For each service deﬁned

88

Y. Pe˜ na, D. Correal, and T. Hernandez

in the portfolio, we explicitly deﬁne the operations, messages, contracts, interfaces, and properties that describe the service [4]. In Figure 2(a), we present a subset of the service portfolio of the insurance company. In this portfolio, the CarAccidentService, CarInsurance Service, InformationClientService, AuthorizationService and AssignMobileRepresentativeService services are deﬁned.

(a) Portfolio

(b) Ecosystem

Fig. 2. Partial view of a Portfolio and an Ecosystem for the Insurance Company scenario

The service ecosystem deﬁnes the dynamic orchestration and choreography [1] relations between services during their execution, as well as the control ﬂow from the service consumers to the legacy systems. The ecosystem is formed by zones or logic and independent groupings that participate in the solution. In the example presented in Figure 2(b), the following zones are deﬁned: Consumers Zone, Process Zone, Services Zone, and Providers Zone. The consumers zone groups the diﬀerent consumers that consume architecture services through the business process activities. For our case study we use an Accident Department consumer. The process zone deﬁnes the business process involved in the ecosystems scenario. For our case study we use the accident registration process. The service zone relates the services required by the business process activities, which are previously deﬁned in the service portfolio. In our case study we deﬁned the following services: CarAccidentService, CarInsuranceService, InformationClientService, AuthorizationService and AssignMobileRepresentativeService. The provider’s zone describes the legacy systems [5] that support the functionalities required by the services. In our case study the following legacy systems are deﬁned: CRM, Accidents, and InsuranceSystem.

3

A Model-Driven Approach for Service-Oriented Analysis and Design

We based our approach on two main technologies: Model Driven Engineering (MDE) and Domain-Speciﬁc Languages (DSL). MDE is a concept that proposes

Reusing Legacy Systems in a Service-Oriented Architecture

89

the use of models that facilitate the separation of concerns using several levels of abstraction. In MDE, the goal is to state a problem, search for a solution, and test it, not on software artifacts but on models related to the problem domain. Through the use of DSLs we express, in terms known to the domain experts, whole or fragments of programs that should otherwise be expressed in traditional programming languages, making them incomprehensible to the stakeholders. Thus, domain speciﬁc languages (DSL) make the comprehension of programming code easier for the persons with expertise in a particular domain but who don’t necessarily have informatics knowledge. In our proposal, we use DSLs and models in the following way. First, we use a DSL with two main objectives: 1) facilitate the description of a SOA in terms of the service portfolio and ecosystems, and 2) describe the organization’s legacy systems. Second, these two descriptions are interpreted and transformed into a model that represents the candidate SOA. The later is then used to do the reuse analysis. Next, we describe in detail each of these steps. 3.1

DSL-SOA: A Domain Specific Language for SOA

In our proposal, we created a DSL called DSL-SOA, which allows software architects to express in a detailed way the services to be used in a SOA style. Additionally, the language facilitates the description of the organization’s legacy systems in terms of their capabilities [6] [1]. This DSL adopts some of the terminology of WSDL and incorporates concepts speciﬁc to the SOA domain, thus allowing the deﬁnition of services other than web services. DSL-SOA was created to provide a familiar language to software architects. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

P o r t f o l i o I n s u r a n c e P o r t f o l i o type=S t a t i c services [ atomic I n f o r m a t i o n C l i e n t S e r v i c e context = B u s i n e s s requestMessages [ R q M C r e a t e C l i e n t input (name type=s t r i n g , r e q u i r e d=true , many=f a l s e ) ] responseMessages [ R s M C o n s u l t C l i e n t output ( C l i e n t type=o b j e c t , many=f a l s e ) ] interface { I C l i e n t interfaceOperations { c r e a t e C l i e n t input= R q M C r e a t e C l i e n t , output= R s M C o n s u l t C l i e n t , scope= Entity , functionCRUD= CREATE}} contract [ CClient ] endPoints [ E P C r e a t e C l i e n t , binding {BDCreateC i n t e r f a c e=I C l i e n t , o p e r a t i o n=c r e a t e C l i e n t } ] ; atomic C a r A c c i d e n t S e r v i c e context = B u s i n e s s requestMessages [ RqMCreate input ( i d C l i e n t type=i n t e g e r , r e q u i r e d=true , many=f a l s e ; i d C a r type=i n t e g e r , r e q u i r e d=true , many=f a l s e ; p l a c e type=s t r i n g , r e q u i r e d=true , many=f a l s e ADate type=date , r e q u i r e d=true , many=f a l s e ; ) ; R q M C o n s u l t A c c i d e n t input ( i d A c c i d e n t type=i n t e g e r , r e q u i r e d=true , many=f a l s e ) ] responseMessages [ R s M C r e a t e A c c i d e n t output ( v a l i d a t e type=boolean , many=f a l s e ) ; R s M C o n s u l t A c c i d e n t output ( a c c i d e n t type=o b j e c t , many=f a l s e ] interface { I C a r A c c i d e n t S e r v i c e interfaceOperations { c r e a t e input=RqMCreate , output=RsMCreate , scope=Entity , functionCRUD=CREATE; c o n s u l t A c c i d e n t input=RqMConsultAccident , output=R s M C o n s u l t A c c i d e n t , scope=Entity , functionCRUD=RETRIEVE} contract [ CCarAccidentService ] endPoints [ EPCreateAcc , binding { BDCreateAcc i n t e r f a c e=I C a r A c c i d e n t S e r v i c e , o p e r a t i o n=c r e a t e A c c i d e n t } } ] ] Ecosystem EcosystemSOA type = Dinamic ConsumerZone c o n s u m e r Z o n e {consumers [ A c c i d e n t D e p a r t m e n t ] } ProcessZone p r o c e s s Z o n e { A c t i v i t i e s [ R e g i s t e r I n i t i a l A c c i d e n t , RegisterInitialEvaluation ]} ServicesZone s e r v i c e s Z o n e { s e r v i c e s [ I n f o r m a t i o n C l i e n t S e r v i c e , CarAccidentService ]} ProvidersZone p r o v i d e r s Z o n e { p r o v i d e r s [CRM, A c c i d e n t s , I n s u r a n c e S y s t e m ] } c o n n e c t i o n s ( R e g i s t e r I n i t i a l A c c i d e n t −TO−C r e a t e A c c i d e n t typeConnection=u n i d i r e c t i o n a l , s o u r c e A c t i v i t y=R e g i s t e r I n i t i a l A c c i d e n t , t a r g e t=EPCreateAcc )

Listing 1.1. Deﬁnition of a service portfolio and an ecosystem using DSL-SOA

90

Y. Pe˜ na, D. Correal, and T. Hernandez

Modeling Portfolios and Ecosystems using DSL-SOA. A service portfolio represents an organized catalog of all the organization’s services. With the use of DSL-SOA, it is possible to deﬁne the diﬀerent organization’s services, indicating diﬀerent properties such as: context (business, data, utility, etc.), request and response messages to which the service responds, business interfaces oﬀered by the service and service levels that the service is obligated to. Similarly, the DSLSOA allows the deﬁnition of the service ecosystems, in terms of zones that form it and the orchestration needed to support a business process. To illustrate the use of DSL-SOA, in Listing 1.1 we present the deﬁnition of the service portfolio and ecosystem presented in ﬁgure 2 in section 2. The deﬁnition of the service portfolio is presented in lines (1-27) and the ecosystem deﬁnition in lines (28-37). Defining Legacy Systems using DSL-SOA. In our approach, we consider several aspects while modeling a legacy system. First, it is necessary to explicitly deﬁne the capabilities provided by the system. A capability is the functionality oﬀered by a legacy system, and it is determined by its inputs and outputs [1]. Second, we need to identify the technological platform used by the application to execute the capability (i.e. java, .NET). Finally, we need to identify the communication style (synchronous or asynchronous), the processing style (RPC, event-oriented, batch-oriented, etc.), and the number of concurrent users it supports. In Listing 1.2 we present an example of the deﬁnition of two legacy systems for the accident registration process: CRM and Accidents.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

LegacySystems CRM artefact ( Legacy CRM type= Component d e s c r i p t i o n =” C r e a t e and c o n s u l t i n f o r m a t i o n c l i e n t ” , capabilities [ capability createClient , i n p u t s ( name type=s t r i n g , r e q u i r e d=t r u e ) , functionCRUD= CREATE] ) , outputs ( i d C l i e n t type=i n t e g e r , many=f a l s e ) , r e s t r i c t i o n s [ R e s t r i c t i o n : term=” B a t c h O r i e n t e d ” value=t r u e ; R e s t r i c t i o n : term=”ModeConnection ” value=” a s y n c h r o n o u s ” ] ) Accidents artefact ( Legacy A c c i d e n t s type= Component d e s c r i p t i o n =” C r e a t e and c o n s u l t a c c i d e n t s ” , capabilities [ capability createAccident , i n p u t s ( i d C l i e n t type=i n t e g e r , I d C a r type=i n t e g e r , a c c i d e n t P l a c e type=s t r i n g , ADate type=d a t e ) , outputs ( c l i e n t e type=boolean , many=f a l s e ) , functionCRUD= CREATE; capability consult , i n p u t s ( i d A c c i d e n t type=i n t e g e r , i d C l i e n t type=i n t e g e r ) , outputs ( i n s u r a n c e type=o b j e c t , many=f a l s e ) , functionCRUD= RETRIEVE] ) , restrictions

[

Restriction : Restriction :

term=” S t a t e ” value=” S t a t e l e s s ” ; term=”ModeConnection ” value=” a s y n c h r o n o u s ” ]

Listing 1.2. Deﬁnition of the CRM legacy system using DSL-SOA

3.2

Archivol: A Software Architecture Meta-model

As the second part of our proposal, we built an architectural metamodel called Archivol [7]. By using Archivol, a software architect can create a model that

Reusing Legacy Systems in a Service-Oriented Architecture

91

represents an architectural solution with a SOA style. Archivol permits the modeling of aspects like a service portfolio and solution ecosystems. Additionally, this metamodel allows the modeling of the organization’s legacy systems and its relations with the proposed services of the candidate architecture. According to this, the deﬁnitions made using the DSL-SOA are interpreted and transformed into a model that conforms with the Archivol metamodel. To illustrate this idea, in the ﬁgure 3 we present the concept Architectural Element of the Archivol metamodel. This concept permits the modeling of a portfolio composed of services from the architectural model. This model is automatically created from the deﬁnitions given in lines (1-37) of Listing 1.1.

Fig. 3. Example of the Archivol metamodel instantiation

4

Analyzing Legacy Systems Reuse

Once the SOA model is created from the DSL-SOA, the next phase in our proposal is the reuse analysis of our architectural model. Currently, our analysis is carried out in two steps. In the ﬁrst one, the legacy systems are analyzed to detect possible risks in their utilization. In the second, the operations oﬀered by the services are analyzed against the capabilities oﬀered by the legacy systems to determine if legacy systems can support the services implementation of the candidate SOA. The objective of this analysis is to provide the software architect with the required information to propose mitigation strategies for detected risks and determine what services of the candidate SOA can be exposed with the existing legacy systems. 4.1

Step 1 - Legacy Systems Risk Analysis

The objective of this ﬁrst analysis is to identify the risks associated with the use of legacy systems in the SOA architecture. In SMART [6], SEI deﬁned a set of risks that must be identiﬁed when implementing SOA architectures that

92

Y. Pe˜ na, D. Correal, and T. Hernandez

use legacy systems. In our analysis two of these risks must always be identiﬁed: risks associated with the use of batch oriented systems and risks associated with the system’s communications type. We will refer to the ﬁrst type of risks as R1 and to the second type as R2. Additionally, the software architect can deﬁne other risks that this analysis must identify. For example, he/she can deﬁne that the analysis should identify risks associated with the use of stateful java components in the legacy systems. Risks are described as restrictions of the SOA candidate architecture using DSL-SOA. In Listing 1.3 we present an example of this description. RestrictionsSOA s t y l e C o n n e c t i o n = asynchronous s t y l e P r o c e s s i n g <> batchOriented customerRestriction [ ” ComponentJava”=” s t a t e l e s s ” ] EndRestrictionsSOA

1 2 3 4 5 6

Listing 1.3. Deﬁnition of restrictions of the SOA candidate architecture using DSL-SOA

This analysis is performed by comparing the candidate architecture restriction model with the legacy system model. The legacy system model is searched for restriction elements whose name matches the name of the restriction model elements. When a match is found, the value of the restrictions is compared. The result of this analysis is saved in a legacy system model that conforms to the Archivol metamodel and presented to the user in a SOA Restrictions report. We perform the analysis on the models using ATL [8] rules. For the CRM system example the result of risk analysis is presented in ﬁgure 4, which shows an image of a SOA Restrictions report. In this report the list of business and/or technical restrictions of the legacy system are presented and evaluated as not a risk (Apply) or as a risk (Not apply). The software architect must take into account restrictions identiﬁed as risks and propose to either not use the legacy system or create intermediary services.

Fig. 4. Example Report SOA Restrictions

4.2

Step 2 - Correspondence Analysis

The next step consists in identifying the legacy systems capabilities that can be reused. In our current implementation, we are capable of identifying the reuse

Reusing Legacy Systems in a Service-Oriented Architecture

93

capabilities of CRUD services (Create, Read, Update, Delete). This identiﬁcation is done adhering to the following principles: 1. For each service operation deﬁned in the service interface, a legacy system capability search is done. We have a match if the CRUD operation deﬁned by the service interface and the capability in the legacy the system both aﬀect the same business entity (i.e. Client) in the same way (i.e. Update) 2. After identifying all the capabilities, we validate the number of in/out parameters of the operation against the number of in/out parameters of the capability exposed by the legacy system. 3. With the ﬁltered capabilities, we validate the data type of the in/out parameters of the operation against the data type of the in/out parameters of the capability exposed by the legacy system. To illustrate this, consider again our study case. In the portfolio, the CarAccidentService service was deﬁned with two operations: one for creating an accident and the other for searching for an accident. We also have the legacy system Accidents, which oﬀers the capabilities of registering and querying accidents. In table 1, we present an example of the results obtained with the correspondence analysis. After the validation we can conclude that the create operation of the CarAccidentServices service can be supported directly with the CreateAccident capability of the Accidents legacy system. The opposite case can be seen for the consultAccident operation that, although passing the ﬁrst validation, has a different number of input parameters and data types. If the correspondence analysis does not match any services operation or any legacy systems capabilities, the software architect need to enrich the description of the candidate SOA and the legacy systems.

Table 1. Example of the results obtained with the correspondence analysis

Operation - Capability create(int idClient, int idCar, string place, date ADate) createAccident(int idClient, int idCar, string accidentPlace, date ADate) (CREATE) (CREATE) Correspondence Results Correspondence level Result CRUD operation Both are CREATE Number of inputs Both have 4 Parameters type Equals Operation - Capability consultAccident(int idAccident) consult(int idAccident, int IdClient) (RETRIEVE) (RETRIEVE) Correspondence Results Correspondence level Result CRUD operation Both are RETRIEVE Number of parameters Service get 1 and Legacy 2 Parameters type Not compare

94

5

Y. Pe˜ na, D. Correal, and T. Hernandez

Related Work

In our research we have considered two main works: SMART[6], proposed by the Software Engineering Institute (SEI), and SOAMIG, from the Koblenz-Landau University [9]. The SMART approach supports the analysis of legacy system reuse. Its objective is to help organizations in the decision taking process about the viability of component reuse to expose them as SOA services. From SMART we have taken the documented risks identiﬁed for legacy system reuse and we followed a similar strategy when we analyzed the relations between the service’s contracts and the legacy system’s description. Unlike SMART our analysis is based on Archivol, which allows us to individually model the legacy system’s technology and the SOA architecture. SOAMIG is a modeling technique for SOA adoption, and allows a bottom-up analysis using tools like TGraphs and GReQL to identify the possible services oﬀered by the candidate SOA . One limitation of this approach is that an analysis of the candidate architecture is not done, limiting its service discovery to the information that exists in the legacy systems. This prevents new requirements or business process needs to be included in the candidate architecture. Furthermore, the tools that support SOAMIG are commercial, generating an additional cost to the analysis.

6

Conclusions and Future Work

In this work we focused on supporting the reuse of legacy systems in service architectures. Our solution strategy was based on the use of models and domain speciﬁc languages to model both the services designed to support the business processes and the organization’s legacy systems. Our strategy permitted an analysis of the operations of those services and the capabilities oﬀered by the legacy systems in order to detect risks present in the SOA solution. Additionally, with our approach correspondence analyses were carried out to determine which capabilities were more appropriate to be exposed as services according to the deﬁned portfolio. In the current state of our work, the analysis is limited to services related to CRUD operations. This limit is due to the project’s scope and time constraints. In conclusion, we can say that our objective was partially accomplished. From the technological point of view the use of models and the analysis carried out on them proved to be an eﬀective strategy. However, the correspondence analysis requires that a complete description of the service portfolio and the legacy systems be made. This implies the adoption of strict documenting discipline. To study this requirement our case study was deployed in the Software Department’s architecture laboratory (at Universidad de los Andes) and made available for students taking software architecture subjects. From this exercise we found that software architects usually do not make complete descriptions. Therefore, despite the fact that existing legacy systems could provide the required capabilities, they were not correctly identiﬁed by the correspondence analysis. In

Reusing Legacy Systems in a Service-Oriented Architecture

95

the cases where the services were correctly documented, positive results were obtained in the correspondence analysis. As part of our future work, our strategy needs to be expanded to extend the correspondence analysis to more complex functionalities than CRUD. Additionally, making use of the risk and correspondence analysis, we would like to suggest to the software architect solutions based on service-oriented patterns that permit risk mitigation and increase the use of legacy systems.

References 1. OASIS: Reference architecture for service oriented architecture version 1.0. Technical report (April 2008) 2. Cruz, D., Correal, D., Pena, Y.: Estrategia para gobernabilidad en arquitecturas orientas a servicios dirigida por mde. In: XXXV Latin Americas Informatics Conference (September 2009) 3. Bell, M.: Service Oriented Modeling: Service Analysis, Design and Architecture. John Wiley and Sons, Inc., Chichester (2008) 4. Erl, T.: SOA Principles of Service Design. Prentice Hall/PearsonPTR (2008) 5. Rojas, N.: Conviviendo con sistemas legados. Revista Electronica Paradigma 1(1) (2007) 6. Lewis, G.A., Morris, E.J., Simanta, S.: Smart: Analyzing the reuse potential of legacy components in a service-oriented architecture environment. Technical report (June 2008) 7. Archivol group MOdel Oriented Software ArchitectureS (MOOSAS), University of Los Andes, C.: Page metamodel archivol (2009), http://moosas.uniandes.edu.co/ 8. ATLAS group (INRIA), University of Nantes, F.: Page project atl (2010), http://www.eclipse.org/m2m/atl/ 9. Horn, T., Fuhr, A., Andreas., W.: Towards applying model-transformations and -queries for soa-migration. In: Congress on Services -I 2009 (July 2009)

Intelligent Author Identification Qing Wang1 and Ren´e Noack2 1 2

University of Otago, Dunedin, New Zealand [email protected] Christian-Albrechts-University Kiel, Germany [email protected]

Abstract. This paper addresses a fundamental problem existing in the development of digitalising scientiﬁc contribution for individuals – the author identiﬁcation problem. Instead of proposing an accurate and complete approach to identify authors in an open-world domain, which seems to be hardly found, we aim to develop the knowledge-based identiﬁcation for authors by establishing an identity layer between a conceptual layer and a view layer. With the evolving knowledge acquired from diﬀerent communities, a visual model built upon the conceptual and identity layers is adaptive such that the degree of accuracy and completeness on author identiﬁcation can be improved over time.

1

Introduction

Author identiﬁcation is a long-standing problem remaining in the area of institutional repositories, scientiﬁc communities, etc. In the past decades, a great number of working groups [9] have been setup at various levels – international, national or community-based – to explore possible solutions. Despite the vast amount of eﬀorts, no satisfactory solution has yet been found. This severely hinders the capabilities of providing bibliometric analysis for scientiﬁc contribution of individuals, personalising content and services in social networks, integrating applications with assured quality, etc. A common approach towards the author identiﬁcation problem is to assign each author with a unique identiﬁer [1,6]. This approach usually works well when application domains are small. In an open-world domain, this approach would however lead to the egg-and-chicken controversy – assigning a unique identiﬁer to an author ﬁrst or identifying an author uniquely ﬁrst. An alternative, driven by library communities such as International Federation of Library Association and Institutions (IFLA), is to use authority ﬁles [3,5]. As this approach leaves users to select a desired authority record from potential matching results, the accurateness of author identiﬁcation is a concern, particularly when disambiguating popular names such as Stephen Smith. Some repositories also prefer to directly contact authors and ask for their conﬁrmation for the identity. The diﬃculty they meet is how to deal with people who do not reply or have been dead. In light of various challenging issues surrounding the author identiﬁcation problem, we intend to address the following questions in this paper. J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 96–106, 2010. c Springer-Verlag Berlin Heidelberg 2010

Intelligent Author Identiﬁcation

97

– Since it is diﬃcult to identify authors in an open-world domain, can we ﬁnd a way of establishing the knowledge on author identiﬁcation which will be more accurate and complete over time? – As identifying authors often relies on community-based knowledge, can we ﬁnd a way of eﬃciently and eﬀectively sharing the knowledge on author identiﬁcation across diﬀerent communities? In this paper we propose a multi-layer architecture to incorporate knowledgebased identiﬁcation into a database system. The goal is to establish a more realistic and natural perspective on the modelling of authors – an author may be viewed diﬀerently in diﬀerent communities at diﬀerent times. A traditional perspective on data modelling diﬀerentiates levels of abstraction by considering a model at a conceptual, logical or physical layer. A conceptual layer emphasises on representing the real world; a logical layer describes the database schema and a physical layer implement the database schema. In addition to these, a view layer may be built upon a data model to provide customised information. To enable knowledge-based identiﬁcation, we generalise this architecture by abstracting the knowledge on author identiﬁcation into an independent identity layer and adding an adaptive visual layer lying between a conceptual layer and a view layer. More precisely, we have – a conceptual layer models primitive objects in the print world and the like; – an identity layer manages the knowledge in relation to the identity of authors; – a visual layer presents dynamic objects by ultilising a ﬂexible binding of knowledge at an identity layer and objects at a conceptual layer. In the remainder of this paper we give the motivation in Section 2. Then we revise the deﬁnition of Higher-Order Entity-Relationship model in Section 3. In Section 4 an identity layer is introduced, which consists of a ground model and a set of reﬁning relation tuples. Section 5 presents a visual layer built upon the conceptual and identity layers. We discuss the possible author identiﬁcation services in section 6 and conclude the paper in Section 7.

2

Author Tagging in Open-World Domains

In an application domain every object is assumed to possess some properties that can uniquely distinguish itself from the others. These properties are often conceptualised by a set of key attributes in a data model. However, when an application domain is open (i.e., an open-world domain), the conceptualisation of objects is subject to the availability of information. It may lead to indiscernible objects whose properties do not suﬃce to uniquely identify themselves within an application domain. Therefore, the author identiﬁcation problem we encounter has its roots in modelling authors in open-world domains with very limited author information. Example 1. Let us consider the following publications and the question of whether the authors named Qing Wang are the same person. According to the similarity of

98

Q. Wang and R. Noack

publication titles, they might be the same person. By comparing their aﬃliations, they might be diﬀerent persons. Since it is possible that an author changes his/her subjects and aﬃliations over time, it is unlikely to disambiguate the identity of authors with merely the information provided with publications. – FIXT: A Flexible Index for XML Transformation by Jianchang Xiao, Qing Wang1 , Min Li and Aoying Zhou (1:Fudan University, China) – XML Database Transformations with Tree Updates by Qing Wang2 , KlausDieter Schewe and Bernhard Thalheim (2:Massey University, New Zealand) Due to the ambiguousness of authorship, a dilemma with author tagging inevitably arise. When authors of diﬀerent publications refer to the same person in the real world but are tagged into diﬀerent objects in a data model, the problem of incompleteness occurs. As a converse, when the authors of diﬀerent publications refer to diﬀerent persons in the real world but are tagged into the same object in a data model, the problem of inaccurateness occurs. Example 2. Let us look back the publications in Example 1 and the query “list all the papers by Qing Wang who wrote the paper named XML Database Transformations with Tree Updates”. Assume that Qing Wang in these publications are diﬀerent authors but tagged to the same author object. Then the query result would not be accurate as it includes the ﬁrst paper which belongs to a diﬀerent person. Similarly, assume that Qing Wang in these publications are the same author but taggged to two diﬀerent author objects. Then the query result would not be complete because the ﬁrst paper is not included. Consequently, the incompleteness and inaccurateness of author tagging would aﬀect the manageability, traceability, interoperability, quality of analysis, etc. of an application. More speciﬁcally, if the knowledge of making decisions on author tagging is not kept, it would be very diﬃcult to detect and correct mistakes hidden in the system. Moreover, it would be impossible to trace back the reasons of making mistakes and thus to avoid them in the future. When exchanging the information across applications, mistakes can be easily spread around, which would rise the concern on the quality of service. Furthermore, without accurate information stored in the system, results of applying analytical tools would be diluted. The author level metric tools (e.g., citations, h-index, etc.) would be either overestimated or underestimated.

3

Conceptual Layer

In the common practice of conceptual modelling, objects are modelled with the unique-key-value property such that two objects sharing the same values on all key attributes are meant to be the same object. As exempliﬁed in the previous section, it is inevitable to have objects existing in an open-world domain whose known properties are not suﬃcient to uniquely identify themselves, so we have to remove the unique-key-value property in conceptual modelling. Then the question of how to uniquely distinguish objects in an open-world domain arises. For

Intelligent Author Identiﬁcation

99

this, a straightforward approach is to use object identiﬁers. Their representation is not important but the interrelationship among them does matter. We revise the deﬁnition of Higher-order Entity-Relationship Model (HERM) [7,10]. There are four kinds of objects: entities, relationships, clusters and collections. Every object is associated with a unique identiﬁer. Definition 1. Let D = {Di }i∈Ibe a fixed family of basic domains, O be the universal set of identifiers, D = Di and D ∩ O = ∅. i∈I

– An entity type τE (or relationship type on level 0) consists of a finite nonempty set of attributes: attr(τE ) = {A1 , . . . , Am } and a domain assignment: dom : attr(τE ) → D. An entity of type τE is a pair (i, e) with an identifier i ∈ O, and a mapping e : attr(τE ) → D with e(A) ∈ dom(A) for all A ∈ attr(τE ). – A relationship type τR on level k + 1 consists of a finite non-empty set comp(τR ) of object types in which each has level at most k and at least one must have level exactly k, a finite set of attributes: attr(τR ) = {A1 , . . . , Am } and a domain assignment dom : attr(τR ) → D. A relationship of type τR is a pair (i, r) with an identifier i ∈ O and a mapping r : comp(τR ) ∪ attr(τR ) → O ∪ D with r(τ ) ∈ O for all τ ∈ comp(τR ) and r(A) ∈ dom(A) for all A ∈ attr(τR ). – A cluster type τC = τ1 ⊕ · · · ⊕ τn consists of a finite, non- empty set of object types τ1 , ..., τn . A cluster of type τC is an object of type τi where i ∈ [1, n]. – A collection type τL has a single object type τ . We denote a list-type by τL = [τ ], a set-type by τL = {τ } and a bag-type by τL = τ . A collection of type τL is a finite list (finite set, finite bag, respectively) of objects of type τ . A conceptual model M is a pair U, O such that U = D ∪ O is the base set of M and O = E ∪ R ∪ C ∪ L is the set of objects in M , where E, R, C and L represent a finite set of all entities, relationships, clusters and collections in M , respectively. To serve our purpose eﬀectively, the conceptual modelling process should obey the following principles. Firstly, data which may have variant expressions should be modelled as an object rather than a representation of a ﬁnite set of values. An object-based view on these data can empower us to establish a ﬂexible abstraction level handling the identity of authors, whereas a value-based view cannot. For example, when modelling authors as objects, we may deﬁne an identity relation among author identiﬁers. In doing so, diﬀerent objects modelled for an author are interconnected via the identity relation, in which each object is allowed to have a variant of representation for the author. In contrast with this, modelling an author in terms of a set of values leaves us an oversimpliﬁed choice which ignores the diversity of objects and thus cannot always be correct – either treating authors represented by the same set of values as the same person or treating them as being diﬀerent.

100

Q. Wang and R. Noack

&RQWHQW

7LWOH 9ROXPH ,VVXH1R
)QDPH 6QDPH

2UGHU

$XWKRUVKLS

$XWKRU 2WKHUQDPH (PDLO

^`

%RRN 1DPH

-RXUQDO 1DPH

&RQIHUHQ FH 1DPH

2WKHUV

$IILOLDWLRQ

1DPH

1DPH

Fig. 1. A simple conceptual model

Example 3. Fig. 1 provides a conceptual model for publications, which is simpliﬁed for clarity. Publication meta-data such as aﬃliations, books, journals and conferences are modelled as objects because capturing their variants is of interest. Other publication meta-data that can be interpreted without ambiguousness such as page numbers, volumes, issues, etc. are modelled as values. The second principle is the separation of concerns on modelling objects and identifying objects. Taking author tagging as an example, authors of publications which look like to be the same person still should be modelled as being diﬀerent objects. Meanwhile, the knowledge of why they might be the same person should be captured at a diﬀerent layer - identity layer, which will be discussed in Section 4. In doing so, a conceptual model may serve as a faithful reﬂection of the print world and the like, while all the knowledge on interrelationships of objects are managed at the identity layer to enable a ﬂexible and continuous adaptation of knowledge that can cope with an evolving application domain.

4

Knowledge-Based Identification

The knowledge-based identiﬁcation resides at an identity layer consisting of a ground model and a set of reﬁning identity tuples. By the combined eﬀects of a ground model and a set of reﬁning identity tuples, the knowledge of author identiﬁcation can be eﬃciently handled at two diﬀerent levels: structural and instance-based levels. Let Oτ be a set of identiﬁers of type τ . Then all identity relations discussed in this section are deﬁned as: Oτ × Oτ ⇒ {true, f alse}. Furthermore, we will use Abstract State Machines (ASMs) [4] to establish models in this section. 4.1

Ground Model

The intuition behind a ground model is to establish an approximate identity relation E A based on the general knowledge of ﬁnding the variants of an author. Each general knowledge is represented by a generic query returning pairs of identiﬁers. A generic query must respect the genericity principle [2,8] and only

Intelligent Author Identiﬁcation

101

concerns about structural properties of author variants; otherwise, a query is said to be non-generic. Definition 2. A ground model consists of a set {Q1 , ..., Qk } of generic queries such that par forall x with x ∈ O do E A (x, x) := true enddo forall x1 , x2 with E A (x1 , x2 ) do E A (x2 , x1 ) := true enddo forall x1 , x2 with Q1 (x1 , x2 ) ∨ ... ∨ Qk (x1 , x2 ) do E A (x1 , x2 ) := true enddo par Since generic queries may use the intermediate results in E A , a ground model is deﬁned after iterating the above rule until a ﬁxpoint of E A is reached. Example 4. Suppose that we have the following relational schemata for objects Affiliation, AffiliationSet, Author and Authorship shown in Fig. 1: – – – –

Affiliation = {ID, Name}; AffiliationSet = {ID, {AﬃliationID}}; Author = {ID, FName, SName, OtherName, Email, AﬃliationSetID}; Authorship = {ID, Order, PublicationID, AuthorID},

and the generic queries Qn , Qc , Qa and Qe representing the following rules: 1. Affiliation-rule: two aﬃliations are identical if they have the same name. Qn (x, y) := ∃z.Affiliation(x, z) ∧ Affiliation(y, z) 2. AuthorCollaboration-rule: two authors are identical if they have the same name and both have co-authored with at least one other author. Qc (x, y) := ∃x1 , x2 , y1 , y2 , z1 , z2 , z3 , z, z1 , z2 , z3 , z .E A (z, z )∧ Authorship(x1 , x2 , z3 , x) ∧ Authorship(z1 , z2 , z3 , z)∧ Authorship(y1 , y2 , z3 , y) ∧ Authorship(z1 , z2 , z3 , z ) 3. AuthorAffiliation-rule: two authors are identical if they have the same name and both are associated with the same aﬃliation. Qa (x, y) := ∃x1 , x2 , x3 , x4 , x5 , x6 , x4 , x5 , x6 , z, z . Author(x, x1 , x2 , x3 , x4 , x5 ) ∧ AffiliationSet(x5 , x6 )∧ Author(y, x1 , x2 , x3 , x4 , x5 ) ∧ AffiliationSet(x5 , x6 )∧ ∧z ∈ x6 ∧ z ∈ x6 ∧ E A (z, z ) 4. AuthorEmailaddress-rule: two authors are identical if they have the same email address. Qe (x, y) := ∃x1 , x2 , x3 , x4 , x5 , x1 , x2 , x3 , x5 . Author(x, x1 , x2 , x3 , x4 , x5 ) ∧ Author(y, x1 , x2 , x3 , x4 , x5 )

102

Q. Wang and R. Noack

The approximate identity relation E A deﬁned by this ground model would be par ...... forall x, y with Qn (x, y) ∨ Qc (x, y) ∨ Qa (x, y) ∨ Qe (x, y) do E A (x, y) := true enddo par A ground model abstracts the general knowledge on author identiﬁcation, however, some accurate problems still exist. (i) partial applicability: e.g., people change their surnames after marriage or inconsistently use name abbreviations in their publications. (ii) partial correctness: e.g., two persons who have the same name work in the same aﬃliation or both co-author with another person. 4.2

Stepwise Refinement

As a ground model can not handle exceptional cases in author identiﬁcation, we need an approach to reﬁne it. The idea is to use two reﬁning relations working towards opposite dimensions and capturing speciﬁc knowledge on author identiﬁcation, called positive and negative identity relations and denoted as E + and E − , respectively. A tuple E + (x1 , x2 ) states that identiﬁers x1 and x2 are identical while a tuple E − (x1 , x2 ) states that identiﬁers x1 and x2 are not identical. Definition 3. The refining relations E + and E − are associated with the sets − + − {Q+ 1 , ..., Qn } and {Q1 , ..., Qm } of non-generic queries, respectively, such that par + forall x1 , x2 with Q+ 1 (x1 , x2 ) ∨ .... ∨ Qn (x1 , x2 ) + do E (x1 , x2 ) := true enddo forall x1 , x2 with E + (x1 , x2 ) do E + (x2 , x1 ) := true enddo − forall x1 , x2 with Q− 1 (x1 , x2 ) ∨ ... ∨ Qm (x1 , x2 ) − do E (x1 , x2 ) := true enddo forall x1 , x2 with E − (x1 , x2 ) do E − (x2 , x1 ) := true enddo par Example 5. Suppose that an author named “Susan Lee” with identiﬁer i1 and an author named “Susan Maneth” with identiﬁer i2 refer to the same person in the real world because Susan Lee changed her surname to Maneth after marriage. For this, we can add a tuple E + (i1 , i2 ) in E + . Similarly, suppose we know that an author named “Susan Lee” with identiﬁer i1 is diﬀerent from an author named “Susan Lee” with identiﬁer i3 . For this, we can add a tuple E − (i1 , i3 ) in E − . It is possible that the knowledge on identifying a speciﬁc author may conﬂict with each other. For instance, we might have both E + (i1 , i2 ) and E − (i1 , i2 ) in the reﬁning relations. To solve such conﬂicting knowledge, it is important to automatically discover all the inconsistencies induced by two reﬁning relations. To

Intelligent Author Identiﬁcation

103

handle this, we may consider tuples in E + and E − as propositions in a propositional logic. Meanwhile, two propositions E + (x1 , x2 ) and E − (x1 , x2 ) should always satisfy the axiom E + (x1 , x2 ) ⇔ ¬E − (x1 , x2 ). In doing so, we can infer the consistency of two reﬁning relations by using propositional tableaux for a proposition φ that is a conjunction of all tuples in E + and E − . Let n and m be the numbers of tuples in E + and E − , respectively. Then we have

φ = E + (x1 , x1 ) ∧ ... ∧ E + (xn , xn ) ∧ E − (y1 , y1 ) ∧ ... ∧ E − (ym , ym ). If the proposition φ is true, then the reﬁning relations are consistent. When the reﬁning relations are consistent, they can be utilised to ﬁne-tune a ground model. We thus obtain an exact identity relation E E which provides the decisive knowledge of author identiﬁcation at the identity layer. That is, par forall x1 , x2 with (E A (x1 , x2 ) ∨ E + (x1 , x2 )) ∧ ¬E − (x1 , x2 ) do E E (x1 , x2 ) := true enddo forall x1 , x3 with E E (x1 , x2 ) ∧ E E (x2 , x3 ) do E E (x1 , x3 ) := true enddo par Let S denote an identity layer containing E E . Then any changes on the ground model and reﬁning relations which consequently aﬀect E E can be captured by a ﬁnite set Δ of updates in the form of (E E (i1 , i2 ), true) or (E E (i1 , i2 ), f alse). In doing so, the knowledge at an identity layer can be continuingly stepwise reﬁned via various learning processes such that S1 →Δ1 S2 →Δ2 ... →Δn−1 Sn . Remark 1. The ASM methods [4] have been intensively used in system design and analysis, in which ground models are established for capturing requirements and then turned into executable code by stepwise reﬁnements. At its core, the principle of substitutivity does not have to be obeyed. As our purpose of using the ground model and reﬁnement methods is to specify the evolvement of knowledge in an application domain, the reﬁnements of knowledge comply with the principle of consistency, instead of the principle of substitutivity.

5

Visual Layer

A visual layer is a federation of a conceptual model and an identity layer. By applying the knowledge of author identiﬁcation stored at an identity layer over a conceptual model, authors referring to the same person are identiﬁed and represented as a uniﬁed object at a visual layer. Formally speaking, a set of equivalence classes of identiﬁers is deﬁned in terms of E E . Definition 4. An equivalence class of an identifier i ∈ O w.r.t. E E is the subset of all identifiers in O which are equivalent to i, i.e., [i] = {i ∈ O|E E (i , i)}. Following the convention, the set of all equivalence classes in O with respect to E E is called the quotient set of O by E E and denoted as O/E E .

104

Q. Wang and R. Noack

Definition 5. Let M = (U, O) for U = D ∪ O be a conceptual model and E E be an exact identity relation defined by an identity layer. Then a visual model K = (U , O ) for U = D∪O/E E is defined by a homomorphism h : M → K such that (1) for each value a ∈ D, h(a) = a; (2) for each identifier i ∈ O, h(i) = [i]; (3) for each entity (i, e) ∈ O, (h(i), e) ∈ O is an entity; (4) for each relationship (i, r) ∈ O, (h(i), h(r)) ∈ O is a relationship, where r : comp(τR ) ∪ attr(τR ) → O ∪ D with r(τ ) ∈ O for all τ ∈ comp(τR ) and r(A) ∈ dom(A) for all A ∈ attr(τR ), and h(r) : comp(τR ) ∪ attr(τR ) → O/E E ∪ D with h(r)(τ ) ∈ O/E E for all τ ∈ comp(τR ) and r(A) ∈ dom(A) for all A ∈ attr(τR ); (5) for each cluster u ∈ O, h(u) ∈ O is a cluster; (6) for each collection [l1 , ..., ln ], {l1 , ..., ln } or l1 , ..., ln ∈ O, [h(l1 ), ..., h(ln )], {h(l1 ), ..., h(ln )} or h(l1 ), ..., h(ln ) ∈ O is a collection. A visual layer is always in ﬂux so as to reﬂect the evolving knowledge on author identiﬁcation. By knowledge acquisition and reasoning across diﬀerent application domains, the degree of precision and completeness on the identity of authors can be improved over time. The following ﬁgure presents an overall picture for this multi-layer architecture with knowledge-based identiﬁcation. Conceptual Layer Identity Layer Visual Layer

6

M + S1 →Δ1 = K1

M + S2 →Δ2 = K2

... ... ... →Δn−1 ... ...

M + Sn = Kn

Discussion

We discuss several author identiﬁcation services built upon the knowledge-based identiﬁcation in the proposed multi-layer architecture and an intelligent way of managing the knowledge acquired from diﬀerent systems. Knowledge Sharing. The knowledge of author identiﬁcation can be shared via various forms of services. We may provide author proﬁle services which contain all the information associated with individual authors such as aﬃliations, names, email addresses, publications, etc. These services can be published as data feeds providing regularly updates (e.g., HTML, Atom or RSS feeds), as widgets providing dynamically content embedded in other applications (e.g., arXiv myarticles widget), as Web APIs providing the capability for marshups (e.g., Scopus APIs), as interactive tools into social networks or other systems (e.g., Thomson Reutors ResearcherID Upload, arXiv Facebook application) and so on. We may also generate author authority ﬁles to help repositories control author names. With the additional abstraction for knowledge-based identiﬁcation, it would be easy to trace the knowledge of author identiﬁcation such as who, when and why add a piece of speciﬁc or general knowledge into the identity layer. Thus, the reliability of services can be enhanced and well controlled. Moreover, we can treat the whole identity layer as a plug-and-play service applied on a conceptual

Intelligent Author Identiﬁcation

105

model so as to reuse the successful knowledge and in the meantime rapidly deploy the knowledge into new systems. Knowledge Acquisition. The way of acquiring knowledge within this multi-layer architecture involves two steps: (1) analyse and extract all explicit or implicit knowledge of author identiﬁcation from external services provided by third-party providers; (2) store the knowledge of author identiﬁcation and other primitive data relating to publications (including their citation metrics) into the identity and conceptual layers, respectively. One of the biggest advantages oﬀered by this multi-layer architecture is to eﬀectively reason about the knowledge of author identiﬁcation integrated from diﬀerent systems. Any conﬂicts between the added new knowledge and the existing knowledge can be automatically discovered in the integration process. It can thus help detect accidental or historical errors contained in the knowledge of author identiﬁcation. In doing so, collective knowledge from diﬀerent communities can be acquired after checking the quality of external services. In addition to this, this architecture provides great ﬂexibility and eﬃciency for managing author tagging. For example, when new knowledge has been acquired after integrating a service provided by other scholarly communities such as Thomson Reutors ResearcherID Download, Scopus RSS feed, etc., this knowledge can be instantly applied on all the primitive data at the conceptual layer. Since there is no need for tagging authors individually, a vast amount of heavy work on author tagging, which currently happens in practice can be saved.

7

Conclusion

We proposed a multi-layer architecture to tackle the author identiﬁcation problem. An additional layer for managing the knowledge of author identiﬁcation has been established, lying between a conceptual layer and a view layer. In doing so, we can build an adaptive visual model by binding a conceptual model with knowledge-based identiﬁcation to capture the identity of authors in an evolving application domain. In the future we will further investigate identity services to integrate collective knowledge from diﬀerent communities.

References 1. Credit where credit is due. Nature 462(7275), 825 (December 2009) 2. Aho, A.V., Ullman, J.D.: Universality of data retrieval languages. In: Proceedings of Principles of programming languages, pp. 110–119. ACM Press, New York (1979) 3. Bennett, R., Hengel, C., Hickey, T., O’Neill, E., Tillett, B.: Virtual international authority ﬁle. In: ALA Annual Conference, New Orleans (2006) 4. B¨ orger, E., St¨ ark, R.F.: Abstract State Machines: A Method for High-Level System Design and Analysis. Springer, Heidelberg (2003) 5. Bourdon, F., Webb, R.: International cooperation in the ﬁeld of authority data: an analytical study with recommendations. KG Saur (1993)

106

Q. Wang and R. Noack

6. Habibzadeh, F., Yadollahie, M.: The problem of who. International Information and Library Review 41(2), 61–62 (2009) 7. Hartmann, S., Link, S.: Collection type constructors in entity-relationship modeling. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 307–322. Springer, Heidelberg (2007) 8. Hull, R., Yap, C.K.: The format model: a theory of database organization. In: Proceedings of Principles of database systems, pp. 205–211. ACM Press, New York (1982) 9. Swan, A.: Author identiﬁcation web page, http://repinf.pbworks.com/Author-identification 10. Thalheim, B.: Entity-Relationship Modeling: Foundations of Database Technology. Springer, Heidelberg (2000)

Abstraction, Restriction, and Co-creation: Three Perspectives on Services Maria Bergholtz, Birger Andersson, and Paul Johannesson Department of Computer and Systems Sciences Stockholm University {maria,ba,pajo}@dsv.su.se

Abstract. The recent surge in the interest of services has brought a plethora of applications of the service concept. There are business services and software services, software-as-a-service, platform-as-a-service, and infrastructure-as-aservice. There is also a multitude of definitions of the service concept. In this paper, we propose not a new definition of service but a conceptual model of the service concept that views services as perspectives on the use and offering of resources. The perspectives addressed by the model are: service as a means for abstraction; service as means for providing restricted access to resources; and service as a means for co-creation of value.

1 Introduction As a consequence of the increasing interest in services, there is now a plethora of applications of the service concept. There are business services and software (web) services, software-as-a-service (SaaS), platform-as-a-service, and infrastructure-as-aservice. New methods are proposed to structure systems by means of service architectures [Er07, Ag08, Zi04]. For example, in the view of Papazoglou and Van den Heuvel [PH06], (web) service design and development is about identifying the right services, organizing them in a manageable hierarchy of composite services and choreographing them together for supporting a business process. A business service can be composed of finer-grained services, which in turn are being supported by infrastructure services. The diversity of definitions and views can be seen in e.g. [WS04, OA06, Pr04, Lu08, UN08] where a common view is that a service is an abstraction of activities that once started will achieve some user goal. However, the exact way of defining a service depends on the perspective of a particular source. For example, [OA06, Pr04, UN08] focus on a business service perspective, while [WS04] takes a web (or software) service perspective. Attempts to define and characterise services by identifying properties (such as intangibility, inseparability, heterogeneity, and perishability [Zei85]) that distinguish them from other kinds of resources have been problematic [Gol00]. Therefore, it has been suggested [Edv05] to stop searching for internal properties of services that uniquely define them, and instead view and investigate services as perspectives on the use and offering of resources. Thus, the focus is shifted from the internal characteristics of resources to their context of use and exchange. We will follow this line of reasoning and suggest and model a number of service perspectives rather than J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 107–116, 2010. © Springer-Verlag Berlin Heidelberg 2010

108

M. Bergholtz, B. Andersson, and P. Johannesson

propose a single service definition. We claim there are three main service perspectives: service as a means for abstraction, service as a means for providing resource access without ownership transfer, and service as a means for co-creation of value. The purpose of the paper is to propose a conceptual model of services based on these three perspectives. The model has its theoretical foundation in the REA ontology [Mc82] and Hohfeld’s classification of rights [Hoh78]. REA is used because it is a well established ontology of business collaboration with the basic view that resources are exchanged between actors according to agreements. Hohfeld is used as, following the argument of Zeithaml [Zei85], services are “first sold, then produced and consumed”. Being sold implies that some rights are transferred from one actor to another with different consequences depending on the type of right. The remainder of this paper is structured as follows. In section 2 we briefly outline the main points of the REA ontology and Hohfeld’s classification of rights. In section 3 we propose and present a conceptual model of services from three perspectives. Section 4 discusses related work and concludes the paper.

2 The REA Ontology and Hohfeld’s Classification of Rights The REA ontology. The REA (Resource-Event-Agent) ontology was originally formulated in [Mc82] and developed further in a series of papers, e.g. [Gee99, Hr06]. Its conceptual origins can be seen as a reaction to traditional business accounting where the needs are to manage businesses through a technique called double-entry bookkeeping. This technique records every business transaction as a double entry (a credit and a debit) in a balanced ledger.

Fig. 1. Basic concepts of the REA ontology (adopted from http://reatechnology.com/what-isrea.html)

The core concepts in the REA ontology are resources, economic events, and agents. The intuition behind the ontology is that there are two ways agents can increase or decrease the value of their resources: through exchange and conversion processes, [Hr06]. An exchange process occurs when an agent receives economic

Abstraction, Restriction, and Co-creation: Three Perspectives on Services

109

resources from another agent and gives resources back to that agent. A conversion process occurs when an agent consumes resources in order to produce other resources. For example, in a purchase (an exchange process) a buying agent has to provide money to receive some goods. Two economic events take place in this process: one where the amount of money is decreased (a decrement event) and another where the amount of goods is increased (an increment event). This correspondence of economic events is called a duality. A corresponding change of availability of resources takes place at the seller’s side. Here the amount of money is increased while the amount of goods is decreased. An example conversion process is hair-dressing where an agent is using or consuming resources (labour, scissors, shampoo) in decrement economic events to produce improved hair in an increment economic event. In order to relate resources and economic events, the notion of stockflow is used. There are two basic kinds of stockflows: inflows that relate resources with increment economic events and outflows that relate resources with decrement economic events. Furthermore, inflows and outflows are specialized depending on if they are part of an exchange process or a conversion process. In an exchange process, agents ‘give’ (a specialization of outflow) up resources in order to ‘take’ (a specialization of inflow) other resources. During conversion processes, agents ‘use’ or ‘consume’ (specializations of outflow) resources to ‘produce (a specialization of inflow) other resources. A commitment is an obligation to fulfil an economic event in the future. For an exchange process, a commitment is an obligation for an agent to provide rights to resources. In a conversion process, commitments represent scheduled usage, consumption, or production of economic resources. Two commitments are related via an association called ‘reciprocity’ that identifies which resources are promised to be exchanged for, or converted into, other resources. The ‘reservation’ relationship is a special kind of stockflow that describes the planned inflow or outflow of resources. A contract, finally, is a collection of commitments. Hohfeld’s classification of rights. A central component in any resource exchange between actors is the transfer and creation of rights. In order to clarify the role of rights in resource exchanges we will make use of the work of W. N. Hohfeld, [Hoh78], who proposed a classification identifying four broad categories of rights: claims, privileges, powers, and immunities. • One actor has a claim on another actor if the second actor is required to act in a certain way for the benefit of the first actor, typically by carrying out some action. Conversely, the second actor is said to have a duty to the first actor. An example is a person who has a claim on another person to pay an amount of money, implying that the other person has a duty to pay the amount. Claims correspond to the REA Commitments. • An actor has a privilege on an action if she is free to carry out that action without any interference from the environment in which the action is to be carried out. By environments is here meant social structures such as states, organizations or even families. Some examples of privileges are free speech and the fact that a person owning some goods has privileges to use these in various ways. • A power is the ability of an actor to create or modify a relationship. An example is that a person owning a piece of land has the power to sell it to someone else, thereby creating a new ownership relationship for that piece of land.

110

M. Bergholtz, B. Andersson, and P. Johannesson

• An immunity refers to the restriction of power of one actor in terms of creating formal relationships on behalf of another actor. For example, a native people may hold immunity towards state legislation concerning their property rights, meaning that the state does not have the power to enforce laws that modify existing property rights. We will not make use of immunities in this paper. Most relationships consist of a combination of several of these rights. For example, if you own a car it means that you have privileges on using it as well as the power to lend the car or sell it, i.e. creating new ownerships involving other actors.

3 Service Perspectives In this section, we will introduce a conceptual model for services. The model does not propose a single service definition but instead suggests a number of service perspectives based on the ways resources can be used and exchanged. This approach is reflected in the model, which does not include the term “service” but instead a family of related terms, including “service resource”, “service offering”, and “capability”. The approach is inspired by a language problem identified by Wittgenstein [Wit33]. He contends that a word is defined by its use and that it can be used in different ways, but that there is no usage characteristic that is common for all ways. He likens the different uses with a family of meanings of the word. In our case a meaning is a perspective, and we have identified three main perspectives on “service”: • Service as a means for abstraction. Services can provide an abstraction mechanism, where resources are specified through their function and not their construction. In other words, a resource is defined in terms of the effects it has in a process, not in terms of its properties or constituents. For example, a hair dressing service can be defined in terms of the effects it has on someone’s hair, not in terms of the resources being used in the execution of the service, such as scissors or electric machines. • Service as a means for providing restricted resource access. An agent can provide access to some of her resources to another agent by transferring the ownership of them. However, such an ownership transfer may in some situations be undesirable or even legally impossible. Thus, there is a need for a way of offering access to resources without transferring ownership, and services provide a mechanism for this purpose. For example, instead of selling people, labour services are sold, and instead of selling cars, car rental services are provided. • Service as a means for co-creation of value. For most kinds of goods, customers are not involved in their production. Instead, goods are produced internally at a supplier who later on sells the goods to a customer who uses them without the involvement of the supplier. In contrast, services are often created and used in an interaction between supplier and customer. We will base the proposed conceptual model of services on these three perspectives. The model will be presented in a series of diagrams, where the first (Fig. 2) shows services as an abstraction mechanism, the second and third one (Figs. 3 and 4) show how services may provide restricted resource access, while the last (Fig. 5) shows services as co-creation of value.

Abstraction, Restriction, and Co-creation: Three Perspectives on Services

111

3.1 Service as a Means for Abstraction A main purpose of using services is that they enable Agents to offer Resources without specifying their characteristics in detail, i.e. Resources can be offered in a more abstract way. We will here distinguish between two kinds of abstract Resources, Capabilities and Service Resources. A Capability is an abstract Resource that is defined only through its consequences, i.e. what changes it can bring to other Resources when used in a conversion process (see Figure 21). For example, an Agent having a hair dressing Capability means that it is able to improve the hair styles of people. In other words, when a hair dressing Capability is used, a hair style is improved. Other examples of Capabilities could be to proofread German documents, to arrange dinner parties, or to host online gaming sessions. A Capability is abstract in the sense that it has to be based on some other Resources but is not defined in terms of them; in Figure 2 such Resources underlying a Capability are modelled by the class Resource Set. For example, a proof reading capability can be based on concrete Resources like human skills and labour, paper, and pencils. However, it could also be based on computer hardware and some advanced software. When the proof reading Capability is to be used, the Resources it is based on will be used.

Fig. 2. Service as a means for abstraction

In addition to Capabilities, we also introduce Service Resources as abstract Resources. A Service Resource is similar to a Capability but differs in that it can be used in only one single Process (an REA-conversion process, see section 2). In other words, a Service Resource is an abstract Resource that is defined through the single Process in which it can be used. Service Resources are used to expose some limited aspect of a Capability, e.g. a Service Resource “hair dyeing” is based on a hair dressing Capability, which can be used in many conversion processes. Service Resources are helpful when defining Offerings, modelled by the class Offering in Figure 2, that specify what Resources an Agent is prepared to offer to other Agents. When defining an offering of a Service Resource, it is sometimes not sufficient to specify only the Service Resource; it may also be required to put constraints on which Resources the Service Resource can be based on. For example, a hair dresser may 1

The reading direction of associations in the diagrams is shown via an arrow in the end of the association (not to be confused with the UML navigable arrow).

112

M. Bergholtz, B. Andersson, and P. Johannesson

offer a “hair dyeing” service and declare that it is based only on colouring products with vegetarian ingredients. In order to model these constraints, we introduce the class Restriction in the model. It can be noted that Restrictions are applicable not only to Offerings of Service Resources, but for any Offerings. Capabilities and Service Resources, as almost all concepts in the conceptual models presented here, may exist on a knowledge level as well as on an operational level. According to [Fo97], the operational level models concrete, tangible individuals in a domain, while the knowledge level models information structures that characterize categories of individuals on the operational level. Figures 2 through 5 hence distinguish between concepts such as Resources (categories of Resources such as Car model, Agent type, Real Estate) and Resource Instances (specific and often tangible concepts like a specific car or a concrete piece of land), Economic Events and Economic Event instances, and so forth for every concept in the model. Due to space limitations we include both knowledge and operational level concepts in the diagrams only when both concepts are required to illustrate a focal point in the model. 3.2 Service as a Means for Providing Restricted Resource Access Figure 3 depicts three different ways for an Agent to make its resources available to other Agents through Offerings. It also shows how several Offerings can be combined into one single Offering. First, an Agent may offer to sell a Resource to another Agent, i.e. to transfer the ownership of the Resource to the other Agent, as modelled by class Ownership Offering. A transfer of ownership means that a number of Rights are transferred from seller to buyer, in Figure 3 modelled by the class Right. Rights are divided into Powers (an Agent is entitled to create, modify or cancel the ownership) or Privileges (an Agent is allowed to use the Resource being transferred, see Hohfelds’s classification of rights in section 2). As an example, an Agent offering to sell a book to a customer means that the Agent is offering the customer Privileges to use the book as well as the Power to transfer the ownership of the book to yet another Agent if she so wishes. Secondly, an Agent may offer to lend a Resource or provide access to it in a Lending Offering. This means to offer an Agent to get certain Privileges on the Resource but without getting any ownership, i.e. the borrower is not granted the Power to change the ownership of the Resource. Thirdly, an Agent may make a Service Offering, which is the most abstract way of providing access to an Agent’s resources. A Service Offering means that the offering Agent offers to use some of her Service Resources in a conversion process that will benefit another Agent. In this case the offering Agent “stands between” the requesting Agent and the concrete Resources to be used in the conversion process. Effectively, the offering Agent restricts access to these Resources. In particular, the buying Agent is not offered any Powers or Privileges on any concrete Resources. Instead, she is offered a claim on the offering Agent that she is to contribute to a certain process. Finally, there are combinations of Offerings, in Figure 3 modelled as a Bundled Offering. The purpose of bundling is to provide a mechanism for combining different types of Offerings. Bundled Offerings are also Offerings, in Figure 3 modelled via a generalization relationship between Offering and Bundled Offering, in order to be able to assemble more complex bundles not only from the aforementioned Ownership, Loan and Service Offerings but recursively from other Bundled Offerings as well.

Abstraction, Restriction, and Co-creation: Three Perspectives on Services

113

Fig. 3. Service as a means for restricted access provisioning

Fig. 4. Access provisioning fulfilled (provide/receive associations hidden to reduce complexity)

The diagram in Figure 3 shows different kinds of Offerings. When such offerings are accepted they will result in Commitments and Contracts, see Figure 4. A Service Offering will result in a Service Commitment, while an Ownership or a Loan Offering will result in an Ownership/Loan Commitment. Bundled Offerings will result in Contracts which are collections of Commitments. When Commitments have been established they are to be fulfilled through Economic Events. Commitments can be fulfilled in different ways depending on the kind of Offering they are based on. A Service Commitment is fulfilled via a special kind of Economic Event Instance, called a Service Delivery. A Service Delivery consumes a Service Resource Instance and thereby it uses the Agent Capability on which that Service Resource Instance is based. Agent Capability is here an operational level concept corresponding to the knowledge level concept of Capability, representing that a certain Agent has a certain Capability. Thus, a Service Commitment becomes fulfilled through an Agent using her Capability in order to benefit another Agent. This situation is different from that of an Ownership/Loan Commitment, which is fulfilled through an Economic Event Instance where an Agent gives rights (Privileges and/or Powers) on some Resource Instance to another Agent. Summarising, a Service Commitment is fulfilled by an Agent consuming and using her Resources (stockflow: consume and use), while an

114

M. Bergholtz, B. Andersson, and P. Johannesson

Ownership/Loan Commitment is fulfilled by an Agent giving away Rights and/or Privileges (stockflow: give). The Commitments of Figure 4 are active under certain pre-specified conditions, which are modelled using the class Trigger. Triggers are used to specify when Commitments are to be fulfilled. This is in many cases trivial, e.g., stating that a Resource will be delivered within one week from ordering. Triggers, therefore, typically include general attributes such as ‘From’, ‘To’, and ‘Location’. For example, one Trigger for a Commitment may state that a certain Resource be delivered to a certain location within a predefined time interval. Triggers can, however, also be more complex, in particular for Service Commitments. An example could be a Trigger stating that a snow ploughing service is to be delivered whenever more than 1 dm of snow has fallen. In this case, one single Service Delivery may not be enough to fulfil the Service Commitment, as the service shall be delivered as soon as the amount of snow fallen exceeds 1 dm. Conversely, it is not always required that every Service Commitment be fulfilled by a Service Delivery. If no snow ever falls, the Trigger will never make the Commitment active, and no Service Delivery will take place. Still, the Contract containing the Commitment is respected even in this situation. Thus, Triggers are used to specify any type of condition under which a Service Commitment is active. If and when a Service Delivery is required to be carried out is thereby effectively determined by a Trigger. 3.3 Service as a Means for Co-creation of Value For most kinds of resource exchanges, customers are not involved in the production of the exchanged resource. Instead, resources are produced internally at a supplier who later on sells the goods to a customer who uses them without the involvement of the supplier. In contrast, Service Deliveries are usually parts of processes where value is co-created in an interaction between provider and recipient. In Figure 5, this is modelled through the class Co_Creation Process that relates an Economic Event Instance (Service Delivery) carried out by a provider with complementary Economic Event Instances carried out by the recipient. For example, a service provider may offer driving lessons to a customer. The fulfilment of the Service Commitment requires that both parties take part in the creation of value, i.e., that the service provider uses his teaching capabilities (based on Resources such as knowledge and time) and the customer uses his Resources (his previous skills plus his time) to produce an improvement of the customer’s driving skills.

Fig. 5. Service as a means for co-creation of value

Abstraction, Restriction, and Co-creation: Three Perspectives on Services

115

4 Concluding Remarks In this paper we have proposed a conceptual model of the notion of service. A main characteristic of the model is that it describes services from three perspectives – service as a means for abstraction; for access restriction; and for co-creation of value. These perspectives and the combination thereof, are similar yet different from other perspectives in the literature. The work was in part motivated by a problem posed in [Fe09]. The problem there was how to view a service where a) the delivery of the service is done during a time period, and b) the terms of the service are honoured even if no service is actually delivered. An example of the problem is a snow ploughing service, which is to remove snow as soon as more than 1 dm of it has fallen. The apparent paradox in the problem is solved by distinguishing between commitments of services and service deliveries that fulfil these commitments. The work is moreover motivated by the assumption that co-creation of value is fundamental for services, or, in other words, building a model that describes only one agent’s perspective at a time may not be an optimal approach when modelling services. Our three perspectives can be compared to those introduced in [Akk04]. There the chosen perspectives are called ‘service value’, ‘service offering’, and ‘service process’. The service value perspective is analogous to our abstraction perspective where a service is described by the effects it has, but it also contains elements from our co-production perspective. The service offering perspective is related to our view of services as a means for restricted access to resources. The service process perspective describes how a service offering is put into operation, but the authors do not investigate realization issues in detail. In the context of SOA, OASIS acknowledges that services are not only a technical but also a social concept [OA06]. It is stated that many, if not most, effects that are desired in the use of SOA-based systems are actually social effects rather than physical ones. “When a customer ‘tells’ an airline service that it ‘confirms’ the purchase of the ticket it is simultaneously a communication and a service action – two ways of understanding the same event, both actions, one layered on top of the other, but with independent semantics” [OA06, p. 32]. Compared to our three perspective view, OASIS focuses on abstraction and access restriction (of mainly software services). Lusch [Lus08] on the other hand emphasizes the co-creation of value perspective and argues that it is paramount for a so called service-dominant logic, which can be contrasted with a goods-dominant logic. In a service-dominant logic the focus is on the interaction between the firm and the customer. Together the firm and the customer contribute with their capabilities in order to solve a particular customer problem. By comparison, in the goods-dominant logic service companies are more manufacturing like. The customer is viewed merely as a consumer of whatever resource (including a service) that is made available for him, and his own skills and capabilities do not add value to it. The contributions of this paper are mainly theoretical but we believe that they will also find applications in structuring service descriptions and developing service classifications. Further research will investigate these issues as well as consolidate the proposed model.

116

M. Bergholtz, B. Andersson, and P. Johannesson

References [AG08] Arsanjani, A., et al.: SOMA: A method for developing service-oriented solutions. IBM Systems. Journal 47(3), 377–396 (2008) [Akk04] Akkermans, et al.: Value Webs: Ontology-Based Bundling of Real-World Services. IEEE Intelligent Systems 19(4) (July/August 2004) [BP07] OASIS Web Services Business Process Execution Language Version 2.0 (2007), http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.html [Edv05] Edvardsson, B., Gustafsson, A., Roos, I.: Service portraits in service research: a critical review. Int. Jour. of Service Industry Management 16(1), 107–121 (2005) [Er07] Erl, T.: SOA: principles of service design. Prentice-Hall, Englewood Cliffs (2007) [Fe09] Ferrario, R., Guarino, N., Fernandez Barrera, M.: Towards an Ontological Foundations for Services Science: the Legal Perspective. In: Sartor, G., Casanovas, P., Biasiotti, M., Fernandez Barrera, M. (eds.) Approaches to Legal Ontologies, Springer, Heidelberg (2009) [Fo97] Fowler, M.: Analysis Patterns. In: Reusable Object Models, Addison-Wesley, Reading (1997) [Ge99] Geerts, G., McCarthy, W.E.: An Accounting Object Infrastructure For KnowledgeBased Enterprise Models. IEEE Int. Systems & their Applications, 89–94 (1999) [Gol00] Goldkuhl, G., Röstlinger, A.: Beyond goods and services - an elaborate product classification on pragmatic grounds. In: Proc of Quality in Services, QUIS 7 (2000) [Hoh78] Hohfeld, W.N., Corbin (eds.): Fundamental Legal Conceptions. Greenwood Press, Westport (1978) [Hr06] Hruby, P.: Model-Driven Design of Software Applications with Business Patterns. Springer, Heidelberg (2006) ISBN: 3540301542 [Lus08] Towards a conceptual foundation for service science: Contributions from servicedominant logic. IBM Systems Journal 47(1) (2008) [Mc82] McCarthy, W.E.: The REA Accounting Model: A Generalized Framework for Accounting Systems in a Shared Data Environment. The Accounting Review (1982) [OA06] OASIS. Reference Model for Service Oriented Architecture 1.0. (2006), Available at http://www.oasis-open.org/committees/download.php/19679/ soa-rm-cs.pdf [PH06] Papazoglou Heuvel, M., van den, W.J.: Service-oriented design and development methodology. Int. Journal of Web Engineering and Technology 2(4), 412–442 (2006) [Pr04] Preist, C.: A Conceptual Architecture for Semantic Web Services. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 395–409. Springer, Heidelberg (2004) [UN08] United Nations, Dept. of Economic and Social Affairs. Common DataBase (CDB) Data Dictionary. Available at http://unstats.un.org/unsd/cdbmeta/gesform.asp?getitem=398 2008-02-19 [Wit33] Wittgenstein, L.: The Blue and Brown Book, pp. 1–74. Harper&Row, New York (1980), Available online at http://www.geocities.jp/mickindex/ wittgenstein/witt_blue_en.html [WS04] W3C. Web Services Architecture W3C Working Group (2004), http://www.w3.org/TR/2004/NOTE-ws-arch-20040211/ [Zei85] Zeithaml, V.A., Parasuraman, A., Berry, L.L.: Problems and Strategies in Services Marketing. Journal of Marketing 49, 33–46 (1985) [Zi04] Zimmerman, O., Krogdahl, P., Gee, C.: Elements of Service-Oriented Analysis and Design (2004), http://www-128.ibm.com/developerworks/library/ ws-soad1/

The Resource-Service-System Model for Service Science Geert Poels Faculty of Economics and Business Administration, Ghent University, Tweekerkenstraat 2, 9000 Gent, Belgium [email protected]

Abstract. Service Science is the interdisciplinary academic field that studies service systems. A challenge for Service Science is the development of abstractions, models, vocabularies, and measures that support service systems research. This paper proposes the Resource-Service-System model as a conceptual model for Service Science that emphasizes that, in an economic context, service systems interact through the exchange of service for service in a mutually beneficial manner. This new model is adapted from the REA model of economic exchange by analyzing REA from the perspective of the Service-Dominant Logic economic worldview, which has been proposed as the philosophical foundation of Service Science.

1 Introduction Service Science is the emerging academic field that studies service systems in order to discover underlying principles that can inform service innovation and guide the design, improvement, and scaling of service systems [1]. As a distinct interdisciplinary field, Service Science needs an idiosyncratic and unifying paradigm to provide identity and discriminate it from its many contributing but separate service research disciplines [2]. Service-Dominant Logic (SDL) [3] has been proposed as a philosophical foundation of Service Science that “provides just the right perspective, vocabulary, and assumptions on which to build a theory of service systems, their configurations, and their modes of interaction” [4, p. 18]. SDL is a worldview that sees all economic activity as service exchanges between service systems (which can be individuals or groups of individuals like families, firms, and nations [5]). In SDL, service is defined as the application of competences by one service system for the benefit of another service system. In the traditional economic worldview, referred to as Goods-Dominant Logic (GDL) [6], a service is seen as a second-class product that suffers from shortcomings like intangibility, heterogeneity, inseparability, and perishability. In GDL, services are to the best possible extent, as far as allowed by their shortcomings, treated as any other kind of product (e.g., goods, rights). GDL considers services as transferable resources that have a nominal value that is determined through their exchange for other resources (usually money). In contrast, SDL sees a service as a collaborative process in which each party brings in or makes accessible its unique resources. In SDL, it is the provision of resources by one party and their acting upon the resources of another party that creates J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 117–126, 2010. © Springer-Verlag Berlin Heidelberg 2010

118

G. Poels

real value with and for that other party (i.e., the service beneficiary). Economic rationale dictates that service systems do not interact to create value for just one of them. They interact such that value is created for both of them. In other words, service exchange is the economic motive for service systems to engage in interactions with other service systems because it is through service exchange that service systems can improve their state. Service Science needs modelling and simulation tools that help studying service systems. Current challenges for Service Science include the formal representation and measurement of work in service systems [7] and the development of a shared vocabulary to describe service systems [8]. This paper aims to contribute to Service Science by proposing a conceptual model of economic exchange in SDL. The model is derived from the Resource-Event-Agent (REA) model of economic exchange [9], which is firmly grounded in well-established accounting and business process theories. REA focuses on the effect of economic exchanges on the resources controlled by the legal and natural persons that participate in the exchanges. The new model, which we call the Resource-Service-System (RSS) model, is constructed based on an SDL interpretation of REA, using terms and definitions taken from SDL literature and two genuine Service Science research contributions: the system theoretic definition of service system provided in [8] and the Interact-Serve-Propose-Agree-Realize (ISPAR) model of service system interaction presented in [10]. Section 2 presents and illustrates the RSS model. Section 3 evaluates the model by comparing it to existing models. Section 4 discusses the potential use of the model in Service Science and suggests future research.

2 The RSS Model Fig. 1(a) shows the REA model of economic exchange. An economic resource is a valuable good, right, or service that is presently under the identifiable control of an economic agent. An economic resource is under the control of an economic agent if that person owns the resource or is otherwise able to derive economic benefit from it. If two economic agents desire to obtain control of one or more economic resources controlled by the other agent, then both agents may wish to engage as trading partners in an economic exchange, which is a business transaction that transfers the control of the resources between the agents. A transfer of control of (a) resource(s) from one agent to another agent is modelled as an economic event in which the concerned resource(s) (is)/(are) identified by a stockflow relation and the agents participate in provider and receiver roles. Economic reciprocity in exchanges is modelled through the duality relation between economic events and requiting events (often payments), in which the provider and receiver roles of the involved agents are switched. Although REA is a conceptual model of economic exchange, it is not a model of service exchange in SDL as services are seen as economic resources. Consequently, control of services can be transferred from a provider to a receiver. Instead of cocreating value, the provider is assumed to ‘produce’ the services (i.e., creates value) and the receiver is assumed to ‘consume’ them (i.e., destroys value). Further, to model the transfer of a service, an economic event is needed, e.g., a service transfer, provision, or delivery event. For instance, from a GDL perspective, a garage and a car

The Resource-Service-System Model for Service Science

119

owner can exchange a car oil change service for money. A REA model of this exchange (Fig. 1(b)) would identify, apart from the oil change service economic resource, an economic event (e.g., oil change service transfer) that transfers the control of the service from the garage to the car owner, meaning that the car owner consumes the oil change service produced by the garage. In return, the car owner pays the garage, causing a flow of money from car owner to garage. Payment is the requiting event of the service transfer. Fig. 1(b) further shows that a REA model allows representing the resource component structure of a service (via the composition relation). The car oil change service is composed of oil change competences (i.e., knowledge and skills) embodied in a car mechanic and various other resources (e.g., garage tools, a garage pit, and a quantity of motor oil) used as appliances to convey these competences to the service target (i.e., the car). (a)

(b)

Fig. 1. (a) REA model of economic exchange; (b) exchange of car oil change service for money

In SDL, service is a process and not a resource. REA can provide a conceptual basis for a model of economic exchange in SDL by classifying service as economic event instead of economic resource. Support for this position is found in the ontological analysis of service using the DOLCE upper-level ontology presented in [11], where it is concluded that “it seems legitimate to assume that goods are objects (endurants, in DOLCE’s terms), while services are events (perdurants)” [11, s.p.]. Fig. 2(a) shows our SDL interpretation of the REA model in Fig. 1(a). The model is obtained by replacing in Fig. 1(a) economic event by service. Like REA economic resources, a resource in SDL is something of value under the control of a legal or natural person. If economic exchanges are service exchanges then the persons controlling resources are service systems. The notion of service system is given a system theoretic definition in [8], the main element of which is that a service system is a configuration of resources that is an open system (1) capable of improving the state of another system through sharing or applying its resources (i.e., the other system

120

G. Poels

(a)

(b)

Fig. 2. (a) RSS model of service exchange; (b) exchange of car oil change service for payment

determines and agrees that the interaction has value), and (2) capable of improving its own state by acquiring external resources (i.e., the system itself sees value in its interaction with other systems). Therefore, economic agent is replaced by service system. As shown in Fig. 2(a) by the controls aggregation relation, a service system is an aggregate (or configuration) of resources that are controlled by the system. A service is the acting of one or more operant resources on one or more other resources (operand, but possibly also operant) [8]. The distinction between operant and operand resources is a key feature of SDL [3]. Operand resources are passive resources that require action to make them valuable, whereas operant resources are active resources that embody competences (i.e., knowledge and skills) and that can act on other resources to make them valuable. According to [12], this distinction can enrich the conceptual foundation of Service Science as service systems are driven by operant resources rather than operand resources. Therefore, resource is specialized into operant resource and operand resource (instead of goods, services and rights as in Fig. 1(a)). Fig. 2(a) shows that at least one operant resource must act in a service and at least one resource must be acted upon, meaning that service implies the application of competences which must be integrated with other resources to create value. These acts in and is acted upon in relations replace the stockflow relation of Fig. 1(a).

The Resource-Service-System Model for Service Science

121

The system theoretic definition of service system in [8] further specifies that service systems are themselves resources, more particularly operant resources. As service systems are configurations of resources, service systems can be composed of other service systems. A composition of resources needs to include an operant resource, otherwise it cannot be considered a service system [8]. The model in Fig. 2(a) emphasizes the component structure of service systems rather than that of economic resources by replacing the composition aggregation relation and control relation in Fig. 1(a) by a single controls aggregation relation. The service systems involved in a service are explicitly identified via value cocreation roles. A resource provider co-creates value with another service system (i.e., a resource integrator) for the benefit of that other system by providing/applying resources. A resource integrator co-creates value with another service system (i.e., a resource provider) for its own benefit by integrating the resources provided/applied by the other system. These roles replace the provider and receiver roles in Fig. 1(a). Finally, the model includes a bidirectional is reciprocal of relation between services that replaces the duality relation in Fig. 1(a). Mandatory participation constraints indicate that each service needs a reciprocal service. This means that when a service system provides resources for a service that benefits another service system, then this other service system must provide resources for a requiting service that benefits the first service system. So, in the requiting service the resource provider and integrator roles of the service systems that co-create value are switched. Applying the RSS model to the example of a car oil change results in the model shown in Fig. 2(b). Car oil change is identified as a service in which a garage and a car owner participate in the respective roles of resource provider and resource integrator. The car mechanic is an operant resource controlled by the garage that acts upon the car which is an operand resource controlled by the car owner. In the service, other operand resources controlled by the garage (e.g., motor oil, a garage pit, and garage tools) are acted upon as they are appliances for the oil change competences embodied in the car mechanic. Payment is the reciprocal service of car oil change. The resources that are acted upon or act in the payment service are also modelled: the car owner’s payment competences (i.e., knowledge of how to effectuate payments), his money, and a bank account as an operand resource that the garage brings in. Note the two essential differences with the REA model in Fig 1(b). First, there is no need to model both a service and a service transfer. Second, the resources brought in by the resource integrator (i.e., the car in case of the car owner and the bank account in case of the garage) are explicitly modeled, emphasizing that in each of the reciprocal services value is co-created by both parties rather than created by one party and destroyed by the other party. 2.1 A Service Process Model View The model in Fig. 2(a) does not elaborate on the process structure of a service exchange, other than recognizing that service is an event, meaning an occurrence in time. The mandatory is reciprocal of relation suggests that the pair of services that make up a service exchange concur, as each service needs a reciprocal service. In [13] it is argued that any Service Science model of service systems should capture time. To add a time

122

G. Poels

dimension to the RSS model and to formalize the co-existence of mutually reciprocal services, we extend the model in Fig. 2(a) with a service process model view. The ‘process’ of service exchange can be seen as a series of interactions (possibly only one) between the involved service systems. In [10] the notion of interaction between service systems was formalized and a normative model (called ISPAR) was proposed that identifies and typifies all possible service system interactions. Service system interactions are described by interaction episodes which are series of activities jointly undertaken by two interacting service systems.1 The ISPAR model is represented in [10] as a decision tree with ten leaf nodes that represent the different types of outcome for interaction episodes. Six of these outcomes characterize interaction episodes that describe service interactions between service systems, i.e., interactions that aim at mutual value co-creation [8]. A further distinction is made between successful (outcome R), aborted (outcomes –P and –A), and unsuccessful service interactions (outcomes –D, -K, and K) (Table 1). Table 1. ISPAR outcomes of interaction episodes describing service interactions Interaction episode outcome (-R) mutual value co-creation realized (-P) proposal for service interaction(s) not successfully communicated (-A) no agreement reached on proposal for service interaction(s) (-D) mutual value co-creation not realized, but not disputed (-K) mutual value co-creation not realized, successful dispute resolution (-K) mutual value co-creation not realized, unsuccessful dispute resolution

Interaction type Successful Aborted Unsuccessful

Fig. 3 shows a model view that can be integrated with Fig. 2(a) to incorporate the concepts and structure of ISPAR in the RSS model. In the service process model view, the is reciprocal of relation in Fig. 2(a) is reified such that the process structure of a service exchange can be explicitly represented. As a service exchange object links a pair of objects representing the reciprocal services that constitute a service exchange, and the participation of service objects in these links is mandatory (i.e., each service needs a reciprocal service), the life of a service exchange concurs with the lives of its constituting services. Fig. 3 further shows that each service exchange is composed of one or more service interactions. A constraint not shown is that all service interactions that compose some service exchange are described by the same interaction episode. We consider interaction episodes that describe non-service interactions as outside the scope of RSS as this is a model of service exchange. So, only interactions that aim at mutual value co-creation are considered, even if eventually unsuccessful or aborted. The ISPAR model can also be used to prescribe a minimal normative lifecycle for the service exchange objects in Fig. 3. Each service exchange, and hence each of the two reciprocal services it relates, starts with an activity that proposes service 1

Note that ISPAR considers, just as REA and consequently RSS, economic exchange as a bilateral process, so involving exactly two service systems.

The Resource-Service-System Model for Service Science

123

interaction(s).2 The way in which a service exchange (and its constituting services) ends depends on the outcome of the interaction episode. Successful service interactions are described by an interaction episode that is a sequence of three main activities: (1) proposing one or more service interactions; (2) agreeing to this proposal; and (3) realizing the agreed-to service interaction(s). This happy path leads to the result that value co-creation for both service systems is realized, in which case the service exchange and both involved services come to a successful end. For interaction episodes with outcomes –P and –A, the service exchange ends because the service interactions are aborted. In interaction episodes with outcomes –D, K, and –K, the agreedto service interactions are not (fully) realized and the service exchange might even end with a dispute, which may or may not be successfully resolved.

Fig. 3. RSS model: Service process model view

In the car oil change example, an exchange of oil change service for payment requires one or two service interactions: either a single garage visit that is immediately followed by a cash or credit card payment (at the garage’s premises); or a single garage visit that is followed by a money transfer from the car owner’s bank account to the garage’s account, sometime after the oil change took place (e.g., within 30 days). The proposing activity is when the car owner calls the garage to ask for a car oil change and the garage proposes a garage visit. Both parties then need to agree on the conditions/price and on a suitable date/time for the visit. At the scheduled date/time the car owner brings in the car, the oil is changed, and payment is made (immediately or later). Although only one interaction episode really takes place and describes the actual exchange as it happened, many different interaction episodes with different outcomes are possible and can be modeled in a lifecycle model for the service exchange. Examples of unhappy outcomes that may occur include an interrupted or unanswered call (-P), a car owner that refuses the conditions/price offered (-A), and no show-up at the agreed date/time (-D).

2

This activity can be initiated by one service system (in which case the other service system might not be identified yet or only typified) or jointly by both service systems. A proposal might refer to a single well-defined service interaction or an ongoing series of interactions not completely defined [14].

124

G. Poels

3 Related Work A conceptualization of service systems that is often referred to by Service Science researchers (e.g., [1], [13]) is the model defined in [7] (Fig. 4). This model shows that service systems interacting in a service take on provider or client roles, in which the provider takes responsibility for transforming or operating on a service target that is owned by the client. The model is in line with SDL as it emphasizes that the interaction of both systems is required to create value for the client. However, Fig. 4 does not suggest that service systems interact through service exchange. In contrast, the RSS model puts strong emphasis on the exchange nature of service. The shift in the logic of economic exchange from resource-for-resource to service-for-service is what SDL is all about [15], so it is hard to imagine how a conceptual model for Service Science can abstract from service exchange. Other differences with RSS are that Fig. 4 does not provide constructs to model the use/consumption of resources in a service and that it does not emphasize the crucial role of operant resources.

Fig. 4. Conceptual model of service systems interacting in a service [7]

Related to our research, we found two other studies that have applied ideas from value modeling to service modeling in a Service Science context. The REA-based model presented in [16] recognizes services as a special kind of resources that have as goal to modify and add value to other resources. Applying this model to the car oil change example gives Fig. 5.3 The main difference with the REA model of Fig. 1(b) is that Fig. 5 identifies the service target (i.e., the car) that is owned by the service beneficiary. The oil change service has as goal to transform this service target. Like REA (but unlike RSS), a service is seen as a resource that is consumed in an event (i.e., refill oil), which is not conform to SDL. Furthermore, apart from the service target, the resources brought in by the service systems are not identified, which is different from the RSS model for this example (Fig. 2(b)). The model presented in [17] can be used to model value encounters, which are interaction spaces between multiple actors in which each actor brings in resources which are then combined to create value for all of them. The model is different from the REA-based model in [16] in that all resources brought in by the service systems 3

Fig. 5 represents only the car oil change service part of the service exchange, so is an incomplete model used for illustration purposes only. Like the RSS model, the model presented in [16] employs REA’s duality relation to model the exchange of service-for-service.

The Resource-Service-System Model for Service Science

125

Fig. 5. REA-based model of service [16] as applied to the car oil change example

can be identified. However, the model does not conform to SDL as services “are resources as well” [17, p. 39] and can be “consumed by some other actor” [17, p. 39]. Furthermore, the model does not distinguish operant and operand resources. In summary, the RSS model is different from these pre-existing models in the sense that it fully conforms to SDL. It (i) considers service as a process (different from [16], [17]); (ii) emphasizes that in an economic context, service is exchanged for service (different from [7]); (iii) can be used to identify all resources that act in or are acted upon in a service (different from [7], [16]); and (iv) distinguishes operant and operand resources (different from [7], [16], [17]).

4 Discussion and Future Research The RSS model is intended as an instrument to be used in the study of service systems and their interaction in the context of service exchanges. Unlike previous conceptualizations used in Service Science, value models stress that service exchange is the economic motive for service systems to engage in service interactions, which makes RSS potentially useful for Service Science research aimed at service management and service engineering applications. According to [18], such a focus can broaden the perspective of service management and service engineering, which traditionally emphasize the difficulties created by the inherent inefficiencies of producing services instead of goods, to the efficient and effective mutual value co-creation in service ecosystems. The model can, for instance, be used to identify all resources that the resource providing and integrating service systems contribute in an exchange of mutually reciprocal services, which may be useful for service innovation and design (e.g., designing new service offerings), service operations (e.g., resource acquisition, subcontracting and outsourcing decisions), and service management in general (e.g., cost accounting, pricing, profitability analysis). Further, lifecycle models for service exchanges help identifying the different states in which a certain type of service exchange can be, which may be useful for simulating or monitoring service executions. The history of a service system can be expressed as a sequence of interaction episodes with other service systems [14]. The distribution of outcomes over time can be a significant performance measure for service systems with respect to their service

126

G. Poels

offerings [10]. Such measurements may prove useful for optimization and improvement initiatives, and for service engineering in general. Further research is required to evaluate the external validity of the model by applying it to a wide range of service contexts. We also plan to extend the model with the Service Science notion of value proposition, which is another key concept in the study of service systems. As ultimately the utility of the RSS model as a research instrument for Service Science depends on the SDL mindset that it reflects, future research may be directed towards an evaluation of SDL as a foundation of Service Science.

References 1. Spohrer, J., Maglio, P.P., Bailey, J., Gruhl, D.: Steps Toward a Science of Service Systems. IEEE Computer 40, 71–77 (2007) 2. Chesbrough, H., Spohrer, J.: A Research Manifesto for Services Science. CACM 49, 35– 40 (2006) 3. Vargo, S.L., Lusch, R.F.: Evolving to a New Dominant Logic for Marketing. Journal of Marketing 68, 1–17 (2004) 4. Maglio, P.P., Spohrer, J.: Fundamentals of Service Science. J. of the Acad. Mark. Sci. 36, 18–20 (2008) 5. Vargo, S.L., Maglio, P.P., Akaka, M.A.: On value and value co-creation: A service systems and service logic perspective. European Management Journal 26, 145–152 (2008) 6. Vargo, S.L., Lusch, R.F.: The Four Services Marketing Myths: Remnants from a Manufacturing Model. Journal of Service Research 6, 324–335 (2004) 7. Maglio, P.P., Srinivasan, S., Kreulen, J.T., Spohrer, J.: Service Systems, Service Scientists, SSME, and Innovation. CACM 49, 81–85 (2006) 8. Maglio, P.P., Vargo, S.L., Caswell, N., Spohrer, J.: The service system is the basic abstraction of service science. Inf. Syst. E-Bus. Manage. 7, 395–406 (2009) 9. McCarthy, W.E.: The REA Accounting Model: A Generalized Framework for Accounting Systems in a Shared Data Environment. The Accounting Review 57, 554–578 (1982) 10. Spohrer, J., Anderson, L.C., Pass, N.J., Ager, T., Gruhl, D.: Service Science. J. Grid Computing 6, 313–324 (2008) 11. Ferrario, R., Guarino, N., Barrera, M.F.: Towards an Ontological Foundation for Services Science. The Legal Perspective (2009) (in press), http://www.loa-cnr.it/Publications.html 12. Lusch, R.F., Vargo, S.L., Wessels, G.: Toward a conceptual foundation for service science: Contributions from service-dominant logic. IBM Systems Journal 47, 5–14 (2008) 13. Hocova, P., Stanicek, Z.: On Service Systems – by Definition of Elementary Concepts: Towards the Sound Theory of Service Science. In: IESS 1.0. LNBIP 53 (2010) (forthcoming) 14. Spohrer, J., Vargo, S.L., Caswell, N., Maglio, P.P.: The Service System is the Basic Abstraction of Service Science. In: HICSS (2008) 15. Vargo, S.L., Lusch, R.F.: From goods to service(s): Divergences and convergences of logics. Industrial Marketing Management 37, 254–259 (2008) 16. Weigand, H., Johannesson, P., Andersson, B., Bergholtz, M.: Value-Based Service Modeling and Design: Toward a Unified View of Services. In: van Eck, P., Gordijn, J., Wieringa, R. (eds.) CAiSE 2009. LNCS, vol. 5565, pp. 410–424. Springer, Heidelberg (2009) 17. Weigand, H., Arachchige, J.J.: Value Encounters: Modelling and Analyzing Co-creation of Value. AIS Transactions on Enterprise Systems 1, 32–41 (2009) 18. Vargo, S.L., Akaka, M.A.: Service-Dominant Logic as a Foundation for Service Science: Clarifications. Service Science 1, 32–41 (2009)

Third International Workshop on Active Conceptual Modeling of Learning (ACM-L 2010) Preface The ACM-L 2010 workshop aims to bring together researchers, developers, and users to exchange ideas about different approaches for modeling concepts and more complicated conceptual constructs to describe the past, current and future world (or parts of them), modeling continuous learning from past experiences, capturing knowledge from transitions between system states, and recognizing new types of knowledge. A need for active conceptual modeling for information systems rises from several sources: active modeling, emergency management, learning from surprises, data provenance, modification of the events/conditions/actions as the system evolves, actively evolving conceptual models, schema changes in conceptual models, historical information in conceptual models, ontological modeling in domain-aware systems, spatio-temporal and multi-representation modeling, etc. All these approaches may appear together. The most important needs are perhaps emergency management and learning from surprises, because they often appear in big disasters and catastrophes, as in a tsunami, earthquake or eruption of a volcano. In these kinds of situations, information systems must collect large amounts of raw data, analyze it, conceptualize it, map it to the domain, distribute it, make conclusions, make plans for new activities, and manage cooperation of active officials. The long-term goal is to provide a theoretical framework for active conceptual modeling of learning – based on multi-level, multidimension, multi-perspective knowledge and human cognition – for developing a learning-base system to support cognitive services/capability development and a large class of applications. The workshop will also discuss short- and long-term research goals for ACM-L development, identify use cases, and explore research areas beyond ACM-L. We would like to express our gratitude to the authors for writing their papers to be submitted, the program committee members for their work in reviewing papers, and the ER 2010 organizing committee for all their support. July 2010

Hannu Kangassalo Sal March Leah Wong

Towards a Framework for Emergent Modeling Ajantha Dahanayake1 and Bernhard Thalheim2 1

Georgia College & State University, CBX 12, Milledgeville, GA 31061 USA [email protected] 2 Christian-Albrechts-University Kiel, Dept. Of Computer Science, 24098 Kiel, Germany [email protected]

Abstract. The unpredictability in crisis situations and the time to respond during emergencies require the development of toolkit support for modeling approaches that produce models of information processing views within a very short time interval. The present approach to designing such systems is based on the traditional modeling practice of starting from scratch to arrive at a solution. Based on the observations of shortcomings in crisis response and recovery coordination systems developed at a main European harbor, a framework for emergent modeling is presented in this paper. Emergent modeling is an approach to model coordination structures for unpredictable emergency response coordination work. The emergent modeling relies on the concept of model evolution through reusing available models and model patterns. An assessment is made to explore the extent to which MetaCASE and model suites are available and to what extent they are appropriate to fulfill those requirements of emergent modeling toolkit support. Keywords: Emergent Modeling, model generation, model suites, model evolution, MetaCASE.

1 Introduction Emergency situations are of an unpredictable nature, and crises require public agencies to cooperate with each other using all kinds of information and communication technology (ICT). A large-scale crisis or emergency can severely affect the normal well-being of a society, while at the same time exceeding society’s and government’s regular management abilities, which require exceptional measures over a short period of time [1]. As such, crises are inherently wicked problems [2]. Moreover, in the event of a crisis (such as a large-scale emergency or a natural disaster), “no single organization has all the resources to alleviate the effects” [3]. They are usually required to deploy a network of diverse public response agencies, including police, fire and medical services, and perhaps additional domain experts (e.g. in hazardous materials). These public agencies are also required to interact with other organizations of non-professional emergency responders, volunteer groups, NGOs, government agencies, and the press. Essentially, the challenge of crisis response as a key public sector process is centered on coordination and processing of information. Accordingly, the design and use of crisis response coordination systems J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 128–137, 2010. © Springer-Verlag Berlin Heidelberg 2010

Towards a Framework for Emergent Modeling

129

during inter-agency crisis response cannot follow the traditional guidelines of routine modeling practices. Therefore, they are seeking an improved understanding of systems modeling for coordination in crisis response and the role those systems’ modeling toolkit plays in supporting it. 1.1 Information Systems Modeling Issues Traditional information systems modeling techniques suffer from a number of weaknesses. These weaknesses are mainly caused by concentration on database modeling and by non-consideration of application domain problems that must be solved by information systems [4], 5], and [7]. Limited scope: The vast majority of techniques are limited to the specification of data structuring, that is, properties about what the schema of the database system is expected to do. Classical functional and nonfunctional properties are in general left outside or delayed until coding. Poor separation of concerns: Most modeling approaches provide no support for making a clear separation between (a) intended properties of the system considered, (b) assumptions about the environment of this system, and (c) properties of the application domain. Low-level schematology: The concepts in terms of which problems have to be structured and formalized are concepts of small-scale modeling - most often, data types and some operations. It is time to raise the level of abstraction and conceptual richness found in application domains. Isolation: Database modeling approaches are isolated from other software products and processes, both vertically and horizontally. These approaches neither pay attention to what upstream products in the software might provide or require nor pay attention to what companion products should support nor provide a link to application domain descriptions. Poor guidance: The main emphasis in database modeling literature has been on suitable sets of notations and on a posteriori analysis of database schemata written using such notations. Constructive methods for building correct models for complex database or information systems in a safe, systematic, incremental way are-by and large-non-existent. Cost: Many information systems modeling approaches require high expertise in database systems and in the white-box use of tools. Poor tool feedback: Many database system development tools are effective at pointing out problems, but in general they do a poor job of (a) suggesting causes at the root of such problem, and (b) proposing better modeling solutions. During crisis situations, the traditional approach to modeling is inappropriate when models of coordination structures need to be fully developed for unpredictable situations or emergencies. For example the recent volcanic cloud from Iceland created an unpredictable crisis to European air space and air travel that was beyond all expected emergency recovery plans of those effected airports. In such situations, the modeling of coordination structures and information processing needs to take into account the nature of unpredictability in crisis situations and the short time that is necessary to respond during such emergencies. Therefore, the development of toolkit

130

A. Dahanayake and B. Thalheim

support systems is necessary to explore modeling approaches that produce models of information processing views within a very short time interval. 1.2 The Structure of the Paper This paper is organized as follows. Section 2 presents a case conducted at a European harbor during a crisis, and lessons learned during the emergency response coordination systems development that led to the understanding of emergence in emergency response. Sections 3, introduces the emergent modeling concept and describes the requirements for emergent modeling in crisis response coordination systems. Section 4 gives an assessment of the extent to which MetaCASE and model suites fulfill the requirements for emergent modeling toolkits including a summary of this assessment. Finally, Section 5 concludes with future research directions.

2 Lessons Learned: Emergency Response Systems Modeling In this section we describe the modeling of crisis response coordination at a European harbor during a crisis, and highlight the level of available modeling support and its shortcomings. 2.1 Emergency Response Systems Engineering at a European Harbor Harbors have built networked crisis response platforms to connect all crisis reliefresponse organizations, which allow them to access, share, and exchange information. One example of such a platform is called the dynamic map, which has been utilized and tested at some harbors and which allows relief-response organizations to oversee the disaster area and its surroundings and anticipate future developments regarding the crisis situation. The (re)design of the crisis response management system [12], [14] incorporated the ‘emergent team’ concept for many reasons. Depending on the scale of the disaster, crisis responses in a harbor infrastructure will range from dealing with a small-scale problem, in which a few organizations might be involved, to a full-scale crisis, in which multiple organizations are required to resolve and to prevent escalation of the crisis. Information relevant for a crisis response may be dispersed across heterogeneous, high volume, and distributed information resources. Such unpredictable crisis situations require the dynamic establishment of a “emergent team” consisting of the various relief-response organizations. In response to an ongoing crisis situation, membership of the “emergent team” can change accordingly depending on the type of crisis, its magnitude and how it develops. New reliefresponse organizations will join the “emergent team” when their services are needed, and leave when their response goals have been achieved. The (re)design is aimed to ease the difficulties of distributed, dynamic, and heterogeneous environments and to confront relief organizations in finding and retrieving their specific organizational role and the crisis situation relevant information they require to perform their crisis relief activities. In summary, as the systems’ evolve to a crisis response coordination information system, the following requirements were identified:

Towards a Framework for Emergent Modeling

131

(1) Capable of providing relief-response organizations with a role-related picture of the crises development in a time critical manner. (2) Capable of satisfying changing information needs flexibly. (3)Capable of structuring advanced technologies and available technical infrastructures in a meaningful way to realize dynamically changing user information needs during a crisis response. (4)Extendable when a relief-response organization is required to join relief-response activities. (5)Capable of dealing with a relief-response organization, and leaving the functioning system once its task is completed. Based on these requirements a new generation of information coordination and processing tools for crisis response coordination modeling was developed. To assess the affectivity and efficiency of the modeling tools, experiments were conducted simulating six scenarios of crisis response coordination situations and its virtual team concept. Seven members of the systems analysis and design team of the crisis respond coordination and management center’s IT department took part in those six test scenarios. The emergency response information coordination and processing systems modeling activity required access to several types of models to successfully arrive at the architecture and design specifications of those applications. The design activity involves the designing of a myriad of models, such as knowledge acquisition, knowledge selection, stakeholder analysis, network models for collaboration, coordination models for relief effects, models for knowledge management, models for process descriptions, models for network and node analysis, models for defining knowledge bases and critical knowledge ownerships, and the typical information system models for data, integration, networks, time, space and positioning, security and technology, and a few more. Based on these requirements, a new generation of information coordination and processing toolkits for crisis response systems became a necessity.

Fig. 1. Multi-model development at crisis response coordination systems modeling

2.2 Lessons Learned The different models interpret different views of the solution space of the emergency response coordination needs in the occurring crisis, such as: (a) Different models require different modeling tools with respective modeling languages. For example, Petri Nets and critical path analysis complement each others contribution for network modeling and analysis, but they differ in their structure, semantics, and scope. Same applies for UML object models and use-case modeling techniques. (b) Those diagrams are incrementally developed and models evolve during development. (c) Models are not independent; they are interrelated and in most applications intertwined. (d) Models’ interrelationship is often not made explicit. (e)

132

A. Dahanayake and B. Thalheim

Models impose changes in other models. (f) Changes within one model may result in inconsistencies to other models due to the variety of models used. (g) Models need to be developed within a very short time. (f) The traditional modeling practice of starting from scratch becomes frustrating and inappropriate at the height of emergence in emergency response coordination modeling. 2.3 Emergence of Emergence in Crisis Response The ubiquity of emergence during crisis is most obvious in what they called “emergent groups,” which are entities with no existence before the crisis. They have transitory existence but are crucial to the response [16]. The pervasiveness of emergent groups, such as welfare agencies, search-and-rescue teams, and temporary overall community-coordinating groups, forced researchers to acknowledge their presence and study their constitution rather than consider them aberrations. A recent account of emergent groups is that of [17]; he states that they depart from the same kinds of issues during large-scale crises, plans break down, authority structures and communities react in unforeseen ways, communication links break down, and information quality falters. As a result, emergent response groups are characterized by a great sense of urgency and high levels of interdependence under changing conditions. As such, these groups have flexible task assignments, fleeting membership, and possibly multiple conflicting goals so that they resemble swarms rather than traditional groups. Emergent group behavior is often understood as aggregate behavior that differs from combined behavior in that it is not equivalent to the sum of individual behaviors. Through emergence, it is possible for complex behavior to arise from simple local behaviors over time. In traditional systems design, such complex behaviors were sometimes seen “as ‘parasitic’ or to be avoided” given their unpredictability and potentially counterproductive outcomes, but they are now increasingly being harnessed for useful purposes [18]. In crisis response, a similar trend can be seen, reflecting a shift from neglect, reject, or caution towards understanding, embracing, and supporting emergence. A notable example is the recognition of emergent groups and coordination via feedback from the Disaster Research Center (DRC). DRC understand that no one set of standards can regulate the activity of professional crisis responders [19]. In line with the information-processing view, modeling coordination by creating emergent models by a feedback mechanism is more likely as diversity and uncertainty increase, than modeling coordination create from scratch. The consequence of this particular way of modeling is that crisis response organizations which emphasize modeling coordination from scratch are following a questionable strategy by ignoring that crises create the conditions where such an approach is inappropriate. However, post-disaster evaluations often use criteria dominated by structured modeling of coordination. This creates a paradox that challenges the preparation for crisis response which is torn between the need for flexibility and the demand for control and responsibility. Accordingly, guiding modeling and systems design and development in crisis response coordination and planning, favors instead, an emergent modeling approach. The structured modeling is not only a poor modeling approach for crisis response coordination, but it is actually not even applied in the reality of crisis response operations. Nonetheless, there will

Towards a Framework for Emergent Modeling

133

always be a simultaneous presence of emergent and structured aspects and the two should be blended together. Rather than assuming emergent modeling to be dysfunctional or inappropriate, it should be taken advantage of and incorporated into the way of thinking and modeling.

3 Requirements for Emergent Modeling Emergent Modeling is a modeling approach for developing crisis response coordination systems that capture models for emerging crisis response requirements, for “emerging teams”. Emergent models are entities with no existence before the crisis, and those have transitory existence but are crucial to the response. The emergent modeling approach must not start from scratch but reuse available models and combine them in a realistic manner using models that are developed in similar or previous situations. Models of the emergency response coordination information systems must be built incrementally from higher-level models: in a way that guarantees high quality construction by investing in constructiveness. The emergent modeling method is typically made of a collection of model-building strategies, paradigm and high-level solution selection rules, model refinement rules, guidelines, and heuristics. Some of them might be domain independent; some others might be domain-specific. Emergent modeling must care for the vertical and horizontal integration of models within the entire analysis, design, development, deployment, and maintenance lifecycle - from high-level goals to be supported by appropriate architecturism, from informal formulation of information system models to conceptual models, and from conceptual models to implementation models and their integration into deployment of emergency response coordination systems. Crisis response coordination systems modeling should move from infological design to holistic co-design of structuring, functionality, interactivity, and distribution [11], and [20]. These techniques must additionally be error prone due to the complexity of crisis response coordination systems. Richer structuring mechanisms based on problem-oriented constructs have to be developed as well as model suites [6], and [10] that provide a means for handling a variety of models and viewpoints. The use of novel modeling paradigms should not require deep theoretical backgrounds or a deep insight into information systems technology. The results or models should be compiled into appropriate implementations using lightweight techniques. Complex crisis response coordination information systems have multiple facets. Since no single modeling paradigm or universal language will ever serve all purposes of a system, the various facets then need to be linked to each other in a coherent way through multiparadigm modeling. To enhance the communicability and collaboration within a development and support team, the same model fragment must be provided in a number of formats in a coherent and consistent way for multi-format modeling. A constructive feedback mechanism should be developed for selecting useful models and for rearranging model constructs. Instead of just pointing out problems, future tools should assist in resolving problems. In general, applications keep evolving due to changes in the application domain, changes of technology, changes in crisis response coordination systems purposes, etc. A more constructive approach should also help manage the evolution of models.

134

A. Dahanayake and B. Thalheim

4 Assessing Toolkit Support for Emergent Modeling Emergent modeling as an approach for modeling crisis response coordination is realistic only when embedded within a MetaCASE environment. MetaCASE does not support all encompassing requirements that embrace emergent modeling. MetaCASE can be used in combination with model suites as discussed in [13] for embracing a greater number of requirements identified in section 3.1. 4.1 The MetaCASE Toolkit The importance of Computer Aided Method Engineering (CAME) [8] in CASE (Computer Aided Systems Engineering) led to the development of CASE shells and MetaCASE toolkits (or fully customizable CASE environments, that were intend to overcome the inflexibility of modeling language support) [8]. MetaCASE research has addressed issues such as model integration, multi-user support, multirepresentation paradigms, model modifiability and evolution [9], and information retrieval and computation facilities. Today, MetaCASE is the theory behind the solution for full agility in supporting arbitrary modeling languages and has become the de facto standard for Information Systems analysis and design toolkits. The global philosophy behind MetaCASE is to provide the toolkit required for a modeling language, according to the chosen modeling activities, within a systems development methodology. The fundamental theory is the separation of concepts from their visual representations so that a chosen set of concepts that belongs to a language is combined with graphic representations to function as a modeling tool and to visualize a graphical presentation of a real world situation, the model. Further, MetaCASE offers mechanisms for population of models to create data. Generally, the central repository of the MetaCASE is a layered database architecture consisting of four layers: signature, language, model, and data. The collections of modeling concepts that are required to define arbitrary modeling languages belong to the signature layer. The language layer represents modeling language descriptions consisting of concepts, rules, and behavior which are normally called the meta models of the modeling languages and contains the relevant graphical representations of those concepts. The combination of concepts and their graphical representations leads to the generation of modeling tools. The language layer is responsible for the construction of meta models of modeling languages and generation of modeling tools by combining modeling concepts with their graphical or visual representations. This activity is called modeling method engineering. The model layer accommodates the models designed with the toolkits. Data, the populations of models, resides in the Data layer. MetaCASE has made a substantial contribution to toolkit support for the design and development of information systems. Those prominent advantages of metaCASE solutions are: the ability to configure toolkits for structure oriented modeling languages, such as Entity Relationship diagrams, Domain Class diagrams, etc.; the ability to configure and support behavior-oriented modeling languages such as event charts, use-cases, etc.; the ability to configure and support process oriented modeling language such as Activity diagrams, Systems Sequence diagrams, etc.; and the layered architecture and the modular and orthogonal functional representation leading to reusability, agility, and the evolution to more complex modeling requirements.

Towards a Framework for Emergent Modeling

135

4.2 Model Suite Support in MetaCASE The extension of MetaCASE with model suites discussed in [13], is a viable concept for model management, which generates reusing when implemented with MetaCASE. A model suite [6] consists of a set of models with explicit associations among the models, with explicit controllers for maintenance of coherence of the models, with application schemata for their explicit maintenance and evolution, and tracers for establishment of their coherence. This theory has been tested against typical applications such as OLTP-OLAP architectures, multi-model suites at the same abstraction layer, and challenging applications for scientific databases. Model suites are based on a general theory, on a specification technology and on an implementation technology for sets of models that share common sub-models, that collaborate and that evolve over time, with new or corrected data and with new analysis tasks. These models must be tightly coupled. Model suites are used to specify this model coupling and model collaboration. Renewable and evolving model clusters are going to be developed for the development of a theory and supporting technology. Crisis response coordination systems are, per definition, distributed, location independent, accessible via internet, and the users of the systems are becoming more and more mobile and ubiquitous, requiring adapting to their varying usage, contexts, and goals. Those systems challenge current modeling techniques by requiring a large variety of tools that are capable of providing a large variety of changes in scope, impact, granularity, abstraction level, etc. embedded in modeling support toolkit which guarantee the co-existence of those models and the co-evolution of these models for providing different perspectives of the same solution or application. Therefore, model suite support in MetaCASE can generate a variety of modeling toolkits for large scale modeling [13]. Such MetaCASE model suite toolkits can exhibit the following characteristics: (1) Explicit specification of model suite collaboration: Interdependencies among models can be given in an explicit form. The consistencies of models become recursive. (2) Integrated development of different models: Models are used to specify different views of the same problem or application. They become consistent in an integrated manner. Their integration is made explicit. Simultaneous updates of models are allowed. (3) Co-evolution of models: The model suites allow data exchange between models. The change within one model is propagated to all dependent models. (4) Combining different representations with mathematical rigor of models: Each model consists of well-defined semantics as well as a number of representations for the display of model content. The representation and the model are tightly coupled. (5) Evolution of different representations: Changes within any model can either be refinements of previous models or explicit revisions of such models. These changes are enforced for other representations as well whenever those are concerned. (6) Management of model suites: The propagation of changes are supported by scheduling mechanisms, e.g., ordering of propagation of model changes. The management must support rollback to earlier versions of the model suite. The management should also allow model change during propagation. (7) Version handling for model suites: Model suites may have different versions.

136

A. Dahanayake and B. Thalheim

4.3 Assessing the Available Body of Knowledge for Emergent Modeling Toolkit In this subsection we evaluate the already available body of knowledge that is useful for the development of a supporting toolkit for crisis response coordination systems modeling and development. We first evaluate the MetaCASE toolkit and model suites theory as the most suitable knowledge at hand for emergent modeling. The summary of this evaluation is given in the table below. Emergent Modeling requirements Reuse of models and model components

MetaCASE Yes but different

Constructiveness Integration Richer structuring mechanism Light weight techniques Multi paradigm modeling Multi format modeling Constructive feed back mechanisms Tools assisting to resolve modeling problems Evolution of models

Not available Yes available Yes but limited Not available Yes available Not available limited Not available Not available

Model Suites Yes but need improvement Not available Yes available Yes available Not available Yes available Yes available Not available Not available Yes available

From this evaluation we could conclude that MetaCASE model suites toolkit can provide the basis for the incorporation of the rest of the identified requirements to arrive at a fully fledged emergent modeling toolkit.

5 Conclusions Emergent modeling is as modeling approach for crisis response coordination. For the incorporation of emergent modeling functionality within a toolkit, first, it needs to develop its foundation on MetaCASE model suites toolkit. As a second step, it should consider incorporation of remaining concepts and theories that are required for a emergent modeling approach (those that are not represented within MetaCASE or model suites). The main requirement is to develop solution models in very short time. The advantage then comes from quick solution modeling by reusing available models and model patterns. The reuse of available models and model components is a theory that needs to be developed in order to make emergent modeling become practice. Also, reuse may need to include simulations and simulation models at some point to identify coordination patterns. Reuse then may be either reusing for examples or reusing for real model building and further evolution into solution models.The challenge of emergent modeling is the evolution of the application domain itself. This challenge becomes crucial for crisis response coordination systems due to the low `half-life’ period observed for these applications and the high potential of evolution, migration and integration. Therefore, pattern-backed information systems development has great potential to contribute to crisis response coordination systems building .

Towards a Framework for Emergent Modeling

137

References 1. Da-li, H., Hua-lin, W., Chang-nan, W.: Research on Framework of Public Crisis Management System under the Circumstance of E-Governance. In: 4th International Conference on Wireless Communications, Networking and Mobile Computing (2008) 2. Lodge, M.: The Public Management of Risk: The Case for Deliberating among Worldviews. Review of Policy Research 26(4), 395–408 (2009) 3. Bui, T., Cho, S., Sankaran, S., Sovereign, M.: A Framework for Designing a Global Information Network for Multinational Humanitarian Assistance/Disaster Relief. Information Systems Frontiers 1(4), 427–442 (2000) 4. Thalheim, B.: Towards a Theory of Conceptual Modelling. In: Heuser, C.A., Pernul, G. (eds.) ER 2009 Workshops. LNCS, vol. 5833, pp. 45–54. Springer, Heidelberg (2009) 5. Ma, H., Schewe, K.D., Thalheim, B.: Modelling and Maintenance of Very Large Database Schemata Using Meta-structures. In: UNISCON 2009. Notes in Business Information Processing, vol. 20, pp. 17–28. Springer, Heidelberg (2009) 6. Thalheim, B.: The Conceptual Framework to Multi-Layered Database Modelling. In: EJC, Maribor, Slovenia, pp. 118–138 (2009) 7. Thalheim, B., Klettke, M.: Evolution and Migration of Legacy Information Systems. In: Handbook of Conceptual Modelling, ch. 11, Springer, Heidelberg (2010) 8. Dahanayake, A.N.W.: An Environment to support flexible information modeling. PhD thesis, Delft University of Technology, The Netherlands (1997) 9. Kelly, S., Smolander, K.: Evolution and issues in MetaCASE. Information & SoftwareTechnology 38(4), 261–266 (1996) 10. Thalheim, B.: Extended Entity-Relationship Model. In: Encyclopedia of Database Systems, pp. 1083–1091. Springer, Heidelberg (2009) 11. Schewe, K.-D., Thalheim, B.: The co-design approach to web information systems development. International Journal of Web Information Systems 1(1), 5–14 (2005) 12. Chen, N., Dahanayake, A.N.W.: Role-based situation-aware information seeking and retrieval for crisis response. Journal of Intelligent Control Systems 12(2), 186–197 (2007) 13. Dahanayake, A., Thalheim, B.: Co-Evolution of (Information) System Models. In: EMMSAD 2010, Tunis (2010) 14. Gonzales, R., Verbreack, A., Dahanayake, A.: Extending information processing view of crisis response. International Journal of E-Government (IJEGR) 6(4) (2010) 15. Quarantelli, E.L., Dynes, R.R.: Response to Social Crisis and Disaster. Annual Review of Sociology 3(1), 23–49 (1977) 16. Majchrzak, A., Jarvenpaa, S.L., Hollingshead, A.B.: Coordinating Expertise Among Emergent Groups Responding to Disasters. Organization Science 18(1), 147–161 (2007) 17. Lynden, S. J., Rana, O. F., Margetts, S., Jones, A. J.: Emergent coordination for distributed information management. In: IEEE Conference on Evolutionary Computation (2000) 18. Dynes, R.R., Aguirre, B.E.: Organizational adaptation to crises: Mechanisms of coordination and structural change. Disasters 3(1), 71–74 (1979) 19. Thalheim, B.: Codesign of structuring, functionality, distribution and interactivity. In: Australian Computer Science Comm. (2004)

When Entities Are Types: Effectively Modeling Type-Instantiation Relationships Faiz Currim1 and Sudha Ram2 2

1 Department of Management Sciences, University of Iowa, Iowa City IA, USA Department of Management Information Systems, University of Arizona, Tucson AZ, USA [email protected], [email protected]

Abstract. Type-instantiation relationships (TIRs) appear in many application domains including RFID-based inventory tracking, securities markets, health care, incident-response management, travel, advertising, and academia. For example an emergency response (type) is instantiated in the actual incident, or an advertisement (type) serves impressions on a website. This kind of relationship has received little attention in literature notwithstanding its ubiquity. Conventional modeling does not properly capture its underlying semantics. This can lead to data redundancy, denormalized relations and loss of knowledge about constraints during implementation. Our work formally defines and discusses the semantics of the type-instantiation relationship. We also present an analysis of how TIRs affect other relationships in a conceptual database schema, and the relational implications of our approach. Keywords: data modeling, relationships, typing, instantiation, materialization.

1 Introduction In this paper, we discuss and refine the modeling of type-instantiation relationships (TIRs). A popular application is RFID-enhanced inventory management [1]. EPC tags are placed on individual items and data about them read in real-time by the reader network. Organizations store information about product types (e.g., a 19" monitor model 1957B) as well as the monitor units (instantiations) produced or in stock (each with its own serial number, date of manufacture, and other properties). The concept of instantiation is well known in object-oriented design, where classes are instantiated into objects. In semantic data modeling, the term materialization has been proposed [2]. Initial studies of TIRs focused on the inventory-related scenarios where a conceptual entity was materialized into concrete instances [2]. As mentioned earlier, this is an important case particularly with the advent of RFID technology. However, TIRs appear a numerous other contexts, particularly in service industries. This is why we do not adopt the term materialization, since the TIRs may associate two completely abstract or intangible entities. For example, consider the realm of online advertising (the same principles apply for traditional advertising as well). A client creates an advertisement with Google AdWords™ by choosing specific keywords related to their business. Depending on the user profile (e.g., search patterns, J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 138–147, 2010. © Springer-Verlag Berlin Heidelberg 2010

When Entities Are Types: Effectively Modeling Type-Instantiation Relationships

139

location), this ad may be instantiated in multiple forms (word or phrase combinations). The final step in the TIR hierarchy is the actual ad-impression displayed on a webpage that an end-user can click on. The securities market is another area where TIR hierarchies are common. Futures contracts for a specific metal or commodity (e.g., gold 100 troy ounces, or “GC 100”) participate in a TIR relationship with a version of the contract (e.g., a “GC 100 ending December 2010”). In emergency-responses, it is important to distinguish between an emergency type (e.g., 3-alarm fire or a shooting) and actual incidents on a specific day and location. Conceptual design is an important part of the database design process. A good conceptual grammar is said to possesses properties of simplicity, minimality, formality and expressiveness [3]. Expressiveness and simplicity typically involve a trade-off. Giving modelers a comprehensive toolbox of generic relationships to ease their design tasks is a valid argument [4], but this is weighed down by the trade-off of additional complexity [5]. Authors therefore, typically rely on usefulness of a grammar extension as “moderating factor” while proposing conceptual augmentations. We feel that type-instantiation relationships occur frequently enough in real-world applications to warrant further study. Formality requires a clear specification of relationship semantics—which we do using set theoretic notation. Minimality refers to ensuring that the semantics of TIRs are distinct from those of other constructs (i.e., the definition for TIRs are distinct from those for other constructs). We address these concerns in Section 3. In Section 2 we discuss the data modeling issues that arise in the absence of TIRs.

2 Data Modeling Problems Earlier literature has brought into focus interesting issues with the ability to identify and record data about items down to the instance level [2, 4]. A good conceptual model serves to narrow the gap between real-world concepts and the ability to represent them in a conceptual model, particularly when such concepts occur frequently in the real world [6]. The separation of the product type from its units and avoiding redundantly storing the model-level data for each unit is a practical requirement. The limitations of past work include the lack of formal specification of construct semantics for TIRs (particularly in terms of minimality—and the distinction between TIRs and other relationships), as well as the limited discussion of how TIRs impact other relationships. We feel that accurate specification and modeling of constructs is critical to effectively processing and using the data at the enterprise level, and subsequently sharing it across the supply chain. We develop a simplified internet advertising example to illustrate some semantic issues with TIRs. AdWordsCo (AWC) handles advertisements from customers. Each ad possess a unique ad number, budget, description, and target URL. Depending on end-user search patterns, an impression of an ad may be displayed on web pages. They record the date each impression was displayed and the position on the page. The Entity Relationship (ER) diagram for this mini-case is in Fig. 1. For brevity we only show entity classes in subsequent figures, and omit attributes.

140

F. Currim and S. Ram

Fig. 1. Ads and Impressions (shaded properties inherited)

The easiest option is the usage of a regular interaction relationship (Fig. 1). While simple, we note some deficiencies in this option because inheritance is not modeled. Every impression (with a common ad) has a specific descriptive text to display and destination URL (inherited properties; like for subclass entities). Depending on the application, it may also inherit information from relevant relationships (e.g., with clients) and subclass hierarchies (e.g., subclasses of ads related to an instantiation). These semantics are not represented in an interaction relationship.

Fig. 2. Modeling Impressions as a weak entity class

Some of these problems may be alleviated by modeling Impressions as a weak entity class (Fig. 2). Weak entity classes have been used to tackle the “model-versusinstance” kind of relationship in some textbooks [7]. While this implies an AdNo attribute is automatically associated with an impression, we notice some problems. Weak entity classes do not inherit non-identifying attributes. But, an instance of an impression does possess the properties like destination URL and descriptive text (we term these common attributes). Similarly, common relationships have specific implications for the impressions (see Sec. 3.3). Further, IMPRESSIONS does not depend on ADS for its identifier, and the impression ID is not a “partial identifier”. This is noticeable when product or software serial numbers are involved. Instead, instantiation entities have a natural identifier, which goes against the original definition of weak entity class proposed by Chen [8].

Fig. 3. Modeling Ads and Impressions in a realization relationship

Another option is to use the UML realization relationship (Fig. 3) to model the association between ads and impressions. The realization relationships in UML allows for the definition of interfaces. While it should be intuitively clear why interfaces from OO-programming are different from TIRs, we briefly provide some points of distinction. By definition—an interface cannot be instantiated (unlike types, which are

When Entities Are Types: Effectively Modeling Type-Instantiation Relationships

141

instantiated). Instead, interfaces are designed to provide method-declarations that other classes can inherit. A class implementing an interface must provide code to implement all the methods declared in the interface. This includes values of the properties from ads (in a realization relationship, ads will not have values for description or URL; which contradicts the definition of attributes in an entity class).

Fig. 4. Modeling Ads and Impressions in a type-instantiation relationship

As we have seen, existing approaches have limitations in capturing the required semantics for the type-instantiation relationship. Therefore, we feel the typeinstantiation relationship is the most suitable approach to address the problems in the modeling of ads and impressions as in Fig. 4.

3 Construct Semantics and Discussion We present the set theoretic definitions for constructs to elaborate precise semantics and demonstrate differences between them (necessary for formality and minimality as mentioned in Section 1). As will be shown, instantiation classes are mathematically distinct from other constructs. 3.1 Entity Classes and Attributes Entities: A regular (strong) entity e can be defined as a 4-tuple (A, K, D, M) where: A is the set of attributes for the entity, K is the set of identifying attributes (K ⊆ A), D is the set of domains for the attributes, and M represents the set of mappings between an entity attribute and the corresponding domain for the entity (i.e., it says “this employee-entity has a value of Joe Smith for the attribute name from the domain legal names”). This definition of an entity extends the definition of (A, K) provided by Thalheim [9]. In Thalheim’s definition, the attribute-domain relationship is defined at the schema level. This is not so favorable in our opinion since it requires any two attributes with the same names to have the same type definition. Instead we allow D to be non-exclusive to a single entity or class. Entity Types, Sets and Entity Classes: Classification relates entities sharing a common (A, K, D) as belonging to the same entity type, i.e., ∀ ei, ej ∈ E, the sets of (A, K, D) for ei are the same as for ej. For simplicity, we ignore relationships the entity class participates in. The difference between an entity type and an entity set is that the former is the intension comprising a set of properties, and the latter the extension comprising a set of instances that possess the property [10]. We use entity class to

142

F. Currim and S. Ram

denote a reference to the combination of the entity type and entity set. In common usage the terms are often used synonymously. Attributes: Every entity, e, e ∈ E (the class) has a set of attributes A = {A1…Am}. The value that an attribute Ai takes on for a given entity, is given by the function: mij: ej × Ai → Di where Di represents the domain of the particular attribute Ai. We denote the set of attribute-domain mappings as M. If the attributes are multi-valued, we can replace Di with P(Di), where P(Di) is the power set of Di. Identifying Attributes: Entities have an identifier K (a set of attributes, the cardinality of which: | K | ≥ 1). Let A be the set of all attributes {A1, A2, …, Am}, then K ⊆ A is defined an identifier s.t. ∀ei, ej ∈ E, if ei(K) = ej(K) ⇒ i = j, and {e(K1) ≠ Ø, …, e(Km) ≠ Ø}; where ei(K) is the mapped value of (K) for ei. Weak Entity Classes: A weak entity class W may be defined as the 4-tuple (A, WK, D, IR); where IR is the identifying relationship that leads to the creation of the weak entity class. WK is the partial identifier (possibly empty) that along with the set of attributes K of its identifying owner(s) can uniquely identify the weak instance. 3.2 Composite Classes A composite relationship leads to the modeling of a new class called the composite class [11]. A composite relationship is defined on a base class B, and each composite entity, c ∈ C, is a collection of entities from the base class (c ⊆ B). A composite relationship based on the class E and additional composite attributes Aj, (1 ≤ j ≤ m), is defined as CR ⊆ P(E) × A1 × A2 × … × Am. The domain of an attribute Aj for a composite entity may include the base entity class (e.g., the costliest “ad” in c). These attributes are known as self-referenced attributes. A self-referenced property relationship can be defined as a mapping, M, from an attribute, RAj, to the power-set of the base entity class, P(E). Formally: Mj ⊆ RAj × P(Ei). As an example of a composite class, ads may be classified into ad categories based on the kind of product they market (e.g., consumer good, educational service, online product; each of which forms a composite entity). 3.3 Typing and Instantiation Classes We introduce the terms typing class and instantiation class. We avoid the terms “supertype” and “subtype” since some authors use it interchangeably with “superclass” and “subclass”. For a typing class, its attributes, A, can be partitioned into two kinds, common and class-specific. The common attributes, CA (note: K ⊂ CA), are inherited by the instantiation class (e.g., AdNo, clickURL). The classspecific attributes, SA, are unique to each typing entity (e.g., an Ad budget across all impressions) and do not apply to its instantiation. Common relationships have the same “type” (degree, associated entity classes) for both the typing and instantiation classes. However, unlike attributes, their “value” (actual associated entity members) can be different between a typing and its instantiations.

When Entities Are Types: Effectively Modeling Type-Instantiation Relationships

143

An instantiation class only inherits the 2-tuple (CA, D) directly from its typing class and is not a subset of the typing class. The identifier of the typing class T is simply an inherited attribute for the instantiation class (and not the identifier for the instantiation). The instantiation class also implicitly inherits attribute data from the relevant subclasses of its typing class. Given a typing class T (CA, SA, D, K) an instantiation class I can be defined as: (T, A', D', K') where (A', D', K') are unique to the instantiation class. Also, if i is an instantiation of t, then i(CAk) = t(CAk) ∀k ∈ {1, …, m}, where CAk are common attributes from T. Likewise, if t ∈ G, (where G is a subclass of T, and has attributes Ag), then i(CAg) = t(CAg). Interaction relationships applicable to both the typing and instantiation classes serve define the domain of values for the instantiation class. To model the “common relationships” we annotate the relationship diamond of the typing class with a circled C, and have a line connecting the corresponding relationships (see Fig. 5). The circled C is optional (included for clarity). Formally, with typing class T, an instantiation class I, and an entity class E in a common relationship: we denote the typing class’s relationship as RelTE and use RelIE for the relationship with the instantiation class. The projection function π (RelTE) returns entities from E associated with the typing entity t. Then, ∀ of type t, π (RelIE) ∈ π (RelTE). Some authors have used typing and composite classes interchangeably [12, 13]. While there is some similarity between the two constructs, e.g., a single typing (or composite) entity relates to multiple instantiation (or base class) entities, there are a number of differences between typing and composite classes (including their formal semantics as seen in Sections 3.2 & 3.3). We summarize these along four dimensions. Modeling direction: Typing classes lead to the instantiation classes (top-down), while base classes aggregate to composite classes (bottom-up). Common attributes: For composite classes, at most a single type attribute is shared between the base class and the composite class while multiple common attributes are inherited by the instantiations. Common relationships: Composite and base classes do not share common constraining relationships. Cardinality of membership: An instantiation entity can be associated with exactly one typing entity. A base class member can belong to none, one, or possibly multiple composite entities.

4 Relational Implications In this section we discuss the translation for the type-instantiation relationships, and the impact of TIRs on relationships where the typing and instantiation level classes are both involved. Fig. 5 shows a modified schema for AWC. The translation into relations is shown in Table 1. As mentioned in Section 3.3, π (RelIE) ∈ π (RelTE). This requires a referential integrity from the table capturing RelIE (IMPRESSIONS) to the table for RelTE (CAN_APPEAR) rather than E (WEBSITES). Doing so provides a stronger and semantically more accurate referential integrity constraint for the relationship with the common entity class. For example, let’s assume that an ad for Phoenix College, can appear on 40 different websites related to education and online training. There are many other partner websites who host ads provided by AWC, but an impression for

144

F. Currim and S. Ram

Phoenix College will only appear on the 40 relevant hosts. It thus makes sense to have the combination of (AdNo, impID) reference the CAN_APPEAR relation.

Fig. 5. ER schema for AWC Table 1. Translation of AWC Schema ADS (AdNo, description, budget, clickURL) WEBSITES (siteID, sname, saddress) CAN_APPEAR (AdNo, siteID) Foreign Key (AdNo) references ADS Foreign Key (siteID) references WEBSITES IMPRESSIONS (impID, pgPos, dateDisp, AdNo, siteID) Foreign Key (AdNo) references ADS Foreign Key (AdNo, siteID) references CAN_APPEAR

Note: The default translation with the foreign key referencing WEBSITES would lead to a table in 5NF, but not in DKNF since the inclusion dependency (RelIE ⊆ RelTE) would be violated and allow for deletion anomalies [14]. Foreign keys can capture sufficient referential integrity semantics when a binary common relationship is involved, and the cardinality on the common class (E) side is at most 1 (since it permits translation by importing the identifier of E into the instantiation class). When the degree of the relationship is not 2 (i.e., we’re dealing with unary, ternary or higher-arity relationships), or the cardinality of the common relationship with the instantiation class is M-M, the proper referential integrity is enforced using triggers. Constraining relationships occur in a variety of applications (including TIRs). As an example we consider the classical bill-of-materials structure (Fig. 6.). A product (model; typing level) is used by [0:M] other products (also at the model level). A model may use multiple sub-component product models. For example, say an subAssembly-α is part of engine models engineA and engineC (each engine model includes other parts). At the instantiation level, a single manufactured engineA (serial number E12345) uses many items, including subAssembly-α serial number S98765. Once item S98765 is used by E12345, it cannot be included in other item (engine or otherwise). When a specific unit (e.g., item S98765) is used by another item, it is important to know whether it was used by an item of the correct model.

When Entities Are Types: Effectively Modeling Type-Instantiation Relationships

145

Fig. 6. Constraining Relationship of degree 1

The corresponding relations are: PART_OF (ComponentModelNo, AggregateModelNo) ITEMS (SerNo, ModelNo, ..., UsedInItemSerNo) Foreign Key (UsedInItemSerNo) references ITEMS (best we can do; no

connection with the corresponding typing entity or model) As can be seen, the foreign key (UsedInItemSerNo) no longer suffices to check that the item was used in a manner constrained by the PART_OF relationship. We need an additional trigger to check that the model of any specific (UsedInItemSerNo) is within the domain permitted by the Part_Of relationship. To allow a reference to the corresponding model, one could modify the ITEMS table to: ITEMS (SerNo, ModelNo, ..., UsedInItemSerNo, UsedInModelNo) Foreign Key (UsedInItemSerNo) references ITEMS Foreign Key (ModelNo, UsedInModelNo) references PART_OF

The problem is ITEMS table drops down to 2NF. The recommended approach involves a trigger instead. The same problem arises when the cardinality at the instantiation level is M-M or the degree of the common relationship is ternary (hence results in a separate relation for the instance level common relationship).

Fig. 7. ER schema for ArtCo

The final case we consider, is where the mini-world has entities participating in a relationship, and some subset of the entities needs to be tracked at the instance level. Consider an artisan who owns a small business (ArtCo) selling both self-crafted products and vendor-supplied ones (Fig. 7). Assume that only for the orders where a hand-crafted item is sold, is the artist interested in tracking a serial number.

146

F. Currim and S. Ram

Normally, there would be three ways to translate Relates to (since it is a 1-to-1 relationship): (a) placing a foreign key (serial number) in the ORDERLINES table, (b) placing a foreign key in the CRAFTEDITEMS table, or (c) creating a new table for the relationship. Turns out, all options require designer intervention to avoid ending up in sub-BCNF. One may rule out option (a) owing to the minimum cardinality of [0:1] on the CRAFTED side (which leads to null values), so let us consider options (b) and (c). The usual translation for Option (b): CRAFTEDITEMS (SerialNumber, CProductID, OrdNo, ProductID) Foreign Key (CProductID) references CRAFTEDPRODUCTS Foreign Key (OrdNo, OProductID) references ORDERLINES

Option (c): RELATESTO (OrdNo, ProductID, SerialNumber) Foreign Key (OrdNo, ProductID) references ORDERLINES Foreign Key (SerialNumber) references CRAFTEDITEMS

For option (b), as can be seen, by default there would end up being a dual reference to the product ID. Removing one of them normalizes the table. For option (c), the problem is SerialNumber Æ ProductID, leading to a table in 3NF but not in BCNF. Clearly therefore, option (c) is not a good design solution and one must use a version of option (b) with the duplicate attribute removed.

5 Discussion and Conclusion In this paper, we showed that modeling TIRs leads to better representation of realworld semantics and a more expressive conceptual grammar. We formally specified the semantics of the relationship constructs, and distinguished it from related concepts like interaction, inclusion, composition and realization relationships. We also discussed the relational translation options. The correct handling of constraining relationships leads to a schema with stronger referential integrity and higher normal form. Proper management of instantiation-level data provides strategic information for decision-making at the organizational level and across partners and the supply chain [15]. From a design science [16] perspective, our work falls into the build and evaluation dimensions of the design process. We have presented formal rigor in our construct definitions and examples to serve as initial evaluation of the design artifacts. A preliminary study by one of the authors in partnership with a large automotive corporation found that cases of type-instantiation relationships occur frequently in manufacturing. This can be generalized to almost all scenarios where large-scale inventory management is involved. Further, as pointed out in the introduction— service industries such as advertising, emergency management, financial markets and healthcare see frequent occurrences of TIRs. Future work includes empirical testing and evaluation of TIRs with case studies. Instantiations naturally occur over time and space, and so we are particularly interested in the implications for active data management with implementations across partners or the supply chain.

When Entities Are Types: Effectively Modeling Type-Instantiation Relationships

147

References 1. Currim, F., Ram, S.: RFID Data: A case of and for types and instances. In: Conference RFID Data: A case of and for types and instances (2009) 2. Goldstein, R., Storey, V.: Materialization. IEEE Transactions on Knowledge and Data Engineering 6, 835–842 (1994) 3. Batini, C., Ceri, S., Navathe, S.B.: Conceptual Database Design: An Entity-Relationship Approach. Benjamin/Cummings Publishing Company (1992) 4. Dahchour, M., Pirotte, A., Zimányi, E.: Materialization and Its Metaclass Implementation. IEEE Transactions on Knowledge and Data Engineering 14, 1078–1094 (2002) 5. Dobing, B., Parsons, J.: How UML is used. Communications of the ACM 49, 109–113 (2006) 6. Pirotte, A., Zimányi, E., Massart, D., Yakusheva, T.: Materialization: A Powerful and Ubiquitous Abstraction Pattern. In: Conference Materialization: A Powerful and Ubiquitous Abstraction Pattern, pp. 630–641 (1994) 7. Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems. Addison Wesley, Reading (2003) 8. Chen, P.P.: The Entity-Relationship Model - Toward a Unified View of Data. ACM Transactions on Database Systems 1, 9–36 (1976) 9. Thalheim, B.: Entity-Relationship Modeling: Foundations of Database Technology. Springer, Heidelberg (2000) 10. Parsons, J., Wand, Y.: Guidelines for Evaluating Classes in Data Modeling. In: Conference Guidelines for Evaluating Classes in Data Modeling, pp. 1–8 (1992) 11. Ram, S.: Intelligent Database Design using the Unifying Semantic Model. Information and Management 29, 191–206 (1995) 12. Batra, D.: Conceptual Data Modeling Patterns: Representation and Validation. Journal of Database Management 16, 84–106 (2005) 13. Hay, D.C.: Data Model Patterns: Conventions of Thought. Dorset House Publishing Company, New York (1995) 14. Fagin, R.: A normal form for relational databases that is based on domains and keys. ACM Transactions on Database Systems 6, 387–415 (1981) 15. Niederman, F., Mathieu, R., Morley, R., Kwon, I.-W.: Examining RFID Applications in Supply Chain Management. Communications of the ACM 50, 93–101 (2007) 16. Hevner, A.R., March, S.T., Park, J., Ram, S.: Design Science in Information Systems Research. MIS Quarterly 28, 75–105 (2004)

KBB: A Knowledge-Bundle Builder for Research Studies David W. Embley1, , Stephen W. Liddle2 , Deryle W. Lonsdale3 , Aaron Stewart1 , and Cui Tao4 1

Department of Computer Science Information Systems Department 3 Department of Linguistics, Brigham Young University, Provo, Utah 84602, U.S.A. 4 Mayo Clinic, Rochester, Minnesota 55905, U.S.A. 2

Abstract. Researchers struggle to manage vast amounts of data coming from hundreds of sources in online repositories. To successfully conduct research studies, researchers need to ﬁnd, retrieve, ﬁlter, extract, integrate, organize, and share information in a timely and high-precision manner. Active conceptual modeling for learning can give researchers the tools they need to perform their tasks in a more eﬃcient, user-friendly, and computer-supported way. The idea is to create “knowledge bundles” (KBs), which are conceptual-model representations of organized information superimposed over a collection of source documents. A “knowledgebundle builder” (KBB) helps researchers develop KBs in a synergistic and incremental manner and is a manifestation of learning in terms of its semi-automatic construction of KBs. An implemented KBB prototype shows both the feasibility of the idea and the opportunities for further research and development.

1

Introduction

In many domains, the volume of data is enormous and increasing rapidly. Unfortunately, the information a researcher requires is often scattered in various repositories and in the published literature. Researchers need a system that can help eﬃciently locate, extract, and organize available information so they can analyze it and make informed decisions. We address this challenge with the idea of a Knowledge Bundle (KB ) and a Knowledge-Bundle Builder (KBB ). Active conceptual modeling for learning (ACM-L) is at the core of our approach. As we explain below, a KB includes an extraction ontology, which allows it to both identify and extract information with respect to a custom-designed schema. (This constitutes the conceptualmodeling part of ACM-L.) Construction of a KB itself can be a huge task—but one that is mitigated by the KBB. Construction of the KB under the direction of the KBB proceeds as a natural progression of the work a researcher does

Supported in part by the National Science Foundation under grant #0414644.

J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 148–157, 2010. c Springer-Verlag Berlin Heidelberg 2010

KBB: A Knowledge-Bundle Builder for Research Studies

149

to manually identify and gather information of interest. As a researcher begins to work, the KBB immediately begins to synergistically assist the researcher and quickly “learns” and is able to take over most of the tedious work. (This constitutes the active-learning part of ACM-L.) In describing our KBB approach to building KBs, we ﬁrst give a motivating example of a bio-research study (Section 2). We then explain how the KBB plays its claimed role in the bio-research scenario (Section 3) by deﬁning what a KB is and giving speciﬁc examples of KBB tools for building and using KBs. Finally, we give the status of our implementation and mention current and future work needed to enhance KBs and the KBB (Section 4) and then draw conclusions (Section 5).

2

Motivation Scenario

Suppose a bio-researcher B wishes to study the association of TP53 polymorphism and lung-cancer. To do this study, B wants information from the NCBI dbSNP repository1 about SNPs (chromosome location, SNP ID and build, gene location, codon, and protein), about alleles (amino acids and nucleotides), and about the nomenclature for amino-acid levels and nucleotide levels. B also needs data about human subjects with lung cancer and needs to relate the SNP information to human-subject information. To gather information from dbSNP, B constructs the form in the left panel in Figure 1. Form construction consists of selecting form-ﬁeld types—e.g., singlevalue ﬁelds, multiple-value ﬁelds, multiple-column/multiple-value ﬁelds, radio buttons, and check boxes—and organizing and nesting them so that they are a conceptualization of the information B wishes to harvest for the research study. B next ﬁnds a ﬁrst SNP page in dbSNP from which to begin harvesting information. (The created form and located page need not have any special correspondence—no schema correspondence, no name correspondence, and no special structure requirements—but, of course, the page should have data of interest for the research study and thus for the created form.) B then ﬁlls in the form by cut-and-paste actions, copying data from the page in the center panel in Figure 1 to the form in the left panel. To harvest similar information from the numerous other dbSNP pages, B gives the KBB a list of URLs, as the right panel in Figure 1 illustrates (although there would likely be hundreds rather than just the six in Figure 1). The KBB automatically harvests the desired information from the dbSNP pages referenced in the URL list. Since one of the challenges bio-researchers face is searching through the pages to determine which ones contain the desired information, the KBB provides a ﬁltering mechanism. By adding constraints to form ﬁelds, bioresearchers can cause the KBB harvester to gather information only from pages that satisfy the constraints. B, for example, might only want coding SNP data 1

The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms hosted by National Center for Biotechnology Information (NCBI) at www.ncbi.nlm.nih.gov/projects/SNP/.

150

D.W. Embley et al.

Fig. 1. Form Filled in with Information from an SNP Page

with a signiﬁcant heterogeneity (i.e., minor allele frequency > 1%). Because of this ﬁltering mechanism, B can direct the KBB to search through a list of all pages without having to ﬁrst limit them to just those with relevant information. For the research scenario, B may also wish to harvest information from other sites such as GeneCard. B can use the KBB with the same form to harvest from as many sites as desired. Interestingly, however, and as an example of the learning that takes place, once the KBB harvests from one site, it can use the knowledge it has already gathered to do some of the initial cut-and-paste for B. In addition to just being a structured knowledge repository, the KB being produced also becomes an extraction ontology capable of recognizing data items it has already seen. It can also recognize data items it has not seen but are like the data it has seen—e.g., numeric values or DNA snippets. Using KBs as extraction ontologies also lets bio-researchers search the literature. Suppose B wishes to ﬁnd papers related to the information harvested from the dbSNP pages. B can point the KBB to a repository of papers to search and cull out those that are relevant to the study. Using the KB as an extraction ontology provides a sophisticated query of the type used in information retrieval resulting in high-precision document ﬁltering. For example, the extraction ontology recognizes the highlighted words and phrases in the portion of the paper in Figure 2. With the high density of not only keywords but also data values and relationships all aligned with the ontological KB, the KBB can designate this paper as being relevant for B’s study. For the human-subject information and to illustrate additional capabilities of the KBB, we suppose that a database exists that contains the needed

KBB: A Knowledge-Bundle Builder for Research Studies

151

Fig. 2. Paper Retrieved from PMID Using an Extraction Ontology

Fig. 3. Some Human Subject Information Reverse-Engineered from INDIVO

human-subject information. The KBB can automatically reverse-engineer the database to a KB, and present B with a form representing the schema of the database. B can then modify the form, deleting ﬁelds not of interest and rearranging ﬁelds to suit the needs of the study. Further, B can add constraints to the ﬁelds so that the KBB only gathers data of interest from the database to place in its KB. Figure 3 shows an example of a form reverse-engineered from the INDIVO database, tailored to ﬁt our research scenario. The icons in the form let a user tailor the form: modify a form-ﬁeld title (pencil icon), delete a form ﬁeld (× icon), insert or nest a new form ﬁeld (choice list of icons to insert respectively a single-value form ﬁeld, a multiple-value form ﬁeld, a multiple-column/multiplevalue form ﬁeld, and radio-button and check-box selection ﬁelds). With all information harvested and organized into an ontology-based knowledge bundle (the KB), B can now issue queries and reason about the data to do some interesting analysis. Figure 4 shows a sample SPARQL query over the data harvested from the pages referenced by the six URLs listed in Figure 1. The query ﬁnds three SNPs that satisfy the query’s criteria and for each, returns the dbSNP ID, the gene location, and the protein residue it found. In our prototype, users may click on any of the displayed values to display the page from which the

152

D.W. Embley et al.

Fig. 4. Screenshot of our Web of Knowledge Prototype

value was extracted and to highlight the value in the page. As Figure 4 shows, users may alternatively click on one or more check boxes to access and highlight all the values in checked rows. The values rs55819519, TP53, and His Arg are all highlighted in the page in the right panel of Figure 4.

3

KBs and KBBs

Having provided a scenario in which a researcher can use KBs built synergistically through a KBB, we now explain exactly what a KB is and how a KBB synergistically builds them. In doing so, we emphasize that although our researchstudy scenario speciﬁcally targets bio-research, our deﬁnitions and explanation here do not. It should be clear that a KBB can assist intelligence-gathering researchers in all areas—scientiﬁc, business, military, and government. We deﬁne a knowledge bundle (KB) as a 7-tuple (O, R, C, I, S, A, F ): – O is a set of intensional object sets, and are one-place predicates—sometimes called concepts or classes; they may also play the role of properties or attributes. (Examples: P erson(x), Amino Acid(x), Country(x), Color(x).) – R is a set of intensional relationship sets among the object sets, and are nplace predicates (n ≥ 2). (Examples: P erson(x) is citizen of Country(y), Sample(x) taken on Date(y).) – C is a set of constraints over O and R, limited so that (O, R, C) constitutes a decidable fragment of ﬁrst-order logic. (Examples: ∀x(Student(x) ⇒ P erson(x)), ∀x(Sample(x) ⇒ ∃1 y(Sample(x) taken on Date(y)))

KBB: A Knowledge-Bundle Builder for Research Studies

153

– I is an instantiation of the object and relationship sets in O and R; when I satisﬁes C, I is a model for (O, R, C). (Examples: Sample(“SMP9671”) taken on Date(2009-03-25), Color(“green”).) – S is a set of inference rules, horn-clause statements. (Example: BrotherOf (x, y) :- P erson(x), P erson(y), SiblingOf (x, y), M ale(x).) – A is a set of annotations for data-value instances in object sets in O; each data value v may link to an appearance of v in a source document. (Example: Codon(72) may link to the appearance of 72 in the SNP page in Figure 1.) – F is a set of data frames [Emb80]. Data frames are abstract data types, linguistically augmented to include recognizers for object and relationship instances and operation instantiations as they appear in documents and freeform user queries. (Examples: the instance recognizer [ACGT]([ACGT])+ for a DNA snippet, (Country | Nation | Republic | ...) as keywords indicating the presence of a country concept.) The triple (O, R, C) is an ontology 2 . In our implementation, we use OWL to represent ontologies. Adding the I component allows us to populate the ontology. The quadruple (O, R, C, I) characterizes information and is an information system or database. In our implementation, we use RDF for storing instances with respect to OWL ontologies. The quintuple (O, R, C, I, S) characterizes a computational view of knowledge. Adding the S component allows us to reason over the base facts in the information system. In our implementation, we use SWRL rules and the Pellet reasoner. The sextuple (O, R, C, I, S, A) characterizes a Platonic view of knowledge. Adding the A component provides a form of authentication since users can trace knowledge back to its source; it thus provides a form of “justiﬁed true belief,” which Plato insists is part of the deﬁnition of knowledge [PlaBC]. Completing the septuple by adding the F component, linguistically grounds the knowledge [BCHS09, HLF+ 08],3 making the KB also be an extraction ontology. Further, having an extraction ontology enables a KB to be an active learner, where we consider active learning to be the ability to automatically ﬁnd facts in source documents that pertain to the KB’s ontology, annotate them, and add them to the KB. Finding facts in source documents and adding them to the bundle of collected knowledge is the essence of building KBs for research studies. Letting KBs themselves assist in the task goes a long way toward automating the KBbuilding process. This automation is non-trivial, and full automation is likely impossible. Hence, we aim to construct KB-building tools that synergistically 2

3

Researchers disagree about the deﬁnition of an ontology, but we adopt the view that an ontology is a formal theory captured in a model-theoretic view of data within a formalized conceptual model. Since the elaboration of our triple (O, R, C) is a predicate-calculus-based, formalized, conceptual model, we call it an ontology. Both LexInfo [BCHS09] and OpenDMAP [HLF+ 08] are independently developed complementary, eﬀorts, aimed at linguistically grounding ontologies. As both their work and ours explores this wide-open research area, the projects have much to contribute to and learn from each other.

154

D.W. Embley et al.

work well with users and incrementally take on more and more of the burden of KB construction. A KB-Builder (KBB ) is a tool suite to aid in the construction of KBs. More speciﬁcally, it is a tool suite to largely automate the building of KBs. In our approach to providing a KBB tool suite, we focus on tools (1) to build KBs via form speciﬁcation and automated information harvesting and (2) to reverseengineer structured and semi-structured information sources into KBs. Form-based Ontology Creation and Information Harvesting. While we do not assume that bio-researchers and other decision-making researchers are expert users of ontology languages, we do assume that they can create ordinary forms of the kind people routinely use for information gathering. A KBB interface lets users create forms by adding various form elements as the clickable icons in the data and label ﬁelds of the form in Figure 3 indicates. Users can specify any and all concepts needed for a study, can specify relationships and constraints among the concepts, and can nest, customize, and organize their data as they wish. From a form speciﬁcation, the KBB generates a formal ontological structure, (O, R, C). Each label in a form becomes a concept of O. The form layout determines the relationship sets in R among the concepts and determines the constraints in C over the concepts and relationship sets. Given a form, a user can cut-and-paste data from source documents into the form ﬁelds to create the I and A components of a KB. When harvesting from sites like the NCBI dbSNP repository which have hundreds of pages all formatted in a similar way, the KBB can infer from the user’s cut-and-paste actions the patterns it needs to harvest the desired information from all pages on the site. These patterns consist of paths in DOM trees of HTML pages along with left and right context and list delimiters to locate data within DOM-tree nodes. To build the F component of a KB, the KBB creates instance recognizers in two ways as it harvests information: (1) by creating lexicons and (2) by identifying and specializing data frames in a data-frame library. For lexicons, the KBB simply makes a list of names of identiﬁable entities, which it can then later recognize and classify. For data-frame recognizers, we initialize a data-frame library with data frames for common items we expect to encounter—e.g., all types of numbers, currencies, postal codes, and telephone numbers, among many others. When recognizers in these data frames recognize harvested items, they can classify the items with respect to these data frames and associate the data frames with concepts in the ontology. Some automatic specializations are possible, such as numbers with as-yet-unknown units. For more complex pattern recognition, experts can add recognizers. Reverse-Engineering Structured and Semi-structured Data to KBs. Structured repositories (e.g., relational databases, OWL/RDF triple stores, XML document repositories) and semi-structured repositories (e.g., human-readable tables and forms, hidden-web display pages) may contain much of the information needed for a research study. For structured repositories, reverse-engineering processes (e.g., for relational databases [MH08]) can turn these repositories into knowledge bundles. Further, the results of reverse engineering can be nested form schemas like the one in Figure 3. In this case, researchers can use the techniques

KBB: A Knowledge-Bundle Builder for Research Studies

155

mentioned in the previous paragraph to custom-tailor reverse-engineered KBs by restructuring the generated forms to become the (O, R, C)-ontologies they want. They can also limit the data extracted from the database to the I-values they want, and they can use the techniques mentioned in the previous paragraph to produce F -component lexicons and data frames. For structured repositories such as relational databases that allow view deﬁnitions, S-component construction is possible, yielding rules for reasoning. Although the reverse-engineering process for semi-structured repositories is even more challenging than for structured repositories, it is nevertheless feasible for many documents (e.g., for humanreadable web tables [GBH+ 07, PSC+ 07]).

4

Implementation Status and Future Work

We have implemented an initial prototype of our KBB as part of our Web-ofKnowledge (WoK) project [ELL+ 08]. Currently, as Figure 1 shows, our prototype lets users create ontologies via forms, ﬁll in the form from a machine-generated web page in a hidden-web site, and harvest information from the remaining sibling pages of the hidden-web site [Tao08, TEL09]. We have not yet, however, added constraint ﬁltering to forms. Our WoK prototype can also automatically reverse-engineer machine-generated sibling tables from hidden-web sites into forms and automatically establish the beginnings of a KB extraction ontology [Tao08]. Although not yet integrated into our WoK prototype, we have implemented a way to reverse-engineer an XML-Schema document into a conceptual model, which is compatible with our KB ontologies [AKEL08]. Using extraction ontologies coded by hand, we can successfully do high-precision ﬁltering of semistructured web documents [XE08], but we have not yet brought this up to the level we need for high-precision document retrieval for free-running text as indicated in Figure 2. In another WoK subproject we have developed a way to generate an ontology from a collection of human-readable tables. We can interactively interpret tables [Pad09], semantically enhance them [Lyn08], and merge them into a growing ontology [Lia08] using automated schema integration techniques [XE06]. We have yet to make all these components work together to achieve the overall goal of automatically growing ontologies by reverse-engineering coordinated collections of human-readable tables. The current implementation of our WoK prototype also allows users to access and query the data in a KB as the screenshot in Figure 4 shows. Although some of our work is complete, we still have much to do to solidify and enhance what we have already implemented and to extend it to be a viable research-study tool. We plan further research as follows. (1) We have deﬁned and implemented data frames for concepts corresponding to nouns and adjectives, but we should also deﬁne data frames for relationships in connection with verbs and prepositions. (2) Our current system expects source documents divided into distinct records, but to extract selected information from free-running text, we need to relax the record-boundary constraints and be able to recognize a record of interest, and its extent, without any boundary information. (3) Our

156

D.W. Embley et al.

reverse-engineering eﬀorts have proven to be successful, but we should take these approaches even further, for instance, by inferring schemas from general semistructured data like the dbSNP page in Figure 1. (4) Although not a scientiﬁc workﬂow system by itself, a KBB can become an integral part of a workﬂow system; embedding a KBB inside of a workﬂow system being used to gather information for research studies (e.g., scientiﬁc workﬂow systems [LAB+ 06]) could greatly enhance and help automate the information harvesting facilities of these systems.

5

Concluding Remarks

Several related ﬁelds of research are at the heart of our work: information extraction [Sar08], information integration [ES07], ontology learning [Cim06], and data reverse engineering [Aik98]. The KB/KBB approach discussed here is a unique, synergistic blend of techniques resulting in a tool to eﬃciently locate, extract, and organize information for research studies. (1) It supports directed, custom harvesting of high-precision technical information. (2) Its semi-automatic mode of operation largely shifts the burden for information harvesting to the machine. (3) Its synergistic mode of operation allows research users to do their work without intrusive overhead. The KB/KBB tool is a helpful assistant that “learns as it goes” and “improves with experience.”

References [Aik98]

Aiken, P.H.: Reverse engineering of data. IBM Systems Journal 37(2), 246– 269 (1998) [AKEL08] Al-Kamha, R., Embley, D.W., Liddle, S.W.: Foundational data modeling and schema transformations for XML data engineering. In: Proceedings of the 2nd International United Information Systems Conferences (UNISCON 2008), Klagenfurt, Austria, pp. 25–36 (April 2008) [BCHS09] Buitelaar, P., Cimiano, P., Haase, P., Sintek, M.: Towards linguistically grounded ontologies. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyv¨ onen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 111–125. Springer, Heidelberg (2009) [Cim06] Cimiano, P.: Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer, New York (2006) [ELL+ 08] Embley, D.W., Liddle, S.W., Lonsdale, D., Nagy, G., Tijerino, Y., Clawson, R., Crabtree, J., Ding, Y., Jha, P., Lian, Z., Lynn, S., Padmanabhan, R.K., Peters, J., Tao, C., Watts, R., Woodbury, C., Zitzelberger, A.: A conceptualmodel-based computational alembic for a web of knowledge. In: Li, Q., Spaccapietra, S., Yu, E., Oliv´e, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 532–533. Springer, Heidelberg (2008) [Emb80] Embley, D.W.: Programming with data frames for everyday data items. In: Proceedings of the 1980 National Computer Conference, pp. 301–305, Anaheim, California (May 1980)

KBB: A Knowledge-Bundle Builder for Research Studies

157

[ES07] Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2007) upl, B., Pollak, B.: Towards [GBH+ 07] Gatterbauer, W., Bohunsky, P., Herzog, M., Kr¨ domain-independent information extraction from web tables. In: Proceedings of the Sixteenth International World Wide Web Conference (WWW 2007), Banﬀ, Alberta, Canada, pp. 71–80 (May 2007) [HLF+ 08] Hunter, L., Lu, Z., Firby, J., Baumgartner Jr., W.A., Johnson, H.L., Ogren, P.V., Cohen, K.B.: OpenDMAP: An open source, ontology-driven, concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-speciﬁc gene expression. BMC Bioinformatics 9(8) (2008) [LAB+ 06] Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientiﬁc workﬂow management and the Kepler system. Concurrency and Computation: Practice and Experience 18(10), 1039–1065 (2006) [Lia08] Lian, Z.: A tool to support ontology creation based on incremental mini-ontology merging. Master’s thesis, Department of Computer Science, Brigham Young University, Provo, Utah (March 2008) [Lyn08] Lynn, S.: Automating mini-ontology generation from canonical tables. Master’s thesis, Department of Computer Science, Brigham Young University, Provo, Utah (2008) [MH08] Mian, N.A., Hussain, T.: Database reverse engineering tools. In: Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems, Cambridge, United Kingdom, pp. 206– 211 (February 2008) [Pad09] Padmanabhan, R.K.: Table abstraction tool. Master’s thesis, Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, New York (May 2009) [PlaBC] Plato. Theaetetus. BiblioBazaar, LLC, Charleston, South Carolina, about 360BC (translated by Benjamin Jowett) [PSC+ 07] Pivk, A., Sure, Y., Cimiano, P., Gams, M., Rajkoviˇc, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data & Knowledge Engineering 60, 567–595 (2007) [Sar08] Sarawagi, S.: Information extraction. Foundations and Trends in Databases 1(3), 261–377 (2008) [Tao08] Tao, C.: Ontology Generation, Information Harvesting and Semantic Annotation for Machine-Generated Web Pages. PhD dissertation, Brigham Young University, Department of Computer Science (December 2008) [TEL09] Tao, C., Embley, D.W., Liddle, S.W.: FOCIH: Form-based ontology creation and information harvesting. In: Proceedings of the 28th International Conference on Conceptual Modeling (ER 2009), Gramado, Brazil, November 2009, pp. 346–359 (2009) [XE06] Xu, L., Embley, D.W.: A composite approach to automating direct and indirect schema mappings. Information Systems 31(8), 697–732 (2006) [XE08] Xu, L., Embley, D.W.: Categorization of web documents using extraction ontologies. International Journal of Metadata, Semantics and Ontologies 3(1), 3–20 (2008)

7th International Workshop on Web Information Systems Modeling (WISM 2010) Preface The international workshop on Web Information Systems Modeling (WISM) aims to study the latest developments in modeling of Web Information Systems (WIS). This is the seventh edition of this workshop which follows successful editions organized in Amsterdam (2009), Barcelona (2008), Trondheim (2007), Luxembourg (2006), Sydney (2005), and Riga (2004). In the past, storyboards have been successfully used for modeling WIS. As there is an increasing need to support complex interactions between users and WIS, in the first paper by Berg et al. the authors propose extensions to storyboards based on speech dialogues. The common dialogue forms are supported by dialogue patterns. Currently there are various formats for Web logs which makes log analysis a difficult process. The second paper by Hernandez et al. proposes a unified model for Web log data that is subsequently used for deriving a multidimensional model. The data transformations are specified using Query/View/Transformation rules and are used in a fully automatic process. There are many Web applications that have a REpresentational State Transfer (REST) architecture and XML-based data storage. As REST is stateless, it is difficult to support auditing and accountability for these applications. The third paper by Graf et al. proposes an opportunistic locking mechanism that ensures easy modifications of the allocation of XML nodes and scalable integrity verification based on the XML tree structure. A lot of the current research is done by means of collaborations between individuals that often belong to different research groups. Finding the right partner and analysing collaborations are complex tasks. The fourth paper by Lopes et al. proposes a framework for recommending collaborations in the context of Web social networks. The authors present the architecture of the framework, the metrics involved in recommending collaborations, and preliminary evaluation results. The sentiment emerging from Web documents is an influential factor for business decisions. Unlike other approaches that use the bag-of-words model to compute sentiment scores for documents, the last paper by Hogenboom et al. proposes to consider the document narrative structure for determining the document sentiment. Depending on the position in the document rhetorical structure, text fragments can have different impacts during sentiment computations. Based on this selection of papers and topics relevant to the workshop goals, we invite the interested reader to have a closer look at the articles gathered in the proceedings. We would like to thank all the authors, reviewers, and participants for their contributions and support, making the organization of this new edition of the workshop possible. July 2010

Flavius Frasincar Geert-Jan Houben Philippe Thiran

Integration of Dialogue Patterns into the Conceptual Model of Storyboard Design Markus Berg1 , Bernhard Thalheim2 , and Antje Düsterhöft1 1

Hochschule Wismar, Germany Department of Electrical Engineering and Computer Science {markus.berg,antje.duesterhoeft}@hs-wismar.de 2 Christian-Albrechts-University Kiel, Germany Department of Computer Science and Applied Mathematics [email protected]

Abstract. Web information systems, e.g. modern e-commerce platforms, become nowadays more sophisticated, cope with more complex applications and support an integration of speech dialogues. Their workflow and supporting infrastructure can be specified by storyboards. The integration of speech dialogues is however an unsolved issue due to the required flexibility, to the wide variety of responses and the expected nativeness. Classical keyword-based search cannot cope with such interaction media. This paper extends storyboarding by speech dialogues. Speech dialogues must be very flexible in both recognition of answers and in generation of appropriate answers. We thus introduce a pattern-based approach to specification and utilisation of speech dialogues. The paper shows that it is possible to create patterns for common dialogue-forms. Consequently they are integrated into the storyboard model and build the basis for the modeling of natural dialogues in web information systems.

1

Introduction

A web information system (WIS) [16] is a database-backed information system that is realised and distributed over the web with user access via web browsers. Information is provided from web pages that are linked within a navigation structure. Interaction with the system is typically browser-based and uses nowadays also other interaction media such as voice-interaction systems. Additionally, database functionality is provided through the interaction media. Information is data that has been veriﬁed to be accurate and timely, is speciﬁc and organized for a purpose, is presented within a context that gives it meaning and relevance, and which leads to increase in understanding and decrease in uncertainty. This understanding of the notion of information relates data to the users, their intentions, their current demand and need. Therefore, information provision requires to know the user. We may use this general user description for adaptation of WIS to the user. WIS support also complex ﬂows of work. Therefore, we need a detailed speciﬁcation for this behaviour of the system. Storyboards are a conceptual speciﬁcation method for the description of user-system interaction. Since we cannot J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 160–169, 2010. c Springer-Verlag Berlin Heidelberg 2010

Integration of Dialogue Patterns into Storyboards

161

only orient the system to one very speciﬁc kind of users, storyboards must be adaptable to the user, the actual behaviour and to the history of interaction of this user. Beside abstract description of interaction we may inherit approaches from natural language interaction. If we are able to use this kind of interaction then we may also use modern voice-based communication media. Moreover, systems will be easier to use in certain tasks, simpler to learn and become useful for a larger respectively diﬀerent user community compared with classical programmes. Therefore, we need a form of natural language communication that is very ﬂexible and adaptable. This ﬂexibility and adaptability can be achieved by an approach we are proposing in this paper. Dialogue patterns consist of classes of dialogue forms from which the concrete dialogue can be chosen. Dialogue patterns can be combined with each other and may thus be the basis for complex dialogues. Basic dialogues may also be grouped into diﬀerent kinds such as conﬁrmatory dialogues. In order to apply this approach we use a formal speciﬁcation of dialogue patterns and an adaptation of these patterns to the actual dialogue situation. This paper brieﬂy overviews related work and storyboard design [16]. We introduce in section 4 the notion of a dialogue pattern. In section 5 we integrate story patterns and storyboards. This approach has already been tested for application systems.

2

Related and Previous Work

The number of publications in the ﬁeld of WIS is enourmous. The ARANEUS framework [1] determines that conceptual modeling of WIS consists of content, navigation and design aspects. This results in the modeling of databases, hypertext structures and page layout. Another similar approach is OOHDM [11] [10] which is completely object oriented. It also comprises three layers: object layer, hypermedia components and interface layer. The work in [4] describes an approach of interactively generating stories with a speciﬁcation of a formal logic model. Our own work on storyboarding has been reported in [6] and [14]. Moreover we have investigated the role and use of metaphors in storyboarding in [18]. In [2] we have designed a method for generating Voice XML from the storyboard speciﬁcation and thus proven that it is possible to use storyboarding also for modeling speech applications. Besides conceptual modeling also aspects of natural language dialogues have to be considered. The process of the development of speech interfaces is analysed in [5]. The work in [9] describes the conceptual base for the dialogue design process. In [8] and [3] the Wizard-of-Oz methodology is used for simulating interactive systems. Besides the dialogue description, also the user has to be modeled. This is done in [7]. The usage of diﬀerent modalities when accessing the web is examined in [20]. The modeling of dialogues with the help of dialogue acts is regarded in [13] and [17].

162

3

M. Berg, B. Thalheim, and A. Düsterhöft

Storyboard Design

Storyboards are a methodology which was created for the design of large-scale data-intensive web information systems. It is based on the abstraction layer model (ALM) which is shown in ﬁgure 1 [15]. The strategic layer is used to describe the system in a general way concerning the intention. It is comparable to a mission statement. The business layer concretises this information by describing stories which symbolise paths through the system. The purpose of this layer is to anticipate the behaviour of the users. In the conceptual layer the scenes in the storyboard are analyzed and integrated. The design of abstract media types support the scenes by providing a unit which combines content and functionality. The presentation layer associates presentation options to the media types. In the implementation layer physical implementation aspects like setting up database schemata, page layout and the realisation of functionality by script languages are addressed. Each layer is associated with speciﬁc modeling tasks which allow the transition between the layers. To progress from the strategic to the business layer storyboarding and user proﬁling is required. To get from the business layer to the conceptual layer conceptual models have to be created, i.e. database modeling, operations modeling, view modeling and media type modeling. The transition to the presentation layer is characterised by the deﬁnition of presentation styles. In the implementation layer all implementation tasks have to be realised.

Fig. 1. Abstraction Layer Model

Storyboarding focusses on the business and the conceptual layer [15]. The business layer deals with user proﬁling and the design of the application story. The core of the story space can be expressed by a directed multi-graph, in which the vertices represent scenes and the edges actions by the users including navigation. If more details are added, application stories can be expressed by some form of process algebra. That is, we need atomic activities and constructors for sequencing, parallelism, choice, iteration, etc. to write stories. In the conceptual

Integration of Dialogue Patterns into Storyboards

163

layer media types which support the scenes and operations which support the activities in the storyboard are modeled. Moreover hierarchical presentations, and the adaptivity to users, end-devices and channels are addressed in this layer. WIS can be used by any web user. That’s why the design of such systems requires anticipation of the user’s behaviour. This problem is addressed by storyboarding. It describes the ways users may choose to interact with the system. A storyboard consists of three parts [15]: The stories, which are navigation paths through the system, the actors which comprise users with the same proﬁle and tasks which link activities (resp. goals) of the actors with the story space which is a container for the description of the stories. Subgraphs of the story space are called scenarios. This enables a hierarchy and encapsulation of scenes. Every action can be equipped with pre- and postconditions or triggering events. This allows us to specify under which conditions an action can be executed. SiteLang is a language which deﬁnes a story algebra and allows the formal representation of the theoretical storyboard model. The explanation of the SiteLang syntax is beyond the scope of this paper and can be read in [19].

4

Definition of Dialogue Patterns

A dialogue is deﬁned as a conversational exchange of information between people. It consists of many related utterances with a speciﬁc meaning and aim. These utterances are called speech acts. Searle identiﬁed ﬁve illocutionary acts [12]: assertives, directives, commissives, expressives and declaratives. When classifying dialogue utterances you can observe certain classes like greeting, apologizing or asking which can be assigned to the illocutionary acts (e.g. an apology is an expressive speech act). Aggregation leads to a group of speech acts. While speech acts classify single utterances, dialogue acts model dialogue classes consisting of several utterances. They can be seen as a superclass of logical related speech acts. This model is well suited when analysing dialogues a posteriori. But when deﬁning dialogue patterns a priori we need to consider dialogue branches as we do not know the answer beforehand. A confirmation dialogue is a six-tuple D = {Q1 , A, V, C, D, Q2 } which comprises the following steps: – – – – –

Question Answer Veriﬁcation Answer: Conﬁrmation | Denial (Question for correction)

As mentioned above the form of the answer is not known beforehand and introduces a branch. The question for correction is optional, depending on the path the user has chosen. Q1 ∈ μ, A ∈ α, V ∈ φ, C ∈ γ, D ∈ δ, Q2 ∈ ν +

Q1 , Q2 , V ⊆ Q ⊆ Σ

+

(1) (2)

164

M. Berg, B. Thalheim, and A. Düsterhöft

Let μ be {"How many persons take part in your trip?","How many persons?", "With how many persons do you want to travel?"}, α={1,2,3,4,5}, φ={"Are you sure?","You want to travel with λα persons, correct?"}, γ={yes, yo}, δ={no, nope} and ν={"Please say now the number of persons!"}. The instantiation of this pattern results in speciﬁc dialogues. Example 1. D = {μ[2], α, φ[1], γ, δ, ν[0]} would result in the following example dialogue: S: With how many persons do you want to travel? U: Three S: You want to travel with three persons, correct? U: Yes The instantiation with a set instead of a single value leads to random prompting, which prevents monotony when the user works often with the system. Sets referring to user input like α, γ and δ specify the domain of possible answers. The values can be used to generate grammars. Let a context-free grammar be deﬁned as a quadrupel of nonterminal symbols, terminal symbols, production rules and a start symbol: G = {N, Σ, P, S} with P = A → ω, A ∈ N, ω ∈ {N ∪ Σ}+ . Now we can infer Σ from α. The deﬁnition of possible user utterances by just enumerating terminal combinations is not very eﬀective. That’s why we need to change the domain of the user utterance sets to: dom(α), dom(γ), dom(δ) = (N, Σ, P, S). The former example for α can now be expressed as α = ({S}, {1, 2, 3, 4, 5}, {S → 1|2|3|4|5}, S). Now we are able to express even more complex utterances like "I want to travel with two persons" or "Three persons, please" as the following example shows. Example 2. G = {N, Σ, P, S} N = {S, N O, P RS, P RE, P OST } Σ = {I, want, to, travel, with, persons, please, one, two, three, f our, f ive} P = {S → P RE N O P RS P OST | N O | N O P RS | N O P RS P OST P RE → I want to travel with N O → one|two|three|f our|f ive P RS → persons P OST → please} This can be transformed into the following SRGS ABNF grammar: $S=$PRE<0-1> $NO $PRS<0-1> $POST<0->; $PRE=I want to travel with; $NO=one|two|three|four|ﬁve;

Integration of Dialogue Patterns into Storyboards

165

$PRS=persons; $POST=please; To facilitate post-processing, the introduction of semantic return values is helpful. With the help of the λ-Operator we can access values from diﬀerent rules. An object-oriented approach allows us to deﬁne subtypes. While λα returns e.g. "three persons please" λα.no would return 3. Now that we have deﬁned grammar generation, some other dialogue types are speciﬁed: – Question/Answer: D = {Q1 , A} – Selection: D = {Q1 , A} – Yes/No-Question: D = {Q1 , A} As you can see, the deﬁnition is the same in all three cases. But with the adaption of the domain you can support diﬀerent dialogue semantics. Now we are able to create patterns for diﬀerent dialogue types. But as these dialogues are predeﬁned, there are relatively rigid. One of the most important characteristics of natural language is fuzzy answers. Some are under- others overspeciﬁed. Underspeciﬁed answers lead to re-requests and overspeciﬁed answers have to ﬁll several patterns. Example 3. S: When do you want to start? U: Tomorrow in Cologne S*: Where do you want to start? The third utterance of the above dialogue should not occur, as the user already gave that information. This leads to the necessity of processing overspeciﬁed answers. One approach is the extension of the recognition domain through an object oriented pattern-concept (i.e. frame). This enables us to group semantical related dialogue patterns. In a tourism scenario there exist diﬀerent questions. Some of them are mandatory for generating a search query. Possible attributes are: – – – – –

begin of journey end of journey destination number of children number of adults

Some are likely to be summarized in single utterances. Instead of asking "When do you want to start your journey?" and "When do you want to end your journey?" you could ask "Please say your travel dates". Moreover it has to be realized that the user gives the answer "Two adults and one child" to the question "How many adults take part?". An approach to realize this, is the enclosure of domain classes.

166

M. Berg, B. Thalheim, and A. Düsterhöft

⎡

⎤ id : journey ⎢ D = {Q+ , A+ } ⎥ 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ id : begin ⎢ ⎥ ⎢ D = {Q1a , A1 } ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ id : end D = {Q1b , A2 } This enclosure is equivalent to the storyboard term scene. When asking the question with the id begin we can activate A+ = A1 ∪A2 which allows the user to overspecify his answer because A+ represents an extended recognition domain. Depending on what the user said, further dialogue steps can be omitted. This can be checked by the ﬁlled variables: A+ ∈ α1 ∪ α2 , α1 = {$dates}, α2 = {$dates}. If after asking the question begin α2 = ∅ the question end doesn’t need to be posed.

5

Enhancing Storyboarding by Dialogue Patterns

After having deﬁned dialogue patterns we now integrate them into storyboarding, as can be seen in ﬁgure 2. Because the combination of dialogue steps is deﬁned as a scene, we extend the scene-deﬁnition by dialogue patterns. A scene is deﬁned as follows: Scene id MediaObject: modality Actors: user Context: channel Task: id Specification: on event if precondition doScene implementation accept on postcondition Now the tuple which deﬁnes the dialogue has to be integrated. A selection dialogue is deﬁned as: D1 = {Q1 , A1 }, Q1 = P ∪ S and can be instantiated (see ﬁgure 3) with: P = {”P lease choose your option”, ”W hat kind of accommodation do you pref er?”}, S = {suite, apartment, doubleroom}, A1 ∈ S

Integration of Dialogue Patterns into Storyboards

167

Fig. 2. Storyboard with dialogue acts

Fig. 3. Instantiation of dialogue patterns

For simpliﬁcation reasons the grammar for set A is omitted in this step. This dialogue is only one part of a scene (i.e. a complex dialogue). Another dialogue would be a question for the number of persons. D2 = {Q2 , A2 }, Q2 = {”P lease say the number of persons!”, ”W ith how many persons do you want to travel?”}, A2 = {1..5} By enclosing these dialogues, a complex dialogue pattern Δ = {Di }∗ can be created. In a scene deﬁnition this pattern has to be referenced. The answer domain of Δ is Λ ∈ {α1 ..αn }. Scene accommodation MediaObject: speechForm Actors: customer Context: channel=speech Task: getMandatoryInformation Specification: on lastSceneCompleted if Λ = ∅ doScene

168

M. Berg, B. Thalheim, and A. Düsterhöft

D1 = selection(Q1, A1 ) D2 = getN umber(Q2, A2 ) Δ = {D1 , D2 } accept on Λ =∅ DialoguePattern selection if α = ∅ Specification: D = {Q, A}, Q = P ∪ S accept on α = ∅

DialoguePattern getNumber if α = ∅ Specification: D = {Q, A} accept on α = ∅

The instantiation of the dialogue patterns with speciﬁc values has to be done in the presentation layer.

6

Conclusion and Future Work

Storyboards are conceptual speciﬁcations of the interaction between users and web information systems and more speciﬁcally the ﬂow of interaction and the support by the WIS. Nowadays, web sites are complex and attract also users that want to interact with the system based on other modalities and communication media, eg. through natural language dialogues. Natural language dialogues support user in a more ﬂexible way and are thus of higher utility to the user. This paper develops an approach to natural language dialogue support. We use a pattern approach that combines elements of similar behaviour and similar dialogue ﬂow into a class. These speech- and dialogue act based dialogue patterns can be instantiated by the actual dialogue situation, can be combined to more complex dialogues, and can also be hierarchically ordered. Scenes in a storyboard can thus be supported by complex dialogues. This integration of dialogue acts and scenes allows to describe the dialogue within a scene from one side and provides the ﬂexibility and adaptability that is necessary for web systems from the other side. Our approach nicely supports dialogues within interactive voice response systems. We do not target at a general solution that allows to use any kind of dialogues but use instead prepared classes of dialogues. These classes are formally described by dialogue patterns. We additionally assume that users are completing a dialogue sequence. Therefore, our approach allows to naturally support user-system interaction as long as dialogue patterns have been developed. The development of additional natural language features is the target of our future work.

Acknowledgements This work is supported by the European Funds for Regional Development (EFRE).

Integration of Dialogue Patterns into Storyboards

169

References 1. Atzeni, P., Gupta, A., Sarawagi, S.: Design and maintenance of data-intensive websites. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377, pp. 436–450. Springer, Heidelberg (1998) 2. Berg, M., Düsterhöft, A., Thalheim, B.: Integration of Natural Language Dialogues into the Conceptual Model of Storyboard Design. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 196–203. Springer, Heidelberg (2010) 3. Berg, M., Gröber, P., Weicht, M.: User Study: Talking to Computers. In: 3rd workshop on Inclusive E-Learning, London (2010) (to appear) 4. Ciarlini, A.E.M., Pozzer, C.T., Furtado, A.L., Feijó, B.: A Logic-Based Tool for Interactive Generation and Dramatization of Stories. In: Proceedings of the ACM SIGCHI ACE, Valencia, vol. 265, pp. 133–140 (2005) 5. Cohen, M., et al.: Voice User Interface Design. Addison-Wesley, Redwood City (2004) 6. Feyer, T., Thalheim, B.: E/R based scenario modeling for rapid prototyping of web information services. In: Kouloumdjian, J., Roddick, J., Chen, P.P., Embley, D.W., Liddle, S.W. (eds.) ER Workshops 1999. LNCS, vol. 1727, pp. 253–263. Springer, Heidelberg (1999) 7. Fischer, G.: User Modeling in Human-Computer Interaction. User Modeling and User-Adapted Interaction 11(1-2), 65–86 (2001) 8. Fraser, N., Gilbert, G.: Simulating Speech Systems. Computer, Speech and Language 5(1), 81–89 (1991) 9. Harris, R.A.: Voice Interaction Design. Crafting the New Conversational Speech Systems. Morgan Kaufman Publ. Inc., Massachusetts (2004) 10. Rossi, G., Schwabe, D., Lyardet, F.: Web application models are more than conceptual models. In: Kouloumdjian, J., Roddick, J., Chen, P.P., Embley, D.W., Liddle, S.W. (eds.) ER Workshops 1999. LNCS, vol. 1727, pp. 239–252. Springer, Heidelberg (1999) 11. Rossi, G., Garrido, A., Schwabe, D.: Navigating between objects: Lessons from an object-oriented framework perspective. ACM Computing Surveys 32(1) (2000) 12. Searle, J.R.: Speech Acts. An Essay in the Philosophy of Language. Cambridge University Press, Cambridge (1969) 13. Sitter, S., Stein, A.: Modeling the illocutionary aspects of information-seeking dialogues. Information Processing & Management 28(2), 165–180 (1992) 14. Schewe, K.-D., Thalheim, B.: Integrating database and dialogue design. Knowledge and Information Systems 2(1), 1–32 (2000) 15. Schewe, K.-D., Thalheim, B.: Web Information Systems: Usage, Content, and Functionality Modelling. Technical Report (2005) 16. Schewe, K.-D., Thalheim, B.: Conceptual modelling of web information systems. Data & Knowledge Engineering 54, 147–188 (2005) 17. Stolcke, A., et al.: Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech. Computational Linguistics 26(3), 339–373 (2000) 18. Thalheim, B., Düsterhöft, A.: The use of metaphorical structures for internet sites. Data & Knowledge Engineering 35, 161–180 (2000) 19. Thalheim, B., Düsterhöft, A.: SiteLang: conceptual modeling of internet sites. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) ER 2001. LNCS, vol. 2224, pp. 179–192. Springer, Heidelberg (2001) 20. Wahlster, W.: SmartKom: Multimodal dialogues with Mobile Web Users. In: Proceedings of the Cyber Assist International Symposium, pp. 33–34 (2001)

Model-Driven Development of Multidimensional Models from Web Log Files Paul Hern´ andez, Irene Garrig´ os, and Jose-Norberto Maz´ on Lucentia Research Group Dept. of Software and Computing Systems University of Alicante, Spain {phernandez,igarrigos,jnmazon}@dlsi.ua.es

Abstract. Analyzing Web log data is important in order to study the usage of a website. Even though some approaches propose data warehousing techniques for structuring the Web log data into a multidimensional model, they present two main drawbacks: (i) they are based on informal guidelines and must be manually applied; and (ii) they consider data tailored to a specific Web log format, thus being restricted to specific analysis tools. To overcome these limitations, we present a model-driven approach for obtaining a conceptual multidimensional model from Web log data in a comprehensive, integrated and automatic manner. This approach consists of the following steps: (i) obtaining a conceptual model of the Web log data based on a unified metamodel, (ii) deriving a multidimensional model from this Web log model by formally defining a set of QVT (Query/View/Transformation) transformation rules.

1

Introduction

Web log ﬁles can have millions of entries that contain a lot of information about the user interaction with the site. These ﬁles are useful for a detailed analysis of the usage of a website (also known as clickstream [2]) in order to support decision making regarding several tasks [1,2,10,12], e.g., reducing the editor’s eﬀort, improving the browsing experience, managing Web traﬃc, marketing, ecommerce, advertising, the evaluation of the system against initial speciﬁcations and goals, or personalization support. Web log data are usually stored in ﬁles by using diﬀerent text-based formats, such as the NCSA Common log format [18] or the W3C Extended Common Log File format [19]. Moreover, every format can be customized for speciﬁc purposes depending on the data that should be monitored. Unfortunately, the features of these formats make the analysis of Web log ﬁles to be restricted to speciﬁc analysis tools and their analysis functionality [5]. Therefore, advanced analysis techniques can not be easily applied to Web log data. To solve these problems, some approaches [10,12] consider structuring the Web log data into a data warehouse by means of multidimensional modeling [11]. Multidimensional modeling is at the core of data warehousing, since it allows J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 170–179, 2010. c Springer-Verlag Berlin Heidelberg 2010

Model-Driven Development of Multidimensional Models from Web Log Files

171

advanced analysis tools (i.e., such as OLAP – On-Line Analytical Processing– , data mining or “what-if”) to access data in a way that comes more natural to human analysts. The data is located in n-dimensional space (or facts, e.g., how many products are sold, how many patients treated, how long something takes, etc.), with the dimensions representing the diﬀerent ways the data can be viewed and sorted (e.g., according to time, store, customer, product, etc.). Therefore, designers of multidimensional schemas have to structure the information that is available into facts and dimensions. Current approaches manually deﬁne these structures from the Web log information which is a tedious, error prone and time-consuming task. Also, multidimensional modeling requires specialized design techniques that resemble the traditional database design methods in which the development process is guided by a ﬁrst conceptual design phase whose output is an implementation-independent and expressive conceptual multidimensional schema for the data warehouse [17]. In order to overcome these drawbacks, a model-driven approach is proposed in this paper to automatically derive a conceptual multidimensional schema from Web log ﬁles. To be able to tackle the aforementioned diﬀerent available formats, a uniﬁed metamodel for Web log data has been developed. Fig. 1 shows an overview of our overall approach. The ﬁrst step is to represent the raw data from Web log ﬁles into a model that conforms to our generic Web log metamodel. Once this model is obtained, a conceptual multidimensional model can be automatically derived, by means of several model transformations deﬁned by using the QVT (Query/View/Transformation) language.

Fig. 1. Overview of our model-driven approach for obtaining multidimensional models from Web log data

The remainder of this paper is structured as follows. A brief overview of the related work is presented in section 2. Section 3 describes our model-driven approach for multidimensional modeling of Web log data. An example is provided throughout this section to show the applicability of our approach. Finally, section 4 points out our conclusions and future works.

2

Related Work

Commercial tools for Web log data analysis have signiﬁcant limitations when performing advanced analytical tasks [12]. Furthermore, they have some drawbacks: (i) they are useless when trying to understand navigational patterns of

172

P. Hern´ andez, I. Garrig´ os, and J.-N. Maz´ on

users [2], and (ii) they lack the ability to integrate and correlate information from diﬀerent sources. One of the most known analysis tools is Google Analytics1 which has emerged as a major solution for Web traﬃc analysis. However, it has several drawbacks, e.g. the drill-down capability is limited and there is no way of storing your data eﬃciently. Also, the user does not own the data, Google does. There are several approaches [3,4,10,11,12] that deﬁne a multidimensional schema in order to represent the Web log data. With these approaches, once the data is structured, it is possible to use OLAP or data mining techniques to analyze the content of the Web logs, tackling the aforementioned problems. However, there is a lack of agreement about a methodological approach in order to detect which would be the most appropriate facts and dimensions: some of them let the analysts decide the required multidimensional elements, while others decide these elements by taking into consideration a speciﬁc Web log format. Furthermore, Web applications can be distributed over several servers depending on the performance needed, e.g., video and audio content could be hosted in an specialized multimedia server while sales transactions in a high security server. Therefore, the main problem is that the multidimensional elements are informally decided according to a speciﬁc format, so the resulting multidimensional model may be incomplete. To overcome these problems, our approach is aligned with [6] where Web log ﬁles are considered at the conceptual level. Speciﬁcally, our approach deﬁnes (i) a Web log metamodel in order to unify diﬀerent Web log formats in a conceptual Web log model, and (ii) a set model transformations to automatically obtain multidimensional data structures from a Web log model.

3

Model-Driven Approach for Multidimensional Modeling of Web Logs

In this section, we describe our approach for obtaining a conceptual multidimensional model from Web log ﬁles. In Fig. 1, we show an overview of our approach: from the Web log ﬁles, a Web log model is obtained. From this model, a conceptual multidimensional model is obtained through a set of QVT transformations. To summarize, the beneﬁts of our approach are: – A Web log metamodel is deﬁned, which is not tailored to a speciﬁc Web log format, which allows more ﬂexibility – The Web log model is automatically generated from the Web log ﬁle – The multidimensional model is automatically derived from the conceptual model by means of QVT rules It is worth to point out that this conceptual model will drive the development of a data warehouse by using our approach presented in [15]. This data warehouse will be used to enhance the analysis of Web usage data. 1

http://www.google.com/analytics

Model-Driven Development of Multidimensional Models from Web Log Files

3.1

173

Web Log Metamodel

The main goal of this metamodel is to deﬁne the elements and the semantics that allow building a conceptual model which represents, in a static way, the interaction between raw data elements (i.e. the client remote address) and usage concepts (i.e. session, user). We have divided our Web log metamodel into two packages as is shown in Fig. 2: the Entries package and the Usage package.

(a) The Entries Package

(b) The Usage Package Fig. 2. Our Web log metamodel

174

P. Hern´ andez, I. Garrig´ os, and J.-N. Maz´ on

The Entries package (see Fig. 2a) is intended to represent the entries of any kind of Web log format. The EntryField metaclass contains subclasses representing any ﬁeld present in an entry. These ﬁelds are optional because some Web log formats are customizable like the W3C Extended Log File Format, thus allowing to store only the desirable ﬁelds in the log. Most of the ﬁelds are gathered directly from de http request of the client like the RemoteIp, BytesSent, RemoteName, and HttpStatus. The AuthUser metaclass has a value if the current user is authenticated in the server. The WebObject metaclass represents any element that, when clicked, produces a request over a resource identiﬁed by a URI. The Request metaclass represents the request line from the client. The TimeTaken metaclass represents the length of the time that the action took. The Cookie ﬁeld includes the content of one or more cookies sent or received. The Referrer metaclass represents the site the user comes from, and the Agent metaclass contains information regarding the browser type that the client used. Finally, the Entry metaclass consists of a set of entry ﬁelds in order to structure the conﬁguration of the current Web log. Regarding the Usage package (see Fig. 2b), it contains classes to represent how the user interacts during a session and produces entries in the Web log. The User is the person or program that makes use of the website. The User identiﬁcation is one of the challenges task in Web log analysis, this information could be taken directly from the authenticated user ﬁeld if it is available, in other case there are some methods to accomplish this task2 . A single Session has a unique User but a single User may have many sessions. The Session also has a Context which could be the Device used by the client or the UserOrigin which in turn could be determined by the RemoteIp. A Session contains a set of entries caused by a user interaction over a period of time3 . A Page is a specialized WebObject and it can contain many WebObjects at the same time. A Session could have one or more Page elements associated. Cookies are very helpful in order to identify the user, determine the user location and delimit the session. A drawback for using cookies is that they are not always available because they depend on the user acceptance. Our metamodel has been developed in the Eclipse Modeling Framework 4 . Eclipse is an open source project conceived as a modular platform able to be extended by plugins in order to add features to the development environment. Within Eclipse, EMF is a Java framework and code generation facility for building tools and other applications based on a structured model. In order to support our modeling tasks, we have developed a plugin of the metamodel that allows the deﬁnition and edition of Web log models in a programmatic manner by using a reﬂective API for manipulating EMF objects.

2 3 4

This issue is out of the scope of this paper and we refer the reader to some methods explained in [11] for further information. Determining a specific session is a challenging task; again this is out of the scope of this paper and we refer the reader to [11] for further explanations. Site: http://www.eclipse.org/emf

Model-Driven Development of Multidimensional Models from Web Log Files

175

With the deﬁned metamodel it is possible to express the structure of the data contained in the log ﬁles in a model independent of the Web server technology. It is also possible to model the users interaction during a session with the website. To show how to create Web log models from our metamodel, we use a running example based on a log ﬁle from the server which hosts our research group website5 (an Apache server that uses the Combined Log Format). A typical entry is shown in Fig. 3. 172.16.242.69 - - [16/Mar/2010:09:28:00 +0100] "GET /labcss/Projects.php HTTP/1.1" 200 2916 "http://lucentia.dlsi.ua.es/labcss/Activities.php" "Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8 (.NET CLR 3.5.30729)" Fig. 3. Code for a typical entry from http://www.lucentia.es

The process of obtaining a Web log model from the set of Web log ﬁles has been implemented by using the java.util.regex.Pattern class for representing regular expression to parse data in the Web log ﬁles with the java.util.Scanner. These data is then converted into a model that conforms to our Web log metamodel, by using the EMF.Edit interface EditingDomain. The corresponding model for our example is sketched in Fig. 4.

Fig. 4. Sample Web log model

The Entry element contains the ﬁelds that represent the raw data taken directly from the Web log entry line. This information is associated with a Session started by an anonymous User within a localization Context (User Origin Spain). This model represents the interaction of the User in the website: how (click sequence), when (Time Stamp), where (User Origin) and what (Pages visited). It is worth to recall that the novelty of our approach is that this model is independent of the 5

http://www.lucentia.es

176

P. Hern´ andez, I. Garrig´ os, and J.-N. Maz´ on

Web log technology. In our sample, it is shown how Session started in the Activities Page. Then the User went to the Projects Page. In this way, it is possible to represent complex User Sessions and larger sets of entries if it is needed. 3.2

Multidimensional Conceptual Modeling

The major aim of a conceptual multidimensional model is to represent the main multidimensional elements without taking into account any speciﬁc technology detail. The UML proﬁle proposed in [13] is used for specifying conceptual multidimensional models as UML class diagrams, where facts and dimensions are represented by Fact ( ) and Dimension ( ) classes respectively. More precisely, Fact classes are deﬁned as composite classes in shared aggregation relationships with several Dimension classes. If multiplicities are not speciﬁed for those relationships, a default of many-to-one is assumed, i.e., each fact is associated with one coordinate in every dimension, and each of the coordinates can be used for many facts. Measures for Fact classes are represented as attributes with the FactAttribute stereotype ( ). With respect to dimensions, each level of a dimension hierarchy is speciﬁed by a Base class. Every Base class ( ) can contain several dimension attributes (DimensionAttribute stereotype, ), and must also contain a descriptor attribute (Descriptor stereotype, ). 3.3

Model Transformations from Web Log Model to Conceptual Multidimensional Model

Traditionally, conceptual multidimensional schemas have been derived from a detailed analysis of relational data sources in order to determine facts and dimensions from relational tables [7,8,9,16]. In this way, we have previously developed a set of QVT transformations to support designer in discovering every kind of multidimensional element from relational data sources [14]: e.g., table which contains a high number of numeric columns is transformed to a fact. However, these guidelines are focused on relational sources and they are not valid when the multidimensional model must be derived from the Web log model. Therefore, in this paper we have deﬁned a new set of QVT transformations for detecting facts and dimensions in Web log models, thus deriving a conceptual multidimensional model. Due to space constraints, we focus on explaining a subset of these QVT transformations. Once a Fact has been derived from a Session class in the Web log model, the ObtainFactAttributes (see Fig. 5) enforces the derivation of FactAttribute properties in the conceptual multidimensional model: BytesTaken, TimeTaken and SessionDate. The value for each of these attributes is derived from the Web log model and it is calculated by means of OCL constraints in the where clause of the QVT transformation. For example, for each Session fact, the value of the TimeTaken is the total length of processing time that every entry cause in the server within a session. Regarding dimensions, the User2Dimension transformation checks the User class in the Web log model to create a User dimension (and their related UserData Base class) related to the previously created Session fact. Once this

Model-Driven Development of Multidimensional Models from Web Log Files

177

Fig. 5. Obtaining Sessions fact attributes

Fig. 6. Deriving User dimension

dimension is created, the corresponding DimensionAttributes must be enforced by means of the QVT transformations deﬁned in the where clause. Also, from the UserData Base class a hierarchy should be enforced by means of the corresponding QVT transformation. In order to exemplify the deﬁned QVT transformations, the resulting multidimensional model for our running example is deﬁned in Fig. 7.

178

P. Hern´ andez, I. Garrig´ os, and J.-N. Maz´ on

Fig. 7. Sample conceptual multidimensional model

4

Conclusions and Future Work

In this paper we have presented a model-driven approach for obtaining a conceptual multidimensional model from Web log data. This model will drive the development of a data warehouse in order to enhance the analysis of Web usage data. To be able to tackle the diﬀerent available Web log formats, a uniﬁed metamodel for Web log data has been developed. Our approach consists of the following steps: (i) obtaining a conceptual model of the data of the Web log ﬁles (based on the uniﬁed metamodel deﬁned), (ii) automatically deriving a multidimensional model from this Web log model by formally deﬁning a set of QVT transformation rules. Our future work consists of aligning our Web log metamodel with Web engineering approaches in order to create the multidimensional model for the Web application together with the rest of the Web conceptual models (navigational, domain, etc). Acknowledgments. This work has been partially supported by the ESPIA project (TIN2007-67078) from the Spanish Ministry of Education and Science, and by the QUASIMODO project (PAC08-0157-0668) from the Castilla-La Mancha Ministry of Education and Science (Spain).

References 1. Alves, R., Belo, O.: Mining clickstream-based data cubes. In: 6th International Conference on Enterprise Information Systems, pp. 583–586 (2004) 2. Alves, R., Belo, O., Cavalcanti, F., Ferreira, P.: Clickstreams, the basis to establish user navigation patterns on web sites. In: Fifth International Conference on Data Mining, Text Mining and their Business Applications, pp. 87–96. WIT Press, Southampton (2004)

Model-Driven Development of Multidimensional Models from Web Log Files

179

3. Aur´elio, D.M., Jorge, A.M., Soares, C., Leal, J.P., Machado, P.: A data warehouse for web intelligence. In: Neves, J., Santos, M.F., Machado, J.M. (eds.) EPIA 2007. LNCS (LNAI), vol. 4874, pp. 487–499. Springer, Heidelberg (2007) 4. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowl. Inf. Syst. 1, 5–32 (1999) 5. Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. ACM Trans. Internet Techn. 3, 1–27 (2003) 6. Fraternali, P., Lanzi, P.L., Matera, M., Maurino, A.: Model-driven web usage analysis for the evaluation of web application quality. J. Web Eng. 3, 124–152 (2004) 7. Golfarelli, M., Maio, D., Rizzi, S.: The Dimensional Fact Model: A conceptual model for data warehouses. Int. J. Cooperative Inf. Syst. 7, 215–247 (1998) 8. H¨ usemann, B., Lechtenb¨ orger, J., Vossen, G.: Conceptual data warehouse modeling. In: 2nd Intl. Workshop on Design and Management of Data Warehouses, pp. 6–1–6–11 (2000) 9. Jensen, M.R., Holmgren, T., Pedersen, T.B.: Discovering multidimensional structure in relational data. In: Kambayashi, Y., Mohania, M., W¨ oß, W. (eds.) DaWaK 2004. LNCS, vol. 3181, pp. 138–148. Springer, Heidelberg (2004) 10. Joshi, K.P., Joshi, A., Yesha, Y.: On using a warehouse to analyze web logs. Distributed and Parallel Databases 13, 161–180 (2003) 11. Kimball, R., Merz, R.: The data webhouse toolkit: building the web-enabled data warehouse. John Wiley & Sons, Inc., New York (2000) 12. Lopes, C.T., David, G.: Higher education web information system usage analysis with a data webhouse. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Lagan´ a, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3983, pp. 78–87. Springer, Heidelberg (2006) 13. Luj´ an-Mora, S., Trujillo, J., Song, I.Y.: A uml profile for multidimensional modeling in data warehouses. Data Knowl. Eng. 59, 725–769 (2006) 14. Maz´ on, J.N., Trujillo, J.: A model driven modernization approach for automatically deriving multidimensional models in data warehouses. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 56–71. Springer, Heidelberg (2007) 15. Maz´ on, J.N., Trujillo, J.: A hybrid model driven development framework for the multidimensional modeling of data warehouses. SIGMOD Record 38, 12–17 (2009) 16. Phipps, C., Davis, K.C.: Automating data warehouse conceptual schema design and evaluation. In: 4th Intl. Workshop on Design and Management of Data Warehouses, pp. 23–32 (2002) 17. Rizzi, S., Abell´ o, A., Lechtenb¨ orger, J., Trujillo, J.: Research in data warehouse modeling and design: dead or alive? In: 9th International Workshop on Data Warehousing and OLAP, pp. 3–10 (2006) 18. The Apache Software Foundation: Log files, http://eregie.premier-ministre.gouv.fr/manual/logs.html 19. W3C Consortium: Extended common log file format, http://www.w3.org/TR/WD-logfile.html

Integrity Assurance for RESTful XML Sebastian Graf, Lukas Lewandowski, and Marcel Waldvogel Department of Computer and Information Science, University of Konstanz 78457 Konstanz, Germany {Sebastian.Graf,Lukas.Lewandowski,Marcel.Waldvogel}@uni-konstanz.de

Abstract. The REpresentational State Transfer (REST) represents an extensible, easy and elegant architecture for accessing web-based resources. REST alone and in combination with XML is fast gaining momentum in a diverse set of web applications. REST is stateless, as is HTTP on which it is built. For many applications, this not enough, especially in the context of concurrent access and the increasing need for auditing and accountability. We present a lightweight mechanism which allows the application to control the integrity of the underlying resources in a simple, yet flexible manner. Based on an opportunistic locking approach, we show in this paper that XML does not only act as an extensible and direct accessible backend that ensures easy modifications due to the allocation of nodes, but also gives scalable possibilities to perform on-the-fly integrity verification based on the tree structure.

1 1.1

Introduction The Multiple Facets of XML

The eXtensible Markup Language (XML) [2] represents one major paradigm in nowadays WWW environments. Not only used as a quasi standard when it comes to conﬁguration issues and handling of meta information, XML is also used as a direct data source regarding the preparation and visualization of information. Famous representatives for these use cases are XHTML as well as SVG or KML. These diﬀerent XML dialects show the necessity of human-readable ﬁle formats, and, accompanied with an enriched tool-set like XPath [8], XQuery/Update [5], and XSLT [9], highlight the applicability in many diﬀerent areas. Another perspective to the evolution of XML as a direct accessible data storage format can be observed in modern storage systems. Not only have modern common (object-)relational database systems the ability to store and retrieve native XML. The ease of use, ﬂexibility, and adaptability gave also birth to several non-relational databases [15] that indeed have an essential reason to exist nowadays. Besides the utilization of XML as a data-format backend in visualizations and storage applications, XML is also used for providing integrated uniﬁed access regarding entire workﬂows of querying and modifying data, especially in the WWW. Apache Cocoon [23] and XForms [3] are main represents, along multiple others, when it comes to an all-in-one XML based solution of retrieving, transforming and presenting data. J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 180–189, 2010. c Springer-Verlag Berlin Heidelberg 2010

Integrity Assurance for RESTful XML

1.2

181

Stateless Access to Resources with REST

The REpresentational State Transfer [11] constitutes a new and elegant approach to access distributed resources. Instead of encapsulating requests in containers like SOAP, REST is accessing resources directly in a stateless manner: No session handling and transaction check is performed, each request is encapsulated, atomic and bound to a direct resource. This easy way of handling and accessing distributed resource as well as the clean deﬁnition of methods to interact with these distributed resources are the ingredients for the success of REST. The usage of REST is deﬁned on three independent axis, (a) the REST verbs, (b) the REST resources, and (c) the REST parameters (cf. (1)). Every request is bound to a verb. The verb determines the kind of action related to the requested resource and the corresponding parameters. REST verbs are POST requests to create/append a resource, PUT requests to place new content on a given resource (e.g. update), DELETE requests for removal operations and GET requests for common read-only access. These simple operations deﬁned in HTTP are the key ingredients of a RESTful application. Besides the verbs, REST is based on direct accessible resources which are direct parts of the URI. A resource is always a concrete manifestation of data which can be accessed with all REST verbs in a similar way. The third important part of REST are the parameters. Parameters can contain any meta information for accessing data, for instance queries or additional commands. They are always optional and bound to a resource. Thus, a parameter can be used to ﬁlter or adapt the operation performed on a REST verb. : //host/data.xml ?query = descendant − or − self :: x GET http V erb

Resource

(1)

P arameter

The URL above shows a simple REST access on XML. Obviously, REST matches perfectly with XML as resource handler. Due to the architecture of the requests, any tree-like structure can be accessed directly without any limitations via any URI. XML equipped with unique identiﬁers, either based on node-levels like ORDPATH [18] or on any tree-based encoding, is able to answer requests directly on the node-level as well as on the XML itself. This enables REST to access XML substructures directly via resources without any implementation of necessary parameters since direct node access has to be based otherwise on queries for example. 1.3

Contribution and Problem Statement

RESTful access to resources is easy to provide and oﬀers a high ﬂexibility in its utilization. However, the simplicity of REST comes at a price: Considering multiple, concurrent modiﬁcations on the resource, a RESTful backend can render the semantic state of the resource invalid if an operation can not be performed in an atomic manner and therefore must be split into multiple consecutive requests.

182

S. Graf, L. Lewandowski, and M. Waldvogel

Besides, as REST is a stateless technique, any session or transaction based approach will not satisfy the paradigms of a RESTful application. To the best of our knowledge, there is currently no approach that is able to check the entire integrity of a RESTful resource against consecutive requests from disjoint clients. Nevertheless, we believe that such an integrity check increase the usability of REST without degenerating it to a non-stateless approach. In this paper we propose a technique based on opportunistic locking to provide data integrity on tree-structured data regarding the handling of consecutive as well as concurrent REST requests. First, we generate a checksum for the tree based on Merkle Trees [16]. These checksums are the key for integrity checks for any RESTful access on XML resource. Thus, we enable RESTful applications to verify the integrity within each request while adhering to the stateless paradigm of REST. 1.4

Related Work

The set of REST and XML has been explored in various ways. Wilde [21] encourages the usage of REST even for not web-related resources. Objects that cannot be represented as web content are encapsulated in XML to provide a common way to be accessed. [19] describes an approach about the XML dialect BPEL that is capable of acting RESTful. This approach ﬁts perfectly in our use case of transaction integrity checking. Kramis et al. [12] described an approach which allows the access of any XML document in a common way. However, even if temporal aspects are considered no integrity check is performed. The common access of XML data is also described in [1]. For communicating between ﬁxed and mobile clients, XML-RPC is used for a session based approach. [20] enables REST to work transactional based on allocations of transactions as separate resources. This approach works directly on single resources only but could result in race conditions. As our approach utilizes the structure of an XML resource, that corresponds to disjoint subtrees as well as the direct allocation of nodes, we evaluate possibilities to ensure structural integrity of tree structures. Based on the Merkle Trees [16], there are multiple diﬀerent approaches to ensure integrity [4]. All of these approaches make use of recursive structural computations. Checksum methods employing the same idea can be used to provide integrity in our XML structures. Validation approaches that are directly related to XML are not solely structure based. Some of them utilize a scheme based validation [10]. However, since we focus on concurrent operations on the nodes, a check against a DTD does not satisfy our needs. Even concurrent updates can result in a valid XML which is not valid in the semantic context of the sequential requests. [7] uses XML as a base for deﬁning a language that provides data integrity. However, stateless communication is not considered an alternative in this approach. [6] improved this approach to check the integrity of distributed web communication systems. This system neither relies on stateless communication, nor does it utilize any additional information from the underlying resource that

Integrity Assurance for RESTful XML

183

1

GET http : //host/data.xml/11 V erb

2

N ode 11 including subtree

10

3

GET http : //host/data.xml/2 V erb

GET http : //host/data.xml/3 V erb

4

7

11

14

N ode 2 including subtree

5

6

8

9

12

13

15

16

N ode 3 including subtree

Fig. 1. RESTful access to tree

in turn could contain beneﬁting information. Finally, [22] proposes a protocol based approach to provide security and integrity of the retrieved data.

2

Integrity Check for REST-Enabled XML Resources

The veriﬁcation of integrity regarding consecutive REST accesses on tree based data is mainly based on two aspects: First, the deﬁnition of a RESTful access to the resource in a way that it can explore the tree structure in a native manner and second, an integrity check of the tree structure including a checksum based resource allocation scheme. 2.1

Unified RESTful Access to Tree Based Structures

The motivation for our veriﬁcation approach is to ensure concurrent data accesses while adhering to the strict stateless architecture deﬁned by REST. Besides, we additionally integrate full RESTful paradigms in our approach. These paradigms are represented within our URI speciﬁcation as follows: – A URI can request one XML by its resource name. In that case the root node including the entire tree is returned. – Each node in the XML can be accessed with a unique identiﬁer (similar to Temporal REST [12] or ORDPATH [18]). If a resource is oﬀering such a feature, the unique identiﬁer can be accessed over REST as a direct resource as well. The choice of the encoding of the unique identiﬁer is independent from our approach. In case of node-level access, the desired node plus the underlying subtree is returned. Figure 1 shows document accesses based on our deﬁnition. Obviously, coupling REST requests with unique identiﬁers per node is straightforward. However, it is important to understand that requests are only valid for the requested node as well as the related substructure. As our approach utilizes the structure of the tree to perform integrity veriﬁcation, requests coupled to one node are not allowed to access ancestor nodes or the corresponding subtrees. Therefore, related to REST

184

S. Graf, L. Lewandowski, and M. Waldvogel

parameters, that for instance can contain XQuery/Update [5], we only allow the usage of the forward axis in the related subtree. However, this is not a constraint at all as most of the modiﬁcation and query languages rely on XPath, and each XPath expression can be evaluated only by utilizing the forward axis [17]. 2.2

Integrity Check of the Tree

As we are working with XML, we utilize the tree structure not only in terms of the direct allocation of nodes as resources. When it comes to on-the-ﬂy veriﬁcation of the integrity of the XML resource we rely on recursive algorithms [16] to generate checksums for each node. (2) below denotes the structure of the checksummed tree. n.hash = H(H(n.content) n.child(0).hash n.child(1).hash . . .)

(2)

Thereby, n represents the node and H(x) is a hash function with input x1 . The checksum of a node relies on its hash value and therefore is deﬁned as the hash value of the content of a node combined with the hash value of all of its child nodes. The selection of the speciﬁc hash algorithm itself can be adapted to the speciﬁc use case of the resource in the application: If the resource has to be responsive, a fast hash function should be chosen. If the structure is in need of high integrity, a more stable hash function should be considered. The approach itself is as stable as the used hash function. Figure 2a shows such a checksummed tree structure. Checksums are generated based on a recursive relation where each node inherits the integrity of its corresponding subtree. Therefore, the complete subtree rooted at a node can be veriﬁed in a single step by checking the node’s hash value. Projected on the already described RESTful access where one resource can be a XML tree as well as a qualiﬁed node, the checksum of a node is guarding the entire integrity of a resource node plus the underlaying subtree. Any modiﬁcation related to the structure of a (sub-)tree or the content of a single node results in the regeneration of the corresponding checksums. Figure 2b shows an example for the checksum regeneration while algorithm 1 describes the algorithm. Each time a request is performed, the checksum delivered with the request is compared to the one of the requested resource. If both checksums diﬀer, an error is returned. In case of a modiﬁcation in the tree, all checksums on the path to the root are recomputed with the help of the corresponding siblings. In the example of Fig. 2b, the white node labeled 67 is the node which is inserted in the tree. The nodes labeled 5, 6, 7 and 10 (depicted in grey with a white border) are only touched for read operations. These read operations are necessary to perform the update of the parent nodes 4, 3, 2 and 1 since the checksums are always based on the checksums of the related children as well. This example highlights that, due to the recursive structure, only the checksums of the nodes on the path to the root need to be modiﬁed during an update 1

To improve processing overhead for nodes with high degree, the sequence of children can also be internally structured and hashed into a hierarchy.

Integrity Assurance for RESTful XML

185

67bg 1

946b

dvse

1

2

nms3

997d 19ak

4

99gv

7

6

lr9c

10

8h5y

5

2

573n

3

8

11

9

12

15

4

16

56bd 234g 1117 345v 3n0m 12ax 12c3 xcv3

(a) Checksummed Tree

10

19ak

sl24

14

13

573n

3

zus6

5

6

11

7

67

8

zus6

99gv

9

12

14

13

15

16

56bd 234g mbkl 1117 345v 3n0m 12ax 12c3 xcv3

(b) Modified Checksummed Tree

Fig. 2. Recursive regeneration of checksum while inserting new node as a leaf

operation. In the worst case all nodes on the path to the root need to be updated depending on the modiﬁed leaf. Obviously, the cost of a modifying access to the tree corresponds to the height of the tree. Therefore, as we only need to update the checksums on the path up to the root, we are able to do an on-the-ﬂy update of the checksums while traversing the tree from the node that was modiﬁed up to the root of the tree. It is important to understand that we make use of the structure of the data. Since all requests occur on the tree, we do not have to regenerate every checksum of the entire data space within single modiﬁcations. All nodes that are not directly aﬀected by a modiﬁcation request are excluded from any updating mechanisms as long as they are located in disjoint subtrees. 2.3

Request-Based Integrity Validation

The structure of the tree itself carries the prove of integrity at every point in time. Therefore and due to the atomicity of REST requests we can provide validation of consecutive REST requests that access the same resource. After each request the checksum of the requested resources is returned. Thus, if the request is based on an entire XML tree, the checksum of the root node and the current status of the entire tree is returned to the client. If a request aﬀects a node resource, the checksum of the subtree rooted at the corresponding node is returned to the client. As communication base for the checksum within the REST requests, the ETag ﬁeld of the HTTP speciﬁcation is used. Validating the expected state of a resource is often related to previous, consecutive requests/modiﬁcations on the same resource: A client requests resources, checks the delivered data and tries to perform subsequent operations on it. With the ﬁrst request, a checksum of the requested resource – that can be the entire XML as well as a substructure based on an unique node – is delivered together with the requested data in the ETag ﬁeld of the HTTP header. This checksum is returned to the server in the consecutive request. If the request is modifying the data, a new checksum is computed and again the checksum is returned with the following response. If the server observes a diﬀerent checksum for the same

186

S. Graf, L. Lewandowski, and M. Waldvogel

Algorithm 1. Handle Request Input: HTTPRequest request, Hash function H Output: HTTPResponse response begin Node n ←− request.resource if request.checksum = n.checksum then opReturn = opOnData(n, request.verb, request.parameter) if opReturn = wasV alidOp then response ←− new ErrorResponse(opReturn.errorCode) else if opReturn = wasM odif yingOp then Node m ←− n repeat m.checksum ←− H(content(m)) h h ←− m.checksum for r ∈ m.siblings do h ←− h H(r) m ←− m.parent until m = root m.checksum ←− H(content(m)) h response ←− new SuccessResponse(200, success, n.checksum) else response ←− new ErrorResponse(412, Precondition failed) return response end

requested resource, the HTTP error 412 (Precondition Failed) is returned to the client thus the client is informed about the concurrent modiﬁcation of the data. Figure 3 depicts a consecutive, concurrent check-then-act situation that highlights our approach. First, client 1 gets the resource with id 3 in the tree. The hash value 997d is returned to the client together with the requested resource. A second request is performed by client 2 but on the node with id 4. This node is a child of the node requested by client 1. The returned checksum for this node is 8h5y. Subsequently, client 1 performs a POST operation to insert a new leaf in the subtree maintained by client 1. The new node has the id 67 and all the checksums on the path to the root are updated before the request is solved and a suitable HTTP code which depict success of the operation is returned to the client. The related new checksum (lr9c) of the requested resource with id 3 is returned to the client in the ETag of the response. The corresponding HTTP communication protocol is listed in table 1a. In the meantime, client 2 tries to access the resource with id 4. Since the request is shipped with the checksum of its last request 8h5y, the server compares the two checksums from the concurrent requests on the same resource (8h5y and sl24 ). Due to the modiﬁcation of the same subtree within the request of client 1, the checksum of node 4 changed. Therefore, the request from client 2 is denied

9 8

3

GET Client 1

5 9 8 6 5

sl24

4

67

3

lr9c

Post Client 2 not valid

7

Post Client 1 valid

19ak

8h5y

4

6

GET Client 2

187

56bd 234g mbkl 1117 345v 56bd 234g 1117 345v

Server

19ak

Client 2

7

Client 1

997d

Integrity Assurance for RESTful XML

Fig. 3. Concurrent REST requests

due to the disparity of the checksum shipped with the request and the checksum currently associated with node 4. Client 2 ﬁrst has to become aware of the changes in the data and retrieve the new checksums for the requested resources before new checksum-guarded requests become valid. The HTTP communication protocol corresponding to this is listed in table 1b This example workﬂow shows that our approach ensures data veriﬁcation in a RESTful manner. As long as clients only request the subtree of a speciﬁc node, our integrity approach can even handle multiple accesses. Furthermore, our checksum approach can easily be modiﬁed in a kind that within every request/response a checksum is shipped without direct interaction from the server site. This would enable clients to perform checking against data integrity by

Table 1. Example of concurrent HTTP communication (a) Communication for client 1 HTTP Request

HTTP Response

GET http://. . . /3

(b) Communication for client 2 HTTP Request

HTTP Response

GET http://. . . /4 ETag(997d) <node> ...

POST ETag(997d) http://. . . /3/firstChild <node> ...

ETag(8h5y) <node> ... POST ETag(8h5y) http://. . . /4 <node> ...

ETag(lr9c) 201 CREATED

ETag(sl24) 412 PRECONDITION FAILED

188

S. Graf, L. Lewandowski, and M. Waldvogel

themselves. If derived checksums diﬀer, the client has to ﬁnd out what concurrent operation was modifying the related resource or the underlying subtree.

3

Conclusion and Future Work

The proposed setup of REST and XML was implemented using JAX-RX [14] as interface and Treetank [13] as well as BaseX [15] as XML resource. Our implementation substantiates our assumption that our approach leverages trust in stateless data handling by a simple though powerful validation mechanism by resolving the lack of conﬁdence in the data accessed over REST with an in-data integrity veriﬁcation. Regarding further extensions of our approach, we believe that our approach can make use of more sophisticated hashing adaptions as well. With an intelligent hashing strategy it is possible to reduce the overhead of adapting the hash values in a tree every time a modiﬁcation occurs even though these modiﬁcations are rather small. Furthermore, since we are restricting our access at the moment to ﬁxed resources and therefore substructures in tree, we want to increase the ﬂexibility of our approach regarding the computation of checksums and concurrent accesses in the tree to track consecutive requests on diﬀerent nodes. To provide such auditing features, we plan to equip our approach with a versioned backend to track consecutive modiﬁcations on distributed data. This envolves even more power to our integrity approach since with every modiﬁcation the related integrity structure can be secured as well. Checking the integrity of accessed data is one of the most important tasks regarding distributed applications. Although REST oﬀers a great ﬂexibility we believe that we can ensure the conﬁdent access to the data in a way that is not restricting REST but gives the possibility to overcome the uncertainness of stateless data access.

References 1. Alvarez-Cavazos, F., Garcia-Sanchez, R., Garza-Salazar, D., Lavariega, J.C., Gomez, L.G., Sordia, M.: Universal access architecture for digital libraries. In: Proceedings of the 2005 Conference of the Centre for Advanced Studies on Collaborative Research, CASCON 2005, pp. 12–28. IBM Press (2005) 2. Bray, T., Paoli, J., Sperberg-McQueen, C.M., Textuality, T.B.: Extensible markup language (xml) - version 1.0 (1997) 3. Cardone, R., Soroker, D., Tiwari, A.: Using xforms to simplify web programming. In: Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp. 215–224. ACM, New York (2005) 4. Carminati, B., Ferrari, E., Bertino, E.: Securing xml data in third-party distribution systems. In: Proceedings of the 14th ACM International Conference on Information and knowledge Management, CIKM 2005, pp. 99–106. ACM, New York (2005) 5. Chamberlin, D., Florescu, D., Robie, J., et al.: XQuery update facility (2006)

Integrity Assurance for RESTful XML

189

6. Chi, C., Liu, L., Yu, X.: Data Integrity Related Markup Language and HTTP Protocol Support for Web Intermediaries. In: Sha, E., Han, S.-K., Xu, C.-Z., Kim, M.H., Yang, L.T., Xiao, B. (eds.) EUC 2006. LNCS, vol. 4096, pp. 328–335. Springer, Heidelberg (2006) 7. Hung Chi, C., Wu, Y.: An xml-based data integrity service model for web intermediaries. In: In Proc. 7th IWCW. pp. 14–16 (2002) 8. Clark, J., DeRose, S., et al.: XML path language (XPath) version 1.0 (1999) 9. Clark, J., et al.: XSL transformations (XSLT) version 1.0 (1999) 10. Fan, W., Libkin, L.: On xml integrity constraints in the presence of dtds. J. ACM 49(3), 368–406 (2002) 11. Fielding, R.T.: Architectural styles and the design of network-based software architectures. Ph.D. thesis, University of California, Irvine (2000), chair-Taylor, Richard N 12. Giannakaras, G., Kramis, M.: Temporal REST—How to really exploit XML. In: IADIS International Conference WWW/Internet (2008) 13. Graf, S.: Treetank, a native xml storage. Tech. rep., Bibliothek der Universität Konstanz, Universitätsstr. 10, 78457 Konstanz (2009), http://kops.ub.uni-konstanz.de/volltexte/2010/10066 14. Graf, S., Lewandowski, L., Gruen, C.: Jax-rx, unified rest access to xml resources. Tech. rep., Bibliothek der Universität Konstanz, Universitätsstr. 10, 78457 Konstanz (2010), http://kops.ub.uni-konstanz.de/volltexte/2010/12051 15. Holupirek, A., Grün, C., Scholl, M.H.: Basex and deepfs joint storage for filesystem and database. In: Proceedings of the 12th International Conference on Extending Database Technology, EDBT 2009, pp. 1108–1111. ACM, New York (2009) 16. Merkle, R.C.: A digital signature based on a conventional encryption function. In: Pomerance, C. (ed.) CRYPTO 1987. LNCS, vol. 293, pp. 369–378. Springer, Heidelberg (1988) 17. Olteanu, D., Meuss, H., Furche, T., Bry, F.: Symmetry in xpath. In: Proc. EDBT Workshop on XML Data Management (2002) 18. O’Neil, P., O’Neil, E., Pal, S., Cseri, I., Schaller, G., Westbury, N.: Ordpaths: insertfriendly xml node labels. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 903–908. ACM, New York (2004) 19. Pautasso, C.: Bpel for rest. In: Dumas, M., Reichert, M., Shan, M.-C. (eds.) BPM 2008. LNCS, vol. 5240, pp. 278–293. Springer, Heidelberg (2008) 20. da Silva Maciel, L.A.H., Hirata, C.M.: An optimistic technique for transactions control using rest architectural style. In: Proceedings of the 2009 ACM Symposium on Applied Computing, SAC 2009, pp. 664–669. ACM, New York (2009) 21. Wilde, E.: Putting things to REST. School of Information, UC Berkeley, Tech. Rep. UCB iSchool Report 15 (2007) 22. Yao, D., Koglin, Y., Bertino, E., Tamassia, R.: Decentralized authorization and data security in web content delivery. In: Proceedings of the 2007 ACM Symposium on Applied Computing, SAC 2007, pp. 1654–1661. ACM, New York (2007) 23. Ziegeler, C.: Cocoon: Building XML Applications. Pearson Education, London (2002)

Collaboration Recommendation on Academic Social Networks Giseli Rabello Lopes1 , Mirella M. Moro2 , Leandro Krug Wives1 , and Jos´e Palazzo Moreira de Oliveira1, 1

Universidade Federal do Rio Grande do Sul - UFRGS Porto Alegre, Brazil {grlopes,wives,palazzo}@inf.ufrgs.br 2 Universidade Federal de Minas Gerais - UFMG Belo Horizonte, Brazil [email protected]

Abstract. In the academic context, scientific research works are often performed through collaboration and cooperation between researchers and research groups. Researchers work in various subjects and in several research areas. Identifying new partners to execute joint research and analyzing the level of cooperation of the current partners can be very complex tasks. Recommendation of new collaborations may be a valuable tool for reinforcing and discovering such partners. This paper presents an innovative approach to recommend collaborations on the context of academic Social Networks. Specifically, we introduce the architecture for such approach and the metrics involved in recommending collaborations. We also present an initial case study to validate our approach.

1

Introduction

Nowadays, information can be electronically accessed as soon as they are published on the Web. However, problems associated to the information overload phenomena emerged. The recovery of relevant digital information on the Web is a complex task and research in the information ﬁltering area, speciﬁcally about recommender systems, is very important. Recommender systems reduce the problems associated to the information overload phenomena by minimizing the time spent to access relevant information. Recommender systems involve information personalization. Personalization is related to the ways in which information and services can be tailored to match the speciﬁc needs of a single user or community [1]. However, recommender systems are inserted in a social context since the recommendations are delivered to a user or a community of users. Perugini et al. [2] emphasize that recommendation has an inherently social element and is ultimately intended to connect people. In this perspective, the social interactions and its relational aspects must be considered.

This work was partially supported by CNPq, Brazil.

J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 190–199, 2010. c Springer-Verlag Berlin Heidelberg 2010

Collaboration Recommendation on Academic Social Networks

191

In this context, the research area of Social Networks has emerged. Speciﬁcally, the Social Network Analysis (SNA) is based on the assumption that the relationship’s importance between interaction units is a central point to the evaluation and the analysis of social interactions. The increasing interest in researching on Social Networks was encouraged by the popularization of online Social Networks, which are very interesting Web applications. Nowadays, this type of network is commonly used, and each network connects millions of users. Examples of online Social Networks include MySpace, Facebook, Hi5, Orkut, among others. Another example of the Social Network’s concept application is a co-authorship Social Network representing a scientiﬁc collaboration network. In this “social growth of the web”, the number of studies involving Social Network Analysis and the number of applications aiming to improve recommender system have increased. This fact shall lead to advances in both areas and encourages the improvement or the development of new approaches to analyze connections and recommend new ones. In this context, our paper presents an innovative approach to recommend collaborations on the context of academic Social Networks. The contributions of this paper can be summarized as follows: we introduce the architecture for such approach and the metrics involved in recommending collaborations (Sect. 2) and we present the results of an initial case study to validate our approach (Sect. 3).

2

Collaborations Recommendation

In this section, we give an overview of our approach to perform the recommendation of collaborations on the context of academic Social Networks, as illustrated in Fig. 1. The ﬁrst step consists on selecting the target user of the recommendation and the collaborative network to be used (i.e., a group of individuals from a digital community such as DBLP1 ) (1). Then, the Social Network formed by selected individuals (or a subgroup of them) is constructed and the weights of the relationships among them are attributed (2). These weights are calculated based on the information about the publications of the authors obtained from existing Digital Library (3). These weights indicate the level of “global cooperation” between pairs of the researchers involved. Indeed, the researchers’ proﬁles (proﬁles of the authors of the collaborative network) to be used by the recommender system are constructed based on: the information available about these authors in some digital library (4) and the classiﬁcation of the authors’ publications made with the aid of a research area ontology (taxonomy) (5). Further details about this kind of ontology can be obtained in [3]. Our approach considers the research areas in which the researchers work to determine the “global correlation” index. The “global cooperation” and the “global correlation” indexes between pairs of authors are then used (6 and 7) as the basis to generate recommendations of authors with whom the user could establish new collaborations. Finally, these recommendations are presented to the user (8). The approach is detailed in the following subsections. 1

DBLP Computer Science Bibliography, http://www.informatik.uni-trier.de/˜ley/db

192

G.R. Lopes et al.

(1)

Select User and Collaborative Network

(2)

Collaborative Network

(6)

(3)

(4)

Collaborations Recommendation

Digital Library

(7) (5) Research Areas Researchers Profile Research areas Ontology

(8)

Fig. 1. Overview of the proposed Collaborative Recommender System

2.1

Global Cooperation

Social Networks are based on the relationship’s importance between interaction units. The interaction units of Social Networks are known as actors. The determination of relationships’ weights between actors in a Social Network is a great challenge. These weights measure the importance of the existing relationships between actors. In this paper, the Social Network that we analyze is a scientiﬁc collaboration network. We present a metric to determine one type of association namely Collaboration in Co-authorship, which deﬁnes the global cooperation between authors in a collaboration network. Actually, the information available in this type of Social Network is about co-authored papers between authors. Formally, a Social Network SN of a co-author relationship a is a pair: SNa = (N, E) where N and E are the set of N odes and Edges, respectively. Each edge e ∈ E is a tuple of the form ai , t, w, aj , where the edge is directed from ai to aj , t denotes the type of association between ai and aj , and w denotes the weight aﬀected to the association. The weight is a numerical value between 0 and 1 and the Collaboration in Co-authorship weight (wtCa ) is given by the equation 1. wtCa(ai →aj ) =

|aj co authorship| |ai author|

(1)

where: – wtCa(ai →aj ) corresponds to the weight of the collaboration based on the coauthor relationship (the weight is diﬀerent according to the relation direction, i.e., the weight ai → aj is diﬀerent from aj → ai ); – |aj co authorship| corresponds to the number of times that author aj has co-authored a paper with ai ; – |ai author| corresponds to the total number of papers of the author ai . In other words, the higher the wtCa weight is, the more relevant is the relationship between author aj to the author ai . The use of wtCa metric implies that there

Collaboration Recommendation on Academic Social Networks

193

is a graph with 0 or 2 links between two authors. The weights represent the degree of collaboration in co-authorship between these authors. This metric is an asymmetric variant of the Jaccard Coeﬃcient and it has already been applied in the context of Social Networks [4,5]. 2.2

Global Correlation

For performing co-authorship recommendation, it is important to determine a metric that states the global correlation between researchers. This correlation may consider the diﬀerent research areas in which the involved researchers work. This was chosen because is an important facet to be considered in this context. For this purpose, we take two actions: we construct a working proﬁle to each researcher by considering the research areas in which the researchers work; and we assign weights to represent the grades of contributions of the researcher in the corresponding area. The global correlation is calculated between working proﬁles of the researchers. Equation 2 calculates the weight of each research area that will compose the working proﬁle of the researcher. wRa (ai , x) =

|ai authorresearch area x | |ai author|

(2)

where |ai authorresearch area x | corresponds to the number of papers that author ai published in the research area x, and |ai author| to the total number of papers by ai ’. Then, each area x has a corresponding weight that indicates the contributions of the researcher in that area. The weights indicate the grade of contributions of the researcher in each research area, and they are used in the calculation of the global correlation between pairs of authors. Vector Space Model (VSM) [6] is used in our work to perform this computation. The VSM uses an n-dimensional space to represent the terms, i.e. n corresponds to the number of distinct research areas. Each author proﬁle is represented by a research area’s vector, and the weights represent the vector’s coordinates in the corresponding dimension. Based on the VSM, similarity is calculated between pairs of authors and the index terms correspond to the research areas of the authors. The weight assigned to each research area allows to distinguish between the areas, and it is calculated according to their importance to the author considered by using equation (2). These weights vary continuously between 0 and 1. Values near to 1 correspond to more important research areas while values near to 0 correspond to less important research areas. The VSM principle is based on the inverse correlation between the distance (angle) among term vectors in the space and the similarity between the information (in this case the authors proﬁle) that they represent. To calculate the similarity score, that will represent the global correlation between two authors, the cosine is used as in equation 3. The resultant value indicates the global correlation degree between the two authors proﬁles (ai and aj ), where wRa (ai , x) represents the weights of

194

G.R. Lopes et al.

Table 1. Actions to be taken according to global cooperation and global correlation degrees Cooperation/Correlation low low medium high

medium

high

ok recommend recommend alert ok alert alert alert ok

the research areas that composed the user’s proﬁle, and n represents the total number of research areas. n k=1 wRa (ai , xk ).wRa (aj , xk ) global correlation(ai , aj ) = (3) n 2 n 2 k=1 (wRa (ai , xk )) . k=1 (wRa (aj , xk )) 2.3

Recommendation

Our approach to generate cooperation recommendations considers the analysis of all indexes previously described. The relationship between the “global cooperation” and the “global correlation” for each pair of authors establish the necessity (or not) of having more research interaction between them. For this kind of analysis, we establish degrees to represent the diﬀerent ranges of values that are possible for these weights. The degrees are “high”, “medium” and “low”. Based on this analysis, recommendation actions can be established to indicate possible cooperations between pairs of authors. The combination between degrees and the interpretation of them can be observed in Table 1, where the action of recommending a collaboration is indicated by the term “recommend”. For instance, pairs of authors with high global correlation between them but with low global cooperation must be recommended to intensify their cooperation. Moreover, pairs of authors with low global correlation between them and high or medium global cooperation could also be recommended. The term “ok” indicates the ideal relationship. The term “alert” indicates situations in which the global cooperation and global correlation degrees are diﬀerent but do not characterize a recommendation action. For example, pairs of authors with high or medium global cooperation that have lesser global correlation degrees do not need to be recommended to cooperate, since the cooperation between the authors is already happening in a more intense relation than their proﬁles correlation suggests. In the case study presented in Sect. 3, we presented an example of translation of the ranges of numeric values for global cooperation and global correlation into the degrees deﬁned as “low”, “medium” and “high”.

3

Case Study

In order to analyze how the proposed approach performs, we consider a case study from the InWeb (Brazilian National Institute of Science and Technology

Collaboration Recommendation on Academic Social Networks

195

for the Web)2 . The Institute is formed by 27 researchers and their students. All researchers are professors in a Brazilian major education institution (UFMG, UFRGS, UFAM and CEFET-MG) with graduate program in Computer Science. For validating our metrics, we have implemented a tool to automatically generate a Social Network. This SN was built using information about authors provided by the DBLP on January 21, 2009. It is important to notice that this library is exported as an XML document. Instead of using the whole dataset, we extracted from the library just the papers written by the considered researchers and published in conference proceedings or in journals (as elements inproceedings or article). This data gathering process summed up 677,345 authors; 692,431 conference proceedings papers and 432,663 journal articles. Such a subset was chosen because this information is signiﬁcantly important for representing the co-author relationship between authors and, consequently, to determine the research collaborations among them. The actors (vertices or nodes) of the SN can be chosen at will. In here, they are a subset of authors with scientiﬁc papers indexed by the DBLP. The relational ties (linkages or edges) between actors are the relationships between pairs of authors. These social ties represent the co-author relationships. The weights of the linkages are determined by the equation 1 (presented in the Sect. 2). In that equation, |ai author| corresponds to the total number of papers of the author ai , and it considers all papers to this author ai indexed at DBLP (papers published in conferences proceedings or in journals), including those papers that are not co-authored by authors in the selected SN (that can be a sub-network of the global SN). These weights represent the “global cooperation” index between researchers. The SN will be represented by a directional graph where the edges have a direction associated, according to the equation previously deﬁned. The InWeb SN was automatically constructed using our developed system. The result is presented in Fig. 2. The recommendation method was implemented as presented in Sect. 2. The proﬁles of the researchers (author proﬁles of the collaborative network) used by the recommendation method were built based on information available about these authors in the DBLP and on the classiﬁcation of authors publications made using an ontology of research areas. The ontology (taxonomy) for classifying authors papers was proposed by Loh et al. [3]. This ontology uses a classiﬁcation of computer science areas similar to the ACM (Association for Computing Machinery) classiﬁcation3 and provides associated weights to the keywords according to their relevance to each research area. In this case study, only the keywords present on the publication’s title were used. The proposed approach considers the research areas in which researchers work to determine a “global correlation” index. The “global cooperation” and “global correlation” indexes between pairs of authors are used as a base to generate recommendations for authors with whom they should establish new collaborations, or that they must intensify collaborations. For the deﬁnition of recommendations, we have used the method 2 3

InWeb, http://www.inweb.org.br/ ACM Computing Classification, http://www.acm.org/about/class/ccs98-html

196

G.R. Lopes et al. Legend: UFAM 1.1 Altigran Soares da Silva 1.2 Edleno Silva de Moura 1.3 Jo˜ ao M. B. Cavalcanti UFRGS 2.1 Carlos A. Heuser 2.2 Jos´ e Palazzo Moreira de Oliveira 2.3 Leandro Krug Wives 2.4 Renata de Matos Galante 2.5 Viviane Moreira Orengo CEFET-MG 3.1 Cristina D. Murta 3.2 Evandrino G. Barros 3.3 Fabiano C. Botelho UFMG 4.1 Adriano M. Pereira 4.2 Alberto H. F. Laender 4.3 Arnaldo de Albuquerque Ara´ ujo 4.4 Berthier A. Ribeiro-Neto 4.5 Clodoveu A. Davis 4.6 Dorgival Olavo Guedes Neto 4.7 Gena´ ına Nunes Rodrigues 4.8 Gisele L. Pappa 4.9 Jussara M. Almeida 4.10 Marcos Andr´ e Gonalves 4.11 Mirella Moura Moro 4.12 Nivio Ziviani 4.13 Raquel Oliveira Prates 4.14 Renato Ferreira 4.15 Virg´ ılio A. F. Almeida 4.16 Wagner Meira Jr.

Fig. 2. Automatic InWeb Social Network

presented in Sect. 2.3. We have used a linear scale to translate the ranges of numeric values for global cooperation and global correlation into the degrees deﬁned as “low”, “medium” and “high”. The “low”, “medium” and “high” degrees correspond, respectively, to values v in the ranges {v ∈ |0.0 < v ≤ 1/3}, {v ∈ |1/3 < v ≤ 2/3} and {v ∈ |2/3 < v ≤ 1.0}. After the identiﬁcation of the resulting degree, it is possible to establish recommendation actions according to Table 1. As shown in that table, the recommendation action will only happen when there is a “low” value of global cooperation between pairs of authors and a “medium” or “high” value of global correlation. Moreover, when the value of global cooperation is eﬀectively null (zero), a recommendation to initiate collaboration will be made to the authors. Indeed, when the value of global cooperation is a “low” but not null, an intensiﬁcation of existing collaboration will be recommended. An example of interface for presenting recommendations for the researcher “Jos´e Palazzo Moreira de Oliveira” (InWeb member), can be seen in Fig. 3. In the recommendation to “initiate collaboration”, the value of the global correlation index obtained between the recommended research and the target researcher is shown. When it is recommended to “intensify cooperation”, the ratio of global cooperation and global correlation is presented. Furthermore, recommendations are presented in ranked lists: in the “initiate collaboration” list, the researchers recommended are in descending order of global correlation index, since the higher the value of global correlation found, the greater the possibility of collaboration be interesting to the target researcher; and in the “intensify cooperation” list, the recommended researchers are presented in increasing order of ratio between

Collaboration Recommendation on Academic Social Networks

197

José Palazzo Moreira de Oliveira Logout

Initiate Cooperation: 1. Clodoveu A. Davis 2. Wagner Meira Jr. 3. Adriano M. Pereira 4. Berthier A. Ribeiro-Neto 5. Edleno Silva de Moura 6. Dorgival Olavo Guedes Neto 7. Virgílio A. F. Almeida 8. Altigran Soares da Silva 9. Nivio Ziviani 10. Raquel Oliveira Prates

Intensify Cooperation: 0.804 0.632 0.572 0.564 0.554 0.509 0.502 0.488 0.485 0.461

1. Alberto H. F. Laender 2. João M. B. Cavalcanti 3. Renata de Matos Galante 4. Leandro Krug Wives

0.052 0.065 0.074 0.245

Fig. 3. Example of interface for recommendations presentation

global cooperation and global correlation, so that the recommendations are presented in order of intensiﬁcation necessity.

4

Related Works

Traditionally, recommender systems are studied in three diﬀerent perspectives according to the methodologies used to perform recommendation: (i) contentbased, which recommends items classiﬁed accordingly to the user proﬁle and early choices considering semantic issues; (ii) collaborative ﬁltering, which deals with similarities among users interests including the consideration of structural aspects; and (iii) hybrid approach, which combines the two to take advantage of their beneﬁts. The related works presented in this section aim to use Social Networks on the context of Recommendation Systems. Many current solutions focus only on structural issues of the Social Network to generate recommendations. Some examples of these collaborative ﬁltering approaches are presented bellow. Ogata et al. [7] propose PeCo-Mediator-II system to search cooperators through a chain of personal connections in a SN. The system aims to help the user to ﬁnd cooperators through the chain of connections between them. Golbeck et al. [8] present a website that integrates Social Networks on the Semantic Web context and the concept of trust to generate ﬁlms recommendations. The SN indicates the trust rates between users by the path’s length between them. Aleman-Meza et al. [4] deﬁne a solution to the Conﬂict of Interest problem (COI) using Social Networks. The objective is to detect relationships of COI amongst authors of scientiﬁc papers and potential reviewers of these papers based on public sources such as DBLP and FOAF. Quercia et al. [9] propose a framework, called FriendSensing, to automatically suggest friends to users of Social Networks constructed based on the physical proximity between users identiﬁed through mobile devices. Karagiannis et al. [10] analyze emails exchanged by a certain group of people. For this purpose, they construct a Social Network among these people and suggest the recommendation of “friends of friends”. Chen et al. [11] propose four algorithms to recommend people on the

198

G.R. Lopes et al.

context of Beehive, an IBM’s Social Network. Two of these algorithms, named FoF and SONAR, use only structural aspects of the SN. Some approaches consider both the semantic and the structural issues in the recommendation method. Some examples of these hybrid approaches are presented bellow. Kautz et al. [12] present the ReferralWeb system to identify experts in searches by keywords and to generate a path of social relationships between a user and the recommended expert. The system models and extracts existing relationships among people of the Computer Science community by mining public data available on Web documents. McDonald [13] details an evaluation of two diﬀerent Social Networks that can be used in a system to recommend individuals for possible collaborations. The system matches individuals looking for expertise within people that could have this expertise. Zaiane et al. [14] explore a Social Network coded within the DBLP database by drawing on a new random walk approach in order to reveal interesting knowledge about the research community and even recommend collaborations. The approach aims at helping the user on the process of searching for relevant conferences, similar authors and interesting research topics. Weng et al. [15] propose a recommendation method that uses ontologies and the spreading activation model. This model is used to search for other inﬂuential users on a Social Network. The ontologies are used to deﬁne the users proﬁles and as base to infer users’ interests. One of the algorithms proposed by Chen et al. [11] to recommend people on the Beehive, named CplusL, combine two approaches of recommendations: the approach based on content and the approach that uses the structural aspects of the SN. In this paper, we propose an approach to consider the working area of the recommendation target user by trying to match this information to recommended a researcher with similar proﬁle. Most of the approaches that perform collaborations recommendation focus on recommending experts in a certain area or information [7,12,13]. They do not consider the working area of the target user. Our approach also aims at obtaining information about the relationships among users, implicitly, through publications data obtained from digital libraries. Moreover, it includes not only the recommendation of new collaborations, as well as the recommendation to intensify existing collaborations, which is not considered by other related work. Many works focus only on structural issues of the SN to generate recommendations [4,7,8,9,10,11]. Our work considers both the semantic issues involving the relationship between the researchers in research areas and the structural issues by the analysis of the existent relationships among researchers.

5

Final Remarks

In this work, we have presented the details of an innovative approach to recommend scientiﬁc collaborations on the context of Social Networks. Moreover, we have shown the state of the art and discussed related work. The complete work is under development as a research project of the InWeb. The case study for the experiments is a collaborative Social Network based on the publications

Collaboration Recommendation on Academic Social Networks

199

of the researchers associated to the InWeb project. In our case study, the Social Network is constructed based on data available by the DBLP digital library. In future works, we will provide more experiments to evaluate the approach aiming to show the viability and applicability of our proposal in a real world application.

References 1. Smeaton, A.F., Callan, J.: Personalisation and recommender systems in digital libraries. Int. J. on Digital Libraries 5, 299–308 (2005) 2. Perugini, S., Gon¸calves, M.A., Fox, E.A.: Recommender systems research: A connection-centric survey. J. Intell. Inf. Syst. 23, 107–143 (2004) 3. Loh, S., et al.: Constructing domain ontologies for indexing texts and creating users’ profiles. In: Work. on Ontologies and Metamodeling in Software and Data Engineering, Brazilian Symp. on Databases, UFSC, Florian´ opolis, pp. 72–82 (2006) 4. Aleman-Meza, B., et al.: Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection. In: Intl. Conf. on World Wide Web, pp. 407–416. ACM Press, New York (2006) 5. Mika, P.: Social networks and the semantic web. In: IEEE/WIC/ACM Intl. Conf. on Web Intelligence, pp. 285–291. IEEE Press, New York (2004) 6. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988) 7. Ogata, H., Yano, Y., Furugori, N., Jin, Q.: Computer supported social networking for augmenting cooperation. Comput. Supp. Coop. Work 10, 189–209 (2001) 8. Golbeck, J., Hendler, J.: Filmtrust: movie recommendations using trust in webbased social networks. In: Consumer Communications and Networking Conf., pp. 282–286. IEEE Press, New York (2006) 9. Quercia, D., Capra, L.: Friendsensing: recommending friends using mobile phones. In: ACM Conference on Recommender Systems, pp. 273–276. ACM Press, New York (2009) 10. Karagiannis, T., Vojnovic, M.: Behavioral profiles for advanced email features. In: Intl. Conf. on World Wide Web, pp. 711–720. ACM Press, New York (2009) 11. Chen, J., et al.: Make new friends, but keep the old: recommending people on social networking sites. In: SIGCHI Conference on Human Factors in Computing Systems, pp. 201–210. ACM Press, New York (2009) 12. Kautz, H., Selman, B., Shah, M.: Referral web: combining social networks and collaborative filtering. Commun. ACM 40, 63–65 (1997) 13. McDonald, D.W.: Recommending collaboration with social networks: a comparative evaluation. In: SIGCHI Conference on Human Factors in Computing Systems, pp. 593–600. ACM Press, New York (2003) 14. Zaiane, O.R., Chen, J., Goebel, R.: Dbconnect: mining research community on dblp data. In: WebKDD and SNA-KDD Work. on Web Mining and Social Network Analysis, pp. 74–81. ACM Press, New York (2007) 15. Weng, S.S., Chang, H.L.: Using ontology network analysis for research document recommendation. Expert Syst. Appl. 34, 1857–1869 (2008)

Mining Economic Sentiment Using Argumentation Structures Alexander Hogenboom, Frederik Hogenboom, Uzay Kaymak, Paul Wouters, and Franciska de Jong Erasmus University Rotterdam, PO Box 1738, NL-3000 DR, Rotterdam, The Netherlands {hogenboom,fhogenboom,kaymak,wouters,fdejong}@ese.eur.nl

Abstract. The recent turmoil in the ﬁnancial markets has demonstrated the growing need for automated information monitoring tools that can help to identify the issues and patterns that matter and that can track and predict emerging events in business and economic processes. One of the techniques that can address this need is sentiment mining. Existing approaches enable the analysis of a large number of text documents, mainly based on their statistical properties and possibly combined with numeric data. Most approaches are limited to simple word counts and largely ignore semantic and structural aspects of content. Yet, argumentation plays an important role in expressing and promoting an opinion. Therefore, we propose a framework that allows the incorporation of information on argumentation structure in the models for economic sentiment discovery in text.

1

Introduction

Today’s economic systems are complex with interactions amongst ever more actors and with increasing dynamics. Tracking and monitoring is important in any dynamic system in order to be able to exercise control over it, and is essential in complex systems like economic systems. As our ability to collect and process information increases, actors in economic systems (e.g., businesses) feel a growing need for automated information monitoring tools that can help to identify issues and patterns that matter and that track and predict emerging events. A key element for decision makers to track is stakeholders’ sentiment. The relevance of insight in sentiment has been studied in various contexts. For instance, recent research demonstrates that the detection of occupational fraud – a 652 billion dollar problem – can be supported by the automated detection of employee disgruntlement in a vast amount of archived e-mails [14]. In the context of organizational change processes, Hartelius and Browning [12] argue that managers’ most important actions are persuasive actions. Furthermore, recent research demonstrates the inﬂuence of investor sentiment on ﬁnancial markets through the impact of news messages [2]. J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 200–209, 2010. c Springer-Verlag Berlin Heidelberg 2010

Mining Economic Sentiment Using Argumentation Structures

201

The recent turmoil in the ﬁnancial markets has illustrated the need for advanced monitoring and tracking tools that enable timely intervention. The key conceptualization of economic sentiment considered here is consumer conﬁdence, which is the degree of optimism that consumers have about the future of the economy and their own ﬁnancial situation. Consumer spending tends to vary with the consumer conﬁdence [18]. Since consumer spending is an important element of economic growth, consumer conﬁdence can be considered to be an important indicator for economic expansion. As such, the formation of expectations regarding future developments in the economy signiﬁcantly inﬂuences future states of the economy, such as a recession [15] or economic recovery [32]. Hence, economic analysts and policy makers must keep track of economic sentiment in order to anticipate the future state of the economy. Back in 1975, Katona [17] argued that economic sentiment may represent a subjective state of mind of actors within an economic system. Economic sentiment has commonly been characterized as a latent variable, correlated with traditional macro-economic indicators, e.g., employment conditions [1]. More recent studies however consider additional macro-economic indicators to capture economic sentiment, e.g., the University of Michigan Consumer Sentiment Index (CSI) or the Consumer Conﬁdence Index (CCI) [18]. Traditional indicators have been operationalized using publicly available macro-economic data, whereas the CSI and CCI have been based on regular, allegedly representative surveys. Conversely, Bovi [3] points out that people’s expectation formation is thwarted by structural psychologically driven distortions. The structural diﬀerence between surveyed ex ante expectations and subsequent realizations may be caused by respondents considering questions to be vague or hard to assess, which may trigger them to provide heuristic, biased answers [30]. Moreover, Oest and Franses [23] stress that over time, the small survey panels encompass diﬀerent respondent samples. This complicates generalizability of survey ﬁndings, as observed sentiment shifts may be largely driven by diﬀerences in respondent samples. In a recent analysis, Vuchelen [32] argues that the broader view on economic sentiment pioneered by Katona may complement the more restrictive view based on macro-economic indicators. In this light, we envisage a more deliberate conceptualization of economic sentiment when common macro-economic indicators are complemented with a general mood, which is typically represented using an indicator of polarity (possibly assessed on multiple features). In their communication, people reveal their mood to a certain extent. With the advent of the Internet, traces of human activity and communication have become ubiquitous, partly in the form of written text. An overwhelming amount of textual publications (e.g., scientiﬁc publications, blogs, and news messages) is available at any given moment. Analyzing free-text information can enable us to extract the information tailored to the needs of decision makers. The amount of data available to decision makers is overwhelming, whereas decision makers need a complete overview of their environment in order to enable suﬃcient tracking and monitoring of business and economic processes, which in turn can facilitate eﬀective, well-informed decision making.

202

A. Hogenboom et al.

The abundance of digitally stored text opens possibilities for large-scale (semi-)automatic text analysis, focused on uncovering interesting patterns: text mining. Text mining may lead to valuable insights, but raw textual data does not necessarily explicitly reveal the writer’s sentiment. Existing sentiment mining approaches enable quantitative analysis of texts, mainly based on their statistical properties, possibly combined with numeric data. Most approaches are limited to word counts and largely ignore semantic and structural aspects of content. We hypothesize that argumentation structure analysis can support economic sentiment mining, as argumentation structures play an important role in expressing and promoting opinions. Moreover, not all parts of a text may contribute equally to expressing or revealing the underlying sentiment. The relative contribution of a certain linguistic element to the overall sentiment may depend on its position within the overall structure of the text and argumentation. For instance, a conclusion may contribute more than a refuted argument. In this paper, we propose a framework combining knowledge from the areas of text mining – and more speciﬁcally sentiment mining – and argumentation discovery. This framework is inspired by a review of the state-of-the-art in these areas. Not only will this research contribute to the existing body of knowledge on sentiment mining by bridging the theoretical gap between qualitative text analyses and quantitative statistical approaches for sentiment mining, but the envisaged link between argumentation structures and associated sentiment may also enable decision makers and researchers to obtain insight in why things are happening in their markets, rather than just what is happening. The remainder of this paper is organized as follows. First, the interrelated concepts of text mining and sentiment mining are presented in Sect. 2. Then, Sect. 3 shifts focus to discovery of argumentation structures. Subsequently, we propose a framework in which the knowledge from the disparate ﬁelds of sentiment mining and argumentation discovery is combined. We conclude in Sect. 5.

2

Text Mining

Much linguistic information is available in textual format. Text is a direct carrier of linguistic information, which renders it a convenient mode for representing or processing linguistic data. Text is typically considered to be unstructured data. Yet, text has a kind of structure that arbitrary collections of words or sentences generally lack. From a linguistic perspective, text documents typically have some implicit notion of structure, constituted by semantic or syntactical structure, as well as typographical elements, lay-out, and word sequence [11]. 2.1

Extracting Knowledge from Textual Data

In the last couple of decades, a substantial amount of research has been focusing on automated ways of gaining understanding from text by means of text mining. Text mining is a broad term that encapsulates many deﬁnitions and operationalizations, which appear to be distributed in a continuum between two extremes.

Mining Economic Sentiment Using Argumentation Structures

203

On one hand, text mining refers to retrieving information that already is in the text (typically using predeﬁned patterns). On the other hand, text mining could refer to a more inductive approach, where patterns are to be discovered in textual data. Theory (i.e., the model) follows the data. Many deﬁnitions of text mining exist, yet the common denominator is that text mining seeks to extract high-quality information from unstructured data which is textual in nature, where quality is often conceptualized as a measure of interestingness or relevance. The dispersion of conceptualizations of text mining is reﬂected in the terminology used to refer to text mining, e.g., text analytics, intelligent text analysis, knowledge discovery in texts, and text data mining. The latter term indicates a connection between data mining and text mining. Data mining is used to ﬁnd patterns and subtle relationships in structured data, and rules that allow prediction of future results, whereas text mining focuses on ﬁnding patterns and relations in unstructured, textual data. Feldman and Sanger [10] however argue that from a linguistic perspective, text is typically not completely unstructured. A text document can already be referred to as weakly structured when it has some indicators to denote linguistic structure (e.g., key terms related to argumentation, headers, or templates adhered to in scientiﬁc research papers and news stories). Furthermore, Feldman and Sanger distinguish semi-structured documents which contain extensive and consistent format elements, such as HTML documents. With respect to text mining in its broadest sense, literature exhibits a rough distinction between three stages: preprocessing, processing, and presentation. Feldman and Sanger [10] provide an extensive overview of preprocessing routines, pattern-discovery algorithms, and presentation-layer elements. Most text mining tools utilize their own framework for processing texts with the purpose of extracting information. However, GATE [6], a freely available text processing framework, has become increasingly popular due to its ﬂexibility and extensibility. Amongst supported linguistic analyses are tokenization, Part-Of-Speech (POS) tagging, and semantic analysis. Tools like GATE could prove useful in a setting in which economic discourse is to be analyzed for interesting patterns. Yet nowadays, patterns in raw text are not enough anymore; insight in (patterns of) associated sentiment is crucial for decision makers.

2.2

Sentiment Mining

The ﬁeld of sentiment mining is relatively young. The discovery of sentiment is usually focused on reviews of products, movies, etcetera. The focus of work on analyzing online discussions and blogs [16] is more on distinguishing opinions from facts than on extracting and summarizing opinions. Existing toolkits are limited to simple word counts and relevant linguistic resources are absent or do not always ﬁt into the applied framework. Today’s text analytical tools are ill-equipped to deal with highly dynamic domains, because they have been developed without adaptation in mind [29] and until recently largely ignore structural aspects of content [7,25].

204

A. Hogenboom et al.

Early attempts to incorporate structural aspects of texts have been made by Pang et al. [24], who stress that, e.g., a review with a predominant number of negative sentences may actually have a positive conclusion and thus have an overall positive sentiment. Therefore, Pang et al. include location information of tokens for sentiment in their analysis. Devitt and Ahmad [8] use theories of lexical cohesion for sentiment polarity classiﬁcation of ﬁnancial news. Mao and Lebanon [19] model sentiment as a ﬂow of local sentiments, which are simply related to position in the text. Yet so far, no attempts have been made for utilizing information encompassed in argumentation structures, whereas argumentation structures are closely related to the sentiment of the message they convey.

3

Discovering Argumentation Structures

By using argumentation structure and elements such as speciﬁc metaphors, analogies, vocabularies, or supportive non-textual data, a speciﬁc mood or opinion can be expressed and promoted. For example, the use of analogies or vocabularies invoking negative associations in means of communication concerning change processes may lead people to have negative expectations. Our framework starts from the hypothesis that sentiment mining in economic texts can thus be improved if the information in the structural elements of a text can be harvested. 3.1

Argumentation

Argumentation is central in any discourse. Humans discuss and argue by exchanging information in natural language. In all societies, there is a tendency for idle, free-ﬂowing exchange of ideas and thoughts, which is called conversation [26]. In economics literature, conversation is often seen as cheap talk in which the act of conducting a conversation does not inﬂuence the payoﬀs in a game-theoretic setting [9]. Here, conversation is considered only to convey direct information, either in the form of imperatives (e.g., issuing orders) or in the form of information that is actionable (e.g., by revealing private information). Although classical economic theory posits that all information is incorporated in a market-based pricing system, the importance of private information and asymmetric distribution of information has been subject to many economic studies. Conversation provides a mechanism to diﬀuse asymmetric information. In addition to the direct information content, argumentation and persuasion are important aspects of linguistic communication. People exchange ideas with a goal. Argumentation is incorporated to convince the listener of the validity of the reasoning. Anyone engaged in argumentation selects and presents information in a particular way that enhances the acceptance of the argument. Hence, rhetoric, argumentation structures, and presentation styles are very important since they facilitate persuasion, as acknowledged by various economists. McCloskey and Klamer [21] estimate that a signiﬁcant part of national income can be attributed to persuasion. Cosgel [5] models consumption from a rhetorical perspective and shows how subjective information such as tastes can be understood from a diﬀerent perspective than the more common choice framework.

Mining Economic Sentiment Using Argumentation Structures

3.2

205

Argumentation Mining

The above studies demonstrate that an analysis of discourse in which structural and semantic elements are incorporated can provide information that is otherwise not available. Qualitative text analyses, possibly guided by the Textual Entailment (TE) framework [13] or the Rhetorical Structure Theory (RST) presented by Taboada and Mann [27], can enable the discovery of such information. In recent years, computational models of linguistic processing, text mining and argumentation discovery have been developed, especially in the ﬁelds of computer science and computational linguistics. A pioneer in this area has been Teufel, relying on statistical classiﬁers to identify and classify sections on scientiﬁc documents as so-called argumentative zones [28]. Early research, e.g., the work of Marcu [20], typically exploited keywords taken to be signaling a discursive relation, yet more recently, researchers like Webber et al. [33] argue that the true structure of discourse in a text is not necessarily formed by the actual textual units and their connecting keywords; they appear to employ a more high-level conceptualization of argumentation structures, which can however be linked to the relational meaning invoked by the keywords. Another perspective on argumentation discovery is advocated by Vargas-Vera, focusing on discovering argumentation structures in texts by representing these texts as networks of cross-referring claims [31], similarly to Buckingham Shum et al. [4]. More recently, Mochales Palau and Moens have focused on the automatic detection of argumentation structures in legal texts [22]. Such eﬀorts as described here are promising ﬁrst steps towards principal ways of automatically detecting argumentation structures.

4

Argumentation-Based Economic Sentiment Mining

In order to be able to extract economic sentiment from text sources, we need an information system capable of inferring speciﬁc information on economic sentiment from natural language texts. The purpose of such a system is to analyze a given text collection and to determine the sentiment in the texts. However, in economics, sentiment typically associated with arbitrary words does not necessarily reﬂect the intended sentiment. Statements that appear to have a positive sentiment can in fact be used to express a negative opinion and vice versa. Also, someone could express a positive attitude towards certain negative developments, or dissatisfaction with respect to seemingly positive events. For example, rising prices may be good news for sellers, yet bad news for buyers. However, the reasoning scheme behind a speciﬁc piece of text may contain important information that would remain undetected if simply evaluating sentiment word by word. It is the argumentation structure that provides us with essential clues as to which parts of the text contribute in what way to the overall sentiment conveyed by the text as a whole. Hence, only by taking into account argumentation structures, one could determine the sentiment of a message more accurately. Our envisaged system for economic sentiment mining is hence to take into account argumentation structures, which can be detected automatically (see Sect. 3.2).

206

A. Hogenboom et al.

In our envisaged approach, we aim to identify distinct elements of argumentation structures in order to be able to, e.g., diﬀerentiate between conclusions and their supporting arguments. In this respect, we hypothesize that, e.g., conclusions are good summarizations of the main message as well as key indicators of the sentiment throughout the text. Furthermore, sentiment stored within nonfactual (hence inherently subjective) arguments that support conclusions is also valuable, in contrast to sentiment imputed to factual support, which should rather be discarded. Hence, our application aims to take such considerations into account, by classifying textual elements and using elemental sentiment and argumentation structures for determining the overall sentiment. An example of a typical problem within the economic domain is the explanation of positive events by means of negative terms, causing texts to be erroneously classiﬁed as having a negative sentiment, e.g., in a text on plunging mortgage rates and house prices that yield improved home loan aﬀordability (see http://www.getfrank.co.nz/homes-more-affordable/). Due to its speciﬁc structure and choice of words, it is diﬃcult to interpret this text correctly with existing, mostly statistics-based sentiment mining techniques. Even though the conclusion that housing is becoming more aﬀordable has a rather positive sentiment associated with it, the support for this conclusion is mostly constructed of words that are associated with negative sentiment. Processing such texts without taking into consideration argumentation structures would most likely lead to false classiﬁcations. We therefore propose an Information Extraction pipeline which extracts economic sentiment while taking into account argumentation structures. This pipeline divides speciﬁc roles and tasks amongst diﬀerent components that are interconnected by their inputs and outputs. Such a pipeline facilitates stepwise abstraction from raw text to useable, formalized chunks of linguistic data and enables eﬀective text processing, as each component can be optimized for a speciﬁc task. In our framework, depicted in Fig. 1, we propose to employ the general purpose GATE framework, which allows for easy usage, extension, and creation of individual components. For initial lexico-syntactic analysis of input text (i.e., operations not speciﬁc to our envisaged sentiment mining approach), we propose to use several existing components from GATE’s default pipeline, A Nearly New Information Extraction System (ANNIE). First of all, we clear documents from unwanted artifacts such as tags, by means of a Document Reset component. Subsequently, we employ an English Tokenizer, which splits text into separate tokens (e.g., words). Then, a Sentence Splitter is used, which splits the input text into sentences, after which a POS Tagger component is utilized in order to determine the part-of-speech of words within a text collection. After these basic syntactic operations, semantic analysis is to be performed by several novel components. Firstly, we employ an Argumentation Gazetteer for identifying argumentation markers, i.e., key terms related to argumentation. For this, we propose to employ a populated argumentation ontology that contains deﬁnitions of these argumentation markers and their relations to argumentative text elements (e.g., arguments, supports, conclusions), which are also

Mining Economic Sentiment Using Argumentation Structures

GATE Document Reset

Text

Data UsedBy

GATE English Tokenizer

GATE Sentence Splitter

GATE POS Tagger

Sentiment Analyzer

Argumentation Parser

Argumentation Gazetteer

Sentiment Ontology

207

Argumentation Ontology

Fig. 1. Conceptual outline of the envisaged information processing pipeline

deﬁned in this ontology. The centrepiece of our approach here is modeling the textual means by which argumentation in economic discourse is structured. Our proposed models of argumentative structure will take RST and TE as starting point. RST focuses on the role of relation markers in cohesive texts and oﬀers an explanation of this coherence by describing texts using various notions of structure. RST can thus provide important guidelines for the annotation of a domain-speciﬁc training corpus. TE focuses on determining semantic inference between text segments, which is useful for detecting text segments that are essential parts of the argumentation structure, in that they contribute to the overall argumentative path followed in a document. A combination of insights from RST and TE could hence generate a more elaborate insight in argumentation structure. Guided by the annotated argumentation key terms found by the Argumentation Gazetteer, the Argumentation Parser subsequently identiﬁes text segments and determines their role in a document’s argumentation structure, hereby utilizing the argumentation ontology. Finally, the Sentiment Analyzer identiﬁes the sentiment in the identiﬁed individual text segments and connects the sentiment of these segments to the associated argumentation structure. Based on their role in the argumentation structure, text segments are assigned diﬀerent weights in their contribution to the overall sentiment. For this process, we will develop our models from textual data by using machine learning techniques. The learning techniques used will incorporate computational intelligence methods such as neural networks, self-organizing maps, evolutionary computation, and cluster analysis in addition to advanced statistical approaches such as Bayesian networks [7]. The output of this process is an ontology that is populated on the ﬂy and represents knowledge on the current economic sentiment in the text collection. This sentiment ontology in turn utilizes the argumentation ontology in order to enable a connection between argumentation and sentiment, hereby facilitating insight in opinion genesis. New knowledge on economic sentiment is stored in the ontology, thus enabling reasoning and inference of knowledge in order to support decision making processes.

208

5

A. Hogenboom et al.

Conclusions and Future Work

The disparate ﬁelds of text mining and sentiment mining on the one hand, and argumentation discovery on the other hand, oﬀer a wide range of possibilities in order to advance economic discourse analysis. Firstly, text mining techniques, and more speciﬁcally sentiment mining techniques, can help researchers and decision makers to track important trends in their markets. Secondly, argumentation discovery techniques can facilitate insight in the reasoning utilized in economic discourse. Hence, we have proposed an information extraction framework that combines insights from these disparate ﬁelds by linking argumentation structures in economic discourse to the associated sentiment, which could oﬀer researchers and decision makers a new perspective on the origins of economic sentiment. As future work, we plan to further elaborate on this framework and to investigate principal ways of combining argumentation structures with sentiment analysis and subsequently representing economic sentiment in insightful ways. Special attention will be paid to the level of analysis; diﬀerent types of text may require diﬀerent levels of granularity due to their distinct characteristics with respect to, e.g., structure or content. Furthermore, we plan to implement the proposed pipeline and to perform analyses to assess the quality of its outputs on corpora of, e.g., news articles, scientiﬁc papers, or blogs, the sentiment of which is to be annotated by human experts in order to obtain a golden standard.

References 1. Adams, F.G., Green, E.W.: Explaining and Predicting Aggregate Consumer Attitudes. International Economic Review 6, 275–293 (1965) 2. Arnold, I.J.M., Vrugt, E.B.: Fundamental Uncertainty and Stock Market Volatility. Applied Financial Economics 18, 1425–1440 (2008) 3. Bovi, M.: Economic versus Psychological Forecasting. Evidence from Consumer Conﬁdence Surveys. Journal of Economic Psychology 30, 563–574 (2009) 4. Buckingham Shum, S.J., Uren, V., Li, G., Domingue, J., Motta, E.: Visualizing Argumentation: Software Tools for Collaborative and Educational Sense-Making. In: Visualizing Internetworked Argumentation, pp. 185–204. Springer, Heidelberg (2002) 5. Cosgel, M.M.: Rhetoric in the Economy: Consumption and Audience. Journal of Socio-Economics 21, 363–377 (1992) 6. Cunningham, H.: GATE, a General Architecture for Text Engineering. Computers and the Humanities 36, 223–254 (2002) 7. Daelemans, W., van den Bosch, A.: Memory-Based Language Processing. Cambridge University Press, Cambridge (2005) 8. Devitt, A., Ahmad, K.: Sentiment Analysis in Financial News: A Cohesion-Based Approach. In: 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), pp. 984–991 (2007) 9. Farrell, J.: Talk is Cheap. The American Economic Review 85, 186–190 (1995) 10. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006) 11. Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learning 39, 169–202 (2000)

Mining Economic Sentiment Using Argumentation Structures

209

12. Hartelius, J.E., Browning, L.D.: The Application of Rhetorical Theory in Managerial Research: A Literature Review. Management Communication Quarterly 22, 13–39 (2008) 13. Herrera, J., Penas, A., Verdejo, F.: Techniques for Recognizing Textual Entailment and Semantic Equivalence. In: Mar´ın, R., Onaind´ıa, E., Bugar´ın, A., Santos, J. (eds.) CAEPIA 2005. LNCS (LNAI), vol. 4177, pp. 419–428. Springer, Heidelberg (2006) 14. Holton, C.: Identifying Disgruntled Employee Systems Fraud Risk Through Text Mining: A Simple Solution for a Multi-Billion Dollar Problem. Decision Support Systems 46, 853–858 (2009) 15. Howrey, E.P.: The Predictive Power of the Index of Consumer Sentiment. Brookings Papers on Economic Activity 32, 176–216 (2001) 16. Hu, M., Sun, A., Lim, E.P.: Comments-Oriented Blog Summarization by Sentence Extraction. In: 16th ACM SIGIR Conference on Information and Knowledge Management (CIKM 2007), pp. 901–904 (2007) 17. Katona, G.: Psychological Economics. Elsevier, Amsterdam (1975) 18. Ludvigson, S.C.: Consumer Conﬁdence and Consumer Spending. The Journal of Economic Perspectives 18, 29–50 (2004) 19. Mao, Y., Lebanon, G.: Sequential Models for Sentiment Prediction. In: ICML Workshop on Learning in Structured Output Spaces (2006) 20. Marcu, D.: The Rhetorical Parsing of Unrestricted Texts: A Surface-Based Approach. Computational Linguistics 26, 395–448 (2000) 21. McCloskey, D., Klamer, A.: One Quarter of GDP is Persuasion. American Economic Review 85, 191–195 (1995) 22. Mochales Palau, R., Moens, M.F.: Argumentation Mining: The Detection, Classiﬁcation and Structure of Arguments in Text. In: 12th International Conference on Artiﬁcial Intelligence and Law (ICAIL 2009), pp. 98–107 (2009) 23. van Oest, R., Franses, P.H.: Measuring Changes in Consumer Conﬁdence. Journal of Economic Psychology 29, 255–275 (2008) 24. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs Up? Sentiment Classiﬁcation using Machine Learning-Techniques. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pp. 79–86 (2002) 25. Shanahan, J.G., Qu, Y., Wiebe, J.M.: Computing Attitude and Aﬀect in Text: Theory and Applications. Springer, Heidelberg (2006) 26. Shiller, R.J.: Conversation, Information, and Herd Behaviour. American Economic Review 85, 181–185 (1995) 27. Taboada, M., Mann, W.C.: Rhetorical Structure Theory: Looking Back and Moving Ahead. Discourse Studies 8, 423–459 (2006) 28. Teufel, S.: Argumentative Zoning: Information Extraction from Scientiﬁc Text. Ph.D. thesis, University of Edinburgh (1999) 29. Turmo, J., Ageno, A., Catala, N.: Adaptive Information Extraction. ACM Computing Surveys 38(2) (2006) 30. Tversky, A., Kahneman, D.: Judgment under Uncertainty: Heuristics and Biases. Science 185, 1124–1131 (1974) 31. Vargas-Vera, M., Moreale, E.: Automated Extraction of Knowledge from Student Essays. International Journal of Knowledge and Learning 1, 318–331 (2005) 32. Vuchelen, J.: Consumer Sentiment and Macroeconomic Forecasts. Journal of Economic Psychology 25, 493–506 (2004) 33. Webber, B., Stone, M., Joshi, A., Knott, A.: Anaphora and Discourse Structure. Computational Linguistics 29, 545–587 (2003)

Third Workshop on Domain Engineering (DE@ER 2010)

Preface Domain Engineering is relevant to various fields in software and systems development, such as conceptual modeling, software product line engineering, domain-specific languages engineering, and so on. It deals with identifying, modeling, constructing, cataloging, and disseminating artifacts that represent the commonalities and differences within a domain, as well as with providing mechanisms, techniques, and tools to reuse and validate these artifacts in the development of particular systems. The aims of most up-and-coming methods and techniques in the area of domain engineering are to help reduce time-to-market, development cost, and projects risks on one hand, and help improve systems quality and performance on a consistent basis on the other hand. As an interdisciplinary area, domain engineering deals with various topics such as conceptual foundations, semantics of domains, development and management of domain assets, lifecycle support, variability management, and consistency validation. The purpose of this series of workshops is to bring together researchers and practitioners in the area of domain engineering in order to identify possible points of synergy, common problems and solutions, and visions for the future of the area. In the workshop, three papers were presented, dealing with evaluation of DomainSpecific Modelling solutions, representation of business domain knowledge and development artifacts, and specification of data properties. In addition, a panel discussing guidelines for designing Domain-Specific Modeling Languages (DSML) was led by Prof. Ulrich Frank, from the University of Duisburg-Essen, Germany. In that panel the challenges related to the boundaries between a DSML and its models as well as to the question how specific a DSML should be, were discussed. Furthermore, analyzing the ontological, linguistic and epistemological aspects, we listed a set of criteria to guide the design and evaluation of DSMLs from various perspectives. We wish to thank the program committee, the papers' authors, the workshop participants, the panelists, and the local organizers for contributing to the success of this workshop.

July 2010

Iris Reinhartz-Berger Arnon Sturm Jorn Bettin Tony Clark Sholom Cohen

Evaluating Domain-Specific Modelling Solutions Parastoo Mohagheghi and Øystein Haugen SINTEF, Forskningsveien 1, Oslo, Norway {parastoo.mohagheghi,oystein.haugen}@sintef.no

Abstract. This paper presents criteria and evaluation methods for evaluating domain-specific modelling (DSM) solutions based on analysing state of the art and experiences of developing and evaluating DSM solutions in research projects. The state-of-the-art analysis returned several requirements regarding the quality of domain-specific modelling languages and tools developed based on them that are classified based on the identified stakeholders. The stakeholders are those who develop and those who use a DSM solution, the intended domain and purposes with developing a DSM solution as defined by domain experts, software engineering concerns, integration with other languages or tools, and the quality of artefacts to be modelled or generated. Both quantitative and qualitative approaches may be applied for evaluating DSM solutions based on the development stage and requirements. There is a clear need for a process that supports evaluating the quality of DSM solutions and this research contributes to the definition of such process. Keywords: domain-specific language, modelling, assessment, quality, case study.

1 Introduction General-purpose modelling languages like UML are already widely used in industry, but the experience of many cases shows that learning and adopting them to specific contexts is difficult, such that they are not always the best fit for solving special problems. This is the reason why domain-specific modelling is receiving attention by industry, also because the domain-specific modelling environments are getting more powerful and mature. A Domain-Specific Language (DSL) is typically a small, highly focused language used to model and solve some clearly identifiable problems in a domain; in contrast to a General-Purpose Language (GPL) that is supposed to be useful for multiple domains. DSLs may operate stand alone, be called at run-time from other programs or be embedded into other applications to do specific tasks. DSLs may be designed from scratch or by extending a base language (e.g., defining profiles in UML). Mernik et al. discuss different approaches to the development of DSLs and their advantages and disadvantages and also write that DSL development is hard, requiring both domain knowledge and language development expertise [9]. Besides, it is often far from evident that a DSL might be useful or that developing one might be worthwhile. Several domain-specific modelling languages (DSML) and editors and transformations for modelling, generation and other purposes such as simulation J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 212–221, 2010. © Springer-Verlag Berlin Heidelberg 2010

Evaluating Domain-Specific Modelling Solutions

213

(generally referred as DSM solutions) have been developed in the context of four industrial partners involved in the European IST project MODELPLEX1. MODELPLEX aimed at applying Model-Driven Engineering (MDE) techniques on scenarios of complex software systems. The industrial domains here were enterprise business applications, telecommunication, aerospace crisis management systems and data intensive geological systems. Examples of DSM solutions developed in MODELPLEX are a network modelling tool and DSMLs for security and performance engineering. Also a DSM solution for specifying signalling at railway stations and generating source code has been developed in the ITEA-MoSiS2 project. All of these DSM solutions are meant to be used by domain experts and thus should be understandable by these experts. Some questions that arose regarding the quality of these DSM solutions were: • Is the DSM solution easily usable by the intended domain experts? • Does the DSM solution provide appropriate built-in abstractions and notations for building applications in the specific domain? • Does the DSM solution serve the purpose of the development such as generating relevant artefacts? • Is the DSM solution maintainable and evolvable when the domain evolves? • Is the DSML small enough, leaving out language features that do not contribute to the purpose of the language? In order to apply a systematic approach for evaluating DSM solutions, we performed a state-of-the-art analysis on evaluating languages used in software development in general and DSLs in particular. The analysis identified several characteristics of DSLs and DSMLs that showed to be relevant for our work. We also detected a few examples of evaluation. This paper summarizes the results of this analysis, discusses experiences of evaluating DSM solutions in the research projects MODELPLEX and MoSiS, and proposes directions for future work. The remainder of this paper is organized as follows. Section 2 presents the identified evaluation criteria and a classification of them while Section 3 focuses on evaluation methods. Section 4 is the discussion of two case studies. Finally, the paper is concluded in Section 5 and future work is discussed.

2 Criteria for Evaluating Domain-Specific Modelling Languages Related work can be discussed in several dimensions: evaluating languages in general, evaluating modelling languages, and evaluating domain-specific languages. We focus on the last two while some general characteristics of languages relevant for our discussion are also included. Howatt proposes four classes of criteria for evaluating languages [4]: • Language Design and Implementation Criteria: Is the language formally defined? Can a fast, compact compiler be written to generate efficient, compact code? 1

MODELPLEX- MODelling solutions for comPLEX software systems (2006-2010); http:// www.modelplex.org/ 2 MoSiS- Model-driven development of highly configurable embedded Software intensive Systems (2007-2010); http://itea-mosis.org/modules/wikimod/index.php?page=WikiHome

214

P. Mohagheghi and Ø. Haugen

• Human Factors Criteria: These criteria are used to assess the human interface or the user-friendliness of a language. • Software Engineering Criteria: These assess those aspects of a language that enhance the engineering of good software; for example supporting portability, reliability and maintainability of the software. • Application Domain Criteria: These criteria assess how well a language supports programming for specific applications. Kennedy et al. add two other criteria to this list [7]: The time and effort required to write, debug, and tune the code, and the performance of the code that results. Lindland et al. describe their framework for evaluating conceptual models in [8]. Conceptual models are models developed in early phases of development. The framework defines three quality goals for models: • Syntactic quality is how well the model corresponds to the language, • Semantic quality is how well the model corresponds to the domain, • Pragmatic quality is how well the model corresponds to its audience interpretation. The Lindland et al.’s framework distinguishes between quality goals and means to achieve the goals. For example, having a formal syntax helps to achieve syntactic quality. Grossman et al. use the following criteria and those identified in [1] for evaluating UML in [2]. The criteria are however mostly relevant for DSLs as well: • Having right data which are necessary constructs and their semantics. Completeness is added in [15], which is capturing all concepts. • Accuracy of concepts to present the developed system and helping in designing it. • Flexibility to model different systems and ease of change. • Understandability in the ease of read and conveying the meaning of the underlying system. • Level of detail and needed training. Paige et al. have also identified some principles in the design of modelling languages that may be used as evaluation criteria for evaluation DSMLs [11]. Examples are: • • • •

Simplicity: no unnecessary complexity, including being small and memorable. Uniqueness or orthogonality: no redundant or overlapping features. Consistency: language features cooperate to meet language design goals. Seamlessness: mapping concepts in the problem space to implementations in the solution space, and the same abstractions can be used throughout development. • Space economy: concise models are produced. An analysis of the identified characteristics shows that these are defined from multiple viewpoints by different stakeholders. Based on the covered literature, we have identified the stakeholders interested in a DSM solution and classified the identified criteria according to their interests as depicted in Fig.1.

Evaluating Domain-Specific Modelling Solutions ease of implementing language features / generating compilers, unambiguous, formalism

Tool Developers (TD)

evolution, scalability, reusability, modularity

ease of learning/understanding models and other usability concerns, increased productivity, effort needed to model, debug or generate artefacts

debugger, library, intuitive UI

End Users (EU)

Domain Experts (DE)

Software Engineers (SE)

reliability, maintainability, completeness, correctness, complexity, performance of generated artefacts

DSM Solution domain / system appropriateness, consistency, orthogonality, accurateness, cost-effectiveness, specific requirements

215

mappings, metamodels, integration, standards, extensibility, exchanging artefacts, interoperability

Quality Experts (QE)

Other languages / tools (O)

Fig. 1. Evaluating a DSM solution by different stakeholders

The stakeholders are defined below and examples of criteria of interest for them are discussed: • Tool Developers (TD) are those developing the DSML and related tools. Examples of relevant criteria for them are those identified by Howatt as language design and implementation criteria and application domain criteria [4]. • End-Users (EU) are those using the DSM solution for modelling or generating artefacts. Usability and ease of learning are examples of criteria relevant for them. The link between TD and EU suggests that providing some support by tool developers such as including a useful library, debugger and an intuitive User Interface (UI) helps improving end-users’ experience with the DSM solution. • Domain Experts (DE) represent the domain of interest and the purpose of a DSM solution. In general, a DSML should include appropriate domain concepts and abstractions [9] and be complete and accurate. A DSM solution may be developed for multiple purposes such as programming directly in the terms used by domain experts and thus reducing the gap between domain experts and software developers, automating software development or improved quality of the code. The evaluation should therefore focus on the purpose of a DSM solution. • Software Engineers (SE) are interested in the characteristics of the DSM solution that lead to developing good software. Examples of their concerns are reuse of models and evolvability of the DSM solution. Applying some software engineering practices also improve the quality of models and generated artefacts. • Quality experts (QE) are interested in the quality of models or artefacts generated from models. These may have requirements regarding completeness and

216

P. Mohagheghi and Ø. Haugen

performance of the generated code, its completeness and even understandability of models and generated artefacts for maintenance. • Other languages / tools (O) cover requirements for interoperability with other tools or languages, mappings between languages or tools, building extensions, and compliance to standards if required. There are several approaches for developing a DSML (such as developing from scratch or extending an existing language) and the O-characteristics should be considered when selecting the approach. The model depicted in Fig.1 allows classifying identified criteria in a meaningful way and is applied when selecting evaluation criteria in the case studies discussed in Section 4. The identified evaluation methods are discussed in the next section.

3 Evaluation Methods To perform the evaluation of a DSM solution, one may take advantage of quantitative or qualitative approaches. For quantitative evaluation, some identified metrics are: • Time and effort required to model, debug, and generate artefacts (from [7]). We may also add time and effort to understand models. One may compare time and effort when using a DSM solution with time and effort without using a DSM solution in a controlled experiment as done in [5]. • Performance of the code that results from models (from [7]). • Collecting metrics from models such as the number of model elements. Model metrics are discussed in [10]. A large amount of metrics can be defined on models while identifying useful model metrics is a challenge. • Usability metrics are discussed in [14]. Seffah et al define usability as “whether a software product enables a particular set of users to achieve specific goals in a specific context of use” and covers efficiency, productivity, satisfaction, learnability, safety and usefulness for solving problems. Some proposed metrics are time to learn or perform tasks, user steps to perform a task and layout appropriateness. • Number of concepts and the relations between these concepts in the DSML [13]. This metric is on the metamodel of languages and assumes that languages with more concepts and relations are more complex, such as UML. Since DSMLs are usually small languages, this count will probably not return interesting information. • Evaluating metamodel’s understandability by performing controlled experiments as discussed in [12]. Both syntactic understanding which refers to the constructs of the metamodels and relationships (for example, how many attributes describe an employee) and semantic understanding that assess the understanding of contents (for example whether every employee has a unique employee number) are of interest to assess. • Performing a survey among users can generate quantitative data. Qualitative approaches cover case studies (including comparative ones that compare using a DSM solution with other approaches), analysis of a language and the DSM solution by experts for various characteristics, and monitoring or interviewing users.

Evaluating Domain-Specific Modelling Solutions

217

A DSM solution may be evaluated both quantitatively and qualitatively. The important issue is to decide which approach is best in which phase of the development lifecycle. The ISO 9126 standard divides metrics in internal (design time), external and quality in use metrics which indicates that properties should be measured in different stages and some design-time measures can be used as prediction of run-time characteristics. Seffah et al. discuss predictive and testing metrics where predictive metrics may provide an estimate of system usability [14]. Testing metrics are collected when a software product is in use. Kelly and Tolvanen recommend an incremental and test-driven approach for developing DSLs [6]. For a DSM solution, there is often a prototyping phase and a usage phase. In the prototyping phase, evaluation is often done by language experts and pilot users who try the language on small cases. We developed a set of questions for this phase based on the requirements of case studies that is presented in the next section. The evaluation in the prototyping stage is often qualitative. In the usage stage, more users are involved which allows running experiments or collecting opinion of users in a survey.

4 Case Studies 4.1 Evaluating the Network Modelling Tool The first case discussed here covers developing a network modelling tool in Telefónica using Eclipse GMF. The experiences are discussed in detail in [3]. The key driver for this DSM solution is the recognition that it is becoming increasingly difficult to manage the complexity and size of modern telecom networks. By Telefónica’s requirement, the Network DSML had to include specific elements required for modelling and also allow modelling at different levels of abstraction, at least showing the internal of devices, how devices connect to each other and higherlevel interactions and roles of whole sub-networks in the deployment of a service. From these models, a wide range of artefacts could be generated such as device configuration specifications. Rather than developing a metamodel from scratch, a metamodel based on Common Information Model (CIM)3 was used in this development. CIM was relevant as it is the underlying model in many products dealing with management and instrumentation of network equipment. Finally, there were a number of generic features which were required in order to meet the needs of end-users of the tool. These included: a) a visual, user-friendly interface; b) scalability – enabling thousands of model elements to be managed; c) interoperability with other tools and standards; d) flexibility – enabling the rapid adaptation of the tool to support new abstractions (preferably done by the engineers themselves); and e) support for model validation and checking. The evaluation of the DSM solution was performed by answering a set of questions defined by a team of researchers and domain experts based on the requirements. The feedback by a team of pilot users is based on using the tool for modelling and the generator to produce the required artefacts in some example scenarios. The set of questions from various viewpoints and the results of evaluation are summarized in Table 1. 3

Common Information Model Website, http://www.dmtf.org/standards/cim

218

P. Mohagheghi and Ø. Haugen Table 1. Evaluating the network modelling solution

Stake holder EU

EU

EU

EU

SE SE

SE DE

O O

Question

Results

Is the DSML tool easy to use? Not enough, largely due to the sheer size of the Is the UI acceptable? metamodel which resulted in having to add a large number of connection and node tools. Do you intend to use the DSM We would like to use but there are several barriers: solution in future projects and The DSML should be smaller and more focused, invest on making a more other tools than GMF for developing it should be usable version? evaluated and the DSM solution should be used in a series of projects to investigate Return-OnInvestment. Does the DSM solution affect Yes, the DSM solution has the potential to improve the performance of users? productivity and quality but additional work and training are needed to achieve those objectives. Do we think that using the Yes, the image and reputation of innovation can be DSM solution improves our greatly improved by the use of tools and approaches reputation and image as such as the one presented herein. innovative? Is the DSM solution scalable? GMF does not scale well because of some shortcomings in the implementation. Is the DSM solution flexible? The same applies to flexibility. A more dynamic, metamodel-driven tool generation approach is needed. Does the DSM solution Modelling at different abstraction levels is applied to provide reuse possibilities? increase reusability of elements. Is the CIM metamodel Yes, they are suitable for this purpose but need suitable for modelling constant revision and extension to keep up with the network management in evolution of the domain and the standard of reference Telefónica? (CIM). Is the DSML compatible with Yes, using CIM provides such compatibility but the standards? brings problems due to its size. Is the DSML compatible with Many tools used in the network management domain other tools? are based on CIM, but as the DSML transforms the CIM metamodel into EMF, this leads to compatibility issues with CIM-based off-the-shelf products that need to be resolved.

One of the most challenging aspects of this DSM solution was the large number of modelling abstractions and relationships in the CIM model. Another challenge was that of making the tool as usable as possible, which involved changing the tooling definition. We experienced that developing a DSML in an environment such as Eclipse required high language and tool expertise, which make developing DSM solutions out of reach of domain experts with some IT expertise, and the resulting DSM solution is not changeable or flexible enough. Changes to the metamodel which happen frequently in the domain required considerable effort in updating the tool and the developed models became corrupted due to changes.

Evaluating Domain-Specific Modelling Solutions

219

4.2 Evaluating the Train Control Language The Train Control Language (TCL) is a DSML for specifying the signalling at railway stations and generating interlocking source code that is used in allocating routes to trains. Using TCL has several benefits compared with the current development process. In the current workflow, errors in the various steps are possible due to the manual procedure. Thus validation of each step is required to ensure the safety of the system. Using TCL most of these steps are automated. By assuring that TCL and the generators are correctly implemented, consistency between the representations can be guaranteed. Therefore some of the validation steps can be eliminated. Several constraints are defined to assure that stations are correctly created, and the editor makes sure that every necessary condition is taken into consideration. If the constraints are properly defined, the TCL tool may guarantee completeness by requiring all necessary elements. Other benefits are implementing a target environment that includes generators such as code generators and analysis tool that prevent or detect inconsistencies or errors in models. Together these benefits lead to significant productivity improvements. The first step in evaluating the TCL has been identifying stakeholders and their reasons for developing a DSM solution. We identified the stakeholders to be: a) tool developers who have developed the metamodel and supporting tools; b) signalling engineers who are the end-users that will model the stations; c) station deployers that will generate required source code; d) testers who will generate test cases from the models; and e) railway authorities who are the standardization organs and national Table 2. Examples of requirements for evaluating TCL, means and evaluation methods Stake holder EU

EU

EU

O

Requirement

Means

Evaluation Method

TCL models should be Walking through existing Performing visual comparison of similar to existing examples together with models developed with the first diagrams. version of TCL with existing Station deployers. diagrams. Small stations should Small stations are Models developed with the first be covered identified and their models version of TCL are compared with completely. are reviewed. existing diagrams. TCL and tools should Adding well-formedness Validate the constraints by prevent specifying rules to the language. Also inspections and running test cases. unsafe models. adding constraints to the TCL specifications. Models are compliant Add constraints to the Validate the constraints and with safety standards. TCL specifications. Also, inspect the development process. integrate the safety standards in the necessary steps in the development process.

220

P. Mohagheghi and Ø. Haugen

authorities that define safety requirements. The second step in evaluation has been identifying quality requirements of these stakeholders. Finally, we have also identified evaluation method for each requirement, and how a requirement can be achieved by “means” that should be applied. Examples of quality requirements, means and evaluation methods are depicted in Table 2. The actual evaluation of the language as identified by evaluation methods remains to be performed. At this stage, the evaluation work has helped the involved stakeholders to clarify and communicate their intentions with the DSM solution, the implemented features of the solution (defined as means) and relating requirements to features.

5 Conclusions and Future Work The quality of domain-specific languages (DSLs) and modelling solutions has been subject of some research by now. Based on a state–of-the-art analysis and experiences with developing domain-specific modelling (DSM) solutions in research projects, we have identified several evaluation criteria. These are currently classified according to the stakeholders interested in them. We have also identified evaluation methods and examples of evaluation. All these are included in a framework for evaluating DSM solutions which is under development and should include examples of best practices or means as well. Based on the experiences so far, we can summarize that some characteristics are especially important for DSLs. An important criterion is domain-appropriateness. A DSL must be powerful enough to capture the major domain concepts and should match the mental representation of the domain. DSM solutions are typically used for prediction or simulation, as well as code generation, test generation and execution. Thus the language should be formal and accurate. Any DSL with a diagrammatical syntax should have proper layout, and the there is often a need for integrating DSLs with other ones. Performing a systematic review of published literature for identifying all related research will contribute to this work. We have also performed several case studies on evaluating DSM solutions in the early phase of development using a questionnaire. We presented two cases of evaluation in this paper. Relating evaluation criteria to evaluation methods is also subject for future work. When discussing DSM solutions, it is of key importance to focus on the needs of an often narrow application domain and the actual purposes of the DSM solution. The development of a DSM solution is iterative and so is the assessment. Having the requirements in mind, there is a clear need for a process that supports defining and evaluating the quality of domain-specific solutions. We have identified some steps of this process as identifying stakeholders and requirements of the DSM solutions, identifying means to achieve the requirements, and identifying evaluation methods. We will continue work on this process in future work. Acknowledgments. This work has been supported by the MODELPLEX project (IST-FP6-2006 Contract No. 34081) and the MoSiS project ITEA 2 – ip06035.

Evaluating Domain-Specific Modelling Solutions

221

References 1. Goodhue, D.L.: Development and Measurement Validity of a Task Technology Fit Instrument for User Evaluations of Information Systems. Decision Sciences 29(1), 105– 138 (1998) 2. Grossman, M., Aronson, J.E., McCarthy, R.V.: Does UML Make the Grade? Insights from the Software Development Community. Information and Software Technology 47, 383– 397 (2005) 3. Evans, A., Fernández, M.A., Mohagheghi, P.: Experiences of Developing a Network Modelling Tool Using the Eclipse Environment. In: Paige, R.F., Hartman, A., Rensink, A. (eds.) ECMDA-FA 2009. LNCS, vol. 5562, pp. 301–312. Springer, Heidelberg (2009) 4. Howatt, J.: A Project-Based Approach to Programming Language Evolution (2001), http://academic.luther.edu/~howaja01/v/lang.pdf (visited in August 2007) 5. Kärnä, J., Tolvanen, J.P., Kelly, S.: Evaluating the Use of Domain-Specific Modeling in Practice. In: 9th OOPSLA Workshop on Domain-Specific Modeling (2009) 6. Kelly, S., Tolvanen, J.-P.: Domain-Specific Modeling- Enabling Full Code Generation. IEEE Computer Society Publications, Los Alamitos (2008) 7. Kennedy, K., Koelbel, C., Schreiber, R.: Defining and Measuring the Productivity of Programming Languages. International Journal of High Performance Computing Applications 18(4), 441–448 (2004) 8. Lindland, O.I., Sindre, G., Sølvberg, A.: Understanding Quality in Conceptual Modelling. IEEE Software 11(2), 42–49 (1994) 9. Mernik, M., Heering, J., Sloane, A.M.: When and How to Develop Domain-Specific Languages. ACM Computing Surveys 37(4), 316–344 (2005) 10. Mohagheghi, P., Dehlen, V.: Existing Model Metrics and Relations to Model Quality. In: 2009 ICSE Workshop on Software Quality (WoSQ 2009), pp. 39–45. IEEE CS, Los Alamitos (2009) 11. Paige, R.F., Ostroff, J.S., Brooke, P.J.: Principles for Modeling Language Design. Information and Software Technology 42, 665–675 (2000) 12. Patig, S.: Preparing Meta-Analysis of Metamodel Understandability. In: Workshop on Empirical Studies of Model-Driven Engineering (ESMDE 2008), pp. 11–20 (2008) 13. Rossi, M., Brinkkemper, S.: Complexity Metrics for System Development Methods and Techniques. Information Systems 21(2), 209–227 (1996) 14. Seffah, A., Donyaee, M., Kline, R.B., Padda, H.K.: Usability Measurement and Metrics: a Consolidated Model. Software Quality Journal 14, 159–178 (2006) 15. Teeuw, W.B., van den Berg, H.: On the Quality of Conceptual Models (1997), http://osm7.cs.byu.edu/ER97/workshop4/tvdb.html

Towards a Reusable Unified Basis for Representing Business Domain Knowledge and Development Artifacts in Systems Engineering Thomas Koﬂer and Daniel Ratiu Institut f¨ ur Informatik, Technische Universit¨ at M¨ unchen Boltzmannstr. 3, 85748 Garching b. M¨ unchen, Germany {koflert,ratiu}@in.tum.de

Abstract. During the systems engineering process many heterogeneous artifacts which belong to diﬀerent engineering disciplines and describe diﬀerent views upon the product to be developed are produced. In order to integrate these artifacts, to increase the level of reuse, or to evaluate and implement changes, we need semantically rich relations between development artifacts and the parts of the product they describe. In this paper, we present our research to extend the IEEE Standard Upper Ontology with a mid-level layer which can be used as a reusable semantic basis for making the knowledge about the developed product, the development artifacts, and the relations between them explicit.

1

Semantic Gap in Engineering of Complex Systems

During the development process of complex industrial systems many development artifacts are produced. They belong to diﬀerent engineering disciplines and address diﬀerent views over the product to be developed. Due to their heterogeneity, the result of the integration of these views is visible only in the developed product. Each discipline is using its special tools, notations and techniques and this leads to a semantic gap in the development process [1]. the same or highly related business domain concepts are reﬂected in diﬀerent artifacts in a completely diﬀerent manner and are interleaved with details speciﬁc to a particular development discipline (e. g. mechanical, electrical, or software engineering). Due to the high number of artifacts and their heterogeneity, virtually nobody from an organization knows in detail all produced artifacts and their relation to the product under development. This situation is ampliﬁed in large projects that are developed over many years, by delocalized teams, or that experience personel turnover. Due to this lack of knowledge, typical engineering tasks like implementing changes, or reusing parts of already existent systems are extremely diﬃcult and expensive. Furthermore, the know-how gathered inside the company and the knowledge about already built assets cannot be transferred between diﬀerent projects in a systematic and disciplined manner. J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 222–231, 2010. c Springer-Verlag Berlin Heidelberg 2010

Towards a Reusable Uniﬁed Basis

223

Fig. 1. Domain knowledge is a common denominator between development artifacts

We advocate that the knowledge about the product (business domain) is the greatest common denominator of the development artifacts and the most important means to close the semantic gap that occurs in the system engineering process. In Figure 2, we present an intuitive view of our approach that is based on a uniﬁed model that contains knowledge about the development artifacts, the product to be developed, and the dependency relations between them. In order to build such a model, we need a conceptual framework for dependency management which contains the following ingredients: 1) a vocabulary to describe diﬀerent kinds of development artifacts, 2) a generic vocabulary to describe the knowledge about the product to be developed, and 3) a vocabulary to capture the dependencies between product concepts and development artifacts. In this paper, we present an approach to build a dependency management framework by extending the IEEE Standard Upper Ontology [2] with a mid-level layer.

! " # $ %& % '

Fig. 2. Manage dependencies by linking a domain model to development artifacts

Outline: In Section 2, we present our framework for managing the logical dependencies between development artifacts by using the domain knowledge. In Section 3, we present an example for using our framework. In Section 4, we discuss some variation points and limitations of our approach. In Section 5 we present the work related to our approach, and we conclude the paper and present our plans for future work in Section 6.

224

2

T. Koﬂer and D. Ratiu

Towards a Framework for Dependency Management

In this section, we present our ﬁrst steps to deﬁne the dependency management framework. The term TBox denotes SUO and our mid-level ontology, the term ABox denotes the concrete domain and artifact model and the relations between them. 1) Use IEEE Standard Upper Ontology (SUO) as semantic basis: As semantic basis for our framework, we use the IEEE Standard Upper Ontology (SUO) [2] (http://suo.ieee.org/)). For our purpose, SUO represents a high-level categorization of the general concepts that can be used as starting point for organizing the knowledge about the domain of the product and the development artifacts. SUO is too general for our purpose as it is targeted to describe all possible concepts (including feelings), many of them being irrelevant in our use-case (i. e. management of logical dependencies in systems engineering). Therefore we use only a part of SUO. Figure 3 illustrates on the left side the SUO concepts that we use and on the right side examples of important relations between these concepts. SUO exhaustively partitions the conceptual world into two disjoint categories: “physicals” and “abstracts”. “Physical” is further divided into “objects” and “processes”. The concept “object” has a special interest for us, since both the end product developed during engineering and the development artifacts are some kinds of objects. The world of “abstracts” is divided by SUO into diﬀerent categories like “attributes” that represent qualities of concepts, “quantities”, and “relations”. SUO deﬁnes diﬀerent kinds of meronymy (part of) relations as shown in the table on the right – e. g. the “component” relation is a subrelation of “part” that is restricted between two “CorpuscularObjects”. SUO deﬁnes other relations like diﬀerent kinds of attributes that concepts can have. In addition to concepts and relations, SUO deﬁnes also a set of axioms (e. g. the fact that “part” is inter alia a reﬂexive relation). In order to provide an adequate vocabulary for dependency management, we need to extend SUO with concepts needed for the description of development artifacts, the product under development, and dependencies between the product concepts and these artifacts. These extensions are presented in the following paragraphs in more detail.

Fig. 3. A fragment of the IEEE Standard Upper Ontology

Towards a Reusable Uniﬁed Basis

225

Fig. 4. An extension of SUO with knowledge about system engineering

2) Product knowledge ontology. In Figure 4, we present an overview of the extension of SUO to represent the knowledge about the developed product. One of the basic decisions in systems engineering is to establish system boundaries – i. e. what belongs to the system and what is outside. That’s why we deﬁne the concepts “system” and “environment”. A special kind of object upon which a system acts and which belongs to the environment is called a “patient”. Systems can be decomposed into more parts, each part being itself a system at a lower granularity level. Systems perform diﬀerent “industrial processes” that describe the behavior of the system. Both “systems” and “environment” have attributes. An attribute can depend on both the system and environment attribute. Furthermore, an industrial process can change the attributes of the environment. An “industrial process” has as input a “patient” and as output another “patient”. The time dependency between “industrial processes” is captured through the “follows” relation. The above concepts were distilled based on our experience with modeling an example system in the ﬁeld of industrial automation (more details are given in the next section). This small set of concepts and relations are part of a core vocabulary that allows us to describe a product that is the result of the automation systems engineering. We are aware that we need to extend our vocabulary with additional concepts, in order to describe other system engineering products (e. g. cars). 3) Development artifact ontology. Development artifacts (presented in Figure 5) are physical objects. In practice, the artifacts are digital documents (or parts thereof) produced during the development process. Examples of artifacts are requirements, source code, or CAD documents. Some documents have a rich structure, while others are basically natural language texts. The granularity level at which the development artifacts are represented can be ﬂexibly chosen. For example, we can increase the detail level at which “UMLArtifacts” are captured by considering diﬀerent diagrams or constructs thereof (e. g. use-cases, actors, classes, methods, components). The “needs” relation is deﬁned between two “development artifacts” and captures the fact that some development artifacts are used as inputs for producing subsequent artifacts in the engineering process.

226

T. Koﬂer and D. Ratiu

'$ $

! %& !

"# $ !

Fig. 5. An extension of SUO with knowledge about development artifacts

4) Linking product knowledge and development artifacts. Until now, we deﬁned two disjoint conceptual spaces: the space of the development artifacts and the one of the developed product. In order to capture logical dependencies, we need to provide a way to link these two worlds. On the right hand side of Figure 5, we present the two relations between development artifacts and product: – refers means only that a certain concept is explicitly referenced in a development artifact (similar to traceability); – defines means that an important design decision about a system is contained in a development artifact. Whenever a development artifact deﬁnes something, it also refers that thing.

3

Example

Our ﬁrst experiments that lead to the deﬁnition of the dependency management layer are done in the context of systems engineering of rolling mills. More specifically, we modeled the development of edgers. An edger is a part of a rolling mill that is responsible for compressing the edges of rolled materials. A rolled material has physical properties such as width, height, and temperature. The edger has to take some of these properties into account in order to process the rolled material – e. g. the optimal throughput speed of the engine of the edger depends on the temperature of the rolled material. The engine is part of the edger, so is the hydraulic press, which compresses the rolled material. The width of the rolled material depends on the width of the closed hydraulic press. The edger is speciﬁed by a CAD document. Another CAD document speciﬁes the hydraulic press and only refers the edger. A C program that controls the edger refers the temperature of the rolled material and the throughput speed of the engine. This setting is illustrated in Figure 6. Each individual of the domain model is an instance of a concept from our mid-level layer – e. g. the individual CAD-Document-Edger is an instance of the concept CAD Artifact. Queries examples: In order to formalize queries, we use the following notation: A as the set of Attribute, D as the set of Development Artifact and S as the set of System.

Towards a Reusable Uniﬁed Basis System

TBox

Environment

WorkProduct

SystemAttr. EnvironmentAtt.

CAD Artifact

227

Softw. Artif.

refers

ABox

dependsOn Width

ThroughputSpeed

attribute

attribute

HydraulicPress

Engine

systemPart

systemPart

C-Program Width

Height

attribute

CAD-DocumentEdger

attribute

RolledMaterial

Temperature

attribute CAD-DocumentHydraulicPress

Edger defines refers defines

Fig. 6. An example domain model and artifact model, including relationships. The dashed lines denote instance relations.

A binary predicate e.g. refers with the arguments α and β (written as refers(α, β)) is true, if α is in a refers relation with β or in a relation that inherits directly or indirectly from refers. In the following, we present two examples of queries that can be formulated based on our model. Q1) “When we change the width of the hydraulic press (denoted as δ), which development artifacts do we have to consider? What other parts of the system are dependent on the width of the hydraulic press?” This query is an example of the analysis of the impact of a change both on the development artifacts and on the system. It can be formalized as: {x : x ∈ D ∧ (refers(x, δ) ∨ ∃y(y ∈ S ∧ attribute(y, δ) ∧ refers(x, y)))} ∪ {x : x ∈ A ∧ dependsOn(x, δ)} The answer to this query, given our domain model example, is a set containing CAD-Document-HydraulicPress and Width (attribute of RolledMaterial ). Q2) “When a speciﬁc development artifact (denote as δ) deﬁnes a system and that system has parts, what development artifacts deﬁne those parts?” This query is an example relevant for change management: the problem which is how to ﬁnd related development artifacts that deﬁne parts of the system a speciﬁc development artifact deﬁnes. Formally, this query can be expressed as: {x : x ∈ D ∧ ∃y(y ∈ S ∧ defines(δ, y) ∧ ∃z(z ∈ S ∧ systemPart(z, y) ∧ defines(x, z)))} This query is formulated in a generic way, since it does not use the vocabulary speciﬁc to edgers. In the case of our model, if δ is the individual CAD-DocumentEdger, then the answer is a set containing only the individual CAD-DocumentHydraulicPress.

4

Discussion

On the usefulness of product knowledge for dependency management. The starting point and motivation for our work is the assumption that the product knowledge can be used to make the (hidden) dependencies between development artifacts explicit. In reality, however, there are other dependencies between artifacts that are independent of the product knowledge (e. g. the dependencies generated by the integration between diﬀerent tools that are used to describe

228

T. Koﬂer and D. Ratiu

the engineering views). The measure in which the product knowledge can eﬀectively be used to manage the dependencies between the development artifacts remains to be empirically investigated. On the choice of SUO as semantic basis. Our dependency management framework is a mid-level ontology that extends the SUO upper-level ontology. There are other upper-level ontologies (e. g. Bunge [3], GOL [4]) which are frequently used in conceptual modeling. We chose SUO, because it is an IEEE standard. Furthermore, it oﬀers us enough and appropriate concepts upon which to build our mid-level ontology. Whether the use of SUO is the best choice, remains to be further investigated. On the instantiation of the framework for other systems engineering domains. We aim to deﬁne a mid-level ontology (which is – including the SUO ontology – a TBox) that can serve as basis for every ABox in the domain of automation systems engineering. Our TBox should have just enough concepts to formulate queries on a high level of abstraction. This oﬀers us the possibility to create a set of generic queries for dependency management, without even knowing what exactly the system under development is. On the possibilities for reusing our framework. There are three ways for reusing our framework: (i) reusing the TBox for domain-speciﬁc ABoxes – this is the most important one –, (ii) reusing the TBox for creating domain-speciﬁc TBoxes with more details, and (iii) once the model for a certain product is created, it can be reused in future engineering projects.

5

Related Work

There are several approaches to build a uniﬁed model for representing the knowledge about the built product and the development artifacts. Linking domain knowledge with development artifacts. LaSSIE [5] was a project done in the late ’80s to improve the programming process by linking a model of the software artifacts (C programs) with a model of their business domain (telephony switches). Both these models were captured as ontologies. These linking enabled the developers to formulate combined queries about the domain knowledge and the source code. [6,7] present an approach to manage the development artifacts (with focus on requirements) with the help of an enterprise ontology that captures both the product structure and the structure of the organization that builds the product. In comparison to these works, we aim to provide a generic framework for dependency management that can be reused by instantiating it for other development artifacts and other engineering domains. Product knowledge and product data management. [8,9] present ontology-based approaches for management of knowledge about engineering products. In comparison to these approaches, we develop a mid-level ontology layer as an extension of SUO, and thereby, we have the possibility to reuse the formalization provided by SUO for many concepts and relations.

Towards a Reusable Uniﬁed Basis

229

[10] presents the requirements for a domain repository that contain reusable development artifacts. One requirement for the domain repository is to easily retrieve development artifacts as a prerequisite for enabling the reuse. Our work can be used to extend the domain repository with explicit information about the product to be developed and about the structure of artifacts. [11] discusses the representation of product knowledge in PDM systems. The authors advocate the need to use advanced knowledge representation techniques (semantic networks) in order to capture the knowledge about the product. Our uniﬁed model serves the same purpose, namely it oﬀers a domain meaning and enables the users formulate logical queries about the relation between artifacts and product concepts. Information retrieval. In order to automate our approach, we need support for the automatic population of our ABox. We can take advantage of the research done in the area of mining ontologies from software artifacts. [12] uses a formal ontological representation of source code and software documents. Witte et al. are able to automatically extract concept instances and their relations from source code and software documents. We can use this approach in order to link the mined concepts to our mid-level ontology (e.g. the concept “Method” in [12] can inherit from our concept “Software-Artifact”). The works of [13] and [14] are focusing on reverse engineering and linking source code and software documents together. Querying and reasoning is as important in [13] as in our approach. But we go one step further and aim to connect more heterogeneous artifacts (not only requirements and source code, but also artifacts produced by other disciplines, e. g. electrical engineering, mechanical engineering, . . . ) via a common domain model. In [15], Braga et al. describe a retrieval technique based on domain knowledge in the form of ontologies. The focus of this work is on information retrieval and ﬁltering of domain information across multiple domains. The focus of our work is the creation of a TBox that serves the mentioned aims. Nevertheless, [15] uses domain ontologies as well. Since the focus of this paper is the combination of diﬀerent information retrieval techniques, it could be used to extract information out of an already built ABox based on our TBox. [16,17] present an approach to enrich the information retrieval from engineering documents by using domain ontologies that share the established knowledge in design and manufacturing. The methodology for ontology aquisition [17] can be used to extend our mid-level ontology. Furthermore, by using information retrieval techniques, we can obtain automatic support for building the ABox.

6

Future Work

As future work, we plan the following directions: Framework evaluation. Our uniﬁed framework is the starting point for managing the dependencies among development artifacts. Our main hypothesis is

230

T. Koﬂer and D. Ratiu

that a signiﬁcant part of these dependencies can be made explicit by using the knowledge about the built product and its relation to the engineering artifacts. Dependency management is not an end goal per se, but is rather means for supporting typical systems engineering tasks like impact analysis, or reuse. One of our aims is the deﬁnition of generic queries relevant for systems engineering (e. g. for impact analysis, for ﬁnding cross-cutting dependencies between diﬀerent kinds of development artifacts, . . . ). As shown in the presented example (Section 3), these queries should be helpful for our industrial partners, without even knowing what the speciﬁc domain model (ABox) will be. The aim is to formulate this set of queries by using only the vocabulary of our mid-level ontology. Instantiate the framework for other domains. Our focus is to deﬁne a framework for supporting dependency management in the engineering of diﬀerent complex industrial systems. To enable reuse between diﬀerent disciplines, our framework should be general enough and easy to instantiate for diﬀerent industrial domains (e. g. development of components for cars). The generic concepts that are valid for all domains is represented by our TBox which should be used as basis for domain-speciﬁc ABoxes. As future work we aim to consolidate the mid-level layer by instantiating the framework for describing other products and other development artifacts. Semantically enriched PDM tools. In engineering projects, product data is often stored in PDM tools (e. g. Comos 1 ). Current PDMs capture the knowledge about the built product only in an implicit manner and links it with a (usually very shallow) artifact model. Our mid-level ontology can be used to semantically enrich PDMs by oﬀering a conceptual basis for a richer representation of product knowledge and of development artifacts. The content from a semantically enriched PDM can be treated like an ABox. Automation. Another important direction for our future work is to provide (semi-) automatic support for creating a domain-speciﬁc ABox. The automation is very important, because the eﬀort of creating the ABox is an essential factor that determines whether our approach can be used in practice. For this reason we plan to deﬁne a methodology to guide the creation of an ABox with minimal eﬀort, and to use ontology mining techniques. Acknowledgments. This work was partially funded by the German Federal Ministry of Education and Research (BMBF), grant “SPES2020, 01IS08045A”.

References 1. O’Brien, W.: Avoiding semantic and temporal gaps in developing software intensive systems. Journal of Systems and Software 81(11), 1997–2013 (2008) 2. Niles, I., Pease, A.: Origins of the IEEE Standard Upper Ontology. The Knowledge Engineering Review (2001) 1

http://www.comos.com/14.html?&L=1

Towards a Reusable Uniﬁed Basis

231

3. Wand, Y., Weber, R.: On the deep structure of information systems. Informations Systems Journal 5, 203–223 (1995) 4. Degen, W., Heller, B., Herre, H., Smith, B.: GOL: toward an axiomatized upperlevel ontology. In: Proceedings of the International Conference on Formal Ontology in Information Systems, FOIS 2001, pp. 34–46. ACM Press, New York (2001) 5. Devanbu, P., Brachman, R., Selfridge, P., Ballard, B.: LaSSIE: A knowledge-based software information system. Comm. of the ACM 34(5), 34–49 (1991) 6. Billig, A., Sandkuhl, K.: Enterprise ontology based artefact management. Lecture Notes in Informatics (2008) 7. Sandkuhl, K., Billig, A.: Ontology-based artefact management in automotive electronics. Int. J. Comput. Integr. Manuf. 20(7), 627–638 (2007) 8. Lee, J., Suh, H.: Ontology-based multi-level knowledge framework for a knowledge management system for discrete-product development. International Journal of CAD/CAM 5(1) (2005) 9. Chang, X., Rai, R., Terpenny, J.: Development and utilization of ontologies in design for manufacturing. Journal of Mechanical Design 132(2) (2010) 10. Maga, C., Jazdi, N.: Concept of a domain repository for industrial automation. In: Proceedings of the First International Workshop on Domain Engineering (2009) 11. Conrad, J., Deubel, T., Koehler, C., Wanke, S., Weber, C.: Comparison of knowledge representation in PDM and by semantic networks. In: Design for society / International Conference on Engineering Design (2007) 12. Witte, R., Li, Q., Zhang, Y., Rilling, J.: Ontological text mining of software documents. In: Kedad, Z., Lammari, N., M´etais, E., Meziane, F., Rezgui, Y. (eds.) NLDB 2007. LNCS, vol. 4592, pp. 168–180. Springer, Heidelberg (2007) 13. Witte, R., Zhang, Y., Rilling, J.: Empowering software maintainers with semantic web technologies. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 37–52. Springer, Heidelberg (2007) 14. Zhang, Y., Witte, R., Rilling, J., Haarslev, V.: An ontology-based approach for traceability recovery. In: Proceedings of the 3rd International Workshop on Metamodels, Schemas, Grammars, and Ontologies for Reverse Engineering, ATEM 2006 (2006) 15. Braga, R.M.M., Werner, C.M.L., Mattoso, M.: Using ontologies for domain information retrieval. In: Proceedings of the 11th International Workshop on Database and Expert Systems Applications, DEXA 2000, p. 836. IEEE CS, Los Alamitos (2000) 16. Li, Z., Raskin, V., Ramani, K.: Developing engineering ontology for information retrieval. ASME Journal of Computing and Information Science in Engineering 8(1), 21–33 (2008) 17. Li, Z., Yang, M.C., Ramani, K.: A methodology for engineering ontology acquisition and validation. Artiﬁcial Intelligence for Engineering Design, Analysis and Manufacturing 23(1), 37–51 (2009)

DaProS: A Data Property Specification Tool to Capture Scientific Sensor Data Properties Irbis Gallegos, Ann Q. Gates, and Craig Tweedie The University of Texas at El Paso, 500 W. University Ave., El Paso TX, 79912, USA [email protected], [email protected],[email protected]

Abstract. Environmental scientists have begun to use advanced technologies such as wireless sensor networks and robotic trams equipped with sensors to collect data, such as spectral readings and carbon dioxide, which is leading to a rapid increase in the amount of data being stored. This has resulted in a need to evaluate promptly the accuracy of the data, the meaning of the data, and the correct operation of the instrumentation in order to not lose valuable time and information. Performing such evaluations requires scientists to rely on their knowledge and experience in the field. Field knowledge is rarely shared or reused by other scientists mostly because of the lack of a well-defined methodology for sharing information and appropriate tool support. This work presents the Data Property Specification (DaProS) tool that assists practitioners in specification and refinement of properties that can be used to check data quality. The tool can be used to capture scientific knowledge about data processes in remote sensing systems through the use of decision trees and questionnaires that guide practitioners in the specification process. In addition, the tool uses Disciplined Natural Language (DNL) property representations for scientists to validate that the specifications capture the intended meaning. Keywords: Data Quality, Sensor Networks, Domain Engineering, Property Specification, Property Validation.

1 Introduction The amount of sensor data acquired through scientific field instruments is rapidly increasing, and field instrumentation usually does not provide the means to detect anomalies in sensor data. In this context, an anomaly is a deviation from an expected datum value or data behavior. Scientists analyze the data to identify anomalies derived from instrument malfunctioning or environmental events with scientific implications. Scientists rely on their knowledge and experience in the field to distinguish one type of anomaly from another. The limitation with this practice is that scientists rarely share or reuse knowledge about their data processes with other scientists mostly because of the lack of a well-defined methodology for doing so and tool support. For novice scientists, a significant amount of time may pass before anomalies are detected, if indeed they are detected. If numerous anomalies are detected, the data gathering process may need to be repeated at a different point in time, which can be expensive especially when data is collected at remote sites. J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 232–241, 2010. © Springer-Verlag Berlin Heidelberg 2010

DaProS: A Data Property Specification Tool

233

Sensors need to be redeployed and possibly recalibrated; this obviously increases the amount of time required to gather the data. Also, environmental science sensor data are non-reproducible entities; as a result, an observation at a given point in time and for a set of conditions is lost and unrecoverable. The importance of data to the study of the environment and to society in general emphasizes the need to develop mechanisms and procedures to indentify and understand anomalies in sensor data. There is a need for scientists to capture expert knowledge about data practices to be shared with colleagues and to train novice scientists about how to identify and understand anomalies in collected datasets. The expert knowledge required to detect and understand anomalies in sensor data can be captured as data properties. The focus of this paper is on providing a scientist-centered solution that does not require scientists to learn new computer science formalisms to specify and validate data properties that capture expert-knowledge. The paper introduces the Data Property Specification (DaProS) tool, the processes followed to specify and validate scientific data properties using DaProS, and the lessons learned from using the tool to specify a broad range of properties documented through a literature survey.

2 Background The DaProS tool uses scopes, patterns and Boolean statements to specify data properties. Boolean statements express data properties, which are defined using mathematical relational operators that are applied to a datum, datum relationships, and Boolean methods that are available to the scientist. A property scope delimits a dataset subset over which a property holds. The scope is delimited using datum occurrences in dataset Δ. Given L and R ϵ Δ, a practitioner delimits the scope of a property by designating one of the following types: all the data in Δ (Global); a subset beginning with the first datum in Δ and ending the datum immediately preceding the first datum in Δ that matches R (Before R); a subset starting with the first datum in Δ that matches L and ending with the last datum in Δ inclusive (After L); a subset starting with the first datum in Δ that matches L and ending the datum immediately preceding the first datum that matches R (Between L and R); and a subset starting at the first datum that matches L and ends with the datum immediately preceding the first datum that matches R, or the last element in Δ if datum R does not occur (After L until R). A property pattern is a high-level abstraction of a description of a commonly occurring property on a scientific dataset. Users select patterns through a variety of decisions. The patterns are grouped as experimental reading, which describe the expected behavior of the data themselves, and experimental condition, which describe external conditions such as those associated with the functioning of the instrument or weather conditions. Properties may be time constrained or not. A time constrained property specifies that a property holds in one of the following ways: a maximum amount of time (Maximum Duration), or a minimum amount of time (Minimum Duration); recurrently every c units of time (Bounded Recurrence): after at most c units of time; after Boolean statement T holds (Bounded Response); and for at least c units of time before Boolean statement P holds (Bounded Invariance). A non-time constrained pattern specifies that a property never holds over a dataset (Absence), always holds over a dataset (Globally), or holds at least once over the dataset (Existence).

234

I. Gallegos, A.Q. Gates, and C. Tweedie

3 Data Property Specification via the DaProS Tool The Data Property Specification prototype tool (DaProS) was developed to assist practitioners in specifying and refining data properties that capture scientific expert knowledge about their data processes. The tool uses property categorization using data-property scopes and patterns to guide the specification process. The tool guides practitioners using a decision tree and a series of questions to assist in specification and refinement of data properties. To validate the intended meaning of specified properties, DaProS generates natural language descriptions of specified properties using a disciplined natural language. 3.1 DaProS Specification Process Using DaPros for property specification is a four-step process: 1) property category selection, 2) property scope selection, 3) property pattern selection and specification, and 4) final property view. To illustrate the process, we present a scenario in which Vianey, an environmental scientist, uses DaPros to specify a property P: “On May 18th during the daytime, the temperature difference between the temperature sensor at 10 meters (Ts) and the temperature sensor at 3 meters (t_hmp) shall be less than 3 degree Fahrenheit.” Step 1: Property Category Selection. The practitioner selects a data property category. The data property categories help scientists determine if the purpose of the property is to define data behavior or external behavior. The data property categories also determine if the property is time-dependent or not. The selected type of property will limit the choices of available patterns. If the practitioner is undecided over what data property to select, he or she can use a decision tree to select the property. Because Vianey is unsure of what data category to use, she uses the decision tree to select the category. The decision tree begins by querying Vianey whether the intended property specification is an experimental reading or an experimental condition. She determines that she is concerned about the data behavior itself, so she selects experimental reading. Next Vianey needs to determine if the property is time constrained or not. She is interested in looking at the whole dataset with no time recurrence, and because she does not know the time frequency at which the data are collected, she cannot determine a specific unit of time. As a result, she selects a non-timed constrained property. Vianey now needs to determine if the property has behavior dependencies, data depending on instrument(s) behavior for experimental readings, or instruments depending on data behavior for experimental conditions. She realizes that there are no behavior dependencies because the data readings for this property only capture correlation between readings and not a relationship of the sensor’s instrumentation to the data. Finally, Vianey determines that she is interested in the relationship between two or more data entities. The decision tree guides Vianey to the category Datum Relationship to specify the desired data property. Figure 1 depicts the DaProS graphical user interface showing how the decision trees are presented to the practitioner.

DaProS: A Data Property Specification Tool

235

Step 2: Property Scope Selection. In Step 2, the practitioner selects a property scope and provides the attributes associated with the scope. In our scenario, Vianey decides to look at the data for May 18th collected between 6:15:00 A.M and 8:00:00 P.M. She selects the BetweenLandR scope, where L is 05:18:2010:6:15:00.A.M and R is 05:18:2010:8:00:00.P.M. Step 3: Property Pattern Selection and Specification. In Step 3, the practitioner builds the property by first selecting the pattern that describes the verification behavior. The property specification is bounded to the following basic set of relational operators: <, <=, =, !=, =>, and >. If the selected property category from Step 1 is time related, the practitioner must select a time-constrained property pattern; otherwise, the practitioner selects a qualitative property pattern. DaProS guides the user in the property-pattern selection. If the property is not time-constrained, the practitioner goes to Step 3a. If the property to be specified is time-constrained, the practitioner needs to complete step 3b. Step 3a: Non-time constrained. In this step, the practitioner selects a property pattern and builds the Boolean statement to define the property. In the scenario, Vianey wants to make sure that the property holds throughout the entire dataset, so she selects Global as the property pattern, and defines the property as |Ts-t_hmp|<3.0.

Fig. 1. DaPros Graphical User Interface used to specify data properties

236

I. Gallegos, A.Q. Gates, and C. Tweedie

Step 3b: Time constrained. If the practitioner selects “Time constrained” property pattern, he or she provides the required time constraint. The practitioner then builds the property as described in Step 3a. Step 4: Final Property View. During this step, the practitioner selects a format to display the property. The practitioner can generate a property summary, present a disciplined natural language description, or display the property in Extensible Markup Language (XML) representation. The property summary shows the user the selected property scope and pattern along with the corresponding attributes and the Boolean statement. The disciplined natural language (DNL) description is a natural language description that can be used by practitioners to review and validate the property by determining if the natural language captures the intended meaning of the property. The use of disciplined natural language intends to mitigate the ambiguity inherent to natural languages. The XML representation is to be used to export the specified properties. In our scenario, Vianey validates the property by selecting the DNL description and reviewing the description, she then exports the data property out of the system using DaProS’ export feature. The DNL representation of the property as generated by DaProS is as follows: For the dataset data enclosed by the data interval starting with 05:18:2010:6:15:00.A.M and ending with the datum immediately prior to 05:18:2010:8:00:00.P.M., it is always the case that |Ts-t_hmp|<3.0 holds. 3.2 DaProS Validation Process A challenge associated with property specifications is the limited tool support for practitioners to validate the specified properties. To address this challenge, the DaproS tool uses disciplined natural language (DNL) templates [1] to generate natural language descriptions of specifications. Konrad and Cheng [2] defined a DNL grammar to specify critical software properties. Such DNL grammar was not suited to specify data properties because it does not support the required relations. To address this shortcoming, we constructed a new DNL grammar based on the work initiated by Konrad and Chen. Table 1 presents the DaProS grammar used to derive natural language property representations, and Tables 2 and 3 presents an example. In the grammar, literal terminals are delimited by quotation marks (“”), non-literal terminals are given in a Calibri font, and non-terminals are given in italics. The start symbol of the grammar is property and the language L(G) of the grammar is finite. Each sentence s with s ϵ L(G) is a property composed of a scoped formula and a qualitative or timed pattern. Non-literal terminals need to be instantiated in order to complete the natural language representation. A datum value represents a numeric value n ϵ ℝ. A time value represents a discrete time value t ϵ ℕ. Given Σ is a dataset, a Boolean method is a mapping B: Σ → {true, false}. A computational method is a mapping D: Σ → {n| n ϵ ℝ }. For timed properties, C needs to be instantiated as an integer value c ϵ ℕ.

DaProS: A Data Property Specification Tool

237

Table 1. DaProsS DNL grammar for validating property representations

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

property scope

::= scope “ ,” specification “.” ::= “For all dataset values” |“For all dataset values (“before” del |“after” del |“between” betdel |“after” aftdel)” del ::= datum | time betdel ::= datum “and” datum | time “and” time aftdel ::= datum “until” datum | time “until” time datum ::= datum value time ::= time value specification ::= qualType | timeType qualType ::= occurCat | orderCat occurCat ::= absencePat | universalityPat | existencePat abscensePat ::= “it is never the case that “ datarel “ holds” universalityPat ::= “it is always the case that ” datarel “ holds” existencePat ::= datarel “ eventually holds” orderCat ::= “it is always the case that if” datarel “ holds” (precedencPat | respPat) precedencePat ::= “then ” datarel “ previously held” respPat ::= “then ” datarel “ eventually holds” timeType ::= “it is always the case that ” (durationCat | periodicCat | timeOrderCat) durationCat ::= “once” datarel “becomes satisfied, it holds for (minDurPat | maxDurPat)” minDurPat ::= “at least ” c “ time unit(s)” maxDurPat ::= “less than ” c “ time unit(s)” periodicCat ::= datarel “ holds ” boundRecPat boundRecPat ::= “at least every ” c “ time unit(s)” timeOrderCat ::= “if ” datarel “ holds, then ” datarel “ holds” (boundResPat | boundInvaPat) boundResPat ::= “ after at most ” c “ time unit(s)” boundInvaPat ::= “ for at least ” c “ time unit(s)” datarel ::= singlerel | methrel | boolrel singlerel ::= datum comp datum comp ::= < | ≤ | > | ≥ | = | ≠ compbool ::= = | ≠ boolrel ::= bool_method compbool bool_method | bool_method compbool bool_value bool_method ::= Boolean function methrel ::= datum comp compu_method | compu_method comp compu_method compu_method ::= Computational function bool_value ::= True|False

238

I. Gallegos, A.Q. Gates, and C. Tweedie

Table 2. Data Property (1) DaProS disciplined natural language representation for an experimental reading property extracted from the conducted literature review

Document Specification: On May 12th during the daytime, the dry bulb temperature should be less than or equal to 79.0 degrees Fahrenheit [3]. For all dataset values between 05:12:2010:6:15:00 A.M DaProS DNL and ending with the datum immediately prior to Representation: 05:12:2010:8:00:00 P.M, it is always the case that DryBulbTemp<=79.0 holds. 1,2,4,7,8,9,10,12,26,27,6,28,6 Grammar Rules: Daytime: defined to be between 05:12:2010:6:15:00 A.M Verification Entity: and 8:00:00 P.M from scientific expert knowledge. DryBulbTemp: dry bulb temperature sensor reading. Table 3. Data Property (2) DaProS disciplined natural language representation for an experimental condition property extracted from the conducted literature review

Document Specification: DaProS DNL Representation: Grammar Rules: Verification Entity:

If none of the sensor couples associated with the current meter is present, the series global flag is set to 3 [4]. For all dataset values, it is always the case that if HCSP.isOff() holds, then Global.setFlag(3) eventually holds. 1,2,8,9,14,26,30,31,29,34,16,26,30,31,29,34 HCSP.isOff(): Boolean method returns true if sensor couples are off, false otherwise. Global.setFlag(int n): Boolean method returns true if the series global flag is set to integer value n, false otherwise.

4 Lessons Learned Properties obtained from a literature survey of 15 projects were specified using DaProS, and the DNL representation generated from the grammar in Table 1 was captured. The identified projects are representative of how different institutions have incorporated data quality into their sensor data collection systems. The projects use mechanisms that perform data verification in both field sites and data centers. The main goal of the literature review was to identify scientific projects that collect sensor data and that could benefit from a data property categorization with the goal of identifying the type of data being analyzed and data properties defined by each project. A total of 526 data properties were manually extracted from the published literature of the projects. DaProS was then used to specify the data properties extracted from the projects. The practitioner using DaProS was able to specify all of the data properties, but six. Those six properties were ambiguous and, as a result, the property specification expert was unable to identify the appropriate scope and pattern using the tool based only on the statement as provided in the documentation. However, further property

DaProS: A Data Property Specification Tool

239

refinement and expert-scientist knowledge would allow the practitioner to specify the properties using DaProS. Table 4 presents the properties that were not specified and the issues associated to each property. Table 4. Data properties that could not be specified

Data Property Issue Flag as K if the data looks to have obvious “Obvious” is ambiguous. errors, but no specific reason for the error can be determined. Flag as M if a known instrument This can be specified if the instruments are malfunction occurs. provided. In this, they were not given. Flag as Z if data passed evaluation. “Evaluation” was not specified. A Fine Structure consists of step-like There is no specification for the features. features or small interleaving observed in Also, it is not clear if the data is to be a profile over a range of depths, (usually verified over the range of depths, over the entire profile, or both. Is there a difference 10-100 m) or the entire profile. at 10m vs 100m? The date and time must be sensible. “Sensible” is ambiguous. All data must be corrected with a This can be verified only if the original calibration accuracy of +/-1.0% at up to 20 value is saved along with the corrected mm/hr. value. Should the verification include those corrected at 20mm/hr? All data must be corrected with a This can be verified only if the original calibration accuracy of MAX (+/-1.1 value is saved along with the corrected value. m/sec (2.4 mph), +/-4% of reading).

Analysis of the data property categorization and specification process garnered from the literature showed that scientists placed less attention on instrument malfunctioning than on the data values. It is important to note that instrument malfunctions are a major source of anomalies. Also, scientists are aware of sensors and data relationships, but these relationships are rarely used for anomaly detection. Scientists typically perform data inspections to determine instrument malfunction rather than monitoring the instruments performance. During the specification process, several factors were identified that affect data property specification using our approach. Some data properties are described at such an abstract level that it was difficult to specify them formally without more detail. The advantage of using a tool such as DaProS is that a practitioner, while trying to specify the property, would realize the need to refine the property to capture the intended meaning. Other data properties are complex and need to be decomposed into several simpler properties. A number of specifications are a combination of data verification and data steering properties, and this required separating the two concerns. Combined property specifications require both verifying that the properties adhere to predefined behaviors--the verification aspect, and guaranteeing that a reaction occurs in response to a data or instrument stimulus--the steering aspect. Combined property specifications can

240

I. Gallegos, A.Q. Gates, and C. Tweedie

be decomposed into separate data verification properties and data steering properties. Due to the inherent ambiguous nature of natural language, data property descriptions are sometimes too ambiguous, requiring involvement of the expert.

5 Related Work Some of the concepts used for the design and development of the DaProS tool were adapted from requirements specification techniques used in software engineering to capture critical system, software, and data properties. The DaProS tool focuses on specification of data and instrumentation properties that can be applied during the data acquisition process and reused by others. The software engineering techniques and tools that form the basis for DaProS and those that are related to the work are described next. The Specification and Pattern System (SPS) was introduced by Dwyer et al. [5] to assist practitioners to formally specify software and hardware properties. SPS uses scopes and patterns obtained after analyzing a wide range of properties from multiple domains, i.e., hardware systems, network protocols, security protocols, user interfaces. The SPS supports mapping to several formalisms, e.g., Linear Temporal Logic (LTL), Computational Tree Logic (CTL), and Future Interval Logic (FIL); thus, allowing formally specified properties to be verified by verifications tools such as theorem provers, model checkers, and runtime monitors. The Property Specification (Prospec) tool [6] includes patterns and scopes, and it uses decision trees to assist users in the selection of appropriate patterns and scopes for a given property. Prospec extends the capability of SPS by supporting the specification of Composite Propositions (CP) classes for each parameter of a pattern or scope that is comprised of multiple conditions or events. Prospec uses guided questions to distinguish the types of scope or relations among multiple conditions or events. By answering a series of questions, the practitioner is lead to consider different aspects of the property. A type of scope or CP class is identified at the end of guidance. Prospec generates formal specifications in FIL and LTL. Propel [1] helps practitioners write and understand properties by providing templates that explicitly capture details as options for commonly occurring property patterns based on SPS. The provided templates are represented using both disciplined natural language (DNL) and finite-state automata (FSA). The practitioner can view both representations simultaneously and select from which representation to elucidate the desired property. Spider [2] generates specification properties using natural language representations based on a natural language grammar and a specification pattern system that derives natural language sentences. The specifications are mapped to temporal logics such as LTL or Metric Temporal Logic (MTL) that can be analyzed formally by a tool such as SPIN. The structured language grammar supports translations of untimed and timed properties to multiple temporal logics. While standard UML modeling tools with Object Constraint Logic (OCL) can be used to specify constraints on data and instrumentation properties, standard OCL does not support temporal properties. Extensions to OCL that include temporal logic exist [7], [8]. The goal of DaProS, however, is to make the specification process more

DaProS: A Data Property Specification Tool

241

amenable to non-technical scientists, and this is accomplished through the categorization of properties that are tailored to properties typically specified by environmental scientists, as well as the introduction of scope and patterns.

6 Summary Environmental scientists studying global changes use sensor networks to collect larger amounts of data, and there is a need for scientists to identify anomalies in their data. Scientists rely on their knowledge and field experience to identify such anomalies. Expert knowledge is rarely shared with and reused by other scientists, thus making it difficult for novice users to build expertise. The DaProS tool allows scientists to specify and validate data properties that can be used to detect and understand data anomalies. The DaProS tool was successfully used to specify approximately 500 data properties from a literature survey and to identify factors that can limit the effectiveness of the data properties specification process. Currently, the DaProS tool is being used to specify data properties for scientists working with Eddy covariance, biomesonet towers that collect carbon dioxide (CO2), energy, and water balance measurements obtained at the Jornada Basin Experimental Range. In the future, the DaProS tool will be converted into a Web service to allow scientists to access the tool from field research sites through the Internet. In addition, a monitor that supports steering capabilities is another area of future work.

References 1. Smith, R., Avrunin, G., Clarke, L., Osterweil, L.: PROPEL: An Approach Supporting Property Elucidation. In: Proc. of the 22rd Intl. Conf. on Software Engineering, ICSE (2002) 2. Konrad, S., Cheng, B.H.C.: Facilitating the Construction of Specification Pattern-Based Properties. In: Proc. 13th IEEE Intl. Conf. Requirements Engineering, pp. 329–338 (2005) 3. Canada Federal Department of Fisheries and Oceans. Data Quality Assurance (QC) at the Marine Environmental Data Service (MEDS). DFO ISDM Quality Control, http://www.meds-sdmm.dfo-mpo.gc.ca/meds/Prog_Int/WOCE/ WOCE_UOT/qcproces_e.htm 4. Lambert, W., Merceret, F.J., Taylor, G.E., Ward, J.G.: Performance of Five 915-MHz Wind Profilers and an Associated Automated Quality Control Algorithm in an Operational Environment. J. of Atmospheric and Oceanic Technology 20, 1488–1495 (2003) 5. Dwyer, M.B., Avrunin, G.S., Corbett, J.C.: A System of Specification Patterns. In: Proc. of the 2nd Workshop on Formal Methods in Software Practice (1998) 6. Mondragon, O., Gates, A.Q.: Supporting Elicitation and Specification of Software Properties through Patterns and Composite Propositions. Intl. J. Software Engineering and Knowledge Engineering 14(1) (2004) 7. Ziemann, P., Gogolla, M.: An Extension of OCL with Temporal Logic. In: Jürjens, J., Cengarle, M.V., Fernandez, E.B., Rumpe, B., Sandner, R. (eds.) Critical Systems Development with UML, Technische Universität München, Institut für Informatik, pp. 53– 62 (2002) 8. Flake, S.: Temporal OCL Extensions for Specification of Real-Time Constraints. In: Proc. SVERTS Workshop at UML 2003 (2003)

6th International Workshop on Foundations and Practices of UML (FP-UML 2010)

Preface The Unified Modeling Language (UML) has been widely accepted as the standard object-oriented (OO) modeling language for modeling various aspects of software and information systems. The UML is an extensible language, in the sense that it provides mechanisms to introduce new elements for specific domains if necessary, such as web applications, database applications, business modeling, software development processes and data warehouses. Furthermore, the latest version of UML 2.0 got even bigger and more complicated with more diagrams for some good reasons. Although UML provides different diagrams for modeling different aspects of a software system, not all of them need to be applied in most cases. Therefore, heuristics, design guidelines, lessons learned from experiences are extremely important for the effective use of UML 2.0 and to avoid unnecessary complication. Also, approaches are needed to better manage UML 2.0 and its extensions so they do not become too complex to manage in the end. The Sixth International Workshop on Foundations and Practices of UML (FPUML'10) intends to be a sequel to the successful BP-UML'05 - FP-UML'09 workshops held in conjunction with the ER'05 - ER'09, respectively. FP-UML'10 intends to be a premier forum for exchanging ideas on the best and new practices of the use of UML in modeling and system development. As Booch et al. have stated, "the full value of model driven architecture is only achieved when the modeling concepts map directly to domain concepts rather than computer technology concepts", domain specific approaches to UML were particularly encouraged. We received 10 full papers. The Program Committee only selected 5 papers, making an acceptance rate of 50%. The accepted papers were organized in two sessions. The first one will be focused on Semantics and Ontologies in UML, where the first two papers deal with using UML for enterprise and service oriented architecture modeling, and the third tackles the meaning of membership in collections. In the second session, two papers focusing on Automation and Transformation of activities in UML will be presented. We hope that you will enjoy this record of the workshop and find the information within these proceedings valuable towards your understanding of the current state-ofthe-art in UML modeling issues. We would like to express our gratitude to the program committee members for their hard work in reviewing papers, the authors for submitting their papers, and the ER 2010 organizing committee for all their support.

July 2010

Gunther Pernul Matti Rossi

Incorporating UML Class and Activity Constructs into UEML Andreas L. Opdahl Department of Information Science and Media Studies, University of Bergen, NO-5020 Bergen, Norway [email protected]

Abstract. The Unified Enterprise Modelling Language (UEML) aims to become a hub for integrated use of enterprise and information systems (IS) models expressed using different languages. The paper explains how central constructs from UML's class and activity diagrams have been incorporated into UEML. As a result, the semantics of UML's central constructs for representing classes and activities have become more precisely defined in terms of the common UEML ontology. Through their ontology images, the two diagram types are also on the way to become interoperable with other enterprise and IS modelling languages in UEML. Keywords: Enterprise modelling, IS modelling, ontology, Unified Enterprise Modelling Language (UEML), Unified Modeling Language (UML).

1 Introduction Model-driven technologies are creating increasingly larger demands for model and modelling-language management technologies. Unfortunately, many existing modelling languages for enterprises, information systems (ISs), software and services do not define their semantics well. The current trend towards domain-specific modelling exacerbates the problem by introducing even more languages with incompletely and/or imprecisely defined semantics. The spread of semantic technologies, including the linked open data initiative, creates even stronger demands for handling semantics. Well-defined and precise semantics for models and modelling languages are crucial for leveraging the full power of both model-driven and semantic technologies, in particular to promote interoperability. New theories and technologies are called for to describe the semantics of modelling languages in a common, structured, precise and interoperable manner. This paper will explain how modelling-language semantics are handled by the Unified Enterprise Modelling Language (UEML), using examples from UML. Its primary purpose is to present for the first time how central constructs from UML's class and activity diagrams are described as parts of UEML, going into particular detail about UML Actions and, sometimes also, ActionExecutions. The paper will also discuss the various types of constraints on modelling constructs in more detail than earlier papers and it will introduce the idea of ontology images of modelling languages. Although UEML primarily targets languages for enterprises and their ISs, J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 244–254, 2010. © Springer-Verlag Berlin Heidelberg 2010

Incorporating UML Class and Activity Constructs into UEML

245

incorporating a software-oriented language like UML remains interesting because some of UML's diagram types can be used to represent enterprises/ISs too. Hence, the paper will mainly be concerned with UML as an enterprise/IS modelling language. The rest of the paper is organised as follows. Section 2 presents existing approaches to language definition, with emphasis on the Unified Enterprise Modelling Language. Section 3 demonstrates UEML's approach to semantic construct description, using examples from UML. Section 4 explains how the construct descriptions are interrelated through mappings into a common ontology. Finally, Section 5 concludes the paper and suggests paths for further work.

2 Theory Language definition: The most common approach to defining modelling languages is meta modelling (e.g., [1, 2]). The result is syntax-oriented language definitions that tend to treat semantics as a secondary concern. Other approaches to modellinglanguage definition use formal notations or ontological (or referential) semantics. In the latter case, semantics is defined in terms of a reference ontology (or other conceptual model). Examples include [3, 4], but neither follows or offers a general approach to semantics definition. Wand and Weber's [5] ontological analysis and evaluation has been used to describe the (referential) semantics of several modelling languages, including UML [6, 7]. Other available ontologies include the Enterprise Ontology [8], FRISCO [9] and TOVE [10], but they offer no structured approach to describing modelling languages. Unified Enterprise Modelling Language: UEML [11] aims at supporting integrated use of enterprise and IS models expressed in different languages. To achieve this aim, UEML offers a hub through which modelling languages can be connected. It thereby paves the way for also connecting the models expressed in those languages. A central idea is to describe the semantics of individual modelling constructs by mapping them into a fine-grained and well-structured ontology. So far, 130 constructs from a selection of 10 languages have been incorporated, although with varying degrees of precision. 10 central constructs from UML's class diagrams (Aggregation, Association, Attribute, Class, Composition, Generalization, Link, Object, Operation, Property) and 21 central constructs from its activity diagrams (Action, Activity, ActivityEdge, ActivityFinalNode, ActivityParameterNode, CentralBufferNode, ControlFlow, DataStoreNode, FinalNode, FlowFinalNode, InitialNode, InputPin, JoinFork, MergeDecision, ObjectFlow, ObjectNode, OutputPin, Pin, Token, ValueNode, ValuePin) have been incorporated in several stages. In the first stage, the constructs were described in a structured form using a textual template [12]. In the second stage, the descriptions were refined and represented as an OWL ontology in the UEMLBase tool implemented on top of Protege-OWL [13]. In the third stage, the construct descriptions and resulting ontology were validated and improved using the UEMLVerifier tool [11]. The incorporation was based on an earlier ontological analysis of UML [7] and on a first outline of the UEML approach [14, 15], which used descriptions of UML's Object, Property and Multiplicity constructs as examples.

246

A.L. Opdahl

3 Describing Modelling Constructs Construct description: UEML describes language semantics in terms of individual modelling constructs (see, e.g., [12]), by breaking them up into six parts, which we will illustrate using UML-Actions and -ActionExecutions as examples. In UML 2.2 [2], Action is “a named element that is the fundamental unit of executable functionality.” Its ActionExecution “represents some transformation or processing in the modeled system” and it “represents a single step within an activity, that is, one that is not further decomposed within the activity.” UML-Actions define these attributes (among others): sets of input and output pins, a context, a pre- and a postcondition and an inherited name. Example: To incorporate the UML-Action construct into UEML, its semantics was broken up into the following six parts: 1.

2.

3.

4.

5.

Instantiation level: A UEML construct may be used to represent either individual things (the instance level), classes of things (the type level) or both levels. UMLAction is type level because it is the behaviour of a UML-Classifier (which belongs on the type level). Modality: A UEML construct (or part thereof) may represent either a fact about or someone's belief about, knowledge of, obligation within, intention with respect to a domain, and so on. UML-Action is factual because it represents an actual action within a model (although the model itself can have another modality, e.g., it may represent a possible or wanted future situation). Classes of things: Regardless of instantiation level and modality, a modelling construct will say something about one or more things (if it is instance level) or classes of things (if it is type level). A UML-Action represents a single class (the UML-Classifier) of executor things that perform the action. Properties of things: Most modelling constructs will also represent one or more properties that its things/classes possess. A UML-Action has a name and it represents a number of 'incoming' and 'outgoing' flows of the executor thing. Each flow may have a subproperty that represents its content at a particular time. The UML-Action also represents zero or more internalProperties of the action and an actionFlag that indicates whether it is executing or not at a particular time. The UML-Action may also represent a partWholeRelation to its context, i.e., to the activity system of which the action may be a part. In UEML, some properties are state laws and transformation laws that restrict other properties. UML-Action represents such an actionLaw that constrains the contents of its 'incoming' and 'outgoing' flows and its internalProperties and actionFlag. The actionLaw has two sublaws (subproperties that are laws). The precondition is satisfied when the action starts and the postcondition when it ends. States of things: Some behavioural modelling constructs represent particular states in their things or classes. States are defined in terms of a thing's properties by a state law that restricts (or constrains) these properties' values. UML-Action in itself is not a behavioural construct, so its description does not contain states. A UML-ActionExecution, however, extends UML-Action with a before- and an afterState. The beforeState is restricted by the precondition of the action, whereas the afterState is restricted by its postcondition.

Incorporating UML Class and Activity Constructs into UEML

247

Fig. 1. UML class diagram for describing modelling constructs and for organising the UEML ontology (based on [12])

6.

Transformations of things: Behavioural modelling constructs may even represent transformations of things/classes from a pre- to a post-state. Transformations are defined in terms of the properties that define the pre- and post-states by a transformation law that effects changes of these properties' values. UML-Action does not represent transformations either, but UML-ActionExecution represents an actionExecution transformation from the before- to the afterState, which is effected by the actionLaw. □

Relations: Instead of mapping modelling constructs one-to-one with concepts in an ontology, UEML thereby describes each modelling construct as a scene where the roles are played by things/classes and their properties, perhaps along with states and transformations. These types of roles are related according to Figure 1. Example: The roles in the UML-Action scene are played by executor, actionLaw, precondition, 'incoming' flow etc. as already described. The different roles are interrelated as shown in Figure 2a (omitting a few properties, including names). The executor class possesses an actionLaw and, possibly, a partWholeRelation to its context (the surrounding activity system, which is, however, not represented directly by the UML-Action). The two sublaws of actionLaw constrain the 'incoming' and 'outgoing' flows of the action. Subproperty relations are transitive, so the precondition may constrain not only the 'incoming' flow, but also its contents, and correspondingly for postconditions and the overall actionLaws. Also, property possession distributes over subproperty relations, so only the most compound property, the actionLaw, is explicitly possessed by the executor class. Figure 2b indicates how to extend 2a with before- and afterStates and an actionExecution transformation. (A few unaffected roles in Figure 2a have not been re-drawn in this figure.) □ Constraints: Construct descriptions can be constrained further. Figure 2a also shows cardinality constraints for the construct level, i.e., for the elements that instantiate UML-Action, the Action nodes in activity diagrams. In addition to the usual cardinalities for relations between the roles in a scene, it presents cardinality constraints for the roles themselves. Figure 2b shows the corresponding constraints for UML-ActionExecutions (where the cardinalities of internalProperty have changed because actionFlag is no longer drawn separately). There can also be constraints on the element level, which constrain instances of Action elements, i.e., actions that

248

A.L. Opdahl

Fig. 2. Semantic description of UML Actions (a) and ActionExecutions (b)

belong to individual executor things (instances of the action's Classifier). An elementlevel constraint can be language-wide, which means that it constraints all the instances of one type of element in a language, or it can be element-specific, which means that it constrains the instances of a single element only. A language-wide element-level constraint is defined once for all instances of a modelling construct. A specific element-level constraint is an attribute of the individual model element. Example: A construct-level cardinality constraint in UML is that a model element that instantiates an Association in a class diagram must always connect two or more Class elements. An element-level cardinality constraint is that an Association element may have minimum and maximum cardinality constraints on its own for each of its LinkEnds, which restrict the UML Links that instantiate the Association. This constraint is in part language-wide and in part element-specific. It is element-specific because the actual minimum and maximum cardinalities are specific to each Link and its ends. It is language-wide because it is otherwise similar for all Links/-Ends, i.e., it always states that the number of Objects connected to a Link end must be between the minimum and the maximum cardinality. It is a language-wide constraint with element-specific parameters. □ In addition to cardinality constraints, there are identity constraints, which describe whether construct roles must, may or cannot be described by the same instances. One type of identity constraint is already built into UEML: classes, properties, states and transformations are mutually exclusive on all levels. Also, state and transformation laws are mutually exclusive both with one another and with non-law properties. Identity constraints can apply to the same three levels as cardinalities, i.e., they can be construct or element level and, at the element level, they can be either language-wide or element-specific (or a combination, as we have seen). Example: The flow properties in Figures 2a-b may each be either 'incoming', 'outgoing' or both, but we assume as a construct-level constraint that every Action must have at least one flow that is exclusively 'incoming' or 'outgoing'. Also, in UML, Associations may connect a class with itself at the construct level, as in a parent/child relation from the class Human to itself. The same holds for UML-Associations with an aggregation end, so that, e.g., an Assembly may both contain and be part of other

Incorporating UML Class and Activity Constructs into UEML

249

Table 1. Types of constraints on modelling constructs and their model elements Level / Type Construct level Element level, language-wide Element level, element-specific

Relational ... ...

Cardinality ... ...

Identity ... ...

Behavioural ... ...

...

...

...

...

Assemblies. At the element level, however, the two constructs should behave differently. An instance of a regular Association may Link an Object to itself, but an aggregation Link should always involve distinct Objects (a language-wide elementlevel constraint). □ There are other constraints than cardinality and identity constraints too, of course. We will illustrate behavioural constraints with a brief example. Example: The actionLaw and the pre- and postconditions of Figure 2 are constrained at the construct level so that the precondition must always hold when the action described by the actionLaw starts and the postcondition must always hold when it ends. This constraint is construct-level because it is true of all actionLaws in all UML activity diagrams. At the element level, element-specific actionLaws and pre- and postconditions may be specified within the limits of this constraint, using a suitable language, such as OCL or Z. □ This concludes our walk-through of constraints on UEML's construct descriptions. Table 1 depicts the resulting typology of twelve types of constraints on three levels and of four sorts. Further levels and sorts can of course be added in the future, to account, e.g., for language- and model-wide and for ontology constraints.

4 The Common Ontology Ontology mapping: The previous section showed how individual constructs are described as scenes. These scenes are related because their roles are played by classes, properties, states and transformations described in the same ontology. Example: The roles used to describe UML-Actions in Figure 2 are mapped into corresponding concepts in a common UEML ontology. For example, the executor role is mapped to an ExecutingThing ontology class, and the actionLaw maps to an ExecutionLaw property. Also, internalProperty maps to IntrinsicProperty, partWholeRelation to SystemPartWholeRelation and flow to Flow, which is a subtype (a successor) of MutualProperty. The before- and afterStates map to Triggering- and AnyState in the ontology. actionExecution maps to Exection (see Figure 4 later). □ Ontology structure: The common UEML ontology has the same structure as the individual construct scenes. In fact, Figure 1 depicts the structure of the UEML ontology itself, which corresponds to the structure of the earlier scenes of related roles. The ontology concepts are organised in four taxonomies (or subclass-/type hierarchies) of classes, properties (including state laws and transformation laws), states and transformations, respectively. Classes are organised in a conventional

250

A.L. Opdahl

Fig. 3. The ontology image for UML class diagrams. The individual class-diagram constructs map into parts of this diagram.

generalisation hierarchy where subclasses specialise superclasses. Properties form a precedence hierarchy, so that, for example, “being-human” succeeds “being-alive”, because everything that is human is also alive. States form a state hierarchy so that a more specific state refines less specific states (or-decomposition of states). Transformations are organised in a transformation hierarchy, where a more specific transformation elaborates one or more less specific ones. The taxonomy structure means that any class is somehow related to any other class, and accordingly for properties, states and transformations. Semantically similar constructs from the same of different languages will map to the same ontology concepts or to their super- and subtypes. Hence, it is always possible to derive scene-wise semantic correspondences between any pair of modelling constructs in fine detail, determined by the granularity of the four taxonomies. The division into distinct, but interrelated taxonomies also makes it possible to evolve the ontology over time without increasing complexity more than necessary as new concepts are added. New ontology concepts will always belong to a clearly identifiable taxonomy inside which they will always have a clearly identifiable location. Ontology relations: The ontology concepts have unique names as well as descriptions, and they are interrelated across taxonomies. For example, classes possess properties (that characterise the classes); properties define states; transformations have pre- and post-states; state laws restrict states; and transformation laws effect transformations. There are also additional relations within the same taxonomy: properties may be subproperties of complex ones; states may be regions of composite states (and-decomposition of states); transformations may be components of parallel transformations and steps in sequential ones. The ontology concepts and their relations have cardinality restrictions, which may be less restrictive than the constraints on construct roles. The ontology relations may also have role names (such as 'incoming' and 'outgoing' in Figure 2). Role names are sometimes necessary when properties belong to more than one thing or class, such as part-whole relations, mutual properties and class-subclass relationships.

Incorporating UML Class and Activity Constructs into UEML

251

MutableState(s)

AnyState(s)

Execution(t)

Class-subclass rel.(p)

Subclass(c)

Superclass(c)

PartWholeRelation(p)

Component(c)

B0,n

Composite(c)

IntrinsicProperty[(p)

B1,n

MutualProperty(p)

MutableProperty[(p)

B1,1

AssociatedThings(c)

ExecutionLaw(tl)

Class Operation Attribute Association* Comp.assoc. Generalization Object Property Link* Comp.link

Changing- && ExecutingThing (c)

UEML ontology concept/ UML modelling construct

Anything(c)

Table 2. Mappings of individual class-diagram constructs into ontology concepts

B1,1

B1,1

B1,1

T1,1 B1,1 T1,1

B0,n

B0,n

T1,1 T2,n

T1,1 T1,1

T1,1

T1,1 T1,1

T1,1

T1,1

I1,1 I1,1

I1,1 I2,n

I1,1 I1,1

I1,1

I1,1

* The composition/aggregation varieties of Association and Link are mapped separately as Composite association and Composite link.

Ontology images: In addition to comparing individual constructs within or across modelling languages, construct descriptions can be used to synthesise language descriptions, i.e., merged scenes for all the constructs in a language, to provide an overall picture of its referential semantics. The resulting ontology images may be useful for selecting complementary languages to use in a project, for learning and understanding languages and for validating and improving them. Figure 3 shows the ontology image for UML class diagrams. The image is an excerpt from the underlying UEML ontology. It is slightly simplified because it does not show that the relation between Anything and MutualProperty refines the one between Anything and AnyProperty. Another simplification is that the ExecutionLaw is in fact a ComplexProperty, with MutableProperty, IntrinsicProperty, Mutual Property and PartWholeRelation as Subproperties. Also, ExecutingThing is a subclass of ChangingThing, not a direct subclass of Anything. On the other hand, the figure does show cardinality constraints. These constraints could in principle be more restrictive than the cardinalities of the ontology itself. For example, although a MutualProperty can relate more than two AssociatedThings in the UEML ontology, it is possible to map a binary relation construct from some modelling language into it, which would have stricter cardinality constraints than the ontology constraints. There can be cardinality constraints on both the type (as in Figure 3) and instance levels. Table 2 shows how the central modelling constructs of UML class diagrams map into the ontology image of Figure 3, with UML modelling constructs as rows and UEML ontology concepts as columns. The meta class of each ontology concept is

252

A.L. Opdahl

Fig. 1. Excerpt of the ontology image for UML activity diagrams. The UML Action and ActionExecution constructs map into parts of this diagram.

indicated in parentheses (c=class, p=property, sl=state law, tl=transformation law, s=state and t=transformation). Each mapping has the form Lm,n, where L is instantiation level (either T=type, I=instance or B=both) and m and n are minimum and maximum cardinalities, respectively. For example, a UML-Operation represents exactly one Execution from an AnyState to a MutableState effected by an ExecutionLaw of a Changing- && ExecutingThing, meaning that UML-Operations map to the intersection of these two classes. The && notation is a shorthand to avoid naming subclasses that have no characteristic properties of their own. The table does not include AnyProperty because no construct maps to it (only to its subtypes). In UML 2.2, composition and aggregation is considered a type of Association (on the type level) or Link (on the instance level). The table treats composition/ aggregation separately from Associations/Links because they map to another ontology property (to PartWholeRelation instead of MutualProperty). The table illustrates the underlying ontology mappings in a compact way. The actual mappings are stored in UEMLBase as OWL individuals and properties. Figure 4 shows an excerpt of the ontology image for UML activity diagrams. The UML-Action roles of Figure 2a map into this diagram, which is again simplified in some respects (both the ResourceLocation and Flow properties are MutualProperties, Pre- and Postconditions are StateLaws and TriggeringState is not shown).

5 Conclusion and Further Work The paper has explained how modelling-language semantics are handled in the Unified Enterprise Modelling Language (UEML), using UML's class and activity

Incorporating UML Class and Activity Constructs into UEML

253

diagrams as examples and going into particular detail about UML-Actions and ActionExecutions. It has discussed in more detail than previous papers the various types of constraints on modelling constructs and how they are expressed. It has also introduced the idea of ontology images in relation to UEML. Our discussion and typology of constraints paves the way for refining and formalising the existing construct descriptions in UEMLBase, which have so far been kept informal as the common ontology has stabilised. Our discussion of ontology images demonstrates that complex notations can have surprisingly simple images, making these images useful both for understanding and learning and for evaluating languages and validating language descriptions. Ontology images can also be used to select suitable notations or combinations thereof for particular modelling problems. One reason behind this simplicity is the separation of semantics from syntactical concerns. Another reason is the factoring out of type/instance considerations from the ontology level to the mappings. Limited space only allows a glimpse of the current and possible future treatment of UML as part of UEML. We have left out consideration of language-level constraints, which reflect, e.g., that all the Actions and other types of nodes in a UML activity diagram must be connected, possibly pointing to a future fourth level in Table 1. We have also left out the question of which formal notation and analysis/reasoning tool we want to use for UEML and its description of UML. Whereas an early paper used OCL [15] and the UEMLVerifier is implemented in SWI-Prolog, stronger formalisation may require more advanced analysis tools such as Alloy [16] or NuSMV [17]. Finally, we have not considered syntax, leaving for further work the possibility of extracting syntax rules from the semantic descriptions described in this paper. Early investigations indicate that this is indeed possible meaning that, in the future, standard and domain-specific languages can be defined primarily in semantic terms through ontology mappings and constraints, from which basic syntax rules can be derived automatically. We refer to earlier papers [11, 14] that have described the many other advantages of a common, structured, precise and interoperable approach to defining modelling languages. One advantage is evident already from the examples in this paper. UML's class and activity diagrams have become closely semantically interrelated at the ontology level as a side effect of describing the semantics of their constructs, as witnessed by the many ontology concepts that are common to Figures 3 and 4. Further work is needed to explore the use of UEML for integrated enterprise and IS modelling across modelling language borders, using the mappings into the common ontology to facilitate, e.g., cross-language consistency checking between models, automatic update reflections and even model-to-model translations. In order to reach this longerterm goal, the current UEML ontology and its construct mappings must be refined and formalised and the tool support for UEML extended. Acknowledgments. The author is indebted to all the researchers and research students who contributed to the Domain Enterprise Modelling in Interop-NoE, in particular Giuseppe Berio, Mounira Harzallah and Raimundas Matulevičius.

254

A.L. Opdahl

References 1. Kelly, S., Lyytinen, K., Rossi, M.: MetaEdit+: A fully configurable multi-user and multitool CASE and CAME environment. In: Constantopoulos, P., Vassiliou, Y., Mylopoulos, J. (eds.) CAiSE 1996. LNCS, vol. 1080, pp. 1–21. Springer, Heidelberg (1996) 2. OMG: UML 2.2 infra- and superstructure. Object Management Group (2009), http://www.omg.org (Accessed 2010-06-29) 3. GRL.: GRL Ontology (2010), http://www.cs.toronto.edu/km/GRL/ (Accessed 2010-05-10) 4. Dietz, J.L.G.: Enterprise Ontology: Theory and Methodology. Springer, Berlin (2006) 5. Wand, Y., Weber, R.: On the Ontological Expressiveness of Information Systems Analysis and Design Grammars. Journal of Information Systems 3, 217–237 (1993) 6. Evermann, J., Wand, Y.: Towards ontologically based semantics for UML constructs. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) ER 2001. LNCS, vol. 2224, p. 354. Springer, Heidelberg (2001) 7. Opdahl, A.L., Henderson-Sellers, B.: Ontological evaluation of the UML using the BungeWand-Weber model. Software and Systems Modelling 1(1), 43–67 (2002) 8. Uschold, M., King, M., Moralee, S., Zorgios, Y.: The Enterprise Ontology. Knowledge Engineering Review 13, 31–89 (1998) 9. Falkenberg, E.D., Hesse, W., Lindgreen, P., Nilsson, B.E., Oei, J.L.H., Rolland, C., Stamper, R.K., Van Assche, F.J.M., Verrijn-Stuart, A.A., Voss, K.: FRISCO: A Framework of Information System Concepts. In: The IFIP WG 8.1 Task Group FRISCO (1996) 10. Fox, M.S., Gruninger, M.: Enterprise Modeling. AI Magazine 19(3), 121–190 (1998) 11. Anaya, V., Berio, G., Harzallah, M., Heymans, P., Matulevičius, R., Opdahl, A.L., Panetto, H., Verdecho, M.J.: The Unified Enterprise Modelling Language – Overview and Further Work. Computers in Industry 61(2) (2010) 12. Opdahl, A.L.: The UEML Approach to Modelling Construct Description. In: Doumeingts, G., Müller, J., Morel, G., Vallespir, B. (eds.) Enterprise Interoperability - New Challenges and Approaches, Springer, Berlin (2007) 13. Berio, G., Opdahl, A., Anaya, V. Dassisti, M.: DEM1: UEML 2.1. Interop-NoE DEM deliverable (2005), http://www.interop-vlab.eu/ei_public_deliverables/ interop-noe-deliverables (accessed June 30 2010) 14. Opdahl, A.L., Henderson-Sellers, B.: A Template for Defining Enterprise Modelling Constructs. Journal of Database Management 15(2) (2004) 15. Opdahl, A.L., Henderson-Sellers, B.: Template-Based Definition of Information Systems and Enterprise Modelling Constructs. In: Green, P., Rosemann, M. (eds.) Ontologies and Business System Analysis, ch.6, Idea Group Publishing, USA (2005) 16. Jackson, D.: Alloy: A Lightweight Object Modelling Notation. ACM Transactions on Software Engineering and Methodology 11(2), 256–290 (2002) 17. Miller, S.P., Whalen, M.W., Cofer, D.D.: Software Model Checking Takes Off. Communications of the ACM 53(2), 58–64 (2010)

Data Modeling Is Important for SOA Michael Blaha OMT Associates Inc. Placida, FL USA [email protected]

Abstract. The promise of SOA is being held back by a lack of rigor with XSD interchange files. Many developers focus on the design of individual services and pay little attention to how the services fit together and collectively evolve. Enterprise data modeling is the solution to this problem. A data model is essential for grasping the entirety of services and abstracting services properly. A data model also provides a guide for combining services in flexible ways. Several examples illustrate the benefits. Keywords: data model, enterprise data model, SOA, XML, XSD, UML, integration, development in the large.

1 SOA and Data Modeling: Current Practice SOA is an acronym for Service-Oriented Architecture, an approach for organizing business functionality into meaningful units of work. Instead of placing logic in application silos, SOA organizes functionality into services that transcend the various departments and fiefdoms of a business. A service is a well-defined unit of work that is packaged for easy access. Services communicate by passing data back and forth. Such data is typically expressed in terms of XML, the eXtensible Markup Language that has been standardized by the W3C. (XML has additional purposes, but our focus here is on SOA.) XML combines data with metadata that defines the data’s structure. The W3C has another language — XSD (XML Schema Definition) — for defining XML data structure. XSD defines the metadata that governs the invocations of a service. XSD can specify data details such as the fields that can be included, their hierarchical nesting, and whether they are required or optional. Fig. 1 shows an excerpt of an XML file that requests the rate for a rental car [16]. Fig. 2 shows an XSD file that could be used to define the XML file’s structure. XSD files, such as Fig. 2, are typically constructed with a tool such as Liquid XML or Turbo XML. An editing tool helps developers construct a data hierarchy for a service and then produces the corresponding XSD code. The XSD code is used for generating and checking XML data as the service executes. The current SOA practice and XSD editors focus on data for individual services. There is little attempt to look across services. There is a lack of data modeling to align services throughout an enterprise. This leads to a nightmare where companies create interconnected XSDs that are nearly impossible to understand, let alone manage [7]. J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 255–264, 2010. © Springer-Verlag Berlin Heidelberg 2010

256

M. Blaha

Fig. 1. Sample XML file. XML is the usual language for transmitting service data.

The lack of attention to data modeling is pervasive throughout the SOA community. In our consulting work we’ve encountered few companies that pay attention to SOA data models. The SOA literature also overlooks data modeling. For example [6] has a thorough treatment of SOA but says little about data modeling. We have yet to find a SOA application standard that includes a data model. Instead standards focus on individual XSD files — this is like looking at a forest via individual trees. There are two problems with modeling SOA data via XSDs. First, an XSD file shows only a fragment of the underlying enterprise model. By its very nature, an enterprise model is pervasive and spans multiple XSD files — for a subject area or an entire enterprise. Second, XSD files force data into a hierarchy that is seldom a natural representation. A better way to represent data is with a network approach, such as the UML class model.

2 SOA and Data Modeling: Proposed Practice Thus the current practice has inconsistent and balkanized data that leads to impaired services. The value of services comes not from their individuality, but from their contribution to a larger whole. Individual services must be coordinated and strategized so that they can combine in flexible ways. “SOAs’ real potential lies in the ability to compose services and enable new functionality compositions that can fulfill users’ current — and often changing — requests on the fly. To accomplish this, information must be exchangeable among all composed services...” [17] Developers must counter the natural tendency towards chaos, as different organizations define services over time. As Carey notes, “...an under examined piece of the SOA puzzle is how data access and integration fit into the overall SOA architecture for an enterprise...” [9] Atkinson and Bostan observe that the SOA paradigm emphasizes functionality and down plays data, seeming “to go out of its way to break the principles of data abstraction” [3]. Accordingly we propose a change to the existing practice — the building of an enterprise data model in tandem with the building of services. The enterprise data model summarizes existing services as well as guides future services. Each service uses a subset of the enterprise data model.

Data Modeling Is Important for SOA

257

<xsd:element name = "VehAvailRQCore"> <xsd:complexType> <xsd:sequence> <xsd:element name = "VehRentalCore"> <xsd:complexType> <xsd:sequence> <xsd:element name = "PickUpLocation"> <xsd:complexType> <xsd:attribute name = "LocationCode" type = "xsd:string"/> <xsd:element name = "ReturnLocation"> <xsd:complexType> <xsd:attribute name = "LocationCode" type = "xsd:string"/> <xsd:attribute name = "PickUpDateTime" type = "xsd:string"/> <xsd:attribute name = "ReturnDateTime" type = "xsd:string"/> <xsd:element name = "VendorPrefs"> <xsd:complexType> <xsd:sequence> <xsd:element name = "VendorPref" minOccurs="0" maxOccurs="unbounded"> <xsd:complexType> <xsd:attribute name = "CompanyShortName" type = "xsd:string"/> <xsd:attribute name = "Code" type = "xsd:string"/> <xsd:attribute name = "PreferLevel" type = "xsd:string"/> <xsd:attribute name = "Status" type = "xsd:string"/>

Fig. 2. Sample XSD file. An XSD file defines the structure of the XML files for a service.

258

M. Blaha

We further propose the use of a UML class model for representing enterprise data. The XSD notation is too verbose for such a representation. Also an XSD hierarchy is skewed towards a single service. In contrast, a UML class model can transcend individual services. The issue then becomes how to derive an XSD file for a service from an enterprise model. The most difficult aspect of mapping UML class models to XSD files is the handling of associations that reach across hierarchies. The key is the treatment of identity [4]. Associations can reach across hierarchies by referencing external identifiers. We distinguish external identifiers (unique combinations of real-world attributes) from internal identifiers (meaningless fields that are unique and used for internal links). Developers can recover ideas from existing XSDs and incorporate them into an enterprise data model via reverse engineering. In this case, the input is the XSDs (or XML data implying XSDs) and the output is an enterprise data model. Each XSD file gives a piecemeal glimpse of the underlying enterprise model. Since XSD files often lack a uniform abstraction basis, it can be difficult to merge XSD files. It is best to identify subject areas, integrate within the subject areas, and then integrate for the enterprise. Note that the enterprise data model defines concepts as the intellectual basis for services. Thus you cannot merely construct a literal data model of the requirements. It does not suffice to have a rote representation of the source use cases. Instead you must abstract requirements to reconcile inconsistencies and get at the deeper meaning. Such an abstract model is more profound, more stable, more extensible, and more valuable to a business. By necessity, the coupling between an enterprise model and XSDs will be loose, as SOA services will be of different vintages and will have different snapshots of the evolving enterprise model. Fig. 3 shows a simple enterprise data model. The appropriate XSD hierarchy depends on the service. Each service has a root and fleshes out lower levels by traversing the enterprise model. Thus the findOrders service has Order as level 1. Customer and ProductType are 1 traversal away from Orders and at level 2. Supplier is at level 3 under ProductType. The use of an enterprise data model is a necessary but not a sufficient technology. An enterprise data model makes it possible to integrate services, but does not cause XSDs to be integrated. Developers still must have personal discipline and use a robust development process. In contrast, with an XSD editor alone it is all but impossible to obtain the overall perspective that makes integration possible, regardless of the development process.

3 Specification vs. Implementation Another way to think about SOA is to regard an XSD file as part of the specification for a service. The XSD file is just a partial specification as functionality must also be documented. The actual service code is then the implementation. This distinction between specification and implementation is analogous to that with the Eiffel programming language [15]. Eiffel rigorously separates specification from implementation. An Eiffel contract is the specification that programming code implements. Eiffel presumes that the contract is more difficult to change than the code. Eiffel classes communicate only via the specification that invokes the code.

Data Modeling Is Important for SOA

Customer name address phoneNumber

1

*

Order orderNumber dateTime

* *

ProductType name code

* *

259

Supplier name address phoneNumber

UML enterprise model

Customer Order ProductType Supplier

Order Customer ProductType Supplier

ProductType Supplier Order Customer

Supplier ProductType Order Customer

FindCustomers service

FindOrders service

FindProducts service

FindSuppliers service

Fig. 3. Simple enterprise data model. An enterprise data model can coordinate services across an organization.

Since a service can reference other services, a change to an XSD specification is disruptive and to be avoided. In particular, “the stability of service interfaces is the key to SOA success” [6]. In contrast, service code is internal and not directly accessed. The code for a service can be changed as long as it is correct, executes efficiently, and is built professionally. Thus developers could substitute a faster algorithm or broaden the capabilities of a service without disrupting clients that invoke it. When we build software we often start with a data model of the critical concepts [5] [8]. In a similar manner, SOA development should give prominence to an enterprise data model and use it to help drive the vision for the SOA roadmap.

4 Example: Services for a Large Company A large company’s experience with services illustrates the drawbacks of disjointed XSD files. The company currently has about 100 XSD files. The services have been designed by different teams over a period of several years and so, not surprisingly, the XSD files are inconsistent and redundant. The chaos is getting worse as the number of services grows. Forecasts call for a 10 to 100 fold increase in services over the upcoming years. There are multiple flaws. • Redundancy. The XSD files have much redundancy. For example, there is an address XSD file. In addition the bank and person XSD files each have their own address data that differs from the address XSD file. • Element vs. attribute. There is no obvious reason why some fields are defined as XSD elements and others as XSD attributes. For example, address has a stateProvince element and identification has a stateProvince attribute. • Data types. Data types are inconsistent. For example, some dates are defined as strings; others are defined as dates.

260

M. Blaha

• Element/attribute multiplicity. The XSD files are haphazard with their use of required fields. • Element inclusion vs. element reference. The XSD files have no clear policy for embedding a local element vs. referencing a global element. There are also problems with the files collectively. It is difficult to find concepts and this is only going to worsen as the number of XSD files increases. Similarly, with so many XSD files it is unclear about where to place new concepts. One reason for this chaos is the lack of XSD enterprise modeling tools. Most XSD tools present a hierarchy for an individual service. Few tools can take a data model and generate XSDs. The tools are handicapped by a lack of agreement in the literature for how to map data models to XSDs. Also a tool must devise a user interface so that a developer can indicate how to traverse an enterprise model to generate a hierarchy. A further problem is the distributed ownership of data. Many organizations have weak central control because most of the IT budget is allocated to departments and individual projects. The benefits of centralized control are diffuse and the lack of control only gradually becomes apparent as information systems age. IT management often lacks the expertise and incentives to deal with such gradual, long-term problems. The use of an enterprise data model would not only reconcile the XSD files, but would also improve understanding and provide the underpinning for a more rigorous development practice.

5 Example: Open Travel Alliance Standard We prepared a data model for the car portion of the message users guide [16]. There are twenty-six use cases with XML data that cover scenarios for car services. The use cases structure the data into different hierarchies. The model is incomplete because the sample data lacks details such as whether data is required or optional. Nevertheless the data model did aid our understanding and is more concise than the XML data. The style of the car XML data was more uniform than the XSD files in Section 4 — this is not surprising given that the standard was created by the same team all at once. Fig. 4 and Fig. 5 show data models for customer and vehicle from the car data.

6 Example: Digital Weather Standard References [10] and [11] present an XSD specification for digital weather. The data is essentially just a collection of many numbers for various kinds of measurements — such as temperature, precipitation, wind speed, and cloud cover — as well as the date, time, and location. Intrinsically, there are few cross references across the hierarchies, much less than with the typical business information system. The XSD design protocol is mostly uniform, as would be expected with a standard. However, there is some variation in the choice of XSD element and XSD attribute with no obvious reason for the variation.

Data Modeling Is Important for SOA

Document

Telephone

* phoneTechType

number issueStateProvince issueCountry type birthDate expireDate

areaCityCode phoneNumber

*

* 1

AddedDriver startDate endDate relation

0..1 additional

1

1

1

corpDiscountName corpDiscountNmbr qualificationMethod

*

Email type emailAddress Address

Customer

1

261

0..1

type streetNumber cityName postalCode stateProvince

1

PersonName citizenCountry

* namePrefix[0..1]

0..1

givenName surname nameSuffix[0..1]

Country code name

1

* CustLoyaltyProgram name

1

CustLoyaltyAccount

* membershipNumber travelSector

Fig. 4. UML customer data model — from Open Travel Alliance

7 Example: ACORD Life, Annuity and Health Standard We skimmed this standard [1] [2]. The amount of explanation is overwhelming. The documentation is thorough and extensive — 3542 pages define 462 objects. The documentation would be better yet if accompanied by a data model highlighting the major concepts and relationships.

8 Example: GraphML Standard GraphML is a file interchange format for applications that use graphs [12] [13]. A core language describes graph structure and an extension mechanism handles application-specific data. Fig. 6 shows the data model for graph structure. The model reflects the structure defined in the XSD files as well as our understanding of graphs. A graph is a set of nodes and edges. A node is something that is of interest. An edge is a coupling between nodes. Nodes and edges can connect directly or they can connect via intermediate ports. A port is a defined position on a node for making a connection.

262

M. Blaha

VehicleIdentity Vehicle

VehicleClass size

0..1

1

code codeContext airConditioned fuelType transmissionType passengerQuantity baggageQuantity isConfirmable distanceUnit distancePerFuelUnit includeExclude returnVehicleIndicator description

1

*

1

0..1

1

0..1

vehicleAssetNumber licensePlateNumber stateProvCode countryCode vehicleID_number vehicleColor VehicleMakeModel code name modelYear VehicleType vehicleCategory doorCount

1

* VehicleRentalDetails

1

0..1

condition

parkingLocation 1 0..1

ConditionReport

1 0..1

FuelLevelDetails

OdometerReading

fuelLevelValue

quantity unitOfMeasure

Fig. 5. UML vehicle data model — from Open Travel Alliance

GraphML supports both directed and undirected graphs; the edges in a graph are directed or undirected by default as indicated by edgeDefault. An edge can override the graph default (via directed); thus a graph can have both directed and undirected edges. GraphML also supports hyperedges — a generalized edge that can connect more than two nodes. An endpoint is an end of a hyperedge. Ordinary edges have two endpoints — source and target. Hyperedges, by definition, have multiple endpoints. Hyperedges cannot directly connect to nodes and only connect via endpoints and ports. GraphML has XSD definitions for occurrences (Graph, Edge, Node, Hyperedge, Port, and Endpoint) as well as for types (GraphType, EdgeType, NodeType, HyperedgeType, PortType, and EndpointType). The type definitions give rise to the data structure in [Fig. 6]. The type definitions use XSD elements sparingly and favor XSD attributes. The occurrence definitions refer to the corresponding type in what is often called the “Garden of Eden” style [14]. The GraphML example demonstrates another benefit of data modeling. Multiple XSD design practices are used in practice, such as use of element vs. attribute, reference to a global element vs. embedding a local element, and definition of structure via

Data Modeling Is Important for SOA sourcePort

EdgeType

0..1

id directed

*

0..1

source

GraphType id edgeDefault

1

1

PortType

0..1

name

*

0..1

*

0..1

target

NodeType 0..1

0..1 targetPort

* * 0..1

263

* id

1 1

0..1

* HyperEdgeType id

1

*

EndpointType id type

Fig. 6. UML data model for graph structure — from GraphML

occurrences or types. A model helps with problem understanding by setting aside these arbitrary, confounding design differences and instead focusing on the intrinsic essence of a problem. It is much easier to understand the content and scope of GraphML with the UML model in Fig. 6 than with multiple XSD files.

9 Conclusion The current XSD practice is disappointing. SOA technology is being held back by the lack of rigor with XSD interchange files. Developers work diligently on the logic of individual XSDs but pay little attention to how the XSD files fit together and collectively evolve. The focus is on designing in the small (individual services) rather than designing in the large (collections of services). The current practice is in many ways the antithesis of software engineering. The premise of software engineering is to think deeply about an entire problem, and only then start writing code. Instead SOA developers are looking only at individual services and overlooking integration issues. SOA practice can be improved by basing XSD files on a data model of the enterprise. There are several benefits of such an approach. • Global understanding. It is difficult to understand a collection of services by studying each XSD file, one at a time. An enterprise data model gives a comprehensive overview. Each XSD file expresses a subset of the enterprise model. • Consistency. An enterprise model can align services and their data • Communication. A data model provides a concise explanation that is a helpful prelude to a more detailed study of XSD code.

264

M. Blaha

• Extensibility. A broad understanding of an enterprise helps developers determine where to add data and functionality for new services. • Expressiveness. The UML class model is a more natural representation for data than a hierarchy. Data modeling is only one of the technologies that is needed for SOA, but it is one that has been sorely lacking. Data modeling can yield profound insights that reduce the complexity and risks of SOA development. Data modeling can help ensure that services align with the needs of a business and that they scale as deployment ramps up. This paper explains the benefits of an enterprise data model. We have taken some early steps to apply an enterprise data model to services, but do not yet have experimental data to demonstrate an improvement. Such a demonstration would be an important topic of further research.

Acknowledgements We thank Paul Brown, Rod Sprattling, and Patti Lee for their helpful suggestions.

References 1. ACORD Home page, http://www.acord.org/Pages/default.aspx 2. ACORD XSD schema, http://schemas.liquid-technologies.com/ LibraryDocs/Accord/Life%20Standards/2.20.01/ 3. Atkinson, C., Bostan, P.: The Role of Congregation in Service-Oriented Development. In: PESOS 2009, Vancouver, Canada, May 18-19, pp. 87–90 (2009) 4. Blaha, M.: Patterns of Data Modeling. CRC Press, New York (2010) 5. Blaha, M., Rumbaugh, J.: Object-Oriented Modeling and Design with UML, 2nd edn. Prentice Hall, Upper Saddle River (2005) 6. Brown, P.C.: Implementing SOA. Addison-Wesley, New York (2008) 7. Brown, P.: Personal communication 8. Carey, M.J.: SOA What? IEEE Computer 41(3), 92–94 (2008) 9. Carey, M., Reveliotis, P., Thatte, S., Westmann, T.: Data Service Modeling in the AquaLogic Data Services Platform. IEEE Congress On Services (2008) 10. Digital Weather Home Page, http://www.nws.noaa.gov/ndfd/ 11. Digital Weather XSD schema, http://schemas.liquid-technologies.com/ LibraryDocs/DWML/0/ 12. GraphML Home Page, http://graphml.graphdrawing.org/index.html 13. GraphML XSD schema, http://schemas.liquid-technologies.com/ LibraryDocs/GraphML/1.0/ 14. Lammel, R., Kitsis, S., Remy, D.: Analysis of XML Schema Usage. In: XML 2005 Conference (2005) 15. Meyer, B.: Applying Design by Contract. IEEE Computer 25(10), 40–51 (1992) 16. OpenTravel TM Alliance Message Users Guide (June 2009), http://www.opentravel.org/Specifications/Default.aspx 17. Tolk, A., Diallo, S.Y.: Model-Based Data Engineering for Web Services. IEEE Internet Computing, 65–70 ( July/August 2005)

Representing Collectives and Their Members in UML Conceptual Models: An Ontological Analysis Giancarlo Guizzardi Ontology and Conceptual Modeling Research Group (NEMO), Federal University of Espírito Santo (UFES), Vitória (ES), Brazil [email protected]

Abstract. In a series of publications, we have employed ontological theories and principles to evaluate and improve the quality of conceptual modeling grammars and models. In this article, we continue this work by conducting an ontological analysis to investigate the proper representation of types whose instances are collectives, as well as the representation of a specific part-whole relation involving them, namely, the member-collective relation. As a result, we provide an ontological interpretation for these notions, as well as modeling guidelines for their sound representation in conceptual modeling. Keywords: representation of collectives and their members, ontological foundations for conceptual modeling, part-whole relations.

1 Introduction In recent years, there has been a growing interest in the application of Foundational Ontologies, i.e., formal ontological theories in the philosophical sense, for providing real-world semantics for conceptual modeling languages, and theoretically sound foundations and methodological guidelines for evaluating and improving the individual models produced using these languages. In a series of publications, we have successfully applied ontological theories and principles to analyze a number of fundamental conceptual modeling constructs ranging from Roles, Types and Taxonomic Structures, Relations, Attributes, Weak Entities and Datatypes, among others (e.g., [1-3]). In this article we continue this work by investigating a specific aspect of the representation of part-whole relations. In particular, we focus on the ontological analysis of collectives and of a specific partwhole relation involving them, namely, the member-collective relation. Parthood is a relation of fundamental importance in a number of disciplines including cognitive science [4-6], linguistics [7-8], philosophical ontology [9-11] and conceptual modeling [1-3]. In ontology, a number of different theoretical systems have been proposed over time aiming to capture the formal semantics of parthood (the so-called mereological relations) [9,10]. In conceptual modeling, a number of socalled secondary properties have been used to further qualify these relations. These include distinctions which reflect different relations of ontological dependence, such as the distinction between essential and mandatory parthood [1,2]. Finally, in J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 265–274, 2010. © Springer-Verlag Berlin Heidelberg 2010

266

G. Guizzardi

linguistic and cognitive science, there is a remarkable trend towards the definition of a typology of part-whole relations (the so-called meronymic relations) depending on the different types of entities they relate [7]. In general, these classifications include the following three types of relations: (i) subquantity-quantity (e.g., alcohol-wine, milkmilk shake): modeling parts of an amount of matter; (ii) component-functional complex (e.g., mitral valve-heart, engine-car): modeling aggregates of components, each of which contribute to the functionality of the whole; (iii) member-collectives (e.g., tree-forest, lion-pack, card-deck of cards, brick-pile of bricks). This paper should then be seen as a companion to the publications in [2] and [3]. In the latter, we managed to precisely map the part-whole relation for quantities (the subquantity-quantity relation) to a particular mereological system. Moreover, in that paper, we managed to demonstrate which are the secondary properties implied by this relation. In a complementary manner, in [2], we exposed the limitations of classical mereology to model the part-whole relations between functional complexes (the component – functional complex relation). Additionally, we also managed to further qualify this relation in terms of the aforementioned secondary properties. The objective of this paper is to follow the same program for the case of the membercollective relation. The remainder of this article is organized as follows. Section 2 reviews the theories put forth by classical mereology and discusses their limitations as theories of conceptual parthood. These limitations include the need for a theory of (integral) wholes to be considered in additional to a theory of parts. In section 3, we discuss collectives as integral wholes and present some modeling consequences of the view defended there. Moreover, we elaborate on some ontological properties of collectives that differentiate them not only from their sibling categories (quantities and functional complexes), but also from sets (in a set-theoretical sense). The latter aspect is of relevance since collectives as well as the member-collective relation are frequently taken to be identical to sets and the set membership relation, respectively. In section 4, we promote an ontological analysis of the member-collective relation, clarifying on how this relation stand w.r.t. to basic mereological properties (e.g., transitivity, weak supplementation, extensionality) as well as regarding the modal secondary property of essential parthood. As an additional result connected to this analysis, we outline a number of metamodeling constraints that can be used for the implementation of a UML modeling profile for representing collectives and their members in conceptual modeling. Section 5 presents some final considerations.

2 A Review of Formal Part-Whole Theories 2.1 Mereological Theories In practically all philosophical theories of parts, the relation of (proper) parthood (symbolized as <) stands for a strict partial ordering, i.e., an asymmetric (2) and transitive relation (3), from which irreflexivity follows (1): ∀x ¬(x < x) ∀x,y (x < y) → ¬(y < x) ∀x,y,z (x < y) ∧ (y < z) → (x < z)

(1) (2) (3)

Representing Collectives and Their Members in UML Conceptual Models

267

These axioms amount to what is referred in the literature by the name of Ground Mereology (M), which is the core of any theory of parts, i.e., the axioms (1-3) define the minimal (partial ordering) constrains that every relation must fulfill to be considered a parthood relation. Although necessary, these constraints are not sufficient, i.e., it is not the case any partial ordering qualifies as a parthood relation. Some authors [10], require an extra axiom termed the weak supplementation principle (4) as constitutive of the meaning of part and, hence, consider (1-3) plus (4) (the socalled Minimal Mereology (MM)) as the minimal constraints that a mereological theory should incorporate. ∀x,y (y < x) → ∃z (z < x) ∧ ¬overlap(z,y)

(4)

An extension to MM has then been created by strengthening the supplementation principle represented by (4). In this system, (4) is thus replaced by the so-called stronger supplementation axiom1: ∀x,y ¬(y ≤ x) → ∃z (z ≤ y) ∧ ¬overlap(z,x)

(5)

Formula (5) is named the strong supplementation principle, and the theory that incorporates (1-5) is named Extensional Mereology (EM). A known consequence of the introduction of axiom (5) is that in EM, we have that two objects are identical iff they have the same (proper) parts, a mereological counterpart of the extensionality principle (of identity) in set theory. A second way that MM has been extended is with the aim of providing a number of closure operations to the mereological domain. As discussed, for example, in [9], theories named CMM (Closure Minimal Mereology) and CEM (Closure Extensional Mereology) can be obtained by extending MM and EM with the operations of Sum, Product, Difference and Complement, which are the mereological counterparts of the operations of union, intersection, difference and complement in set theory. In particular, with an operation of sum (also termed mereological fusion), one can create an entity which is the so-called mereological sum of a number of individuals. 2.2 Problems with Mereology as a Theory of Conceptual Parts Mereology has shown itself useful for many purposes in mathematics and philosophy [9,10]. Moreover, it provides a sound formal basis for the analysis and representation of the relations between parts and wholes regardless of their specific nature. However, as pointed out by [4,5] (among other authors), it contains many problems that make it hard to directly apply it as a theory of conceptual parts. As it shall become clear in the discussion that follows, on one hand the theory is too strong, postulating constraints that cannot be accepted to hold generally for part-whole relations on the conceptual level. On the other hand, it is too weak to characterize the distinctions that mark the different types of conceptual part-whole relations. A problem with ground mereology is the postulation of unrestricted transitivity of parthood. As discussed in depth in the literature [2,8], there many cases in which 1

The improper parthood relation (≤) in this formula can be defined as (x ≤ y) =def (x < y) ∨ (x = y).

268

G. Guizzardi

transitivity fails. In general, in conceptual modeling, part-whole relations have been established as non-transitive, i.e., transitive in certain cases and intransitive in others. The problem with extensional mereologies from a conceptual point of view arises from the introduction of the strong supplementation principle (5) which states that objects are completely defined by their parts. If an entity is identical to the mereological sum of its parts, thus, changing any of its parts changes the identity of that entity. Ergo, an entity cannot exist without each of its parts, which is the same as saying that all its parts are essential parts. Essential parthood can be defined as a case of existential dependence between individuals, i.e., x is an essential part of y iff y cannot possibly exist without having that specific individual x as part [1]. A stereotypical example of an essential part of a car is its chassis, since that specific car cannot exist without that specific chassis (changing the chassis legally changes the identity of the car). As discussed in depth in [1], essential parthood plays a fundamental role in conceptual modeling. However, while some parts of objects represented in conceptual models are essential, not all of them are so. The failure to acknowledge that can be generalized as the failure of classical mereological theories to take into account the different roles that parts play within the whole. As discussed in [1,3], a conceptual theory of parthood should also countenance a theory of wholes, in which the relations that tie the parts of a whole together are also considered. From a conceptual point of view, the problem with the theory of General (Classical) Extensional Mereology is related to the existence of a mereological sum (or fusion) for any arbitrary non-empty (but non-necessarily finite) set of entities. Just as in set theory where one can create a set containing arbitrary members, in GEM one can create a new object by summing up individuals that can even belong to different ontological categories. For example, in GEM, the individual created by the sum of Noam Chomsky’s left foot, the first act of Puccini’s Turandot and the number 3, is an entity considered as legitimate as any other. As argued by [4], humans only accept the summation of entities if the resulting mereological sum plays some role in their conceptual schemes. To use an example: the sum of a frame, a piece of electrical equipment and a bulb constitutes an integral whole that is considered meaningful to our conceptual classification system. For this reason, this sum deserves a specific concept in cognition and name in human language. The same does not hold for the sum of bulb and the lamp’s base. Once more, we advocate that a theory of conceptual parthood must also comprise a theory of wholes. According to Simons [10], the difference between purely formal mereological sums and, what he terms, integral wholes is an ontological one, which can be understood by comparing their existence conditions. For sums, these conditions are minimal: the sum exists just as the constituent parts exist. By contrast, for an integral whole (composed of the same parts of the corresponding sum) to exist, a further unifying condition among the constituent parts must be fulfilled. A unifying condition or relation can be used to define a closure system in the following manner. A set B is a closure system under the relation R, or simply, R-closure system iff cs 〈R〉 B =def (cl 〈R〉 B) ∧ (con 〈R〉 B)

(6)

where (cl 〈R〉 B) means that the set B is closed under R (R-Closed) and (con 〈R〉 B) means that the set B is connected under R (R-Connected). R-Closed and R-Connected are then defined as:

Representing Collectives and Their Members in UML Conceptual Models

269

cl 〈R〉 B =def ∀x (x∈B) → ((∀y R(x,y) ∨ R(y,x) → (y∈B)) con 〈R〉 B =def ∀x (x∈B) → (∀y (y∈B) → (R(x,y) ∨ R(y,x))

(7) (8)

An integral whole is then defined as an object whose parts form a closure system induced by what Simons terms a unifying (or characterizing) relation R.

3 What Are Collectives? In an orthogonal direction to the mereological theories just discussed, there are foundational theories in linguistic and cognitive science developed to offer a characterization of the relation of parthood. A classical work in this direction is the one of Winston, Chaffin and Herrmann [7] (henceforth WCH). WCH propose an account of the notion of parthood by elaborating on different types of part-whole relations depending on different ways that a part can be related to a whole. These distinctions have proven themselves fundamental for the development of a general parthood theory for conceptual modeling. Moreover, as it has been shown in a number of publications, issues such as transitivity, essentiality of parts, as well as the definition of characterizing relations, are not orthogonal to these fundamental distinctions. For instance, [3] demonstrates that: (i) the subquantity-quantity relation obeys the axiomatization of the so-called Extensional Mereology (EM), i.e., it is an irreflexive, anti-symmetric and transitive relation; (ii) all subquantities of a quantity are essential parts of it; (iii) quantities are unified by a relation of topologicalmaximal-self-connectedness. According to WCH, the main distinction between collections and quantities is that the latter but not the former are said to be homeomeros wholes. In simple terms, homeorosity means that the entity at hand is composed solely of parts of the same type (homo=same, mereos = part). The fact that quantities are homeomerous (e.g., all subportions of wine are still wine) causes a problem for their representation (and the representation of relationships involving them) in conceptual modeling. In order to illustrate this, we use the example depicted in figure 1.a below. In this specification, the idea is to represent that a certain portion of wine is composed of all subportions of wine belonging to a certain vintage, and that a wine tank can store several portions of wine (perhaps an assemblage of different vintages). However, since Wine is homeomeros and infinitely divisable in subportions of the same type, we have that if a Wine portion x has as part a subportion y then it also has as parts all the subparts of y [3]. Likewise, a wine tank storing two different “portions of wine” actually stores all the subparts of these two portions, i.e., it stores infinite portions of wine. In other words, maximum cardinality relations involving quantities cannot be specified in a finite manner. As discussed, for instance in [3], finite satisfiability is a fundamental requirement for conceptual models which are intended to be used in areas such as Databases and Software Engineering. This feature of quantities, thus, requires a special treatment so that they can be property modeled in structural conceptual models, and one that does not take quantities to be simply mereological sums of subportions of the same kind [3].

270

G. Guizzardi responsible for

1..* Group of People

*

1 Guide

* {essential=true}

Fig. 1. UML Representations of a Quantity (a-left) and a Collective (b-right) with their respective parts

As correctly defined by WCH, collectives are not homeomeros. They are composed of subparts parts that are not of the same kind (e.g., a tree is not forest). Moreover, they are also not infinitely divisible. As a consequence, a representation of a collection as a mereological sum of entities (analogous to a set of entities) does not lead to the same complications as for the case of quantities. Take, for instance, the example depicted in figure 1.b, which represents a situation analogous to that of figure 1.a. In contrast with from the former case, there is no longer the danger of an infinite regress or the impossibility for specifying finite cardinality constraints. In figure 1.b, the usual maximum cardinality of “many” can be used to express that group of people has as parts possibly many other groups of people and that a guide is responsible for possibly many groups of people. Nonetheless, in many examples (such as this one), the model of figure 1.b implies a somewhat counterintuitive reading. In general, the intended idea is to express that, for instance, John as a guide, is responsible for the group formed by {Paul, Marc, Lisa} and for the other group formed by {Richard, Tom}. The intention is not to express that John is responsible for the groups {Paul, Marc, Lisa}, {Paul, Marc}, {Marc, Lisa}, {Paul, Lisa}, and {Richard, Tom}, i.e., that being responsible for the group {Paul, Marc, Lisa}, John should be responsible for all its subgroups. A simple solution to this problem is to consider groups of as maximal sums, i.e., groups that are not parts of any other groups. In this case, depicted in figure 2, the cardinality constraints acquire a different meaning and it is no longer possible to say that a group of people is composed of other groups of people.

Group of People

responsible for 1..*

1

Guide

Fig. 2. Representation of Collections as Maximal Sums

This solution is similar to taking the meaning of a quantity K to be that of a maximally-self-connected-portion of K [3]. However, in the case of collections, topological connection cannot be used as a unifying or characterizing relation to form an integral whole, since collections can easily be spatially scattered. Nonetheless, another type of connection (e.g., social) should always be found. A question begging issue at this point is: why does it seem to be conceptually relevant to find connection relations leading to (maximal) collections? As discussed in the previous section, collections taken as arbitrary sums of entities make little cognitive sense: we are not interested in the sum of a light bulb, the North Sea, the number 3 and Aida’s second act. Instead, we are interested in aggregations of individuals that have a purpose for some cognitive task. So, we require all collectives in our system to form closure

Representing Collectives and Their Members in UML Conceptual Models

271

systems unified under a proper characterizing relation. For example, a group of people of interest can be composed by all those people that are attending a certain museum exhibition at a certain time, or all the people under 18 which have that been exposed to some disease. Now, by definition, a closure system is maximal (see formula (6)), thus, there can be no group of people in this same sense that is part of another group of people (i.e., another integral whole unified by the same relation). Some authors (e.g., [5]) propose that the difference between a collection and a functional complex is that whilst the former has a uniform structure, the latter has a heterogeneous and complex one. We propose to rephrase this statement in other terms. In a collection, all member parts play the same role type. For example, all trees in a forest can be said to play the role of a forest member. In complexes, conversely, a variety of roles can be played by different components. For example, if all ships of a fleet are conceptualized as playing solely the role of “member of a fleet” then it can be said to be a collection. Contrariwise, if this role is further specialized in “leading ship”, “defense ship”, “storage ship” and so forth, the fleet must be conceived as a functional complex. In summary, collections as integral wholes (i.e., in a sense that appeals to cognition and common sense conceptual tasks) can be seen as limit cases of Gerlst and Pribbenow’s functional complex [5], in which parts play one single role forming a uniform structure. Finally, we would like to call attention to the fact that collectives are not sets and, thus, the member-collective relation is not the same as the membership (∈). Firstly, collectives and sets belong to different ontological categories: the former are concrete entities that have spatiotemporal qualities; the latter, in contrast, are abstract entities that are outside space and time and that bear no causal relation to concrete entities [1]. Secondly, unlike sets, collectives do not necessarily obey an extensional principle of identity, i.e., it is not the case that a collective is completely defined by the sum of its members. We take that some collectives can be considered extensional by certain conceptualizations, however, we also acknowledge the existence of intentional collectives obeying non-extensional principles of identity [6]. Thirdly, collectives are integral whole unified by proper characterizing relations; sets (as mereological sums), in contrast, can be simply postulated by enumerating its members (or parts). This feature of the latter is named ontological extravagance and it is a feature to be ruled out from an ontological system [9]. Finally, we do not admit the existence of empty or unitary collectives, contrary to set theory which admits both the empty set ∅ and sets with a unique element. As a consequence, we eliminate a feature of set theory named ontological exuberance [9]. Ontological exuberance refers to the feature of some formal systems that allows for the creation of a multitude of entities without differentiation in content. For instance, in set theory, the elements a, {a}, {{a}}, {{{a}}}, {…{{{a}}}…} are all considered to be distinct entities. We shall return to some of these points in the next section.

4 The Member-Collection Relation According to [8], classical semantic analysis of plurals and groups distinguish between atomic entities, which can be singular or collectives, and plural entities. From a linguistic point of view, the member-collection relation is considered to be

272

G. Guizzardi

one that holds between an atomic entity (e.g., John, the deck of cards) and either a plural (e.g., {John, Marcus}) or a collective term (e.g., the children of Joseph, the collection of antique decks). Before we can continue, a formal qualification of this notion of atomicity is required. Suppose an integral whole W unified under a relation R. By using this characterizing relation R, we can then define a composition relation
Representing Collectives and Their Members in UML Conceptual Models

273

motivations of mereology in the first place [9]. Of course, one can state that we should require characterizing relations to be informative, i.e., it is not the case that any formal predicate should count as a characterizing relation. But, if we take singletone properties to count as characterizing relations, we need to be much more careful to differentiate which properties should count as informative and which should not. Given these two reasons, we adopt in this paper the view that weak supplementation should be part of the axiomatization of the member-collective relation. This, obviously, does not imply that we cannot have single-track CD’s or single-article journal issues. Following [1], in these cases, we consider the relation between, for instance, the tracks and the CDs to be a relation of constitution as opposed to one of parthood. Relations of constitution abound in ontology. An example is the relation between a marble statue and the (single portion of marble) that constitutes it [1]. The discussion in this section is summarized as follows: (i) Member-collective is an irreflexive, anti-symmetric but intransitive relation. Moreover, it obeys the weak supplementation axiom; (ii) A member x of a collective W is atomic w.r.t. the collective. This means that for if an entity y is part of x then y is not a member of W; (iii) Collectives are not necessarily extensional entities. But, if there is a member of a collective W which is essential to W then all other members of W are essential to it.

Fig. 3. Examples of member/collection part-whole relations

5 Final Considerations The development of suitable foundational theories is an important step towards the definition of precise real-world semantics and sound methodological principles for conceptual modeling languages. This article complements a sequence of papers that aim at addressing the three fundamental types of wholes prescribed by theories in linguistics and cognitive sciences, namely, functional complexes, quantities, and collectives. The first of these roughly correspond to our common sense notion of object and, hence, the standard interpretation of objects (or entities) in the conceptual modeling literature is that of a functional complex. The latter two categories, in contrast, have traditionally been neglected both in conceptual modeling as well as in the ontological analyzes of conceptual modeling grammars. In this paper, we conduct an ontological analysis to investigate the proper representation of types whose instances are collectives, as well as the representation of an important parthood relation involving them. As result, we are able to provide a sound ontological interpretation for this notion, as well as modeling guidelines for the proper representation of collectives in conceptual modeling. In addition, we have managed to provide a precise qualification for the relation of member-collective w.r.t.

274

G. Guizzardi

to both classical mereological properties (e.g., transitivity, weak supplementation, extensionality) as well as the modal secondary property of essentiality of parts. Finally, the results advanced here contribute to the definition of concrete engineering tools for the practice of conceptual modeling. In particular, the metamodel extensions and associated constraints outlined here have been implemented in a Model-Driven Editor using available UML metamodeling tools [12]. Acknowledgement. This research is funded by the Brazilian Research Funding Agencies FAPES (Grant# 45444080/09) and CNPq (Grant# 481906/2009-6).

References [1] Guizzardi, G.: Ontological Foundations for Structural Conceptual Models. In: Telematica Institute Fundamental Research Series, The Netherlands (2005) [2] Guizzardi, G.: The Problem of Transitivity of Part-Whole Relations in Conceptual Modeling Revisited. In: van Eck, P., Gordijn, J., Wieringa, R. (eds.) CAiSE 2009. LNCS, vol. 5565, pp. 94–109. Springer, Heidelberg (2009) [3] Guizzardi, G.: On the Representation of Quantities and their Parts in Conceptual Modeling. In: Frontiers in Artificial Intelligence, vol. 209, pp. 103–116. IOS Press, Amsterdam (2010) [4] Pribbenow, S.: Meronymic Relationships: From Classical Mereology to Complex PartWhole Relations. In: Green, R., Bean, C.A. (eds.) The Semantics of Relationships, pp. 35–50. Kluwer, Dordretch (2002) [5] Gerstl, P., Pribbenow, S.: Midwinters, End Games, and Bodyparts. A Classification of Part-Whole Relations. International Journal of Human-Computer Studies 43, 865–889 (1995) [6] Botazzi, E., et al.: From Collective Intentionality to Intentional Collectives: An Ontological Perspective. Cognitive Systems Research 7(2-3), 192–208 (2006) [7] Winston, M.E., Chaffin, R., Herrman, D.: A taxonomy of part-whole relations. Cognitive Science 11(4), 417–444 (1987) [8] Vieu, L., Aurnague, M.: Part-of Relations, Functionality and Dependence, Categorization of Spatial Entities in Language and Cognition. John Benjamins, Amsterdam (2007) [9] Varzi, A.C.: Parts, wholes, and part-whole relations: The prospects of mereotopology. Journal of Data and Knowledge Engineering 20, 259–286 (1996) [10] Simons, P.M.: Parts. An Essay in Ontology. Clarendon Press, Oxford (1987) [11] Bittner, R., Donelly, M., Smith, B.: Individuals, Universals, Collections: On the Foundational Relations of Ontology. In: Frontiers in Artificial Intelligence, vol. 114, IOS Press, Amsterdam (2010) [12] Benevides, A.B., Guizzardi, G.: A Model-Based Tool for Conceptual Modeling and Domain Ontology Engineering in OntoUML. In: Filipe, J., Cordeiro, J. (eds.) ICEIS 2009, LNBIP, vol. 24, pp. 528–538. Springer, Berlin (2009)

UML Activities at Runtime Experiences of Using Interpreters and Running Generated Code Dominik Gessenharter Institute of Software Engineering and Compiler Construction, Ulm University, Ulm, Germany [email protected]

Abstract. The execution semantics of activities in UML is based on a token ﬂow concept. As ﬂows from a source to a target may contain control nodes and thus tokens may ﬂow to diﬀerent targets depending on other concurrent ﬂows or on guards annotated to edges, the computation of possible ﬂows is complex. Rules deﬁning when tokens may traverse an edge can be (and most often are) implemented in interpreters. Generating code is possible, too, but it is rarely seen in academic as well as in commercial tools. However, the compilation of activities to code may speed up the execution of activities. In this paper, we present an interpreter for activities, an enhanced interpreter using static analysis of activities before executing them as well as a code generation approach. We compare these diﬀerent techniques with regard to runtime behavior and consumption of resources.

1

Introduction

There are two main reasons for executing activities: One is for debugging purposes or early simulation of a systems behavior, the other one is the execution of activities when the system is operating. Depending on the purpose of an activity execution, limitations to its performance may be acceptable or not. The semantics of activities being deﬁned on the basis of a token ﬂow concept encourages using interpreters computing possible token ﬂows step by step during execution. However, computing steps at runtime tends to slow down an execution what might be acceptable while debugging. Compiling activities beforehand may speed up executions but requires statements to be arranged in a way that all possible execution traces of an activity are covered. This is diﬃcult because of the complex synchronization semantics of activities that cannot be covered by the structure of the source code but needs some evaluation performed at runtime. In this paper, we show beneﬁts and drawbacks of the diﬀerent approaches for executing activities. Due to space limitations, we omit signal handling as well as concepts of the packages CompleteActivities and StructuredActivities. A short explanation of the UML semantics of activities is given in Sect. 2. Afterwards, an interpreter for activities is shortly described in Sect. 3. A second J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 275–284, 2010. c Springer-Verlag Berlin Heidelberg 2010

276

D. Gessenharter

interpreter that is a further development of the ﬁrst one is introduced in Sect. 4. An approach to code generation for activities is presented in Sect. 5. Beneﬁts and drawbacks of the the presented approaches are discussed in Sect. 6, Sect. 7 glances over some related works. Concluding remarks and prospects to future work follow in Sect. 8.

2

The Runtime Semantics of UML Activities

The basic element of behaviors in UML are actions of which an activity is composed of [8, p. 311]. The order of action executions is deﬁned by control or object ﬂows which connect actions. The execution semantics of activities in UML is based on a token ﬂow concept. The following steps perform the execution of an action [8, p. 312]: 1 An action execution is created when a token is available at all incoming control ﬂow edges and all guards annotated to the edges evaluate to true and if data tokens are available to all input pins. 2 The action execution is enabled when all tokens of input pins and incoming edges are consumed, i.e. the tokens can not be consumed by another action. 3 An action may start executing after it has been enabled and continues executing until it has completed. If an action is located in an InterruptibleActivityRegion, it might be aborted. 4 When completed, tokens are oﬀered on all output pins and outgoing edges of the action and the action execution terminates. The oﬀered tokens may now ﬂow to other actions causing the creation of new action executions. Initially, tokens are oﬀered to all outgoing edges of InitialNodes. For actions with no incoming ﬂows and no input pins executions are created and enabled. If a token passes an edge that is deﬁned to interrupt an InterruptibleActivityRegion, all executions within the region are aborted and all tokens inside are discarded (see Fig. 1, activity E, ﬂow from Y to Z). Tokens reaching an FlowFinalNode are discarded, too. If a token reaches an ActivityFinalNode, the activity execution and all of its action executions are terminated. Fig. 1 shows six activities. A simple sequence of actions is contained in activity A. The result of action Y is provided as an input for action Z. Activity B shows an error that is often contained in activity diagrams, for instance in the tutorial to UML based Web Engineering [5]. Action X can never execute as tokens must be oﬀered to all incoming edges of which one requires action X to have previously been executed. The execution of activity C has two possible traces – the execution of X if condition a is satisﬁed or the execution of Y followed by Z if condition b is satisﬁed. If both conditions are satisﬁed, one trace is chosen non deterministically. The activities D and E contain concurrent ﬂows. The token oﬀered at the outgoing edge of the initial node is duplicated at the fork node causing the sequence of Y and Z to execute concurrently to the execution of X. The activity execution itself ends not before both X and Z have terminated.

UML Activities at Runtime A

B

X

C

D

X

E

X

Y Y

F

[b]

[a]

277

Y

X

Z

Y

Y

Y X

X Z

Z

[b]

Z

[a]

Z

V

W

Fig. 1. Activities showing simple sequences, alternative and concurrent sequences of actions

Upon starting activity E, X and Y start executing concurrently. If X terminates before Y does, a token is oﬀered at the outgoing edge of action X. When Y terminates, a token passes the edge designated to interrupt the interruptible activity region and thus causes all tokens in the region to be discarded. This includes the token oﬀered at the outgoing edge of X. Z executes and after its termination, no more ﬂows are possible and the execution remains in a deadlock. If Y terminates before the termination of X, X is aborted and no token is oﬀered to the outgoing edge causing another deadlock. Activity F gives an example for complex synchronisation: depending on the satisfaction of the conditions a and b, action V may execute after the termination of X and Y or X and Z or not at all, if b is not satisﬁed. Action W may be executed never, once or twice depending on conditions a and b.

3

The Interpreter Approach

An interpreter evaluates possible token ﬂows as well as it creates, enables and terminates action executions at runtime. This requires to compute token oﬀers at every step of the execution and to select actions which consume oﬀered tokens. In the following, we present an interpreter that was developed as a part of the ActiveCharts project [6] [7]. 3.1

Creation, Propagation and Selection of Token Oﬀers

The intention of computing token oﬀers is to ﬁnd all tokens that may be consumed by actions causing these actions to execute. Each token oﬀered to an edge which guard evaluates to true is included in the created token oﬀer. For this purpose, all edges of a behavior are tested for the availability of tokens. If an activity edge is connected to a control node (ForkNode, JoinNode, DecisionNode, MergeNode), tokens must be oﬀered to the outgoing edges complying to the UML semantics of the control node involved. In Fig. 1, most ﬂows contain at most one control node. But ﬂows may be composed to more complex

278

D. Gessenharter

structures like contained in activity F . In this case, the propagation must be performed iteratively until all tokens are oﬀered to actions, to object nodes or to a ﬁnal node. If a token is oﬀered to a FlowFinalNode, this token is discarded, if a token is oﬀered to an ActivityFinalNode, the execution of the activity terminates. Once the token oﬀers have been computed, the interpreter selects actions for which executions may be created. When selecting an action to consume oﬀered tokens, these tokens must be removed from all other oﬀers containing tokens that are propagated from the same origin and are not duplicated. This situation is shown in activity C of Fig. 1 where actions X and Y compete for the same token oﬀered at the initial node if both conditions a and b are satisﬁed. The selection of one oﬀer causes all oﬀers containing the same token to be invalidated. 3.2

Interpreting an Activity

The ﬁrst step of executing an activity is to generate the initial token oﬀer. Afterwards, token oﬀers are propagated to possible ﬂow targets and targets are selected. For the selected targets, action executions are created, tokens are consumed, i. e. are removed from the oﬀers and action executions are enabled. The UML deﬁnes various actions for diﬀerent purposes, for instance reading or writing values of attribute, linking objects or destroying links, calling another behavior or an operation, sending a signal or waiting for a signal event. Our interpreter supports CallBehaviorActions, CallOperationActions, AcceptEventActions, SendSignalActions and BroadcastSignalActions. If a CallBehaviorAction is executed, an execution for the associated behavior is started. The execution of a CallOperationAction results in the invocation of a method within the context of the executing behavior. The method that is to be invoked is identiﬁed by equality of the method’s and the action’s names. AcceptEventActions wait for an event to occur. The interpreter must deliver an event to an action waiting for it. An event occurs e.g. when a SendSignalAction or a BroadcastSignalAction is executed. For each action execution that completes, tokens are oﬀered at outgoing edges or output pins. The interpreter continues the activity execution by repeatedly executing token oﬀer propagation and selection and execution of ﬂow targets. If a path from the origin of a token oﬀer to the selected target contains an edge that is associated to an InterruptibleActivityRegion, tokens in the region have to be discarded and executing actions inside the region have to be aborted.

4

Enhancements of the Interpreter Approach

The above described interpreter possibly computes propagations of token oﬀers, that have been computed before during the same activity execution. An analysis of the ﬂows within an activity may lead to a signiﬁcant reduction of computation eﬀort during an activity execution. The following paragraphs show the most signiﬁcant enhancements given to our interpreter [2].

UML Activities at Runtime

ActionA e0 Action1 e2

e3 [a==0] e1

e4

Action2 Action3

[b0]

(a) Activity

edge

target

guard

e0

Action1

true

e1

Action2

(a == 0)

e2

Action3

(b ≥ 0)

e3

Action2

true

e4

Action3

true

279

(b) Lookup table

Fig. 2. An activity and a token propagation lookup table

4.1

Focusing on Active Parts of Activities

The computation of token oﬀers not necessarily has to visit all edges of an activity in order to ﬁnd tokens that might be part of a new token oﬀer. By using a list containing edges to which tokens are oﬀered reduces the number of elements that are to be considered when searching for actions to enable. Initially, all outgoing edges of initial nodes are contained in this list. For object ﬂows, a map is used mapping each model element to the data tokens, that are oﬀered to it. This list is initialized by adding pairs of outgoing edges of ActivityParameterNodes to the data tokens contained in the ActivityParameterNode. 4.2

Using Lookup Tables for Token Oﬀer Propagation

As the propagation of token oﬀers may consume much time if control nodes are involved, this computation is replaced by a lookup table. For each edge to which a token is oﬀered, all actions are listed to which the token may ﬂow. Fig. 2(b) shows the lookup table for the activity shown in Fig. 2(a). The ﬁrst column of the lookup table lists all elements, to which tokens may be oﬀered. The second column contains a target for a token oﬀered to an element. If a token can reach more than one target, there is one entry for each target contained in the lookup table. The third column contains the guard that is to be evaluated. This guard is a conjunction of all guards annotated to edges which the token passes on its way from the source to the target. 4.3

Computing Flows

After an action has terminated, tokens are oﬀered to outgoing edges. Possible targets of a token ﬂow are identiﬁed by reading entries of the lookup table. But enabling an action may require other token oﬀers that are propagated from other actions. In activity A of Fig. 2(a) Action2 and Action3 both need two token oﬀers, one from Action1, another one from the output pin at ActionA for which both actions compete. In order to identify possible ﬂows and exclusions between ﬂows, a lookup table containing all ﬂow prerequisites for each action is helpful. Fig. 3 shows a list of ﬂow prerequisites for actions of activity A of Fig.2(a).

280

D. Gessenharter target action

requires tokens at

Action1

e0

Action2

e1

Action2

e3

Action3

e2

Action3

e4

Fig. 3. Lookup table for ﬂow prerequisites

Each possible ﬂow target, for which all ﬂow prerequisites are satisﬁed may be selected for execution. For selected actions, one token must be removed from each element listed in the prerequisites table for this action. This may disqualify other actions from being a possible ﬂow target. An additional column in the prerequisites table containing actions that have to be aborted due to a token passing an interrupting edge may be used. After the lookup table for all oﬀered tokens has been processed, the actions that are contained in the list of actions to execute are executed. Upon the start of an activity, the list of actions to execute is initialized by inserting all actions that do not have incoming edges.

5

A Code Generating Approach

A typical feature of interpreters is that between every execution step, there are some computations to be made in order to determine the next step. This additional eﬀort can be omitted if all the information about the sequence of an execution is determined before the execution starts. Although this strategy is rarely discussed for activities, it is possible to compile activities before execution. The following paragraphs describe how code for activities can be generated. Fig. 4 shows a graph that is intended to depict an abstract activity. We refer to this activity in the following paragraphs. Except as noted otherwise, all nodes are considered to be actions. 5.1

Finding Sequences of Actions

Sequences of actions can be translated to a sequence of statements in source code where the eﬀect of every statement equals the eﬀect of the execution of the sequenced actions. We write a sequence [A; B] if action A is executed ﬁrst followed by the execution of action B. In Fig. 4, the following sequences are marked by shaded background: [1; 2], [3; 5], [4; 6] and [7; 8]. Note that the outgoing edges of action 2 represent an implicit fork. Generating code requires to build one thread for each sequence of actions. The thread is designed to call methods implementing the actions in the order given by the sequence. We write [A; B](T ) for a thread T implementing the sequence [A; B]. Code for creating and running thread instances for the activity shown in Fig. 4 requires thread [1; 2](T1 ) to be created and started. T1 must create and start

UML Activities at Runtime

1

[X]

3

5

[Y]

4

6

2

7

281

8

Fig. 4. An abstract ﬂow graph

the thread [3: 5](T2 ) if X evaluates to true and [4; 6](T3 ) if Y evaluates to true. The thread [7; 8](T4 ) must not be started before T2 and T3 terminate. 5.2

Implementing Control Nodes

If node 7 of Fig. 4 is considered to be a join node, the thread [8](T4 ) must not be started before tokens are oﬀered to all incoming edges of the join node. The join node itself is implemented as an object containing one list of tokens for each incoming edge. Each thread [3; 5](T2 ) and [4; 6](T3 ) calls a method of the join node object to add a token to the concerning list. The node object checks all lists representing incoming edges and starts thread T4 , if none of it is empty. For translating a merge node, there are two concepts applicable. The ﬁrst is to add the sequence following the merge node to all sequences preceding it, e.g. by creating the two threads [3; 5; 7; 8](T2 ) and [4; 6; 7; 8](T3 ). Another implementation may use three threads [3; 5; T4 ’](T2 ), [4; 6; T4 ’](T3 ) and [7; 8](T4 ) where T4 ’ represents the creation and starting of thread T4 . The implementation of a fork node is similar to that of merge nodes. If node 2 in Fig. 4 is a fork node, the threads [1; T2 ’; T3 ’](T1 ), [3; 5](T2 ) and [4; 6](T3 ) are a suitable implementation where T2 ’ and T3 ’ represent the creation and starting of the threads T2 and T3 . If a decision node has guards of which at least one always evaluates to true, a decision node can be implemented as an if then else statement. If node 2 of Fig. 4 is an decision node, the thread [1; if(x) (3; 5) else (4;6)](T1 ) is a suitable implementation. If both guards may evaluate to false, a na¨ıve implementation is to add a loop before the if statement as shown in Listing 1.1. while ( ! ( x | | y ) { ; } i f ( x ){ 3 ; 5; } else { 4 ; 6; }

Listing 1.1. Na¨ıve implementation of a DecisionNode

5.3

Reducing Overhead of Thread Management

In order to reduce the number of threads that are to be created, a single thread class is generated. All sequences are contained in one switch statement. The sequence that is to be executed is determined by a variable (ID) of the thread instance. The implementation of a merge node no longer requires a new thread to be created. It is possible to change the ID of the thread in order to execute

282

D. Gessenharter

the sequence. For a fork node, the current thread may be used for one outgoing edge of the node. For the other edges, new threads are to be created. For this technique of reusing threads, the switch statement is executed in a loop that is exited if the ID is set to a value deﬁned to signal the thread to terminate. If the ID is changed, the sequence corresponding to the updated ID is executed in the next iteration of the loop. Listing 1.2 shows an implementation in pseudo-code where exec(i ) represents the execution of node i of Fig. 4, node 2 is considered to be a fork node and node 7 to be a merge node. The statement new Thread(3) represents the creation of a Thread with ID=3, start() causes the thread to execute the loop.

}

switch ( ID ) case 1 : exec ( 1 ) ; case 2 : exec ( 3 ) ; case 3 : exec ( 4 ) ; case 4 : exec ( 8 ) ;

new Thread ( 3 ) . s t a r t ( ) ; ID = 2; break ; exec ( 5 ) ; ID = 4 ; break ; exec ( 6 ) ; ID = 4 ; break ; ID = 0 ; break ;

Listing 1.2. Implementation of an activity with thread reuse.

6

Comparing the Diﬀerent Approaches

The table shown in Fig. 5 gives quantitative data of a performance test with a basic activity consisting of a sequence of 10 actions, a medium activity consisting of a sequence of 50 actions and a complex activity consisting of 50 actions and 40 join nodes. Each activity is executed 10, 100 and 1000 times concurrently. The test results show that negative eﬀects to the execution time considerably aﬀect the interpreter if models grow more complex or if the number of simultaneously executed activities increases. Only if many activities of low complexity are executed concurrently, the creation of lookup tables and ﬂow prerequisite tables may consume more time than their use saves while executing the activities. The code size of interpreters is ﬁxed whereas generated code grows in size if models become more complex. But interpreters need a representation of that model whereas generated code inherently contains the model. The implementation of join nodes or ﬂows containing multiple control nodes require more code than the representation of the elements in a model (e. g. in a xml representation). But for realistic applications like a journey planner presented in [3], generated code size exceeds the size of an interpreter and a model ﬁle. Object ﬂows that are not discussed in this paper produce many lines of code, especially if wrapper classes are needed to implement multiple output pins at single actions. Execution speed of generated code is the fastest technique, but interpreters allocate less memory. The table of Fig. 5 does not contain the data of memory allocation of the interpreter approach as the prototype is implemented in another language (C#) and numbers therefore are not signiﬁcant. However, as C# is not known for lesser performance than Java (which is used by our other presented approaches) execution times of the interpreter are included.

UML Activities at Runtime

complex

medium

basic

Code Generator

Lookup Tables

283

Interpreter

1

2

3

4

5

2

3

4

5

2

3

4

10

0

0

1539

1224

200

320

115508

440

250

910

152000

100

0

0

1539

1224

450

330

115508

1520

540

3210

152000

1000

0

230

1539

1224

3030

340

115508

12320

960

28490

152000

10

0

0

2091

1224

760

1090

115508

1368

1310

1800

152000

100

0

20

2091

1224

1780

1950

115508

2448

7720

14360

152000

1000

0

230

2091

1224

16090

4610

115508

13248

29530

156000

152000

10

94

100

16384

186872

880

1290

115508

2788

1740

2150

152000

100

680

132

16384

678616

1890

2250

115508

14668

12970

17950

152000

1000

1270

8750

16384

915240

18000

21230

115508

133468

60000

199000

152000

1 = number of concurrently executed activities

4 = code size in bytes

2 = execution time per activity in ms

5 = allocated memory in bytes (approx.)

3 = execution time for all activities in ms

Fig. 5. Test data measured by execution three test activities

7

Related Work

The discussion of diﬀerent approaches is limited to our own prototypes because our interpreter is based on the formalization of the token ﬂow concept of UML [6] and is very well tested. We therefore assume that it correctly implements the runtime semantics of UML activities particularly because its implementation is close to the textual description of the UML speciﬁcation. The speciﬁcation allows for deﬁning diﬀerent algorithms for the computation of ﬂows as long as the eﬀect is the same. We presented one alternative by using lookup tables. Other interpreters may use other techniques and achieve even better results. A problem is that the semantics of UML activities is not carefully considered by many approaches using UML activities, e.g. UML based Web Engineering (UWE) [4] which uses UML activities. As a tutorial for UWE [5] contains activities causing deadlocks like activity B in Fig.1, we assume a modiﬁed runtime semantics that is not explicitly deﬁned in the context of the given examples. Another code generation approach is the tool UJECTOR [9]. But only simple sequences like activity A of Fig. 1 are considered whereas alternative or concurrent ﬂows are not discussed. Our approach for code generation as well as our interpreters are designed to correctly implement alternatives and concurrency in UML activities. Another approach dealing with code generation for activities uses the parallel language Esterel for generating code for activities [1]. As shown in activity F of Fig.1, actions used for synchronisation may be dependent on conditions that are to be evaluated at runtime. A mapping of concepts to ﬁx code structures without an evaluation of conditions at runtime is not suﬃcient. We show how these aspects are considered in diﬀerent approaches without modifying semantics.

284

8

D. Gessenharter

Conclusion and Future Work

Our interpreters support debugging activities by executing single steps and showing tokens in the diagrams which are executed. Although possible, this feature is not included in our code generator yet. Using interpreters may be used for debugging activities or for executing systems for which the time consumption of computing steps is not critical. For executing activities in a real system, i. e. not for debugging purpose, executing code might be the better choice. The technique of lookup tables for interpreters is a good way to speed up execution time. But there might be other algorithms for the computation of token ﬂows which are more eﬃcient then the algorithms verbalized in the UML speciﬁcation. This paper summarizes experiences with our prototypes, but other implementations may use diﬀerent techniques without having lower performance. Comparing the presented approaches with other tools is to be done as future work as well as designing a set of models serving as an input for benchmarks.

References 1. Bhattacharjee, A., Shyamasundar, R.: Validated code generation for activity diagrams. In: Chakraborty, G. (ed.) ICDCIT 2005. LNCS, vol. 3816, pp. 508–521. Springer, Heidelberg (2005) 2. Bulach, A.: Untersuchungen zur Laufzeitverbesserung des ActiveChartsInterpreters. Master’s thesis, Universit¨ at Ulm (2008) 3. Gessenharter, D.: Extending the uml semantics for a better support of model driven software development. In: The 2010 International Conference on Software Engineering Research and Practice (SERP 2010) Workshop on Applications of UML/MDA to Software Systems (2010), accepted for publication (to appear) 4. Koch, N., Zhang, G., Baumeister, H.: UML-Based Web Engineering: An Approach Based on Standards. Web Engineering: Modelling and Implementing Web Applications, 157–191 (2008), http://www.pst.ifi.lmu.de/veroeffentlichungen/uwe.pdf 5. LMU Ludwig-Maximilians-Universitt Mnchen, Institute for Informatics Programming and Software Engineering. UWE Examples 12 (2009), http://uwe.pst.ifi.lmu.de/exampleAddressBookWithContentUpdates.html 6. Sarstedt, S.: Semantic Foundation and Tool Support for Model-Driven Development with UML 2 Activity Diagrams. PhD thesis, Ulm University (2006) 7. Sarstedt, S., Kohlmeyer, J., Raschke, A., Schneiderhan, M.: A new approach to combine models and code in model driven development. In: International Conference on Software Engineering Research and Practice, International Workshop on Applications of UML/MDA to Software Systems (2005) 8. Object Management Group, UML 2.1.1 Superstructure Speciﬁcation, Document formal/2007-02-05 (2007) 9. Usman, M., Nadeem, A.: Automatic Generation of Java Code from UML Diagrams using UJECTOR. International Journal of Software Engineering and Its Applications 3(2), 21–37 (2009)

Model-Driven Data Migration Mohammed Aboulsamh, Edward Crichton, Jim Davies, and James Welch Oxford University Computing Laboratory, Oxford UK {firstname.lastname}@comlab.ox.ac.uk Abstract. The automatic generation of components from abstract models greatly facilitates information systems evolution, as changes to the model are easier to comprehend than changes to program code or service definitions. At each evolutionary step, however, any data already held in the system must be migrated to the new version, and to do this manually can be time-consuming and error-prone. This paper shows that it is possible to generate, automatically, an appropriate sequence of data transformations. It shows also how the applicability of a sequence of transformations may be calculated in advance, and used to check that a proposed evolution will preserve semantic integrity.

1

Introduction

Object models are widely used to describe data structures, architectural features, services and behaviours. These descriptions can be formalised by giving a precise interpretation to the classiﬁcations, associations, and constraints that an object model presents. Such precise interpretations lead naturally to the ModelDriven Architecture (MDA) [10], in which abstract models are used to generate or conﬁgure components—increasing the level of programming abstraction and reducing the cost of development. The beneﬁts of a model-driven approach apply not only to the initial implementation, but also to any subsequent updating or maintenance of the system. The cost of developing a new version of the system can be greatly reduced if aspects of the new implementation can be produced automatically from existing or revised models. However, a system in use may contain important data with complex semantics—constraints upon the values that may be assigned, and the relationships to be maintained. If the data is not also properly transformed to match the new implementation, then inconsistencies and errors may result. For example, an information system might record a room booking in terms of entries in two calendars: one for the room, and one for the person who made the booking. The design of the system might then be updated with a more sophisticated arrangement, in which requests can be made for a type of room, and the status of a request may be either pending or conﬁrmed. The room calendar data then needs to be properly transformed into booking requests. If it were not, then we might have the situation in which a person’s calendar shows a booking, but there is no conﬁrmed request for a room of that type. The data transformations required can be complex, and manual development of these transformations can be costly and error-prone. In this paper, we show J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 285–294, 2010. c Springer-Verlag Berlin Heidelberg 2010

286

M. Aboulsamh et al.

how the cost of development can be reduced, and certain classes of error eliminated, through a model-driven approach to data migration. We show how a modelling language deﬁnition can be extended to produce a language for representing updates to models, with operations representing the addition, modiﬁcation, or deletion of model elements, and expressions representing the intended values or new or modiﬁed attributes. Using SQL databases as a representative domain, we show how changes to a UML model can be represented as update operations and expressions, and how these can be translated automatically into SQL procedures. We show also how we may determine an appropriate precondition for the speciﬁed migration, which may itself be translated into an SQL query. If the existing data satisﬁes this precondition, then the SQL procedures are guaranteed to produce data that is consistent with the constraints of the new model.

2

Metamodelling Evolution

In the context of model-driven engineering, an evolutionary step in the design of an information system corresponds to a change in the system model: removing or adding features, changing properties and associations. We may document such a change as a transformation in a language such as Query/View/Transformation (QVT) [15]: if this transformation were applied to the original model, then the new, evolved model would be the result. However, if we record information about the intent of the transformation— the ways in which the values of attributes in the new model are related to the values of attributes in the old—then we can do more than simply document the change, we can generate an appropriate data transformation, applicable at the instance level, which will transform data collected against the old model into a form suitable for storage against the new. For example, if we were to update the design of an information system so that address information, formerly stored as a list of attributes in the Person class, were to have a class Address of its own, then the transformation could be documented in terms of the deletion of attributes from Person, and the creation of an associated class Address. If, however, we were able also to document the relationship between the attributes of Address and the former attributes of Person, then we may be able to derive a transformation at level M0 that would migrate existing data in an appropriate way. To document such relationships, we must extend the language in which the model itself is written. In the case of UML, there are two diﬀerent strategies for extension: heavyweight, in which the language itself is extended, as exempliﬁed by the Common Warehouse Metamodel [13]; lightweight, in which the extension is expressed in terms of existing language features, as exempliﬁed by the UML Testing Proﬁle [14]. The lightweight approach will be suﬃcient for the purposes of this paper. Figure 1 shows a suitable proﬁle for model evolution. The shaded classes represent existing concepts in UML, the remaining stereotypes and enumerations

Model-Driven Data Migration

287

<<profile>> Data Model Evolution

2 <<stereotype>> EvolOperation

Element

1

*

<<stereotype>> CompEvolOperation compType: CompType

src tgt <<stereotype>> Model

src tgt

<<stereotype>> EvolutionModel params {ordered}

NamedElement

*

ValueSpecification

<<stereotype>> PrimEvolOperation evolType: EvolType

spec Constraint

<<enumeration>> EvolType addClass addProperty addPropertyWithValue

<<enumeration>> CompType parallel sequential

Fig. 1. A Profile for Evolution

introduce the features we need to record model edits and relationships between attributes of the two diﬀerent models. The solid-headed arrow represents the extension of a UML concept or metaclass through a stereotype. In this proﬁle, a Model is simply a collection of model elements, and is itself a named element in an EvolutionModel. Each evolution model is associated with two models, src and tgt, representing the system model before and after the evolutionary step. It is associated also with a single evolution operation, most likely a composite operation of class CompEvolOperation. Each component of this operation relates two models, as source and target; these are two points in a chain of models, leading from the src to the tgt of the overall evolution. A range of primitive operations may be deﬁned, each with a relationship speciﬁcation, presented as an OCL constraint. It is a simple matter to provide a complete set of primitives, simply by providing operations that add and delete each kind of model element, as well as operations that modify their features. These operations will have a range of parameters, depending upon the kind of element involved. They include, for example addClass(name: String) and modifyProperty(class: Class; name: Name; newName: Name; newUpper: UnlimitedNatural; newLower: Integer)

288

M. Aboulsamh et al.

We can add signiﬁcant value through the identiﬁcation of speciﬁc evolution patterns and their formalisation as additional, primitive operations. The literature on schema evolution is a rich source of candidates: see for example [2,7,9]. Those that are frequently applied include: the introduction of an association class; the in-lining of a class; and the repositioning of a feature within an inheritance hierarchy. Other useful patterns correspond to compound operations in a language editor, and may be automatically derived from the editor model. For example, we might have a renameProperty term, with just three parameters, for changing the name of a property but leaving all other properties with their current values. We might also have an inlineClass term, corresponding to a combination of evolutionary steps in which a class is deleted and its properties are added to an associated class. Operations that add or modify elements may include expressions that specify new values for properties. For example, a property of a class may be modiﬁed by the following primitive operation: modifyPropertyWithValue(class: Class; name: Name; newName: Name; newUpper: UnlimitedNatural; newLower: Integer; newValue: OCLExpression) Here, the OCL expression explains the value of that property in terms of the values of properties in the original model; this does not require the @pre construct of OCL, as the expression will be evaluated in the original context. Evolution operations may be composite or primitive, with the methods of composition described by the enumeration CompType. Two methods are mentioned here: sequential and parallel; the latter being useful if, for example, we wished to exchange the roles of two properties, or if we wished to delete an association together with its properties. Using the concrete textual syntax of ; and || for these operators, we can deﬁne operations such as inlineClass in terms of their component actions on model elements: inlineClass(Source,Target,property) = ( forall p : Source.properties . addPropertyWithValue(Target,p,Source.p.type, Source.p.upper, Source.p.lower, Source.p) ) ; deleteClass(Source) where forall is additional method in our language, implemented as an iterator over the declared set. If the expressions supplied are computable, in the context of our platformspeciﬁc implementations, then the resulting language of transformations contains all of the information that we need to migrate the data against the new model. However, models formulated and maintained for the purpose of modeldriven development will inevitably be subject to a range of implicit and explicit constraints. Although many proposed evolutions will involve restructuring and extension of models, many more will involve the introduction of additional constraints, or the modiﬁcation of constraints already speciﬁed.

Model-Driven Data Migration

289

The simplest, and most common, constraint evolution involves a change to the multiplicity of some property or association: for example, we might decide that a one-to-many association needs instead to be many-to-many, or vice versa. In UML, this will correspond to a change to the upper value associated with one of the properties. More complex constraints may be speciﬁed as class or model invariants, describing arbitrary constraints upon the relationships between values of properties across the model. If the conjunction of constraints after the evolutionary step is logically weaker than before, and the model is otherwise unchanged, then there is no doubt as to the feasibility of the corresponding data migration. Whatever data the system currently holds, if it is consistent with the original model, then it should also be consistent with the new model. However, where the conjunction of constraints afterwards is stronger than before, or the evolutionary step involves other changes to structures and values, then data may not ﬁt: that is, the data migration corresponding to the proposed evolution might produce a collection of values that does not properly conform to the new model. It is thus not enough to produce a speciﬁcation in our language of changes, suitable for automatic translation into a platform-speciﬁc implementation. We would wish also to determine, in advance, whether or not this program will succeed in migrating data collected against the old model into a system conforming to the new model: this may be diﬃcult to determine at either the speciﬁcation or the implementation level, and simply performing the migration and then testing to see whether it has succeeded may not be an acceptable strategy.

3

Example

As an example of how we may apply this approach, we will consider a model of a simple student management system. The class diagram of Fig. 2 describes the information held by the system, including the subject of each course, the name of each student, and the address and phone number of the contact record associated with a student. Each student object is associated with a single contact record and a set of courses, for which they are currently registered. The association between students and courses is bi-directional, and the diagram includes the constraint that the information content in each direction must be consistent: for any course, every student s in the set students must have a reference to that course (self) in the s.registeredFor association; for any student, every course c in the set registeredFor must include a reference to that student (self) in the c.students association. Other constraints of the model, not shown in the diagram, might describe properties that, although not essential to the consistency of the representation, may be important in terms of the external meaning of the data. For example, the following constraint would require that no student should be registered for a course that is due to run before the oﬃcial start date of their studies: context Student registeredFor -> forall (c | c.date > startDate)

290

M. Aboulsamh et al.

Student name dateOfBirth startDate

Course students *

registeredFor

subject date

* register(s:Student)

contact

1

Contact address phone

context Student inv registeredFor -> forall(c | c.students -> includes(self)) context Course inv students -> forall(s | s.registeredFor -> includes(self))

Fig. 2. A simple student management system

while the next would require that no student should be registered for more than one course in the same subject: context Student registeredFor -> forall (c, d | c <> d implies c.subject <> d.subject) If constraints such as these are broken, the data may describe a situation which is undesirable, or even impossible, given our intended use and interpretation of the data. For example, if the constraint startDate > dateOfBirth did not hold, then startDate clearly does not correspond to the date of some formal registration or induction ceremony; either that, or the system contains some incorrect data. It is thus important that these constraints are taken into account in any data migration. A simple evolution of this model might involve the addition of a property closingDate to the Course class in our example model, with the intention that this should represent the date by which all registrations should be completed. We may specify an initial, default value for this property for all existing courses using the following evolution operation: addPropertyWithValue(Course,closingDate,Date,1,1,date - 1 week) This speciﬁes that the closingDate should be one week in advance of the course. We may combine this operation with others to perform a more complex evolution. For example, inlineClass(Contact,Student,contact) ; ( addPropertyWithValue(Course,closingDate, Date,1,1,date - 1 week) || renameProperty(Student,date,startDate) ) ; addAssociationClass(Registration,Student, registeredFor,Course,students) ;

Model-Driven Data Migration

Student name dateOfBirth startDate contactAddress contactPhone

291

Course students *

registeredFor subject * closingDate startDate

Registration registrationDate status

register(s:Student) confirm(s:Student) cancel(s:Student)

Fig. 3. A simple student management system, evolved

addPropertyWithValue(Registration,registrationDate,Date, 1,1,students.startDate) ; addPropertyWithValue(Registration,status, RegistrationStatus,1,1,confirmed) ; addOperation(Course,confirm,s:Student) ; addOperation(Course,cancel,s:Student) describes an evolution of Fig. 2 into the model shown in Fig. 3, where the operations inlineClass and addAssociationClass have the obvious interpretations, and the Date and RegistrationStatus parameters to addProperty represents the intended types of the properties being added. We might decide that the link between Course.date and Student.startDate is inappropriate, and that we should instead insist that no course registrations are made before a student has been admitted to the programme. At the same time, we do wish to insist that all course registrations are made before the closingDate of the course in question. We may achieve this eﬀect by adding the following constraints to our model context Course students -> forall (r | r.registrationDate <= closingDate) context Student registeredFor -> forall (r | r.registrationDate >= startDate) in the same evolutionary step that introduces registrationDate. This represents a perfectly reasonable evolution of the model, but it may be that there are Student–Course pairs in the existing data that cannot be successfully migrated. Any student record including a course registration within a week of admission will be mapped to a combination of Student, Registration, and Course that will not satisfy the new constraints: the speciﬁed registrationDate will be less than a week before the closingDate. Using SQL as the language of our platform-speciﬁc implementation, with a standard object-to-relational mapping, these evolution operations can be translated automatically to produce the following procedures:

292

M. Aboulsamh et al.

ALTER TABLE student ADD address VARCHAR (150) DEFAULT ’’ NOT NULL ALTER TABLE student ADD phone VARCHAR (25) DEFAULT ’’ NOT NULL UPDATE student AS TT SET address=( SELECT ST.address FROM contact AS ST, ... WHERE TT.pk=(SELECT AT.student_contactfk1 FROM student_contact_contact_student AS AT WHERE TT.pk=AT.student_contactfk1) ... DROP TABLE contact ALTER TABLE course ADD closingdate DATE NULL UPDATE course SET closingdate=DATE_ADD(startdate,INTERVAL ’-1’ WEEK) ALTER TABLE course CHANGE date startdate DATE NULL ALTER TABLE student_registeredfor_course_students RENAME TO registration ... UPDATE registration SET status=’confirmed’ The ﬁrst block, ending in DROP TABLE contact, corresponds to the evolution step inlineClass(Contact,Student,contact). It creates the two new attributes in the student table, copies their values from the course table, and then deletes the course table. The remaining SQL corresponds to the addition and removal of properties, and the creation of an association class. The necessary and suﬃcient condition for this migration to succeed is given by the constraint identiﬁed above—that no student record includes a course registration within a week of admission. This may be implemented automatically as an SQL query: SELECT COUNT(*) FROM student AS ST, course AS TT, student_registeredfor_course_students AS RT WHERE ST.pk=RT.student_contactfk1 AND TT.pk=RT.student_contactfk2 AND ST.startdate <= DATE_ADD(TT.closingdate,INTERVAL ’-1’ WEEK)

4

Discussion

The approach proposed in this paper can be summarised as follows: we describe an evolution in terms of a sequence of changes to an object model, annotated with expressions that record the relationship between the data semantics in the old and new models. From this description, we generate a program in the implementation language of the information system that migrates the data automatically, together with a check or guard to ensure that the migration will succeed. The mechanism by which the program and the guard are generated is easily explained. A transformation is deﬁned between the evolution language and the

Model-Driven Data Migration

293

language of the platform-speciﬁc implementation: in the case of the example, this was SQL. This transformation can then be applied to any composite evolution operation, provided that the expressions used to specify values for properties are computable in the target language. The generation of the guard is more involved: the constraint information of the target model is mapped into a language of mathematical relations; constraints upon path expressions become constraints upon relations, making explicit any possibilities of aliasing; the proposed evolution is mapped to a sequence of substitutions; and the weakest precondition for the substitutions to achieve the constraints of the target model is calculated. A detailed explanation of how this approach may be applied to the automatic generation of methods from logical speciﬁcations can be found in [5,4]. The automatic calculation of a weakest precondition is possible only if the evolutionary step does not require some complex computation upon the data: for example, one described by a collection of mutually-recursive functions. Fortunately, it is likely that where the data in a new system bears a precise relationship to data in the old, that relationship is one that can be captured as a single function or substitution, by writing an expression for the initial value of each new data item in which any previous data item may appear. If this were not the case, correctness could be determined automatically only by running the proposed program on a copy of the data, installing that data in a system generated to match the new model, and checking to see whether the integrity constraints of that system are satisﬁed. This is less desirable: it requires that the new system be generated before we know whether the design is satisfactory; and it requires that the data be copied and transformed before we know that the transformation will succeed. The evolution of object models is often associated with the notion of refactoring [8]. Ambler [1] has extended this to address the evolution of database designs, and parallels have been drawn [18,2] with schema evolution. However, as [12] observes, the approach—though pragmatic and useful—remains relatively informal, with no precise, computable characterisation of correctness. More relevant to the proposed approach is the work of [11], in which OCL constraints are used, with a set of high-level constructs, to verify the correctness of model transformation; these constructs are also formalised in a UML proﬁle. Having explored the applicability and limitations of existing languages such as QVT for the purpose of expressing constraint information, the authors propose a dedicated language for assessing transformation correctness. Furthermore, correctness is determined through existential quantiﬁcation of the constraint information, asserting merely that target values can exist: this is entirely similar to the weakest precondition calculation used to determine the guard for a proposed evolution. The focus of their work is upon constraints, and the veriﬁcation of manually-written transformations, rather than the automatic generation of M 0 transformations from M 1 and M 2 descriptions—the goal of our data migration approach—but there is clear potential for its extension in this regard.

294

M. Aboulsamh et al.

In respect of the calculation of guards, another related approach to model semantics—although not targeted speciﬁcally at data migration—is reported in [17]. To support reasoning about the satisﬁability of speciﬁcations—in this case, of a conceptual schema—the UML class diagram and the OCL constraints are automatically translated into logical predicates. There is a considerable body of work on database schema evolution, for both object and relational designs [2,7,9,3,16]. This provides a sound theoretical foundation for automating the process of data migration. Similarly, in adopting a UML-based, model-driven approach, we are able to take advantage of existing work on model transformation and program generation: in particular, the work on deducing aspects of change from model comparison [6].

References 1. Ambler, S.W., Sadalage, P.J.: Refactoring Databases: Evolutionary Database Design (Addison Wesley Signature Series). Addison-Wesley Professional, Reading (2006) 2. Banerjee, J., Kim, W., Kim, H.J., Korth, H.F.: Semantics and implementation of schema evolution in object-oriented databases. SIGMOD 16(3) (1987) 3. Curino, C., Moon, H.J., Zaniolo, C.: Graceful database schema evolution: the prism workbench. PVLDB 1(1), 761–772 (2008) 4. Davies, J., Crichton, C., Crichton, E., Neilson, D., Sørensen, I.H.: Formality, evolution, and model-driven software engineering. ENTCS 130 (2005) 5. Davies, J., Welch, J., Cavarra, A., Crichton, E.: On the Generation of Object Databases using Booster. In: Proceedings of ICECCS 2006 (2006) 6. Eclipse Compare Project (2009), http://www.eclipse.org/modeling/emft/?project=compare 7. Ferrandina, F., Meyer, T., Zicari, R., Ferran, G., Madec, J.: Schema and database evolution in the O2 object database system. In: VLDB 1995 (1995) 8. Fowler, M., Beck, K., Brant, J., Opdyke, W., Roberts, D.: Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional, Reading (1999) 9. Jing, J., Claypool, K.T., Rundensteiner, E.A.: SERF: Schema Evolution through an Extensible, Reusable and Flexible Framework. In: Int. Conf. on Information and Knowledge Management (1998) 10. Kleppe, A., Warmer, J., Bast, W.: MDA Explained. The Model Driven Architecture: Practice and Promise. Addison-Wesley, Reading (2003) 11. Lagarde, F., Terrier, F., Andre, C., Gerard, S.: Extending OCL to ensure model transformations. In: ER Workshops 2007, Springer, Heidelberg (2007) 12. Mens, T., Tourwe, T.: A survey of software refactoring. IEEE Trans. Softw. Eng. 30(2), 126–139 (2004) 13. Object Management Group: Common Warehouse Metamodel (2003) 14. Object Management Group: UML 2.0 Testing Profile (2007) 15. Object Management Group (ed.): Meta Object Facility (MOF) 2.0 Query/View/ Transformation Specification. OMG, 1.0 edn (2008) 16. Papastefanatos, G., Vassiliadis, P., Simitsis, A., Vassiliou, Y.: Hecataeus: Regulating schema evolution. In: ICDE, pp. 1181–1184 (2010) 17. Queralt, A., Teniente, E.: Reasoning on UML class diagrams with OCL constraints. In: Embley, D.W., Oliv´e, A., Ram, S. (eds.) ER 2006. LNCS, vol. 4215, pp. 497–512. Springer, Heidelberg (2006) 18. Tokuda, L., Batory, D.: Evolving object-oriented designs with refactorings. Automated Software Engineering 8(1), 89–120 (2001)

Author Index

Aboulsamh, Mohammed Andersson, Birger 107

285

Hogenboom, Frederik 200 Houben, Geert-Jan 159 Johannesson, Paul

Bakillah, Mohamed 12 B´edard, Yvan 23 Bergholtz, Maria 107 Berg, Markus 160 Bettin, Jorn 211 Blaha, Michael 255 Brisaboa, Nieves R. 33 Brodeur, Jean 1

Kangassalo, Hannu 127 Kaymak, Uzay 200 Kirchberg, Markus 75 Koﬂer, Thomas 222 Lemos, Melissa 2 Lewandowski, Lukas 180 Liddle, Stephen W. 148 Liu, Jun 54 Lonsdale, Deryle W. 148 Lopes, Giseli Rabello 190 Luaces, Miguel R. 33

Casanova, Marco A. 2 Chen, Yi-Ping Phoebe 53 Clark, Tony 211 Cohen, Sholom 211 Correal, Dario 86 Cotos, Jos´e M. 43 Crichton, Edward 285 Currim, Faiz 138 Dahanayake, Ajantha 128 Davies, Jim 285 de Jong, Franciska 200 de Oliveira, Jos´e Palazzo Moreira D¨ usterh¨ oft, Antje 160 Embley, David W.

148

Frasincar, Flavius 159 Furtado, Antonio L. 2 Gallegos, Irbis 232 Garrig´ os, Irene 170 Gates, Ann Q. 232 Gessenharter, Dominik 275 Graf, Sebastian 180 Guizzardi, Giancarlo 265 Hartmann, Sven 53 Haugen, Øystein 212 Hern´ andez, Paul 170 Hernandez, Tatiana 86 Hogenboom, Alexander 200

107

Ma, Hui 65 March, Sal 127 Maz´ on, Jose-Norberto 170 Mohagheghi, Parastoo 212 Moro, Mirella M. 190 Mostafavi, Mir Abolfazl 12 190

Navarro, Gonzalo Noack, Ren´e 96

33

Opdahl, Andreas L.

244

Pe˜ na, Yeimi 86 Pernul, Gunther 243 Piccinini, Helena 2 Poels, Geert 117 Ram, Sudha 54, 138 Ratiu, Daniel 222 Reinhartz-Berger, Iris Rossi, Matti 243 Sboui, Tarek 23 Schewe, Klaus-Dieter Seco, Diego 33 Stewart, Aaron 148 Sturm, Arnon 211

211

76

296

Author Index

Tao, Cui 148 Thalheim, Bernhard 75, 128, 160 Thiran, Philippe 159 Tri˜ nanes, Joaqu´ın 43 Tweedie, Craig 232

Varela, Jos´e 43 Viqueira, Jos´e R.R.

43

Waldvogel, Marcel 180 Wang, Jing 53 Wang, Qing 76, 96 Welch, James 285 Wives, Leandro Krug 190 Wong, Leah 127 Wouters, Paul 200 Zimanyi, Esteban

1

Conceptual Modeling - ER 2010: 29th International Conference on Conceptual Modeling, Vancouver, BC, Canada, November 1-4, 2010, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling - Challenging Perspectives: ER 2009 Workshops CoMoL, ETheCoM, FP-UML, MOST-ONISW, QoIS, RIGiM, SeCoGIS, Gramado, ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling - Foundations and Applications, ER 2007

Web-Age Information Management. WAIM 2010 Workshops: WAIM 2010 International Workshops: IWGD 2010, WCMT 2010, XMLDM 2010, Jiuzhaigou Valley, China, ... Applications, incl. Internet Web, and HCI)

Conceptual Modeling: Foundations and Applications

Conceptual Modeling - ER 2009: 28th International Conference on Conceptual Modeling, Gramado, Brazil, November 9-12, 2009, Proceedings (Lecture Notes ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling: ER '99

Web Information Systems and Mining: International Conference, WISM 2010, Sanya, China, October 23-24, 2010, Proceedings (Lecture Notes in Computer ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling.. ER '99

Advances in Web and Network Technologies and Information Management: AP Web WAIM 2009 International Workshops: WCMT 2009, RTBI 2009, DBIR-ENQOIR 2009, ... Applications, incl. Internet Web, and HCI)

Search Computing: Challenges and Directions (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

Conceptual Modeling ER 2001

Search Computing: Challenges and Directions (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

Conceptual Modeling - ER 2002

The Semantic Web: Research and Applications: 7th European Semantic Web Conference, ESW 2010, Heraklion, Crete, Greece, May 30 - June 3, 2010, ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling. Recent Developments and New Directions: ER 2011 Workshops FP-UML, MoRE-BI, Onto-CoM, SeCoGIS, [email protected] , WISM (Lecture Notes in Computer Science)

On the Move to Meaningful Internet Systems: OTM 2009 Workshops: Confederated International Workshops and Posters, ADI, CAMS, EI2N, ISDE, IWSSA, MONET, ... Applications, incl. Internet Web, and HCI)

Advances in Social Computing: Third International Conference on Social Computing, Behavioral Modeling, and Prediction, SBP 2010, Bethesda, MD, USA, ... Applications, incl. Internet Web, and HCI)

Advances in Computer Science and Information Technology: AST UCMA ISA ACN 2010 Conferences, Miyazaki, Japan, June 23-25, 2010. Joint Proceedings ... Applications, incl. Internet Web, and HCI)

Search Computing: Trends and Developments (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

User Modeling, Adaptation, and Personalization: 18th International Conference, UMAP 2010, Big Island, HI, USA, June 20-24, 2010, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Search Computing: Trends and Developments (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

Advances in Intelligent Data Analysis IX: 9th International Symposium, IDA 2010, Tucson, AZ, USA, May 19-21, 2010, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)

Conceptual Modeling for E-Business and the Web: ER 2000 Workshops on Conceptual Modeling Approaches for E-Business and the World Wide Web and

Advances in Multimedia Modeling: 15th International Multimedia Modeling Conference, MMM 2009, Sophia-Antipolis, France, January 7-9, 2009. Proceedings. ... Applications, incl. Internet Web, and HCI)

Advances in Multidisciplinary Retrieval: First Information Retrieval Facility Conference, IRFC 2010, Vienna, Austria, May 31, 2010, Proceedings ... Applications, incl. Internet Web, and HCI)

The Smart Internet: Current Research and Future Applications (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

Advanced Internet Based Systems and Applications: Second International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2006, Hammamet, ... Applications, incl. Internet Web, and HCI)

Advances in Information Retrieval: 32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK, March 28-31, 2010. Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Data and Applications Security and Privacy XXIV: 24th Annual IFIP WG 11.3 Working Conference, Rome, Italy, June 21-23, 2010, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling - Applications and Challenges: ER 2010 Workshops ACM-L, CMLSA, CMS, [email protected] , FP-UML, SeCoGIS, WISM, Vancouver, BC, ... Applications, incl. Internet Web, and HCI)

Conceptual Modeling - ER 2010: 29th International Conference on Conceptual Modeling, Vancouver, BC, Canada, November 1-4, 2010, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling - Challenging Perspectives: ER 2009 Workshops CoMoL, ETheCoM, FP-UML, MOST-ONISW, QoIS, RIGiM, SeCoGIS, Gramado, ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling - Foundations and Applications, ER 2007

Web-Age Information Management. WAIM 2010 Workshops: WAIM 2010 International Workshops: IWGD 2010, WCMT 2010, XMLDM 2010, Jiuzhaigou Valley, China, ... Applications, incl. Internet Web, and HCI)

Conceptual Modeling: Foundations and Applications

Conceptual Modeling - ER 2009: 28th International Conference on Conceptual Modeling, Gramado, Brazil, November 9-12, 2009, Proceedings (Lecture Notes ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling: ER '99

Web Information Systems and Mining: International Conference, WISM 2010, Sanya, China, October 23-24, 2010, Proceedings (Lecture Notes in Computer ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling.. ER '99

Advances in Web and Network Technologies and Information Management: AP Web WAIM 2009 International Workshops: WCMT 2009, RTBI 2009, DBIR-ENQOIR 2009, ... Applications, incl. Internet Web, and HCI)

Search Computing: Challenges and Directions (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

Conceptual Modeling ER 2001

Search Computing: Challenges and Directions (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

Conceptual Modeling - ER 2002

The Semantic Web: Research and Applications: 7th European Semantic Web Conference, ESW 2010, Heraklion, Crete, Greece, May 30 - June 3, 2010, ... Applications, incl. Internet Web, and HCI)

Advances in Conceptual Modeling. Recent Developments and New Directions: ER 2011 Workshops FP-UML, MoRE-BI, Onto-CoM, SeCoGIS, [email protected] , WISM (Lecture Notes in Computer Science)

On the Move to Meaningful Internet Systems: OTM 2009 Workshops: Confederated International Workshops and Posters, ADI, CAMS, EI2N, ISDE, IWSSA, MONET, ... Applications, incl. Internet Web, and HCI)

Advances in Social Computing: Third International Conference on Social Computing, Behavioral Modeling, and Prediction, SBP 2010, Bethesda, MD, USA, ... Applications, incl. Internet Web, and HCI)

Advances in Computer Science and Information Technology: AST UCMA ISA ACN 2010 Conferences, Miyazaki, Japan, June 23-25, 2010. Joint Proceedings ... Applications, incl. Internet Web, and HCI)

Search Computing: Trends and Developments (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

User Modeling, Adaptation, and Personalization: 18th International Conference, UMAP 2010, Big Island, HI, USA, June 20-24, 2010, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Search Computing: Trends and Developments (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

Advances in Intelligent Data Analysis IX: 9th International Symposium, IDA 2010, Tucson, AZ, USA, May 19-21, 2010, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)

Conceptual Modeling for E-Business and the Web: ER 2000 Workshops on Conceptual Modeling Approaches for E-Business and the World Wide Web and

Advances in Multimedia Modeling: 15th International Multimedia Modeling Conference, MMM 2009, Sophia-Antipolis, France, January 7-9, 2009. Proceedings. ... Applications, incl. Internet Web, and HCI)

Advances in Multidisciplinary Retrieval: First Information Retrieval Facility Conference, IRFC 2010, Vienna, Austria, May 31, 2010, Proceedings ... Applications, incl. Internet Web, and HCI)

The Smart Internet: Current Research and Future Applications (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)

Advanced Internet Based Systems and Applications: Second International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2006, Hammamet, ... Applications, incl. Internet Web, and HCI)

Advances in Information Retrieval: 32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK, March 28-31, 2010. Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Data and Applications Security and Privacy XXIV: 24th Annual IFIP WG 11.3 Working Conference, Rome, Italy, June 21-23, 2010, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Recommend Documents