Communications in Computer and Information Science
108
Salvador Sánchez-Alonso Ioannis N. Athanasiadis (Eds.)
Metadata and Semantic Research 4th International Conference, MTSR 2010 Alcalá de Henares, Spain, October 20-22, 2010 Proceedings
13
Volume Editors Salvador Sánchez-Alonso Universidad de Alcalá, Edificio Politécnico despacho O-246, Ctra. Meco s/n, 28871 Alcalá de Henares, Spain E-mail:
[email protected] Ioannis N. Athanasiadis IDSIA/USI-SUPSI, Galleria 2, 6928 Manno, Lugano, Switzerland E-mail:
[email protected]
Library of Congress Control Number: 2010936639 CR Subject Classification (1998): H.4, H.3, I.2, H.2.8, H.5, D.2.1, C.2 ISSN ISBN-10 ISBN-13
1865-0929 3-642-16551-6 Springer Berlin Heidelberg New York 978-3-642-16551-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 06/3180 543210
Preface
Metadata and semantic research is a growing complex ecosystem of conceptual, theoretical, methodological, and technological frameworks, offering innovative computational solutions in the design and development of computer-based systems. Within this perspective, researchers working in the area need to further develop and integrate a broad range of methods, results, and solutions coming from different areas. MTSR has been designed as a forum allowing researchers to present and discuss specialized results as general contributions to the field. This volume collects the papers selected for presentation at the 4th International Conference on Metadata and Semantic Research (MTSR 2010), held in Alcala de Henares––a world heritage city and birthplace of Miguel de Cervantes––at the University of Alcala (October 20–22, 2010). The first MTSR conference was held online in 2005, followed by two more editions: in Corfu (2007) and in Milan (2009). The experience acquired during the past five years, and the warm welcome of MTSR by the research community, encouraged us to organize this new edition of the series, and turn it into a yearly event. According to the number and quality of the contributions submitted for revision, our 2010 effort was again a considerable success. This edition counted on the contribution of three invited speakers: Juliá Minguillon from the Open University of Catalonia, Spain; Christian Stracke from the University of Duisburg-Essen, Germany; and Ferdinando Villa, from the Gund Institute for Ecological Economics, USA and the Basque Centre for Climate Change, Spain;. Juliá Minguillon demonstrated how social information tags can be used to discover hidden semantics that may improve descriptions of learning objects and thus help us to better understand the real needs of people searching for open educational resources. Christian Stracke discussed the benefits and future of standards, and presented the generic multi-dimensional reference model, while Villa presented a new theoretical synthesis and a knowledge modeling formalism enabling multiple-scale, multipleparadigm, and modular modeling, demonstrating its application to real environments by modeling a Web-accessible decision support system. Thanks to the three of them for their talent and good disposition. We would like to thank the authors for submitting their work, as well as the Program Committee members and reviewers for their enthusiasm, time and expertise. The help of the Organizing Committee members of the previous edition––Fabio Sartori in particular––was highly appreciated. Special thanks to Miguel A. Sicilia and Raquel Rebollo for managing the online submission system and preparing the cameraready version of the proceedings. Last, but not least, our gratitude to the local organizers for making MTSR 2010 a success. August 2010
Salvador Sánchez-Alonso Ioannis N. Athanasiadis
Organization
Program Chairs Salvador Sánchez-Alonso Ioannis N. Athanasiadis
University of Alcalá, Spain Dalle Molle Institute for Artificial Intelligence, Switzerland
MTSR Series Steering Committee Miguel-Angel Sicilia Fabio Sartori Nikos Manouselis
University of Alcalá, Spain Università degli Studi di Milano-Bicocca, Italy Greek Research & Technology Network, Greece
Organization Chairs Elena García-Barriocanal Nikos Palavitsinis
University of Alcalá, Spain Greek Research & Technology Network, Greece
Organization Committee Elena Mena-Garcés Daniel Rodríguez-García Raquel Rebollo-Fernández
University of Alcalá, Spain University of Alcalá, Spain University of Alcalá, Spain
International Program Committee Rajendra Akerkar Grigoris Antoniou Tomaz Bartol Howard Beck Paolo Bouquet Shawn Bowers Gerhard Budin Caterina Caracciolo Jack Carlson Artem Chebotko Stavros Christodoulakis Constantina Costopoulou
Technomathematics Research Foundation, India University of Crete & FORTH, Greece University of Ljubljana, Slovenia University of Florida, USA University of Trento, Italy Gonzaga University, USA University of Vienna, Austria Food and Agriculture Organization of the United Nations, Italy US Department of Agriculture, USA University of Texas - Pan American, USA Technical University of Crete, Greece Agricultural University of Athens, Greece
VIII
Organization
Sally Jo Cunningham Emilia Currás Darina Dicheva Asuman Dogac Marcello Donatelli Kathryn Donnelly Koraljka Golub Asunción Gómez-Pérez Stefan Gradmann Jane Greenberg Claudio Gutierrez Francisca Hernández Diane Hillmann Eero Hyvonen Carlos A Iglesias Pankaj Jaiswal Sander Janssen Pete Johnston Dimitris Kanellopoulos Maria Keet Johannes Keizer Christian Kop Brahim Medjahed Eva Méndez Daniela Micucci Julià Minguillón Akira Miyazawa Ambjörn Naeve William Moen Erla Morales Xavier Ochoa Petter Olsen Matteo Palmonari Laura Papaleo Panayiota Polydoratou Marios Poulos T V Prabhakar Edie Rasmussen Andrea Emilio Rizzoli Daniel Rodriguez Stefanie Rühle Gauri Salokhe Inigo San Gil
Waikato University, New Zealand Universidad Autónoma de Madrid, Spain Winston-Salem State University , USA Technical University of Ankara, Turkey European Joint Research Centre JRC, Italy Nofima, Norway University of Bath, United Kingdom Universidad Politécnica de Madrid, Spain University of Berlin, Germany University of North Carolina at Chapel Hill, USA University of Chile, Chile Fundación Marcelino Botín, Spain Cornell University, USA Helsinki University of Technology, Finland Universidad Politécnica de Madrid, Spain Oregon State University, USA Alterra, The Netherlands Eduserv Foundation, United Kingdom University of Patras, Greece Free University of Bozen-Bolzano, Italy Food and Agriculture Organization of the United Nations, Italy University of Klangenfurt, Austria University of Michigan, USA Carlos III University, Spain University of Milano – Bicocca, Italy Universitat Oberta de Catalunya, Spain National Institute of Informatics, Japan Royal Institute of Technology, Sweden University of North Texas, USA University of Salamanca, Spain Centro de Tecnologías de Información Guayaquil, Ecuador Nofima, Norway University of Milan-Bicocca, Italy University of Genova, Italy London City University, United Kingdom Ionian University, Greece Indian Institute of Technology Kanpur, India University of British Columbia, Canada Dalle Molle Institute for Artificial Intelligence, Switzerland University of Alcala, Spain University of Göttingen, Germany Food and Agriculture Organization of the United Nations, Italy Long Term Ecological Research Network, USA
Organization
Giovanni Semeraro Javier Solorio Lagunas Praditta Sirapan Cleo Sgouropoulou Gerald Schimak Aida Slavic Imma Subirats Shigeo Sugimoto Hussein Suleman David Taniar Emma Tonkin Joseph Tennis Giovanni Tummarello Ferdinando Villa Bettina Waldvogel Andrew Wilson Telmo Zarraonandia Thomas Zschocke
IX
University of Bari, Italy University of Colima, Mexico National Science and Technology Development Agency, Thailand Technological Educational Institute of Athens, Greece Austrian Institute of Technology, Austria UDC Consortium, The Netherlands Food and Agriculture Organization of the United Nations, Italy University of Tsukuba, Japan University of Cape Town, South Africa Monash University, Australia University of Bath, United Kingdom University of Washington, USA National University of Ireland, Ireland BC3, Spain & University of Vermont, USA WSL, Switzerland National Archives of Australia, Australia Universidad Carlos III de Madrid, Spain United Nations University, Germany
Table of Contents
Bridging Scales and Paradigms in Natural Systems Modeling . . . . . . . . . . Ferdinando Villa
1
Analyzing Hidden Semantics in Social Bookmarking of Open Educational Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juli` a Minguill´ on
8
Case Studies of Ecological Integrative Information Systems: The Luquillo and Sevilleta Information Management Systems . . . . . . . . . . . . . . Inigo San Gil, Marshall White, Eda Melendez, and Kristin Vanderbilt
18
Agrotags – A Tagging Scheme for Agricultural Digital Objects . . . . . . . . . Venkataraman Balaji, Meeta Bagga Bhatia, Rishi Kumar, Lavanya Kiran Neelam, Sabitha Panja, Tadinada Vankata Prabhakar, Rahul Samaddar, Bharati Soogareddy, Asil Gerard Sylvester, and Vimlesh Yadav
36
Application Profiling for Rural Communities: eGov Services and Training Resources in Rural Inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pantelis Karamolegkos, Axel Maroudas, and Nikos Manouselis
46
Developing a Diagnosis Aiding Ontology Based on Hysteroscopy Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marios Poulos and Nikolaos Korfiatis
57
Utilizing Embedded Semantics for User-Driven Design of Pervasive Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ahmet Soylu, Felix M¨ odritscher, and Patrick De Causmaecker
63
Utilizing Linked Open Data Sources for Automatic Generation of Semantic Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antti Nummiaho, Sari Vainikainen, and Magnus Melin
78
Application of Semantic Tagging to Generate Superimposed Information on a Digital Encyclopedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piedad Garrido, Jesus Tramullas, and Francisco J. Martinez
84
Mapping of Core Components Based e-Business Standards into Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivan Magdaleni´c, Boris Vrdoljak, and Markus Schatten
95
Model-Driven Knowledge-Based Development of Expected Answer Type Taxonomies for Restricted Domain Question Answering . . . . . . . . . . Katia Vila, Jose-Norberto Maz´ on, Antonio Ferr´ andez, and Jos´e M. G´ omez
107
XII
Table of Contents
Using a Semantic Wiki for Documentation Management in Very Small Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vincent Ribaud and Philippe Saliou A Short Communication – Meta Data and Semantics the Industry Interface: What Does the Food Industry Think Are Necessary Elements for Exchange? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kathryn A.-M. Donnelly
119
131
Social Ontology Documentation for Knowledge Externalization . . . . . . . . Gonzalo A. Aranda-Corral, Joaqu´ın Borrego-D´ıaz, and Antonio Jim´enez-Mavillard
137
Information Enrichment Using TaToo’s Semantic Framework . . . . . . . . . . Gerald Schimak, Andrea E. Rizzoli, Giuseppe Avellino, Tomas Pariente Lobo, Jos´e Maria Fuentes, and Ioannis N. Athanasiadis
149
Exploiting CReP for Knowledge Retrieval and Use in Complex Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lorenza Manenti and Fabio Sartori
160
Quality Requirements of Migration Metadata in Long-Term Digital Preservation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feng Luan, Thomas Mestl, and Mads Nyg˚ ard
172
A Model for Integration and Interlinking of Idea Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Westerski, Carlos A. Iglesias, and Fernando Tapia Rico
183
An Enterprise Ontology Building the Bases for Automatic Metadata Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Barbara Th¨ onssen
195
Matching SKOS Thesauri for Spatial Data Infrastructures . . . . . . . . . . . . . Cristiano Fugazza, Soeren Dupke, and Lorenzino Vaccari
211
Sharing Epigraphic Information as Linked Data . . . . . . . . . . . . . . . . . . . . . . ´ Fernando-Luis Alvarez, Elena Garc´ıa-Barriocanal, and Joaqu´ın-L. G´ omez-Pantoja
222
Development Issues on Linked Data Weblog Enrichment . . . . . . . . . . . . . . Iv´ an Ruiz-Rube, Carlos M. Cornejo, Juan Manuel Dodero, and Vicente M. Garc´ıa
235
On Modeling Research Work for Describing and Filtering Scientific Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Miguel-Angel Sicilia Localisation Standards and Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimitra Anastasiou and Lucia Morado V´ azquez
247 255
Table of Contents
The Design of an Automated Workflow for Metadata Generation . . . . . . . Miguel Manso-Callejo, M´ onica Wachowicz, and Miguel Bernab´e-Poveda Assessing Quality of Data Standards: Framework and Illustration Using XBRL GAAP Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongwei Zhu and Harris Wu Brazilian Proposal for Agent-Based Learning Objects Metadata Standard - OBAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rosa Maria Vicari, Alexandre Ribeiro, J´ ulia Marques Carvalho da Silva, Elder Rizzon Santos, Tiago Primo, and Marta Bez Towards Quality Measures for Evaluating Thesauri . . . . . . . . . . . . . . . . . . . Daniel Kless and Simon Milton Enriching the Description of Learning Resources on Disaster Risk Reduction in the Agricultural Domain: An Ontological Approach . . . . . . Thomas Zschocke, Juan Carlos Villagr´ an de Le´ on, and Jan Beniest Descriptive Analysis of Learning Object Material Types in MERLOT . . . Cristian Cechinel, Salvador S´ anchez-Alonso, ´ Miguel-Angel Sicilia, and Merisandra Cˆ ortes de Mattos Quality in Learning Objects: Evaluating Compliance with Metadata Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Christian Vidal, N. Alejandra Segura, S. Pedro Campos, and Salvador S´ anchez-Alonso
XIII
275
288
300
312
320 331
342
The Benefits and Future of Standards: Metadata and Beyond . . . . . . . . . . Christian M. Stracke
354
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
363
Bridging Scales and Paradigms in Natural Systems Modeling Ferdinando Villa Basque Centre for Climate Change, Avenida Urquijo, 4 Bilbao, Spain
[email protected]
Abstract. I present a new modeling formalism that enables multiple-scale, multiple-paradigm, and modular modeling. The formalism starts with a generalization of the semantics of scientific observations, where specialized observation classes compute their states by running models, using the states of the dependent observations as input, inheriting, intersecting and harmonizing their topologies of time and space. This formalism, called semantic meta-modeling, offers a uniform and cohesive approach that encompasses data management, storage, querying and many aspects of traditional modeling. I will show how simple, elegant model specifications can be rewritten into queries that can be run on a semantic database to produce semantically annotated model results. The algorithm automatically operates context translation, matching probabilistic with deterministic data and models, performing data-driven structural transformations of model structure as required by the context, and seamlessly mixing traditionally isolated paradigms such as agent-based with process-based or temporally- with spatially-explicit. Keywords: Semantic meta-modeling, modular modeling, observation semantics.
1 Introduction Data are being produced at unprecedented rates in sensor networks, remote sensing and sophisticated analytical studies. Worldwide efforts such as The Global Earth Observation System of Systems (GEOSS, 2005) promise increasing long-term accessibility and reliability of both real-time and static information. Such efforts reflect the consensus that the information produced, integrated into decision support systems, will assist in making informed environmental and public health decisions and benefit societies and economies worldwide. Yet, turning today’s data streams into usable understanding and decision-making power requires connecting the available information into synthetic models that can make trends and scenarios understandable, indicators of impending change discoverable, and timely policy changes possible, handling and communicating uncertainty appropriately. In recent years, activities such as the Semantic Web have spurred a renewed interest for evolving relatively unstructured “metadata” into formal semantic description of the datasets used in natural sciences and management. We have come to realize S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 1–7, 2010. © Springer-Verlag Berlin Heidelberg 2010
2
F. Villa
that our ability to describe the semantics of what such data artifacts observe and of the way they are observed is crucial to modern day scientific investigation. Core conceptualizations that can help formalizing scientific observations and the ways the observed parts of a system relate to each other have multiplied as their role becomes more important. Such ontologies allow basic integration of independently collected information and provide a branching point for further specification with domainspecific ontologies, enabling semantic validation of the information collected, transferred and created along data paths. The integrated approach outlined here is a general strategy to uniformly represent, store and use semantically characterized, heterogeneous knowledge streams in terms of data, models and scenarios.
2 Observation Semantics Separate logical models of observable entities (observables) and their observations are required to enable a novel, sophisticated approach to integrated modeling. Each data source (either real-time, such as sensors, or static) is semantically tagged to represent an observation (Fig. 1) of a specified observable in a given time/space (Villa et al. 2007).
Fig. 1. Partial representation of a high-level taxonomy of observation types and their key properties. Specializations of this ontology can describe most datasets.
Given this semantic characterization of each data source (Madin et al. 2007), data mediation in models can be seen as a three-fold problem. First, observables need to be of the same class. Description Logics (DL) reasoning on an OWL representation is a viable strategy to negotiate compatibility of observables (represented as OWL individuals and stored along with the data) and guarantee their match based on classification of
Bridging Scales and Paradigms in Natural Systems Modeling
3
the corresponding instances, using algorithms available in off-the-shelf reasoner software. Second, the observations of each observable must be compatible, i.e. a mediating strategy to match differently conceptualized states of compatible observables must exist. This can be trivial and exact (e.g. unit conversion), more complex and lossy (e.g. raster vs. vector representations or random vs. deterministic variables) or very difficult and uncertain (e.g. mediation of different classification schemata for compatible observables, for example land cover class [38] in different interpretations). This second dimension of the mediation problem is beyond the capabilities of DL reasoning, but can be tackled with ad-hoc algorithms. The third dimension of mediation concerns the topologies of space and time over which the states of each observation are distributed, which must match when different data are used in the same model. Extent and granularity of each dataset must be merged along all common topologies and the corresponding states transformed to match them, or separate agents must be built for each simulated process and coupled with adapter agents capable of bridging scales. In order to enable data integration, ontologies (Villa et al. 2009) are used to annotate data sources and models; such ontologies refer to established upper ontologies (Gruber, 1995; Madin et al. 2007) extended as necessary to represent specific observables in each problem area. A semantically annotated database (knowledge box or k-box), consists of a persistent storage of observation instances, indexed spatially, temporally and semantically, and searchable using generalized semantic queries that select classes of observables through limited reasoning, localizable to specified temporal and spatial contexts.
3 Semantic Meta-Modelling: Principles and Definitions The challenges of modeling a multiple scaled, constantly changing environment can be only attacked effectively using an array of modeling formalisms. For example, probabilistic models are particularly suited to tackle decision problems as they are lighter on the assumptions, suitable for data-driven machine learning and most useful in decision making where explicit uncertainty is valued. At the same time, deterministic process models (e.g. for hydrology) are used around the world with good result when the physical processes driving phenomena are well known, while agent-based models remain the paradigm of choice to simulate societal response and complex multi-scale pattern emergence. Semantic meta-modeling allows integration of these approaches through a generalization of the semantics of observations in which specialized observation types compute their states by running models, using the states of the dependent observations as input, inheriting, intersecting and harmonizing their topologies of time and space. This uniform approach to data management, modeling and scenario exploration is summarized in the block diagram of Figure 2. In semantic meta-modeling, a model specification acts as a query that, once supplied with a given context of observation and applied to an observation data base, produces zero or more observation structure using data from the k-box representing the data sources in a case study. This observation is “run” over the desired contexts of
4
F. Villa
Fig. 2. The workflow of semantic meta-modeling
space and time by compiling the mediation strategy for the observation into bytecode for a specialized virtual machine. The compiled model can be serialized and used many times, and will expose selected parameters so that they can be modified for scenario analysis. The result of running it is a new “contextualized” observation structure with a common spatial and temporal scaling and fully known states (either deterministic states or statistical distributions according to the class of observation) for each dependent observation. Result observations are well-behaved, semantically annotated data structures that can easily be persisted to the same k-box or to scientific storage formats (such as NetCDF) for visualization and long-term storage. Modeling proceeds in two phases. In the first (model building) a context observation is built by looking up values of the context observables in a semantically annotated database. The database (see section 4) is supplied to the modeling algorithm and observations are looked up to match the variables of interest; extent composition is used to ensure a common spatial and temporal context, so that a number of states, each with a different value of the context observations, can be calculated. In the second phase (model evaluation) an observation is built for each structure of observables computed for each context state. For each different context resulting, a RETE inference engine (Forgy, 1982) is employed to select the subtypes of the observables of interest that best represent the observable according to the values of the context variables in that context state. Because subtypes can specify additional observable types they depend on, each different context can result in a completely different observable structure. According to the modeling paradigm specified for each observable, calculating the states of the observable may require computation or data retrieval from a database. The infrastructure will negotiate the compatibility of the conceptual models and produce a model that can be run to compute states for all observables, or raise an error if the conceptual models cannot be combined. The final observation is, in the general case, a multi-paradigm model that operates for a specifically scaled context in space, time, and value of other driving variables.
Bridging Scales and Paradigms in Natural Systems Modeling
5
4 Meta-Model Specification Syntax Semantic meta-model specifications include three main elements: 1. 2. 3.
definition of types of observables with their model paradigm (defmodel instructions); context specifications that define the observables that will determine different model structures when the state of the context variables change; and conditional statements that associate specific values of the context variables to specific paradigms and dependencies for the observables that are part of the final model.
The example below demonstrates a model specification, as implemented in the Thinklab semantic modeling toolkit developed by the author. The example uses a modeling language that takes its syntax from the LISP-derived Clojure language (Halloway, 2009). (defmodel demo-biomass-model 'ecology:PlantBiomass [(classification 'landuse:CoverClass [0 32] 'Terrestrial, [34 :>] 'Aquatic) :as biome] (measurement 'eco:Biomass "kg/m^2") :context time :derivative (time:Time '(* self (^ growth-rate time))) :when (is? biome 'Terrestrial) (classification (measurement 'eco:Biomass "kg/m^2") [0 2.5] 'eco:LowBiomass [2.5 :>] 'eco:HighBiomass) :context (classification (measurement eco:SeaLevelTemperature "C") [:< 20] 'temp:Low [20 :>] 'temp:High) :probability ((eco:LowBiomass|'temp:Low [0.7 0.9]) (eco:LowBiomass|'temp:High
[0.1
0.3]) (eco:HighBiomass|'temp:Low :when (is? biome 'Aquatic))
0.017)
(eco:HighBiomass|'temp:High
0.983))
In the example, an observable concept (PlantBiomass from the eco ontology) is modeled in two alternative ways depending on the context of computation. The expression in square brackets is the context model, whose output will select the model to use for PlantBiomass. In the context model, an observation of land cover class is required and will be looked up in a supplied database at computation time; the classification model that wraps it transforms the numeric class into two states (terrestrial and aquatic). According to the biome (:when clause), either a deterministic (the :derivative clause defines the change of state in time), or a probabilistic (the :probability clause defines a contingent probability table) will be used to describe PlantBiomass; each depends on models of different observables (:context clauses), time for the dynamic model and temperature for the probabilistic one, which will be resolved using the same k-box at computation time. Unbound models like time and growth-rate are assumed specified in previous defmodel instructions; observations of the remaining observable concepts without an associated model (such as landuse:CoverClass in the example) will be looked up in the supplied k-box. The model formalism is independent from the time and space context, which is defined only at model computation time. The context of observation is supplied in the same way that a WHERE clause supplies the
6
F. Villa
query context in an SQL statement, using a convenient spatial and temporal scaling (Wu, 2006) for the problem.
5 Discussion Semantic modeling introduces novel potentials for environmental modeling and intelligent decision support systems. Even the simple injection of semantics into conventional data and models adds powerful integration capabilities. Using semantics to specify the modeling paradigm and leaving the task of ensuring compatibility of models to the infrastructure further empowers the modeler. Success of this approach could deeply affect the practice of modeling, allowing scientists to concentrate attention on designing and understanding logical structures, and facilitating communication and integration of models and data in unprecedented ways. Such potentialities become even larger if the phenomenon is seen in the context of a larger-scale, semantically enabled web. I list some novel opportunities below. Multi-paradigm modeling. Modeling at the conceptual level allows modelers to employ a language that is tailored to the knowledge domain of reference, adding the necessary dynamic information to the definition so obtained, and letting appropriate infrastructures define the corresponding computing workflows. Because each modeling paradigm can be described by a set of ontologies handled by matching software, a system can be extended so that, for example, a DDE model can coexist with individual-based models. The details of the scheduling and the interactions between different calculation workflows can be sorted out automatically. For example, a knowledgeexplicit approach can greatly ease the specification of hybrid models such as those that are most necessary in decision making, e.g. landscape models (best modeled as a spatially explicit process based model) coupled with human component models that react to changes in the landscape and influence it in turn (best modeled as individuals moving on the landscape and reacting to its change). Ontologies can identify the common ancestor concepts that allow both the modeler and the infrastructure to represent the coupled model consistently or to seamlessly merge the independently developed components. Automated contextualization in space and time. Models are necessary because the phenomena they describe vary in time and space. The property of being distributed in time and space causes a multiplicity of states for the variables of a system. When we model the temporal or spatial heterogeneity in a system, what we’re modeling is not the system itself, but more accurately the context of its observation. Temporal and spatial scales of observation can be changed so that a dynamic model appears constant, or so that what appears to be a static variable reveals fine-grained internal dynamics. By virtue of their conceptual independence, time and space can be modeled independently from the abstract conceptualization of ecological systems; the definition of the contexts of time and space can be connected to that of the entities by coupling the definition of the temporal and spatial contexts of interest with that of the modeled entities and their behavior. An important consequence of adopting a knowledge-explicit approach to modeling is that when space and time become part of the allowed semantics, there is no need for
Bridging Scales and Paradigms in Natural Systems Modeling
7
specially tailored knowledge or tools for basic spatially-explicit modeling, because such functionalities can be invoked as necessary by the knowledge-based system, and the paradigms necessary to enact it are automatically integrated into the specification. It is for example conceivable to make a non-spatial model spatially explicit by simply describing one or more of its components as distributed in space (Villa, 2007). A properly configured system can propagate the notion of space in one concept to the whole conceptual network, or mediate competing representations by operating transformations, e.g. to propagate coarse polygon data over a fine-resolution grid. Modular modeling. This approach is also a good strategy to enable a really modular modeling practice. Modular model specifications, which can be reused across contexts of application, have remained a holy grail in modeling practice because of the difficulty of matching independently developed model structures while maintaining the semantic and mathematical integrity of the resulting models. By making the specific conceptual model of each variable depend on an observed context, and leaving the task of ensuring computability to the underlying software infrastructure, we achieve a workable approach to modular modeling. An important aspect of this approach is the separation of observables from observations: matches can be defined at the abstract level, but also carry enough information in the semantic types of the observables to allow building models according to the most useful paradigm for the problem and the context. Using observation of the context to validate the composition of observables ensures that each is modeled in a way that is consistent with what the model is supposed to represent. Acknowledgements. This work is funded by the US National Science Foundation through grant 0640837 (Project ARIES).
References 1. Halloway, S.: Programming Clojure. The Pragmatic Bookshelf (2009) 2. Forgy, C.: RETE: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem. Artificial Intelligence 19 (1982) 3. Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. International Journal of Human-Computer Studies 43, 907–928 (1995) 4. Madin, J., Bowers, S., Schildhauer, M., Krivov, S., Ludaescher, B., Pennington, D., Villa, F.: An ontology for Describing and Synthesizing Ecological Observation Data. Ecological Informatics 2, 279–296 (2007) 5. Villa, F., Athanasiadis, I.N., Johnson, G.W.: An Ontology for the Semantic Modelling of Natural Systems. In: 6th European Conference on Computational Biology, ECCB (2007) 6. Villa, F.: A semantic framework and software design to enable the transparent integration, reorganization and discovery of natural systems knowledge. Journal of Intelligent Information Systems 29(1), 79–96 (2007) 7. Villa, F., Athanasiadis, I.N., Rizzoli, A.E.: Modelling with knowledge: a review of emerging semantic approaches to environmental modelling. Environmental Modelling and Software 24, 577–587 (2009) 8. Wu, J.: Scale and scaling: A cross-disciplinary perspective. In: Wu, J., Hobbs, R. (eds.) Key Topics in Landscape Ecology, Cambridge University Press, Cambridge (2006)
Analyzing Hidden Semantics in Social Bookmarking of Open Educational Resources Julià Minguillón Universitat Oberta de Catalunya, Rambla Poblenou 156, 08018 Barcelona, Spain
[email protected]
Abstract. Web 2.0 services such as social bookmarking allow users to manage and share the links they find interesting, adding their own tags for describing them. This is especially interesting in the field of open educational resources, as delicious is a simple way to bridge the institutional point of view (i.e. learning object repositories) with the individual one (i.e. personal collections), thus promoting the discovering and sharing of such resources by other users. In this paper we propose a methodology for analyzing such tags in order to discover hidden semantics (i.e. taxonomies and vocabularies) that can be used to improve descriptions of learning objects and make learning object repositories more visible and discoverable. We propose the use of a simple statistical analysis tool such as principal component analysis to discover which tags create clusters that can be semantically interpreted. We will compare the obtained results with a collection of resources related to open educational resources, in order to better understand the real needs of people searching for open educational resources. Keywords: tags, delicious, social bookmarking, delicious, open educational resources, principal component analysis, repositories.
1 Introduction Open educational resources (OER) have been a hot topic in the recent years, as more and more educational institutions and individuals are making their assets and collections available to the whole community through Internet. These resources are organized and published through open repositories such as MERLOT or OpenCourseWare, for instance, but other kinds of resources that can be also considered educational under some circumstances are also available, such as Wikipedia or LearningSpace, among others. In fact, the term OER can be used for a very wide range of resources, formats, granularity, etc. Therefore, it would be interesting to develop a methodology for establishing the meaning of words related to OER at different levels, improving the descriptions of both learning objects and collections in open repositories. Since the explosion of Web 2.0 services and applications, it has never been so easy to create and share resources using the web as the platform. Blogs, wikis, LMSs, S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 8–17, 2010. © Springer-Verlag Berlin Heidelberg 2010
Analyzing Hidden Semantics in Social Bookmarking of Open Educational Resources
9
CMSs, etc., there are plenty of open source tools and online services for such purposes. But discovering and organizing all these contents is also an interesting subject from the perspective of each user. Social bookmarking, and more specifically delicious1, is a way of organizing such resources by tagging them with words closer to the final user rather than the resource creators. Hundreds of thousands of users tag and share the same links, creating a huge folksonomy [1] of terms related to a given concept. This information can be further analyzed and exploited in order to better understand the way people use certain words to describe common terms which are widely used, such as the term “open” in “Open Educational Resources”. In this paper we discuss the possibilities of using the crowd-sourcing phenomenon of social bookmarking for extracting semantics from the tags added by delicious users, which describe links related to open educational resources. In Section 2 we outline the Open Educational Resources movement and the way of organizing content through open repositories and other Web 2.0 based platforms. Section 3 describes the methodology used for extracting useful information from delicious tags related to a given search term, in our case “Open Educational Resources”. Section 4 presents the results obtained for such search term and the semantics hidden in the relationships between similar users tagging similar resources, following the abovementioned methodology. Finally, Section 5 summarizes the main conclusions that can be drawn from this work and identifies the topics that should be addressed in order to improve the proposed methodology.
2 Open Educational Resources and Open Repositories The concept of “Open Educational Resource” was first firstly adopted at UNESCO's 2002 Forum on the Impact of Open Courseware for Higher Education in Developing Countries, according to Wikipedia. The main idea in the educational field was to reproduce the amazing success of the Open Source movement, trying to create a true community involved in open education. Since then, the OER movement has become a strong reality with thousands of open educational resources available, ranging from small chunks of knowledge (i.e. a definition or a formula) to whole web sites devoted to a specific topic (i.e. DLESE for earth sciences). Setting the granularity at course level, the OpenCourseWare consortium offers 3771 courses from 46 different sources in 7 languages (as July 2010). In this sense, and taking advantage of the evolution of digital libraries, more and more learning object repositories are currently now available. Nevertheless, learners do not have an easy way to organize such learning objects accordingly to their own particularities and preferences yet. The concept of repository itself is very different depending on the point of view (institutional vs. personal, for instance [2]), as institutional needs (preservation, dissemination) are not exactly the same than learners needs (organization, learning). Nevertheless, and with the advent of the web 2.0, many other sites such as flickr or youtube can be also considered open repositories managed by users themselves. 1
http://delicious.com
10
J. Minguillón
2.1 METAOER: Resources for Understanding the OER Movement As part of the activities of the UOC UNESCO Chair in e-Learning, a collection of resources related to the OER movement has been created and shared using delicious2. It is not another collection of open educational resources; it is a collection of open resources that describe key issues related to the OER movement. This collection is always “under construction”, as more and more links are added every week; currently now there are around 80 resources but we expect to reach a few hundreds. All these resources are tagged as “#metaoer” with the idea that this collection will continue growing with collaborations from individuals (experts or not) interested in supporting this project. Each resource tagged as “#metaoer” will be analyzed by a group of experts and, if possible, it will become part of an open repository about open educational resources, named METAOER, which will be part of the UOC institutional repository3. In order to do so, each link is tagged according to several tag bundles created for such purpose, including information about its authors, file format (PDF, Word, OpenOffice and so), license, organization / affiliation, type (paper, report, blog post, etc.) and, specially, its category. This last bundle defines what we think it is necessary to understand in order to become an active member of the OER movement, as shown in Table 1. Table 1. Categories used for cataloguing the documents tagged as “#metaoer” Category Awareness Format License Metadata Repository Software
Explanation
Documents related to the philosophy behind the OER movement, institutional declarations, policies and so. Documents related to the best file formats and standards used to support OER documents. Documents for understanding which license is more appropriated and the restrictions imposed by each one. Documents about standards and specifications for describing learning objects using metadata, vocabularies and taxonomies. Documents about how to build and maintain an open repository, as well as documents about best practices. Documents about software tools related to the other categories (format conversion, repositories, etc.)
In order to improve this classification and the collection of selected links, we will compare it with the results obtained when applying the methodology described in the following section. This methodology helps us to build semantic clusters starting from the tags used by people tagging open educational resources in delicious. We expect these clusters to reveal information about what people think (and probably need) about open educational resources, from a user centered perspective. In fact, user tags can be seen as “open descriptions” of open educational resources, assuming that users searching for resources only tag and share those that they find really valuable. 2 3
http://delicious.com/uocunescochair/#metaoer http://openaccess.uoc.edu
Analyzing Hidden Semantics in Social Bookmarking of Open Educational Resources
11
3 Extracting and Analyzing Delicious Tags As abovementioned, delicious is a Web 2.0 social bookmarking service that allows users to manage and share links in a standardized way. It improves the idea of “favorite links” implemented by web browsers, allowing users to manage them from any computer, using their own tags for describing links and sharing them with other users, if desired. It is very popular (ranked as the 326 most visited web site by Alexa, as July 2010), although preliminary studies showed that is not very used by higher education online students for managing educational resources (only 11% of students used delicious to manage their links). Delicious is a very good example of “the wisdom of crowds” [3], as it reflects the independent and diverse opinions of a group of individuals, although common tags are provided by default by the system, thus introducing an accumulation effect for the most common tags. Table 2 shows the main results obtained when searching for “Open Educational Resources”. A total of 3621 results are obtained (as July 2010), the first 10 are shown. For each resource, only the most important 5 tags are shown (according to the delicious search engine). Notice that “education” is a common tag for most resources, for example, as well as “learning” or “opensource”. This is not remarkable, though, as it is the expected behavior when thousands of users are tagging the same resources and the recommendation system probably offers such tag as an option. On the other hand, we are more interested in discovering the “long tails” [4] related to each resource, as well as the coincidences (or no-coincidences) among tags between users, that is, how tags are grouped to create clusters which can be considered natural sub-domains of the field, extending the analysis performed in [5]. Table 2. 10 first page results obtained from searching for “Open Educational Resources” in delicious (search performed in July 2010) Resource
Number of times saved OER Commons 3417 MIT OCW 15058 MERLOT 3775 Academic Earth 22808 LearningSpace 3192 Connexions 5078 OCW Consortium 3977 Wikipedia 37546 DOAJ 6245 Moodle 12941
Five most important tags education, opensource, learning, curriculum, resources education, mit, learning, free, online education, teaching, elearning, multimedia, resources education, video, lectures, learning, university education, learning, free, elearning, university education, opensource, learning, courseware, collaboration education, opencourseware, learning, opensource, free reference, encyclopedia, Wikipedia, wiki, research journals, research, reference, openaccess, science education, opensource, moodle, elearning, cms
Notice that the first seven sites are perfect examples of learning object repositories, while the other three sites (Wikipedia, DOAJ and Moodle) are of different nature, especially Moodle, which is an open source learning management system not directly related to content. On the other hand, Wikipedia has become a very popular resource because of its internal structure of links, which makes it to appear on top of searching engines results, while the Directory of Open Access Journals is a particular case of repository with a target audience, namely the scientific community.
12
J. Minguillón
3.1 Methodology Our goal is to analyze all the tags used to describe resources in a given domain, trying to discover not only the most common tags but also the relationships between tags, users and the links described using such tags. In order to retrieve all the information stored in delicious for a given concept, we propose to use the following methodology: 1. 2. 3. 4.
Retrieve the first N links related to a given concept using the delicious search engine. For each link, retrieve the first M users that have bookmarked such link. For each pair {link, user}, generate a list of the tags used by such user to describe such link. In case of an empty list, remove such pair. Generate a list of all the tags for all valid pairs.
Applying the first three steps we obtain a variable length data set with, at most, N * M entries (i.e. all the valid pairs), each one up to 45 tags (the maximum shown by delicious). In step 4 we obtain a set that can be used to determined the most important tags. Nevertheless, some data cleansing is needed in order to improve the quality of tags before they are analyzed. For example, some users may have used “elearning” as a tag while others may have used “e-learning”. Misspelling is also a common problem, as well as using the same word in plural or singular (i.e. “tools” vs “tool”), or using words that can be considered to be equivalent (i.e. “education” and “educational”). This step could be partially automated using a system such as WordNet, for example [6]. As expected, the set of tags generated in step 4 is also a long tail that must be cut at a certain point. Two main options are available: a) to specify a number T of desired tags; b) to specify an accumulated probability p (i.e. a threshold) and select the first T tags that achieve such threshold. Anyway, this step can be simplified as follows: 5.
Generate a reduced tag set containing the most important T tags and a conversion table which specified which valid tag is used for every tag in the original data set, ranked according to tag importance.
Then we can proceed to clean the results obtained in step 3, as follows: 6.
For each pair {link, user} obtained in step 3 replace every tag by its valid version according to the tag set generated in step 5, removing those pairs with no valid tags.
Finally, when a set of tags has been determined, we binarize the data set in order to obtain a matrix saying whether a given pair {link, user} has been tagged using {tagi}, as follows: 7.
For each pair {link, user} replace column i by “1” if the i-th tag in the tag set was used, “0” otherwise.
Analyzing Hidden Semantics in Social Bookmarking of Open Educational Resources
13
This generates a data set of dimension T which can be further analyzed using statistical or data mining techniques. For exploratory purposes, we propose to use Principal Component Analysis, as follows. 3.2 Principal Component Analysis Once the binary data set has been generated, we want to determine two different things: first, which tags describe better the data set in terms of variance and, second, how these tags are related to each other. In order to do so, we propose to use Principal Component Analysis, a well known tool for data exploration analysis. We use maximum likelihood as the criterion for extracting components from the original data set, followed by a Varimax rotation, in order to minimize the number of fields (i.e. tags) needed to interpret the obtained components. Finally, we only consider those factors larger or equal to 0.3 for explaining each component, thus reducing the number of factors in each component. We expect each component to gather a reduced set of factors (i.e. tags) which, hopefully, will be related to each other, discovering natural sub-domains of the top level concept used for building the analyzed data set. Notice that in the case of binary variables, maximum variance is obtained when mean is exactly 0.5, which is not the expected behavior except maybe for the most common tags, which do not provide useful information as they are, usually, too general (i.e. “education”). 3.3 Main Methodology Drawbacks This methodology has some well known drawbacks, although they are not really important for experimentation purposes. The main drawback is that links retrieved in steps 1 and 2 depend on delicious internal search engine, which can be biased. This can be partially avoided by including as many links and users as possible, but due to the necessity of attending thousands of concurrent requests, it is not possible to retrieve all the links and all the users for a given search term, as delicious uses bandwidth throttling protection schemes. Nevertheless, it is possible to perform an incremental search (every few hours), thus improving the quality of the results. Another important drawback is that the same web site can be accessed through different links, depending on whether the URL specified by the user includes the name of the web page or not (that is, “index.php” may be part of the URL or not). In the following section we will test the methodology as a preliminary experiment for understanding what users mean when they tag links related to the OER movement. This methodology can be combined with the method described in [7] in order to create ontologies for representing domain concepts.
4 Results and Discussion Following the example shown in Table 2, we applied the methodology described in Section 3.1 in order to analyze the way users are tagging links related to “Open Educational Resources”. Steps 1 and 2 generate a set with 65545 pairs {link, user} from 100 links tagged by 38298 different users, containing a total of 13929 tags (6632 of them are used only once). Once the data cleansing step described in step 6 is performed, we
14
J. Minguillón
set the number of desired tags to T=100, so the final data set contains 47674 different {link, user} pairs from 28292 different users, containing 100 different tags. The first ten tags according to their importance are: education (used 17871 times), free (9704), resource (7432), learning (6682), reference (5768), video (5530), opensource (5483), web20 (4782), elearning (4333) and tool (3903). The last tag in this data set (cms) has been used 289 times. As expected, tags follow a power-law distribution with a very long tail. It is interesting to analyze also the joint distribution of the pairs {link, user}. For each link, we have between 249 and 844 users that have tagged it at least with one valid tag. A visual inspection of the histogram of this distribution shows that it is not normal, but a combination of two normal distributions, the largest centered in the lowest values and another centered in the highest ones, but smaller. This is consistent with the fact that there are some links extremely popular (i.e. Wikipedia). On the other hand, each user tags between 1 and 25 sites, with a distribution that can be very well approximated by a geometric discrete with p=0.593447. We tried also to use a Zipf’s law distribution but it does not fit as well as the geometric one. This information, combined with the quality of tags, could be used to rank users as novice or expert for a given search term, although this is out of the scope of this paper. 4.1 Principal Component Analysis Then, we proceed to perform a principal component analysis, as described in Section 3.2. We obtain 33 components which eigenvalue (i.e. explained variance) is larger than one, explaining the 49.8% of the original data set variance (the largest explains only 3.0%, showing a lack of strong structure among tags). Table 3 shows the first ten components obtained and the tags that define the factors used in each component. Table 3. 10 first components obtained after PCA is applied to the data set Component C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
Tags involved in the component
images, photos, photography, photo, stock free, learning, online, course, university, college book, ebook, library, reading, literature programming, kids, animation, games opensource, tool, software, freeware research, journals, openaccess tutorial, howto, diy4 graphics, clipart curriculum, lessonplans web20, collaboration, community
Notice that due to the nature of the Varimax rotation applied after the principal component analysis, the same tag is not used in two different components (at least for the first 10 shown, but it is also true for the first 15 computed components). This
4
Common acronym for “do it yourself”.
Analyzing Hidden Semantics in Social Bookmarking of Open Educational Resources
15
allows us to create separate clusters with respect to tagging strategies but, as we will see, clusters will probably show some semantic relationship among them. 4.2 Interpretation of Results As described in [8], the components generated by PCA can be seen as an initial solution for any clustering algorithm based on computing centroids (or, in the case of binary vectors, medoids). Therefore, we can infer that the tags that appear in the same component are the basis for a given cluster and that they are strongly related to each other. In fact, results in Table 3 show a natural aggregation of tags around different concepts which reveal some hidden semantics: •
•
• •
•
•
•
C1 and C8 deal with images, one of the most common assets needed by people creating resources, specially blog posts and wiki entries. It is interesting to observe that there is a clear separation between real world images and computed generated cliparts. Therefore, from a semantic point of view, it is important to tag graphic resources according to their nature (real / synthetic). Sites such as flickr allow users to specify such category for a given image. C2 and C9 are related to courses which are available through Internet. These two components are directly related to the concept of OER from a consumer’s perspective. C2 is more oriented towards open content, while C9 is more related to organizational issues. This shows one of the well known problems in the OER movement, the necessity of providing additional support for open content, as “learning is more than just content”, quoting D. Wiley, who also stated that “we live in the age of content abundance” and “content is infrastructure”. These two levels (content and organization) need to be well described and related one to each other. C3 is related to books, but it also includes a reference to e-books, a recent technology which is increasing very fast as more and more users, institutions and companies are adopting it. C4 is probably the most inconclusive cluster, as it mixes tags with very different meanings. An inspection of the data set reveals that most of the high values for this component correspond to two web sites, namely zoho (a web suite of online applications) and curriki (an environment for creating educational materials). C5 is related to software, including references to the open source movement (such as Moodle), but the appearance of “tool” and “freeware” seems to indicate that users needs have a very wide degree of granularity, ranging from small solutions for PDF creation to full content management systems. C6 describes the domain of open access journals, which publish papers that can be considered open resources for research purposes. This component shows the relevance of DOAJ as one of the 10 most important web sites when searching for “open educational resources”. Finally, C7 and C10 are also two very interesting clusters, as they are directly related to the philosophy of the OER movement, but from the producer’s perspective (i.e. creating and sharing), both individually and collectively.
16
J. Minguillón
On the other hand, for the 33 computed components, there are several tags that are never used as a factor in any component. More specifically: resource, reference, elearning, teaching, opencourseware, blog, school, academic, lecture, podcast, creativecommons and oer, among others. Notice that the first three were in the list of 10 most used tags, which means that not all the most popular tags are relevant for analysis purposes, mainly because they appear everywhere in combination with others tags which are more representative of a specific concept. 4.3 Improving the METAOER Repository In the light of the results obtained in the previous section, the most interesting clusters for our purposes are, on the one hand, C2 and C9, that is, open courses already available; and on the other hand, C7 and C10, that is, creating and sharing content. C5 is also an interesting cluster as it is directly related to the category “Software” in Table 1, as abovementioned. Therefore, in order to improve METAOER, we will: • • • •
Define a taxonomy for OER granularity, including documents that explain how to use it by means of current standards and specifications for metadata. Include a new category for tutorials, that is, documents that explain how to put into practice the techniques explained in the other categories. Include a new category for learning communities, related to the “awareness” one but describing communities of learning with best practices on OERs. Define a taxonomy for software tools related to the OER movement.
5 Conclusions Hundreds of millions or even billions of Internet users are interacting everyday with hundreds of millions of potential educational resources. Since the Web 2.0, users are able to create and share resources, so the amount of information available has grown exponentially. On the other hand, these resources need to be organized and described in order to make them visible in such a huge collection as Internet is. This organization is partially created and shared through social bookmarking services such as delicious, for example. Delicious can be used as a huge database for understanding the way users are tagging resources they find to be valuable, at a very high level, with very simple words. In this paper we have proposed a methodology for extracting information from delicious, retrieving the most important tags used to describe the most relevant links related to a given search term, using PCA for exploratory purposes. We have applied this methodology to the concept of “Open Educational Resource”, in order to better understand what people mean when they tag resources according to such term and which kind of resources (or repositories, in a wide sense of the term) are they tagging, in order to discover their needs and see how the OER movement can be promoted. Our results show that the clusters obtained from the PCA applied to the collection of links describe different sub-domains related to OER from different perspectives: images, courses, books, software, journals, etc. Therefore, any collection of open educational resources should take into account these categories in order to provide
Analyzing Hidden Semantics in Social Bookmarking of Open Educational Resources
17
fast access to users trying to find such resources. Furthermore, it is necessary to provide also descriptors that allow users to narrow the range of results obtained when searching for a specific term, such as software related to OER, for instance. We are in the process of setting up an open repository about the OER movement, and the obtained results will allow us to improve the descriptions of the documents stored in such repository, by means of taxonomies and vocabularies. Current and future research in this topic should include the analysis of other terms commonly used in the OER movement, such as “open education”, for example, and then trying to compare and/or combine the obtained clusters. On the other hand, for each one of the obtained sub-domains, it would be interesting to apply the same methodology in order to create a hierarchical taxonomy of terms. This could be combined with hierarchical clustering, trying to establish the relationships between subdomains. In order to do so, enlarging the data set used for building the clusters is also necessary, incorporating not only one hundred of links but thousands of them, although this will be time consuming in both information retrieving and analysis. Finally, connecting the obtained clusters with some heuristics could be useful for building taxonomies and vocabularies for a given concept. Acknowledgments. This paper has been partially supported by the UOC UNESCO Chair in e-Learning, 2010.
References 1. Angeletou, S., Sabou, M., Specia, L., Motta, E.: Bridging the Gap Between Folksonomies and the Semantic Web: An Experience Report. In: Workshop Bridging the Gap between Semantic Web & Web 2.0 at 4th European Semantic Web Conference, Innsbruck (2007) 2. Peters, T.A.: Digital repositories: individual, discipline-based, institutional, consortial or national? The Journal of Academic Librarianship 28(6), 414–417 (2002) 3. Sunstein, C.: Infotopia: How Many Minds Produce Knowledge. Oxford University Press, Oxford (2006) 4. Anderson, C.: The long tail. Wired (October 2004) 5. Robu, V., Halpin, H., Shepherd, H.: Emergence of Consensus and Shared Vocabularies in Collaborative Tagging Systems. ACM Transactions on the Web 3(4, 14) (2009) 6. Fellbaum, C. (ed.): Wordnet: An electronic lexical database. MIT Press, Cambridge (1998) 7. Sugumaran, V., Purao, S., Storey, V.C., Conesa, J.: On-Demand Extraction of Domain Concepts and Relationships from Social Tagging Websites. In: Proceedings of the 15th International Conference on Applications of Natural Language to Information Systems, Cardiff, UK, June 23-25, pp. 224–232 (2010) 8. Ding, C., He, X.: K-means Clustering via Principal Component Analysis. In: Proceedings of the International Conference on Machine Learning, Banff, Canada, July 4-8, pp. 225–232 (2004)
Case Studies of Ecological Integrative Information Systems: The Luquillo and Sevilleta Information Management Systems Inigo San Gil1,*, Marshall White1, Eda Melendez2, and Kristin Vanderbilt3 1
LTER Network Office, University of New Mexico, Albuquerque, NM USA 2 Luquillo Experimental Forest LTER, University of Puerto Rico, Rio Piedras, Puerto Rico 3 Sevilleta LTER, University of New Mexico, Albuquerque, NM USA {isangil,mwhite,emelendez,vanderbi}@lternet.edu
Abstract. The thirty-year-old United States Long Term Ecological Research Network has developed extensive metadata to document their scientific data. Standard and interoperable metadata is a core component of the data-driven analytical solutions developed by this research network Content management systems offer an affordable solution for rapid deployment of metadata centered information management systems. We developed a customized integrative metadata management system based on the Drupal content management system technology. Building on knowledge and experience with the Sevilleta and Luquillo Long Term Ecological Research sites, we successfully deployed the first two medium-scale customized prototypes. In this paper, we describe the vision behind our Drupal based information management instances, and list the features offered through these Drupal based systems. We also outline the plans to expand the information services offered through these metadata centered management systems. We will conclude with the growing list of participants deploying similar instances. Keywords: Metadata Management Sytem, Metadata Editor, Drupal Content Management System.
1 Introduction Ecologists aware of scientific data preservation practices have been seeking optimal tools to generate quality metadata. Good data documentation is critical for data preservation and data discovery [1]. Structured, standardized metadata facilitates the advancement of science by enabling rapid data integration and synthesis. In this paper we present a set of tools to create, edit and manage metadata. *
Corresponding author.
S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 18–35, 2010. © Springer-Verlag Berlin Heidelberg 2010
Case Studies of Ecological Integrative Information Systems
19
The National Biological Information Infrastructure (NBII) [2],[3], is a US Geological Survey program whose mission is to facilitate the access and use of the biological information. The NBII supports metadata initiatives by providing metadata training services [4] as well as sponsoring initiatives to create new metadata editing tools [5]. There are a number of standalone tools to generate metadata for individual users and organizations, but these tools often fall short of community expectations. Users attending NBII-sponsored usability sessions and other feedback mechanisms reported problems including confusing interfaces, unstable applications and a perceived lack of benefit. Overall, a representative perception is that the process of creating metadata consumes too much time. Representatives for the Long Term Ecological Research Network program, the Arizona Hydrological Information System [6], the Oak Ridge National Laboratory Mercury Consortium [7] and the NBII program teamed up in 2006 to produce a set of design guidelines for a better metadata entry tool and editor. The resulting editor tool [5] addressed most of the design requirements, but the early adopters found it difficult to use. These new considerations led the authors to develop a more integrative and simpler editing tool integrated in the Drupal content management system [8]. The task of generating standardized metadata should be integrated in the task of generating metadata. The Drupal based system presented in this paper saves the information manager or investigator the extra step of using a stand alone editor to make the metadata comply with the rules of common metadata specification. In addition, this editor is also oriented to those users looking for a feature rich web-based editor. This paper presents the metadata tools and systems developed using the Drupal CMS technology. We use two case studies to describe the details and early reports of the functionality of the system. Highlights of this free open-source Drupal content management system include: fast and easy implementation, customization, maintenance and use. Other important system design requirements include a built-in relational model and semantic mediation for the information content. We implemented this information management system at two Long term Ecological Network (LTER) sites and made it freely available to the ecological research community. About LTER. The LTER Network is comprised of 26 research sites located in the United States, Puerto Rico, Antarctica, and Oceania. The LTER developed a metadata program to document and preserve their long-term scientific data. [9],[10]. The LTER network has an international counterpart organization (the ILTER) with forty member networks representing as many countries. To facilitate exchange of information within and outside the network, the ILTER has adopted metadata policies and standards. However, financial resources devoted to metadata management at ILTER research sites are frequently insufficient. Each US-LTER site should have a full-time information manager. The information manager ensures that all scientific information related to the station is preserved, well organized and accessible to all personnel. Usually, this person or team also manages the information that is made available to the public in general. Often these human resources are stretched thin, resulting in delays and gaps in the metadata program [11]. LTER heeds their own metadata mandate [12], offering to the public over 7,000 studies in a metadata specification commonly used in ecology. Outside the LTER, however, we do not always see structured metadata in the biology and ecology disciplines:
20
I. San Gil et al.
metadata is often either absent or minimal and non-conforming to any known common specification. More on NBII. The NBII program started in 1993 [2] as a collaborative program to facilitate information on biological resources, including scientific data. The NBII links databases, information and tools originated from NBII partners. NBII collaborators work on data and metadata standards, tools, and technologies that make it easier to use and integrate biological resources. Paper outline. The paper is outlined in three more sections. In section 2, we explain our vision for a successful ecological information management system, including our motivation and experiences with several information managers in the field. In Section 3, we explain the system architecture, starting from broad considerations about the Drupal content management system to the specific customizations to the information management needs of the ecological research stations at LTER. We discuss the first two implementations and case studies of this system in section 4. We conclude the paper with a discussion about the current work, our next clients, our partners and future outlook.
2 A Vision for an Integrated Metadata Management System The envisioned Information Management System provides a service for ecological researchers as well as the public in general. The information packaging duality is attained by creating different levels of detail for the information served. For example, high level descriptions of research projects and research locations may be appropriate for the public while extensive technical descriptions of data methodologies may be more appropriate for those who use the ecological data collected at field stations and LTER sites. We surveyed the categories used to classify and serve information through the 26 LTER websites. We found that the most recurring categories of information include: Scientific Data, Publications, Research Project, Personnel Directory, Sites, Facilities, News, Maps, History, Educational Activities, Outreach Activities and Informatics. There are other information categories, but the aforementioned categories are the most observed at the LTER sites. These information categories are interrelated: A photo gallery contains visual information about research sites and personnel; a research project is related to its scientific data, publications, people and research maps. The news category contains articles related to people, outreach efforts and educational activities. Metadata includes information about data, people involved in the data lifecycle, research projects, maps, publications and research sites. Figure 1 illustrates some of the information categories used to classify information at most US LTER sites, as well as the subset that comprises metadata. Our goal is to integrate the metadata management process in the broader information lifecycle, avoiding duplicating the metadata work. We wanted to simplify the data documentation process. Instead of using a standard-compliant standalone editor, we save effort and avoid further errors by having the local information management system output standard compliant metadata. However, the task of creating a universal
Case Studies of Ecological Integrative Information Systems
21
Fig. 1. The figure shows the most common categories of information found while surveying twenty six LTER websites. The most common categories are represented by a larger font, while other less popular groupings have smaller fonts. The shaded ellipse encircles the categories that constitute metadata.
adapter for each custom information management system would be very expensive. Instead, we focus on offering an easy to use information management system that also generates standard-compliant metadata. This Drupal based system allows the manager to create custom views of the information without specific programming or web-skills. Drupal relates all the information categories outlined above. Drupal also excels at serving information on the web. We are able to easily contextualize unsolicited information to the main content displayed, enhancing the experience of the visitor. For example, if a web visitor is visualizing a research project, the system volunteers the top five data sets produced under the research project, the relevant personnel and publications. The system allows simple content tagging, thus enabling additional content relationships. We offer the web visitor content that is directly related to the main display (research project has data, has publications, has people) and also other content related through the tagging process. Related Work. There are many current developments that are either directly related to this effort or are similar conceptual models. We already pointed out Aguilar’s editor [5] as a precursor of this work. Aguilar’s choice of technology, a mashup of Orbeon [13] and Java that runs in a servlet container such as Tomcat [14] proved too much for wide adoption and further refinement. The standalone editors such as Metavist [15] and Morpho [16] were used as guiding principles. Perhaps the OBFS dataset registry tool [17], a web form that enables basic documentation of a data set, is the closest precursor in the sense that it is a web based integrative metadata tool.
22
I. San Gil et al.
There are a number of synergistic efforts. The U.S. Geoscience Information Network group has developed a Drupal based metadata editor for basic metadata (ref), offering the content in ISO [18],[19] compliant reports. Also, the US National Phenology Network [20] has developed a set of forms to capture the phenophases of plants and other living animals. The front end of the application uses Drupal, while the forms use a java framework. The rest of the metadata and information is shared by the primary US-NPN website. Spain-based Grupo de Ecología Terrestre developed an integrated information management system called Linaria [21] that serves structured metadata from the system. The biological station at the University of Michigan also uses a Drupal based modified version of our model [22], highlighting the adaptability of the system to customizations and user needs. The wind energy group at the Oak Ridge National Laboratory (ORNL) has developed a set of metadata capture forms based on our work[23]. The outreach efforts sponsored by the NBII gave way to collaborations with the National Phenology Network, The United States Virtual Herbarium, the Spain Long Term Socio-Ecological Research Network, the Finnish Socio-Ecological Research Network and the Long Term Ecological Research Network in Taiwan. These groups have also benefited from our vision or technology, while we have been able to improve the Drupal implementation. Another remarkable Drupal based effort in ecology, and specifically, in biodiversity are the Encyclopedia of Life’s modules for biodiversity information management [24] are a reference for information managers in ecology. Particularly interesting are EOL’s tools for metadata management in ISO. Another Encyclopedia of Life product that uses Drupal as its platform is LifeDesks [25]. We plan to use some of the species pages and taxonomies used, as well as some optimization modules that will provide a low-cost, value added features to the work we are presenting in this paper.
3 System Architecture In this section we cover the customizations made to manage ecological metadata and other common ecological information. Since our system is based on the Drupal framework, we describe first the basics of the Drupal model, then we describe the customizations made in order to make a information management system for use in biological stations. Why Drupal? At the time when we started implementing the vision outlined in section 2, Drupal was the only content management system that had all the following characteristics: Open source, substantial community adoption and support. Drupal also has a growing community of developers and adopters [26]. Drupal was the only content management system that easily allows the user to work with the architecture. We found Drupal to be flexible to adapt the architecture to the data model and customer needs. This means that we did not lose any of the Drupal functionality while we adapted the relational database to our needs. We were able to customize from the database to webform and views with adequate granularity for the information categories. Other CMSs have many virtues, and in some aspects may surpass Drupal, but we valued the system’s flexibility.
Case Studies of Ecological Integrative Information Systems
23
3.1 The Drupal Framework The Drupal content management system is a product that can run on top of Apache [27], Microsoft’s IIS web server [28] or other web servers. The code base is PHP [29], with some Javascript [30] additions. Most Drupal instances use MySQL [31] as the backend relational database, however PostgreSQL[32] can be used too. Drupal is a well adopted open source content management system, with well over half a million websites powered by Drupal [33], or 1% of the internet share. The initial install comprises four simple steps that are clearly documented. With all prerequisites running, an initial complete install can be completed in about five minutes. To the Drupal community, a ‘node’ refers to a record in a content-bearing table of the database. Tables that contain information may represent a webpage, a story, an article, a blog post, a photo. Drupal provides all information types the same underlying structure. Any application developed for a specific information type may be leveraged by all the information types in Drupal. This fosters information integration. Drupal node-centric structure makes the developer’s work relevant system wide. In Drupal, the different types of tables are referred to as Content Types, I.e: Content Types are categories of information groups or containers for a specific category of information. For example, the Content Type “Research Site” may encompass a site description, latitude and longitude and site elevation. Nodes stored in content types can be linked to each other. For example, a node in the content type research project may contain several nodes in the content type “data set”. This is relational database design which can be directly inspected in the underlying MySQL database. The look and feel of the front end application (the website face) is easily customizable through custom Themes. Drupal themes are the layouts that structure the content on the web application. A user can change the look and feel by clicking twice. There are hundreds of themes available for use [34]. Maintaining this information system up-to-date will be easier: The only tool required is an internet browser. Also, separation of content, organization and layout allows planning for independent modifications of the look and feel and the content quantity and quality. Almost all the Drupal operations are conducted using a web browser (I.e; Firefox, Internet Explorer, Chrome, etc.). Drupal’s core is maintained by the Drupal team [33]. There are many extra features that extend the basic functionality through the addition of modules. A distributed community of developers release these modules and themes using programming guidelines, principles and open source philosophy. The Drupal web site structures all these custom modules and themes at http://drupal.org/project. 3.2 Customizing Drupal Currently, our LTER Drupal group uses many extensions to the core and has developed custom content types for personnel, research site, research project, metadata for dataset with data table and attribute types. We also benefit from optimized modules that manage bibliographies, image galleries and videos developed by the Drupal community. Using extended functionality it is possible to serve content in the extensible markup language (XML)[35]. Using Perl [36] and XSLT [37] transformations, we provide Ecological Metadata Language [38] and Biological Data Profile [39] compliant metadata.
24
I. San Gil et al.
We can serve the content as PDF, Excel spreadsheets, Word documents, and other commonly used formats. Our group uses the Views module extensively. The Views module allows us to offer the content in many user friendly and intuitive layouts. The views module is nothing but a GUI into the creation of structure query language (SQL) queries, coupled with the actual final web layout. In the same GUI interface we configure the queries (fields, filters, arguments, relations), the layout (the style, pagination) and the display (access, sorting). Other interesting features are security (SSL encryption, captchas and enhanced logs), user management, LDAP connectivity and RSS feeds. The latest stable version of Drupal (Drupal 6.*) provides its adopters with taxonomies for semantic mediation. Drupal taxonomies are families of keywords. These keywords may be free form or may come from a controlled vocabulary. The keywords may be structured hierarchically, conferring some further structure to a vocabulary list. These keywords can be used to tag content, forging relationships between otherwise disconnected information. In summary, the content types are repositories of information customized to the categories managed by the group. Creating a custom content type in Drupal is fairly simple. Each content type includes a configurable input form to capture and edit the information that it holds. Figure 2 shows a subset of Drupal ER diagrams for the metadata tables.
Fig. 2. A simplified data diagram that includes most of the metadata related tables. Most of the Drupal core tables have been excluded from this diagram, except for the master table "node", in the upper right area.
Case Studies of Ecological Integrative Information Systems
25
The tables shown in Figure 2 are related to ecological metadata management. The areas in the figure represent the information categories discussed: • • • • •
A personnel directory, whose main table is titled content_type_person. The basic metadata area, whose main table is titled content_type_data_set. The structure of the tabular data entities area (content_type_data_file). The georeference area, whose main table is the content_type_research_site. The variable area, whose main table is titled content_type_variable.
The tables that have the prefix content_field are dependent on the content types. These tables hold information for the one-to-many relationships to their parent tables. There are keys inside those tables to index the multiplicity on the relations. In Figure 3, we show a diagram that expresses the metadata content types or information categories as implemented in Drupal custom content types and their relationships. Note the many to many relationships as well as the one to many relations.
Fig. 3. The seven Drupal custom content types used to document ecological data and their relationships. Each content type may represent one or more tables in the Drupal relational database. All content is related at the record or node level.
In order to represent the most common categories of information, we have created seven content types: 1. The basic unit refers to an observable. We classify observables in three types: a physical, quantifiable measurement, a date or a set of codes. This classification avoids complexities noted by many [40]. Variables can be described using the associated web form. The web form (see figure 4) has three tabs corresponding to the types mentioned (physical measurement, dates and codes).
26
I. San Gil et al.
2. The data file structure describes the physical details of the data container, usually a spreadsheet or a database view output. Details such as the column separator, number of header or footer lines, file name and link to the actual data are among the information that the user can input in the associated, multi-tab web form. 3. The research site describes the location where data is gathered. Latitude, longitude, description of the site characteristics, and elevation are some of the descriptors that can be entered in the web form. 4. There is also a form to enter the contact information and other details about a person. These forms have fields to relate to other content types, called as node references. These inputs use autofills to make the correct selection easier. 5. A bibliographic section captures all information related to publications. 6. The basic details for a data set, such as title, abstract, publication date, keywords, etc are captured in another form that has connectors to the personnel, research sites and publications. See an instance of this form in Figure 5. 7. A research project form captures information about the umbrella project that may encompass several data sets.
Fig. 4. A web form to capture information about a variable. Ideally, the information about a variable is populated automatically off the original data. However, a user friendly interface is provided to correct and add information about a variable. Three types of variables typically used in biological studies can be documented. Codes, dates and physical measurements can be added to a description of the variable using the tabs in the form.
Case Studies of Ecological Integrative Information Systems
27
Fig. 5. This web form captures the basic details about a dataset. The information captured (abstract, title, identifiers, locations, etc) allows the data to be found in specialized searches. The four tabs break down the form in simple one-pagers, and divide the metadata ingestion process by information categories.
4 Case Studies We describe here the steps taken to migrate content from the Luquillo Experimental Forest LTER [41] and Sevilleta LTER [42] information management systems in to the new Drupal model. These two LTER sites have a long history of research, with continuous data records that span about twenty years.
28
I. San Gil et al.
4.1 The Luquillo Experimental Forest LTER Case The Luquillo LTER content migration is nearly complete. All the content is tied together minimizing the risk of running into a page without further suggestions or links to related content. The information is also mediated by a custom controlled vocabulary whose adequacy is being tested against the goal of unearthing related information (discovery functionality). The migration process started with the effort of three persons, one dedicated information manager for the Luquillo site, and two operators from the LTER Network Office. There were two main tracks for the migration, coordinated by the Luquillo information manager. One track dealt with the overall content, structure, layout and vision, while the other track focused on metadata. The metadata migration process: Most of the Luquillo metadata was standardized [11] using the Ecological Metadata Language specification as the common vehicle to share information. Following the first migration step, there was a semi-manual quality control procedure of the resulting EML documents. Once the metadata was rich in content, and controlled for quality, then taking advantage of the standardized format for metadata, we wrote a parser using Perl to migrate the metadata inside the Drupal content types created to host metadata. The re-usable parser (EML is used by several networks in the USA and abroad) accommodated most of the content and relationships in the content within the Drupal MySQL database. Over one hundred and twenty datasets were automatically migrated to the Drupal system. The 120 plus metadata sets describe over 900 tabular data files. Each data file, usually a spreadsheet, may contain data organized in hundreds or thousands of rows and several columns. Data files may have from three columns to several dozen columns. Each column represents a variable. The migration script re-used identical variables for different data files. In total, we processed over three thousand variables, some of which were found to be redundant upon closer inspection. We planned a revision procedure to check for content integrity, consisting of completing the relations with other relevant information not captured in the EML documents and correcting migration errors. We created a specialized variable query and results page that enabled us to detect and merge similar variables. By merging variables we facilitate data integration across different datasets. The Luquillo crew conducts research at several research areas. Most of the research areas are located within the Yunque National forest boundaries, except for a few areas that are located in a more urban setting, west of the forest. Each research area contains several research sites. Each research site contains a number of plots that are used to observe effects of treatments or simply experiment replicates. All this site information is managed in Drupal by the custom content type Research Site. In all, there are over fifty entries in this category which is closely related to research projects, data sets, data files and variables. Figure 6 shows Luquillo’s hierarchy of content types under the umbrella of the Research Projects and gives the quantity of nodes for each content type. All seven content types are presented with the quantity of nodes that have been entered or in the process of entering into the system. The figure presents the relationship among the several content types.
Case Studies of Ecological Integrative Information Systems
29
Fig. 6. Luquillo’s Information Management and Content system. Hierarchy of content types, their relations and the number of records these content type contain. Legend: CTE = Canopy Trimming Experiment; LEF = Luquillo Experimental Forest; PPT = Precipitation; NADP = National Atmospheric Deposition Program; LFDP = Luquillo Forest Dynamics Plot; SOM = Coil Organic Matter; LIDET = Long-Term Intersite Decomposition Experiment Team RP # = Abbreviated Research Project; CT1:CT2 = Mapping a node from CT1 to CT2; {A} =A finite set of nodes.
30
I. San Gil et al.
The overall information content migration to Drupal at the Luquillo LTER site was based on providing functionality while addressing shortcomings of their previous system. The legacy functionality included a searchable personnel directory, a data catalog with comprehensive structured metadata, a high level description of the Luquillo LTER projects, a publications catalog with over one hundred entries, several maps, a list of outreach activities including an elementary school program and a calendar of events. Parts of the information managed were automatically updated in the dynamic site portal using mySQL and Paradox databases and PHP for the web interface. Other information offered through the web was static content subject to manual revisions both in the back end and the front site page, relying mostly on the server file system. Some Luquillo LTER goals in regard to the design of an information system included: easier maintenance than the current information system, shortened time lags between data capture and data offer online, semantic mediation for data classification, better information integration, and a framework that facilitates broader participation where investigators could take ownership of how the data and results from the data are portrayed to the public. Finally, Luquillo wanted to implement a vocabulary coordinated by a Luquillo committee. In addition, Luquillo wanted to connect all the legacy contents offered through the legacy sites. Personnel profiles with not just contact information, but related publications and data sets. Research projects were connected to people and publications, and so on. All this interconnected content made it easy to offer a more contextualized experience to anyone visiting their information portal or website. A committee formed by principal investigators from Luquillo LTER and the information management team refined a hierarchical controlled vocabulary that was used to tag the content. This task took considerable time resources, as there are five families of legacy vocabularies to tag the Luquillo information and data. The vocabularies were grouped in these five categories: • • • • •
Terms for describing the Core Area LTER Data Projects Terms for describing research locations Terms assigned by principal investigators Terms procured by a Luquillo LTER committee. Terms used for a special research project
The most populous family was the free form “Assigned by principal investigators”, which included over eight hundred terms, including some misspellings and very similar terms. In contrast, the “Core Area” vocabulary had only a dozen terms, while the other vocabularies had terms in the one to two hundredths, including some redundancies. The new vocabulary, organized hierarchically, will be used for several purposes. In our content management context, we will test the subset that optimizes information discovery by analyzing both web logs and the number of nodes related and exposed in queries per term. The migration process started about a year ago, and it is nearly complete [43]. The cost of the migration process can be broken down in developer time (about 30 hours), planning time (about a week), and development meetings (about three weeks). The Luquillo information manager traveled for some of the development meetings, adding the cost of the trip to the total time of this project. However, we still have to factor
Case Studies of Ecological Integrative Information Systems
31
further training (one week) and learning as well as usability labs oriented to improve the service. 4.2 The Sevilleta LTER Case The Sevilleta LTER site underwent a migration process that was similar to Luquillo LTER’s migration. However, there were a number of differences that are noteworthy: All the migration team was co-located, which helped with the planning and development meetings. The legacy information was structured in a content management system (PostNuke, [44]), which translated the migration process to an exercise of mapping the content management systems’ databases. Much of the quality control process is still in order, since years of existence have resulted in several inconsistencies in the information structure. A new research plot content application has been deployed, with a higher degree of granularity. More sites, plots and subplots are documented in the database for use of the researchers and public exploration. The personnel, bibliography, data sets and research projects are now interconnected at the record (or node) level, rather than in separate compartments. The presentation to the user through the web portal is richer, with a higher degree of contextualized information per page visited. We minimize dead ends; that is, those information pages that have no further contextual links. The metadata migration followed the same path as in Luquillo. EML documents were used to move the content into the custom Drupal content types. In the Sevilleta LTER case, the EML content was non-uniform [45]. While some datasets were rich and quality controlled, others had limited content or inconsistencies, making the postmigration corrections more time-consuming. In all, the migration phase is completed; however, the content is not yet as envisioned in the planning process due to the quality control process that needs to happen. The correction and verification process of over two hundred data sets is easy but time consuming. We estimate about 100 hours extra to finish the Sevilleta Drupal project. The current Drupal Sevilleta portal mimics the previous website portal front page. Some of the menus and categories of information are the same, but the interconnectivity and functionality have been greatly expanded. For example, a personnel directory entry has been enhanced significantly. In addition to the customary person titles and contact information, we added the related list of research projects, data sets and publications. We tag the user with relevant keywords and show blocks of tag-related information on the margin of the main content. We have created similar rich pages for the data sets, research projects and bibliography. A publication list is no longer a list of more or less standard references. In addition to the list, all the elements that make up the list (title, authors, tags) are linked to the corresponding content. 4.3 Other Upcoming and Synergistic Cases The Plum Island LTER [46] and Arctic LTER sites [47] hosted at the Marine Biology Laboratory [48] are next to start the migration process. The LTER Information Management Committee allocated some modest funds to train the information managers through the development of their new information management systems. This process is expected to start sometime during the summer of 2010.
32
I. San Gil et al.
The University of Michigan Biological Station has also tweaked the original Drupal based model, simplifying it for the needs of their principal investigators and users [49]. Specifically, the information manager has merged the data set and data file structure content types. This merging process narrows the definition of a dataset, but simplifies the metadata entry process, an advantage from usability point of view. The North temperate Lakes LTER is on the early stages of their content migration into the Drupal model described above [50]. One of the first steps consisted of integrating content in the database but shown through two different web portals: A website that uses the common projects for the North Temperate Lakes and one for the associated Microbial Observatory. The content types were modified slightly to adapt them to a different data model. The North Temperate Lakes LTER is currently developing queries to migrate the data into the new structure according to their migration plan.
5 Concluding Remarks and Future Work We have developed a low cost integrative information management system for biological and ecological stations [51]. The customized information categories have been represented in a relational database using the Drupal content management system. Our decision was driven by several factors, including an analysis of the general resources of our potential consumers. This Drupal based information management system consumes and produces metadata that is compliant with some of the most common specifications. The system can be adopted by any organization. A number of instances have been deployed to serve as cloud-based metadata editors and metadata catalogs [52]. A factor that influenced us to offer these web based editors to produce and edit existing EML and BDP compliant metadata is the value of the NBII and LTER metadata program [4],[53]. The solutions proposed and described in this paper have been adopted by four biological stations that have been producing hundreds of scientific data since the 1980’s or earlier. Currently a joint NSF proposal of six LTER stations has been funded to advance several aspects of this system, including the automatic generation of metadata from data stored in relational databases, the consumption of web services, including the National Biological Information Infrastructure Thesaurus web services [54] and the US Department of Agriculture Integrated Taxonomic Information System web services [55]. Also, with this grant, we have started collaborating with Drupal developers that have worked with the Encyclopedia of Life (EoL) consortium with the intent of better integrating the metadata compliance in the Drupal system as well as the consumption of EoL’s species pages. A usability test for the systems is in order; an information system is successful when its users find it helpful. We plan to use the usability testing laboratory at the University of Tennessee [56] to find out and correct the weakest points of the system. In summary, we have sought a low cost solution to enhance the management of information and scientific metadata as a means to promote data integration and synthesis. The success of integrative network architectures such as PASTA [57] relies on the quality of the data ingested from the data providers. The Drupal based system that we provide here helps with the data and metadata curation and manipulation processes that are meant to harmonize otherwise irreconcilable data.
Case Studies of Ecological Integrative Information Systems
33
Acknowledgments. Inigo San Gil would like to acknowledge the support of the USGS NBII program through the cooperative agreement with the NBII.
References 1. Michener, W.K.: Meta-information concepts for ecological data management. Ecological Informatics 1(1), 3–7 (2005) 2. Sepic, R., Kase, K.: The national biological information infrastructure as an E-government tool Government. Information Quarterly 19(4), 407–424 (2002) 3. The National Biological Information Infrastructure site, http://nbii.gov 4. San Gil, I., Hutchison, V., Frame, M., Palanisamy, G.: Metadata Activities in Biology. J. of Library Metadata (2010) (accepted) 5. Aguilar, R., Pan, J., Gries, C., San Gil, I., Palanisamy, G.: A flexible online metadata editing and management system. Ecological Informatics 5(1), 26–31 (2009) 6. The Arizona Hydrological Information System, http://chubasco.hwr.arizona.edu/ahis-drupal/ 7. The Oak Ridge National Laboratory Mercury Consortium, http://mercury.ornl.gov/ 8. The Drupal Content Management System, http://drupal.org 9. The Long Term Ecological Research Network, http://lternet.edu 10. Hobbie, J.E.: Scientific Accomplishments of the Long Term Ecological Research Program: An Introduction. BioScience 53(1), 17–20 (2003) 11. San Gil, I., Baker, K., Campbell, J., Denny, E.G., Vanderbilt, K., Riordan, B., Koskela, R., Downing, J., Grabner, S., Melendez, E., Walsh, J.M., Kortz, M., Conner, J., Yarmey, L., Kaplan, N., Boose, E.R., Powell, L., Gries, C., Schroeder, R., Ackerman, T., Ramsey, K., Benson, B., Chipman, J., Laundre, J., Garritt, H., Henshaw, D., Collins, B., Gardner, C., Bohm, S., O’Brien, M., Gao, J., Sheldon, W., Lyon, S., Bahauddin, D., Servilla, M., Costa, D., Brunt, J.: The Long Term Ecological Network metadata standardisation implementation process: A progress report. International Journal of Metadata, Semantics and Ontologies 4(3), 141–153 (2009) 12. Harmon, M.: Motion to adopt EML at the Coordinating Committee (2003), http://intranet.lternet.edu/archives/documents/reports/ Minutes/lter_cc/Spring2003CCmtng/Spring_03_CC.htm 13. Orbeon, a framework for XForms, http://orbeon.org 14. Tomcat, An apache servlet container, http://tomcat.apache.org 15. Rugge, D.J.: Creating FGDC and NBII metadata using metavist 2005. Technical Report (2005), http://ncrs.fs.fed.us/pubs/gtr/gtr_nc255.pdf 16. Morpho, a standalone editor that outputs Ecological Metadata Language compliant XML, http://knb.ecoinformatics.org/software/morpho/ 17. The Organization for Biological Field Stations portal , http://obfs.org/ 18. ISO 19115. A international geographic information metadata standard, http://www.iso.org/iso/iso_catalogue/catalogue_tc/ catalogue_detail.htm?csnumber=26020&commid=54904 19. ISO 19137. Another ISO standard to cover geographic metadata, this is the XML implementation, http://www.iso.org/iso/ catalogue_detail.htm?csnumber=32555&commid=54904 20. The US National Phenology Network, http://usanpn.org
34
I. San Gil et al.
21. The ecology group at Sierra Nevada, creators of Linaria, http://iecolab.es/ 22. The University of Michigan Biological Field Station using Drupal (2010), Resource at, http://umbs.lsa.umich.edu/research/ 23. The wind energy data and information gateway (WENDY), http://windenergy.ornl.gov/ 24. Encyclopedia of Life, http://eol.org 25. LifeDesks Drupal Modules, by the encyclopedia of life http://www.lifedesks.org/modules/ 26. Wiersma, G.S.: Building Online Content and Community with Drupal. Collaborative Librarianship 1(4), 169 (2009) 27. Apache web server, http://apache.org 28. Microsoft’s Internet Information Server, http://iis.net 29. The Pre HTML Processor language, http://php.net 30. The JavaScript, https://developer.mozilla.org/en/firefox_3.6_for_ developers#JavaScript 31. Mysql now, http://mysql.com 32. PostGreSQL, http://postgresql.org 33. Buytaert, D.: State of Drupal. Keynote presentation at DrupalCon 2010, San Francisco, CA (2010), http://www.archive.org/details/Css3TheFutureIsNow 34. The Drupal themes, http://drupal.org/project/Themes 35. The extensible markup language, http://w3.org/XML 36. The Perl language, http://perl.org 37. The XML stylesheet transformation language, http://w3.org/Style/XSL/ 38. The guidelines for EML use are, http://knb.ecoinformatics.org/ software/eml/eml-2.0.1/index.html 39. The Biological Data Profile, http://www.fgdc.gov/standards/projects/ FGDC-standards-projects/metadata/biometadata/biodatap.pdf 40. Velleman, P.F., Wilkinson, L.: Nominal, ordinal, interval, and ratio typologies are misleading. American Statistician, 65–72 (1993) 41. The Luquillo Experimental Forest LTER, http://luq.lternet.edu 42. The Sevilleta LTER portal, http://sev.lternet.edu 43. Melendez, E.: Developing a Drupal website-IMS for Luquillo LTER while learning Drupal. Databits. Spring Issue (2010), http://databits.lternet.edu/spring2010/developing-drupal-website-ims-luquillo-lter-whilelearning-drupal 44. PostNuke Content management system 45. San Gil, I., Baker, K.: The Ecological Metadata Language Milestones, Community Work Force, and Change. Databits, 4–7 (Fall 2007) 46. The Plum Island Ecosystem, http://ecosystems.mbl.edu/pie/ 47. The Arctic LTER, http://ecosystems.mbl.edu/arc/ 48. The Marine Biology Laboratory at Woods Hole, Massachusetts, http://mbl.edu 49. The University of Michigan Biological Station, http://umbs.lsa.umich.edu/research/ 50. The North Temperate Lakes Drupal developments, http://lter.dnsalias.net/site/ 51. The DIMS, http://intranet.lternet.edu/im/project/dims 52. One of the instances of a cloud based editor, http://nbii.lternet.edu
Case Studies of Ecological Integrative Information Systems
35
53. Lytras, M.D., Sicilia, M.A.: Where is the value in metadata? International Journal of Metadata, Semantics and Ontologies 2(4), 235–241 (2007) 54. NBII Thesaurus tool and web client accessible, http://thesaurus.nbii.gov 55. The Taxonomic web services, http://www.itis.gov/web_service.html 56. The University of Tennessee College of Communication and Information user experience laboratory, http://www.cci.utk.edu/ 57. Servilla, M.S., Brunt, J.W., San Gil, I., Costa, D.: Pasta: A Network-level Architecture Design for Generating Synthetic Data Products in the LTER Network. Databits – Fall (2006)
Agrotags – A Tagging Scheme for Agricultural Digital Objects Venkataraman Balaji2, Meeta Bagga Bhatia1, Rishi Kumar1, Lavanya Kiran Neelam2, Sabitha Panja2, Tadinada Vankata Prabhakar1, Rahul Samaddar1, Bharati Soogareddy2, Asil Gerard Sylvester2, and Vimlesh Yadav1 1
2
Indian Institute of Technology Kanpur, India International Crops Research Institute for the Semi-Arid Tropics , Hyderabad, India {tvp,meeta}@iitk.ac.in
Abstract. Keyword assignment is an important step towards semantic enablement of the web. In this paper we describe a taxonomy called Agrotags which is designed for tagging agriculture documents. Agrotags is a subset of Agrovoc and is much smaller: about 2100 as against 40,000. Agrotags is manually created by carefully examining each of the Agrovoc terms for their utility in tagging. This selected subset is further refined and validated by looking at the manually assigned keywords from Agris databases. Further extending the usage of Agrotags emerges the concept of Agrotagger which is a system for automatically generating keywords for agricultural documents. Agrotagger has been built by moving the learning (what keyword to assign) from the example (document) level to the model level. Agrotagger being a pluggable module can act as an add-on to any repository. Keywords: Agrovoc, Agrotags, Agrotagger, Keyphrase Assignment, Metadata.
1 Introduction The near absence of agriculture and farming as distinct practices in the world of Web 2.0 has already been pointed out in many instances [1]. International and national level efforts, like the agropedia[2], have initiated strategies and created pathways to address this problem of bringing quality extension materials to the web. These are stored in a reusable fashion thus facilitating reuse in various contexts across diverse delivery mediums. However, we find that there is no paucity of research reports, papers and documentation related to agricultural research on the web. Many reputed publishing houses hosts many of these articles in their repository. A few of these repositories use various tagging methods to label documents to facilitate ease of retrieval; while others prefer to let search engines index their repository. The inherent drawbacks of both these approaches lie in the lack of ability to infer knowledge from the tags. This greatly limits the participation and availability of the document across a semantic network. The need for a knowledge model grounded tagging methodology was strongly felt [22]. The combination of advanced tagging, metadata and cross-linking facilitated by controlled ontologies would give raise to a wealth of semantically-linked and relevant S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 36–45, 2010. © Springer-Verlag Berlin Heidelberg 2010
Agrotags – A Tagging Scheme for Agricultural Digital Objects
37
documents. Many international agricultural thesauri exist like Agrovoc [3], CABI [4], NAL [5] etc. Agrovoc, with its existence since 1976 as a thesaurus and its morph into a full-fledged agricultural ontology in the last decade, was seen as a natural choice as a base set for the creation of Agrotags. The advantages offered by a semantically-tagged knowledge repository for agriculture was already ascertained by efforts such as the agropedia. Agrovoc has provided the glue for the semantic inference in this endeavor [9]. ICRISAT(The International Crops Research Institute for the Semi-Arid Tropics) has long been involved with the Agrovoc enrichment together with the FAO(Food and Agriculture Organization of the United Nations) and with and IITK(Indian Institute of Technology Kanpur) maintaining the Hindi version of Agrovoc. ICRISAT has led the revision and refinement of the Agrovoc thesaurus which forms the basis of the Agrovoc Agricultural Ontology Service (AOS). Agrotags was envisaged as a collection of terms that would be used to tag digital information objects (DIOs) in the agriculture realm. The main aim is to normalize tagging process in order to make more efficient and simpler searching and provide most efficient resources to the user. Agrotags's pedigree has been Agrovoc - the agricultural thesaurus from FAO. The ongoing efforts to enrich Agrovoc to ontology is widely known (AIMS website) [6]. Agrovoc is also working on mapping onto leading thesauri such as NAL, CABI, etc this provides documents tagged with Agrotags rich interconnection with documents tagged with other thesauri. The inherent power of Agrovoc to convert a term into 19 languages provides an added advantage. Applications built using Agrotags as an assisting-knowledge layer would have greater reach. 1.1 Ontogenesis of Agrotags The development of Agrotags was started by analyzing various tagging options available for research documents especially in the agriculture realm. The inherent drawback was realized as documents tagged in other languages were not ‘retrievable’ using the tags supplied. An immediate solution lay in the use of terms from Agrovoc. Agrovoc contains (as of May 2010) almost 40,000 terms in the English language alone - a huge candidate set for generation of tags. The subject matter experts from ICRISAT and IITK decided that a collection of hand-picked terms would go into the creation of a collection of terms for tagging agriculture related documents. Initially, the top term creation was based on popular thesauri like NAL and CABI, but later it was decided to create a hierarchy rooted in the concepts from the subject categories in Agris database[7], since these seemed to be better suited for indexing . After the top terms were finalized, the team set about creating the hierarchy taking care to retain the intended purpose of Agrotags. Terms were also sourced outside Agrovoc to arrive at a comprehensive collection of tags. Navigating through the 25 top terms of Agrovoc, the team selected terms that were useful for tagging. For example, outbreeding, cultivar selection, mass selection, control methods etc. are narrower term of Agrovoc top term methods with different depth level. However, outbreeding and mass selection associated to crop improvement, cultivar selection to plant production and control methods to plant protection top term of Agrotags.
38
V. Balaji et al.
It was felt that we need to include some terms into Agrotags which are not in the existing version of Agrovoc. Agrovoc is a dynamic and evolving ontology which invites new additions and corrections. So we proposed that these new terms be added to Agrovoc (proposal pending), thus conserving the property that Agrotags is a proper sub-set of Agrovoc. These terms were arrived at by examining the manually assigned tags to more than 2000 English language documents in Agris database during the period 2002-2009. In the first version of Agrotags, 15 top level terms were created. The subsequent revisions may refine these classifications.Plant production, plant protection, crop improvement etc., formed some of the top-level terms of this kind. Currently Agrotags are available in English, Hindi and French languages. Telugu and Kannada versions are in progress. Agrotags can be seen at http://agropedia.iitk. ac.in/agrotags_version2/agro_tree.html. 1.2 Criteria of Selection Only descriptors and more popular terms were selected to create Agrotags from Agrovoc. The non descriptors, scientific/taxonomic names, fishery related terms and geographical terms were not included in the selection process .This can be elaborated taking into account some simple examples like: ‘Rice’ is a term in Agrovoc (termcode-6599) and has non-descriptor ‘paddy’ [8]. ‘Rice’ is a term present in Agrotags but the term ‘paddy’ is not present so if our document consists of a keyword ‘paddy’ it will be mapped to ‘Rice’ term of Agrotags Similarly, ‘Organic Wastes’ (termcode-35237) is a term in Agrovoc as well as Agrotags. ‘Garden Wastes’ (termcode-35242) is a narrower term (NT) of ‘Organic Wastes’ in Agrovoc but not in Agrotags. Now if our document consists of Garden Wastes as its candidate term it will be mapped to its broader term that is ‘Organic Wastes’. Scientific names, geopolitical names were also excluded and it was decided to address only agriculture domain in this edition of Agrotags resulting in the removal of fisheries related terms as well. To summarize, the following equation describes the relationship between Agrotags and Agrovoc: Agrotags = Agrovoc - (Non_Descriptor terms+ Scientific Terms + Geopolitical Terms + Fisheries) 1.3 Top Level Terms of Agrotags Agrovoc has 25 top level terms where as Agrotags has 15 Top level terms. Agrotags top level terms are not a subset of Agrovoc top level terms but a subset of the overall Agrovoc(Fig.1). 1.4 Agrovoc to Agrotags Term Mapping The diagram below (Figure 2) shows the hierarchical structure of the ‘Methods’ fragment of the Agrovoc ontology. The terms in red are the one included in Agrotags. Relationship information NT: Narrower Term, usedFor: Non-Descriptor. Fig.3 shows a table for mapping between Agrovoc to Agrotags terms.
Agrotags – A Tagging Scheme for Agricultural Digital Objects
Fig. 1. Agrotags top-level terms
Fig. 2. Agrovoc to Agrotags term mapping
39
40
V. Balaji et al.
Fig. 3. Term mapping between Agrovoc to Agrotags
Fig. 4. Agrotags, Agrotagger and openagri in joint action
1.5 Use of Agrotags Agrotags currently are stroed in an internal database format which is used by OpenAgri[10], an open source repository for agricultural documents developed by IIT-Kanpur and ICRISAT. This repository provides for rich semantic interlinking
Agrotags – A Tagging Scheme for Agricultural Digital Objects
41
between document using Agrotags Documents are also automatically tagged using the Agrotagger algorithm (Fig.4). Also refer [18], [19], [20], [21] in Open access context.
2 The Agrotagger Machines as compared to human give more efficacious result in almost all the domain, but when it comes to natural language understanding, machine driven results can’t compete human analysis. But this also has a positive side, extracting a handful of keywords from content potentially seems to be a feasible solution and with that point a pluggable module called Agrotagger is being developed with collaboration of FAO. This module could be used as an add-on to leading repositories such as DSpace and advanced management systems like Drupal and Joomla to automatically tag documents within a controlled vocabulary such as Agrotags. User generated tags together with those that are generated by Agrotagger would help link documents related to agriculture more effectively for faster retrieval and for an enhanced presence in the present flair of the web. 2.1 Need for Agrotagger With the huge amount of digital documents existing in the internet and their growing panoply with each passing day, keyphrases prove to be an important metadata. Although key phrases can be assigned by the document’s author at the time of its creation, the manual process of tagging the documents with keyphrases is not only labor-intensive and time-consuming but also yields poor indexing consistency over the entire document collection. Indexing a document is not a very new concept indeed- if we take a brief look in the Ancient History, we will find that long back in fourteenth century, the first systematic approach to indexing emerged which was true alphabetical indexing. Later as the technology developed fresh ideas kept coming and alphabetical index became catalogue, catalogue became taxonomy, taxonomy gets converted to thesaurus and then using this vocabulary we get automatically generated keywords from Agrotagger. Any given document’s metadata consists of fields like: author, title, keywords etc. but the most reliable of all is keywords. For example: The title “Options for adaption, though limited do exist” is an article about Marine fisheries from the magazine “The Hindu- Survey of Indian Agriculture 2009”. Now the given title has no clue about the actual topic of the article. This is where keywords are crucial. Automatic keyword assignment has several approaches, primarily keyword assignment from a vocabulary where the candidate keyword is from a standard vocabulary and keyword generation from text where we do not restrict the candidate keyword to a specific vocabulary. These could be rule-based assignments or based on machine learning. Some of the sample rule-based systems are E. Han and G. Karypis [13], L.S. Larkey and W.B. Croft [14], Fabrizio Sebastiani [15]. Eibe Frnak [16], P.D. Turney [17], are based on machine learning. 2.2 Role of Agrotags in Agrotagger Agrotagger uses Agrotags as candidate key phrases for documents. As explained earlier Agrotags are a proper subset of Agrovoc – Agrovoc has about 40,000 agricultural
42
V. Balaji et al.
concepts and Agrotags has around 2100. The concepts selected in Agrotags are handpicked based on their utility in a tagging scheme as well as their popularity. Agrotagger identifies the occurrence of Agrovoc terms in the document, replaces them with an equivalent Agrotags term and then chooses the candidate keyword from among them. 2.3 Workflow in Agrotagger At the top level, Agrotagger works in three main stages: Stage 1: Identify all Agrovoc terms in the document – the document now is a bag of Agrovoc terms Stage 2: For each of these Agrovoc terms, identify an Agrotags term; this reduces the document to a bag of Agrotags terms. Stage 3: Use statistical techniques to calculate the suitability of these terms for keyphrases Agrotagger is inspired by an automatic keyphrase extraction algorithm called KEA[11]. Basically the KEA system works by training a classifier (which is done through training the system using large datasets) and keyword assignment using the trained model. Learning through a large corpus is difficult – they are not simply available. We have modified the KEA algorithm by shifting the training from the corpus to the knowledge model level – the Agrovoc to Agrotags mapping is the learning model and has been manually constructed. After obtaining content bearing terms (by eliminating fluff words and through stemming) we intersect them with Agrovoc terms. The resulting terms are then mapped with their respective Agrotags terms from a pre computed Hash Table. This set of filtered candidate terms are then given as an input to the KEA algorithm. To extract keyphrases KEA makes use of the following attributes: • • • • •
Length of a phrase in words Frequency of the words Node Degree of the candidate terms Occurrence based on location of the terms Appearance: Binary Variable to check the presence of the terms.
For more details refer: KEA: Keyword Extraction Algorithm and Rishi Kumar’sThesis [12] Figure 5, gives the top level workflow in Agrotagger. Stopwords is the name given to words which are filtered out prior to, or after, processing of any selected document. In our case we have identified 262 distinct stop words which are generally articles, pronouns, adverbs, prepositions, conjunctions, consonants, vowels and some unit entities. 2.4 Usage of Agrotagger It is currently being used by an open access agricultural research repository called openagri. This repository is a open platform to submit any kind of agricultural published material under a single hood, all a user needs is a username and password which is easily attainable by registering into the site. Once a user registers and submits his document, the Agrotagger running in the background automatically generates keywords. See Fig.6 for a sample screen.
Agrotags – A Tagging Scheme for Agricultural Digital Objects
Fig. 5. Workflow of Agrotagger
http://agropedia.iitk.ac.in/auto_tagger/callable_auto_tagger.php
Fig. 6. Document from openagri research repository
43
44
V. Balaji et al.
Agrotagger is also available as a web service. To automatically get keywords for your agricultural document (as of now only pdf’s) go to:
3 Conclusions In this paper we have described a system for automatically generating keywords for agricultural documents. We propose a new tagset called Agrotags, which is propersubset of enhanced Agrovoc. Agrotags are specially designed with tagging in mind. Agrotagger is a software for assigning keyphrases automatically from Agrotags. Agrotagger works by recognizing Agrovoc terms from the document, mapping them to Agrotags terms and using statistically techniques for assigning probabilities as candidate keywords. The whole system has been implemented and deployed as a webservice.
Acknowledgement We gratefully acknowledge the Indian Council of Agricultural Research, New Delhi, India and Food and Agriculture Organization , Rome for their unending Financial, technical and overall support and cooperation.
References 1. Balaji, V.: The fate of agriculture, http://www.india-seminar.com/2009/597/597_v_balaji.htm 2. Agropedia: An agricultural encyclopedia, http://agropedia.net/ 3. Agrovoc: A multilingual agricultural thesaurus, http://aims.fao.org/website/Agrovoc-Thesaurus/sub 4. CABI, http://www.cabi.org/ 5. National Agricultural Library, http://www.nal.usda.gov 6. Agricultural Information Management Standards, http://aims.fao.org/ 7. AGRIS: International Information System for the Agricultural Sciences and Technology, http://agris.fao.org/ 8. Agrovoc: A multilingual agricultural thesaurus-Terminology, http://www.fao.org/docrep/008/af234e/af234e02.htm 9. Use of Semantic Wiki Tools to Build a Repository of Reusable Information Objects in Agricultural Education and Extension: results from a preliminary study: Web2ForDev International Conference, Rome (September 25-27, 2007), http://www.web2fordev.net/ 10. Openagri: An Open Access Agricultural Research Repository, http://agropedia.iitk.ac.in/openaccess/ 11. KEA: Keyword Extraction Algorithm, http://www.nzdl.org/Kea/index_old.html 12. Kumar, R.: Master’s Thesis:Automatic Keyword Extraction suing Enhanced Knowledge Models
Agrotags – A Tagging Scheme for Agricultural Digital Objects
45
13. Han, E., Karypis, G.: Centroid-based document classification analysis and experimental result. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000) 14. Larkey, L.S., Croft, W.B.: Combining classifiers in text categorization. In: SIGIR, pp. 289–297 (1996) 15. Sebastiani, F.: Machine Learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002) 16. Frnak, E., Gautwin, C., Manning, C.G.N., Witten, I.H., Paynter, G.W.: Kea: Practical automatic keyphrase extraction. In: Designand Usability of Digital Libraries: Case Studies in the Asia Pacific, pp. 129–152 (2005) 17. Turney, P.D.: Learning to extract keyphrases from text. Technical report, National Research Council, Institute for Information Technology (1999) 18. Kousha, K., Abdoli, M.: The citation impact of Open Access Agricultural Research: a comparison between OA and Non-OA Publications. In: IFLA World Library and Information Congress: 75th IFLA General Conference and Assembly (2009), http://www.ifla.org/files/hq/papers/ifla75/101-kousha-en.pdf 19. Open Access Publishing: Views of Researchers in PublicAgricultural Research Institutions in Zambia by Justin Chisenga and Davy Simumba. Agricultural Information World wide (2009) 20. Gawrylewski, A.: http://www.soros.org/openaccess/read.shtml (2008) 21. John, H., Sheehan, P.: 2006 The Economic Impact of Enhanced Access to Research Findings. CSES Working Paper No. 23. Agricultural Information Worldwide - 2 (2009), http://www.cfses.com/documents/wp23.pdf 22. Soergel, D., Lauser, B., Liang, A., Fisseha, F., Keizer, J., Katz, S.: Reengineering thesauri for new applications: the AGROVOC example. Journal of Digital Information 4(4) (2004), http://journals.tdl.org/jodi/article/viewarticle/112/111
Application Profiling for Rural Communities: eGov Services and Training Resources in Rural Inclusion Pantelis Karamolegkos, Axel Maroudas, and Nikos Manouselis Greek Research & Technology Network, Mesogion Ave. 56, 11527, Athens, Greece {pkaramol,axel}@grnet.gr,
[email protected]
Abstract. Metadata plays a critical role in the design and development of online repositories. The efficiency and ease of use of the repositories are directly associated with the metadata structure, since end-user functionalities such as search, retrieval and access are highly dependent on how the metadata schema and application profile have been conceptualized and implemented. The need for efficient and interoperable application profiles is even more substantial when it comes to services related to the e-government (eGov) paradigm, given a) the close association between services related to eGov and the metadata usage and b) the fact that the eGov concept is associated with time and cost critical processes, i.e. interaction of citizens and services with public authorities. In this paper, we outline an effort related to application profiling for eGov services and training resources, used in the platform of RuralObservatory2.0, which will underpin a major objective of the ICT PSP Rural Inclusion project, i.e. the eGov paradigm uptake by rural communities. Keywords: application profile, e-government, learning object, metadata.
1 Introduction It is widely acknowledged that Small and Medium Enterprises (SMEs) constitute a critical aspect of the overall production process in liberal economies1. Hence, it becomes evident that the optimization of their productive processes and the minimization of their operating costs are in the interest of the greater business ecosystem. However, although significant provision has been taken in terms of motivating the foundation and sustainability of SME’s there are yet criticalities pertaining to each enterprise’s distinct idiosyncrasies that need to be addressed. One of these issues is the low penetration degree of innovative tools and technologies by SME’s residing in rural areas2 [1] Inevitably, the eGov concept finds extended applicability when it comes to rural settings, where specific challenges such as the physical distance between citizens’ residencies and public authorities’ premises, call for efficient eGov frameworks that 1 2
http://ec.europa.eu/information_society/tl/ecowor/smes/index_en.htm http://www.publictechnology.net/content/20494
S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 46–56, 2010. © Springer-Verlag Berlin Heidelberg 2010
Application Profiling for Rural Communities: eGov Services and Training Resources
47
will facilitate the transactions between people and the public administration. However, devising an implementing a successful strategy in the scope of promoting a sustainable ICT and eGov uptake in rural areas, needs to take into consideration several aspects, apart from provisioning for Internet access to rural communities. On one hand, the online tools and repositories need to be easily accessible by the majority of population that is engaged in business transactions in rural areas. Most people, due to the time-critical nature of their everyday tasks will avoid engaging in long term searches on the Internet, in order to find electronic versions of the services of their interest, let alone training resources about these services. On the other hand, there is a greater challenge of making the usage of such online tools sustainable, i.e. cultivating in local populations the necessary mentality of making the most out of the sophistication provided by e-services. This calls for a long-term approach and will certainly require increased effort in order to get users familiarized with innovative technical solutions, using carefully designed and deployed training material. Rural Inclusion3, a major European Project, supported by the Information and Communication Technologies Policy Support Programme, adapts, and deploys a Web infrastructure combining semantic services with a collaborative training and networking approach, in rural settings. The project, through the RuralObservatory2.04 component, offers an innovative and viable solution in terms of rural SME’s familiarization with the usage of eGov services: it is an innovative sophisticated Web-based environment, through which rural SMEs are able to find information both on eGov services offered in their region, as well as to have access to e-learning content on how they can use such services. Overall, the Rural Inclusion project, through the RuralObservatory2.0 component addresses the constraints related to the sustainable uptake of ICT and eGov services particularly in rural areas, by: a) deploying a repository with eGov services and training resources for rural entrepreneurs, that will facilitate information retrieval, access, usage and exploitation of eGov services and relevant digital educational content b) incorporating the necessary metadata standards (in our case IEEE LOM and eGMS) for describing the relevant resources (eGov services and training material) c) extending/specializing those standards in order to take into consideration the variety of special requirements that have to be reflected in the metadata (e.g. linguistic preferences, geographical location, particularity of covered topics, etx). This calls for the implementation of both standard-based and also context-specialized metadata in the RuralObservatory2.0 repository, through the incorporation of the appropriate Application Profiles, i.e. assemblages of metadata elements selected from one or more metadata schemas, with the purpose of adapting or combining existing schemas into a package that is tailored to the functional requirements of a particular application, while retaining interoperability with the original base schemas [2]. In this direction, the present paper describes the application profiles that have been developed to support the RuralObservatory2.0 online repository.
2 Background Electronic government (eGov) is one of the novel and most appealing applications of Information and Communication Technologies (ICTs). eGov is defined as “the use of 3 4
www.rural-inclusion.eu http://rural-inclusion.vm.grnet.gr:8081/inclusion/index.htm
48
P. Karamolegkos, A. Maroudas, and N. Manouselis
ICT in public administrations combined with organizational change and new skills in order to improve public services and democratic processes, and strengthen support to public policies” [3]. Rural Inclusion platform will provide innovative services for rural SMEs, by alleviating the administrative burdens associated with several transactions with public authorities. This will mainly be achieved through a sophisticated interactive information elicitation process that will help SMEs understand the prerequisites of several services and save valuable time and resources, as the overall purpose is the minimization of physical interaction between SMEs and public authorities. Within the Rural Inclusion platform, the task of provisioning for training resources and a specific number of eGov services will be undertaken by the RuralObservatory2.0, in which two types of information resources (or objects) are mainly stored, shared and accessed online: Digital Training Objects (DTOs) and e-Government Resource Objects (eGROs). The design and specification of the RuralObservatory2.0 also called for the design of the system’s repositories: that is, of the databases that store information about DTOs and eGROs and the relevant metadata. In the case of RuralObservatory 2.0 the stored metadata about the resources has been represented according to two selected metadata standards. More specifically, for the description and classification of the eGROs for rural SMEs, the RuralObservatory2.0 used a specialization of the eGovernment Metadata Standard (e-GMS) [4]. In addition, for the description and classification of the DTOs, a specialization of the IEEE Learning Object Metadata (LOM) [5], [6] standard has been developed. The metadata (and therefore access to the training resources) are made available to other repositories and federations of repositories, through their exposal using the OAI-PMH5 protocol. This allows for the potential federation of the RuralObservatory2.0 with federations of repositories such as ARIADNE6 and GLOBE7. Approach similar to the one adopted by Rural Inclusion, in the view of providing added value to everyday business activities of SMEs through the incorporation and sustainable uptake of ICT, has also been undertaken by other European projects. Among the most notable ones, we find Symphony [7], a project that aimed at the development of an integrated set of tools for management of enterprises, in order to support human resource managers of SMEs in their decision making process about the looking for newcomers and/or assigning people to jobs and Flexibly Beyond [8], in which Complex Knowledge Structures (CKS), a methodological and computational framework for representation and management of experiential and social knowledge of SMEs based on storytelling and Case Based Reasoning (CBR) paradigms was adopted in the view of transferring the theoretical findings to practical level in SMEs manufacturing context. The need for metadata standard customization through the incorporation of the appropriate application profiles is very important when it comes to the design and development of a digital repository, such as the Rural Inclusion’s RuralObservatory2.0, to be populated with content that aims to facilitate the uptake of eGov services by both specific target user groups (SMEs, public authorities) and the general audience. In the next sections we discuss our experience from developing such a 5
www.openarchives.org http://www.ariadne-eu.org/ 7 http://www.globe-info.org/en/aboutglobe 6
Application Profiling for Rural Communities: eGov Services and Training Resources
49
metadata schema specifically tailored for a portal related to eGov services that aims to address the lifelong learning needs of several stakeholders, such as rural SMEs, public authorities and citizens in general.
3 Rural Inclusion Application Profile 3.1 Rural Inclusion Digital Training Objects (DTOs) In what is related to its training / educational aspect within Rural Inclusion, the RuralObservatory2.0 portal will undertake the role of a Digital Learning Repository. Digital Learning Repositories (DLR) consist an area of particular interest for metadata development. In such tools, digital learning resources are systematically organized, classified and published. Many institutions are currently engaged in developing DLRs that can be searchable and accessible from a wide audience [9]. The DTOs support a variety of training scenarios for the rural SMEs, and include different types of educational material (such as lectures, best practice guides, self-assessment forms, etc.). These are stored as electronic files in the form of Powerpoint presentations, Word documents, PDF documents, short demo videos, and others. The Application Profile for the DTOs (dubbed Rural Inclusion AP) consists of the following elements and specializations. 3.1.1 DTO Elements The first category of LOM elements is the category General. It includes elements that describe a learning object (in our case, a DTO), and store general information about it. In Rural Inclusion AP, the following elements have been selected for use as recommended by LOM: Identifier, Title, Language, Description, Keyword, Structure and Aggregation Level. In addition, the element Coverage has been specialized in the way presented in Table 1 Table 1. Elements of the General category that have been further specialized in DTO AP Element
General
Subelement
Description
Use in Rural Inclusion
Value space
Coverage
Geography or region to which this DTO applies.
Include 3-layer coverage of specific European regions
http://ec.europ a.eu/eurostat/r amon/nuts/cod elist_en.cfm?li st=nuts
The next category Life Cycle describes the history and current state of a DTO, as well as the entities that have affected the DTO during its evolution. In Rural Inclusion AP, the following elements have been selected and used as recommended by IEEE LOM: Version, Status, and Contribute. The Meta-Metadata category contains information about the metadata record that describes the DTO. It identifies the metadata record in a classification system (i.e. the repository’s database with the metadata descriptions). It contains information about who provided the DTO description and
50
P. Karamolegkos, A. Maroudas, and N. Manouselis
when, which metadata schema was followed to produce the metadata description, and in which language the metadata are in (which can be different than the language of the learning object itself). In Rural Inclusion AP it is used as recommended by LOM, and includes the elements: Identifier, Metadata Schema, Language, Contribute and their designated sub-elements. In a similar manner, a set of selected items from the Technical category are used to describe the technical requirements and characteristics of a DTO. The elements selected for Rural Inclusion AP are: Format, Size, Location, Platform Requirements, and Duration. The Educational category describes the key educational or pedagogic characteristics of a DTO. Furthermore, elements have been specialized as presented in Table 2. Table 2. Elements of the Educational category that have been further specialized in Rural Inclusion AP Element
Sub-element
Description
Interactivity Type
Predominant mode of learning supported by the learning object
Learning Resource Type
Specific kind of learning object. The most dominant kind shall be first.
Intended End User Role
Principal user(s) for which this LO was designed, most dominant first.
Context
The principal environment within which the LO and use of this LO is intended to take place.
Educational
Use in Rural Inclusion Vocabulary containing the values than LOM recommends Vocabulary containing the values than LOM recommends Vocabulary containing the values than LOM recommends Vocabulary containing the values than LOM recommends
Vocabulary LOM recommend ation vocabulary LOM recommend ation vocabulary LOM recommend ation vocabulary LOM recommend ation vocabulary
The Rights category describes the intellectual property rights and conditions of use for this DTO. In Rural Inclusion AP it is used as recommended by LOM, and includes the elements: Cost, Copyright & Other Restrictions, and Description. In the Relation category that defines the relationship between the described DTO and other DTOs, the following LOM-specified elements are used: Kind and Resource, as also their designated sub-elements. Finally, the Classification category describes where this DTO falls within a particular classification system. All the elements described in LOM are used in Rural Inclusion AP. In the context of Rural Inclusion initiative, the classification system used is based on the NACE codes of economic activity8. The first four digits of the code, which is the first four level of the classification system, are the same in all European countries. The fifth digit might vary from country to country and further digits are sometimes placed by suppliers of databases. 8
http://ec.europa.eu/competition/mergers/cases/index/nace_all.html
Application Profiling for Rural Communities: eGov Services and Training Resources
51
Table 3. Example of a RuralObservatory DTO Element
Value
1. General 1.2 Title
Managing Farms with IT
1.3 Language
en A good summary of how you can use IT, the Web and eGovernment services to help manage your farm. Agriculture, Animal production, Aquatic sciences and fisheries
1.4 Description 1.5 Keyword 2. Life Cycle
2.3 Contribute
2.3.1 Role
publisher
2.3.2 Entity
Ceri Evans, UK (15-03-2010)
2.3.3 Date
2000
4. Technical 4.1 Format
ppt
4.2 Size
40001 bytes http://rural-inclusion.vm.grnet.gr:8080/observatory/ viewDTO.do?dto_id=51
4.3 Location 5. Educational 5.2 Learning Resource Type
slide
5.5 Intended End User Role
learner
5.6 Context
higher education, professional development
5.7 Typical Age Range
18-U
9.Classification 9.2 Taxon Path
9.2.1 Source
NACE codes of economic activity
9.2.2 Taxon
9.2.2.1 Id 9.2.2.2 Entry
A AGRICULTURE, HUNTING AND FORESTRY
52
P. Karamolegkos, A. Maroudas, and N. Manouselis
3.1.2 DTO Example of Use In Table 3, we provide an extract of a DTO description, regarding how to manage a farm with IT. Due to space restrictions, not all elements are used. Instead of that, we are restricted only to a combination between the mandatory and the most commonly recommended elements of the AP. 3.2 Rural Inclusion eGov Resource Objects (eGROs) Apart from the training resources, the Rural Inclusion Observatory 2.0 aims to list a number of eGov services for each participating country that may be useful for the SMEs in the corresponding rural areas. The way to represent and store such characteristics in an appropriate format is by using well-specified metadata schemas. In the case of eGROs, the medatada schemas to be developed are based on existing metadata standards. More specifically, the standard used in this case is the e-Government Metadata Standard (e-GMS) 4 3.2.1 eGROs Elements The first element, is the Title. Its purpose is to enable the user to find an eGRO with a particular title or carry out more accurate searches. The second element is the Subject, which stores information about the topics of the content of the eGRO. Its purpose is to enable the user to search by the topic of the eGRO. Table 4. Elements of the Subject category, further specialized in Rural Inclusion AP Element
Use in Rural Inclusion
Value space
Predefined vocabulary values
NACE Codes of Economic Activity Controlled Vocabulary
Predefined vocabulary values
Integrated Public Sector Vocabulary9
Process Identifier
Indicates a specific service or transaction, using an identifier taken from a recognised list.
Predefined vocabulary values
European Commission's Business Life Events Vocabulary
Programme
The broader policy programme to which this eGRO relates directly
LangString
ISO/IEC 1064610
Sub-element
Category
Keyword Subject
9 10
Description Broad subject categories from the Government Category List, and, optionally, any other widely available category list.. Words or terms used to describe, as specifically as possible, the subject matter of the eGRO. These should be taken from a controlled vocabulary or list.
http://doc.esd.org.uk/IPSV/2.00.html http://www.iso.org/iso/catalogue_detail.htm?csnumber=29819
Application Profiling for Rural Communities: eGov Services and Training Resources
53
The third element, Description, stores an account of the content of the eGRO. Its purpose is to help the user decide if the eGRO fits their needs. The next element, Publisher, stores the entity responsible for making the eGRO available. Its purpose is to enable users to find a eGRO published by a particular organization or individual. It can also be referred to by those wanting to re-use or re-publish the eGRO elsewhere. Only the Name, Address, Country, Telephone and e-mail attributes of the vCard specification [10] will be used. Following Publisher, comes the Date element, which describes a date associated with an event in the life cycle of the eGRO. Its purpose is to enable the user to find the eGRO by limiting the number of search hits according to a date, e.g. the date the eGRO was made available. It consists of the sub-elements Issued (the date when the governmental eGRO has been issued) and Valid (the period of validity of the governmental eGRO). The next couple of elements is Type and Format. Type stores the nature or genre of the content of the eGRO. Its purpose is to enable the user to find a particular type of eGRO. It will contain a text string that will take values from the eGMS Type Encoding Scheme11. Format on the other hand, describes the physical or digital manifestation of the eGRO. Its purpose is to allow the user to search for items of a particular format. It takes values from MIME types as defined in RFC2048:1996 [11], for example: application/msword for Microsoft’s Word documents, text/html for html pages etc. The Identifier element which comes next, holds an unambiguous reference to the eGRO within a given context. Its purpose is to allow a user to search for a specific eGRO or version. It actually refers to the System ID and stores a machinegenerated running number allocated when the file is first created. This will typically be used by the internal processes and will rarely be visible to the end user, although it can be a useful tool for administrators accessing other information about the file-path object (e.g. interrogating the audit trail). The Language element, holds the language of the intellectual content of the eGRO. Its purpose is to enable users to limit their searches to eGROs in a particular language. The following element, Coverage, indicates the extent or scope of the content of the eGRO. Its purpose is to enable the user to limit the search to items about a particular place or time. The ‘Coverage’ element is expected to take values from a pre-defined vocabulary of countries and/or regions, such as the ISO country set [12] as well as additional vocabulary terms to cover special cases such us the European regions (from Nomenclature of Territorial Units for Statistics (NUTS)12. The next element, Audience, describes a category of user for whom the eGRO is intended. Its purpose is to enable the user to indicate the level or focus of the eGRO, as well as enabling filtering of a search to items suited to the intended audience. In Rural Inclusion Observatory 2.0 it has been decided to take values from the Local Government Audience List13 (LGAL). The next table provides a more intuitive depiction of the specialization in regard to the specific element, so that it addresses specific Rural Inclusion needs. The Classification element, describes the Classification of the eGov service along the ICDT model introduced in [13], that is from a controlled vocabulary of values, e.g. ‘Information Services’. The Metadata element contains information about the 11
www.govtalk.gov.uk/documents/Encoding_scheme_type_v1_2002.pdf http://ec.europa.eu/eurostat/ramon/nuts/home_regions_en.html 13 http://www.esd.org.uk/standards/lgal/ 12
54
P. Karamolegkos, A. Maroudas, and N. Manouselis
metadata record that describes the eGRO. It identifies the metadata record in a classification system (e.g. a database with e-government eGROs’ descriptions). It is an IEEE LOM element that was “borrowed” for the Rural Inclusion e-GMS Application Profile This element consists of the following sub-elements: Contribute: describes the entities that have contributed to the metadata record, such as the metadata author (creator) or a validator, as well as additional in-formation about when this metadata record has been created, modified or published. Role: the Table 5. Specialization of the Audience Element in eGROs Rural Inclusion AP Element
Audience
Description
Use in Rural Inclusion
A category of user for whom the eGRO is intended. Enables the user to indicate the level or focus of the eGRO, as well as enabling filtering of a search to items suited to the intended audience. Don’t use Audience unless the eGRO is prepared with a particular group in mind. If it’s for general release, leave it blank.
The Vocabulary of “Local Government Audience List” will be used
Value space
Local Government Audience List
role of the entity (person or organization) that creates, modifies or validates a metadata record. It is expected to take values from a pre-defined vocabulary of business roles (e.g. Author, Modifier, Validator, etc.). Date: the date of the creation, modification or validation of the metadata record. It is expected to contain a text string where a date is denoted in the YYYY-MM-DD W3C-DTF14 format (or any other widely acceptable format for date representation). Entity: information about the entity. It is expected to contain a information about the person or organization, incorporated in the elements Name, Address, Country, Telephone and e-mail Metadata Schema: contains a text string with the version of the model used (e.g. ‘Rural Inclusion e-GMS v1.0’). Language: the language of the metadata record. It is expected to take values from a pre-defined vocabulary of languages, such as in [14] e.g. ‘en’ for English, ‘el’ for Greek. The last element is Location, and it incorporates the URL identifier of the resource. The following example shows how the elements of the Rural Inclusion e-GMS Application Profile can be exploited for the description of a sample eGRO. Due to space restrictions, not all elements are used.
14
http://www.w3.org/TR/NOTE-datetime
Application Profiling for Rural Communities: eGov Services and Training Resources
55
Table 6. Example of a RuralObservatory DTO Element
Value TED - Tenders Electronic Daily
1.Title 2. Subject 2.1 Category
REAL ESTATE, RENTING AND BUSINESS ACTIVITIES
2.2 Keyword
Land and premises Start a business Buy a business Sell your business Close your business Insolvency and Bankruptcy Leaflet for parents explaining the purpose of the introduction of Home-School agreements, which are compulsory for all maintained schools
2.3 Process Identifier
3. Description 5. Date 6. Type
Form
7. Format
Electronic Document
8. Identifier
20008769
9. Language
Slovenian, Greek, English, Polish, German
10. Coverage
EU
11. Audience
Businesses
12. Classification
Information services
13. Meta-metadata Creator
Role
13.1 Contribute
Entity
Date
Name
John Doe
Address
Somewhere Str 23
Country
Greece
Telephone
210-2820276
e-mail
[email protected] 23/08/01
4 Conclusions The development of an appropriate metadata schema can greatly facilitate the search and retrieval tasks of the users that are accessing an online digital repositories. In addition, the adoption of a well-accepted metadata standard (such as IEEE LOM and e-GMS), can promote interoperability between the Rural Inclusion repository (i.e.
56
P. Karamolegkos, A. Maroudas, and N. Manouselis
RuralObservatory2.0) and others, as well as reusability of the metadata records. On the other hand, in repositories for eGov services, the adopted metadata schema has to be appropriately contextualized in order to better meet user needs and requirements. In this paper we present such specializations which consists of application profile for the Rural Inclusion project. Using the IEEE LOM and e-GMS standards is in line with the majority of other efforts deploying similar repositories of learning and eGov resources content objects.
Acknowledgments The work presented in this paper has been funded with support by the European Commission, and more specifically the project “Rural Inclusion: e-Government Lowering Administrative Burdens for Rural Businesses” of the ICT PSP Programme
References 1. Blakemore, M., Lloyd, P.: Think Paper 10. Trust and Transparency: pre-requisites for efficient eGovernment, Organizational Change for citizen-centric eGovernment, Version No. 23 (2007) 2. Duval, E., Hodgins, W., Sutton, S., Weibel, S.L.: Metadata Principles and Practicalities. D-Lib Magazine 8(4) (2002) 3. European Commission: The role of eGovernment for Europe’s future, Communication from the Commission to the Council, the European Parliament, the European Economic and Social Committee and the Committee of the Regions, Brussels, COM No. 567 (2003) 4. e-GMS E-Government Metadata Standard version 3.0 (2004), http://www.govtalk. gov.uk/schemasstandards/metadata_document.asp?docnum=872 5. IEEE LOM: Draft Standard for Learning Object Metadata, IEEE Learning Technology Standards Committee, IEEE 1484.12.1-2002 (2002) 6. ISO/IEC: Working Draft for ISO/IEC 19788-2 – Metadata for Learning Resources – Part 2: Data Elements, ISO/IEC JTC1 SC36 (2005) 7. Bandini, S., Mereghetti, P., Merino, E., Sartori, F.: Case–Based Support to Small–Medium Enterprises: The Symphony Project. In: Basili, R., Pazienza, M.T. (eds.) AI*IA 2007. LNCS (LNAI), vol. 4733, pp. 483–494. Springer, Heidelberg (2007) 8. Bandini, S., Manzoni, S., Sartori, F.: Case-Based Reasoning to Support Work and Learning in Small and Medium Enterprises. In: 21st IEEE International Conference on Tools with Artificial Intelligence, Newark, New Jersey, pp. 253–260 (2004) 9. Tzikopoulos, A., Manouselis, N., Vuorikari, R.: An Overview of Learning Object Repositories. In: Northrup, P. (ed.) Learning Objects for Instruction: Design and Evaluation, pp. 29–55. Idea Group Publishing, Hershey (2007) 10. Dawson, F., Howes, T.: vCard MIME Directory Profile, Internet proposed standard RFC 2426 (1998) 11. Freed, N., Klensin, J., Postel, J.: Multipurpose Internet Mail Extensions (MIME) Part Four: Format of Internet Message Bodies, RFC 2048 (1996) 12. ISO 3166-1:2006. Codes for the representation of names of countries and their subdivisions – Part 1: Country codes. International Standardization Organization 13. Anghern, A.: Designing mature Internet business strategies: the ICDT model. European Management Journal 15(4), 361–368 (1997) 14. ISO 639-1:2002. Codes for the representation of names of languages – Part 1: Alpha2code. International Standardization Organization
Developing a Diagnosis Aiding Ontology Based on Hysteroscopy Image Processing Marios Poulos1 and Nikolaos Korfiatis2 1
Laboratory of Information Technology, Department of Archives and Library Science, Ionian University, Ioanni Theotoki 72, 49100, Corfu, Greece 2 Department of Economics, University of Copenhagen, Øster Farimagsgade 5. 26, 1353, Copenhagen, Denmark
[email protected],
[email protected]
Abstract. In this paper we describe an ontology design process which will introduce the steps and mechanisms required in order to create and develop an ontology which will be able to represent and describe the contents and attributes of hysteroscopy images, as well as their relationships, thus providing a useful ground for the development of tools related with medical diagnosis from physicians. Keywords: ontologies, metadata, hysteroscopy, medical imaging, medical diagnosis.
1 Introduction Hysteroscopy [1] is a medical endoscopic method used by physicians for the inspection of uterus using an endoscopic device. This particular method is increasingly used by physicians due to the ability of the user to record camera images on the different phases of the endoscopic inspection during a therapy. The goal of this paper is to describe the stages of an ontology development process targeted to the description, annotation and retrieval of medical images generated during the endoscopic examination with a hysteroscope. In this particular case, the semantic approach using an ontology approach can be considered as particularly useful since the problem domain requires the specification of a representational vocabulary for a shared domain of discourse [2]. Via ontology description, one can accurately represent the specific domain that is interested in, dividing the main concepts in classes and describing their relationships and restrictions. Ontologies are considered crucial in medicine. Pinciroli [3] describes them as “the backbone of solid and effective applications in health care”. Ontologies contribute to the medical science domain in many ways; first of all they provide the establishment [4] of a certain vocabulary so that all implicated persons can communicate in a unified way, providing also ways for interchange between several different shareholders1. Moreover, by using ontologies data exchange [4] can be accomplished between heterogeneous systems. Ontologies are also used to support decision making systems in 1
The National Center for Biomedical Ontology- http://www.biontology.org
S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 57–62, 2010. © Springer-Verlag Berlin Heidelberg 2010
58
M. Poulos and N. Korfiatis
medicine [5]. When dealing with medical images, problems in data retrieval may be a deterrent to the physicians’ work that need to find images similar to the addressing problem as quick as possible. Even in cases where similar images can be retrieved quite fast, still most of the images are not annotated. However, the acquisition of similar images [6] accompanied with the proper annotation is crucial for the physician in order to study the disease category and solve the clinical problem. This paper aims to provide a framework which will introduce the steps and mechanisms required in order to create and develop an ontology capable of representing and describing hysteroscopy images and their relationships, in order to provide the physician with a useful aiding tool in medical diagnosis of endometrial cancer. The creation of a database that will contain the hysteroscopy images is necessary so that the creation of the ontology can be accomplished. The rest of the paper is organized as follows: Section 2 analyzes the proposed methodology: the hysteroscopy image processing, the creation of the database where the image data will be stored, and the ontology development. Section 3 discusses implementation scenarios, and section 4 concludes the paper with arguments for possible extensions and future research.
2 Methodology The section describes a set of three steps that need to be carried out so that a diagnosis the aiding tool can be created using a specially engineered ontology for this domain. 2.1 Hysteroscopy Image Processing This step lies in acquiring and organizing medical images received from the hysteroscopy instrument, particularly focusing on the following tasks: •
•
•
Subsequent digital image processing in order to detect in their content morphological characteristics of medical interest (e.g. deformities, cancers etc.). This shall be achieved through comparison of the image’s interior findings against predefined patterns, using non conventional algorithms. Enriching the images with respective semantic concepts and private metadata, in order to facilitate the classification and identification of an image among a large medical image database which shall be explicitly built for this purpose. Continuous recalculation and reorganization of image metadata depending on new concepts which may have some research interest value.
This approach actually introduces a novel way of image management and resides in organizing the image database in a way that enables content-based image. This can be achieved by using alternate image processing algorithms, based on analytical geometry rules [7]. Therefore, this study can be used as a retrieval tool, by means of semantic metadata already implanted in them.
Developing a Diagnosis Aiding Ontology Based on Hysteroscopy Image Processing
59
2.2 Clinical Evaluation and Questionnaire Process In this step the file of each patient must be completed. Each image will be accompanied by certain information on the patients’ health condition (such as laboratory testing etc.) . In order to complete this step, not only the information from laboratory testing is included along with the image but also a specially formed questionnaire that provides details that can affect the physicians’ diagnosis and the medical history of the patient. 2.3 Ontology Development This step lies in the development of the medical ontology. Our main goal is to provide all similar images to the physician, in order to help throughout the medical diagnosis procedure. What is most important to our goal is to define the main concepts of the hysteroscopy images and the patients’ history so that the physician can make requests and effectively compare the information he is studying. In this step the following elements must be described: • Medical images from hysteroscopy procedure, in which patients will be subject to. • Information obtained from the clinical evaluation and the questionnaire. • Their relationships and restrictions. By choosing this ontological approach, hysteroscopy acquired images can be properly retrieved, containing the patients’ history, thus enabling semantic description of such critical knowledge, and giving the opportunity to the physician to be able to compare the clinical problem that has been raised.
3 Implementation Plan In this section an initial RDF representation of the proposed ontology is presented, as well as an implementation strategy that will lead to a running ontology capable of collaborating with the database, provide feedback to the physicians, and become the basis of a decision support system. The root class is named Hysteroscopic_Procedure and consists of three main classes: Image_obtain, diagnosis and decision_Stages. The Image_obtain class has to do with the manipulation of the image and the image processing tasks embodied in the categorization procedure. The Diagnosis class consists of the four main classes – categories in which the patient can belong in the diagnosis procedure. The main patient categories can be seen in Fig.2. The third class, Decision_Stages (depicted in Fig. 3), is connected with the class Diagnosis. In the Decision_Stages class all the necessary information for the diagnosis can be found, such as the questionnaire filled in by the patient containing crucial information so that the physician can be assisted on taking a decision related with the clinical problem.
60
M. Poulos and N. Korfiatis
Fig. 1. The Hysteroscopic_Image class
Fig. 2. The Diagnosis class
Fig. 3. The decision stages class
The RDF representation can be extended to OWL 2 [8], a semantic web standard based on the original OWL definition designed for use by applications that need to process the content of image based information, which is applicable in the case of medical images. The tool that can is going to be used for the development of the
Developing a Diagnosis Aiding Ontology Based on Hysteroscopy Image Processing
61
medical ontology in OWL is Protégé [9] which enables concurrent access to all parties related with the ontology design process. OWL 2 ontologies can easily be used with the image data which will be stored in a relational database package. The primary goal after the development of the medical ontology will be to use OWL to integrate it with the relational database containing the hysteroscopy images, and then perform queries against the aggregate collection to answer realistic questions that could not be answered without the addition of an OWL 2 ontology on the description of the medical images. Then the integrated ontology/database system will be further integrated with an inference mechanism which will eventually produce the desired decision-support system. The architectural choice in that case is to use the Jess rule engine in order to provide inference abilities to the developed application [9]. The integrated decision support system will be capable of assisting the physician in the diagnostic procedure and will constitute an aiding diagnostic tool for the scientific community.
4 Conclusion – Future Plans In this short paper we have presented an approach to address the problem of representing and describing hysteroscopy images and their relationships, in order to provide the physician with a useful aiding tool in medical diagnosis of endometrial cancer. Towards this direction a three-step methodology was proposed including the hysteroscopy image processing, a clinical evaluation and the development of a medical ontology to represent and describe hysteroscopy images and their relationships. The ontology will be seamlessly integrated with a relational database containing hysteroscopy image data and an inference engine to constitute a decision support system that will assist the cancer diagnostic procedure. Ongoing work at this stage aims at completing the implementation of the medical ontology in OWL, bind it with the database and integrate them with the inference engine into the proposed decision support system. Moreover it should be stated out that the project introduced in this paper constitutes only a part of a larger project which is scheduled to run in the near future. In this part and in the larger project, the hysteroscopy images acquired by patients and the medical assistance which is necessary in order to create the diagnosis aiding tool, will be obtained is collaboration with the Medical School of the University of Ioannina.
References 1. Garuti, G., Sambruni, I., Cellani, F., Garzia, D., Alleva, P., Luerti, M.: Hysteroscopy and transvaginal ultrasonography in postmenopausal women with uterine bleeding. International Journal of Gynecology & Obstetrics 65, 25–33 (1999) 2. Gruber, T.R.: A translation approach to portable ontology specifications. Knowledge Acquisition 5, 199 (1993) 3. Pinciroli, F., Pisanelli, D.M.: The unexpected high practical value of medical ontologies. Computers in Biology and Medicine 36, 669–673
62
M. Poulos and N. Korfiatis
4. Abidi, S.R., Abidi, S.S.R., Hussain, S., Shepherd, M.: Ontology-based modeling of clinical practice guidelines: a clinical decision support system for breast cancer follow-up interventions at primary care settings. Studies in Health Technology and Informatics 129, 845 (2007) 5. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L.J., Eilbeck, K., Ireland, A., Mungall, C.J.: The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25, 1251–1255 (2007) 6. Bodenreider, O., Stevens, R.: Bio-ontologies: current trends and future directions. Brief Bioinform. 7, 256–274 (2006) 7. Poulos, M., Rangoussi, M., Alexandris, N., Evangelou, A.: Person identification from the EEG using nonlinear signal classification. Methods of information in Medicine 41, 64–75 (2002) 8. OWL 2 Web Ontology Language Document Overview, http://www.w3.org/TR/owl2-overview/ 9. O’connor, M., Knublauch, H., Tu, S., Grosof, B., Dean, M., Grosso, W., Musen, M.: Supporting rule system interoperability on the semantic web with SWRL. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 974–986. Springer, Heidelberg (2005)
Utilizing Embedded Semantics for User-Driven Design of Pervasive Environments Ahmet Soylu1, Felix Mödritscher2, and Patrick De Causmaecker1 1 K. U. Leuven, Department of Computer Science, CODeS, iTec, Kortrijk, Belgium {Ahmet.Soylu,Patrick.DeCausmaecker}@kuleuven-kortrijk.be 2 Vienna University of Economics and Business, Department of Information Systems, Vienna, Austria
[email protected]
Abstract. The Web does not only offer an almost infinitive number of services and resources but can be also seen as a technology to combine different technological devices, like mobile phones, digital media solutions, intelligent household appliances, tablet PCs, and any other kind of computers, in order to create environments satisfying the need of users. However, due to the large amount of web resources and services as well as the variety and range of user needs, it is impossible to realize software solutions for all possible scenarios. In this paper, we present a user-driven approach towards designing and assembling pervasive environments, taking into consideration resources and services available on the Web and provided through computing devices. Based on semantics embedded in the web content, we explain the concept as well as important components of this user-driven environment design methodology and show a first prototype. Finally, the overall approach is critically discussed from the perspectives of programmers and web users on the basis of related work. Keywords: Semantic Web, End-User Development, Web Programming, Embedded Semantics.
1 Introduction Presently, the Web comprises a network of linked documents, primarily designed for humans [1]. Humans either directly access web resources (i.e. pages) through URLs to find information of their interest or use web forms and other gadgets to interact with the web applications. Additionally, machines use ad-hoc techniques to extract information from the web content and a separate access channel/facade, i.e. web services, to interact with other web applications. However two particular movements in today’s web technology and computing are on the way to change this picture. On the one hand, the Semantic Web alleviates data access problems of the machines, in a similar manner of web services, through a distinct facade by means of XML, RDF etc., thereby creating a Web which has facades of access and interaction for both humans and machines. On the other hand, with the emergence of Pervasive Computing, the Web is not a closed virtual box of networked applications and information anymore. It is supposed to be a communication and application space, i.e. Web of S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 63–77, 2010. © Springer-Verlag Berlin Heidelberg 2010
64
A. Soylu, F. Mödritscher, and P. De Causmaecker
Things (WoT), and an information space, i.e. Web of Data (WoD), for the computing systems [2, 3]. In other words, it is the medium of the immersion so that computing can be situated into the real life. Overall the Web in its current form contains a lot of valuable data and functionality provided through web applications built by programmers from companies and applied science. The end-user, i.e. the humans, are involved into this process, e.g. through HCI methodologies like usability engineering or participatory design. Considering the large amount of data and functionality in the Web as well as the different end-users and devices, it is not possible to realize (working and learning) environments for all. Trying to overcome “one size fits all” flaw, streams like adaptive technology (e.g. Adaptive Hypermedia by [4]) or end-user development [5] have emerged. Due to a general criticism on adaptive technologies (e.g. in [6]), this paper focuses on end-user development (EUD) and introduces an approach towards “User-driven Design of Pervasive Environments” (UDPE) which is based on REST and embedded semantics. Therefore, section 2 elaborates theoretical foundations and argues for EUD. Then, section 3 describes UDPE from perspective of programming experts. Furthermore, section 4 sketches EUD facilities on the basis of technology and literature review. Finally, section 5 discusses UDPE with respect to related work, before section 6 concludes the paper along with the future work.
2 Foundations, Limitations, and Possible Solutions The Web is becoming ubiquitous, semantic and more functional by the emergence of the pervasive computing era [7, 8], web semantics, novel architectural styles, and new development approaches. Within the context of this paper, embedded semantics and REST are of interest. Embedded semantics aims at enhancing HTML documents with semantic annotations in order to create a machine readable Web. It uses the attribute system of the HTML for structuring the valuable information. The important technologies are; microformats, RDFa, and eRDF. REST [9] is an architectural style aiming at collecting the fundamental design principles that enable the great scalability, growth and success of the Web [10]. It is built around the resources and representations. Web services are considered as resources which are meaningful concepts addressed with unique URIs while a representation is a document representing the state of a resource. Every interaction can be considered as a call of a particular resource and the result of each interaction as a new state of that resource. RESTful web services treat HTTP as a semantically capable application protocol rather than just a transport protocol, which is the case e.g. for SOAP [11]. We suggest that the usage of the Web (i.e. read/access and interact) should be considered in twofold: (1) machine facade and (2) human facade of the Web. Although a conceptual distinction is admissible, from a representational perspective a uniform representation of the both facades are desirable in order to prevent inconsistencies, synchronization problems and development overheads. Considering the matter from information access point of view, uploading an external file dedicated to machine reading (e.g. RDF or XML) still remains forbiddingly complex [12], hence we advocate that use of embedded semantic technologies will provide a simpler solution for unifying both machine and human readable facades of the web content.
Utilizing Embedded Semantics for User-Driven Design of Pervasive Environments
65
From interaction point of view, annotating interactional elements such as forms and links can partially provide a similar unified facade for the interactions. Such unification is complete if the web site is fully RESTful that is all possible interactions provided through the site are built upon a REST API. To achieve this, HTML and REST have to be separated whereby HTML is used for the presentational structure only and relevant calls are made through the REST API. A web site based on this principle does not necessarily need to expose all the elemental functionalities of the web application through the human interface, but through the API. However, most APIs are only described with text in HTML documents for the use of developers [13]. Therefore, it is necessary to provide machine readable annotations for such descriptions in order to facilitate tool support for the developers [13]. Briefly, we speculate on the following model of the human-machine usable Web: web applications shall have one facade of access and interaction that is every web site dedicated to human use also becomes a web service. Machines also access to the human usable facade of the web applications. However since the required interactional elements and the valuable information are annotated, machines simply extract the annotated information and use it. Once a machine initiates an interaction through a resource, results are returned in human readable form (i.e. HTML) where each element of the result list (i.e. response) is also annotated. In short, semantic annotations shall define the functional and meaningful aspects of the application where HTML defines the presentational structure of the application. Hence within the same physical representation two different facades of use are realized. Considering the integration of the devices, use of RESTful services is an appropriate and a simple solution. Although such devices might deliver their functionalities through human usable embedded web sites, they are primarily expected to provide RESTful APIs. From the perspective of developing pervasive environments, the theoretical issues described formerly indicate that traditional software development methods are inappropriate or hindering for end-users. On the one hand, the Web offers a large quantity of resources and semantic relations which cannot be fully taken into consideration by the developers. On the other hand, it is impossible to design environments for the needs of every potential user. Therefore, new research streams tries to overcome the flaws of “one size fits all” solutions. Amongst others, adaptation technologies aim at changing the behavior of a computer system according to the characteristics of the end-user or the environment [14], implying that adaptation builds upon user/environmental states, adaptable objects, and adaptation rules. However, [6] criticize that it is impossible to create adaptation rules for all possible situations. Consequently we focus on novel development methods from software engineering and human-computer interaction which aim at including end-users, e.g. through a participatory design, or even at shifting development tasks to them. The latter approach depicted e.g. by Lieberman and colleagues [5] is called enduser development (EUD) and tries to change systems from being “easy to use” to being “easy to develop”. The spectrum of EUD reaches from parameterization and customization of programs up to active programming and source code modifications. Depending on the expertise of the end-users, EUD has to provide programming tools on different levels, reaching from a source code editor up to high-level facilities which completely hideaway the programming tasks. Furthermore, a EUD framework has to consider and foster the necessary hand-on skills and competences. Overall,
66
A. Soylu, F. Mödritscher, and P. De Causmaecker
EUD is a valuable approach to counter the dynamics and complexity of sociotechnical systems and satisfy the end-users’ needs by simply empowering them to develop their environments from the artifacts given themselves.
3 User-Driven Design of Pervasive Environments (UDPE): Basic Concept and a Possible Scenario Indeed, annotations of web-based information and interactions lead to a machinehuman usable, more functional Web, as available data and interactions can be projected to an object-based distributed database and to a distributed programming framework respectively. It is reasonable to consider annotated information through a page in terms of custom or predefined data structures or data types. An example is depicted in Fig. 1 where a hCard microformat is projected to a predefined (since hCard vocabulary is given) data structure. Furthermore, it is possible to project each annotated HTML form or link available through a resource (i.e. page) to a method/function where parameters are the input elements of the HTML form. Ahmet Soylu Kortrik Belgium 0032484742034
vCard { String name; Address adr; String tel; } Address { String locality; String country-name; }
Fig. 1. Projection of embedded information to data structures (hCard microformat example)
We face custom or predefined data structures on the one hand and functions of the web applications available through its different pages on the other hand. On top of that, it is possible to design an infrastructure which allows programmers to select and orchestrate different functionalities of web applications and also to use data available through each application as input parameters to these functions. Indeed a web application is just another site (also a web service) built upon other web sites and services. We differentiate between elemental web sites/services representing the core functionalities of a single application and compound web sites/services (i.e. mashups) which are composed of other sites and services. Mashups [15] represent a development paradigm by which the Web can be used as a distributed database and a programming framework, requiring that the resources, i.e. data instances and functions, are available on the Web at development and run time. It is apparent that different devices will be connected to the Web and serve their data and functionalities to users and each other over the Web by the means of REST-based services. Therefore by following the mashup approach, in accordance to [16], it is possible to enable developers to program pervasive spaces [17]. Combining the mashup and EUD approach, User-driven Design of Pervasive Environments (UDPE) refers to the HCI paradigm shift from “what the Web can do” to
Utilizing Embedded Semantics for User-Driven Design of Pervasive Environments
67
“what the Web can do for humans” [18]. UDPE is not restricted to web programmers only. In the sense of end-user development, it applies to extended scenarios by empowering all kind of end-users to ‘develop’ ubiquitous spaces through the Web. In accordance with end-user development, such a framework for user-driven environment design must provide facilities for programmers (e.g. a code editor) as well as inexperienced users (e.g. web-widgets). The overall frame of the UDPE is depicted in Fig. 2 assuming existence of embedded web servers or gateways coupled with the internal functions of available devices.
UDPE programs
Embedded Server
Gateways
INTERNET/INTRANET
Web servers
Web applications
Fig. 2. User-driven design of pervasive environments through the mashup and EUD approach
A possible realization of UDPE based on REST and the embedded semantic technologies is described in the following. In a first sketch, we focus on the development of pervasive environments through experienced users (i.e. web programmers), then key features and UI elements for non-programmers will be addressed as well. In our scenario, producers of TVs and TV recorders include embedded servers into their new products. These embedded servers provide human usable web interfaces which publish the functionalities of the TVs and the recorders through RESTful APIs. For simplicity we assume existence of some basic functionalities; for TVs these are ‘On’, ‘Off’, ‘Switch to channel A’ etc., and for recorders these are ‘On’, ‘Record’, ‘Off’ etc. The underlying implementation of the human user interface is based on the RESTful requests to the API. Hence for each action there is a corresponding REST request. It is assumed that these two devices are connected to a user’s local network and to the Internet (e.g. through a local master server). The TV stations publish their daily schedules through a web site. We also assume that programs in the daily schedule are annotated with the hCalendar microformat. A portal site aggregates the schedules of the TV stations and provides a query form to the users so that they can query the TV schedules by providing day and TV station parameters. A developer has given a task
68
A. Soylu, F. Mödritscher, and P. De Causmaecker
to program a web agent which allows user to set a recorder and a TV to record a particular TV show. Users only provide the name of the show, and/or the TV station, and then the agent is supposed to find out required time intervals through the station’s web site in order to schedule the TV and the recorder. At the first hand, we assume existence of the embedded web applications for TV and recorder and the semantic annotations of the TV programs in terms of RESTful requests and embedded semantics respectively. The RESTful requests together with their projections are given in Table 1 for a TV and recorder. Table 1. Available RESTful requests for the TV and the recorder, and their projections TV On/Off Switch Recorder On/Off Record
REST Request PUT http://localhost/mytv/status/ PUT http://localhost/mytv/channel/ REST Request PUT http://localhost/myrec/status/ POST http://localhost/myrec/newRec/
Projected Method put_status(String) put_channel(String) Projected Method put_status(String) post_newRecord(String)
The later step requires a detailed elaboration on a possible programming editor supporting UDPE. There are several requirements to be discussed around such an editor. The most fundamental requirement is existence of the content assistance. Content assistance, in our context, enables developer to navigate inside the function and data schema of a resource. For instance, when a developer wants to use data and functional sources of a particular site, after typing down the resource name (probably the variable name pointing out to the resource), the editor should fetch schemas of the available functions and data structures, and should display and recommend them as a selectable list in order to support the development process. Furthermore each element in the vocabulary of the schema should also have an accompanied human readable description. Although such a vocabulary is for machine use, such descriptors will support the human developer during the development process. This is crucial since in most of the cases developer might not be aware of the available functional and data sources through an application unless it is explicitly documented. However, this feature is strictly based on the characteristics of the technology used to annotate the information and the interactions. Annotated information should exhibit ‘locality’ [19] and should have an accessible description. Descriptive characteristic might be inherent in the technology itself, that is, it might be ‘self-contained’ (e.g. RDFa) [19] or have a separate descriptor (e.g. eRDF). Locality requires a particular data structure within a HTML document to be accessible, while self-containment requires specific structures to be re-usable without dependency on any descriptor or pre-knowledge. If the embedded information exhibits locality and self-containment or refers to a separate descriptor, then the programming editor can access the page and retrieve schemas and descriptions of the embedded information and interactions. The demonstration given in this paper is based on microformats [20] (but compliant with RDFa and eRDF). Although microformats exhibit locality, they do not fully
Utilizing Embedded Semantics for User-Driven Design of Pervasive Environments
69
support self-containment as either the syntax and vocabulary must be given by the consumer agents or the resources are accompanied with extractors. None of these prerequisites satisfies our needs since our agent (i.e. editor) is expected to have no prior knowledge about the available microformat, and an extractor highly couples the agent and the underlying technology by extracting information only into a predefined format in a predefined way. Accordingly, we prefer to use a separate microformat descriptor used for specifying custom microformats. The data model of the descriptor includes the following required (r) and optional (o) fields: (1) Type: (r) Determines if it is an elemental or compound microformat. (2) Identifier: (r) Allows a differentiation between the elements. (3) Design pattern: (r) Specifies which HTML elements and attributes are used to define a certain microformat. (4) Label: (o) Provides a human readable description of the element. (5) Match string: (o) Restricts HTML attribute of the design pattern based on string equivalents or regular expressions. (6) Scope: (o) Specifies the scope of the semantics within the web content. (7) Selector: (o) Determines from which source the element text or semantic have to be extracted. (8) Reference: (o) Refers to another existing microformat, particularly useful for compound microformats. (9) Optional: (o) Indicates that an elemental microformat is optional within a given compound microformat. This description language satisfies our needs for describing the schema of the information embedded in the web content. Referring interactions, we consider two forms to be annotated: RESTful APIs, and forms/links of web sites. We employ the microformat proposal hREST [13] which aims at providing machine readable descriptions of web-based APIs. Authors introduced six elements for the hREST microformat: (1) service as a main block markup to indicate that the corresponding microformat describes a service, (2) operation to annotate the service operations, (3) address which annotates the URI of the operation, (4) method to annotate the HTTP method (GET etc.), (5) input and output to annotate the input and output of an operation, and finally (6) label for human readable label of a service. Our microformat descriptor is suitable for describing the hREST microformat, so that, both together fully satisfy our requirements. Considering interactions, links represent one unified user action, for instance deleting a record or getting information of a particular resource. They do not require any input parameters since everything is predefined, they are self-descriptive, and they inherit self-containment. Therefore the only requirement, in our context, is existence of meaningful operation names and corresponding descriptions. We prefer to use the title attribute of the HTML link for declaring the name of the operation and an inner span element with the class value of ‘nav_desc’ to annotate the descriptions of the operations. Forms are composed of meaningful elements, and their syntax and vocabulary are fixed thereby satisfying locality and self-containment, whereby the microformat community has already drafted a methodology for annotating forms (see http://microformats.org/wiki/rest/forms-brainstorming). The approach partially satisfies our requirements. The basic requirement deals with providing a label attribute for each input element which can be used for naming the input elements. However no means is introduced in order to shortly describe the form and the input elements which is a crucial requirement for a possible editor. Therefore we use inner span elements with the class attribute values of ‘input_desc’ and ‘form_desc’. A possible
70
A. Soylu, F. Mödritscher, and P. De Causmaecker
concern while using annotated forms or links is the response format. The response format (i.e. return type) is normally a HTML document which should but is not understandable for machine agents. In this context, the result format is still a HTML document where each result element is instance of a data chunk annotated within the HTML page and not necessarily having the same structure. A crucial requirement for such an infrastructure is the availability of a generic event notifier - listener infrastructure in order to enable developers to set automatic actions associated with the events occurring (e.g. TV is on, a new record started etc.) in the environment. In this regard a context ontology, which is beyond the scope of this paper, with a rule layer supporting various types of rules, i.e. deduction, normative, reactive [21], can also be an appropriate choice for domain-specific applications. An example pseudo code demonstrating the UDPE, based on the scenario described, has been depicted in Fig. 3. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
/* definitions */ Resource tv = “http://www.portal.com/schedule/”; Resource myTv = “http://localhost/mytv/”; Resource myRecorder = “http://localhost/myrec/”; Listener ls; Time ts; Time te; /* find a particular movie and extract start and end times */ List tvProgram[] = tv.search(“21.04.2010”, “TV1”); for each item in tvProgram do if item.Title = “The Cosby Show” then ts = item.StartTime; te = item.EndTime; break; end end /* turn devices on, record the programme, turn devices off */ if ts != null and te != null then ls.HookAction(myTv.put_status(“On”),time = ts); ls.HookAction(myTv.put_channel(“TV1”), time = ts); ls.HookAction(myRecorder.post_newRecord(“Cosby”),time = ts); ls.HookAction(myRecorder.put_status(“Off”), time = te); ls.HookAction(myTv.put_status(“Off”),time = te); end
Fig. 3. An UDPE program for a smart home environment including a TV and a recorder
By defining the embedded semantics relevant for a scenario and utilizing the REST-based methods (indicated with the pseudo code in Fig. 3), web programmers can assemble their pervasive environment on top of existing data and functionality in the Web. Moreover, they even can combine the different devices for their everyday activities. In this variant, however, UDPE is only applicable for expert users, i.e. programmers in the field of web technologies. As suggested by EUD guidelines [22], UDPE should provide facilities for both expert users and novices. Thus the upcoming section proposes features and a user interface for end-users, i.e. users with appropriate knowledge about the Web.
Utilizing Embedded Semantics for User-Driven Design of Pervasive Environments
71
4 End-User Facilities from Literature and Technology Review It is impossible to anticipate all needs of the user within the broad context space of the ubiquitous applications. This fact justifies end-user involvement. We consider userinvolvement in two means: (1) at development time (a) by user’s being part of the development cycle through actively providing feedback on the design [23], (b) by enabling end-users to design and develop their own applications with high level tools (i.e. EUD) [24], (2) at run-time by enabling the user to intervene application’s behaviors (a) directly (i.e. user-control) by deciding on appropriate behaviors [25], (b) indirectly (i.e. user mediation) by providing helpful feedback [26]. We consider the former, in terms of EUD, and the latter, in terms of user-control, as key paradigms to the success of Ubiquitous Computing. Although user-control and mediation approach is contradictory to the pervasive computing vision, which places an absolute focus on machine control, we argue that user-involvement at run-time, supported by adequate machine guidance and feedback mechanisms, is a must. This becomes apparent if one considers years of research a head probably only with a limited achievement towards the real machine intelligence [27]. In the frame of EUD, literature proposes two different line of approaches: (1) web design tools such as Microsoft FrontPage, Macromedia Dreamweaver etc. enable users to visually design and develop web pages and sites [28]; (2) mashup design and development tools such as Yahoo Pipes, IBM Mashup etc. support end-users in combining services and data from various sources to create new functionalities and content [29]. With the emergence of Web 2.0 applications, visual web design and development tools have lost their prominence for end-users, since various Web 2.0 applications allow creating readily operational applications without need of exhaustive design and configuration efforts. Moreover, mashups enable users to combine functionalities and data available on the Web thus increasing the reuse. We believe that the essence of mashups depends on their ability to bridge the gap between end-users and the design and development of applications. Therefore, we have investigated several mashup design and development tools (listed in [29]), in terms of their end-user facilities, to develop a prototype mashup mockup supporting UDPE: (1) IBM Mashup Center, (2) Intel Mashmaker, (3) JackBe Presto, (4) Liquid Apps, (5) Open Mashup Studio, (6) Yahoo Pipes, and (7) Deri Pipes. We have derived the following criteria for a EUD-enabling GUI: (1) design facilities should follow real world mental models [30], (2) users should not be placed under a high cognitive load by means of overloaded forms and pages (e.g. configuration facilities) [31], (3) users should be confident that they have the full control and awareness of an application [25], and (4) the design should be engaging [32]. According to these criteria, the most appropriate design elements seem to be widgets, wires, drag and drop facilities, and intelligent guidance. We explain these elements through the prototype mockup in Fig. 4 representing a slightly different scenario. The overall development environment follows a grocery-kitchen metaphor, i.e. find ingredients in the grocery and cook them in the kitchen. The upper part of the Fig. 4 (grocery) gives a persistent presentation of the available resources (i.e. devices and web services) and aims at creation of a permanent awareness and control of the resources. It enables users to select data, functionalities and event notifiers (by following
72
A. Soylu, F. Mödritscher, and P. De Causmaecker
Fig. 4. A prototype mockup for UDPE Mashup Editor following grocery-kitchen and animation/movie metaphors
[33]) available through the devices and web applications (i.e. stores). Double clicking to a resource loads available elements of the selected resource into the respective containers (i.e. selection boxes for ‘Operations’, ‘Data’ and ‘Events’). The bottom part of the Fig. 4 (kitchen) visualizes the mashup development area. This development area follows a model similar to the animation/movie development environments (e.g. Microsoft Movie Maker - we have observed that many naive of users can create simple animations and movies through similar tools). It provides an execution timeline divided into scenes and aims at the creation of an engaging user experience. Each scene corresponds to a widget; there are four types of widgets available, namely, ‘Operation’, ‘Filter’, ‘User’, and ‘Listener’. If user selects an operation from ‘operation store’ of the ‘grocery’, it is added as an ‘operation’ type widget (e.g. scene) into the timeline. Inputs of the operations are represented through visual input elements. If the user selects a ‘data’ (events, people, movies etc.) from the ‘data store’, a ‘filter’ type scene is added where each metadata element of the data type is displayed as an input element similar to an ‘operation’ widget. If the user selects an ‘Event’ from the ‘store’, a ‘listener’ type scene is added. It handles the event notifications sent by the resources (e.g. TV is on), and can be used to initiate execution of the mashup (being the first scene). End-users can mark input elements as ‘auto’, ‘user input’, ‘default’, and ‘fixed’. In case of ‘auto’ elements, input is provided by the widget wired to them, while ‘user input’ elements are filled by the user during execution of the mashup. ‘Default’ values are provided by the user at development time, whereby the user can change this value
Utilizing Embedded Semantics for User-Driven Design of Pervasive Environments
73
at runtime. In the case of ‘fixed’ elements, the user enters a fix value at development time which cannot be changed in the execution phase. The mixture is also possible in a scene, that is, some input elements are filled by the user and some are auto completed from the previous scene. If there are no ‘user input’ elements in a scene, then the scene is not visible to the user during execution of the ‘movie/animation’. A user can create parallel timelines, so that she can wire different parallel data, operation, and listener widgets from other timelines to the widget at the next scene. Users can also change the order of widgets and wire widgets to each other through drag and drop facilities. The number of configuration elements is kept as minimal as possible, however the functionality can be increased through distributing different functionalities into layers (i.e. modes) according to their difficulties and anticipated usage frequencies ranging from ‘beginner’ to ‘expert’ levels. Intelligent guidance shall help through the design by means of disabling or selecting appropriate design elements in order to facilitate the design and development process.
5 Discussion and Related Work So far we have not paid attention to the embedded semantics sufficiently, i.e. users should also have the possibility to specify information encoded in web-based content e.g. for describing RESTful services or for exchanging data between applications. Overall, the main focus of UDPE is set on Semantic Web technologies such as RDF and OWL. However we advocate that this simple idea of annotating information within HTML content introduces new possibilities like indicated in this paper. In [3], authors use embedded semantics, microformats, to annotate valuable information pieces and contextual information for the e-learning domain. Furthermore they built a web service which harvests embedded information and allows clients to query this information. A similar approach has been employed in [34] where microformats are used to find and annotate governmental web services. These services are harvested by special agents, and the description is stored in a semantic repository. Later these services are put at disposal of citizens by means of a semantic search engine. Within the context of this paper, we also tried to build on existing mashup technologies [15, 35]. On the one hand the Mashup tools mentioned in section 4 have a strong focus on content aggregation and manipulation, i.e. feeds, while providing limited support for service composition. Microformats and RDFa are not supported and attention is given to feeds (e.g. RSS). Visual development environments are provided based on widgets, called modules or pipes, however support for more experienced users through source code manipulation is not well addressed. On the other hand, the underlying approach, technology and framework used in these tools are left hidden (most of them are already commercial, and not open source), therefore it is not possible to compare these approaches with ours from a technical point of view. A notable approach which is based on a concrete methodology and technology is SMashups [36]. It focuses on service composition rather than data. It follows the SAWSDL approach (Semantic Annotation for WSDL) which aims at adding semantic annotations to web services described with WSDL. A service annotation mechanism, called SA-REST, is based on Microformats [13] and RDFa [36] and used for REST based services usually described in HTML pages. SA-REST and SAWSDL specify
74
A. Soylu, F. Mödritscher, and P. De Causmaecker
associations between the service description components and concepts in a semantic model (i.e. ontology) in order to enable semantic interoperability. The main drawback of this approach is that it assumes the existence of a (pre-defined) ontology so that different services can be semantically integrated. However such approach can be only useful for integrating a group of domain-specific applications since ontologies, by their nature, require commitment, which is hard to achieve even for flat vocabularies. The Web is highly heterogeneous. Therefore we prefer to avoid utilizing ontologies for such global integration purposes. The approach presented in this paper further extends the notion of mashups to the Pervasive Computing through integration of everyday devices to the Web and considering Web as an active programming framework rather than a passive information source (e.g. event notifiers - listeners). Literature on Pervasive Computing outlines that RDF and OWL are basically used to formalize ontological models, although the current focus is mostly on developing generic but domain-specific context ontologies, see [37] for a review, to be used for reasoning purposes in order to provide an adaptive user experience, there is a lack of a generic framework ensuring a loose integration of the Web and the pervasive spaces. Use of ontologies as domain-specific artifacts, in pervasive spaces, can enhance the application of UDPE. Developers and end-users can really program the environment by making use of context ontologies of the pervasive spaces since such context ontologies will allow programmer to reach information about the entities available in the environment, their statuses, characteristics and relationships with each other. We believe that at the interface level (i.e. Web) use of metadata level semantics will be an appropriate and a simple solution by assuming that every system has its own local ontology and metadata schema. This is because it is always easier to define mappings between different metadata schemas than defining ontological mappings. From the hardware point of view, implementation of RESTful services on an embedded system is realized in [38], authors also refer to the related literature, for instance, [39] attached a mini server ability to equipments such as to air-conditioners while [40] implemented an embedded temperature web controller. The current implementations of embedded web servers need to be standardized and negotiated with the appliance producers. Marching towards the vision of ubiquitous Internet, enabling web services on embedded systems is certainly on the to-do list of consumer electronics industry for the near future [38].
6 Conclusions and Further Work In this paper, we have introduced a development methodology named User-driven Design of Pervasive Environments (UDPE). It aims at empowering users to program and design their ubiquitous spaces from the data and functionality available on the Web thereby considering the Web as a distributed database and programming framework integrated with the digital equipments available through the ubiquitous infrastructure. We have shown how UDPE can be set into practice on the basis of embedded semantics and RESTful web services and how pervasive environments can be developed by two important target groups, web programmers and web users. Considerable work still remains to be carried out in this area, both in terms of further validation, and in more detailed exploration of the proposed approach. The
Utilizing Embedded Semantics for User-Driven Design of Pervasive Environments
75
research direction can be itemized as follows: (1) realization of a prototypic implementation of the approach, (2) realization of a UDPE programming editor supporting GUI-based interface for end-users as well, (3) extending UDPE’s technical frame by integrating ontological models of the pervasive environments, (4) investigation of standardized context dissemination mechanisms [8, 41], e.g. notifiers – listeners through push/pull, to realize automatic actions taken by the pervasive environments based on changes in the environment context (e.g. events), (5) investigation of standardized and generic security mechanisms for UDPE. Indeed the realization of the aforementioned research path is expected to result in generic and standardized webbased ubiquitous computing framework. Acknowledgments. This paper is based on research funded by the Industrial Research Fund (IOF) and conducted within the IOF Knowledge platform “Harnessing collective intelligence in order to make e-learning environments adaptive” (IOF KP/07/006). Partially, it is also funded by the European Community's 7th Framework Programme (IST-FP7) under grant agreement no 231396 (ROLE project).
References 1. Ayers, D.: The Shortest Path to the Future Web. Internet Comput. 10, 76–79 (2006) 2. Soylu, A., De Causmaecker, P.: Merging Model Driven and Ontology Driven Development Approaches: Pervasive Computing Perspective. In: 24th International Symposium on Computer and Information Sciences (ISCIS 2009), pp. 730–735. IEEE Press, Los Alamitos (2009) 3. Soylu, A., De Causmaecker, P., Wild, F.: Ubiquitous Web for Ubiquitous Computing Environments: The Role of Embedded Semantics. J. Mob. Multimed. 6, 26–48 (2010) 4. Brusilovsky, P.: Adaptive Hypermedia. User Modeling and User-Adapted Interaction 11, 87–110 (2001) 5. Lieberman, H., Paterno, F., Klann, M., Wulf, V.: End-User Development: An Emerging Paradigm. In: Lieberman, H., Paterno, F., Wulf, V. (eds.) End-User Development. LNCS, vol. 4321, pp. 1–8. Springer, Heidelberg (2006) 6. Wild, F., Mödritscher, F., Sigurdarson, S.E.: Designing for Change: Mash-Up Personal Learning Environments. In: eLearning Papers (2008), ISSN: 1887-1542 7. Weiser, M.: The computer for the 21st century. Sci. Am., 94–98 (1991) 8. Soylu, A., De Causmaecker, P., Desmet, P.: Context and Adaptivity in Pervasive Computing Environments: Links with Software Engineering and Ontological Engineering. J. Soft. 4, 992–1013 (2009) 9. Vinoski, S.: REST eye for the SOA guy. IEEE Internet Comput. 11, 82–84 (2007) 10. Riva, C., Laitkorpi, M.: Designing Web-Based Mobile Services with REST. In: Di Nitto, E., Ripeanu, M. (eds.) ICSOC 2007. LNCS, vol. 4907, pp. 439–450. Springer, Heidelberg (2009) 11. Dillon, T., Talevski, A., Potdar, V., Chang, E.: Web of Things as a Framework for Ubiquitous Intelligence and Computing. In: Zhang, D., Portmann, M., Tan, A.-H., Indulska, J. (eds.) UIC 2009. LNCS, vol. 5585, pp. 1–10. Springer, Heidelberg (2009) 12. Khare, R.: Microformats: the next (small) thing on the semantic Web? Internet Comput. 10, 68–75 (2006)
76
A. Soylu, F. Mödritscher, and P. De Causmaecker
13. Kopecky, J., Gomadam, K., Vitvar, T.: hRESTS: An HTML Microformat for Describing RESTful Web Services. In: International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2008), pp. 619–625 (2008) 14. Mödritscher, F.: Adaptive E-Learning Environments: Theory, Practice, and Experience. Verlag Dr. Müller, Saarbrücken (2008), ISBN: 978-3-639-02635-1 15. Benslimane, D., Dustdar, S., Sheth, A.: Service Mashups: The New Generation of Web Applications. IEEE Internet Comput. 12, 13–15 (2008) 16. Mödritscher, F., Wild, F.: Personalized E-Learning through Environment Design and Collaborative Activities. In: Holzinger, A. (ed.) USAB 2007. LNCS, vol. 4799, pp. 377–390. Springer, Heidelberg (2007) 17. Helal, S.: Programming pervasive spaces. IEEE Pervasive Comput. 4, 84–87 (2005) 18. Mödritscher, F.: Semantic Lifecycles: Modelling, Application, Authoring, Mining, and Evaluation of Meaningful Data. Int. J. Knowl. Web Intell. 1, 110–124 (2009) 19. Adida, B.: hGRDDL: Bridging micorformats and RDFa. J. Web Semant. 6, 61–69 (2008) 20. Khare, R., Çelik, T.: Microformats: A pragmatic path to the Semantic Web. In: 15th International World Wide Web Conference, pp. 865–866 (2006) 21. Boley, H., Kifer, M., Patranjan, P.L., Polleres, A.: Rule interchange on the web. In: Antoniou, G., Aßmann, U., Baroglio, C., Decker, S., Henze, N., Patranjan, P.-L., Tolksdorf, R. (eds.) Reasoning Web. LNCS, vol. 4636, pp. 269–309. Springer, Heidelberg (2007) 22. Repenning, A., Ioannidou, A.: What Makes End-User Development Tick? 13 Design Guidelines. In: Lieberman, H., Paterno, F., Wulf, V. (eds.): End-User Development. LNCS, vol. 4321, pp. 51–86. Springer, Dordrecht (2006) 23. Begier, B.: Users’ involvement may help respect social and ethical values and improve software quality. Inf. Syst. Frontiers, doi: 10.1007/s10796-009-9202-z 24. Fischer, G., Giaccardi, E.: Meta-Design: A Framework for the Future of End-User Development. In: Lieberman, H., Paterno, F., Wulf, V. (eds.): End-User Development. LNCS, vol. 4321, pp. 421–452. Springer, Dordrecht (2006) 25. Spiekermann, S.: User Control in Ubiquitous Computing: Design Alternatives and User Acceptance. Shaker Verlag, Aachen (2008) 26. Dey, A.K., Mankoff, J.: Designing mediation for context-aware applications. ACM Trans. Comput.-Hum. Interact. 12, 53–80 (2005) 27. Kasabov, N.: Evolving Intelligence in humans & machines: Integrative evolving connectionist systems approach. IEEE Comput. Intell. Mag. 3, 23–37 (2008) 28. Valderas, P., Pelechano, V., Pastor, O.: Towards an End-User Development Approach for Web Engineering Methods. In: Dubois, E., Pohl, K. (eds.) CAiSE 2006. LNCS, vol. 4001, pp. 528–543. Springer, Heidelberg (2006) 29. Taivalsaari, A.: Mashware: The future of web applications. Technical report, Sun Microsystems (2009) 30. Meadows, D.H.: Thinking is Systems. Chelsea Green Publishing (2008) 31. Hagras, H.: Embedding Computational Intelligence in Pervasive Spaces. IEEE Pervasive Comput. 6, 85–89 (2007) 32. O’Brien, H.L., Toms, E.G.: What is user engagement? A conceptual framework for defining user engagement with technology. J. Am. Soc. Inf. Sci. Technol. 59, 938–955 (2008) 33. Wild, F., Sigurðarson, E.S., Sobernig, S., Stahl, C., Soylu, A., Rebas, V., Górka, D., Danielewska-Tuecka, A., Tapiador, A.: An Interoperability Infrastructure for Distributed Feed Networks. In: 1st International Workshop on Collaborative Open Environments for Project-Centered Learning (COOPER 2007), Greece (2007) 34. Sabucedo, L.A., Rifón, L.A.: A Microformat Based Approach For Crawling And Locating Services In The E-government Domain. In: 24th International Symposium on Computer and Information Sciences, pp. 111–116. IEEE Press, Los Alamitos (2009)
Utilizing Embedded Semantics for User-Driven Design of Pervasive Environments
77
35. Birman, K., Cantwell, J., Freedman, D., Huang, Q., Nikolov, P., Ostrowski, K.: Edge Mashups for Service-Oriented Collaboration. IEEE Comput. 42, 90–94 (2009) 36. Sheth, A.P., Gomadam, K., Lathem, J.: SA-REST: Semantically Interoperable and Easierto-Use Services and Mashups. IEEE Internet Comput. 11, 91–94 (2007) 37. Perttunen, M., Riekki, J., Lassila, O.: Context Representation and Reasoning in Pervasive Computing: a Review. Int. J. Multimed. Ubiquitous Eng. 4, 1–28 (2009) 38. Chang, C.E., Mohd-Yasin, F., Mustapha, A.K.: An implementation of embedded RESTful services. In: Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA 2009), pp. 45–50 (2009) 39. Lin, T., Zhao, H., Wang, J., Han, G., Wang, J.: An Embedded Web Server for Equipments. In: 7th International Symposium on Parallel Architectures, Algorithms and Networks, pp. 345–350. IEEE, Los Alamitos (2004) 40. Huang, J., Ou, J.Y., Wang, Y.: Embedded Temperature Web Controller Based on IPv4 and IPv6. In: 31st Annual Conf. of the IEEE Industrial Electronics Society, IECON 2005 (2005) 41. Gu, T., Pung, H.K., Zhang, D.Q.: A Service-Oriented Middleware for Building ContextAware Services. J. Netw. Comput. Appl. 28, 1–18 (2005)
Utilizing Linked Open Data Sources for Automatic Generation of Semantic Metadata Antti Nummiaho, Sari Vainikainen, and Magnus Melin VTT Technical Research Centre of Finland, Vuorimiehentie 3, P.O. Box 1000, FI-02044 VTT, Finland {antti.nummiaho,sari.vainikainen,magnus.melin}@vtt.fi
Abstract. In this paper we present an application that can be used to automatically generate semantic metadata for tags given as simple keywords. The application that we have implemented in Java programming language creates the semantic metadata by linking the tags to concepts in different semantic knowledge bases (CrunchBase, DBpedia, Freebase, KOKO, Opencyc, Umbel and/or WordNet). The steps that our application takes in doing so include detecting possible languages, finding spelling suggestions and finding meanings from amongst the proper nouns and common nouns separately. Currently, our application supports English, Finnish and Swedish words, but other languages could be included easily if the required lexical tools (spellcheckers, etc.) are available. The created semantic metadata can be of great use in, e.g., finding and combining similar contents, creating recommendations and targeting advertisements. Keywords: metadata generation, semantic metadata, Linked Data.
1
Introduction
Many social media services like Delicious, Last.fm, Flickr and YouTube let users tag content with keywords of their choice. For users, this provides an easy and straightforward way to add metadata. This metadata then plays a key role in linking resources and people to each others. However, there are several challenges relating to the use of tags; the flat structure, use of different words to describe the same thing, words with several different meanings (polysemy), different language versions, misspellings and different lexical forms. Connecting the tags to existing semantic knowledge bases would provide much more information about them and how they relate to each other. However, this should be done automatically to preserve the easiness that users have in adding the tags as simple keywords. To tackle this issue, we have created a semantic tag analyser that aims at linking given keywords to URIs in existing semantic knowledge bases. Our analyser uses publicly available knowledge bases such as WordNet1 , KOKO2, DBpedia3 1 2 3
http://wordnet.princeton.edu/ http://www.yso.fi/onki/koko/ http://dbpedia.org/
S. S´ anchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 78–83, 2010. c Springer-Verlag Berlin Heidelberg 2010
Utilizing Linked Open Data Sources for Automatic Generation
79
and Freebase4 that offer structural data through open APIs. The reason to use several databases is that different data sources contain different kind of knowledge and not all information is available in one dataset. For example, if we search for a Finnish word “koivu” (birch in English) from KOKO we find general concepts relating to the tree, but if we make the same search to Freebase, we find persons whose name is Koivu. Tags can be very heterogeneous depending on the context where the application is used. Therefore, using multiple databases makes it more likely to find the right meaning.
2
Related Work
In this chapter we present some recent studies that describe different methods which have been developed for linking tags to ontologies. What our application adds to the already published works is that it aims at defining a general tool for providing meanings for all kinds of words in different languages. The Relco framework applies syntactic, semantic and collaborative techniques to connect tags to ontological concepts. Techniques have been demonstrated in the educational (ViTa a Video Tagging system for educational videos) and cultural heritage domain. [1] The Morpho framework, inspired by Relco, elicits, enhances, and transforms a user profile from one application to another application in a mashup environment. It deals with semantic and syntactic heterogeneity of data and schema of the user profile. [2] Cantador et al. have studied how ontological user profiles can be created by incorporating individual tagging history and matching tags with ontology concepts using WordNet and Wikipedia. [3] FLOR, a FoLksonomy Ontology enRichment tool, connects free-keywords to online ontologies by utilizing lexical processing, disambiguation and semantic expansion techniques. [4] There are also several existing applications that connect tags to ontologies. For example, Faviki5 is a social bookmarking tool that connects tags to DBpedia concepts and Zemanta6 is a tool that enrichs blog posts by linking their contents to existing ontologies.
3
Semantic Tag Analyser
The steps that our tag analyser (implemented in Java) takes when finding meanings for a tag are the following: 1. Finding meanings from amongst the proper nouns 2. Detecting possible languages 4 5 6
http://www.freebase.com/ http://www.faviki.com/ http://www.zemanta.com/
80
A. Nummiaho, S. Vainikainen, and M. Melin
3. Finding spelling suggestions (if language was not detected) 4. Finding meanings from amongst the common nouns 5. Possibly repeating steps 1 and 4 for spelling suggestions Currently, our tag analyser only supports tags in English and Finnish. Swedish tags are also handled in the analysis, but they are not treated as smartly as in the other two languages, as misspelled Swedish words are not recognized. However, the application is implemented so that it should be relatively easy to add other languages too, if the required lexical tools for getting word roots and finding spelling suggestions are available. Our tag analyser utilises the following databases through the open APIs that they provide: – – – – – – –
CrunchBase7 (a database of companies, people, and investors) DBpedia (structured Wikipedia information) Freebase (data from several sources, as well as user contributed data) KOKO (a collection of Finnish core ontologies)) OpenCyc8 (an open source version of the Cyc general knowledge base) Umbel9 (a lightweight, subject concept reference structure for the Web) WordNet (a lexical database for the English language)
We have chosen these databases, because they cover a wide area of different tags in the languages that our application currently supports, and also because many of them provide Linked Data to link concepts in different databases. However, other databases could be added, e.g., if we find that there are unrecognized words for which meanings could be found in some other database. Fig. 1 shows the tag analysis as a flow chart. The tag analysis’s steps are examined in more detail in the following chapters. 3.1
Finding Meanings from Amongst the Proper Nouns
As any tag can be a proper noun, the first step is to try to find meanings from amongst them. Freebase, CrunchBase and DBpedia are used for this as they contain concepts about people, companies, places, etc. As the tag may also be a misspelled proper noun, this step is also executed for the spelling suggestions if no meanings were found for the given tag from neither the proper nouns nor the common nouns. 3.2
Detecting Possible Languages
Google’s Translate service10 is used for trying to detect the possible languages of the tag. The language detection process is executed by trying to find translations from different languages and whenever a translation is found, the language is noted as a possible language for the tag. A tag can therefore have multiple possible languages as the same word can have a meaning in different languages. 7 8 9 10
http://www.crunchbase.com/ http://www.opencyc.org/ http://www.umbel.org/ http://translate.google.com/
Utilizing Linked Open Data Sources for Automatic Generation
Tag(s) and possible additional information
Spelling suggestions found
Finding meanings from amongst the proper nouns using DBpedia, Freebase and Crunchbase
81
Detecting possible languages
Finding spelling suggestions
Languages found No
Yes Yes Finnish/Swedish tag(s)
Finding meanings from amongst the common nouns using KOKO -> DBpedia/Freebase -> WordNet -> OpenCyc
English tag(s)
Finding meanings from amongst the common nouns using WordNet -> OpenCyc/DBpedia -> KOKO
Fig. 1. Tag analysis goes through several steps to find meanings for all kinds of tags
3.3
Finding Spelling Suggestions
If no language can be detected, the tag may be misspelled and spelling suggestions are searched. For a Finnish tag, the Finnish-Malaga11 tool is first used for trying to find word roots for the tag, and if none are found, Webvoikko12 is used for finding spelling suggestions. For English words Yahoo’s Spelling Suggestion service13 is used. All found English and Finnish spelling suggestions are then handled as regular tags and meanings for them are searched from amongst the proper nouns and the common nouns. 3.4
Finding Meanings from Amongst the Common Nouns
The databases that are used for finding meanings from amongst the common nouns are defined separately for each language. This is because not all databases support all languages equally. For example, KOKO is designed for Finnish concepts, but also has some labels in Swedish and English, DBpedia and Freebase are designed for English concepts but also have some labels in Finnish and Swedish, but OpenCyc and WordNet do not have any Finnish or Swedish labels. For English tags the analysis starts from WordNet where the word roots and WordNet meaning URIs are searched. Each of the found word roots is then used in querying meanings from OpenCyc. Then the meaning URIs found from OpenCyc are accessed to get possible owl:sameAs URIs as OpenCyc has linked WordNet’s, DBpedia’s and Umbel’s URIs to their concepts with owl:sameAs relations. Only those OpenCyc URIs of which owl:sameAs relations do not contain a different WordNet meaning URI than the one that was found earlier, are 11 12 13
http://joyds1.joensuu.fi/suomi-malaga/suomi.html http://joukahainen.puimula.org/webvoikko http://developer.yahoo.com/search/web/V1/spellingSuggestion.html
82
A. Nummiaho, S. Vainikainen, and M. Melin
accepted. After that, the Finnish KOKO ontology is accessed through its Web Service interface. If a meaning from DBpedia or Freebase was found, the meaning’s Finnish alternative label from one these databases (if available) is used in searching for concepts from KOKO. Otherwise, a Finnish translation (using Google Translate) of the original English tag is used. For Finnish and Swedish tags, the analysis starts from KOKO. Finnish tags are used as they are, but Swedish tags are translated to Finnish. The analysis then continues to WordNet, where the English label(s) of the KOKO concept(s) are used to get the word roots and WordNet meanings. If the English label(s) could not be found from KOKO, they are obtained by translating the tag to English. Finally, the WordNet roots are used to get meanings from OpenCyc and DBpedia as described earlier.
4
Utilizing the Found Meanings
Once the meaning URIs for a tag have been found, more knowledge can be obtained by accessing the data referred by the URIs. This knowledge includes, e.g., labels, synonyms and descriptions in different languages, relations of concepts broader, narrower, related), classifications (e.g., person, location, music), etc. The usefulness of this information depends on how the analysed tags are used. For example, different language versions can be used to localize the tags, classifications to categorize them and location coordinates to display them on a map. We have also implemented a way to export the gathered additional information as an RDF document that conforms to the ontology that we have defined earlier based on existing social media ontologies such as MOAT (Meaning of a Tag) [5].
5
Findings
We find that Linked Data sources offer valuable resources for automatic generation of semantic metadata as well as additional concept related information that can be utilised in many different applications. Since different knowledge bases offer different kind of data it depends on the use case which database or databases to choose. Our approach in our analysis was to use the databases that best support the specific language and then use Linked Data to get relations between the databases. Although not all of these databases support multiple languages with help of Linked Data we were able to utilise knowledge relating to concepts in multiple languages. WordNet offers many tools for handling English language and the senses of the words but it does not contain any links to other datasets or additional information relating to concepts. Luckily, other databases such as OpenCyc link their concepts with WordNet URIs, so additional information can be gathered. Some but not all concepts in KOKO have English and Swedish labels in addition to Finnish labels. Because of this we need to use Google Translate to get labels in all three languages. Although this works fine in most cases, there are also incorrect translations which lead to incorrect semantic meanings of a tag.
Utilizing Linked Open Data Sources for Automatic Generation
83
We have been informed that a more accurate translation process of KOKO is ongoing. Once it is finished, it also improves our analysis. Overall, there is no unified way to access semantic knowledge bases as each of them has their own APIs and protocols. For example, DBpedia is accessed through a SPARQL endpoint, KOKO using a SOAP web service and OpenCyc with REST/XML. When using Linked Data online, we are also dependent on the services and their response times. During the development both OpenCyC and DBpedia updated their APIs which caused changes to our software as well. Since many of the datasets can also be downloaded, one opportunity is to handle this data locally. We have not done that, but we store analysed data relating to tags and their meanings into our own RDF storage. The more the application is used the more information we have relating to different tags. Then, even if the services are down we can still utilise our own database.
6
Conclusions
In this paper we have presented an application that aims at creating meanings for any given word(s). The created semantic metadata can be of great use in, e.g., finding and combining similar contents, targeting advertisements, etc. For tag-based services, such as different image sharing sites, the application opens new ways to add more intelligent features. Also, different services, where contents have traditionally been classified with keywords, such as book publishers, could find the application useful. Currently, we are using our semantic tag analyser in implementing a service that can recommend events to users based on their interests. Both the events’ descriptions as well as users’ tags are analysed with the semantic tag analyser and the created semantic metadata is used in finding the events that are in some way semantically connected to the users’ tags.
References 1. Sluijs, K., Houben, G.-J.: Automatic Generation of Semantic Metadata as Basis for User Modeling and Adaptation. In: Kuflik, T. (ed.) Advances in Ubiquitous User Modelling. LNCS, vol. 5830, pp. 73–93. Springer, Heidelberg (2009) 2. Leonardi, E., Houben, G.-J., Sluijs, K., Hidders, J., Herder, E., Abel, F., Krause, D., Heckmann, D.: User Profile Elicitation and Conversion in a Mashup Environment. In: CEUR Workshop Proceedings, 1st International Workshop on Lightweight Integration on the Web, pp. 18–29 (2009) 3. Cantador, I., Szomszor, M., Alani, H., Fern´ andez, M., Castells, P.: Enriching Ontological User Profiles with Tagging History for Multi-Domain Recommendations. In: 1st International Workshop on Collective Semantics: Collective Intelligence & the Semantic Web (2008), http://eprints.ecs.soton.ac.uk/15451/ 4. Angeletou, S.: Semantic Enrichment of Folksonomy Tagspaces. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 889–894. Springer, Heidelberg (2008) 5. Vainikainen, S., Nummiaho, A., B¨ ack, A., Laakko, T.: Collecting and Sharing Observations with Semantic Support. In: 3rd International AAAI Conference on Weblogs and Social Media, pp. 338–341. AAAI Press, Menlo Park (2009)
Application of Semantic Tagging to Generate Superimposed Information on a Digital Encyclopedia Piedad Garrido, Jesus Tramullas, and Francisco J. Martinez University of Zaragoza, Pedro Cerbuna, 12, 50009 Zaragoza, Spain {piedad,tramullas,f.martinez}@unizar.es http://www.unizar.es
Abstract. We can find in the literature several works regarding the automatic or semi-automatic processing of textual documents with historic information using free software technologies. However, more research work is needed to integrate the analysis of the context and provide coverage to the peculiarities of the Spanish language from a semantic point of view. This research work proposes a novel knowledge-based strategy based on combining subject-centric computing, a topic-oriented approach, and superimposed information. It subsequent combination with artificial intelligence techniques led to an automatic analysis after implementing a made-to-measure interpreted algorithm which, in turn, produced a good number of associations and events with 90% reliability. Keywords: Topic Maps, DITA, XTM, electronic encyclopedia, automatic processing, artificial intelligence, superimposed information.
1
Introduction
An encyclopedia is a work that covers all aspects of human knowledge. Basically, it is a comprehensive and complete reference work of which there are two main types: (i) in alphabetical order, used like dictionaries, but with much more information. Each volume contains the terms included between the two words that appear on its spine, and (ii) arranged as themes, known as systematic classification. Each volume deals with a different theme. To locate certain information, it is necessary to resort to the contents and the alphabetical index in each volume. In recent years, many electronic encyclopedias have been published. At the beginning of this electronic era, they were diffused on optical devices like CDROMs and DVDs. Presently, their online format has spread and led to very complete multimedia documental products with which users may interact with thanks to their hypertext characteristics. Indeed, the diffusion of online electronic encyclopedia contents has led to such as an increase in the production of contents to be processed by documentation units that their manual processing proves practically impossible, at least by traditional techniques. S. S´ anchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 84–94, 2010. c Springer-Verlag Berlin Heidelberg 2010
Semantic Tagging on a Digital Encyclopedia
85
The online Gran Enciclopedia Aragonesa (GEA, the Great Aragonese Encyclopedia) [1] is a combination of online alphabetically arranged encyclopedias and those classified into categories. This format affects the way users act because once they have tried it, they prefer the real-time self-service format to the traditional encyclopedias to search for specific information. GEA was published in print in 1981 but it was at the end of 2001 when it began its scanning, starting off the first version of GEA online in September 2003. The potential users of the systems are professional scholars, researchers and the general public. GEA is structured in XML voices, sorted into categories and subcategories and can be accessed through an index or a common integrated form on the website. This structure favours the interchange of information and communications on the one hand but it is still ’static’ and ’rigid’, which hampers the work of renovation of the voices, and their customers must resort too many levels of depth until they find the information they need. Furthermore, their searches per topic are limited to either the categories offered (art, biographies, sciences, geography, heraldry, history, humanities and entertainment), or a search done in alphabetical order. The analysis of GEA is important because: (i) the process of updating the encyclopedia never ceases, (ii) the analysis of the Spanish language is very interesting and there are fewer initiatives regarding this, and (iii) it promotes the technology transfer between universities and enterprises. The aims of the project were: (i) to get a ’dynamic’ version of the product ables to automate processing of textual documents with historic information in Spanish using free software technologies, and (ii) to start a research, development and innovation project between Dicom Medios and the University of Zaragoza. This article analyses how superimposed information with XTM [2] and DITA [3], the knowledge-based strategy adopted, and artificial intelligence (AI) technologies may apply to improve textual documents with previously digitalised historic information for the purpose of assisting the preservation of these materials and of providing solutions to overcome the existing technical difficulties to make historic heritage perpetual. This proposal, which is based on a knowledgebased strategy combining superimposed information, subject-centric computing and topic-oriented approach, will: (i) facilitate the task of automating the documental analysis of content (semantic description) with this kind of textual documents, (ii) permit a more thorough representation of the contents, (iii) increase the possibilities of retrieving requested information, and (iv) adapt their use to each user’s needs. This document is arranged as so: Section 2 introduces some key concepts we are working with, such as superimposed information, ISO standard 13250 (Topic Maps) [2] and its XTM specifications, as well as the Darwin Architecture to transcribe information (DITA). Section 3 includes some works related to our proposal. Section 4 presents algorithms designed to automatically analyse textual documents containing historical information. The results obtained have been analysed as Section 5 explains. Finally, Section 6 offers the most important conclusions.
86
2
P. Garrido, J. Tramullas, and F.J. Martinez
Main Concepts: Superimposed Information, Subject-Centric Computing, and Topic-Oriented Approach
The philosophy of work using a knowledge-based strategy which blends superimposed information [4], subject-centric computing [5], and topic-oriented approach [6] is based on documents needing a logical representation structure with which access to them may be facilitated. This logical structure is not normally deduced from textual information, rather from the contexts in which the documents were created and implemented, and in which these documents are used. Normally, the semantic description of a document is not limited to a single model that has been adapted to a discipline. This is because many metadata models exist with various forms of implementation that makes them all useful for solving a specific information representation problem. For this reason, combining several models helps accomplish richer semantics to subsequently enable simpler indexing and, in turn, allow more efficient search processes. ISO standard 13250:2003 [7], known as Topic Maps, is a semantic model with a subject-centric vision used to represent the conceptual structure of the information contained in an online information resource. They are used to richly describe relationships between ’things’ rather than between documents and pages, they improve the findability of information, and they are in a more high-level because they are human-oriented. The biggest contribution of such structures to knowledge representation lies in the fact that, when combined, they are much more effective than when considered separately [8]. In 2000, Topic Maps were defined using an XML syntax [2] known as XTM (XML for Topic Maps), which was updated in 2006. DITA (Darwin Information Typing Architecture) is a XML specification, which IBM created in 1999, and it was transferred to the OASIS association in 2004. This content model is a topic-centered approach generated to design, produce and distribute technical information. The philosophy of this content model is based on the fragmentation of the content in short themes to facilitate their reuse in different statements. Since this is an extendable specification, different organisations possibly define the specific informative structures without relinquishing the use of generic tools [9]. We believe that the combination of both models could prove profitable to develop automatic systems that process large amounts of textual information as the former adds contextual semantics to the text by greatly minimising the problem since it only provides the information contained in it. In parallel, DITA covers the traditional part since it is based on its content and deducing the relevant information through content. The most interesting characteristic to undertake the automated analysis of textual documents containing historic information is based on the fulfilment of two conditions: (i) the presentation of a series of occurrences, taking place at a given time and in a specific place, associated with a group of entities, and (ii) most of an entity’s intrinsic information is performed by virtue of relationships with other
Semantic Tagging on a Digital Encyclopedia
87
entities. The semantics of these relationships lies in the roles they play. In this sense, all the smaller fragments into which we are interested in dividing each entity (the internal structure) have to be identified beforehand to subsequently interrelate the entities (the external structure) either in the encyclopedia database computer system or with other external online information sources. Later a content analysis has to be done which, in the terminology used in this research work, is the equivalent of the semantic description of these documental contents. The information retrieved after the analysis could have a network structure in which each node is an entity related with other nodes. Effective semantic information for the human user would be stored in the tags of the arcs containing the role of the internode association, along with the date and place if dealing with a dated event. The response to all these challenges is based on a hybrid scheme employed between the XTM specification and the DITA architecture. Both are models that define the way all the associations among the entities are stored and interpreted, and independently describe each entity from the rest.
3
Related Work
We now discuss the different application types and developments related to our proposal: (i) applications that carry out the analyses of event (see Table 1), and (ii) the developments which use the XTM markup languages (see Table 2) and the DITA ones. To the best of our knowledge, only two contributions have been made [10,11] which use these textual document analysis tools with historic information. Unlike our proposal, neither works with superimposed information nor uses a language analyser including events analyses, and they do not include an indices base capable of searching among various document types because the document to be Table 1. List of tools that analyse events Name
Description Event Structure Analysis It is programmed in JAVA, works with XML and is based on a qualitative methodology to understand the sequence of events and how ESA they link people and things through narrative prose. Only examples of its use with textual documents and historical information are available in the article of Richardson [10] on the Everett massacre in 1916, and in Griffin’s historic sociology matter [11]. Textual Analysis By Augmented Replacement Instructions This open-source system is based on pattern recognition. It is designed to work with short summaries based on three forms of information; TABARI actors (proper names), verbs that determine the actions among actors, and phrases to distinguish among the various verbal meanings and to supply syntactic information related to the position the verb takes in the phrase.
88
P. Garrido, J. Tramullas, and F.J. Martinez
Table 2. Specific developments which use Topic Maps and their XTM specification Name Description Merlino is a prototype of the semi-automatic events generation. It takes a Topic Map as input, creates search queries, and uses many search Merlino engines to automatically identify relevant information resources and to identify them as events of the topic supplied for the search. It is set up in Perl and its strong point is its ability to express semantic relationships in Topic Maps using a powerful search engine. Semantic Information Retrieval Environment for Digital Libraries It appears in digital libraries as a Topic Maps-based semantic Siren information retrieval environment. It is part of the DMG-Lib (Digital Mechanism and Gear Library), a set of components that make up a collaborative work environment which uses the TMwiki and Merlino tools in some of its development phases [12]. The work of Ann Houston and Grammarsmith, presented in TMRA 07, entitled ‘Automatic Topic Map Generation from Free Text using Houston Linguistic Templates’, shows how using the textual information available online and Ontopia’s Omnigator software automatically constructs a Topic Map by comparing passages from a free text with linguistic templates [13].
imported in nearly all the applications has to be supplied in flat text. Table 2 presents the most outstanding Topic Maps-related developments. We now highlight the works done using DITA. First the work of Hennum [14] that uses information superimposed with DITA and SKOS to manage the formal subjects of the document content. Then Gelb’s contributions to international congresses which consider the theoretic use of Topic Maps and DITA [15] which integrated a methodology known as SOTA (Solution Oriented Topic Architecture) into the development of projects involving content management. Next there is Garrido’s practical approach [16] to demonstrate that work with superimposed information done with both markup text languages improves human-machine interaction subjects when working with textual documents. The nearest development to the proposal presented herein, that is, the automatic processing of textual documents with historic information, is Wittenbrik’s work [17] that considers the incorporation of encyclopaedic information online by using the international Topic Maps norm, which could only be consulted previously in a printed form. The genuine contribution to this predefined structure is made only at the coding level. Topic Maps enabled us to map legible data online to an available semantic structure. Thus its tagging, among other things, was not carried out automatically. Briefly, if we wish to correctly digitalise, store, process and diffuse textual historic documents in Spanish, we conclude that there is no tool other than our proposal available in the market to do this.
Semantic Tagging on a Digital Encyclopedia
4
89
Designed Algorithms
The aim of analysing the available textual documents with historic information is to extract a series of relationships among the entities and a series of events that, collectively, describe the relevant information. So the algorithm used for automatic processing purposes must be capable of: (i) reading and interpreting the text, (ii) detecting relevant information, (iii) extracting it, (iv) shaping the association among entities or for an entity-related event, and (v) storing it in one or several ways. The model designed assumes the same information resource to be the object of several representations, all of which are interdependent within the process but have different results. However, the crux of the problem lies in: (i) the detection method used to determine which specific part of the processed textual document is converted into an association or event, and (ii) the method to infer the tagging and to indicate the role, date, place and entities participating. For this purpose, our system is able to use two different approaches (see Table 3).
Table 3. Developed algorithms FIRST-LEVEL ALGORITHM
Detection method: Based on the word search in the text Inference method: To check against a predefined knowledge-based system Detection method: To process natural language at the SECOND-LEVEL morphological, syntactic and semantic levels ALGORITHM Inference method: A typical rule engine which incorporates a series of morphological combinations.
4.1
The First Approach (First-Level Algorithm)
The first of the two approaches we implemented was a first-level algorithm where the analysis was done directly on the text. The text was first fragmented into phrases to identify associations and entities, and then into words. Each word was analysed to determine if it was the role of an association (by inspecting a small database of relevant roles: successoral lines, hierarchies, titles, etc.), or whether it was a word which mentioned an entity (detecting if it was a proper name). When an entity was detected, the presence of a role in the rest of the phrase was investigated, and if successful, an association among the topics was created. To detect events, a basic exploration based on finding parentheses was followed to indicate a date of an event in most cases. This exploration algorithm has been classified as a first-level algorithm because detection is based directly on the words in the text, and its associations and events are produced from a search of keywords which provide clues as to the presence of relevant information. Therefore it deals with the subject of indexing by establishing relationships between natural language and documental languages. However,
90
P. Garrido, J. Tramullas, and F.J. Martinez
the basic service has not been developed further since the analysis essentially depended on a predefined knowledge-based system, and it failed proportionally to the complex expression of natural language regarding its variation and linguistic ambiguity characteristics. So we had to go one step further by distinguishing between the information analysis and the projection phase. 4.2
The Second Approach (Second-Level Algorithm)
The next step is a second-level approach in which the information analysis phase would be independent of grammars, codes, predetermined deductive languages and predefined knowledge-based systems because it would be impossible to cover the peculiarities of processing the natural language from Spanish with such complex structures. The intention was to firstly eliminate complex expressions and use others to be subsequently analysed by a simple processor. In our particular case, the projection phase involved having to regenerate the documental base with a different structure, so it was necessary to develop a second-level processing algorithm. This uses natural language processing techniques at the morphological, syntactic and semantic levels such as a typical rule engine as a novelty to incorporate a set of morphological combinations which, unlike what is normally done, conducts an in-depth analysis to neither identify only certain significant structures, nor to lead to the loss of relevant information, a mistaken extraction or redundant information. At the morphological and syntactic levels, the Freeling software package [18] was integrated and adapted to correct certain errors made (cataloguing common names as proper names as they are at the beginning of sentences, assigning verbs to common names, etc.) and to solve some widely used acronyms (e.g., Z. or id.). Moreover, a twin-type system was included to process linguistic ambiguity Table 4. Steps of the second-level algorithm Step Task 1 Filtering out the stop words 2 3
4
5 6
Step Task 7 Obtaining a series of lists of the associations and events Word-by-word analysis 8 Verifying the lists and search by references. Phrase-by-phrase analysis 9 Having generated the dependences, the lists of associations will be calculated by a cross-search Structure check against a 10 The relevance of the roles of both series of ad-hoc patterns the associations and events will be calculated Eliminating redundant informa- 11 A rescue method will be called tion with summary patterns to detect orphan associations Detecting events based on three pattern types
Semantic Tagging on a Digital Encyclopedia
91
Fig. 1. The original GEA voice in XML (input), and the XTM-DITA (output) obtained after applying our algorithm
92
P. Garrido, J. Tramullas, and F.J. Martinez
in such a way that a word could be simultaneously catalogued in several ways to avoid errors in the following analysis phases. An example of a fictitious entity: ‘Al comienzo de su reinado, se cas´ o con Magdalena de Folcaquier’ (At the beginning of his reign, he married Magdalena de Folcaquier). Following the syntactic and morphological level: ‘A (SP, preposition) comienzos (NC, common noun) de (SP, preposition) su (DP, possessive determiner) reinado (NC, common noun) cas´ o (VM, Spanish conjugation of the verb ‘casar’) con (SP, preposition) Magdalena de Folcaquier (NP, proper noun)’. At the semantic level, this morphological structure would be verified with the series of predefined algorithm patterns for the purpose of finding relevant information. If one of the predefined patterns was ‘VM SP NP’, then this sentence would fulfil this pattern as from the word ‘cas´ o’. If we look closely, we can observe how a wide spectrum of sentences describing an action on another entity is covered with this pattern. Table 4 presents the different steps of our second-level algorithm. This algorithm receives the original information split into XML voices with a simple semantic description (i.e. voz, vozid, nombre, descripci´ on). After processing the original voices applying our knowledge-based strategy which combines subjectcentric computing, topic-oriented approach, and superimposed-information, the obtained output is semantically richer (see in Figure 1).
5
Performance Evaluation
This section presents the analysis of the results obtained from using the two algorithms proposed for the automatic text analysis. The relevance of the roles of both the associations and events that the two algorithms detect will be calculated by two different approaches. The first approach consists in assigning the preset relevances associated with the predefined roles by rendering the remaining roles null relevance. In this way, the algorithm will assign relevances in relation to this predefined table. However, this method is not flexible. The second approach, which is used in the final algorithm, consists in counting the roles. The more frequent the roles, the greater their relevance, and vice versa. This is a very flexible method and depends on neither predefined tables nor information domains. Having assigned relevance, a filter is run to eliminate the most irrelevant data (that is, the data with the least frequent roles). The tests done have shown that the second-level algorithm produced around three thousand associations and approximately one thousand events as from two hundred words, unlike the first-level algorithm that generated hundreds of associations and dozens of events. This reflects an increase in the order of magnitude of the results. Moreover, the reliability (in other words, the degree of success of both the associations and events related to the text) for the secondlevel algorithm is 90%, while it is 70% for the first-level algorithm1 . 1
An example trace is available in http://e-archivo.uc3m.es/bitstream/ 10016/4945/1/Tesis.pdf (pages 324-328)
Semantic Tagging on a Digital Encyclopedia
6
93
Conclusions
In digital means, integrating the ‘context’ into the developments performed with thesauri and ontologies, and the solution being a versatile one, is an important aspect to investigate. In this research work, a contextualised, reusable and interoperable solution was firstly planned and then constructed thanks to the use of standards and free software. Since no traditional classification tools were included in the prototype, with the proposed information architecture we managed to: (i) enhance user-friendliness, especially with non-specialised users, (ii) capture the search that this work contemplated in natural language without having to simulate or imitate a performance, and (iii) merge it with other types of external information sources thanks to Topic Maps. Our proposal goes beyond the traditional solutions in the sense that it provides a framework within ’things’ can be represented as they are, and it significantly extends and improves the information retrieval process. Secondly, not only the work done with superimposed information for the online development in this field of application, but also the combination of both subjectcentric vision (semantic model) and topic-centered vision (content model), are novel proposals that make the capture, development and reuse of semantics and content strongly relevant, enabling the difficult process of restructuring the volume of information to work properly. Thirdly, incorporating AI techniques into the algorithm provides coverage of the peculiarities of the Spanish language, such as semantic ambiguity and the wide spectrum of available linguistic formulae to express the same thing.
References 1. GEA, Gran Enciclopedia Aragonesa (2009), http://www.enciclopedia-aragonesa.com/ 2. ISO/IEC JTC1/SC34, ISO/IEC 13250-3:2007: Information technology – Topic Maps – Part 3: XML syntax, ISO Intl. Organization for Standardization (2007) 3. DITA, Oasis Darwin Information Typing Architecture (2009), http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=dita 4. Tramullas, J., Garrido, P.: Constructing web subjects gateways using dublin core (DC), resource description framework (RDF) and topic maps (TM). Information Research: An International Electronic Journal 11(2) (January 2006), http:// informationr.net/ir/11-2/paper248.html 5. Maicher, L., Garshol, L.M.: Subject-centric Computing (2008) 6. Nakamura, S., Chiba, S., Kaminaga, H., Yokoyama, S., Miyadera, Y.: Development of a topic-centered adaptive document management system. In: Fourth International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009, pp. 109–115 (2009) 7. ISO/IEC JTC1/SC34, ISO/IEC 13250:2003 Information technology–SGML Applications–Topic Maps, ISO Intl. Organization for Standardization (2003) 8. Garrido, P.: EI Procesamiento autom´ atico de documentaci´ on textual con informaci´ on hist´ orica: una aplicaci´ on XTM y DITA, Ph.D. dissertation, University of Carlos III de Madrid (2008), http://e-archivo.uc3m.es/dspace/handle/10016/4945
94
P. Garrido, J. Tramullas, and F.J. Martinez
9. Linton, J., Bruski, K.: Introduction to DITA: a User Guide to the Darwin Information Typing Architecture. Comtech Services Inc., 2006. 10. Griffin, L.: Millowners and wobblies: an event structure analysis of the everett massacre of 1916. Annual meeting of the American Sociological Association (August 2004), http://www.allacademic.com/meta/p109966_index.html 11. Griffin, L., Korstad, R.: Historical inference and event-structure analysis. International Review of Social History 43 (1998) 12. Thomas, H., Brecht, R., Markscheffel, B., Bode, S., Spekowius, K.: TMchartis — A Tool Set for Designing Multiple Problem-Oriented Visualizations for Topic Maps. In: Maicher, L., Garshol, L.M. (eds.) TMRA 2007. LNCS (LNAI), vol. 4999, pp. 36–40. Springer, Heidelberg (2008) 13. Houston, A., Grammar, S.: Automatic Topic Map Generation from Free Text using Linguistic Templates. In: Maicher, L., Garshol, L.M. (eds.) TMRA 2007. LNCS (LNAI), vol. 4999, pp. 237–253. Springer, Heidelberg (2008) 14. Hennum, E., Day, D., Hunt, J., Schell, D.: Design patterns for information architecture with dita map domains: defining a type for collections of topic. IBM DeveloperWorks. Technical Library (September 2005), http://www.ibm.com/ developerworks/xml/library/x-dita7/ 15. Gelb, J.: DITA and Topic Maps: bringing pieces together. In: Proceedings of the Topic Maps International Conference (April 2008), http://www.suite-sol.com/ downloads/DITA-and-TopicMaps-Bringing-the-Pieces-Together.pdf 16. Garrido, P., Tramullas, J., Mart´ınez, F., Coll, M., Plaza, I.: XTM-DITA structure at human-computer interaction service. In: Actas del XI Congreso Internacional Interacci´ on Persona-Ordenador, pp. 407–411 (June 2008) 17. Wittenbrik, H.: The GPS of the information universe: Topic map in an encyclopaedic online information platform. In: Proceedings of XML Europe Conference (June 2000) 18. Freeling: an open source suite of language analyzers (2009), http://www.lsi.upc.edu/~nlp/freeling/
Mapping of Core Components Based e-Business Standards into Ontology Ivan Magdaleni´c1, Boris Vrdoljak2 , and Markus Schatten1 1
University of Zagreb, Faculty of Organization and Informatics Pavlinska 2, HR-42000 Varaˇzdin, Croatia {ivan.magdalenic,markus.schatten}@foi.hr 2 University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3, HR-10000 Zagreb, Croatia
[email protected]
Abstract. A mapping of Core Components specification based e-business standards to an ontology is presented. The Web Ontology Language (OWL) is used for ontology development. In order to preserve the existing hierarchy of the standards, an emphasis is put on the mapping of Core Components elements to specific constructs in OWL. The main purpose of developing an e-business standards’ ontology is to create a foundation for an automated mapping system that would be able to convert concepts from various standards in an independent fashion. The practical applicability and verification of the presented mappings is tested on the mapping of Universal Business Language version 2.0 and Cross Industry Invoice version 2.0 to OWL. Keywords: e-business standards, ontology mapping, Core Components, UBL, CII.
1
Introduction
There are dozens of standards that describe business documents today. While each standard has emerged due to the need in particular areas of business, these standards are often overlapping. It is often the case that the interests of various stakeholders result with standards that largely cover identical areas of operations. The exchange of business documents in such a case requires the conversion of the elements from one standard to another. Though there are tools that help in such a document conversion, there is always a need for an expert in the field who will be able to decide upon the meaning of each element. The process of deciding which elements will be mapped into another one is done manually by the expert. Our intention is to speed up that process by building a system that will perform this task automatically. If unable to fully automate the process, due to various factors, the system should at least suggest which element should be mapped into another, and wait for confirmation of the experts. S. S´ anchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 95–106, 2010. c Springer-Verlag Berlin Heidelberg 2010
96
I. Magdaleni´c, B. Vrdoljak, and M. Schatten
Several ways can be used to compare and synchronize business documents and they are presented later in this paper. In our research we decided to make the mapping of electronic business standards by developing individual ontologies and then create mappings between them. The representation of standards into ontology allows automation of mutual mapping process by using different computational algorithms. The first step in this process is to create a mapping of e-business standards into ontologies and this paper shows the achieved results. E-business standards used in procurement which are based on Core Components Technical Specification 2.01. (CCTS) [15] are in the focus of our research. CCTS is accepted as ISO/TS 15000-5:2005 and several e-business standards are based on it, such as Universal Business Language (UBL) [13], Cross Industry Invoice (CII) Version 2.0 [16], GS1 XML [5] and OAGIS 9.x [11]. All of these standards cover the area of electronic procurement and a mapping between them is a real and practical need. Differently from previous approaches in the literature, particular relevance is given to namespaces and business context because we expect that those features will play significant role in ontology matching. This paper presents a mapping from e-business standards based on CCTS to the Web Ontology Language (OWL). OWL is chosen because it is a well accepted standard for writing ontologies to use it in the Internet. The verification of the proposed mapping is done by building ontologies of UBL and CII. The rest of this article is organized as follows: An overview of business documents standards is given in section 2. The mapping of e-business standards to ontologies is presented in section 3. In section 4 the related work is presented. The conclusion is given in section 5.
2
An Overview of Business Documents Standards
The area of electronic business document standards is an extremely active one, and abounds with standards that define the content and meaning of their elements. Fig 1 gives a non-exhaustive overview of the most important standard definitions in a timescale [6]. There are two major groups of business document standards: delimiter-based standards and markup-based standards. Delimiter-based approaches use standard ASCII characters to separate different data elements, segments, and messages. All definitions of used elements in delimiter-based standards are outside of documents and messages. Markup-based standard use some of the markup languages to mark document elements and to give them proper semantics. The most accepted markup languages are Standard Generalized Markup Language (SGML), Hypertext Markup Language (HTML) and eXtensible Markup Language (XML). We have recognized that the tendency of acceptance of standards is in favor of the standards based on XML and Core Components (an acronym of Core Components Technical Specification). The Core Components concept defines a new paradigm in the design and implementation of reusable syntactically neutral
Mapping of Core Components Based e-Business Standards into Ontology
97
Fig. 1. Overview of different business standards [6]
and semantically correct and meaningful building blocks. Therefore, we bring a detailed description of the metamodel of Core Components and Business Information Entities (BIE), which simplified version is shown in Fig 2. The Core Components elements (CCs) are: Aggregation Core Component (ACC), Basic Core Component (BCC), Association Core Component (ASCC), and Data Type (DT). The ACC is a set of connected components of business information that carry a unique business meaning, independent of any business context. The BCC specifies a distinct business characteristic of the ACC. The ASCC specify a complex business characteristic of the ACC. The BCCs and ASCCs are properties of the ACCs. DT provides a restriction on content. When the Core Components elements are put in a Business Context, they represent the foundation on which Business Information Entities (BIE) are build. BIEs are part of business data or a group of pieces of business data with a unique business semantic definition. CCs serve as a controlled vocabulary to build a BIE. There are three different categories of BIEs: Basic Business Information Entity (BBIE), Association Business Information Entity (ASBIE), and Aggregate Business Information Entity (ABIE). The BBIE represents a single business characteristic of a specific object class in a specific business context. The ASBIE is a complex business characteristic of a specific object class in a specific business context. The ABIE is a set of related pieces of business information that together carry a unique business meaning in a specific business context. Syntax neutral Core Components are intended to form the basis of business information standardization efforts and to be realized in syntactically specific instances. This could be done in delimiter-based standard such as UN/EDIFACT, but most implementations are done using XML syntax. The first implementation of Core Components is UBL and it is followed by the implementation in other electronic business standards: GS1 XML, OAGIS 9.X and recently CII 2.0. Special documents define mappings from Core Components definitions to XML Schema definitions. In UBL the corresponding document is UBL-Naming
98
I. Magdaleni´c, B. Vrdoljak, and M. Schatten
Fig. 2. Simplified metamodel of Core Components and Business Information Entities [6]
and Design Rules-2.0 [12], while UN/CEFACT brings its own roles in XMLNaming-and-Design-Rules-V2.0 for Version 2.0 [18] and UNCEFACT + XML + NDR + V3p0 [17] for Version 3.0. The basic idea of mapping of Core Components BIE definitions into XML Schema is shown in Fig 3. The ABIE structure is mapped into xsd:complexType and its name into xsd:element. The ASBIE is mapped into xsd:element and is used in ABIE to refer to other ABIE. The BBIE is mapped into xsd:complexType where it extends or makes a restriction on a qualified or unqualified Data Type. DT is mapped into xsd:complexType od xsd:simpleType. The following section shows a similar process of mapping e-business standards into an ontology.
3
Mapping of e-Business Standards to Ontology
As already mentioned, our goal is the automation of mapping between different e-business standards, and an important step towards its achievement is the mapping e-business standards to ontologies. In Fig 4 the possible ways of mappings between the various e-business standards are presented. The Case A shows a direct mapping between e-business standards. The drawback of this mapping method is the need for experts who know both standards.
Mapping of Core Components Based e-Business Standards into Ontology
99
Fig. 3. Mapping of Core Components BIE definitions into XML Schema [13]
Such a mapping is also difficult to make because of various syntaxes used in different standards. The Case B shows a mapping of the two standards using a common ontology. The lack of this kind of mapping is the lack of a shared common ontology of the e-business domain. Such an ontology should have all the elements of all e-business standards. Although such a top-down approach is cost-effective in long-term, it is difficult to expect that such ontology of the e-business domain will soon be developed. The Case C shows a variant of B with the difference that the e-business standards are first mapped to a local ontology and then mapped to each other using a common ontology. The advantage of this case with respect to the previous one, lies in the use of ontology in a mutual mapping. The main disadvantage is the same as in B, i.e. the lack of a common ontology. The Case D shows the mapping of e-business standards to local ontologies. The representation of standards in the common language allows automation of mutual mapping process by using different computational algorithms. The advantage of this method is independent mapping of each standard into a common language. This requires that an expert is familiar only with domain of one
100
I. Magdaleni´c, B. Vrdoljak, and M. Schatten
Fig. 4. Mappings between the various e-business standards
particular area compared to the Case B and Case C where the expert have to have the knowledge of all e-business standards. This approach was chosen in our study. The ontology-based integration has already been subject of researches. Cases B,C, and D are already identified in [19] as the three possible ways for using ontologies for content explication. Furthermore, there are many different approaches in ontology mapping and some of them are described in [8], [10], and [4]. We expect that the mapping of e-business standards through the ontology will be more efficient than without them. The first step necessary is to make the mapping of e-business standards to local ontologies. Because we believe that standards based on CCTS take a leadership role, we decided to do their mapping first. The implemented mapping rules from BIE to OWL are presented in Fig. 5. All instances of Core Components ABIE, BBIE, DT, and Business Context are mapped to OWL Class. The name of the class is of type Internationalized
Mapping of Core Components Based e-Business Standards into Ontology
101
Resource Identifiers (IRI). Each IRI is defined by its name and namespace. We use the same namespaces as defined by the mapping of Core Components to XML Schema. If it exists, the definition of Core Component is stored as an rdfs:comment. ASBIE instances are not mapped to OWL Classes, but to OWL Object Properties. If a hierarchy between Core Components elements exists, it is mapped by using the rdfs:subClassOf property. The ABIE consists of set components which are the ASBIE or the BBIE. They are called properties of ABIE and are mapped into OWL as OWL Object Properties. The OWL Class to which a particular property belongs, is mapped by using rdfs:domain. The ASBIE and the BBIE are mapped as rdfs:range. The name of the object property is formed as an IRI, where the name has the prefix “has”. This kind of naming convention is the standard for OWL Object Properties.
Fig. 5. BIE to OWL mapping
Data Type in Core Components consists of Content Component and Supplementary Component. The Content Component contains the actual value and the Supplementary Component defines the description of value like units of measure. Content Component and Supplementary Component are mapped as OWL Datatype Properties. The OWL Class to which particular data type belongs, is
102
I. Magdaleni´c, B. Vrdoljak, and M. Schatten
mapped by using rdfs:domain. rdfs:range is used for defining primitive data types like built-in data types of XML Schema or code list identifiers. If some components are related with certain business context, this property is mapped as OWL Object Property. Each business context is mapped as OWL class. The proposed mapping process is shown in the example of mapping from OWL to UBL. Fig 6 presents the definition of the total tax in UBL. For simplicity of presentation, definition of the total tax presented in Fig 6 does not include all the elements from the original definition. The total tax component is of type ABIE and contains two properties: the amount of tax and subtotals. The amount of tax is a component of type BBIE, a subtotal is component of type ASBIE. Further, the BBIE Tax Amount is of type AmountType. The ASBIE Subtotal is of type ABIE TaxSubtotal.
Fig. 6. The definition of total tax in UBL – short example
A mapping of the described definition of the total tax is presented in Fig 7 by using OWL syntax. Lines 1-7 define the TaxTotal class, which corresponds to the TaxTotal in UBL. The presented IRI in line 2, http://ubl2_0/cac#TaxTotal, is shortened from the original for the presentation purposes. The short name of the class is defined in line 3, whilst lines 4-6 give a full definition of the class. For class TaxTotal there are two Object Properties defined: hasTaxAmount (lines 9-15) and hasTaxSubtotal (lines 24-30). Lines 11-12 define to which OWL class this property belongs, which in this case is the class TaxTotal. Lines 13-14 define the range of the allowed contents in this property, which in this case are instances of the class TaxAmount. The OWL class TaxAmount is defined in lines 17-22 and is a subclass of an OWL class Amount (definition in lines 20-21). The OWL class Amount is defined in lines 40-48. The restriction on content of the class Amount is made by OWL Datatype Property hasAmountValue (lines xx-xx), where it is defined that content is of type xsd:decimal. We used the presented mapping to produce ontologies of invoice from UBL 2.0 and Cross Industry Invoice version 2.0. These ontologies are available at http://edocument.foi.hr/ontology/. The made ontologies were created
Mapping of Core Components Based e-Business Standards into Ontology
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
103
< r d f s : l a b e l>TaxTotal I n f o r m a t i o n a b o u t a t o t a l amount o f a p a r t i c u l a r t y p e o f t a x . < r d f s : l a b e l>TaxAmount < r d f s : l a b e l>Ta x S u b t o t a l I n f o r m a t i o n about the s u b t o t a l f o r a p a r t i c u l a r tax c a t e g o r y < r d f s : l a b e l>Amount A number o f monetary u n i t s s p e c i f i e d
in a currency
...
Fig. 7. OWL definition of tax – short example
...
104
I. Magdaleni´c, B. Vrdoljak, and M. Schatten
using the OWL API open source software [14]. The produced ontologies were loaded into Prot´eg´e for the purpose of syntax and structure verification.
4
Related Work
Ontology building and its use for semantic description of data becomes more and more important subject of research. In most scientific and industrial areas ontologies are made, whose purpose is to describe some or all areas of interest. Specific activity can be seen in recent years when ontologies are build in different domains such as fisheries [3], agriculture [2], and even music [1]. The main difference between such approaches and our approach lays in the planned usage of ontologies. In our case, we strive to create an ontology of individual e-business standards for later mapping. Therefore, our ontology include only certain segments of the e-business domain, which certain standard describes. Because the e-business standards are well defined we have created a software to extract the necessary data and to build an ontology in OWL. The problem of manual ontology development has already been observed and there is a tendency to develop software for the automated creation of ontologies, if possible. An example of such approach is the creation of ontologies from UML diagrams as shown in [21]. In this paper we present the creation of ontologies from e-business standards based on Core Components. The main emphasis in this paper is the process of mapping standards that are based on Core Components. These standards have a specific structure and it is important to map it to the ontology. Data for our ontologies were extracted mainly from the existing XML Schema, and partly from Excel documents. A similar process of creating an ontology from an XML schema is shown in [7]. A similar approach to ours is the construction of an ontology of UBL Version 1.0, which is presented in [9]. They introduced the following rules. CCT, DT, ACC, ABIE define concepts. BCC, ASCC, BBIE, ASBIE define properties. Qualification of CC to BIE is semantically similar as the subconcept relation between concepts and all instances of a BIE are also instances of the corresponding CC. Actually as at the qualification step in UBL/CCTS also the business context is explicitly specified, transforming it to a simple subconcept relation loses some information [9]. Differences to mappings presented in this paper are the following. We map only BIE, because BIE are based on CC and have a busines context. Rules presented in section 2 for mapping Core Components to XML schema also work with BIE. We define that concepts are ABIE, BBIE, DT, and Busines Context. We also use BBIE as a property of ABIE, but we declare them as concept too. We use BBIE to restrict content of property by using rdfs:range. In this way we preserve the hierarchy which goes from DT to BBIE. Mapping of BBIE as property of ASBIE is the same as in [9]. Further, in [9] the manually refined structure of the UBL ontology was aligned with the BULO ontology, i.e. the proper places for the topmost UBL concepts in the BULO taxonomy were identified. We have not aligned our ontologies to any upper ontology because the
Mapping of Core Components Based e-Business Standards into Ontology
105
purpose of our work is not to create a common ontology of e-business domain, but to map different e-business standards to each other automatically. We consider that the alignment of only topmost concepts with an upper ontology will not give the advantage in mutual mapping of ontology. Instead of the alignment with the some upper ontology, we introduced mapping of business context. The knowledge of the business context of certain element can be helpful in identifying its semantic. Further, UBL version 1.0 is significantly smaller the UBL Version 2.0. In [9] one namespace for all ontologies is used. We used more namespaces in order to avoid name collisions. In [20] construction of UBL ontology Version 2.0 is presented. In this approach like in ours, all elements of Core Compnents are concepts except ASBIE. Properties of concepts are defined by owl:intersectionOf, owl:Restriction, and owl:someValuesFrom. We and [9] use rdfs:range for the same functionality, which is better for the use in mutual ontology mapping. However, in [9] and [20] mappings of UBL to OWL are presented, while our approach is more general and deals with all mapping of all standards based on Core Components to OWL. However, in all this works the attention has been paid to correctly map UBL to OWL without considering future usage of ontology. Differently from those approaches, in our case particular relevance is given to the problem of business context and namespace, because we expect that those features will play a signicant role in ontology matching, especially in the e-business domain.
5
Conclusion
This paper presented a mapping of e-business standards based on Core Components specification into ontology using OWL. We gave an overview of e-business standards to point to the complexity of the e-business domain and to show a need for mutual mapping of standards by mapping them in a common ontology founded language. The emphasis is put on mapping of Core Components elements to specific constructs in OWL in such a way that the existing hierarchy of e-business standard is preserved. The purpose of building an ontology out of e-business standards is the building of a system that will be able to do mappings between e-business standards automatically using its ontology representation. This goal differs from ours mapping with other approaches presented in related work. The advantage of our mapping is in more general approach, which deals with mapping of all standards base on Core Components to OWL. We introduced mapping of business context which can be helpful in identifying of semantic of some elements. The practical applicability and verification of the presented mappings is tested on the mapping of Universal Business Language version 2.0 and Cross Industry Invoice version 2.0 to OWL. Our future work is focused on developing an ontology from the delimiterbased e-business standards. In parallel, we are building a system for automatic mapping of ontologies to each other.
106
I. Magdaleni´c, B. Vrdoljak, and M. Schatten
References 1. Albuquerque, M.O., Siqueira, S.V.M., Lanzelotte, R.S.G., Braz, M.H.L.B.: An Ontology for Musical Phonographic Records: Contributing with a Representation Model, Best Practices for the Knowledge Society. In: Knowledge, Learning, Development and Technology for All, vol. 49, pp. 495–502. Springer, Heidelberg (2009) 2. Athanasiadis, I.N., Rizzoli, A.E., Janssen, S., Andersen, E., Villa, F.: Ontology for Seamless Integration of Agricultural Data and Models, Metadata and Semantic Research, vol. 46, pp. 282–293. Springer, Heidelberg (2009) 3. Caracciolo, C., Heguiabehere, J., Sini, M., Keizer, J.: Networked Ontologies from the Fisheries Domain. In: Metadata and Semantic Research, vol. 46, pp. 306–311. Springer, Heidelberg (2009) 4. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, New York (2007) 5. GS1: GS1 XML, http://www.gs1.org/ecom/xml 6. Liegl, P.: Business Documents for Inter-Organizational Business Processes, Ph.D. Thesis, Vienna University of Technology (2009) 7. Lowe, B.: DataStaR: Bridging XML and OWL in Science Metadata Management. In: Metadata and Semantic Research, vol. 46, pp. 141–150. Springer, Heidelberg (2009) 8. Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. The Knowledge Engineering Review 18(1), 1–31 (2003) 9. Nagypal, G., Lemcke, J.: A business data ontology. Deliverables of project: Service Ontologies and Service Description (2004), http://dip.semanticweb.org/ documents/D3.3-Business-data-ontology.pdf 10. Noy, N.F.: Semantic integration: a survey of ontology-based approaches. ACM SIGMOD Record 33 (2004) 11. OAGi: OAGIS 9.X, http://www.oagi.org/ 12. OASIS: UBL-Naming and Design Rules-2.0, UBL, http://www.oasis-open.org/ committees/download.php/10323/cd-UBL-NDR-1.0Rev1c.pdf 13. OASIS: Universal Business Language (UBL) Naming and Design Rules 2.0, http://docs.oasis-open.org/ubl/os-UBL-2.0/UBL-2.0.html 14. OWL API, http://owlapi.sourceforge.net 15. UN/CEFACT: Core Components Technical Specification ver.2.01 - Part 8 of the ebXML Framework, http://www.unece.org/cefact/ebxml/CCTS_V2-01_Final.pdf 16. UN/CEFACT: Cross Industry Invoice Version 2.0, http://www1.unece.org/ cefact/platform/display/TBG/Cross+Industry+Invoice+Process 17. UN/CEFACT: UNCEFACT+XML+NDR+V3p0, http://www.unece.org/cefact/xml/UNCEFACT+XML+NDR+V3p0.pdf 18. UN/CEFACT: XML-Naming-and-Design-Rules-V2.0, http://www.unece.org/cefact/xml/XML-Naming-and-Design-Rules-V2.0.pdf 19. Wache, H., Voegele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., Huebner, S.: Ontology-Based Integration of Information - A Survey of Existing Approaches. In: Proc. of IJCAI 2001 Workshops: Ontologies and Information sharing, pp. 108–117 (2001) 20. Yarimagan, Y., Dogac, A.: A Semantic-Based Solution for UBL Schema Interoperability. In: IEEE Internet Computing, pp. 64–71 (May 2009) 21. Xu, Z., Ni, Y., Lin, L., Gu, H.: A Semantics-Preserving Approach for Extracting OWL Ontologies from UML Class Diagrams. In: Database Theory and Application, vol. 64, pp. 122–136. Springer, Heidelberg (2009)
Model-Driven Knowledge-Based Development of Expected Answer Type Taxonomies for Restricted Domain Question Answering Katia Vila1 , Jose-Norberto Maz´on2 , Antonio Ferr´ andez2 , and Jos´e M. G´ omez2 1
2
University of Matanzas, Department of Informatics Varadero Road, 40100 Matanzas, Cuba University of Alicante, Department of Software and Computing Systems San Vicente del Raspeig Road, 03690 Alicante, Spain {kvila,jnmazon,antonio,jmgomez}@dlsi.ua.es
Abstract. A Question Answering (QA) system must provide concise answers from large collections of documents to questions stated by the user in natural language. Importantly, a question should be correctly classified by means of a predefined taxonomy in order to determine which is the Expected Answer Type (EAT), thus reducing the searching space over documents, while a right answer is obtained. Designing a proper EAT taxonomy is even more crucial in restricted domain QA, since domain experts use specific terminology, thus asking more precise questions and expecting more precise answers. This paper presents a novel model-driven approach in order to ameliorate the task of designing restricted-domain EAT taxonomies by using heterogeneous knowledge resources and collection of documents. To show the applicability of our approach, a set of experiments has been carried out by defining a new EAT taxonomy for being able to answer questions about the agricultural domain.
1
Introduction
Question Answering (QA) is defined as the task of searching and extracting the text that contains the answer for a specific question in natural language from a collection of text documents or corpus [11]. Having studied the major evaluation forums for QA systems such as TREC1 and CLEF2 conferences, a common architecture of a QA system consists of three different sequential phases (see Question Answering part of the Fig. 1): (i) question analysis for analyzing and understanding the question by classifying it, and extracting the significant keywords; (ii) these keywords are used by an Information Retrieval (IR) system in order to select and retrieve the relevant passages or documents; and (iii) finding and extracting the expected answer by using natural language processing tools (such as POS tagger, syntactical parser, entity annotator, semantic role parser, etc.) to analyze this set of passages. 1 2
TREC:Text REtrieval Conference, http://trec.nist.gov/ CLEF: Cross-Language Evaluation Forum, http://clef-campaign.org/
S. S´ anchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 107–118, 2010. c Springer-Verlag Berlin Heidelberg 2010
108
K. Vila et al.
Interestingly, the question analysis phase must be performed as accurate as possible since it is the first phase of the QA systems and, therefore, the rest of the phases depend of their results. Within this phase is highly important to determine the semantic type of the answer or Expected Answer Type (EAT) by means of a predefined taxonomy (also known as question hierarchy [8] or question ontology [9]). Importantly, a correct specification of the EAT taxonomy implies an accurate EAT detection, thus reducing the search space of possible answers and giving a more precise answer [8,7]. Indeed, more than 36.4% of errors of the QA are related to an incorrect EAT detection of the question [10]. There are several approaches for manually defining EAT taxonomies [15,7,9,8] from a corpus of questions by means of the semantic knowledge provided by a unique generic KOS (Knowledge Organization Systems) [6] such as WordNet3 . Although the EAT taxonomies defined by using these approaches perform well in open-domain QA, they are not suitable for restricted domains because: (i) manually tuning EAT taxonomies for restricted domains requires a huge effort in time and cost due to the inherent complexity of concepts provided by these domains; (ii) defining restricted-domain EAT taxonomies by analyzing potential questions to be answered is not realistic, since restricted-domain questions are highly complex and difficult to be acquired; and (iii) the development of restricted-domain EAT taxonomies not only requires more than one generic KOS to be successfully carried out, but also a more precise domain KOS, such as Agrovoc thesaurus4 for the agricultural domain. To overcome these drawbacks, this paper presents a novel approach based on model-driven development software techniques [3] in order to automatically design EAT taxonomies for restricted domains. Basically, our approach consists of three tasks (see Fig. 1): (i) creating a restricted-domain model from the most relevant terms from the corpus, (ii) enriching such a model with knowledge from domain and generic KOS, and (iii) obtaining a new EAT taxonomy for the restricted domain which will be useful for dealing with questions from the restricted domain. The remainder of this paper is structured as follows. After presenting a motivating example in the next section, current approaches for defining EAT taxonomies are briefly compared in Sect. 3. Section 4 describes our model-driven approach to use KOS in the development of EAT taxonomies for restricted domain QA. Section 5 shows the applicability of our approach by means of a set of experiments. Section 6 sketches out our conclusions and future work.
2
Motivating Example
To illustrate the benefits of our approach throughout this paper, consider the following motivating example based on the agricultural domain from the Cuban Journal of Agricultural Science or RCCA5 . This journal comprises topics related 3 4 5
http://wordnet.princeton.edu/ http://www.fao.org/agrovoc/ RCCA: Revista Cubana de Ciencia Agr´ıcola, http://www.ica.inf.cu/productos/rcca/
Model-Driven Knowledge-Based Development of EAT Taxonomies
109
Fig. 1. Our model-driven approach for creating restricted-domain EAT taxonomies
to agricultural science, such as Animal Science, Pastures and Forages, etc. In this paper, we use the spanish part of this journal as corpus. Some sample questions about the agricultural domain are as follows: Q1. ¿Qu´e enzima aumenta la digestibilidad del f´ osforo org´ anico por parte de los animales? (What enzyme increases the animals’s digestibility of the organic phosphorus?). Q2. ¿Qu´e glicosidos tienen un efecto defaunante en el rumen? (What glycosides have an defaunating effect in the rumen?). These questions are answered by our open-domain QA system for Spanish, named AliQAn [13]. This system has an EAT taxonomy with two levels, based on WordNet Based-Types, that consists of the categories shown in the part of Fig. 2 labeled as OD-EATT. By using the EAT taxonomy of AliQAn, the classification of the previous sample questions are as follows: object for question Q1, and profession for Q2. It is worth to point out that AliQAn incorrectly classifies question Q2, since its EAT taxonomy does not include concepts such as: glycoside or neither any of its hypernyms (stated in the Fig. 2). Classification of question Q1 could be considered as correct since enzyme has a top-concept hypernym object (the whole hypernym path is shown in Fig. 2). However, object is a too wide concept that can accept as semantically right some incorrect candidate answers such as: artifact, ground, yeast, acids, salts, etc. Therefore, it is shown that developing new EAT taxonomies is required for improving QA systems in restricted domains.
3
Related Work
Main features that define an EAT taxonomy are size (number of classes of the taxonomy), structure (flat or hierarchical) and recall (ideal if the EAT taxonomy covers most of the questions without considering if the QA system will be able to answer, thus making the precision decrease; otherwise recall is realistic) [16]. Table 1 summarizes some existing EAT taxonomies by considering these features.
110
K. Vila et al.
Fig. 2. Excerpt of an EAT taxonomy for open (OD-EATT ) and restricted domains (RD-EATT )
Table 1. Summary of existing approaches for EAT taxonomies Domain Size (# classes) Structure Recall Work in: Small Medium Large Flat Hierarchy Ideal Realistic Metzler and Croft [9] Open 31 x x Li and Roth [8] Open 50 x x Sekine et al. [15] Open 150 x x Hovy et al. [7] Open 180 x x Ely et al. [4] Medical 10 x x Kim et al. [14] Medical 7 x x Ferr´ es et al. [5] Geographical 25 x x
Nowadays, there exist many EAT taxonomies proposals for open-domain QA systems, that are concerned about a wide spectrum of questions [9,8,15,7]. Within these approaches EAT taxonomies are manually developed from large collections of questions (obtained from the Web6 or TREC or CLEF conferences) by obtaining knowledge from WordNet. These approaches take into account the question stem or interrogative clause (e.g., What, Which, Who questions, etc.), adding later semantic knowledge to obtain more accurate answers. Among all answer types, those having ambiguous question stem (What, Which) are the most difficult to analyze, since they can be related to any answer type (What object, What substance, What enzyme, etc.); unlike question stem Who, When and Where that may correspond to person, date, and location concepts, respectively. 6
AskJeeves: http://www.ask.com; Yahoo Answers: http://answer.yahoo.com
Model-Driven Knowledge-Based Development of EAT Taxonomies
111
Actually, the more ambiguous question the more semantic knowledge is required for specifying an adequate EAT taxonomy for the application domain of the QA system. Therefore, in spite of being hierarchical, these approaches are not enough refined to be useful for restricted domains, and specific approaches for developing restricted-domain EAT taxonomies are required. However, current approaches also suffer from problems. For example, the EAT taxonomies defined in [4] and [14] were obtained from a collection of 1001 questions asked by physicians and from 435 questions upon the RSI (Repetitive Strain Injury) corpus, respectively. Both works manually develop the EAT taxonomy by using the UMLS metathesaurus7 as domain KOS. In [5] an EAT taxonomy of a baseline QA system is manually tuned by using domain ontology concepts and relationships. A common drawback of these approaches is that the EAT taxonomy is based on analyzing potential questions from users, which may not be feasible in real applications, since acquiring a large number of restricted-domain questions is difficult. Bearing in consideration these issues, it is important to define mechanisms for ameliorating the design of EAT taxonomies by using available knowledge resources, thus increasing the precision of restricted-domain QA systems. To this aim, in this paper a novel model-driven approach is defined to automatically use several KOS for creating EAT taxonomies for restricted-domain QA together with the collection of documents instead of a corpus of questions.
4
Model-Driven Development of EAT Taxonomies
Our approach consists of automatically defining an EAT taxonomy for restricteddomain QA by using KOS within a three-step model-driven process (see Fig. 1): our process is based on a unified metamodel we have created for specifying in a model the most frequent restricted domain terms from the corpus and the most useful concepts from different kind of restricted domain KOS (step T1) in a meaningful, precise and consistent manner. Then, this model is enriched with concepts from open domain KOS (step T2). Finally, once this knowledge is represented, we have defined a model transformation to derive a restricteddomain EAT taxonomy (step T3). 4.1
Restricted Domain Metamodel
The first and second steps of our approach consist of creating models that represent terms from the restricted-domain corpus and joining them with their corresponding concepts from the KOS, respectively. To this aim, we have defined the restricted-domain metamodel which contains the adequate elements to create a variety of these models (see Fig. 3). The core element in this metamodel is the RestrictedDomainModel metaclass which is useful for creating a model for a particular restricted domain. The CorpusTerm metaclass is useful for representing any of the terms appearing in 7
http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/
112
K. Vila et al.
Fig. 3. Overview of our restricted-domain metamodel
a corpus. A metaattribute value is used to store the lemmatized value of each term. There are several lexical kind of corpus terms as adjectives, nouns or verbs, which are represented as several subclasses of the CorpusTerm, i.e. AdjectiveTerm, NounTerm or VerbTerm metaclasses. It is worth noting that syntactical relations between these terms (which can be easily provided by POS tagger and syntactical parser when the corpus is processed) are valuable for being used in further steps of our approach. Specifically, the VerbTerm metaclass has relations to indicate which NounTerm can be seen as subject or as an object. Also the NounTerm can be related to an adjective or to other nouns. These relations are important to detect the multi-words which often appear in restricted domains (e.g. “calcium hydroxide” or “adrenal cortex hormones”in the chemical domain). Also, every kind of CorpusTerm has its own type (coming from several Enumerations as shown in Fig. 3). Finally, every CorpusTerm may also have some semantic information (SemanticLabel metaclass). This semantic information can be provided by open-domain tools when the corpus is being processed in the QA task, such as semantic role parser, Name Entity Recognizer (NER), temporal or numerical expressions recognizer, etc. The SemanticLabel metaclass indicates the name of the technique used to acquire the semantic information, the obtained value by applying these techniques and, also the probability of the certainty of this value. For example, for terms “Congo river” and “lake Kariba” there is a semantic relation whose value is “Inland waters”, name is “NER”, and probability is “1”, since this value has been obtained by using a NER. Furthermore, Concept and Equivalence metaclasses allow the elements of this restricted-domain metamodel to be semantically enriched with concepts and relationships from several KOS. The Concept metaclass refers to an element from a particular KOS. Each of these elements has a value to represent it. Besides, each concept can be related to one or more concepts through relations of synonymy, hypernymy and hyponymy. Each concept may be related to more than one KOS for which the name is indicated and also an ID for the concept within this KOS. This metaclass has an isTop metaattribute that states if it is a top concept in that KOS. Equivalences between a term and a concept can be defined: the metaclass Equivalence represents an association between Concept and NounTerm.
Model-Driven Knowledge-Based Development of EAT Taxonomies
4.2
113
Obtaining a Restricted-Domain Model from the Corpus
As our approach for creating EAT taxonomies is based on the most relevant terms appearing in the corpus, the first step consists of obtaining a restricteddomain model (according to our aforementioned metamodel) that contains all the available information from these terms previously extracted in the collection processing (see transformation T1 in Fig. 1). The transformation T1 has been implemented to obtain the most relevant terms from the corpus and defining their corresponding elements in the restricted domain model. This transformation selects these terms based on two constraints: lexical (each term must be a noun, an adjective or a verb) and statistical (terms must have certain frequencies, e.g. relative frequency8 or tf-idf frequency [2]). It is worth noting that the threshold values for these frequencies may be modified depending on the specific domain. From each selected term, a class CorpusTerm is created (AdjectiveTerm, VerbTerm or NounTerm) with its corresponding lexical, syntactic and semantic information obtained from the corpus processing in the QA task, including the different kind of relationships between them. Fig. 4 shows an example to show how transformation T1 works, by taking question Q1 of our running example and a passage of a document related to that question.
Fig. 4. Example that shows how to obtain a restricted-domain model
Within this example, text is lexically and syntactically labeled by using the MACO PoS-tagger [1] and a partial syntactic parser called SUPAR [12], respectively. MACO’s labels starting by “NC:common noun ” or “NP:proper noun” are added to the restricted domain model as NounTerm, by “VM:main verb” as VerbTerm and by “AQ:qualifying adjective” or “AO:ordinal adjective” as AdjectiveTerm. Also, the following syntactical information was extracted: “f´ osforo” 8
f ri = fNi , where f ri is the relative frequency of the term ti in the corpus, fi is the absolute frequency calculated from a number of observations of the term ti in the corpus and N is the total number of terms in the corpus.
114
K. Vila et al.
(phosphorus) is related to the “org´ anico” (organic) adjective, while “digestibilidad” (digestibility) noun is the object of the verb “aumentar” (increase) and “enzima” (enzyme) noun is the subject. These syntactical relationships are obtained by using SUPAR labels “SNS:nominal simple syntagma” and “CCC:clauses of every sentences” in the following way: (i) noun and adjective within the same “SNS” are related by means of a Related Adjectives attribute; (ii) nouns within the same “SNS” are related by means of a Related Nouns attribute; and (iii) noun and verbs within the same “CCC” are related by means of a Subject or Object attribute, according to the function of the name with respect to the verb. 4.3
Enriching the Restricted-Domain Model
The second step of our approach consists of adding semantic knowledge to the already defined elements of the restricted-domain model by means of concepts and relationships from different kinds of KOS in order to create an enriched restricted domain model. This enrichment step is done in the T2 transformation (see Fig. 1) which allows managing heterogeneous KOS (from a simple taxonomy until a complex ontology). The reason is that our metamodel is sound enough for specifying in a model those parts of KOS that will be useful for further defining an EAT taxonomy for restricted domain in an integrated manner, thus abstracting away unnecessary details. Importantly, the aim of T2 transformation is to associate each corpus term previously detected with some concept from a domain KOS. First simple words of NounTerm are searched, and then multi-words by using its Related Adjectives and Related Nouns attributes. An Equivalence class is created for associating each new Concept class (including its corresponding KOS classes) to some existing NounTerm classes. The following step is to search for synonyms, hyponyms and hypernyms of the new Concept class in a domain KOS until a top concept is reached. Then, every top concept from the domain KOS is checked to be associated to some concept from a generic KOS (some disambiguation algorithm can be used at this stage), and if this association does not exist then hyponyms (and their synonyms) of this top concept are checked. For each concept belonging to a generic KOS, its hypernyms (and its synonyms) are added to the restricteddomain model until finding a top concept. The rationale behind T2 transformation is to associate corpus terms with concepts from the domain KOS and not with the concept from the generic KOS, since (i) it is likelier to restricted-domain terms appear in the domain KOS, and (ii) polysemy is avoided because restricted-domain terms appear in the generic KOS are previously disambiguated with the domain KOS. Furthermore, if the generic KOS would be directly used for creating an EAT taxonomy, it would be overloaded with concepts barely used in the restricted domain. Even more, we advocate for creating an EAT taxonomy from the terms in the restricted domain model (and not directly from the domain KOS), thus assuring that it contains those semantic classes more closely related to the domain. For example, if the domain of the corpus is fisheries but only an agricultural KOS is available (which also includes concepts and relationships from fisheries), then it is assured
Model-Driven Knowledge-Based Development of EAT Taxonomies
115
that the resulting enriched restricted model only contains those concepts from the domain KOS related to fisheries, ignoring the rest of the agricultural terms. Therefore, an EAT taxonomy derived from this restricted-domain model will have an adequate size, structure and recall for the actual domain. An important detail of transformation T2 is that it provides the KOSindependence of our approach, since it includes the required information into the restricted-domain metamodel by managing the heterogeneity of the KOS. For example, hierarchical relationships are found in the broader term and narrow term respectively if the KOS is a thesaurus, hypernyms and hyponyms if it is a lexical database as WordNet, subclass-of and instance-of if it is an ontology, functional dependencies if it is a relational database, etc. Following our running example, we choose to use the Agrovoc thesaurus as agricultural domain KOS, and WordNet as generic KOS. The following NounTerms in the restricted-domain model (see Fig. 4) are found in Agrovoc: “fitasa” (phytase), “digestibilidad” (digestibility) and “enzimas” (enzymes). They are specified as Concept s in the enrichment restricted-domain model. From these concepts, transformation T2 uses Agrovoc and Wordnet to obtain 249 new Concept s (185 from Agrovoc, 15 from Wordnet, and 49 from both) and 9 levels according with their hypernym-hyponym hierarchical structure. For example, from the NounTerm phytase the Concept phytase is obtained and also its hypernyms in Agrovoc until reaching the top concept agents (see Fig. 2). This Agrovoc top concept agents is intended to be mapped with the same concept in Wordnet. However, it has five senses which introduces a high degree of polysemy in our enriched-domain model. Therefore, a simple disambiguation strategy is used in which an Agrovoc concept is mapped with its Wordnet counterpart only if it has one sense. Otherwise, some of the hyponyms of the concept from Agrovoc are intended to be mapped with one Wordnet concept that has only one sense. In our example, the concept from Agrovoc that is successfully mapped is enzymes, thus obtaining the concepts from Wordnet shown in Fig. 2 starting at proteins. The enriched restricted model will be used to create a new EAT taxonomy for the restricted domain as shown in the next step. 4.4
Obtaining EAT Taxonomies from the Restricted-Domain Model
An EAT taxonomy for the restricted domain can be obtained by applying certain criteria within T3 transformation: if a loose criteria is chosen, like “including in the EAT taxonomy those concepts without hypernymn”, then a generic taxonomy is obtained; or if a tighter criteria is defined, like “including in the EAT taxonomy those concepts that have a number of hyponyms greater than N”, then a more refined taxonomy is obtained. A tight criteria is more appropriate in restricted-domain QA because it is highly advisable that these kinds of taxonomies are refined in order to improve its precision. By taking the first criteria, a generic EAT taxonomy is obtained for our running example with a unique level with 2 concepts entity and agents. By using this taxonomy, question Q1 can be classified as entity (thus retrieving erroneous candidate answers such as any kind of profession, event, group, etc.). Also, Q1
116
K. Vila et al.
can be unclassified since enzyme class is not in the taxonomy. Conversely, if the second criteria is used, a refined EAT taxonomy is obtained with 9 levels and 13 concepts (as shown by means of rectangles in Fig. 2). Therefore, question Q1 would be classified as enzymes and the search space is restricted to kinds of enzymes (such as hydrolases or its hyponyms) which would be accepted as a right answer. Finally, by using this EAT taxonomy, the answer to the question Q1 will be “phytase”. Furthermore, this EAT taxonomy would be useful for more specific questions, e.g. question Q1 could be redefined as “What esterase increases the animals’s digestibility of the organic phosphorus?”.
5
Experiments and Results
Our experiments compared the results obtained by a QA system when using an open-domain EAT taxonomy and a new EAT taxonomy developed for the agricultural domain. The corpus used in our experiments is composed of 2024 articles (about 30 MB as flat text files) from the RCCA journal (from 1966 until 2009). First step in our experiments consists of processing this corpus with a POS tagger (MACO) and a syntactical parser (SUPAR), indexing it and computing frequencies for each term. As we consider the most relevant terms as those having fr>25 and tf-idf>0.01, 8696 relevant terms were obtained and specified in a restricted-domain model by means of transformation T1. Then, noun terms are used in transformation T2 for enriching the restricted-domain model by using Agrovoc and Wordnet. Table 2 shows a summary of our results: the restricted-domain model has 9022 concepts from which 3029 are multi-words. Most of the concepts (8530) come from the domain KOS (Agrovoc) and 3473 concept come from the generic KOS (Wordnet), having 2981 common concepts which represent the enrichment of the restricted domain model. Afterwards, the EAT taxonomy was obtained from the restricted-domain model by applying transformation T3 with the criteria of choosing those concepts with more than 2 hyponyms. Table 2 shows that the EAT taxonomy contains roughly 10% of concepts from the restricted-domain model.
Table 2. Summary statistics of Restricted Domain (RD) Model and created EAT taxonomy (# semantic classes) RD Model Levels Agrovoc WordNet Multi-words 0 438 174 212 1 1382 479 812 2 1565 551 568 3 1144 486 334 4 839 362 250 5 896 375 261 6 1002 433 255 7 562 291 140 8 284 156 61 9 291 95 84 10 72 41 28 11 30 17 12 12 20 11 8 13 3 1 2 Total 8530 3473 3029
EAT Taxonomy Total Agrovoc WordNet Multi-words Total 462 149 77 69 161 1429 133 83 67 149 1627 98 70 40 121 1233 79 77 24 111 935 70 68 19 105 979 76 53 24 94 1053 66 52 17 82 587 37 18 12 43 289 32 19 4 32 295 11 9 3 13 77 5 2 1 5 30 4 2 1 4 20 1 1 0 1 3 0 0 0 0 9022 761 531 281 921
Model-Driven Knowledge-Based Development of EAT Taxonomies
117
Our first step was using AliQAn with 180 training questions (made by agricultural domain experts) over the RCCA corpus. AliQAn classified these questions with its own EAT taxonomy (explained in Sect. 2): 148 questions were incorrectly classified (82%). Errors in the question classification appeared due to the fact that AliQAn has an open-domain EAT taxonomy which is poor for restricted domains. 64% of these errors are because the EAT is unknown (95 questions) and 36% of errors are due to an incorrect classification (53 questions are classified as object, whose classification is too generic to be useful for a restricted domain such as the agricultural one, thus leading to incorrect answers). Some sample questions incorrectly classified were previously explained in Sect. 2. Furthermore, 32 questions (18%) was correctly classified by using AliQAn EAT taxonomy as person, place, numerical percentage, numerical quantity, temporary date, and abbreviation. Precision would thus increase for complex questions (such as What or Which ones) if more semantic information is used. Finally, the EAT taxonomy obtained by using our approach is added to AliQAn and it is checked that 165 questions were well classified from the same collection of questions (91.6%). There were 15 questions misclassified since they were cause-effect questions which AliQAn is not prepared for dealing with. Therefore, accuracy of the question classification process of AliQAn in our agricultural domain is increased in 73.6%. Questions that were previously generically classified as object, were more precisely classified by using the new EAT taxonomy for the agricultural domain: instead of object, any of its hyponyms including in the refined EAT taxonomy can be used now (see Fig. 2).
6
Conclusions and Future Works
In this paper we have presented our model-driven approach for tackling the complex task of creating EAT taxonomies for restricted-domain QA systems by using collection of documents in the corpus and heterogeneous KOS in a systematic, well-structured, and comprehensive manner. The basis of our work is twofold: first, a unified metamodel is developed in order to define, in an integrated restricted-domain model, the most relevant terms in the collection of documents and useful concepts from different kind of KOS. Secondly, several transformations has been defined in a model-driven perspective, in such a way that the final EAT taxonomy is derived with a high degree of automation. Our future work consists on evaluating the effectiveness our approach with a more complete set of experiments in other domains and with other QA systems. Acknowledgments. This research has been partially funded by the Valencia Government under project PROMETEO (Development of Intelligent and Interactive Techniques of Text Mining) number [PROMETEO/2009/119].
References 1. Acebo, S., Ageno, A., Climent, S., Farreres, J., Padr´ o, L., Ribas, F., Rodr´ıguez, H., Soler, O.: Maco: Morphological analyzer corpus-oriented. Technical report, Dept. LSI - Universitat Polit´ecnica de Catalunya (1994)
118
K. Vila et al.
2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. ACM Press, New York (1999) 3. B´ezivin, J.: On the unification power of models. Software and System Modeling 4(2), 171–188 (2005) 4. Ely, J.W., Osheroff, J.A., Gorman, P.N., Ebell, M.H., Chambliss, M.L., Pifer, E.A., Stavri, P.Z.: A taxonomy of generic clinical questions: classification study. BMJ 321(7258), 429–432 (2000) 5. Ferr´es, D., Rodr´ıguez, H.: Experiments adapting an open-domain question answering system to the geographical domain using scope-based resources. In: Proceedings of the Multilingual Question Answering Workshop of the EACL 2006, pp. 69–76 (April 2006) 6. Hodge, G.: Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. The Digital Library Federation Council on Library and Information Resources (2000) 7. Hovy, E., Hermjakob, U., Ravichandran, D.: A question/answer typology with surface text patterns. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 247–251. Morgan Kaufmann Publishers Inc., San Francisco (2002) 8. Li, X., Roth, D.: Learning question classifiers: the role of semantic information. Nat. Lang. Eng. 12(3), 229–249 (2006) 9. Metzler, D., Croft, W.B.: Analysis of statistical question classification for factbased questions. Inf. Retr. 8(3), 481–504 (2005) 10. Moldovan, D.I., Pasca, M., Harabagiu, S.M., Surdeanu, M.: Performance issues and error analysis in an open-domain question answering system. ACM Trans. Inf. Syst. 21(2), 133–154 (2003) 11. Moll´ a, D., Vicedo, J.L.: Question answering in restricted domains: An overview. Computational Linguistics 33(1), 41–61 (2007) 12. Ferr´ andez, A., Palomar, M., Moreno, L.: An empirical approach to spanish anaphora resolution. Machine Translation 14(3-4), 191–216 (1999) 13. Roger, S., Vila, K., Ferr´ andez, A., Pardi˜ no, M., G´ omez, J.M., Puchol-Blasco, M., Peral, J.: Using aliQAn in monolingual QA@CLEF 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 333–336. Springer, Heidelberg (2009) 14. Sang, E.T.K., Bouma, G., de Rijke, M.: Developing offline strategies for answering medical questions. In: Proceedings of the AAAI 2005 Workshop on Question Answering in Restricted Domains, pp. 41–45 (2005) 15. Sekine, S., Sudo, K., Nobata, C.: Extended named entity hierarchy. In: Gonz´ ales Rodr´ıguez, M., Paz Su´ arez Araujo, C. (eds.) Proceedings of 3rd International Conference on Language Resources and Evaluation (LREC 2002), Canary Islands, Spain, May 2002, pp. 1818–1824 (2002) 16. Tom´ as, D., Vicedo, J.L.: Multiple-taxonomy question classification for category search on faceted information. In: Matouˇsek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 653–660. Springer, Heidelberg (2007)
Using a Semantic Wiki for Documentation Management in Very Small Projects Vincent Ribaud and Philippe Saliou LISyC, Université de Brest, UEB, France, CS 93837, 29238 Brest Cedex, France {Vincent.Ribaud,Philippe.Saliou}@univ-brest.fr
Abstract. The emerging ISO/IEC 29110 standard Lifecycle profiles for Very Small Entities is targeted at very small entity (VSE) having up to 25 people, to assist them unlock the potential benefits of using software engineering standards. VSEs may use semantic web technologies to improve documentation management infrastructure and processes. We proposed to use a semantic wiki for documentation management based on an identification scheme inspired from an IFLA proposition called Functional Requirements for Bibliographic Records. The document identification scheme allows documents to be managed by the internal resource management of the semantic wiki, hence benefiting from a straightforward but powerful version control. With few inputs of semantic annotations by VSE employees - through usable semantic forms and templates, the semantic wiki acts as a library catalog, and users can find, identify, select, obtain, and navigate resources. Keywords: very small entities, Functional Requirements for Bibliographic Records, ISO/IEC 29110, semantic wikis.
1 Introduction The term 'Very Small Entity' (VSE) was defined by the emerging ISO/IEC 29110 standard “Lifecycle profiles for Very Small Entities” [1] as being “an entity (enterprise, organization, department or project) having up to 25 people”. VSEs can find it difficult to relate software engineering standards to their business needs and to justify the application of the standards to their business practices. A disciplined documentation management may be seen as the first step towards standardization, but at least will provide a significant help for the achievement of projects. Documentation management is a piece of the puzzle of Knowledge Management (KM). In [2], Chan and Chao present a research survey conducted among 68 small and medium-sized enterprises (SME) which have implemented Knowledge Management (KM) initiatives. They conclude that effective KM is influenced by two types of KM capability, infrastructure and process, which have to be deployed. Publishing and content management systems (CMS) provide generally the documentation management infrastructure, associated with a documentation workflow process. Instead of these feature-rich systems, we move to an “as simple as possible” system using a semantic wiki as a base technology and a straightforward identification scheme to support the acquisition, organization, maintenance, retrieval, and sharing of documentation. S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 119–130, 2010. © Springer-Verlag Berlin Heidelberg 2010
120
V. Ribaud and P. Saliou
Section 2 presents the positioning of our work. Section 3 drafts some challenges of documentation management; the core (§3.4) is an identification scheme inspired from Bibliographic Records standards. We conclude with perspectives.
2 Work Positioning 2.1 Infrastructure Publishing and content management systems (CMS) are generally used as the basis for a documentation management infrastructure. But several authors have criticized the rigidity of the editorial control required by a CMS [3] and the need to balance structure/constraint and flexibility [4]. Some are promoting the use of Wikis and RDF (Resource Description Framework) to resolve these issues [5]. As [6] pointed out, a first step towards building the Semantic Web is to have the infrastructure needed to handle and associate metadata with content. However, most authoring environments have a major drawback. In order to provide metadata about the content of a document or a Web page, the author must first create the content and second annotate the content in an additional annotation step. As a way out of this problem, [7] propose that an author needs the possibility to easily combine authoring of a Web page and the creation of relational metadata describing its content. Our proposition uses a [semantic] wiki to handle Web pages and documents. In a wiki, it is exceptionally easy for anybody to edit Web pages following a small number of conventions. Semantic wikis let users add semantic information to the pages. We use Semantic MediaWiki (SMW, http://semantic-mediawiki.org), a free semantic extension of the free software MediaWiki (http://www.mediawiki.org). SMW let users edit metadata in a straightforward manner similar to the editing of page content; hence it fulfills the requirement stated above. The problem differs with documents, usually produced in a word processor. Once the document stored in a wiki, access to the document is performed through a wiki page with the same name. This page handles metadata managed by the wiki (e.g. versioning information) and let users upload the resource. In a semantic wiki, this page handles also metadata provided by users and acts as the resource description. Editing the resource description (document metadata) is unfortunately separated from editing the resource. 2.2 Processes Rech et al. [8] identified several challenges related to knowledge transfer and management for small and medium-sized enterprises in the software sector: recording, reusing, locating and sharing information. The same observations are applying to documentation management. However, before to address these challenges, we have to consider a supplementary feature: documents may be referred by a name, vague (e.g. user’s manual) or precise (e.g. French translation of software requirements for the 2.6 version of given software product). Referring a document (or a set of documents) by name requires identifying documents and relationships among documents.
Using a Semantic Wiki for Documentation Management in Very Small Projects
121
Identifying documents and their relationships. We are proposing to use a documentation identification scheme, inspired from the FRBR, Functional Requirements for Bibliographic Records [9]. Let us describe the FRBR proposition in a nutshell. FRBR gathers information about Work, a distinct intellectual or artistic creation. We recognize the work through individual realizations of the work, but the work itself (e.g. the Hebrew Bible) is a set of concepts regarded as commonly shared by a number of individual sets of signs (e.g. the Hebrew Bible in Biblical Hebrew or its Latin form, Biblia Hebraica) called Expressions [9]. An expression is the specific intellectual or artistic form that a work takes each time it is “realized.” An expression excludes aspects of physical form, such as typeface and page layout, that are not integral to the intellectual or artistic realization of the work as such [9]. Work and Expression are abstract entities; when a work is realized, the resulting expression of the work may be physically embodied on or in a medium such as paper, tape, canvas, etc. That physical embodiment constitutes a Manifestation. In some cases there may be only a single physical exemplar produced of that manifestation of the work (e.g. an 11th century manuscript of the Hebrew Bible with Aramaic Targum). In other cases there are multiple copies produced in order to facilitate public dissemination or distribution (e.g. the Biblia Hebraica Quinta published by the Deutsche Bibelgesellschaft). Whether the scope of production is broad or limited, the set of copies produced in each case constitutes a manifestation [9]. A specific copy - a single exemplar of a manifestation - constitutes an Item. In terms of intellectual content and physical form, an item exemplifying a manifestation is normally the same as the manifestation itself. The FRBR proposition allows us to establish distinctions and precise relationships between the various intellectual creations - artifacts - handled during a software project. Various terms are used by creators and publishers of intellectual and artistic entities to signal relationships between those entities. Terms such as "edition" and "version" are frequently encountered on publications and other materials, as are statements such as “based on ...” or “translated from ....” [9]. The FRBR analyzed specifically relationships that operate between one work and another, between a work and an expression, between one expression and another, between a manifestation and an item, etc. We will detail in section 3.4, a documentation management scheme that may be suitable for some very small projects. Other projects may have different requirements and interpretations. For instance, we considered that different translations of the same document are different manifestations of the same expression. Hence, different translations are supposed to share exactly the same set of concepts. That may be not true for VSEs localizing a software product and its associated documentation for a worldwide market. Each translation is itself an expression and defining different expressions gives us a means of reflecting the distinctions in intellectual or artistic content that may exist between one translation and another of the same work (the software product and its documentation). Reusing. Several authors pointed out the help of ontologies to support reuses [8], [10]. By construction, using the ISO/IEC 29110 standard in a VSE introduces its underlying ontology: process, activity, task, role, products, etc. In the case of documentation, the
122
V. Ribaud and P. Saliou
ISO/IEC standard - Basic Profile ([11], Clause 4.5) defines an alphabetical list of the input, output and internal process products, its descriptions, possible states and the source of the product. For instance, the main process related to software development proceeds with: • Input products: Project Plan; • Output products: Requirements Specification, Software Design, Traceability Record, Software Components, Software, Test Cases and Test Procedures, Test Report, Product Operation Guide, Software User Documentation, Maintenance Documentation, Change Request; • Internal products: Validation Results, Verification Results. We use the Basic Profile list of 22 work products as a basis to categorize documentation (with the possibility for the VSE to add documents types related to its business and organization). Recording. Rech et al. [8] observed that information and knowledge about the software projects and products exist during the runtime of a project but get lost soon after its end. They propose that adequate documentation should either be supported automatically or semi-automatically by using a single point of consistent knowledge to simplify storage and retrieval. A VSE needs a simple model to locate, store, and retrieve work products. Our proposal is to replace the hierarchically physical organization (as it may be found in a file system or a CMS) with a logical organization based on the identification scheme presented in section 3.4. Documents are naturally categorized into the class associated with their type. A generic template for documents as well as a template for each document type provide users with a way of specifying resource descriptions without learning any new syntax and ensure that properties and classes are used consistently. Locating. Locating - retrieving - resources difficulty is related to the crucial problem of interaction between resources providers and users. Ramadour et Cauvet [10] believe that this interaction can be supported and even automated by increasing the expressiveness of the language used for encoding component properties and formulating queries, enhancing therefore the quality of the retrieval. SMW includes an easy-to-use query language which enables users to write simple or complex queries. The syntax of this query language is similar to the syntax of resource descriptions (typed by the document creator itself or used in templates). SMW provides users with category browsing and with a kind of hierarchical faceted navigation through semantic properties. Sharing. Uren et al [12] state that a document centric process must handle three classes of data: ontologies, documents and annotations. Documentation sharing is accomplished in an easy manner through the use of the wiki resource management system - that is including version control. Sharing ontologies in a semantic wiki is also simple, because any update to the ontology is immediately available to user - the question is remaining how to update, automatically or semi-automatically, others wikis using the same ontology. The difficult point is related to annotation sharing: after several unsuccessful attempts, we abandoned to provide a scope control on the annotations, and each published annotation is a public one.
Using a Semantic Wiki for Documentation Management in Very Small Projects
123
3 Engineering Activities and Documentation Management 3.1 Software Engineering Standards A concise definition of the objects of software engineering may be found in [13] “a project uses resources in performing processes to produce products for a customer.” It gives the model of Figure 1, centered on the software engineering project as the focal point for applying software engineering standards. This suggests a categorization of standards in four major areas: customer, process, product, and resource.
Fig. 1. The objects of software engineering, suggesting a categorization of standards in the subject areas of customer, process, product, and resource [13]
For VSEs, each category contains a number of standards that put them out of reach. There is a need for an umbrella standard within each category. The IEEE/IEC 12207, Software Life Cycle Processes [14], provides this umbrella for all of the customer and process standards. The on-going ISO/IEC 29110 standard [1] should provide such an umbrella for the customer, process, and products area. 3.2 ISO/IEC 29110 The ISO/IEC emerging standard “Software Engineering - Lifecycle Profiles for Very Small Enterprises (VSE)” - Basic Profile [11] contains 2 processes: Project Management (PM) and Software Implementation (SI). PM is subdivided in 4 activities (Project Planning, Project Plan Execution, Project Assessment and Control, Project Closure) and SI is subdivided in 6 activities (Software Implementation Initiation, Software Requirements Analysis, Software Architecture and Detailed Design, Software Construction, Software Integration and Tests Product Delivery). The Basic Profile ([11], Clauses 4.2.8 and 4.3.8) proposes task decomposition of the PM and SI processes for each activity, together with inputs and outputs of each task. So, we can establish the workflow for each of the 22 work products (cf. §2.2). For instance, Figure 2 presents the workflow of Work Product 11, Requirements Specification.
124
V. Ribaud and P. Saliou
Fig. 2. WP11 Requirements Specification workflow
3.3 Functional Requirements for Bibliographic Records The Functional Requirements for Bibliographic Records (FRBR) is a conceptual model of the bibliographic universe, describing the entities in that universe, their attributes, and relationships among the entities [15]. The entities have been divided into three groups. The first group comprises the products of intellectual or artistic endeavor that are named or described in bibliographic records: work, expression, manifestation, and item [9]. The second group comprises those entities responsible for the intellectual or artistic content, the physical production and dissemination, or the custodianship of such products: person and corporate body [9]. The third group comprises an additional set of entities that serve as the subjects of intellectual or artistic endeavor: concept, object, event, and place [9].
Fig. 3. FRBR entities and relationships
Relationships depicted in the Figure 3 (a synthesis of Figure 3.1, 3.2, 3.3 [9]) indicate that a work may be realized through one or more than one expression. An expression, on the other hand, is the realization of one and only one work (there is a one-to-many relation linking work to expression). An expression may be embodied in one or more than one manifestation; likewise a manifestation may embody one or more than one expression (a many-to-many relation linking expression to manifestation). A manifestation, in turn, may be exemplified by one or more than one item; but an item may exemplify one and only one manifestation (a one-to-many relation linking manifestation to item). Let us have an example. Starting with the work of the Software Requirements Specification of a project called XYZ, it was created by XYZ project manager and
Using a Semantic Wiki for Documentation Management in Very Small Projects
125
expressed in several ways including a current version, previous releases, a summary, etc. Once the expressions are recorded in some physical form - may be using a tool, we have different manifestations such as the original text - as a part of the tool data -, and two electronic editions - one in PDF format and the other in HTML. We may have also several translations of any expressions, e.g. English and Spanish translations of the French original text, and a Spanish translation of a previous release. Those manifestations are related to the expression they are based on. At the item level – that’s where we would see the specific copies held in a various places. An item would have attributes like its call number and the location where it is stored and any item specific notes, for example, that it is an customer-signed copy of the printed text of the French edition – would have a box linked to the paper manifestation. As represented in Figure 3, there are relationships for the Group 2 entities with the Group 1 entities: a work is created by a person or corporate body; an expression is realized by a person or corporate body; a manifestation is produced by a person or corporate body; an item is owned by a person or corporate body. 3.4 Documentation Identification Scheme An annotated system is a system which “knows about” its own content in order that automated tools can process annotations to improve use of the system. For example, semantic annotations can describe documents’ authors and their relationships, as well as including traditional metadata, such as the document subject and date of publication. Document (and more generally resource) identification is one of the main issues of an electronic management system; and especially if it is Web-based, Uniform Resource Identifiers (URIs) may be used as identifiers. Identifying documents is not enough, because we have to keep the relationships between the different kinds of documents, for instance that three different documents are different translations of the current Software Requirements Specification. A Semantic Web-based management system is providing a way to define semantic links between resources e.g. that XYZ project manager created the Software Requirements Specification or that this resource is a translation of another one. Such assumptions are called semantic annotations and may be expressed using the Resource Description Framework (RDF) language. A semantic annotation in RDF is a triple . The subject is a URI; the object value can be a literal or a URI; the property is the meaning that is given to the link. Time is vital for a VSE, and employees would not take the time to record semantic annotations linking resources if it can be done straightforwardly. An identification scheme may provide the VSE with documentation identification together with the implementation of Figure 3 relationships. Our main concern is to relate the different resources (abstract or physical) together. Work identification. As noted in §2.2, the VSE has an enhanced list of document types (based initially on the 22 types provided with the ISO/IEC 29110). A trigram may be assigned to each type. Each project is identified with an acronym. In order to generate unique identifier, we can use the scheme year-number within the year; e.g. 2009-3 identifies the third document created by the VSE during the year 2009. Combining all these features together provide us with the work identification scheme. Each work is identified with a unique string based on the pattern:
126
V. Ribaud and P. Saliou
Project - Document Type - Year - Number within a Year For instance, the XYZ Software Requirements Specification is identified with XYZ-SRS-2009-3. Expression identification. Work identifier serves a root for all expressions that realize the work (the main case is the different versions of this work). Versions - expressions - are abstract such as works are. You may reference a version that does not (and will never) have a physical existence. It is only the embodiment (the recording) of an expression (a version in our case) in or on some carrier that move from the abstract “work/expression” to a physical entity [15]. In the documentation managed by a VSE, there are two main types of expression: version and translation, that combine together (you may have Spanish and French translations of a version n, as well several versions of an on-going English translation of an important document). Deciding whether versions or translations are expressions of a work or manifestations of an expression depends on VSEs business processes. Our proposition is to consider versions as expressions, and translations as manifestations. Hence, a version may be embodied in several languages, including the original language of the first manifestation. According to Figure 3, a work is realized through expression(s). Implicitly the expressions of the same work have a sibling relationship to each other [15]. Expressions are linked to the work they “realize” or express. The expression identification scheme relies on the work identification scheme. Each expression is identified with a unique string based on the pattern: Project - Document Type - Year - Number within a Year - Version Number For instance, the version A of the XYZ Software Requirements Specification is identified with XYZ-SRS-2009-3-A. It may be immediately deduced that this expression - version A - realizes the work XYZ-SRS-2009-3. Manifestation identification. A manifestation is the physical embodiment of an expression of a work. In order to record something you have to put it on or in some container or carrier. As represented in Figure 3, a manifestation may be the physical embodiment of several expressions; typically when several documents are recorded on the same media records. As we are concerned with documentation management and not with configuration management, we would mostly not use this many-many relationship between expressions and manifestations. Usually, an expression (such as a version of a work) is embodied in several manifestations (such as different file formats or translations) and a manifestation, on the other hand, is the embodiment of one and only one expression. Implicitly the manifestations of the same expression have a sibling relationship to each other - that may be an equivalent content [15]. Expression identifier serves a root for all manifestations that embody the expression (main cases are the different translations of this expression, as well the different file formats that can be used for this embodiment). The manifestation identification scheme relies on the expression identification scheme. Each manifestation is identified with a unique string based on the pattern: Project - Document Type - Year - Number within a Year - Version Number – Language . Extension For instance, the pdf file containing the English version A of the XYZ Software Requirements Specification is identified with XYZ-SRS-2009-3-A-eng.pdf. It may be
Using a Semantic Wiki for Documentation Management in Very Small Projects
127
immediately deduced that this manifestation - the English translation in a pdf file embodies the expression XYZ-SRS-2009-3-A. Short codes for language names may use a standard like ISO 639-3 (e.g. eng for English, deu for German). File name extensions (e.g. pdf for the Adobe file format) denote a particular way that information is encoded for storage in a computer file. Item identification. An item is then a single exemplar of a manifestation – an individual copy. All copies that are linked to the same manifestation have a sibling relationship to each other [15]. Item management is related to configuration management, and is out of scope of documentation management (and of this paper). 3.5 File and Meta-Data Management Manifestations (files) can be uploaded into Semantic MediaWiki using the "Upload file" feature. Once a file is uploaded, other pages can include or link to the file. Uploaded files are given the "File:" prefix by the system, and this allows the files to be used in articles instantly. Every editable page on Wikipedia has an associated page history (sometimes called revision history or edit history). The page history contains a list of the page's previous revisions, including the date and time (in UTC) of each edit, the username or IP address of the user who made it, and their edit summary. Each uploaded file has a file description page. The purpose of these pages is to provide information about the file, such as who uploaded the file, any modifications that may have been made, an extended description about the file's subject or context, where the file is used, and license or copyright information. All these information are metadata, and the description file is a metadata record. Technically, there are two main solutions to manage metadata records, either to build an independent system or to add an extension to the resource management system itself. Our proposition belongs to the second type, while we are using a semantic wiki to manage metadata and the internal resource management system of the wiki to manage resources (documents). The file management system of a wiki is very straightforward but powerful, as long as we are able to identify documents. The identification scheme provides identification together with relationships management. 3.6 Information Resource and Non-information Resource One of the first steps towards the Semantic Web has been to use URI to identify “anything” and non only resources that can be located and accessed on the Web (some authors call them “non-information resources” [16]). One problem is related to know what the URI is identifying: the resource itself or the description (metadata record) of the resource. In a (semantic) wiki, access to an information resource is provided through its description page (mentioned above) identified with the same name as the resource itself. We extend these principles to “non-information resource”. A page named with the name of a “non-information resource” is a description of the “noninformation resource”, not the “non-information resource” itself. As well we use this URI internally to the wiki, it works fine. But if we wish to generate and export correct triples providing assumptions to the real “non-information resource”, we have to dispose of the right URI of this “non-information resource”.
128
V. Ribaud and P. Saliou
FRBR sorts entities in three groups. Group 1 entities represent the different aspects of user interests in the products of intellectual or artistic endeavor: work, expression, manifestation, and item [9]; all are information resources. Group 2 entities include person (an individual) and corporate body (an organization or group of individuals and/or organizations) [9]; all are “non-information resource”. Group 3 entities include concept (an abstract notion or idea), object (a material thing), event (an action or occurrence), and place (a location) [9]; all examples of “non-information resource”. 3.7 Information Resource Catalogue In the section 4 of [17], IFLA define that an Online Public Access Catalogue should be an effective and efficient instrument that enables a user: 4.1. to find bibliographic resources in a collection as the result of a search using attributes or relationships of the resources 4.2. to identify a bibliographic resource or agent 4.3. to select a bibliographic resource that is appropriate to the user’s needs 4.4. to acquire or obtain access to an item described; or to access, acquire, or obtain authority data or bibliographic data 4.5. to navigate within a catalogue and beyond The straightforward documentation system proposed in this paper has to provide, from its users point a view, a resource catalogue, with the tasks in the list above. So what are these user tasks? Briefly, they are find, identify, select, obtain, and navigate. ‘Find’ involves meeting a user’s search criteria through an attribute or a relationship of an entity. Semantic MediaWiki includes an easy-to-use query language which enables users to access the wiki's knowledge. The syntax of this query language is similar to the syntax of annotations in Semantic MediaWiki. ‘Identify’ enables a user to confirm they have found what they looked for, distinguishing among similar resources. The identification scheme, based on the FRBR Group 1 entities [9] and presented in section 3.4 is used to confirm that the described entity corresponds to the entity sought or to distinguish between two or more entities with similar characteristics. ‘Select’ involves meeting a user’s requirements with respect to content, physical format, etc. or to reject an entity that doesn’t meet the user’s needs. ‘Select’ is strongly related to search capabilities. The first kind of search exploits properties used as annotations. Search criteria can be combined through Boolean operators. Searches can use taxonomies, based on the categories of the wiki. ‘Obtain’ enables a user to acquire an entity through electronic remote access. The internal file management system let an immediate access to the current version of the resource, and its history as well. The history page let users to see all past changes to the page in question, to view a specific version, to compare two specific versions, etc. FRBR recognizes the importance of being able to ‘navigate’. Semantic MediaWiki provides a simple browsing interface that displays all semantic properties of a page, as well as all semantic links that point to that page. By clicking on these links, the user can browse to another article. Faceted classification provides a way to design hierarchies which are simpler and more lightweight. The extension Semantic Drill Down (http://www.mediawiki.org/wiki/Extension:Semantic_Drilldown) provides users with a hierarchical faceted navigation of categories through semantic properties.
Using a Semantic Wiki for Documentation Management in Very Small Projects
129
The tasks listed above require that everybody records resources and metadata record. The latter information should be accurate in order to support documentation management. In a small project, this verification task will be devoted to the Project Manager. As mentioned in section 2.2, the workflow of each product is defined in the ISO/IEC 29110 standard. Our proposal is to review and update the metadata associated with a work product when performing the last activity that outputs the final version of this work product. Conforming to the example presented in Figure 2, WP11 Requirements Specification metadata will be reviewed and updated during the ‘SI.2 SW Requirements Analysis’, while, for instance, WP17 Software User Documentation metadata will be updated during the ‘SI.5 SW Integration and Tests’. Semantic templates are a straightforward tool that Semantic MediaWiki offers to record annotations (metadata). Users specify annotations without learning any new syntax; annotations are used consistently, i.e. users do not have to look for the right properties or categories when editing a page; templates provide data structure, by defining which values belong in which pages. Semantic forms allows for the creation of forms built on semantic template that often provide a nice display and input.
4 Conclusion We proposed to use a semantic wiki for documentation management in Very Small Enterprises (VSEs). The FRBR proposition [9] includes a description of the conceptual model (the entities, relationships, and attributes), a four-level classification for all types of resources, and user tasks associated with the bibliographic resources described in catalogs. We used the four-level classification to define a document identification scheme that allows documents to be managed by the internal resource management of the semantic wiki, hence benefiting from a straightforward but powerful version control. With few inputs of semantic annotations by VSE employees - through usable semantic forms and templates, the semantic wiki acts as a library catalog, and users can find, identify, select, obtain, and navigate resources. The next step is to implement this proposition in a pilot project. Two VSEs may be interested and will provide us with a feedback on our proposition.
References 1. International Organization for Standardization (ISO): ISO/IEC DTR 29110-1 Software Engineering - Lifecycle Profiles for Very Small Entities (VSEs) – Part 1: Overview. ISO, Geneva (2010) 2. Chan, I., Chao, C.: Knowledge management in small and medium-sized enterprises. Communications of the ACM 51(4), 83–88 (2008) 3. García Alonso, J.M., Berrocal Olmeda, J.J., Murillo Rodríguez, J.M.: Documentation Center - Simplifying the Documentation of Software Projects. In: Wiki4SE Workshop - 4th International Symposium on Wikis (2008), http://www.wikis4se.org/doku.php/ 4. Maxwell, J.W.: Using Wiki as a Multi-Mode Publishing Platform. In: 25th Annual ACM International Conference on Design of Communication, pp. 196–200. ACM, New York (2001)
130
V. Ribaud and P. Saliou
5. Rauschmayer, A.: Next-Generation Wikis: What Users Expect; How RDF Helps. In: Third Semantic Wiki Workshop, ESWC, Redaktion Sun SITE, Aachen, poster (2009) 6. Kahan, J., et al.: Annotea: an open RDF infrastructure for shared Web annotations. In: 10th Int. Conference on World Wide Web (WWW 2001), pp. 623–632. ACM, New York (2001) 7. Handschuh, S., Staab, S.: Authoring and annotation of web pages in CREAM. In: 11th Int. Conference on World Wide Web (WWW 2002), pp. 462–473. ACM, New York (2002) 8. Rech, J., Bogner, C., Haas, V.: Using Wikis to Tackle Reuse in Software Projects. IEEE Software 24(6), 99–104 (2007) 9. IFLA: Functional Requirements for Bibliographic Records (2009), http://www.ifla.org/VII/s13/frbr/ 10. Ramadour, P., Cauvet, C.: An Ontology-based Support for Asset Design and Reuse. In: ENC 2008 Mexican International Conference on Computer Science, Mexico, pp. 20–32. IEEE, Los Alamitos (2008) 11. International Organization for Standardization (ISO): ISO/IEC DTR 29110-5-1-2 Software Engineering - Lifecycle Profiles for Very Small Entities (VSEs) – Part 5: Management and Engineering Guide-Basic VSE Profile. ISO, Geneva (2010) 12. Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Web Semantics: Science. Services and Agents on the World Wide Web 4(1), 14–28 (2006) 13. Moore, J.W.: An integrated collection of software engineering standards. IEEE Software 16(6), 51–57 (1999) 14. International Organization for Standardization (ISO): ISO/IEC 12207:2008 Information technology – Software life cycle processes. ISO, Geneva (2008) 15. Tillett, B.: The FRBR Model. FRBR Seminar Australian Committee on Cataloguing, Sydney (2004) 16. Booth, D.: URIs and the Myth of Identity. In: Workshop on Identity, Reference, and the Web, WWW 2006, Edinburgh, Scotland (2006) 17. IFLA: Statement of International Cataloguing Principles (2009), http://www.ifla.org/files/cataloguing/icp/icp_2009-en.pdf
A Short Communication - Meta Data and Semantics the Industry Interface: What Does the Food Industry Think Are Necessary Elements for Exchange? Kathryn A.-M. Donnelly Norwegian Institute of Food Fisheries and Aquaculture Research ‐ Nofima, Tromsø, Norway
[email protected]
Abstract. Information exchange in agriculture, fisheries and food production is now a very important area for development. Information is now often exchanged between machines making technology developments of utmost importance for the European Food Industry. It is essential that all areas of the food industry are fully involved in order to gain maximum benefit from the research. This short communication examines the research which has currently been carried out with regards to meta-data and industry standards and compares the various efforts and the industry opinions about these. Keywords: metadata, industry, agriculture.
1 Introduction Across agriculture, fisheries and food production there is a need to exchange and identify large amounts of product and process information precisely and efficiently. The developments related to the world wide web and semantic technology have led to changes in the amount of and the way information is exchanged. Information is no longer exchanged person to person or person to machine but more often machine to machine [1]. Therefore it is important that technologies which facilitate this and are becoming widely available are appropriate to their purpose. The ability to reuse information and ontology like artifacts, together with bottom up ontology development will save time and increase usability [2]. Making advances in traceability systems, which may be at least in part if not wholly served by ontologies, requires knowledge of the industry for which they are intended. Another requirement is that traceability systems and the data-lists standards or ontology applied should be of use to the industry and need to be presented in a format which the industry can integrate into already existing standards and not create additional layers to existing systems. Within sectors such as food production special challenges are encountered due to a number of factors including the global and dynamic nature of the food supply chain and it’s inherent instability. To overcome these challenges electronic exchange of information is necessary. To facilitate electronic interchange of such product information, international, non-proprietary standards are required such as the ones highlighted S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 131–136, 2010. © Springer-Verlag Berlin Heidelberg 2010
132
K.A.-M. Donnelly
by Jansen-Vullers et al [3]. Standards must describe how information can be constructed, sent and received and also how the data elements in the information should be identified, measured, interpreted and stored [4]. The current situation in food supply chains is that they contain a large variety of systems, software and formats and there is currently no standardized way of electronically coding and transmitting information [5, 6]. There are currently no European, sector specific standards, regarding how food processing information should be stored and transmitted nor is electronic information exchange between links in the supply chain common (when the links are not part of the same company) [5, 7-9]. Individual companies and integrated supply chains have made great progress in proprietary technologies for automated data capture and electronic data coding, but the benefit of these is lost when the data element transmission is required for use outside the originating company as it is only effective when there is an identical software system at the receiving end. In order to enable effective, electronic information exchange, work needs to be carried out on a sector-specific level. Analysis of what product information the particular food sector already records should be carried out and a method for identifying this product information should be developed in a standard format. There have been considerable but scattered attempts to address this at both supply chain and sector level. These attempts include the recently completed European Union TRACE project, [10]. Martini et al [1] (2009) identifies other national and international efforts. Those examined here are related to the need for information exchange for food information purposes. Emphasis is placed not on the technical solutions but rather the interface between what the technical solutions can offer and what is demanded in real world situation. This includes, but is not limited to, the industry’s needs and attitude towards information exchange and which data elements they are motivated to exchange. This short communication aims to carry out a brief comparison of real world studies, examine what research has been carried out and establish what this has to say about the industries response and need for metadata/semantic research including work carried out at sector and inter-sector levels.
2 Method Five studies of metadata covering the honey, chicken, fish, potatoe and soya sectors have been analysed and compared in the table. The importance of various product related data elements in the supply chain communication have been examined and are listed and briefly outlined below.
3 Results The results of the comparison are presented in the following tables.
A Short Communication - Meta Data and Semantics the Industry Interface
133
Table 1. Study no.
Sector
Method
Findings
Comments
I
Potato arable farming
-Registering data -OWL based ontology from required forms for potato processing -Interviewing industry actors
II
Honey processing
-Detailed case study -Industry wide survey in Europe
-Meta data list of -Not represented in a information exchange in formal OWL ontology honey processing sector -focus on supply chain traceability
III
Soya bean processing for both human and animal consumpti on Chicken processing
-Case studies -Industry consultation Europe and USA (limited) -Expert consultation
-Meta data list of information exchange along soya bean supply chains.
-Investigates multiple links in the chain -Investigates relationships and attitudes in an intercontinental setting
-Case studies -Industry consultation Europe and USA (limited) -Expert consultation -Case studies -World wide industry consultation
-Meta data list of information exchange with focus on the chicken processing sector -Meta data lists for traceability throughout the Wild caught and Farmed fish supply chains.
-Not represented in a formal OWL ontology -focus on supply chain traceability
IV
V
Fisheries sector informatio n exchange (wild, caught and farmed)
- whole sector examined - individual interviews conducted
Author and title Haverkort et al [11] Organizing Data in Arable Farming: Towards an ontology of Processing Potato Donnelly et al [12] Creating standardised data lists for traceability: a study of honey processing Thakur and Donnelly [13] Modelling traceability information in soya bean value chains
Donnelly et al [14] Improving information exchange in the chicken processing sector using standardised data lists - Translated into a number Various authors [15, 16] of languages including See ISO/T234. CEN Vietnamese. Workshop Agreement. Traceability of Fishery products. Specification of the information to be recorded in farmed fish distribution chains and CEN Workshop Agreement. Traceability of Fishery products. Specification of the information to be recorded in caught fish distribution chains.
Using the same studies as above, the respondents’ attitudes and need to exchange information are compared and analyzed in table 2. Each of the studies listed above involved some degree of industry consultation and contained either a metadata list or ontology. This study is clearly limited by the apparently low number of such studies. It is clearly possible that other such studies exist and hope that this short communication will encourage authors of such works to make them more visible. From the results in the papers the data lists have not been extensively implemented or tested in an industry setting (with, as far as the author is aware, the exception of study V). Some of the papers present results related to the industry’s need to exchange the data elements that are being considered in this communication.
134
K.A.-M. Donnelly Table 2. Analysis of the industry results from each of the 5 studies
Study no.
Importance of parameters that are (electronically) exchanged.
Communication of the data elements
I
No data available
No data available
II
Greater importance was assigned to data elements related to contamination while those related to nutrition have the least importance. Some of the data elements that the processors reported as being important were not recorded. While slightly higher values of importance were recorded for some parameters no clear pattern was demonstrated
No data available
III
IV
V
No data available
The paper reports that information recorded internally is nearly always communicated to the next link The study showed that more that half the product data elements recorded by the companies involved were communicated externally. No data available
Other The study notes that ontologies in agriculture will be important. No indication is given of release or use dates. No data available
Frequent reliance on data elements supplied by other actors in the supply chain. No data available
7KLVZRUNKDVEHHQXVHGDVWKHµEDVLV¶ for ISO standards.
4 Discussion The aim of this short communication was to highlight the use of semantics and metadata lists in ‘real world’ settings. Despite the limited published material that clearly hinders extensive conclusions being drawn some useful indications are still apparent. One comparison that could be made between studies I-IV was the degree to which the different industries saw the need for traceability. For each study companies across Europe were contacted and while they all acknowledged the need for traceability, the honey producers were shown to be the most willing to participate in the research, the grain producers the least willing and the chicken producers somewhere between the two. This is an important observation with regards to awareness of information flow issues. It would seem that the industries which have recently been challenged by information flow issues, e.g. soya GMO, chicken - fed on feed containing illegally high levels of dioxins, or carrying avianflu etc. are least likely to see the value of publicly funded sector initiatives. This demonstrates the sensitive nature of traceability and access to internal information by external bodies. It also indicates that if public bodies wish improve traceability in cooperation with industry either incentives or legislation would be needed. One of the unexpected findings is that, although many of companies reported exchanging information with outside bodies, they do not report whether there are already existing standards for communicating this information. It is possible this work happens outside the open access arena and also outside the usual channels for research. In projects such as TRACE an attempt was made to address some of the issues of data exchange particularly related to traceability. Several parts of this research are presented in the table. One could suggest from the results presented in the tables that this work should be expanded upon and carried further, because it is of importance the European consumer and the food industry with regards to improving food information exchange.
A Short Communication - Meta Data and Semantics the Industry Interface
135
The published research regarding information flow and metadata lists is limited. The research carried out however confirms the number of food producing businesses that do exchange information. In studies III and IV the companies reported that they exchanged more than 50% of the data they recorded. The benefit is expected to be both in terms of industry communication and communication with the relevant national and international authorities. The information gathered in surveys such as those briefly reviewed in this communication may form the basis for standardized vocabularies. These would then facilitate the development of electronic information interchange in food supply chains (Folinas et al., 2003, Manikas and Manos, 2008). For instance as an extension of the Universal Business Language (UBL); UBL is a library of standard electronic XML business documents such as purchase orders and invoices developed and supported by the Organization for the Advancement of Structured Information Standards (OASIS). Universal Business Language (UBL) is already supported by many national governments. It would be appropriate to carry out a survey on all links, for instance, in the honey supply chain to test the use of such standards. Each of the studies seems to have some of the elements that the others are missing. In many situations to have a complete set of data elements found in an ontology or other accessible list of data elements would be useful. Information about the industry’s attitude would also be advantageous. Research regarding possible industry implementation and usage will also be important with regards to carrying out appropriate research. Study V is of great interest because it demonstrates that the type of information developed in studies I-IV can be of international importance and use. The work presented in Study V began as a project but was developed to European standards for information exchange with regards to traceability and Following this it was suggested as the basis (with some changes) for International Standard Organization standards (ISO standards). This indicates that the need for such standards is recognized both by industry and regulatory bodies. The data presented in Study V forms the basis for standards (which could be used in an ontology and which are currently a metadata list) which are now being considered as international standards demonstrate a clear demand for such work from both the industry and the regulatory authorities on an international basis. In order to further exploit such work cross sector and use, investigations need to be carried out to see where the different metadata list can be reused, where they diverge and where they may benefit from meta cooperation. In some of the studies the number of elements, was deliberately limited e.g. study II, (honey) whilst in others such as the potatoes the number of elements appears to be unlimited. It would seem clear that unlimited number of data elements are the appropriate choice as this gives a sector the greatest freedom to integrate into internal systems and adapt to their needs. This paper is not concerned with the technical standards required for data transmission but rather the elements required to be exchanged and the industries’ attitudes towards them. So far there has been little published research that presents information exchange from the industries’ point of view. This communication examines only published data, there are probably many more data series, for example the mineral water data and other initiatives which need to be connected together and integrated both into ontologies and existing data exchange technologies. This is a major area for future research which requires attention.
136
K.A.-M. Donnelly
References 1. Martini, D., et al.: A Service Architecture for Facilitated Metadata Annotation and Ressource Linkage Using agroXML and ReSTful Web Services. Communications in Computer and Information Science 46, 257–262 (2009) 2. Keet, M.: Ontology Desgin Parameters for Aligning Agri-Informatics with the Semantic Web. Communications in Computer and Information Science 46, 239–244 (2009) 3. Jansen-Vullers, M.H., van Dorp, C.A., Beulens, A.J.M.: Managing traceability information in manufacture. International Journal of Information Management 23(5), 395–413 (2003) 4. Folinas, D., Manikas, I., Manos, B.: Traceability data management for food chains. British Food Journal 108(8), 622–633 (2006) 5. Dreyer, H.C., Wahl, R., Storøy, J., Forås, E., Olsen, P.: Traceability standards and supply chain relationships. In: Proceedings of NOFOMA (the Nordic Logistics Research Network Conference), Linkoping, Sweden (2004) 6. TRACE4, A.I.: - TRACE -Tracing Food Commodities in Europe “Description of Work”, FP6-2003-FOOD-2-A Proposal no 006942, Sixth Framework Programme (2008) 7. Karlsen, K.M., Senneset, G.: Traceability: Simulated recall of fish products. In: Luten, J., et al. (eds.) Seafood Research from Fish to Dish, Quality, Safety and Processing of Wild and Farmed Fish, pp. 251–262. Wageningen Academic Publishers, The Netherlands (2006) 8. Moe, T.: Perspectives on traceability in food manufacture. Trends in Food Science & Technology 9(5), 211–214 (1998) 9. Senneset, G., Foras, E., Fremme, K.M.: Challenges regarding implementation of electronic chain traceability. British Food Journal 109, 805–818 (2007) 10. TRACE. The EU project Tracing the origin of food (2007), http://www.trace.eu.org/ (downloaded 16.10.2008 8.59CET), http://www.trace.eu.org/ (cited 10.20CET 26.10.2007) 11. Haverkort, A., Top, J., Verdenius, F.: Organizing Data in Arable Farming: Towards an Ontology of Processing Potato. Potato Research 49(3), 177–201 (2006) 12. Donnelly, K.A., Karlsen, K.M., Olsen, P., Van der Roest, J.: Creating Standardized Data Lists for Traceability – A Study of Honey Processing. International Journal of Metadata, Semantics and Ontologies 3, 283–291 (2008) 13. Thakur, M., Donnelly, K.A.M.: Modeling traceability information in soybean value chains. Journal of Food Engineering 99(1), 98–105 (2010) 14. Donnelly, K.A., Roest, J.V.d., Höskuldsson, S.T., Olsen, P., Karlsen, K.M.: Improving Information Exchange in the Chicken Processing Sector using Standardised Data Lists. Communications in Computer and Information Science 46, 312–321 (2009) 15. CEN14659, CEN Workshop Agreement. Traceability of Fishery products. Specification of the information to be recorded in caught fish distribution chians, European Committee for Standardization (2003) 16. CEN14660, CEN Workshop Agreement. Traceability of Fishery products. Specification of the information to be recorded in farmed fish distribution chains. European Committee for standardization (2003)
Social Ontology Documentation for Knowledge Externalization Gonzalo A. Aranda-Corral1, Joaqu´ın Borrego-D´ıaz2, and Antonio Jim´enez-Mavillard2 1
2
Universidad de Huelva. Department of Information Technology, Crta. Palos de La Frontera s/n. 21819 Palos de La Frontera, Spain Universidad de Sevilla, Department of Computer Science and Artificial Intelligence. Avda. Reina Mercedes s/n. 41012 Sevilla, Spain
Abstract. Knowledge externalization and organization is a major challenge that companies must face. Also, they have to ask whether is possible to enhance its management. Mechanical processing of information represents a chance to carry out these tasks, as well as to turn intangible knowledge assets into real assets. Machine-readable knowledge provides a basis to enhance knowledge management. A promising approach is the empowering of Knowledge Externalization by the community (users, employees). In this paper, a social semantic tool (called OntoxicWiki) for enhancing the quality of knowledge is presented.
1
Introduction
The competitiveness of companies active in areas with a high market change rate depends heavily on how they maintain and access their knowledge (i.e. their corporate memory) [5]. One of the main problems that current information systems have to face is that most of the information stored is not understood by machines, and so can’t be effectively processed automatically. Semantic Web (SW) aims to extend the current Web by adding structured information. That information will be enriched with semantics, which will allow the description of the content on the Web, so that they can be interpreted by humans and machines. Ontologies are emerging as the most widely accepted technology for that purpose. Roughly speaking, SW’s goal can be achieved by means of transforming information into Knowledge, using ontologies as a formal reference. An ontology is a hierarchical collection of classes (concepts) and properties (relationships between these concepts) that models a particular conceptualization. For instance, the Business Enterprise Ontology1 (EO) defines that “A Sale is an agreement between two Legal Entities for exchanging of a Product by a Sale Price. Commonly, the Product is a good or a service and the Sale Price is monetary, although another possibilities are included”. Once formalized, these kind of definitions solve problems of interoperability at semantic level. 1
Partially supported by TIN2009-09492 project of Spanish Ministry of Science and Innovation, cofinanced with FEDER founds. http://www.aiai.ed.ac.uk/project/enterprise/enterprise/ontology.html
S. S´ anchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 137–148, 2010. c Springer-Verlag Berlin Heidelberg 2010
138
G.A. Aranda-Corral, J. Borrego-D´ıaz, and A. Jim´enez-Mavillard
Fig. 1. Semantic Web Cake and Nonaka & Takeuchi’s cycle
Whenever two information systems try to communicate and share information, semantic problems could arise (beside the obvious physical difficulties) and could be avoided. This is because these two systems “do not speak the same language”. Business management systems cannot avoid being affected by the consequences of this misunderstanding. Some enterprises use the term “Resource” while others talk about “Machine” to refer to the same concept. At other enterprises, however, “resources” as “raw materials” is used. If each enterprise defines or classifies their products in their own way, the result would be that automatic electronic transactions would be extremely difficult to achieve, and therefore e-business would not be possible. Ontologies provide common and shared knowledge from the described vocabulary through their classes and properties, in order to solve this problem. Its common language allows communication between individuals or information systems at different companies or organizations. Nevertheless, the problem of semantic interoperability can persist on a different level, within the organization itself. It appears in the semantic heterogeneity among user’s interpretations of ontology elements. Integration of information by means of ontologies is a current option at organizations which are aware of the knowledge value. There are powerful methods for integration [1] that produce robust ontologies, however, some problems related to ontological literacy of users can exist. Even though ontologies represent the knowledge of enterprises and users accept and consider them as their own, they may not understand that formal representation in a clear way (so they could underuse or misuse it). A proper solution is that in which own users can collaboratively describe the ontology usage and wiki technology is very appropiate for achieving it. A wiki is a software that enables the collaborative content creation. It allows users to create, modify or delete shared content quick and easily. Its purpose is to let multiple users editing web pages (called articles) related to a subject, so each one provides his/her knowledge and they can work together in order to complete the article. These users can create a community that shares content about common topics.
Social Ontology Documentation for Knowledge Externalization
139
Since collective knowledge is extremely important for professional communities, a wiki has become as the essential tool that makes the collaboration among scientific communities, researchers, enterprises and any other user possible. 1.1
Motivation and Aims of the Paper
The use of novel methodologies for Knowledge management, as ontology-based ones, needs of a sound training of users (employees). A main semantic divide arises between ontology specification and a dialy use of ontologies at enterprises. The challenge to provide methods to solve this divide is the main motivation of this paper. Particularly, the aim is to present Ontoxicwiki, a tool designed for bridging this gap. This provides a social platform (based on wiki technologies) where users can document their usage of the ontology, always preserving its logical features and specification. It is importat to remark that OntoxicWiki is not a semantic wiki, it is a wiki for documenting ontologies instead. Ontoxicwiki provides a plugin for Prot´eg´e, an ontology editor that allows to document ontologies as well as to report ontology use cases. The user generated content is considered as the content of an ontology on documentation that the plugin merges with an ontology source. In this way nonspecialist users can generate information on the ontologies that their companies or social networks can exploit. These features distinguish Ontoxicwiki from semantic wikis, that are based on an underlying model of the knowledge described in its pages (while Ontoxicwiki is a wiki for documenting external knowledge represented by ontologies). Structure of the paper. The next section describes how to combine Web 2.0 and SW to enhance Knowledge Assest Management, addressing some of the requirements for SW tools. In Section 3 OntoxicWiki is presented. Main wiki features produced by OntoxicWiki are described in section 4. The paper ends with some conclusions and future work.
2
Knowledge Organizations and Semantic Web Tools
To create a machine-readable ontology, it is necessary to define it in a formal language (such as OWL2 ). As we move into such a technical and formal field, certain difficulties could arise for users. Some of them could be: a) Although there are several ontology editors such as Prot´eg´e3, end users do not need to be skilled in these technologies. Every web user, both businessmen, scientists or researchers and users without a full knowledge should be able to manage these ontologies, especially, company employees. The syntactic and semantic complexity of ontologies can prevent its generalized usage. b) Any user could design their own ontology,specifically suited for a given purpose, or reuse any available ones and take advantage of its previous usage. What ontology to choose? Where can he/she find testimonials of its usage? 2 3
http://www.w3.org/TR/owl2-overview/ http://protege.stanford.edu/
140
G.A. Aranda-Corral, J. Borrego-D´ıaz, and A. Jim´enez-Mavillard
c) There exist a number of undocumented ontologies, without any information about their application field or potential usage. Others, however, have been documented by their own creators (domain experts), which often make them difficult to understand and reuse, because of excessively technical jargon. This method of documentation is not usually appropiate for most users. It would be desirable to have mechanisms for documenting user experiences (for instance, cases of use, detected inconsistencies, etc.). This could be queried by future users and help them to select or understand the suitable ontology. The above difficulties have an influence on any ontology-based process of externalization, organization and management of Knowledge. The primary source for building an ontology is a company’s own information: databases, business process, etc. [1]. Therefore, it could be appropriate (as in [9]) to start analyzing from the perspective of Emergent Knowledge, roles and processes for Knowledge Asset Management (KAM) in creating knowledge organizations represented by Nonaka & Takeuchi’s cycle [8] (see fig. 1). This cycle is based on four activities which transform the visibility, importance and value of KAM into organizations (socialization, externalization, combination and internalization). Knowledge Externalization is a key activity that create explicit intangible assets for the enterprises. In SW, Knowledge is a tangible asset and the substance of processing. In Web 2.0 (W2.0), user generated knowledge is often based on the combination of different contributions from different users (or communities). Therefore, in Semantic Web 2.0 (SW2.0), similar KAM cycles could be studied in order to enhance them by means of new technologies or methods. In W2.0 networks, knowledgecreating communities are networks which are based on prosumers, it means, the creation and consumption of knowledge is entrusted to users. The Nonaka & Takeuchi cycle can be adapted to knowledge management in these networks, and some of these processes can be supported by Ontological Engineering theories and tools. Nonaka and Takeuchi’s cycle projection in these networks shows four needs for creating truly SW2.0 communities: emergent semantics, semantic user interfaces, knowledge networks and ontology alignment (see fig. 1): – The process of collaborative externalization by means of SW technologies is, in fact, a process of emergent semantics when tools for organizing knowledge are provided. – Once Knowledge is externalized, in terms of ontologies, the combination of different knowledge sets is a problem of ontology alignment. – Internalization of the Knowledge implies the fair use of SW tools which facilitate the creation -by employees- of common knowledge into their company, so semantic users interfaces are needed. – Finally, the socialization of the knowledge produces knowledge networks when users adopt Web 2.0 behaviours. Though these activities are similar to those used for constructing, mapping and managing collaborative knowledge spaces [6], innovative W2.0 tools and services will emerge.
Social Ontology Documentation for Knowledge Externalization
141
Fig. 2. Projection of the cycle
OntoxicWiki is designed to provide a semantic bridge between the knowledge activities of the cycle’s projection, enhancing both Web2.0 and SW solutions in this context (fig. 3). Specifically, the tool is designed to satisfy several needs which arise from the problems described below. Firstly, Knowledge has to be represented in a comprehensive and friendly way, in order to be used by all type of users,no matter what their knowledge or skill are. Secondly, shared ontologies have to allow easy reuse, and finally, it must work on socially documented ontologies, which lets users read the contributions made by other users in order to facilitate their choice (and proper usage of) of a set of different ontology concepts, relations and individuals. These requirements lead us to propose a collaborative solution. It seems essential to have a web site where users could find shared ontologies with additional information. This allows us to learn interesting features about them, such as their scope, purpose or recommended applications. Thus, scientists, businessmen and other users will be able to access them, filtering the content by their respective areas of interest and, thanks to the added information, choose the ontology which best suits their work needs. OntoxicWiki supports these requirements: designing and documenting shared ontologies in a collaborative environment.
3
Ontoxicwiki
OntoxicWiki is a tool that has been born to bridge the gap between user and ontology. The main objective of this application is represent ontologies in an intuitive and easy understandable way for any user by providing with an enviroment from which repare and document ontologies socially, concretely a wiki. Supported by a Mediawiki platform, OntoxicWiki is a plug-in developed for the ontology editor Prot´eg´e. It has been designed this way due to Prot´eg´e can be easily extended by way of a plug-in architecture and a Java-based API for building knowledge-based applications with ontologies. Besides, it is a free, open-source
142
G.A. Aranda-Corral, J. Borrego-D´ıaz, and A. Jim´enez-Mavillard
Fig. 3. Externalization of Knowledge by means of Ontoxicwiki
platform that provides a growing user community. In this sense, wiki has become the indispensable tool that best suits for collaboration among scientific communities, researchers, enterprises and other users. 3.1
Functionality
OntoxicWiki mainly consists of a toolbar and a web browser (see the technical architecture in Fig. 7). Its functionality is closely related to the lifecycle of ontologies in the context of this tool. In this scene, the lifecycle consists of five phases: 1. Creating/loading the ontology in Prot´eg´e. The process ends with an .owl extension file. This will be the target ontology. OntoxicWiki is not needed in this phase (Fig 6). 2. Adding documentation to the ontology. OSMV (Ontology Social Metadata Vocabulary) is inserted to the target ontology. Further details in section 4. 3. Writing the ontology on the wiki automatically. Information concerning to ontology resources, i.e. classes, properties and individuals, is extracted. Then, a wiki article is created for every class and property. The information will be written on this article depends on the type of resource is being considered, but each one will consist of a number of subsections, following the wikitext syntax [http://www.mediawiki.org/wiki/Help:Formatting]. In case of a class, relevant data are its rdfs:comment, properties, subclasses, superclasses and instances. Each of these pieces of information is written on its respective subsection (rdfs:comment as natural language and the others as a list of elements). When describing a property, considered data are its rdfs:comment (in natural language), domain, range, superproperty and subproperties (lists of elements). Besides, special articles and subsections coming from OSMV
Social Ontology Documentation for Knowledge Externalization
143
Fig. 4. Ontoxicwiki screenshot
are added to the original ones. Finally, articles are linked through hyperlinks to keep the original ontology structure (Fig. 5). 4. Ontology revision/documentation. Once ontologies are available on the wiki, the user only needs a web browser to access and navigate from one page to another. Users can visit an article (a class or property) just for reading or they can edit, delete it or create new ones (following some simple syntax rules for further processing). Revising and documenting is filling in the corresponding subsection. Like other wikis based on Mediawiki (e.g., Wikipedia), it is fully functional and allows, among others, actions such as registering new users or administration tasks. OntoxicWiki is not needed in this phase (Fig. 4).
144
G.A. Aranda-Corral, J. Borrego-D´ıaz, and A. Jim´enez-Mavillard
Fig. 5. Ontoxicwiki Prot´eg´e plugin and tool bar
5. Retrieving the ontology from the wiki automatically. Users can construct an OWL format ontology from wiki for later use. To achieve this, articles structure and relations (hiperlinks) between them are analyzed and classes and properties hierarchy is generated. Then the content of each article is read and parsed to build such class or property in the ontology. All information related to documentation (OSMV) is ignored, so only the target ontology is retrieved. The user interface is mainly composed by three elements: a toolbar, the template area and the web browser. The toolbar (see Fig. 5, down) provides basic features for managing ontologies. Each button shows the related template. The template area displays the syntax corresponding to the wiki article.
Social Ontology Documentation for Knowledge Externalization
145
Fig. 6. Loading the ontology in Prot´eg´e
Fig. 7. Technical features of Ontoxicwiki (1) (OW) arch. (2) OW uses Prot´eg´e-OWL API to manage the ontology. (3) It accesses to database through MySQL Connector/J driver, which allows to connect from Java source to database. (4)OW browser makes requests to HTTP server. (5)HTTP server parses them and send to PHP engine. (6)It sends appropriate SQL statements to MySQL database manager. (7)Lastly, MySQL accesses database to read/write data and send back the results.
146
4
G.A. Aranda-Corral, J. Borrego-D´ıaz, and A. Jim´enez-Mavillard
Features of the Wiki and the Integration of OSMV Ontology
An aim of Ontoxicwiki is to create a wiki which contains ontologies. It is hosted on a server with Mediawiki software installed. OntoxicWiki will connect to the database created by this software and will store the information related to ontologies on it. Once ontologies are available on the wiki, the user only needs a web browser to access and navigate from one page to another. Users can visit an article just for reading or they can edit it, delete it or create new ones (following the syntax rules for further processing). Like other wikis based on Mediawiki (for instance, Wikipedia), It is fully functional and allows, amongst others, actions such as registering new users or administration tasks. Ontology Metadata Vocabulary OMV4 is a consensus ontology for Information Technology professionals, Computer Science and Knowledge Engineering. It consists of a set of instantiated classes and properties that provides useful data that describe ontologies. OntoxicWiki integrates an extension of OMV, called OSMV Ontology Social Metadata Vocabulary that contains classes and roles which specify social features. This extension allows users to provide two kinds of information. On the one hand, information about the ontology as a whole (domain, the number of classes, the language used to describe it, other similar ontologies on the Internet, etc). On the other hand, specific information for each class/property (user’s comments or information sources for this class/property, etc.). In both cases, there exist data both technical and social nature. When OSMV is added to the target ontology, both ontologies are merged into one. The consequences are: OSMV’s classes and properties are added to the target ontology’s ones. An instance which represent the target ontology itself and an instance for each target ontology’s class and property are created. As far as the wiki is concerned, new subsections are added to each article. On the front page, one subsection for each global documentation property appears: number of classes, language, similar ontologies, keywords, domain, author, etc. On each class and property articles, one subsection for each local documentation property is added: comments, user descriptions, deficiencies found and documentation resources for this topic, etc. From now on, OSMV provides slots to be filled in with global information relative to the ontology as a whole or to their classes and properties specifically.
5
Conclusions and Related Work
Ontoxicwiki has been designed for bridging the gap between the formal specification of ontologies and their users, by means of social documentation. There are similar approachs that combine Semantic Web and wiki technologies, although these have different scopes, applications or do not use the powerful features that ontology editors, such as Prot´eg´e, provide (for example, analysis of logical 4
http://sourceforge.net/projects/omv2/
Social Ontology Documentation for Knowledge Externalization
147
consistency). As Ontoxicwiki were built as a Protege plugin, it has a valuable advantage over semantic wikis: the ontology consistency can be checked at any moment, just clicking on Protege’s button. Obviously, an automated reasoner, i.e. Pellet5 , must be installed and running. If Protege detects any inconsistency, OntoxicWiki could undo the changes based on the regular features of any wiki. It is important to make clear that OntoxicWiki is not a semantic wiki. Semantic wikis provide the abilty to capture or identify information about the data within pages, and the relationships between pages. Meanwhile, OntoxicWiki uses a wiki just to represent and document ontologies easy and intuitively. It is worth to remark that OntoxicWiki is useful for enhacing activities belong to Knowledge process cycle [10], because it provides a social platform where knowledge workers can capture their experience as well as they can access to documents through a nice GUI interface, namely wiki. Therefore, it is a promising research line to extend Ontoxicwiki to a complete suite for Semantic Knowedge Management that transforms ontologies in a clear way. Semantic MediaWiki6 (SMW) helps to search, organise, tag, browse, evaluate, and share wiki’s content. SMW adds semantic annotations that easily publish Semantic Web content. However, it is not designed for documenting ontologies, because it does not directly allow ontology debugging. Our tool is designed for a different purpose: Knowledge externalization of user experiences and social documentation of pre-existing ontologies (which were not built in a semantic wiki environment). For existing ontologies that were built using ontology editors, Ontoxicwiki helps to browse and to document them. From this point of view, its main feature would be to document and to report their use in companies. OntoWiki7 allows users to navigate through ontology classes and properties like traditional wikis, but actually uses an underlying formally described ontology. Users can create instances of classes and relations and add their own documentation. It fosters social collaboration aspects by keeping track of changes, allowing users to comment and discuss every single part of a knowledge base, enabling to rate and measure the popularity of content and honoring the activity of users. However, it does not have a specific ontology for user documentation as OntoxicWiki. AceWiki8 also uses a wiki to manage ontologies in an intuitively way and hides OWL and logic. In this case, the language is ACE, a rich subset of standard English that seems natural language but is in fact a formal language. But, unlike OntoxicWiki, AceWiki does not support user documentation. A more specific approach is SWiM9 , designed for editing/browsing mathematical ontologies [7]. One can easily identify the correspondences between the SW ontologies and mathematical ontologies.
5 6 7 8 9
http://clarkparsia.com/pellet/ http://semantic-mediawiki.org http://ontowiki.net http://attempto.ifi.uzh.ch/acewiki/ http://kwarc.info/projects/swim
148
G.A. Aranda-Corral, J. Borrego-D´ıaz, and A. Jim´enez-Mavillard
It might be arguable that Ontoxicwiki has not user evaluation results in its first stage, since there is no difference from classic wikis. At design time, this was the critical feature. User evaluation enhances when Ontoxicwiki will be implemented as a Protege Server plugin and, thus, authors could use OntoxicWiki from any scientific community for collaborative edition and documentation of ontologies. Lastly, it is likely that users will inconsistently document some classes/relations (even if ontology has been extracted from the company’s own information). A solution is to use tagging, in order to apply semi-automatic reconciliation methods. For example, using agent-based methods [3]. Also, visual representations [4] could be considered, in order to provide an interface for ontology debugging, extending the features of OntoxicWiki for document debugging process.
References 1. Alexiev, V., Breu, M., Bruijn, J.d., Fensel, D., Lara, R., Lausen, H.: Information Integration with Ontologies: Experiences from an Industrial Showcase. John Wiley & Sons, Chichester (2005) 2. Aranda-Corral, G., Borrego-D´ıaz, J.: Ontological dimensions of Semantic Mobile Web 2.0. First principles. To appear in Handbook of Research on Mobility and Computing. IGI Press (2010) 3. Aranda-Corral, G., Borrego-D´ıaz, J.: Reconciling Knowledge in Social Tagging Web Services. In: Corchado, E., Gra˜ na Romay, M., Manhaes Savio, A. (eds.) Hybrid Artificial Intelligence Systems. LNCS, vol. 6077, pp. 383–390. Springer, Heidelberg (2010) 4. Borrego-D´ıaz, J., Ch´ avez-Gonz´ alez, A.M.: Visual Ontology Cleaning. Cognitive Principles and Applicability. In: Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011, pp. 317–331. Springer, Heidelberg (2006) 5. Fensel, D.: Ontologies: Silver Bullet for Knowledge Management and Electronic Commerce, 2nd edn. Springer, Heidelberg (2004) 6. John, M., Melster, R.: Knowledge Networks – Managing Collaborative Knowledge Spaces. In: Melnik, G., Holz, H. (eds.) LSO 2004. LNCS, vol. 3096, pp. 165–171. Springer, Heidelberg (2004) 7. Lange, C., Kohlhase, M.: A Mathematical Approach to Ontology Authoring and Documentation. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) Calculemus 2009. LNCS (LNAI), vol. 5625, pp. 389–404. Springer, Heidelberg (2009) 8. Nonaka, I., Takeuchi, H.: The Knowledge-Creating Company: How Japanese Companies Create the Dynamics of Innovation. Oxford Univ. Press, Oxford (1995) 9. Nyk¨ anen, O.: Semantic Web for Evolutionary Peer-to-Peer Knowledge Space. UPGRADE X(1), 33–40 (2009) 10. Staab, S., Studer, R., Schnurr, H., Sure, Y.: Knowledge Processes and Ontologies. IEEE Intelligent Systems 16(1), 26–34 (2001)
Information Enrichment Using TaToo’s Semantic Framework Gerald Schimak1, Andrea E. Rizzoli2, Giuseppe Avellino3, Tomas Pariente Lobo4, José Maria Fuentes4, and Ioannis N. Athanasiadis2 1
AIT Austrian Institute of Technology, Seibersdorf, Austria
[email protected] 2 IDSIA, Lugano, Switzerland {Andrea,ioannis}@idsia.ch 3 ElsagDatamat, Rome, Italy
[email protected] 4 ATOS Origin, Madird, Spain {tomas.parientelobo,jose.fuentesl}@atosresearch.eu
Abstract. The Internet is growing in a non-coordinated manner, where different groups continuously publish and update information, adopting a variety of standards, according to the specific domain of interest: from agriculture to ecology, from groundwater to climate change. This unconstrained and unregulated growth has proven to be very successful, as more information is made available, even more is being added, in a virtuous cycle of information accrual. At the same time, modern search engines make looking for information rather easy, with their overall performance being more than satisfactory for most users. Yet, searching and discovering information requires a good deal of expertise and pre-existing knowledge. That may not be a problem when a user searches for common assets using a generic-purpose search engine. But what happens when the user is trying to gather scientific information across boundaries (e.g. cross different disciplines, cross environmental domains, etc)? This asks for new approaches, methods and tools to close the discovery gap of information resources satisfying your specific request. This is exactly the challenge the TaToo project is heading to. Keywords: semantic annotation; semantic tagging; model search and discovery; web services; environmental information enrichment.
1 Introduction The TaToo project aims at exploiting a common practice among web user: search, discovery and tagging of interesting resources, focusing on environmental ones. Tagging practices allow communities and user groups to label and classify resources, enriching information and enabling aggregators to display the most relevant ones according to the context, as reflected in Fig. 1. TaToo aims to capitalize on the principles of tagging and by investigating the ability to add valuable information in the S. Sánchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 149–159, 2010. © Springer-Verlag Berlin Heidelberg 2010
150
G. Schimak et al.
form of semantic annotations, facilitating future usage and discovery, and kicking off a beneficial cycle of information enrichment. Thus, the production of semantic metainformation will improve the discovery process, but also its interpretation in a larger sense (verification that it’s the information you were looking for, assessment of usefulness for a given situation, understanding of how to use the information correctly etc.). Standards and metadata are part of the cure, but they are still too little to front the exponential increase in data availability provided by earth observation initiatives such as INSPIRE (http://inspire.jrc.ec.europa.eu), Google Earth (http://earth.google.com), OpenTopography (http://www.opentopography.org), GEOSS (http://www.Earthobser vations.org/geoss.shtml) and many others. We claim that the expressivity of glossaries, dictionaries, thesauri and schemata is too limited for the demands that we expect to be posed to environmental information systems in the near future [2]. To overcome this type of problems we need to add rich semantics, possibly axiom-based, to our environmental resources, thus increasing the expressivity of information, but, at the same time, also increasing the complexity required to convey it as addressed in [3].
Fig. 1. TaToo’s cycles of information enrichment
The TaToo project, which started in 2010, tries to solve this information and discovery gap problem, by providing a way to semantically annotate environmental resources on the Web. The project idea is strongly inspired by existing social bookmarking initiatives, such as Delicious, reddit, StumbleUpon, Digg etc. Yet, TaToo aims to let user use semantics in their annotations, by accessing shared ontologies, thus enabling inference engines to process information and discover new facts and new relationships that are not explicitly stated in the body of knowledge. TaToo aims for a community driven approach of information enrichments (see Fig. 1), addressing experts, from researchers up to decision makers in authorities within a specific (environmental) communities as well as much as possible the public
Information Enrichment Using TaToo’s Semantic Framework
151
in order to setup, extent, use and promote their knowledge by using the TaToo framework as a knowledge sharing platform. In the remainder of this paper we will first review the state of the art in semantic annotations and tagging, then we will present our vision of how TaToo should work and operate, and we also describe the preliminary draft of the enabling software architecture.
2 The Vision Despite the great amount of work and resources currently deployed in the field of semantic annotation of web resources, there are some major hurdles to be overcome to make the TaToo vision become a reality. TaToo is expected to work along the lines of one of those social tagging and bookmarking website. Here we focus on two major use-cases: finding and annotating a resource, and searching for and discovering an annotated resource. In the first case, the user stumbles on an interesting resource during his/her work. Let’s assume (as one of many possibilities) that the resource is a web service described in a web page. The user has simply to feed the URL of the web page to the TaToo server application. This can be done simply by dragging and dropping the URL in a sidebar of a modern browser. The TaToo server application recognizes the URL and starts processing the page, automatically extracting the information regarding the web service and processing the text in the webpage. The TaToo server application then generates a web page where the various elements of the original webpage are presented to the user and offered for semantic tagging. The process is therefore a mix of automated semantic annotation, and manual annotation. The usability of the interface will be therefore a critical element for the success of the application. In the second case, the user accesses the TaToo server to search for and discover environmental resources which have been semantically annotated. The advantage is the ability to come across the limitations imposed by specific domain jargons and semantic ambiguity. In section 5 we describe some use cases coming form TaToo’s validation scenarios where this feature will be exploited. Also, in a possible future scenario, third-party web-services will be able to use TaToo discovery services to automatically chain web-services to answer complex and structured queries, requiring the integrated runs of multiple environmental resources.
3 The Semantic Web and Semantic Annotations TaToo aims at providing a framework for semantic annotation and discovery of environmental resources. But why use semantic and not informal annotations? The key aspect is precisely that informal annotations are relatively good for human interpretation, but fail to help machines to understand the meaning and perform more advanced tasks based on those annotations. Annotations should be then formal and shared in the domain. In this sense, the Semantic Web [4] was first introduced by Berners-Lee with the aim of providing meaningful web content for machines by bringing formal structure to web resources. Gruber [5] introduced the concept of ontology as “a formal
152
G. Schimak et al.
explicit specification of a shared conceptualization”. Thus ontologies play the role in the Semantic Web as the formal (machine-understandable) and shared (in a domain) backbone. Ontologies are becoming a clear way to deal with integration and semantic discovery of web resources in the environmental domain. TaToo has the objective of allowing a cross-domain discovery of resources in the environmental field, meaning that resources annotated with different purposes and possibly different ontologies should be retrieved using a common framework. But, there are several challenges to solve this semantic heterogeneity issue: 1. Allowing multi-domain annotation schema: TaToo should allow the annotation of resources using a common and controlled vocabulary based on ontologies. At the same time, TaToo should offer the possibility of particularize the annotations for a given domain. 2. Implementing an extensible discovery mechanism: TaToo should provide a generic discovery framework in order to allow a common way of searching for annotated resources. At the same time, the system must provide the possibility of extending and specializing the search for a given domain or a given application. There are several approaches to achieve semantic interoperability when dealing with different ontologies. Wache [6] defined three ways for integration of ontologies by using single ontology, multiple ontologies or hybrid ontology approaches. It is intuitive that using just a single ontology is good for integration purposes, but it can overcomplicate the ontology and it is not very flexible under a cross-domain perspective. When using several ontologies there is a need of creating mappings between similar concepts in all the ontologies in order to achieve interoperability. This can become too complicated when dealing with several ontologies and hinder the introduction of new domains. The hybrid approach is based on using a common shared ontology and a set of local or application ontologies that are mapped uniquely to the shared vocabulary. In this sense the local ontologies would extend the vocabulary to the needs of a given domain and the interoperability is achieved based on the shared ontology. TaToo follows the hybrid approach and currently evaluates the most widely used ontologies in the environmental field as candidates to be the basis of the shared ontology. In TaToo, there is also the need of describing in a uniform way environmental resources. We understand an environmental resource as a web resource (being a web page, a document, a model, a service, etc.), which is defined using an URI. In this sense, we are defining within the shared ontology a minimal environmental resource model as a part of the shared ontology that will contain the minimal set of crossdomain concepts and properties. TaToo ontologies will be based on existing W3C standards, particularly RDF (http://www.w3.org/TR/REC-rdf-syntax/), RDFS (http://www.w3.org/TR/rdfschema/) and OWL (http://www.w3.org/TR/owl-features/). In the ontology engineering process we will also try to reuse as much shared vocabularies as possible such as DC (http://dublincore.org/), FOAF (http://www.foaf-project.org/), SIOC (http://sioc-project.org/), etc. TaToo will not provide an ontology engineering tool, but it will rely on existing tools for ontology engineering such as the NeOn Toolkit
Information Enrichment Using TaToo’s Semantic Framework
153
(http://neon-toolkit.org/) or Protégé (http://protege.stanford.edu/). Within TaToo we use the NeOn methodology to build ontology networks and semantic applications (http://www.neon-project.org/nw/NeOn_Book/).
4 The Proposed Architecture TaToo aim is to provide a set of functionality mainly to search, discover and tag resources with metadata in order to improve the discovery through the information enrichment process described above. In order to offer such kind of functionality, the TaToo framework has to provide a set of system components (server side) providing the implementation of the functionality, and a set of user components (client side), letting the end users interacting with the core components through Graphical User Interfaces (or eventually APIs). From a high level view, the TaToo architecture is composed of a basic set of functional blocks or building blocks, which groups components with respect to the specific functionality they participate to realise. This high level group of functional blocks is presented in Fig. 2. The figure shows five different tiers1 and the high level building blocks taking part. The Presentation tier contains the User Components building block since this block provides all the interfaces required by the end user to access the system and take advantage of the provided functionality. These components interact with the System Components through a set of Web clients (stubs), which contact the Web server counterpart through operation invocation. The Service tier contains the Web services implementing the entry point of the system part of the framework. These services provide functionality to the Web clients interacting with the TaToo Core Components contained in the Business tier (the Business tier implements the business logic of the entire framework). The Data tier is responsible of storing resource metadata (in form of RDF triples or structured) in a relational data base. Finally, the Cross tier contains all the building blocks dealing with functionality generally known as cross-cutting. These building blocks provide functionality required by the framework to deal with requirements cross to all the previous tiers, such as security, in terms of authentication, authorization, access control, administration; and so forth. Fig. 2 shows as well the different so far identified resources TaToo foresees to deal with. In general, Web services (SOAP or RESTful), and Catalogues are of primary interest. Other possible resources are models or Web pages containing structured or unstructured content. These resources have to be discovered, tagged, and evaluated. Evaluation and tagging of resources by the end user make possible the information enrichment process; searches will be more and more effective as each time based on a larger amount of available metadata. The high level architecture can be detailed as presented in Fig. 2.
1
It is worth noticing that in the context of architectures based on the service-oriented paradigm, the term ‘tier’ is preferred to the more common term ‘layer’.
154
G. Schimak et al.
Fig. 2. TaToo High Level Architecture
Fig. 3. TaToo Initial Architecture
The User Component building block contains four different sub-blocks. Tagging, Search & Discovery, and Evaluate / Validate offer the respective client side components to exploit system corresponding functionality. In general, these components are supposed to be directly used by the end user (this is particularly true in the case the component provides a GUI). The end user interacts with the component in order to take advantage of the TaToo offered functionality. In general, these components can be implemented as:
Information Enrichment Using TaToo’s Semantic Framework • • • •
155
Portlets making up a portal; External applications to be installed in the end user machine; Browser plug-ins to be installed in the end user browser; APIs in the case the end user wants to write his own custom application.
The general idea of TaToo is to realise a portal providing functionality through a set of configurable portlets (see Fig. 4).
Fig. 4. TaToo Web Portal
The Visualize sub-block (still in the User Components block) contains tools for visualising resources. Once discovered, resources have to be accessed and visualised before tagging. Resources can be accessed through several means and protocols, such as HTTP(S); FTP(S); GridFTP; OpenLDAP; OGC WCS, WMS, WFS; or even Torrent (security and access control issues have obviously to be considered). Once the resource is available, it has to be visualised in order to be tagged. Graphics (Photos/Pictures etc.) in usual formats such as .jpg, .bmp, etc. or a .doc file can be easily visualised through applications that are normally available and installed client side. In the case of custom or proprietary formats, new tools for visualising resources are required. Visualising tools, even if not the main scope of the project, are fundamental as the tagging process can be performed by the end user only if they are able to ‘visualise’ i.e. to obtain an understandable representation of the resources being tagged. The Server Side TaToo building blocks contain corresponding blocks at client side (apart from the block Visualise that has not correspondence at server side). As already stated, these are generally Web services offering a kind of interface to the underneath TaToo Core components. The TaToo Core building block is the main part of the architecture and contains ‘core’ TaToo components implementing the business logic. In particular four components are of major importance: •
•
The clearinghouse plays the role of organising the semantic information on environmental resources. It is a central component for accessing the metadata storage and serves also as an information exchange support between the core system components; The Semantic Processor is the fundamental component dealing with Semantics. It uses a set of (pluggable) ontologies (in the environmental domain) to provide functionality based on semantics. In general, it relies on an application framework (such as Jena) and a reasoner (such as Pellet, or the Jena
156
G. Schimak et al.
•
•
embedded reasoner) to provide its functionality. An application framework provide useful APIs to manipulate RDF, support SPARQL query, and others; The User CTX Manager is in charge of managing the user context. In particular it offers the user the possibility to store information about the performed search, provided tags for resources, and others. For instance, it allows the user to retrieve passed performed search allowing the tagging process to be postponed in time (e.g. in the case the evaluation of the resources requires 2 time to be performed ); The Harvester is the component capable of retrieving external resources (and associated metadata) that could be either data or associated metadata stored in catalogues, Web services or information contained in Web pages. The harvester plays the role of retrieving already available resource metadata (mainly from the resource owner). This means that in addition to the information enrichment process, metadata can also be collected harvesting available catalogues (or Web resources in general). The harvesting process can take place at system deployment time to create an initial set of metadata, and / or once in a while to get updated content.
Finally, the (Meta)Data Management building block deals with the storage where metadata on resources are stored (both resource owner metadata and metadata provided by end users through tagging). For proving TaToo’s objectives and functionality three validations scenarios have been installed.
5 Validation Scenarios TaToo plans to validate the usability of its approach through the implementation of three different scenarios. All three scenarios are embedded in highly complex environmental domains and are therefore mainly addressed to domain expert groups and communities as well as to technically skilled users. The scenarios are addressing the following environmental domains: climate change, agriculture, and anthropogenic impacts of pollution. Even if they have independent, distinct goals, they will be presented altogether in order to demonstrate complementarities of TaToo features, and improvements in the discovery process that are facilitated with TaToo. The three validation scenarios are further explained in the followings. 5.1 Climate Twins Validation Scenario We call region pairs with similar climate conditions (at different times) “Climate Twins”. A web-based “climate twins” exploration tool will identify those Climate Twins, where source grid-cell’s values representing future climate show high similarity with the current climate grid. To find climatic coincidence seems to be a simple exercise, but the accuracy and applicability of the similarity identification depends very much on the selection of climate indicators and uncertainty ranges. The TaToo platform can provide with tools to facilitate an improved, user-focused climate change 2
Tagging can be performed as soon as the user evaluated the found resources. If this is trivial and immediate for a resource like a picture, it can be demanding and time consuming for collection of raw data, for which a long elaboration could be required.
Information Enrichment Using TaToo’s Semantic Framework
157
resource search, through which end-users will be able to add tags and comment existing resources, reuse tags of other users, and eventually discover and retrieve climate twin-region data, through semantic rich, spatially explicit, user-tailored querying. 5.2 Agro-Environmental Validation Scenario In the agro-environmental scenario we will work in collaboration with the AGRI4CAST action of the Joint Research Centre focuses on the European Commission Crop Yield Forecasting System aiming at providing accurate and timely crop yield forecasts and crop production biomass. AGRI4CAST gets increasing requests for analyses to be run against the weather and soil database. That requires either new or modified modelling capabilities with respect to the set of models available in the operational system. To achieve this, software implementations of Crop Forecasting System model components target the objective of easy composition, extension and re-use. Though detailed model and software documentation is available, along with scientific papers and reports describing the application of the models, still the discovery of appropriate models to-be-employed for on demand studies is a monotonous task that requires significant human expert efforts. TaToo will be put to the test as a tool to support the proper annotation of resources by defining attributes, such as description, maximum, minimum and default values, units, and URL. Then its search and discovery capabilities will be put to the test to find alternative modelling solutions, given that each component can make available alternative options for estimating/generating variables. 5.3 Anthropogenic Impact of Pollution Validation Scenario The anthropogenic impact of pollution case study will enable the synthesis of existing (air) pollution monitoring databases, with epidemiological data required for identifying the effects of pollution on human health (anthropogenic impact). This task requires new, rich, data discovery capabilities within the bodies of knowledge available. Proper use of these data requires contextual enhancements, which TaToo will deliver through tagging and enhanced information description (metainformation) embedded into the appropriate semantic environment. 5.4 Overall Project Evaluation Each validation scenario identifies several use-cases that demonstrate clear improvements in the discovery process of environmental resources. In TaToo project, we aim to track all the improvements, organize, and study them, in order to evaluate the overall impact achieved by the TaToo project. We intend to classify improvements against two major axes: type of improvement and type of beneficiaries. We identify the following two types of improvements that will come as a result of TaToo tools usage: 1.
Improvements in the discovery performance, i.e. more efficient discovery of resources. In this case, use-cases demonstrate that the adoption of advanced semantics and TaToo tools has improved the situation in measurable ways (i.e. by
158
G. Schimak et al.
identifying certain indicators of performance that include result completeness, robustness, response time, or other appropriate measures to be defined in the use cases). Improvements in the discovery process, i.e. new search and discovery experience not possible before. In this case, TaToo tools enable completely new discovery processes that have not been available with the traditional tools, or their usability are significantly improved with TaToo. For these cases quantitative measurement of improvement is not possible, rather a qualitative one is adopted. Very important in these cases is to justify existing tools cannot support such processes.
2.
We also identify three major types of beneficiaries (user roles) for TaToo: a. b. c.
Resource provider, i.e. the one who supplies the resource, and in the domains we are treating could be a modeller, or a scientist. Resource publisher, i.e. the one who is in charge of making the resource available. This user could coincide with the provider, but maybe not. Resource consumer, which is the one who wants to discover the resource through some kind of querying, subscription to feeds or through a suggestion mechanism.
All use cases of the three validation scenarios will be classified against the two axes, in order to visualize them on the plane of beneficiaries and improvement types (see Fig 5). Certainly, all types of improvements are not relevant for all beneficiaries, and not all combinations are in the scope of the project. Core interests of TaToo remain in (a) process and performance benefits for the publisher (shown in dark gray), and (b) process improvements for the resource publisher and performance improvements for the resource consumer (shown in light gray).
Fig. 5. The TaToo project Validation Matrix (The ids of the use-cases are examples)
Information Enrichment Using TaToo’s Semantic Framework
159
6 Conclusions Currently, we are at the beginning of the project and there will still be adjustments on the initial architecture and related requirements stemming from the validation scenarios, but what we clearly envisage and what you can read from the sections above is to mitigate the burden of meta-information providing by the system developers. Our aim is to do that in a community driven way, i.e. allowing different communities as well as individuals to add information to resources related to their view on them and embedded in their respective semantic environment. At the end of the project, the TaToo users will benefit from having enriched their resources with semantics, embedded them in a semantic framework to share and reuse with their scientific community. From the technical point of view TaToo will realise a framework able to cope with specific needs of the TaToo Validation Scenarios based on specific environmental domain ontologies. At the beginning the main technical focus will be on semantically supported tagging, search and discovery functionality. But, the long term vision is to provide a general framework able to support the entire environmental domain. Acknowledgment. The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 247893T.
References 1. Gordon, M., Pathak, P.: Finding information on the World Wide Web: the retrieval effectivness of search engines. Information Processing and Management 35(2), 141–180 (1999) 2. Rizzoli, A.E., Schimak, G., et al.: TaToo: Tagging environmental resources on the web by semantic annotations. In: Swayne, D.A., Yang, W., Voinov, A.A., Rizzoli, A., Filatova, T. (eds.) Proceedings of International Environmental Modelling and Software Society (iEMSs) 2010 International Congress on Environmental Modelling and Software Modelling for Environment’s Sake, Fifth Biennial Meeting, Ottawa, Canada (2010), http://www.iemss.org/iemss2010/index.php?n=Main.Proceedings 3. Villa, F., Athanasiadis, I.N., Rizzoli, A.E.: Modelling with knowledge: A review of emerging semantic approaches to environmental modelling. Environmental Modelling and Software 24(5), 577–587 (2009) 4. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (2001) 5. Gruber, T.: A translation approach to portable ontology specifications. Knowledge Acquisition 5 (1993) 6. Wache, H., Scholz, T., Stieghahn, H., K’onig-Ries, B.: An integration method for the specification of rule–oriented mediators. In: Proceedings of DANTE 1999, Kyoto, Japan (1999)
Exploiting CReP for Knowledge Retrieval and Use in Complex Domains Lorenza Manenti and Fabio Sartori Department of Computer Science, Systems and Communication (DISCo) University of Milan - Bicocca viale Sarca, 336 20126 - Milan (Italy) Tel.: +39 02 64487913; Fax: +39 02 64487839 {manenti,sartori}@disco.unimib.it
Abstract. Case Based Reasoning (CBR) is a Knowledge Management approach that consists in the development of decision support systems where problem are solved by analogy with similar problem solved in the past. In this way, the system supports users in finding solutions without starting from scratch. CBR has become a very important research topic in Artificial Intelligence, with the definition of methodologies and architectural patterns for supporting developers in the design and implementation of case–based systems. The paper presents one of this frameworks, namely CReP, an on–going research project of the Artificial Intelligence Laboratory (L.Int.Ar.) of University of Milan–Bicocca, focusing on the integration between CBR paradigm and metadata approach to obtain domain–independent case structure and retrieval algorithm definition.
1
Introduction
Over the last decade Case–Based Reasoning [1] methodology has been successfully applied to a wide range of problem domains and become one of the most important research areas in the field of Artificial Intelligence. The main reason of its success is that the CBR model allows finding solutions to difficult problems even if a well defined knowledge model is absent. Case–Based Reasoning approach is founded on the reasoning by analogy method (i.e. similar problems have similar solutions), summarized in the well known 4R’s cycle [2]: a description of the new problem is given, then it is compared to the description of similar problems already solved and stored in the case base according to a similarity algorithm. The most similar problem description is then retrieved and its solution is reused as a first attempt to solve the new problem without starting from scratch. If the reused solution does not fit the problem, a revise step can be applied to adapt it. Finally, the new problem description and its (possibly revised) solution are retained in the case base. The paper describes CReP (Case Retrieval Platform), a Java–based framework that allows developers to build CBR application in an easy way. In fact, it provides a model and some tools to describe cases, similarity functions on case S. S´ anchez-Alonso and I.N. Athanasiadis (Eds.): MTSR 2010, CCIS 108, pp. 160–171, 2010. c Springer-Verlag Berlin Heidelberg 2010
Exploiting CReP for Knowledge Retrieval and Use in Complex Domains
161
description parts and how they must be put together in an aggregation function to obtain the global similarity value. CReP is the result of more than ten years of experience in the design and development of Knowledge Management applications according to CBR approach (e.g. P-Race [3], P-Truck [4] and Symphony [5]). CReP works on experience representations given in the form of cases. According to the Case-Based Reasoning model, a case is the basic element for lessons learned memorization and the most similar cases are retrieved to reuse their solutions whenever a new case arises. In fact, the framework allows the description and the development of systems able to retrieve cases from the case memory. Describing a problem is one of main activities in the definition of a CBR system: the choice of significant attributes is essential for the definition of a suitable similarity function among cases, that is fundamental to retrieve past situations similar to the current one. A problem C is usually viewed as a n–tuple of features C = {f1 , f2 , . . . , fn }. This structure may be fixed or variable [6], depending on the complexity of the involved knowledge. The CReP platform allows the user to describe problems exploiting a hierarchical structure, with great benefits from the complex knowledge representation point of view. According to the CBR literature (see e.g. [1]), each case in CReP is divided into three special subtrees, called Case Parts: Case Description (the description of the problem), Case Solution (the solution adopted to solve that particular problem) and Case Outcome (the results coming from the application of that particular solution to that particular problem). Each of these three elements can be further described by means of Inner Nodes and Outer Nodes concepts: an Inner Node is an internal node of the structure, while an Outer Node is a leaf of the tree–structure (see Figure 1). In order to increase the generality level of CReP and to allow its use to cover a wider range of domains a metadata description of case structure elements has been given. The CReP metadata schema defines every significant component of the platform, from the atomic elements of the case description (i.e. attributes and the values) to the most complex concepts, like the case base and the similarity and aggregation functions. The main assumption made in CreP is that each experience can be represented as a case composed of case elements. Every case element is devoted to a specific part of a case structure. A case element is associated to a type that indicates its range of values. As a consequence, each case is a finite list of couples case element, value, representing either Inner Nodes or Outer Nodes. A finite and not empty collection of cases constitutes the case base, i.e. the set of experiences used by CReP during the retrieval phase. The retrieval strategy implemented in the CReP framework is based on the classical K–Nearest Neighbor algorithm [7]: the computation of the similarity rate on a single node denotes a local similarity, limited to the definition of the node considered. The global similarity related to the whole case has to involve all the nodes belonging to the subtree of Case Description. In order to give a global rate of similarity, inner nodes are also associated with an aggregation function (e.g. average, weighted average). The similarity rate related to an inner node is the result of the application of the aggregation
162
L. Manenti and F. Sartori
function both to similarity values of all its subtrees and to the result of the local similarity evaluation. In such way, given two cases, it is always possible to calculate their rate of similarity. Moreover, as a collection of case structures can be defined in a CReP–based system (i.e. the structure base), it is possible to define different views for case interpretation. This means that varying the context in which the similarity computation is performed, different case structures can be considered. In this way, the calculation of the similarity between two cases gives different results according to the context in which the computation is made. The consequence of this feature of CReP–based systems is that the results of the retrieval phases change as the side conditions of the problem change. The paper is organized as follows: Section 2 describes the conceptual model of metadata adopted in CReP. Section 3 briefly describes how a computational model has been derived form the conceptual one. Finally, some conclusions and future works are briefly pointed out in Section 4.
Fig. 1. Representation of a case hierarchical structure
2 2.1
CReP: The Conceptual Model of Metadata Definition of Past Experience
The first problem in CBR system development is how to computationally describe past experiences using cases. Knowledge contained in a case must be structured and organized in order to achieve better expressiveness and performances: as briefly introduced above, in CReP, a case is a collection of case elements organized following a tree–structured hierarchical structure. Definition 1. A case element ce is a member of the CaseElement set: ∀ce ∈ CaseElement, ce = (id, t, n) : id ∈ Z + − {0}, t ∈ T , n ∈ String where: – – – –
id is the identifier of the case element; T is the set of possibile types; t identifies a specific type (e.g. String, Integer); n is the name used to describe the case element.
Given ce ∈ CaseElement and ce = (id, t, n), id(ce) = id ∧ t(ce) = t ∧ n(ce) = n
Exploiting CReP for Knowledge Retrieval and Use in Complex Domains
163
Cases belonging to the same context must be collected in a particular structure, named case base. A CaseBase = {c1 , .., cn }, 1 ≤ n < ∞, is a finite and not empty collection of cases. Definition 2. A case base c is defined as a set of couples: ∀c ∈ CaseBase, c = {(ce1 , v1 ), .., (cen , vn )} : ∀(cei , vi ), cei ∈ CaseElement ∧ t(cei ) = t ∧ vi ∈ t ∪ {⊥}, where ⊥ denotes null value and, ∀c ∈ CaseBase, vce (c) = v where v ∈ t(ce) ∧ (ce, v) ∈ c ∧ ce ∈ CaseElement In a case base, each case must be organized following the tree structure presented above. In this structure, inner nodes and outer nodes can be identified (Fig. 1): outer nodes, also named attributes overlap with leaves whereas inner nodes represent categories which the attributes belong to. In order to define precisely the structure of a case, we must first introduce the struct base concept. Definition 3. A StructBase = {s1 , .., sn }, 1 ≤ n < ∞, is a finite and not empty collection of case structures to interpret the value assigned to each case element ce in a case c: ∀c ∈ CaseBase, ∃s ∈ StructBase : struct(c) = s In CReP, only one structure can be defined for each case base. So, each case c ∈ CaseBase will have the same structure s. Starting from the struct base definition, it is possible to identify the three part of a generic case defined in [1]: ∀x ∈ StructBase, x = (d, s, o) where: – d(x) = d denotes the problem description part of the case; – sol(x) = sol denotes the solution part of the case; – o(x) = o denotes the outcome part of the case. In CReP, a case structure is a tree-structured graph of struct nodes and both inner and outer nodes are struct node elements. Definition 4. A generic StructN ode sn is defined by a couple sn = (ce, sf ) where ce ∈ CaseElement and sf is a particular function named Similarity Function. Similarity Function sf is a mathematical function that compares values associated to case elements in different cases and returnes their similarity degree. Formally:
164
L. Manenti and F. Sartori
Definition 5. Given a struct node sn, a similarity function sf exists and is defined on the type t in the following way: ∀sf ∈ SimilarityF unction, sf : t × t → [0, 1] where t ∈ T . The Outer node (ou) concept can be precisely defined following the struct node definition (i.e. ou = (ce, sf )), whereas it is necessary to introduce a formal definition for Inner nodes. Definition 6. An InnerN ode in is defined as in = (ce, sf, af, SubStructN odes) where ce ∈ CaseElement, sf ∈ SimilarityF unction, af is a particular function named AggregationFunction and subStructN odes = (sn1 , ..snn ) is a collection of struct nodes with 1 ≤ n < ∞. Aggregation function af works using similarity degrees of children nodes and the sf value of the current node. Definition 7. Given an inner node in, an aggregation function af exist and ∀af ∈ AggregationF unction, af : [0, 1]∗ → [0, 1]. Starting from the formal case definition, it is possible to describe new cases: each time a new instance of a problem has to be solved, a new case that describes the instance has to be created. Definition 8. A new case c is defined in the following way: ∀ new case c ∈ / CaseBase, c = {(ce1 , v1 ), .., (cen , vn )} : ∀(cei , vi ), cei ∈ CaseElement ∧ vi ∈ t(ce) ∪ {⊥}∧ (∀cei ∈ CE(d) ⇒ vi = ⊥ ∧ ∀cei ∈ / CE(d) ⇒ vi = ⊥) where struct(c) = s and d(s) = d and CE(d) denotes the set of all case elements that are involved in nodes or leaves of the structure that described s. 2.2
Structures and Similarity Functions
As written before, in CReP each case is organized following a tree-structure that describes the problem description part of the case. The solution and outcome parts are independent from the tree-structure. Each structure is composed by two element: – node: to add a node to the structure it is necessary to define: • the node description (a name for the node); • the similarity function to use on node values;
Exploiting CReP for Knowledge Retrieval and Use in Complex Domains
165
• the aggregation function to use on similarity values; – relations between two nodes: they can be either “ part-of”, or “is-a”. In CReP, there is a set of similarity and aggregation functions defined a priori that users can exploit. The first type of similarity functions works on strings: they compare strings in case–sensitive or case–nonsensitive ways and produce as result 1 (if the strings are equal) or 0 (otherwise). The second type works on numbers intervals (both integer and double values); it is necessary to define a “min” value and a “max” value and the similarity value on a struct node sn between two cases x and y is defined as is established as follows: |vce (x) − vce (y)| max − min The results of the similarity functions are used by aggregation functions: it is necessary to aggregate the similarity degrees of children nodes and the similarity degree of current inner node in order to obtain a single value for each sub-tree. To this scope, CReP defines two kinds of aggregation functions: average function and weighted average function. Both of them consider the similarity degrees of inner nodes (if the root is included in calculation by means of the “Function Root” modality) or not (if the root is included in calculation by means of the “Function No Root” modality). At last, it is possible to formally describe the similarity degree in a recursive way: sf (sn)(vce (x), vce (y)) = 1 −
Definition 9. Given x ∈ CaseBase and y new case, the similarity evaluation between the cases x and y is: SIM (x, y, root(struct(x))) = SIM (x, y, d(struct(x))) ⎧ sf (sn)(vce (x), vce (y)) outer node ⎪ ⎪ ⎨ SIM (x, y, sn) = af (sn)(SIM (x, y, sn1 ), .., ⎪ ⎪ ⎩ SIM (x, y, snn ), sf (sn)(vce (x), vce (y))) inner node where (sn1 , ..snn ) = subStructN odes(sn) and ce = ce(sn). Otherwise, it is possible to define the Aggregation Function. Given x ∈ CaseBase and y new case and ce = ce(sn): – Average Root function: for each inner node in with struct node snin n sf (snchi )(vce (x), vce (y)) + sf (snin )(vce (x), vce (y)) simin = i=1 n+1 where n is the number of children nodes of in with struct nodes snch1..n ;
166
L. Manenti and F. Sartori
– Average No Root function: for each inner node in n sf (snchi )(vce (x), vce (y)) simin = i=1 n where n is the number of children nodes of in with struct nodes snch1..n ; – Weighted Root function: for each inner node in with struct node snin and weight win n sf (snchi )(vce (x), vce (y))wi + sf (snin )(vce (x), vce (y))win n simin = i=1 i=1 wi + win where n is the number of children nodes of in with struct nodes snch1..n and weights w1..n ; – Weighted No Root function: for each inner node in with struct node snin n sf (snchi )(vce (x), vce (y))wi n simin = i=1 i=1 wi where n is the number of children nodes of in with struct nodes snch1..n and weights w1..n .
3
Metadata Implementation
The conceptual model of the CReP platform described above has been finally translated into a XML schema that defines the metadata for case representation and retrieval in the design and implementation of CBR systems. In particular the schema defines all the components useful in the process of case base retrieval: the elements are defined (in a top-down way) in order to guarantee the semantic specification of each component. Each element is associated to a complexType that lists all the sub-elements (with the name and the type) it is composed of. The main element is CRePConf that specifies all the components that are involved into the conceptual model of the CReP platform. In this section we will describe part of the XML schema starting from CRePConf, underlying the principal elements useful for the retrieval step.
Aggregation and similarity functions are defined with the following semantic1 : 1
Note that it is explicitly indicated that at least an aggregation and similarity function must be included in each case base.
Exploiting CReP for Knowledge Retrieval and Use in Complex Domains
167
The StructBase element defines the three parts of the case (caseDescription, caseSolution, caseOutcome), each described as a XMLCasePart element in the following way:
Each XMLCasePart, besides the possibility to specify parameters for aggregation and similarity functions, includes the representation of XMLInnerNode and XMLOuterNode to create the hierarchical structure, described as: