Foundations for the Web of Information and Services
Dieter Fensel Editor
Foundations for the Web of Information and Services A Review of 20 Years of Semantic Web Research
Editor Dieter Fensel STI Innsbruck, ICT-Technologiepark University of Innsbruck Technikerstr. 21a Innsbruck 6020 Austria
[email protected]
ISBN 978-3-642-19796-3 e-ISBN 978-3-642-19797-0 DOI 10.1007/978-3-642-19797-0 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011929496 © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: deblik, Berlin Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Foreword: A History of the Semantic Web
In putting together this volume in honor of Professor Rudi Studer, it was realized that the origins of much of this work were somewhat obscure, and that some history would help people to understand the origins of the papers included in this volume. It was also realized that the number of the key early projects in which Rudi participated would help to make it clear that he was an important player in the founding of this important technology. By dint of being one of the other folks who was around “from the beginning,” I was asked if I would write up this brief introduction. In honor of Rudi, I’m happy to accept.
In the Beginning It is hard to know who first had the idea of creating a language on the World Wide Web that could be used to express the domain knowledge needed to improve Web applications. By the mid 1990s, before most people even knew the Web existed, several research groups were playing with the idea that if web markup (it was all HTML back then) contained some machine readable “hints” to the computer, then we could do a better job of Web tasks like search, query, and faceted browsing. It’s important to note that at that time, the potential power of the Web was still being debated, and there were many who were sure it would fail. However, by 1997 or so, it was clear the Web was going to be around for a while, and there was a burst of energy going on. Various people were publishing algorithms suggesting that different approaches could be used for searching the Web than the traditional AI approaches, and it was around this time that Sergey Brin and Larry Page published their famous “PageRank” paper, which led to the creation of Google and the growth of the modern search engine.1 1I
mention this here as I often hear people saying that the Semantic Web was created to improve search. That is partly true, but it is important to note that search as we knew it back then, preGoogle, was not the same as the current keyword search that powers so much of the modern Web. v
vi
Foreword: A History of the Semantic Web
At this time, we also see the first “real” refereed publications coming about machine-readable knowledge on the Web. One of these approaches was the SHOE (Simple HTML Ontology Extensions) project which I led at the University of Maryland.2 Another was the ONTOBROKER project led by Dieter Fensel and Rudi Studer.3 The slogan for the SHOE project, which continues to be a popular quote in Semantic Web topic, was on shirts we had printed ca. 1998, “A little semantics goes a long way.” This idea drove the early work in projects like SHOE and Ontobroker, and it is important to note that while these early projects looked at what we now call “web ontology languages,” they were driven less by the AI-inspired push for expressive languages, and more by the needs of the emerging Web—what we would now call semantic annotation or tagging. About that same time, a research effort was growing in Europe that merged two main trends, an effort called XOL (XML Ontology Language) and the work growing out of Ontobroker. The new effort was named OIL (which at various times had slightly different acronyms, but mostly it became known as the Ontology Interchange Language). OIL had significant support from Description Logic (DL) researchers, which is to a large degree why many of the later ontology languages used DL for their logical modeling. In parallel with this Web representation work, the World Wide Web consortium (W3C) had begun to explore whether some sort of Web markup language could be defined to help bring data to the Web. The Metadata Content Framework4 working group was drafting a language that was later to be named the Resource Description Format (RDF). I will not recap the history of the split between XML and RDF, it was more political than technical, but suffice to say it added some confusion to the industrial story.5
Increasing Research Interest In 1999, I began a three year position as a funding agent for the US’ Defense Advanced Research Projects Agency and convinced them to invest in this emerging technology area. My primary argument was that this could be used to help solve a lot of the DoD’s (and, of course, everyone else’s) data integration problems. To help sell the US government on funding this research area, the techniques pioneered in ONTOBROKER and SHOE were used to build some demos showing the potential for these new languages. Based on these demos, a project called the DARPA Agent Markup Language (DAML) was launched. MIT’s Semantic Web Advanced Development, led by Tim 2 http://www.cs.umd.edu/projects/plus/SHOE/. 3 http://ontobroker.semanticweb.org/. 4 http://www.w3.org/TR/NOTE-MCF-XML-970624/. 5 The
2002 paper “XML and the Semantic Web” addresses some of this background http://www. ulitzer.com/?q=node/40496.
Foreword: A History of the Semantic Web
vii
Berners-Lee, was funded under this program, with a proposal to base the emerging language in the emerging Resource Description Framework language. RDF, like SHOE, used Universal Resource Indicators (URIs) to name concepts, an important aspect of “webizing” the representation languages for the Web. Along the way, the community (both research and industrial) came to accept Tim’s name for this work: The Semantic Web. In actuality, it is worth noting that the Semantic Web was a realization of part of Tim’s original conception of the Web. In fact, in a 1994 talk he said: Documents on the web describe real objects and imaginary concepts, and give particular relationships between them. . . . For example, a document might describe a person. The title document to a house describes a house and also the ownership relation with a person. . . . This means that machines, as well as people operating on the web of information, can do real things. For example, a program could search for a house and negotiate transfer of ownership of the house to a new owner. The land registry guarantees that the title actually represents reality. As this work grew, it was decided that an effort was needed to bring together the key players in this emerging area. The outcome of this was a Dagstuhl Workshop entitled “Semantics for the Web” which was held in March of 2000.6 It was chaired by Dieter Fensel, Wolfgang Wahlster, Henry Lieberman and me. One of the first people we invited was Rudi, as we knew his group was beginning to do exciting work in this area. The workshop was quite successful, leading to an increasing realization that this new technology had significant potential.7 Also in 2000, I held a meeting with Hans-Georg Stork, then working for the European Commission funding AI research. We met to discuss the possibility of an international effort to bring forth a standard language, rather than to have competing US (DAML) and European (OIL) efforts. Based on these discussions, and the approval of our respective organizations, a group called the “Ad hoc US/EU Working Group on Agent Markup Languages” was formed. (Rudi was an active member of this group.) Despite its unwieldy name, the group met on a regular basis and created a language that integrated the best features of the DAML language emerging from the DARPA program and the OIL language coming from the EU researchers. The resulting language, which was called DAML + OIL, became a de facto standard used in research efforts on both sides of the Atlantic. Another important event in showing the academic respectability of the emerging field was the publication of the first Semantic Web thesis. Jeff Heflin, a student in the SHOE group at Maryland started playing extended the annotation language to include a rule-based reasoner, a Web scraper for extracting SHOE from nonannotated Web sites, a visual query by example system and a bunch of other things. His thesis8 included the first formal description of the Semantic Web—defined in terms of multiple ontologies linked together. 6 http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=00121. 7 Many
of the papers from this workshop appeared in the MIT Press Book Spinning the Semantic Web, edited by the four workshop organizers.
8 Available
online at http://www.cse.lehigh.edu/~heflin/pubs/heflin-thesis.pdf.
viii
Foreword: A History of the Semantic Web
Not long after, several European theses on the use of semantics on the Web were completed (and of course one of the leading universities in this was the University of Karlsruhe under Rudi). These theses expanded the work well beyond Jeff’s start, adding technologies such as text-mining, ontology development and learning, automated reasoning, and many others. The growing community, funded by DARPA in the US and the Information Society Technologies (IST) Program9 in the EU, realized it needed to come together more formally, and in 2001 formed a symposium held in San Francisco in 2000, that evolved the following year into the International Semantic Web Conference (ISWC) which has been held every year since. ISWC is run by an international organization called the “Semantic Web Science Association10 ” and at the first meeting of this organization, Rudi was elected the President of SWSA, a role he filled until 2008 when, at the end of his second three-year term, he moved on to the Past President role. Starting in 2001, a great deal of research went into developing ontologies on the Web, funded to a large degree by IST funding to researchers coming out of the AI community. While RDF development continued, and the redesigned RDF and RDF Schema became recommendations in 2004, there was a more visible effort going on in the world of ontology languages. Based on the DAML + OIL work, the World Wide Web Consortium created the “Web Ontology Working Group” (often referred to as WOW-G), which created the Web ontology language OWL. (OWL also became a W3C recommendation in 2004.) OWL remains the primary language being used in current AI-based Semantic Web work, its reference manuals have been translated into a number of languages, and there are a number of books available on OWL use. Rudi’s research groups, of course, were major players in the IST framework projects, in the design of OWL, and in a number of other important aspects of the emerging Web ontology world. In short, no matter where one looked during the formative days of the Semantic Web, Rudi and/or members of a research group he ran were always present.
Forward to the Present Since the goal of this foreword was primarily to discuss the earlier history of the Semantic Web, I won’t spend a lot of time on the intervening years from then to now. I do however want to talk about one aspect of the Semantic Web community that has emerged in the years since. As mentioned previously, much of the work in the early Semantic Web research community was funded in the AI space for work in ontologies. However, around 2006, as the weaknesses in pure social tagging became more and more evident, and 9 This was later renamed the Information and Communication Technologies area, which funded the
large Framework 6 and Framework 7 EU projects. 10 http://www.iswsa.org/.
Foreword: A History of the Semantic Web
ix
as the Semantic Web query language, SPARQL,11 joined the W3C’s growing stack of Semantic Web recommendation and reports, Web application developers became more and more interested in the Semantic Web languages, but primarily the use of RDF as part of the Web architecture (perhaps enhanced with a few terms from RDFS and OWL, but primarily using very little OWL). This community, which is sometimes called the “linked data” community, because it takes much of its inspiration from the earlier “Web of Data” aspects of the Semantic Web, focuses much more on the scaling and Web application use of semantic technologies, and much less on the expressive ontologies of the research community. At this point in 2010, both communities are healthy, although the linked-data community is showing up far more in applied work, as it makes use of the more mature parts of the Semantic Web technology stack and as it fits nicely with some of the business needs of the modern Web community. In recent days, there have been announcements of the use of RDF-based technologies by Facebook and Twitter, Semantic Search is being actively pursued at Google, Microsoft Live Labs and other large Web companies. A number of small “Web 3.0” companies have emerged, and many of these use some simple ontologies, and a lot of data, to try to gain an advantage over their competitors. Countries around the world are starting to explore the exporting of data to the linked data world, led by efforts in the UK (http://data.gov.uk) and the US (http://data.gov). This primarily non-academic use of Semantics has not been largely powered by expressive ontologies, but rather by simpler data structures and very simple vocabularies. As I write this, Facebook’s “Open Graph Protocol” (OGP)12 is emerging as a widely used technology on the Web. The “ontology” (if one could call it that) of the OGP is very simple, consisting of a few main properties such as title, type, image and URL. So, for example, the open government data project at RPI (http://logd.tw.rpi.edu) has a like button on it. The embedded RDFa reads: ...
This enables any user to click on the Web page and say (on their Facebook page) that they like this Web site. This may not seem very much like AI (it’s not), nor very much in the way of linked data (the type field is just a string), but these little bits of RDFa are starting to show up on thousands of Web sites, and it is not too unlikely that OGP will soon be the most used ontology in history, given the power of Facebook in the Web world. There is also growing interest in mechanisms for 11 http://www.w3.org/TR/rdf-sparql-query/. 12 See
http://developers.facebook.com/docs/opengraph for more details.
x
Foreword: A History of the Semantic Web
having the type fields contain reference to online vocabularies (such as SKOS files) increasing the interest in these other, simple, Semantic Web products. It probably will come as no surprise to the readers of this book that within the Semantic Web research world, there is some tension between those who believe that expressive KR is the future of the technology, and those who feel that studying scaling, data integration and the appropriate use of language and semantic data technologies are the way to go. Most research groups are heavily involved in one or the other of these, and it is clear that in many places, bringing together the linked-data and Semantic Web worlds will be a considerable effort. I was thus extremely pleased on a recent visit to “Semantic Karlsruhe,” where there are several research groups and many projects led by Rudi, to discover that not only were there a number of researchers working in each of these areas, but that they knew and respected each others’ work. New ideas in the combination of these technologies and in finding “middle ground” for exploring with new Semantic Web projects is clearly something we see happening there, a promising sign for the future of both sides of the Semantic Web community.
What Comes Next? As the Semantic Web is starting to flourish in the applied research community and to transition from our laboratories to the “real world” of the Web, there is clearly a lot of discussion as to the future of the research enterprise in semantics. As this was not what I was asked to write about, I’ll hold my opinions on this for some later, more contentious, article. However, there is one opinion I hold that I have not heard anyone dispute—I believe that, as he has been since the beginning of the Semantic Web work, Rudi will be a leading researcher in whatever the future brings. Troy, USA August, 2010
Jim Hendler
Preface
I met Rudi more than 20 years ago, while I was searching for a place to do a PhD on Artificial Intelligence. Originally, I was more interested in Machine Learning but with his great spirit of non-directional leadership, Rudi slowly moved me in the direction of knowledge acquisition and knowledge engineering. What surprised me most about these areas was that they looked more like applied sociology than computer science, and only the recent web science adventure surprised me even more. Anyway, I followed his advice-for many years trying to discover what the “formal semantics” of these areas really were. After earning my PhD with Rudi and my interesting research period in Amsterdam, I really gained some insight and got excited about it. I focused on heuristic problem solvers and tried to answer the questions of why and when they are better than global problem solvers. In the end, all our life is about compromising results by restricting effort. This culminated in an exciting Habilitation that, aside from myself, very few people in the world have ever understood.1 Then Rudi shocked me once again. Basically, he told me that I should either focus on something simpler, or I should forget the idea of ever getting a Professorship. I did not like his message but I felt he was right. Reality was simply not ready for my genius. He proposed that I work on Ontologies. Frankly, I hated this suggestion as I regarded Ontologies as quite a boring area of Science and Engineering. First, they are only about data structure, where very few dynamic events happen. Second, most Ontologists have missed the last five hundred years of philosophical development that introduced the notion of an observer and his perspective on any world view. Even conservative physicians had to adopt this point of view nearly one hundred years ago. It was naive of me to assume to know THE model of reality, and to be surprised that others do not share this point of view. That is really not appropriate to the state of philosophy after Descartes and others. Anyway, I had been infected with a virus in Amsterdam that generated an interesting potential of using Ontologies as flexible data schemas. It was the Web. Academics from the hypertext area found 1 My
colleague Enrico Motta from Open University managed to implement some pieces of this grand vision. xi
xii
Preface
the Web primitive, academics from the database area found it even illegal, but I got caught by it. So in Karlsruhe in 1996, we began to work on extending HTML by means of adding semantics to textual and graphical information on the Web. Happily, we found similar work published by Jim Hendler’s group and throgh Tim Berners-Lee’s work, and then later the RDF work of W3C. This period of time significantly changed my life and I would like to thank Rudi for the great support he gave me for nearly a decade and for the great cooperation we have had since then. Therefore, it is my duty and pleasure to edit this “Festschrift” that has collected contributions from his colleagues and academic offspring. The book starts with a preface by Jim Hendler that you have already read if you are reading this book in linear order. He provides a historical view on the development of semantic web research and we would like to mention again his early work on SHOE that was a great encouragement for our work. We then collected six contributions from Rudi’s peers (and actually, one ‘super peer’ is included). The core usage of semantic technology is to provide scalable means to achieve interoperability in large, distributed, heterogeneous, and dynamic environments. The article by Haslhofer and Neuhold2 puts Rudi’s work in context by providing a retrospective on semantics and interoperability research as applied in computer science. His colleagues Oberweis, Schmeck, Seese, Stucky, and Tai from the Institute AIFB, relate Rudi’s work to other areas of applied computer science such as Logic, Complexity Management, Efficient Algorithms, Organic Computing, and Business Process Management. There has always been a close link between semantic technology and database technology on the one hand, and knowledge technology on the other. One could even argue that there is a complexity chain of data, information, and knowledge where semantics is mostly busy with the intermediate item, i.e., information. The article contributed by Lockemann, a colleague from the University of Karlsruhe, provides an excellent analysis of the communalities and differences between database and Ontology technology in terms of efficiency and effectiveness. Personally, I think this article already makes this book a good buy! Van Harmelen and Ten Teije from the VU Amsterdam and Wache from the University of Applied Science in Switzerland take a look from the opposite angle, considering semantics from the knowledge technology perspective. Their work on knowledgebased web service selection discusses a pathway for reunifying heuristic problem solving with semantic web technology. The application of semantic technologies for knowledge management issues is discussed in the article by Davies, British Telecom, Warren, Eurescom, and Sure,3 GESIS and the University of Koblenz-Landau. Using semantics for knowledge management also indicates that the borderline between semantic and knowledge technology is at least as fuzzy as the borderline between database and semantic technology. Finally, Horrocks, from the University of Oxford, takes a technical view on the core of semantic technology, focussing on tools to work and reason with Ontologies. 2 Obviously,
it is the role of Erich Neuhold to put the work of Rudi in context!
3 Each classification system has to deal with exceptions. York Sure is actually an academic offspring
of Rudi.
Preface
xiii
These articles are followed by eleven contributions provided by Rudi’s Academic offspring. Four articles are about the Web of data, information and knowledge, focusing on knowledge mining, knowledge networks, and knowledge diversity. The following four articles go beyond the static web of data and discuss the role semantic technology can play in the Web of software and services by modeling software, services, cloud computing, and event-driven architectures. Finally, applications of semantic technology are discussed for knowledge management scenarios. I wish Rudi many more years of productivity and I am looking forward to cooperating with him in as many projects in the future as we have in the past.
Acknowledgment I would like to thank Amy Anna Mary Strub for her excellent work as English proofreader, and Birgit Leiter for making it possible to publish this book on time. Furthermore, I would like to thank FZI, GESIS, KIT/AIFB, ontoprise GmbH, STI Innsbruck, STI International and WeST for their sponsorship and great support. Innsbruck, Austria November, 2010
Dieter Fensel
Sponsor Information
FZI Application research in the fields of information technology, engineering and economics with reliable knowledge and technology transfer is the core business and expertise of FZI Forschungszentrum Informatik. As a non-profit institution, established 25 years ago by the State of Baden-Württemberg and the University of Karlsruhe (now Karlsruhe Institute of Technology) we deliver the results of scientific research directly to you. As an independent research institution, FZI works for companies and public institutions regardless of company size: from small business to large corporations, from local public administrations to the European Union.
GESIS As the largest infrastructure facility in Germany, GESIS offers a variety of services related to the social sciences. Based on original research and experience, the scientific community finds a wide range of services, consultation, data, and information at all stations of the research data lifecycle from information research and survey design, data collection, archiving, registration, and provision to data analysis.
xv
xvi
Sponsor Information
KIT/AIFB The Karlsruhe Institute of Technology (KIT) is the merger of the former Universität Karlsruhe (TH) and the former Forschungszentrum Karlsruhe. With about 8000 employees and an annual budget of 700 million Euros, KIT is the largest technical research institution within Germany. The Institute AIFB (Applied Informatics and Formal Description Methods) at KIT is one of the world-leading institutions in Semantic Web technology. Approximately 20 researchers of the knowledge management research group are establishing theoretical results and scalable implementations for the field, closely collaborating with the sister institute KSRI (Karlsruhe Service Research Institute), the start-up companies ontoprise and fluid Operations, and the Knowledge Management group at the FZI Research Center for Information Technologies.
ontoprise GmbH ontoprise GmbH is a leading provider of products, solutions and services in the area of semantic technologies. These make it possible to: Describe the meaning of information to be machine-readable, structure knowledge, record complex interrelationships and to integrate distributed information. Thereby, employees involved in knowledge-intensive processes are optimally supplied with the right information. Additionally, companies are empowered in making the right conclusions from existing information.
STI Innsbruck The Semantic Technology Institute (STI) Innsbruck (http://www.sti-innsbruck. at), formerly known as DERI Innsbruck, was founded in 2002 and has developed into a challenging and dynamic research institute. Through STI International, we collaborate with a network of international institutes and global industrial partners
Sponsor Information
xvii
in Asia, Europe and the USA. Our major objective is to establish Semantic technologies as a core pillar of modern Computer Science, thereby providing interoperability and scalability for the web of data and services.
STI International STI International is a global network carrying out research, education, innovation and commercialization activities on semantic technologies facilitating their deployment within industry and society at large. STI International is organized as a collaborative association of interested scientific, industrial and governmental parties that share a common vision. It sets up its own research infrastructure and implements public and internal services that support the individual partner organizations in their research collaboration, standardization, dissemination and exploitation activities
WeST The institute “WeST—Web Science and Technologies” works on issues related to the usage and the technologies of the World Wide Web. Researchers consider the technical aspects of the Web being a globally networked information system and information services as well as the personal and social aspects of Web usage. They aim at understanding the structure and evolution of the Web, for making the Web even more useful and for ensuring its future prosperity and usefulness. Germane to these aspects are the novel technologies of the Semantic Web, Web Retrieval, Multimedia Web, Interactive Web and the Software Web.
Contents
Part I
Colleagues and Historical Roots
A Retrospective on Semantics and Interoperability Research . . . . . . . Bernhard Haslhofer and Erich J. Neuhold Semantic Web and Applied Informatics: Selected Research Activities in the Institute AIFB . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Oberweis, Hartmut Schmeck, Detlef Seese, Wolffried Stucky, and Stefan Tai Effectiveness and Efficiency of Semantics . . . . . . . . . . . . . . . . . . Peter C. Lockemann Knowledge Engineering Rediscovered: Towards Reasoning Patterns for the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frank van Harmelen, Annette ten Teije, and Holger Wache Semantic Technology and Knowledge Management . . . . . . . . . . . . John Davies, Paul Warren, and York Sure
3
29
41
57 77
Tool Support for Ontology Engineering . . . . . . . . . . . . . . . . . . . 103 Ian Horrocks Part II
Academic Legacy
Combining Data-Driven and Semantic Approaches for Text Mining . . . 115 Stephan Bloehdorn, Sebastian Blohm, Philipp Cimiano, Eugenie Giesbrecht, Andreas Hotho, Uta Lösch, Alexander Mädche, Eddie Mönch, Philipp Sorg, Steffen Staab, and Johanna Völker From Semantic Web Mining to Social and Ubiquitous Mining . . . . . . . 143 Andreas Hotho and Gerd Stumme Towards Networked Knowledge . . . . . . . . . . . . . . . . . . . . . . . 155 Stefan Decker, Siegfried Handschuh, and Manfred Hauswirth xix
xx
Contents
Reflecting Knowledge Diversity on the Web . . . . . . . . . . . . . . . . . 175 Elena Simperl, Denny Vrandeˇci´c, and Barry Norton Software Modeling Using Ontology Technologies . . . . . . . . . . . . . . 193 Gerd Gröner, Fernando Silva Parreiras, Steffen Staab, and Tobias Walter Intelligent Service Management—Technologies and Perspectives . . . . . 215 Sudhir Agarwal, Stephan Bloehdorn, and Steffen Lamparter Semantic Technologies and Cloud Computing . . . . . . . . . . . . . . . . 239 Andreas Eberhart, Peter Haase, Daniel Oberle, and Valentin Zacharias Semantic Complex Event Reasoning—Beyond Complex Event Processing 253 Nenad Stojanovic, Ljiljana Stojanovic, Darko Anicic, Jun Ma, Sinan Sen, and Roland Stühmer Semantics in Knowledge Management . . . . . . . . . . . . . . . . . . . . 281 Andreas Abecker, Ernst Biesalski, Simone Braun, Mark Hefke, and Valentin Zacharias Semantic MediaWiki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 Markus Krötzsch and Denny Vrandeˇci´c Real World Application of Semantic Technology . . . . . . . . . . . . . . 327 Juergen Angele, Hans-Peter Schnurr, Saartje Brockmans, and Michael Erdmann
Contributors
Andreas Abecker FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected]; disy Informationssysteme GmbH, Karlsruhe, Germany,
[email protected] Sudhir Agarwal Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany,
[email protected] Juergen Angele ontoprise GmbH, An der RaumFabrik 29, 76227 Karlsruhe, Germany,
[email protected] Darko Anicic FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected] Ernst Biesalski EnBW Energie Baden-Württemberg AG, Karlsruhe, Germany,
[email protected] Stephan Bloehdorn Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany,
[email protected] Sebastian Blohm Microsoft Search Technology Center Europe, Munich, Germany,
[email protected] Simone Braun FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected] Saartje Brockmans ontoprise GmbH, An der RaumFabrik 29, 76227 Karlsruhe, Germany,
[email protected] Philipp Cimiano Semantic Computing Group, Cognitive Interaction Technology Excellence Center (CITEC), University of Bielefeld, Bielefeld, Germany,
[email protected] John Davies Future Business Applications and Services, BT Innovate and Design, British Telecommunications Plc., Ipswich, UK,
[email protected] xxi
xxii
Contributors
Stefan Decker Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, IDA Business Park, Lower Dangan, Galway, Ireland,
[email protected] Andreas Eberhart Fluid Operations, 69190 Walldorf, Germany,
[email protected] Michael Erdmann ontoprise GmbH, An der RaumFabrik 29, 76227 Karlsruhe, Germany,
[email protected] Eugenie Giesbrecht FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected] Gerd Gröner Institute for Web Science and Technologies, University of KoblenzLandau, Universitätsstrasse 1, Koblenz 56070, Germany,
[email protected] Peter Haase Fluid Operations, 69190 Walldorf, Germany,
[email protected] Siegfried Handschuh Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, IDA Business Park, Lower Dangan, Galway, Ireland,
[email protected] Frank van Harmelen Dept. of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands,
[email protected] Bernhard Haslhofer Department of Distributed and Multimedia Systems, University of Vienna, Liebiggasse 4/3-4, 1010 Vienna, Austria,
[email protected] Manfred Hauswirth Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, IDA Business Park, Lower Dangan, Galway, Ireland,
[email protected] Mark Hefke Manager Innovation & Business Development, CAS Software AG, Karlsruhe, Germany,
[email protected] Ian Horrocks Department of Computer Science, University of Oxford, Oxford, UK,
[email protected] Andreas Hotho Data Mining and Information Retrieval Group, University of Würzburg, 97074 Würzburg, Germany,
[email protected] Markus Krötzsch Oxford University Computing Laboratory, University of Oxford, Oxford, UK,
[email protected] Uta Lösch Karlsruher Institut für Technologie, Karlsruhe, Germany,
[email protected] Steffen Lamparter Siemens AG, Corporate Technology, Munich, Germany,
[email protected]
Contributors
xxiii
Peter C. Lockemann Karlsruhe Institute of Technology (KIT), Department of Informatics, 76128 Karlsruhe, Germany,
[email protected]; Forschungszentrum Informatik, Karlsruhe, Germany Alexander Mädche Institute for Enterprise Systems (InES), University of Mannheim, Mannheim, Germany,
[email protected] Eddie Mönch ontoprise GmbH, Karlsruhe, Germany,
[email protected] Jun Ma FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected] Erich J. Neuhold Department of Distributed and Multimedia Systems, University of Vienna, Liebiggasse 4/3-4, 1010 Vienna, Austria,
[email protected] Barry Norton Institut AIFB, Karlsruhe Institute of Technology, Karlsruhe, Germany,
[email protected] Daniel Oberle SAP Research, 76131 Karlsruhe, Germany,
[email protected] Andreas Oberweis Institute AIFB, Karlsruhe Institute of Technology (KIT), 76128 Karlsruhe, Germany,
[email protected] Fernando Silva Parreiras Institute for Web Science and Technologies, University of Koblenz-Landau, Universitätsstrasse 1, Koblenz 56070, Germany,
[email protected] Hartmut Schmeck Institute AIFB, Karlsruhe Institute of Technology (KIT), 76128 Karlsruhe, Germany,
[email protected] Hans-Peter Schnurr ontoprise GmbH, An der RaumFabrik 29, 76227 Karlsruhe, Germany,
[email protected] Detlef Seese Institute AIFB, Karlsruhe Institute of Technology (KIT), 76128 Karlsruhe, Germany,
[email protected] Sinan Sen FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected] Elena Simperl Institut AIFB, Karlsruhe Institute of Technology, Karlsruhe, Germany,
[email protected] Philipp Sorg Karlsruher Institut für Technologie, Karlsruhe, Germany,
[email protected] Roland Stühmer FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected] Steffen Staab Institute for Web Science and Technologies, University of KoblenzLandau, Universitätsstrasse 1, Koblenz 56070, Germany,
[email protected] Ljiljana Stojanovic FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected] Nenad Stojanovic FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected]
xxiv
Contributors
Wolffried Stucky Institute AIFB, Karlsruhe Institute of Technology (KIT), 76128 Karlsruhe, Germany,
[email protected] Gerd Stumme Knowledge & Data Engineering Group, University of Kassel, 34121 Kassel, Germany,
[email protected]; L3S Research Center, Hannover, Germany York Sure GESIS—Leibniz Institute for the Social Sciences and Institute WeST, University of Koblenz-Landau, Mannheim, Germany,
[email protected]; Professor for Applied Informatics in the Social Sciences, University of KoblenzLandau, Koblenz, Germany,
[email protected] Stefan Tai Institute AIFB, Karlsruhe Institute of Technology (KIT), 76128 Karlsruhe, Germany,
[email protected] Annette ten Teije Dept. of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands,
[email protected] Johanna Völker KR & KM Research Group, University of Mannheim, Mannheim, Germany,
[email protected] Denny Vrandeˇci´c Institut AIFB, Karlsruhe Institute of Technology, Karlsruhe, Germany,
[email protected] Holger Wache School of Business, University of Applied Sciences Northwestern Switzerland (FHNW), Brugg, Switzerland,
[email protected] Tobias Walter Institute for Web Science and Technologies, University of KoblenzLandau, Universitätsstrasse 1, Koblenz 56070, Germany,
[email protected] Paul Warren Eurescom GmbH, Heidelberg, Germany,
[email protected] Valentin Zacharias FZI Forschungszentrum Informatik, Karlsruhe, Germany,
[email protected]
Part I
Colleagues and Historical Roots
A Retrospective on Semantics and Interoperability Research Bernhard Haslhofer and Erich J. Neuhold
Abstract Interoperability is a qualitative property of computing infrastructures that denotes the ability of sending and receiving systems to exchange and properly interpret information objects across system boundaries. Since this property is not given by default, the interoperability problem and the representation of semantics have been an active research topic for approximately four decades. Early database models such as the Relational Model used schemas to express semantics and implicitly aimed at achieving interoperability by providing programming independence of data storage and access. Thereafter the Entity Relationship Model was introduced providing the basic building blocks of modeling real-world semantics. With the advent of distributed and object-oriented databases, interoperability became an obvious need and an explicit research topic. After a number of intermediate steps such as hypertext and (multimedia) document models, the notions of semantics and interoperability became what they have been over the last ten years in the context of the World Wide Web. With this article we contribute a retrospective on semantics and interoperability research as applied in major areas of computer science. It gives domain experts and newcomers an overview of existing interoperability techniques and points out future research directions.
1 Introduction Whenever an application processes data it must reflect the meaning—the semantics—of these data. Since this awareness is not given by default, the application designer needs to define a model, identify and structure atomic data units, and describe their meaning. Only if an application is aware of the structure and semantics of data, can it process them correctly. In this context, we often find the distinction between data, information, and knowledge, which has been the subject of intensive discussions in the Information Science literature for years. For a more comprehensive and actual discussion of these terms we refer to Rowley [53]. Here, we simply B. Haslhofer () Department of Distributed and Multimedia Systems, University of Vienna, Liebiggasse 4/3-4, 1010 Vienna, Austria e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_1, © Springer-Verlag Berlin Heidelberg 2011
3
4
B. Haslhofer and E.J. Neuhold
define data as being symbols without any meaning and information objects as being a collection of data that carry semantics, which is a pre-condition for correct interpretation. Interoperability problems arise when distinct applications communicate and exchange information objects with each other: often the structure and semantics of these objects is defined by autonomous designers, each having an individual interpretation of the real world in mind. When an object leaves the boundary of a sending system or application, the interpretation of these objects in a receiving application is often not possible due to the heterogeneities between the involved applications. The problem of how to represent semantics and how to establish interoperability between information objects in distinct autonomous, distributed, and heterogeneous information systems has been a central and very active topic in database and information systems research throughout the past four decades. While the motivation in early database systems was to achieve data independence and interoperation for data-oriented applications, the topic has become increasingly important with the advent of distributed (multimedia) databases and information systems. Today it is still a major research issue in the largest currently existing (multimedia) information system—the World Wide Web. The heterogeneities that impede systems and applications from being interoperable were investigated several times in different domains (e.g., [50, 54, 60, 62]). Although the notions vary, we can broadly categorize them as follows: – Technical Heterogeneities: denotes all system platform and exchange protocol differences that prevent applications from sending and receiving information objects. – Structural and Syntactic Heterogeneities: occur when data units in information objects are represented using different structures and syntactic conventions. – Semantic Heterogeneities: are conflicts that occur because of the differences in the semantics of data units. Analogous to these heterogeneity definitions we can define the various types of interoperability that can be achieved: technical, structural and syntactic, and semantic interoperability. In the following, when we use the term interoperability, we mainly refer to the latter two notions. Before proceeding with our analysis of the various approaches that were developed for achieving interoperability, we introduce an illustrative example, which we will use throughout this work to explain the technical characteristics of these approaches. We assume a scenario in which two film studios, denoted as Studio A and Studio B, independently set up internal movie databases. Over the years both studios collected a large amount of data about movies; they have now decided to share and exchange these data. Figure 1 depicts the differences in how these two studios represent information about the same real-world movie. We allow that an actor can play in several movies and a movie has several actors. The notation we are using here is abstract and represents only the available information. It is not bound to any semantic modeling technique because this is what we want to achieve in the subsequent sections.
A Retrospective on Semantics and Interoperability Research
5
Fig. 1 Illustrative Example. Studio A records for each Movie its title, the year when it was first presented, the genre, its length, and the stars playing in the movie. Studio B records for each Film the title, the releaseYear, the genre, the length, and for each starring role the name and birthDate of the Actor
Fig. 2 Semantics and Interoperability Research in Computer Science
The aim of this chapter is to provide a retrospective on the developments in semantics and interoperability research throughout the past four decades from the perspective of database and information system research. Solutions developed by other disciplines (e.g., Information Retrieval, Data Mining, or Artificial Intelligence), that of course encounter similar problems, are out of the scope of this paper. We will present a selected but, as we believe, representative set of approaches that enable the expression of data semantics and/or allow us to deal with the heterogeneities between applications. Our illustrative example will help us to explain the technical characteristics of some of these approaches. As illustrated in Fig. 2, we start our retrospective in the early 1970s and present early database models in Sect. 2. Then, in Sect. 3, we move along to distributed databases and object-oriented database models, which allow application-oriented and context-dependent design of databases. In Sect. 4, we describe major models and languages for the representation of semantics in distributed and heterogeneous information systems. Then, in Sect. 5, we describe the Semantic Web and the ideas behind the currently on-going Linked Data movement as a way to represent data semantics on the Web. Finally, we summarize our retrospective in Sect. 6 and give an outlook on future research topics in the area of semantics and interoperability research.
6
B. Haslhofer and E.J. Neuhold
2 Early Database Models Very early in the development of file systems and databases it was realized that a model-driven approach to data storage would allow a better separation between the data stores and the application programs using those data. In a way, this so-called data independence was a first step towards data-oriented interoperation of programs. At the same time this data independence brought explicit semantics into play in the sense that a data model reflected the real world and allowed programmers and end users to better share and understand the meaning of those data and therefore to utilize them more effectively. In this section, we first focus on the Relational Model (Sect. 2.1), which forms the current predominant formal basis for modern database systems. Then we describe the Entity Relationship Model (Sect. 2.2) and other related logical and conceptual data models (Sect. 2.3) from that period.
2.1 The Relational Model In the 1970s, a large number of modeling approaches were proposed, and quite a number of them are still in use today. The Relational Model [19] had a seminal influence on this field because the simplicity of the table-oriented visualization allowed easy understanding and use of the data in a data storage independent way. Each row (tuple) in a relational table describes an entity with named attributes. With its keys, as identifiers, and normal forms (2nd, 3rd, Boyce-Codd etc.), representing functional dependencies, early examples of semantics, i.e., reflections on the properties of the real world, became expressible. Figure 3 shows our illustrative example represented in the Relational Model. After the invention of the Relational Model, several efforts to develop languages for manipulating and retrieving data stored in relational database management systems (RDBMS) started. Initial proposals such as SEQUEL [15], which was developed by IBM, and QUEL, which was part of the INGRES effort [55], merged into the standardized Structured Query Language (SQL), which, until today, has remained the predominant data definition and manipulation language for RDBMS. The goal of the SQL standardization was to provide interoperability for applications so that they could access and manipulate data independently from the underlying RDBMS implementation. Today there are three different SQL standards (SQL, SQL2, SQL-99) in existence and several vendor-specific dialects, which is a major drawback from an interoperability perspective. Soon it became clear that the Relational Model was too restrictive to allow for an easy expression of more sophisticated semantic situations that would be needed when designing databases for multiple applications and usage environments. As a consequence, semantically richer models were developed. One of the first conferences oriented strongly towards semantics was the IFIP TC 2 Working Conference on Database Management Systems held 1974 in Corsica. There, Abrial introduced the Binary Relational Model [4] by defining objects as models of concrete or abstract objects of the real world and binary relations between them. In doing so he
Fig. 3 Relational Model Sample. This example shows how Studio A and B could structure their data in relations. Studio A stores the information about movie and stars in a single relation, which can lead to data redundancies as well as update and deletion anomalies. Studio B decomposed its data into separate relations and thereby eliminates these shortcomings. The choice of keys in B causes a large data load in the Starring relation because the relationship between films and actors is established via their keys and foreign keys
A Retrospective on Semantics and Interoperability Research 7
8
B. Haslhofer and E.J. Neuhold
introduced unique internal identifiers and showed that binary relations were sufficient to model the data-related properties of the real world. Semantics was expressed by object properties like synonyms, equivalence or relational symmetry, reflexivity and transitivity but also by handling three valued logic (true, false, unknown) to allow for an open world assumption. Today we can still find some of these concepts in the RDF model (see Sect. 5.1). At the same conference, Sungren introduced his thesis [57] where he applied, for the first time, the Meta-Information concept for database models. This allows for the representation of even richer semantics about the real world modeled in the database including formal and informal information about objects, properties and relations. Another important aspect of metadata is information like quality of the data, changeability of the model, reliability of the information, the source of the data, etc. Metadata help the designer of the database to decide on the proper schema and help the user when locating relevant information in the database.
2.2 The Entity Relationship Model In 1975, Chen published the Entity Relationship Model (see [17] and [18]). This model streamlined a number of the earlier approaches into the somewhat simpler to understand concepts of Entities, Attributes and Relations as the basic building blocks for modeling the real world. Again, the constraints placed on entities (e.g., cardinality, atomicity), relations (e.g., n : m) and attributes (single- or multi-valued, types) allow for the expression of semantics. The ER model gave rise to a series of conferences starting in 1979 and continuing up until today. The semantic modeling aspect for designing databases as well as the interoperability of programs using those databases was considered in the development and the extensions of the ER model. Up until about 1985 the ER model did not discuss, for instance, is-a and inheritance [23]. Figure 4 shows the Entity Relationship model for our illustrative example.
2.3 Other Models Through the 1970s and to some degree since then, quite a number of additional models were proposed. The Object Role Model (ORM) originally proposed by Falkenberg [25] and Njissen as NIAM [46] was later adopted for the ORM modeling technique which in turn influenced (e.g., Halpin [33]) the data modeling part of the nowadays predominant Unified Modeling Language (UML). Most of these models were developed to allow semantic-oriented design of databases and data independence. Interoperability aspects were only mentioned as borderline criteria. That, however, changed with the Architecture Model of the ANSI/X3/SPARC proposal [5]. This model differentiates between three levels of database schemas:
A Retrospective on Semantics and Interoperability Research
9
Fig. 4 Entity Relationship Model Sample. The example shows how Studio A and B could model their data structures using the Entity Relationship Model (in Chen’s original notation). Studio A models the names of the movie stars as multi-valued attributes (marked with double circles). Studio B models the associations between instances of movies and actors as a relationship. The underlined attributes indicate primary keys
an internal model (e.g., a relational model), a conceptual model (e.g., a global ER Model), and multiple external models representing the usage views of the database and reflecting the individual semantic needs of the usage in a heterogeneous interoperability environment. This immediately led to a number of research issues on how to map the different levels into each other without loss of essential information.
3 Distributed and Object-Oriented Database Systems The powerful (relational) database systems developed in the 1970s ensured data independence and interoperability of application programs. At the same time, it was realized that more powerful data models were needed that expose more of the semantics of these data and allow application-oriented and context-dependent design of databases. In the late 1970s and early 1980s the rise of powerful computer networks began. It was henceforth possible to place data on various computer nodes, either locally or distributed throughout larger networks. In this section, we first describe the research area of distributed databases and how they deal with semantic heterogeneities (Sect. 3.1). Then we introduce the central characteristics of object-oriented database models and systems (Sect. 3.2).
3.1 Distributed databases In the late 1970s, the research field of distributed databases1 grew rapidly in importance. Early papers on distributed databases were Distributed INGRES [56] and 1 See
Ceri et al. [14] for an overview of distributed databases.
10
B. Haslhofer and E.J. Neuhold
Fig. 5 Homogeneous and Heterogeneous Distributed Databases Sample. In (a) we assume that Studio B distributes the relations of its schema to two distinct database systems. In (b) the schema of Studio B serves as global schema and also as local export model of B’s database. A mapping M between the global schema and the local schema of database of Studio A needs to be established in order to bridge the heterogeneities between the involved databases
POREL [45], both approaches based on the Relational Data Model. They introduced the concept of global versus local schemas and the three-level architecture for centralized database systems, which was later extended to five layers: the (multiple) local internal models, the local conceptual models, the local (conceptual) export models, the global (conceptual) model, and the (multiple) external models. In order to design such a system, additional semantic meta-information was needed, as, for example, on the data distribution, the size and break-up of entity sets, the relations between them, the cardinality and selectivity of attributes, etc. The data models had to be extended accordingly, but in many cases those extensions were attached to an underlying relational model and not to the conceptual models of the various layers. The interoperability of applications and databases was then assured via the single global schema that would be used both by the local databases as well as by all of the global applications. It was recognized that in principle two situations for distributed databases can exist: (i) homogeneous and (ii) heterogeneous distributed databases. Figure 5 shows how our illustrative example can be deployed in a distributed setting. In the first case, a top-down design is realized by integrating external schemas into a single global schema. Guided by application-oriented metadata the design of the local schemas for the different computers in the network then follows. Here, considerable research effort was spent on strategies for splitting relations horizontally or vertically but, in retrospect, difficulties often arose from the low level of available semantic information. Some other research prototypes next to Distributed Ingres and POREL are SDD-1 of the Computer Corporation of America [37] and R* of IBM [31]. In the second case, heterogeneous systems follow a bottom-up design to cover situations where a number of pre-existing or autonomous databases must be inte-
A Retrospective on Semantics and Interoperability Research
11
grated into a single data management system in order to be shared by global applications. Using the information contained in the local conceptual schemas and the global knowledge about the applications, the export schemas can be developed and then be integrated into a single global schema by means of a mapping specification. The research prototype MULTIBASE [41] uses Daplex, a logical data specification language, for modeling the various schemas. Heterogeneous SIRIUS-DELTA [42] uses the Relational Model only and demonstrates the integration of PHLOX, which is a database system of the CODASYL Model family. However, it does not provide the equivalent functionality to databases as no real global schema is assumed, no local users are allowed, and mapping functions are to be provided by the local database management systems. As it turns out, homogeneous distributed database systems have become a feature of the major database products, whereas heterogeneous systems are still difficult to handle, even today. The main problems arise from the scarcity of explicit semantics that can be provided for the external schemas and the global applications that use those schemas as well as the semantics for the local schemas used for designing the local databases. With the advent of heterogeneous distributed database systems, the need for model compatibility, data consistency and object identity became apparent when interoperability was to be achieved. In our illustrative example, Studio B uses the attribute BirthDate as part of the primary key for the relation Actors. Studio A represents actors as a multi-valued attribute with the consequence that actors can only be identified by their names; information about an actor’s birthdate is not available in Studio A’s database. Therefore, Studio A cannot distinguish between actors having the same names and runs into problems when integrating its data with those of Studio B: if the schema of Studio B is used as global schema, it is not possible to define identity for the actors from Studio A’s database, because birth dates are not given. The models discussed so far neither allow for the specification of behavior nor are they flexible enough to allow for the expression of properties like equivalence, inheritance, and composition. As a consequence, the attention of the database research community shifted to object-oriented databases that allow for the specification of object identity, structures, semantics, behaviors, and constraints for the objects to be stored in the database as described in the following section.
3.2 Object-Oriented database Models and Systems Even when main stream databases—the relational model based systems, whether central or distributed—were enhanced with Entity Relationship type semantic descriptions, they did not show enough flexibility to support, for instance, the interoperation of heterogeneous systems or the extensibility for new appearing data types like semistructured and unstructured information. BLOBs (Binary Large Objects) used as a first solution actually led to the loss of data independence, a paradigm that originally gave rise to the databases concept.
12
B. Haslhofer and E.J. Neuhold
In the early 1980s, object-oriented programming (Smalltalk, C++) became popular and the need for the persistent storage of those new types of data arose. This triggered research in Object-Oriented Database Management Systems (OODBMS) and Object-Oriented Data Models (OODM), which started simultaneously in many locations. Many prototypes and even some commercial systems became available in the late 80s. An extensive description of those systems can be found in Dogac et al. [21] and also in Bukhres et al. [12]. Basically an object-oriented database model introduces application behavior (semantics) into databases by supporting a number of concepts, some of them well known in the object-oriented programming world, others specific to the persistence mechanism used for storing. – Object Identity: every object has a unique identifier attached permanently at object creation time for object recognition. Unfortunately, this does not solve the object identity problem in heterogeneous systems where for the same real world object two different database objects could have been created. – Type Extensibility: the basic data types in the database can be extended with new basic types and their handling functions. Type constructors would allow for new complex (abstract) data types. The typing systems could allow static binding or dynamic binding of data to the operations. – Object Classes: real-world entities of the same kind, that is, those modeled as objects having the same data types, object attributes, behavior and relationships to other objects, can be collected into a single class. – Inheritance: objects of a subclass (a more specific description) inherit properties of a superclass via the semantic concept of an is-a relationship including inheritance from multiple superclasses, e.g., as in case of the two superclasses SUV and Truck and the subclass SportTruck that has properties of both the SUV and Truck classes. – Object Instance: some OODM allow that an object instance can populate all the superclasses it inherits properties from, others only allow the instance in the ultimate subclass where its most specific description is located. Missing information, later added, would change the class of an object whereas in the first case the object instance only would be added to the newly relevant subclass. The first case would also simplify the problem of interoperability in heterogeneous multi-OODBMSs. Of course it would still not solve the problem of object identity. We believe that no single prototype or product fully supported all the possible features and also that no clear winner has ever been established in the OODBMS world. As it happens, object-oriented features were added by the relational database vendors as object-relational database management systems and today the “pure” OODBMS’s can only be found in niche application fields. A simple example of and OODBM schema is given in Fig. 6. To tackle the problem of heterogeneous distributed OODBMSs with their sometimes distinct formal semantics, more (formal) semantic flexibility was desirable. The VODAK Modeling Language (VML) [39] was an attempt to solve the problem by extending the two level models Application Class and Instance and the relationship is-instance-of with two additional levels, the Meta Class (MC) level and the
A Retrospective on Semantics and Interoperability Research
13
Fig. 6 Object-Oriented (UML) Model. The example shows the schema of Studio B in an object-oriented representation using the UML notation. To illustrate the inheritance feature of OO model, we introduced a superclass Person that defines all the attributes that would describe persons (not only actors) in the real world. The class Actor inherits all the properties from Person and introduces the additional attribute ActivePeriod
Meta-Meta Class (MMC) level (or Root Metaclass). The MC classes would specify the behavior of the specific Class Model, e.g. inheritance of all properties for all subclasses or only for specific properties or no inheritance at all could be specified. In case of heterogeneous OODBMSs, a global schema could then be used to integrate the individual different (formal) models and achieve interoperability between the databases. Today the idea of multi-level model architectures is reflected in the Object Management’s Group (OMG) MOF model [48] and serves as formal basis for UML [49], which is now the de-facto standard for object-oriented application design. However, as it soon turned out, even with the powerful object-oriented models, which allowed for the expression of many real-world semantic properties and behaviors, the expressive power needed in the growing world of multimedia and the World Wide Web was still missing. As a consequence, the OODBMSs never became the database concept envisioned in the late 1980s and early 1990s, despite the fact that some of their features can be found even today in multimedia, document, streaming, etc. data models.
4 Semantics in Distributed and Heterogeneous Information Systems Distributed databases split data across several nodes and increased the performance and scalability in data management. The distinction between different types of schemas and the development of more application-oriented data models such as the Object-Oriented Data Model introduces novel ways of expressing data semantics. However, with the rapidly increasing size of local and wide area computer networks, those established database-oriented interoperability mechanisms turned out to be insufficient due to the technical heterogeneities of the involved network nodes. In the late 1980s and early 90s information integration started to become an active research field having the goal to provide uniform access to data stored in distributed, heterogeneous, and autonomous systems. The Semistructured Data Model
14
B. Haslhofer and E.J. Neuhold
Fig. 7 Semistructured Model Example. A directed labeled graph represents the data of Studio B. The graph is self-describing because the data also carry schema information
(Sect. 4.1) plays a central role in this context. In parallel, research on Markup Languages (Sect. 4.2) evolved to a first agreed-upon standard (SGML), a derivative of which (XML) was later integrated with the Semistructured Data Model. Hypertext and Hypermedia research (Sect. 4.3) focused not only on data and document representation, but also on navigation and access to documents in distributed environments. All these efforts had a direct impact on Multimedia Data and Document Models, which aimed at representing the semantics and behaviors of non-textual multimedia objects. As a representative for these developments we discuss MPEG-7 (Sect. 4.4) and briefly outline other metadata interoperability approaches (Sect. 4.5).
4.1 The Semistructured Data Model In all models available so far (Relational Model, ER Model, OO Model), there has been a fixed schema describing the semantics of data. This leads to problems when data are exchanged across systems, because the underlying databases usually do not share the same schema even if they store similar data. This was the primary motivation for developing a more flexible data model, called the Semistructured Data Model. The original model evolved from the LORE [3] and TSIMMIS [16] projects at Stanford University and was first described by Papakonstantinou et al. [51]. Unlike the other existing data models at that time, the semistructured model does not separate the schema from the data. It is self-describing, meaning that the data themselves carry their schema information. Data represented by the semistructured model takes the form of a directed labeled graph. The nodes in such a graph stand for objects or attribute values. An edge indicates the semantics of the relationship two nodes have with each other. Unlike previous models, an edge merges the notions of attributes and relationships into a single primitive. Figure 7 shows our illustrative example in a semistructured representation.
A Retrospective on Semantics and Interoperability Research
15
The semistructured data model provides the necessary flexibility for exchanging data across system boundaries. However, the price for this flexibility is the loss of efficiency in query processing. This is one of the reasons why most of today’s data is still represented in the very efficient relational model and the technologies based on the semistructured model are primarily used for exchanging data. An architectural pattern combining the benefits of the static-schema and schema-less approaches is the mediator-wrapper architecture proposed by Wiederhold [63]. An extensive explanation of the Semistructured Data Model and its succeeding technologies is provided by Abitebul et al. [2].
4.2 Markup Languages The motivation for the development of markup languages comes from the publishing industry and early works on electronic document management systems. Without any markup, documents are simply files containing a sequence of symbols. Applications processing these documents cannot anticipate, for instance, what are the section headings to be presented to the user or where in the character sequence the information about the authors is located. Therefore document exchange and consequently interoperability between applications and between vendors becomes very difficult. As a consequence, the goal of markup languages is to add explicit semantics to plain character sequences. Markers (tags) allow for the annotation of electronic documents in order to add data-, presentation-, and processing-semantics to character subsequences. The IBM Generalized Markup Language (GML) [27] invented by Mosher, Lorie, and Goldfarb was the first technical realization of a markup language. Scribe [52] was the first language that introduced the separation of content and format and applied a grammar controlling the usage of descriptive markup elements. These works lead to the standardization of the Standard Generalized Markup Language (SGML) [35] in 1986. SGML is a metalanguage for describing markup languages and defines a common syntax for the markup or identification of structural textual units as well the grammar—the document type definition (DTD)—for defining the structure and allowed for tags in a document. Prominent derivatives of the SGML are HTML, which was developed in 1991, and XML, which was standardized in 1998. HTML [7] is a presentation-oriented markup language that allows users to easily create Web sites without adhering to the strict formal requirements imposed by the SGML DTDs. In fact, it eliminates DTD’s and only suggests structural and very few (formal) semantic features such as HTML META-Tags. Nevertheless, the extensibility and flexibility of HTML was one of the key factors for the success of the World Wide Web, with the result that today HTML is still the most widely used markup language. While HTML mainly provides markup elements that define presentation semantics of document parts, XML [61] goes back to SGML, eliminates complex properties and streamlines DTD’s. As a consequence, XML provides a simplified meta
16
B. Haslhofer and E.J. Neuhold
Casablanca 1946 Drama 102 Humphrey Bogart 1899-12-25 Ingrid Bergman 1915-08-29 Fig. 8 XML Document Example. It shows the movie data of Studio B represented in XML. The first line contains an XML processing instruction, the following lines the XML elements and values that describe the movie and its actors
markup language for defining documents that contain data to be communicated between applications. Since XML is backwards-compatible to SGML, DTDs can be applied for imposing element definitions and document structures on XML documents. The elements in XML documents indicate the semantics of contained data values. Nowadays, however, DTDs are superseded by the XML Schema, which offers the great advantage that not only data but also the schema information is represented in XML. Figure 8 shows our illustrative example represented in XML. It was soon discovered that the freedom of the original HTML specification, as a presentation-oriented markup language, lead to semantic interoperability problems among web browsers and applications. Around the year 2000, XHTML was developed in order to bind the features of HTML to an XML format. The goal was to represent Web documents as well-formed XML documents, which promised greater interoperability but less freedom in the creation of Web sites. With the development of XHTML 22 and HTML 53 a competition on the next generation markup language started. At the time of writing, HTML 5 seems to be the winner in the field of Web-markup languages because of its less strict, more evolutionary design approach. However, despite this development the expressibility of real-world semantics remained weak and led to the development of additional meta-languages such as RDF/S and OWL, which will be discussed in Sect. 5. HTML 5 now provides the possibility to include metadata content expressed in RDF in Web documents. 2 http://www.w3.org/TR/xhtml2/. 3 http://www.whatwg.org/specs/web-apps/current-work/multipage/.
A Retrospective on Semantics and Interoperability Research
17
4.3 Hypertext and Hypermedia Inspired by Vannevar Bush’s vision of Memex [13], Ted Nelson and Douglas Engelbart started their research on hypertext and hypermedia systems in the late 1960s (cf., [24, 44]). The goal of hypertext was to extend the traditional notion of linear flat text files by allowing a more complex organization of the material. Hypertext systems should allow direct machine-supported references from one textual chunk to another. Via dedicated interfaces, the user should have the ability to interact with these chunks and to establish new relationships between them. Hypertext was considered as a non-linear extension of traditional text organization. In its simplest form, hypertext consists of nodes and plain links, which are just connections between two nodes. They carry no explicit semantics but simply serve for the navigation between documents or document chunks. But links can also be used to connect a comment or annotation to a text. In such a case, the links that connect data with other data express semantics. When links have explicit types assigned, as described in Trigg et al. [59], they explicitly define the semantic relationship between nodes. There is a clear analogy between explicitly typed links in hypertext systems and the semistructured model described in Sect. 4.1: the underlying models are directed labeled graphs. Hypermedia is an extension of hypertext that also includes non-textual multimedia objects such as audio, video, and images. A detailed survey on early hypertext research and existing hypertext systems is available in [20]. However, hypermedia inherits the properties of hypertext and has also only limited means to express the semantics of the involved media objects and the relationships between them. For achieving interoperability and exchanging hypertext and hypermedia documents between applications, it soon became clear that a standardized exchange format is required in order to provide interoperability. HyTime, as an extension of SGML, is an example for such a standard (see Goldfarb [28]). Also the work in the Dexter Group focused on hypertext exchange formats and architectural models that should facilitate the exchange of hypertext [29, 32]. The World Wide Web is the most popular hypertext application in use today. One of the success factors of the Web was that several technologies were integrated into an easy-to-use technology stack: Uniform Resource Identifiers (URIs) and Uniform Resource Locators (URLs) for addressing documents in the Web, HTML (and its extensions) as a flexible markup language for creating hypertext documents, and HTTP as a protocol for the communication between clients and servers (see [36]).
4.4 The MPEG-7 Metadata Interoperability Framework With the release of the MPEG-7 standard in February 2002, a powerful metadata system for describing multimedia content was introduced. The goal was to provide higher flexibility in data management and interoperability of data resources. The difference between MPEG-7 and other already existing MPEG standards is that
18
B. Haslhofer and E.J. Neuhold
MPEG-7 does not specify any coded representation of audio-visual information but focuses on the standardization of a common interface for describing multimedia materials [43]. MPEG-7 aims to avoid being a single monolithic system for multimedia description but rather an extensible metadata framework for describing audiovisual information. MPEG-7 standardizes an extensive set of content Descriptors (D) and Description Schemas (DS) and offers a mechanism to specify new Description Schemas, such as the Description Definition Language (DDL). It is a description standard for all kind of media (audio, image, video, graphics, etc.) and creates a common basis for describing different media types by a single standard. It thereby eases interoperability problems between media types as well as applications. MPEG-7 uses XML for encoding content descriptions into a machine-readable format. XML Schema serves as the basis for the DDL that is used for the syntactic definition of the MPEG-7 description tools and that allows for extensibility of the description tools. Further details on MPEG-7 are available in Kosch [40]. MPEG-7 was not developed with a restricted application domain in mind. With the ability to define media description schemas by means of the DDL, MPEG-7 is intended to be applicable to a wide range of multimedia applications ranging from home entertainment (e.g., personal multimedia collections) to cultural services (e.g., art galleries) and surveillance (e.g., traffic control). However, this wide application spectrum has resulted in an enormous complexity of that standard, which, in our opinion, is one of the reasons why the ambitious goals of MPEG-7 remain unreached.
4.5 Other Metadata Interoperability Approaches The relevant characteristics of the 1990s are the emergence of the World Wide Web and an increasing need for interoperability among distributed applications. The availability of markup languages such as XML promoted the development of metadata interoperability standards that should allow the exchange of information objects across system boundaries. These standards ranged from rather generallyapplicable schemas such as Dublin Core [22] to very domain-specific schemas such as ONIX [58], which provides standardization for the publishing industry. This was also the period when global models covering the semantics of whole application domains emerged. Those models are supposed to define the common notions used in a domain and serve as a global schema for the integration of data in a heterogeneous distributed environment. The CIDOC CRM4 model, for instance, is such a model. It defines a conceptual model that aims at providing interoperability among information systems in cultural heritage institutions. This is architecturally similar to the idea of heterogeneous databases (cf., Sect. 3.1) where a global schema defines the model primitives for querying the underlying databases. The difference 4 http://cidoc.ics.forth.gr/.
A Retrospective on Semantics and Interoperability Research
19
from the 1990s onwards is that global model interoperability approaches are being applied in the Web, which is an open-world environment. However, they inherit the problems distributed databases can only cope with in their much smaller closedworld environment. As in databases, one must always deal with semantic ambiguities in the interpretations of the involved schemas and provide adequate mappings to bridge the heterogeneities. For a more detailed discussion on techniques for achieving metadata interoperability, we refer to a recent survey provided by Haslhofer and Klas [34].
5 The Semantic Web and Linked Data The late 1990s were characterized by the success of the World Wide Web. A set of simple-to-use technologies (URI, HTTP, HTML) suddenly allowed also nontechnical users to easily create and publish documents in a globally accessible information space. This was one of the reasons for the rapid spread of the Web. However, the information published on the Web was intended for human consumption and not for machine-interpretation. This motivated the development of the Semantic Web, which is an extension of the existing Web and has the goal to use the Web as a universal medium for the exchange of data. The Web should become a place where data can be shared and processed by automated tools as well as by people.5
Section 5.1 focuses on early Semantic Web activities and briefly describe the major specifications in place. Section 5.2 summarizes current activities in the area of Linked Data.
5.1 The Semantic Web The term Semantic Web was coined by Tim Berners-Lee and popularized in an article published in Scientific American in 2001 [8]. There the Semantic Web is described as a new form of Web content that is meaningful to computers and will unleash a revolution in new possibilities. In the early Semantic Web vision intelligent agents should act on behalf of their users and automatically fulfill tasks in the Web (e.g., making a doctor’s appointment). This of course requires that these agents understand the semantics of the information exposed on the Web. Based on this vision, the Semantic Web Activity was started at the W3C and has lead to the specification of several standards that technically enable this described vision: RDF/S, OWL, OWL-S, SKOS, and SPARQL. Since one of the major design principles was to build the Semantic Web upon the existing Web architecture, URIs form the basis for all these standards. Hence, 5 http://www.w3.org/2001/sw/Activity.
20
B. Haslhofer and E.J. Neuhold
all resources in the Semantic Web—including, but not limited to, those describing real-world objects—should have URIs assigned. The Resource Description Framework (RDF) serves as data model for representing metadata about a certain resource. It allows us to formulate statements about resources, each statement consisting of a subject, a predicate, and an object. The subject and predicate in a statement must always be resources, the object can either be a resource or a literal node. A statement is represented as a triple and several statements form a graph. RDF data can be exchanged between applications by serializing graphs using one of the RDF serialization syntaxes (e.g., RDF/XML, NTriple, Turtle). We will give an example of an RDF graph in Fig. 9 in Sect. 5.2. The RDF Vocabulary Description Language RDF Schema (RDFS) and the Web Ontology Language (OWL) are means of describing the vocabulary terms used in an RDF model. RDFS provides the basic constructs for describing classes and properties and allows for their arrangement into simple subsumption hierarchies. Since the expressiveness of RDFS is limited and misses some fundamental modeling features often required to construct vocabularies, the Web Ontology Language (OWL) was created. It is based on RDFS and allows the distinction between attribute-like (owl:DatatypeProperty) and relationship-like (owl:ObjectProperty) properties and provides several other expressive modeling primitives (e.g., class union and intersections, cardinality restrictions on properties, etc.) that allow us to express more complex models, which are then called ontologies. With RDFS and OWL one has the possibility to define models that explicitly express the semantics and specify and process possible inferences of data. The formal grounding of OWL (Description Logics) allows applications to reason on RDF statements and infer logical consequences. The binding of semantics to a logical system reduces interpretation ambiguities and leads to greater semantic interoperability between applications. OWL-S Semantic Markup for Web Services is a specific upper-level ontology for the description of services on the Web (Semantic Web Services). It should enable automatic Web service discovery and invocation by Web agents as well as automatic Web service composition and interoperation. Therefore, OWL-S can be considered as an attempt to establish interoperability between services in the Semantic Web. The Simple Knowledge Organization System (SKOS) is a model for expressing the structure and constituents of concept schemas (thesauri, controlled vocabularies, taxonomies, etc.) in RDF so that they become machine-readable and exchangeable between applications. With SKOS one can attach multi-lingual labels to concepts and arrange them in two major kinds of semantic relationships: broader and narrower relationships for constructing concept hierarchies and associative relationships for linking semantically related concepts. The SPARQL Query Language for RDF is an expressive language for formulating structured queries over RDF data sources. It defines a protocol for sending queries from clients to an SPARQL endpoint and for retrieving the retrieved results via the Web. Currently, the abstract protocol specification has bindings for HTTP and SOAP. This allows clients to execute a query against a given endpoint (e.g., http://dbpedia.org/sparql) and to retrieve the result set through common Web transport protocols. The underlying motivation for defining the SPARQL query language
A Retrospective on Semantics and Interoperability Research
21
is analogous to the motivation for defining SQL (see Sect. 2.1): to be able to access RDF stores via a uniform interface in order to achieve greater interoperability. A core belief of the early Semantic Web was that intelligent agents should be able to reason and draw conclusions based on the available data. This is why, in the Semantic Web, the meaning of terminology used in Web documents, that is the semantics of data, is expressed in terms of ontologies. The term ontology has its technical origin in the Artificial Intelligence domain and is defined as a specification of a conceptualization (see e.g., [30]). In its core, an ontology is similar to a database schema: a model defining the structure and semantics of data. Noy and Klein [47] describe several features that distinguish ontologies from database schema, most importantly that ontologies are logical systems that define a set of axioms that enable automated reasoning over a set of given facts. Although intensive research has been conducted in the Semantic Web domain over the last ten years, this early vision of the Semantic Web has not been implemented yet. The limitations of the Semantic Web lie in formal issues like decidability and computational complexity but also in its conceptual complexity. It is difficult to make it clear to the user (and the system designers) what the inferences implied by a given fact are.
5.2 Linked Data In 2006 Tim Berners-Lee proposed the so-called Linked Data principles [6] as a guideline or recommended best practice to share structured data on the Web and to connect related data that were not linked before. These are: 1. Use URIs to identify things. 2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL). 4. Include links to other URIs, so that one can discover more things. These principles eclipse the reasoning part of the Semantic Web, accentuate the data-centric aspects of existing Semantic Web technologies and thereby demystify their application in real-world environments. A central point in the Linked Data principles is the application of HTTP URIs as an object (resource) identification mechanism. When an application dereferences such a URI it receives data expressed in RDF. Structured access to RDF data within data sources is provided by SPARQL. This, in fact, resembles the central features provided by traditional (relational) database systems. The goal of the fourth principle is to interlink semantically related resources on the Web. If, for instance, two studios maintain a data record about the same movie, they should be interlinked. The semantics of the link depends on the application scenario; existing Semantic Web languages provide a set of predefined properties (rdfs:seeAlso, owl:sameAs, skos:closeMatch, etc.) for defining the meaning of links. Figure 9 shows how our illustrative example is represented in the Web of Data.
Fig. 9 Linked Data Example. The example how the data of Studio B could be exposed on the Web following the Linked Data guidelines. The prefixes expand as follows: dbpedia to http://dbpedia.org/resource, dbpprop to http://dbpedia.org/property/, dbpedia-owl to http://dbpedia.org/ontology/, and category to http://dbpedia.org/resource/Category:
22 B. Haslhofer and E.J. Neuhold
A Retrospective on Semantics and Interoperability Research
23
The Linked Data idea rapidly raised interest in various communities. Shortly after the formation of the W3C Linking Open Data Community project,6 DBpedia [11] was launched as the first large linked data set on the Web. It exposes all the information available in Wikipedia in a structured form and provides links to related information in other data sources such as the Linked Movie Database.7 As of November 2009, the DBpedia knowledge base describes more than 2.9 million things such as persons, music albums, or films in 91 different languages. It provides a user-generated knowledge organization system comprising of approximately 415 000 categories and millions of links to semantically related resources on the Web. After DBpedia, many other data sources followed. Today this so-called Web of Data comprises an estimated number of 4.7 billion RDF triples and 142 million RDF links [10]. For data consumers this has the advantage that data as well as schema information is now available on the Web (see The Best Practice Recipes for Publishing Vocabularies8 ) and can easily be accessed via widely accepted Web technologies, such as URI and HTTP. RDF simply serves as a model for representing data on the Web. This pragmatic Web of Data principles also resembles the notion of dataspaces [26] that was coined in the database community. However, as with all of the previously described interoperability attempts and technologies, Linked Data does not solve the complete stack of interoperability problems either. From the Semantic Web it inherits a set of technologies (RDF/S, OWL, etc.) that provide the necessary technical and structural interoperability, which in turn makes data easily accessible on the Web. But this does not solve the semantic interoperability problem. The data in the Linked Data Web are still heterogeneous because they use different vocabularies to describe the same real-world entities or the same vocabulary to describe different real-world entities. This leads to interpretation conflicts and usually requires manual interventions in terms of mappings. Although there exists a wealth of work in the are of semantic mediation and mapping (see e.g., [38]), the complexity of finding potential mappings between concepts grows with the size of the involved vocabularies. A fully automatic matching is considered to be an AI-complete problem, that is, as hard as reproducing human intelligence [9].
6 Summary and Future Research Directions Interoperability is a qualitative property of computing infrastructures. It enables a receiving system to properly interpret the information objects received from a sender and vice versa. Since this is not given by default, the representation of semantics has been an active research topic for four decades. 6 http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData. 7 http://www.imdb.com/. 8 http://www.w3.org/TR/swbp-vocab-pub/.
24
B. Haslhofer and E.J. Neuhold
In this chapter, we gave a retrospective on semantics and interoperability research as applied in major areas of computer science. We started with the Relational Model developed in the 1970s and ended with the currently on-going activities in the Semantic Web and Linked Data community. The technical outcome of all these activities were models that allow for the expression of data semantics and system architectures for the integration of data from several (heterogeneous) sources. From the late 1990s on, when research was driven by the evolving World Wide Web, the semistructured data model gained importance. Different from previous models, it is self-describing, meaning that data itself carries schema information. In essence, all presented models and system architectures enable the representation of data and the description of the semantics of these data. If one and the same model were used for exchanging information objects, interoperability would be established at least on a technical level and to some extent also on a syntactic and structural level. The Web is a good example of that; it provides a uniform way of identifying resources, a common exchange protocol, and a simple standardized markup language. If the involved parties also agree on the semantics of terms, as it is the goal of the various metadata standardization attempts, interoperability can also be established on a semantic level. In practice, however, such an agreement is hard to achieve, especially when multiple parties from a broad range of application domains are involved. We can observe numerous attempts of defining general (ontology) models for a complete domain (e.g., MPEG-7 for multimedia metadata, CIDOC CRM for the cultural heritage domain); although they provide a very detailed domain description, they hardly found their implementation in practice. As long as people are the designers of models, different conceptions and interpretations will always exist, even for superficially homogeneous domains and application contexts. We therefore believe that research in the area of semantic interoperability should take this situation into account and find solutions that deal with a multitude of models and allow for their semi-automatic or manual reconciliation. We believe that the World Wide Web will continue to be the predominant area for semantics and interoperability research. Applications that were available on the Desktop before (e.g., Email, Calendar, Office Suites, etc.) are now on the Web. A more Web-centric solution for data management is, in our opinion, a logical consequence. The Linked Data movement is definitely an important starting point in this direction. However, it will require further research on the integration of existing data sources and the development of scalable graph-based data stores. Additionally, since data are exposed on the Web and they should in the end, also be consumable by machines, further research must be conducted in the areas of data quality, changeability of models, reliability of information, and data provenance. In fact, these research topics were already identified in the early years of database research. Now, however, the open, distributed, and uncontrolled nature of the Web calls for a review of these approaches and possibly their adaption to a Web-based environment. The evolution of schemas and ontologies in decentralized semantic structures such as the World Wide Web also calls for further research. Aberer et al. [1] coined the term Emergent Semantics, which denotes a research field focusing on the under-
A Retrospective on Semantics and Interoperability Research
25
standing of semantics by investigating the relationships between syntactic structures using social networking concepts for the necessary human interpretations.
References 1. Aberer, K., Catarci, T., Cudré-Mauroux, P., Dillon, T., Grimm, S., Hacid, M.-S., Illarramendi, A., Jarrar, M., Kashyap, V., Mecella, M., Mena, E., Scannapieco, M., Saltor, F., Santis, L.D., Spaccapietra, S., Staab, S., Studer, R., Troyer, O.D.: Emergent semantics systems. In: In International Conference on Semantics of a Networked World (ICSNW), pp. 14–43 (2004) 2. Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Mateo (1999). 3. Abiteboul, S., Quass, D., McHugh, J., Widom, J., Wiener, J.L.: The lorel query language for semistructured data. Int. J. Digit. Libr. 1(1), 68–88 (1997) 4. Abrial, J.-R.: Data semantics. In: Klimbie, J.W., Koffeman, K.L. (eds.) Data Base Management, pp. 1–60. North-Holland, Amsterdam (1974) 5. ANSI/X3/SPARC Study Group on Data Base Management Systems: Interim report. FDT— Bulletin of ACM SIGMOD 7(2), 1–140 (1975) 6. Berners-Lee, T.: Linked Data. World Wide Web Consortium, (2006). World Wide Web Consortium. Available at http://www.w3.org/DesignIssues/LinkedData.html 7. Berners-Lee, T., Conolly, D.: RFC 1866—Hypertext Markup Language—2.0. Network Working Group (1995) 8. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (May 2001) 9. Bernstein, P.A., Melnik, S., Petropoulos, M., Quix, C.: Industrial-strength schema matching. SIGMOD Rec. 33(4), 38–43 (2004) 10. Bizer, C., Heath, T., Berners-Lee, T.: Linked data—the story so far. Int. J. Semant. Web Inf. Systems (IJSWIS) 5(3) (2009) 11. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia—a crystallization point for the web of data. J. Web Semant. 7(3), 154–165 (2009) 12. Bukhres, O.A., Elmagarmid, A.K. (eds.): Object-Oriented Multidatabase Systems: A Solution for Advanced Applications. Prentice Hall, New York (1996) 13. Bush, V.: As we may think. Atlantic Monthly 176(1), 101–108 (1945) 14. Ceri, S., Pelagatti, G.: Distributed Databases: Principles and Systems. McGraw-Hill, New York (1984) 15. Chamberlin, D.D., Boyce, R.F.: Sequel: a structured English query language. In: SIGFIDET ’74: Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control, pp. 249–264. ACM, New York (1974). doi:10.1145/800296.811515 16. Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J.D., Widom, J.: The TSIMMIS project: integration of heterogeneous information sources. In: 16th Meeting of the Information Processing Society of Japan, Tokyo, Japan, pp. 7–18 (1994) 17. Chen, P.P.: The entity-relationship model: toward a unified view of data. In: Kerr, D.S. (ed.) VLDB, p. 173. ACM, New York (1975) 18. Chen, P.P.: The entity-relationship model—toward a unified view of data. ACM Trans. Database Syst. 1(1), 9–36 (1976) 19. Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970) 20. Conklin, J.: Hypertext: an introduction and survey. Computer 20(9), 17–41 (1987). doi:10.1109/MC.1987.1663693 21. Dogac, A., Özsu, M.T., Biliris, A., Sellis, T.K. (eds.): Advances in Object-Oriented Database Systems, Proceedings of the NATO Advanced Study Institute on Object-Oriented Database Systems, Held in Izmir, Kusadasi, Turkey, August 6–16, 1993. NATO ASI Series F: Computing and Systems Sciences, vol. 130 (1994)
26
B. Haslhofer and E.J. Neuhold
22. Dublin Core Metadata Initiative. Dublin Core Metadata Element Set, version 1.1. Available at: http://dublincore.org/documents/dces/ (December 2006) 23. Elmasri, R., Weeldreyer, J., Hevner, A.: The category concept: an extension to the entity-relationship model. Data Knowl. Eng. 1(1), 75–116 (1985). doi:10.1016/0169-023X(85)90027-8 24. Engelbart, D.C.: Augmenting Human Intellect: A Conceptual Framework. Stanford Research Institute, Menlo Park (1962) 25. Falkenberg, E.D.: Concepts for modelling information. In: Nijssen, G.M. (ed.) IFIP Working Conference on Modelling in Data Base Management Systems, Freudenstadt, Germany, pp. 95–109. North-Holland, Amsterdam (1976) 26. Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005). doi:10.1145/1107499.1107502 27. Goldfarb, C.F.: A generalized approach to document markup. In: Proceedings of the ACM SIGPLAN SIGOA Symposium on Text Manipulation, pp. 68–73. ACM, New York (1981). doi:10.1145/800209.806456 28. Goldfarb, C.F.: Standards-HyTime: a standard for structured hypermedia interchange. Computer 24(8), 81–84 (1991). doi:10.1109/2.84880 29. Grønbæk, K., Trigg, R.H.: Hypermedia system design applying the dexter model. Commun. ACM 37(2), 26–29 (1994). doi:10.1145/175235.175236 30. Gruber, T.: A translation approach to portable ontology specifications. Knowl. Acquis. 5, 199– 220 (1993) 31. Haas, L.M., Selinger, P.G., Bertino, E., Daniels, D., Lindsay, B.G., Lohman, G.M., Masunaga, Y., Mohan, C., Ng, P., Wilms, P.F., Yost, R.A.: R* : A research project on distributed relational DBMS. IEEE Database Eng. Bull. 5(4), 28–32 (1982) 32. Halasz, F., Schwartz, M.: The dexter hypertext reference model. Commun. ACM 37(2), 30–39 (1994). doi:10.1145/175235.175237 33. Halpin, T.: Object-role modeling (ORM/NIAM). In: Handbook on Architectures of Information Systems, pp. 81–102. Springer, Berlin (1998) 34. Haslhofer, B., Klas, W.: A survey of techniques for achieving metadata interoperability. ACM Comput. Surv. 42(2) (2010) 35. ISO JTC1 SC34. ISO 8879:1986 Information Processing—Text and Office Systems— Standard Generalized Markup Language (SGML) (1986) 36. Jacobs, I., Walsh, N.: Architecture of the World Wide Web, Volume One. Available at: http://www.w3.org/TR/webarch/ (December 2004) 37. Rothnie, J.B. Jr., Bernstein, P.A., Fox, S., Goodman, N., Hammer, M., Landers, T.A., Reeve, C.L., Shipman, D.W., Wong, E.: Introduction to a system for distributed databases (sdd-1). ACM Trans. Database Syst. 5(1), 1–17 (1980) 38. Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. Knowl. Eng. Rev. 18(1), 1–31 (2003). doi:10.1017/S0269888903000651 39. Klas, W., Aberer, K., Neuhold, E.J.: Object-oriented modeling for hypermedia systems using the VODAK model language. In: NATO ASI OODBS, pp. 389–433 (1993) 40. Kosch, H.: Distributed Multimedia Database Technologies Supported MPEG-7 and by MPEG-21. CRC Press LLC, Boca Raton (2003) 41. Landers, T.A., Rosenberg, R.: An overview of multibase. In: DDB, pp. 153–184 (1982) 42. Litwin, W., Boudenant, J., Esculier, C., Ferrier, A., Glorieux, A.M., Chimia, J.L., Kabbaj, K., Moulinoux, C., Rolin, P., Stangret, C.: Sirius system for distributed data management. In: DDB, pp. 311–366 (1982) 43. Nack, F., Lindsay, A.T.: Everything you wanted to know about MPEG-7: Part 1. IEEE MultiMedia 6(3), 65–77 (1999) 44. Nelson, T.H.: Complex information processing: a file structure for the complex, the changing and the indeterminate. In: Proceedings of the 1965 20th National Conference, pp. 84–100. ACM, New York (1965). doi:10.1145/800197.806036 45. Neuhold, E.J., Biller, H.: Porel: A distributed data base on an inhomogeneous computer network. In: VLDB, pp. 380–395. IEEE Computer Society, Los Alamitos (1977)
A Retrospective on Semantics and Interoperability Research
27
46. Nijssen, G.M.: Current issues in conceptual schema concepts. In: Nijssen, G.M. (ed.) Proc. 1977 IFIP Working Conf. on Modelling in Data Base Management Systems, Nice, France, pp. 31–66. North-Holland, Amsterdam (1977) 47. Noy, N.F., Klein, M.: Ontology evolution: Not the same as schema evolution. Knowl. Inf. Syst. 6(4), 428–440 (2004). doi:10.1007/s10115-003-0137-2 48. Object Management Group (OMG). Meta Object Facility (MOF) core specification— version 2.0. Available at: http://www.omg.org/spec/MOF/2.0/PDF/ (January 2006) 49. Object Management Group (OMG). Unified Modelling Language (UML). Available at: http:// www.uml.org/ (2007) 50. Ouksel, A.M., Sheth, A.: Semantic interoperability in global information systems. SIGMOD Rec. 28(1), 5–12 (1999). doi:10.1145/309844.309849 51. Papakonstantinou, Y., Garcia-Molina, H., Widom, J.: Object exchange across heterogeneous information sources. In: Eleventh International Conference on Data Engineering (ICDE 1995), pp. 251–260 (1995) 52. Reid, B.K.: A high-level approach to computer document formatting. In: POPL ’80: Proceedings of the 7th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 24–31. ACM, New York (1980). doi:10.1145/567446.567449 53. Rowley, J.: The wisdom hierarchy: representations of the DIKW hierarchy. J. Inf. Sci. 33(2), 163–180 (2007). doi:10.1177/0165551506070706 54. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22(3), 183–236 (1990). doi:10.1145/96602.96604 55. Stonebraker, M., Held, G., Wong, E., Kreps, P.: The design and implementation of INGRES. ACM Trans. Database Syst. 1(3), 189–222 (1976). doi:10.1145/320473.320476 56. Stonebraker, M., Neuhold, E.J.: A distributed database version of INGRES. In: Berkeley Workshop, pp. 19–36 (1977) 57. Sundgren, B.: An infological approach to data bases. PhD thesis, University of Stockholm (1973) 58. The EDItEUR Group: Online Information Exchange (ONIX). Available at: http://www. editeur.org/onix.html (2007) 59. Trigg, R.H., Weiser, M.: Textnet: a network-based approach to text handling. ACM Trans. Inf. Syst. 4(1), 1–23 (1986). doi:10.1145/5401.5402 60. Visser, P.R.S., Jones, D.M., Bench-Capon, T.J.M., Shave, M.J.R.: An analysis of ontological mismatches: Heterogeneity versus interoperability. In: AAAI 1997 Spring Symposium on Ontological Engineering, Stanford University, Stanford (1997) 61. W3C XML Activity. Extensible Markup Language (XML) 1.0. W3C. Available at: http:// www.w3.org/TR/1998/REC-xml-19980210 (1998) 62. Wache, H.: Semantische Mediation für heterogene Informationsquellen. PhD thesis, University of Bremen (2003) 63. Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992). doi:10.1109/2.121508
Semantic Web and Applied Informatics: Selected Research Activities in the Institute AIFB Andreas Oberweis, Hartmut Schmeck, Detlef Seese, Wolffried Stucky, and Stefan Tai
Abstract Research on Semantic Web has a long tradition in the Institute AIFB, and has largely influenced many research projects in the broader field of Applied Informatics. This article reports on a selection of research activities that illustrate the fruitful exchange of Semantic Web and Applied Informatics, covering the topics of Logic and Complexity Management, Efficient Algorithms, Organic Computing, and Business Process Management. This article does not have the character of a research paper in the classical sense. As a contribution to this Festschrift, it comprises a very special and personal selection of topics and research activities and should be considered as a message of greetings in honor of our colleague: Rudi Studer.
1 Introduction The advent of the Internet has changed the world. New business applications, free flow of information, new services and software architectures, and new global business processes are emerging. The Internet is developing into a comprehensive database storing all the knowledge of mankind, and is becoming the active brain of mankind used for science (storing old discoveries and enabling new ones), for qualitatively new forms of organization of business (new companies—e.g. Amazon, Google, eBay, and others—as well as new products and new services) and politics. With the evolution of “a new generation of open and interacting social machines” enabling “interactive ‘read/write’ technologies (e.g. Wikis, blogs, and photo/video sharing)” [26] the modern Web is changing human life every day. But at the same time, these new advantages introduce some scientific and technological challenges. On the one hand, information needed to solve a specific problem can be at our fingertips when we need it, but on the other hand we are drowning in information and are not able to master the complexity and the dynamics of the processes created (consider, for example, the growing product complexity, i.e. the great array of variants caused by the diversity requirements of well-informed customers). A. Oberweis () Institute AIFB, Karlsruhe Institute of Technology (KIT), 76128 Karlsruhe, Germany e-mail:
[email protected] url: http://www.aifb.kit.edu D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_2, © Springer-Verlag Berlin Heidelberg 2011
29
30
A. Oberweis et al.
Semantic web technology is an enabling factor for many of the new applications, products, and services, and at the same time helps to address some of their shortcomings. This technology needs a scientific foundation rooted in Mathematics, Logic, and Theoretical Computer Science. It is an integral part of Applied Informatics, and closely related to Efficient Algorithms and Organic Computing, Business Information and Communication Systems, Service Computing and Service Engineering, Software and Process Engineering and Complexity Management, all of which are the areas of research represented in the Institute AIFB (www.aifb.kit.edu). The objective of this article is to demonstrate (using specific examples) that there is a fruitful interconnection between semantic web technology and the above mentioned areas of Applied Informatics, which is beneficial for all sides and stimulates further research, either in joint projects, where we benefit directly from each other’s competence, or in individual projects integrating topics from other groups. Cooperation starts with joint discussions and regular meetings to enable and create a good working atmosphere in the Institute AIFB and to exchange ideas to solve all kinds of daily problems. This paper presents a personal selection of topics and problems born out of the fruitful cooperation and collaboration with Rudi Studer at the Institute AIFB that has been carried out over several years and is motivated by our friendship with him, and simply by having him as such a productive colleague. This article is definitely not a research paper in the classical sense. As a contribution to this Festschrift, it should be considered a message of greetings in honor of our colleague Rudi Studer. The article is organized as follows: In the first section we discuss as a general introduction, the challenging interrelationship between the expressive power of suitable languages to describe and access knowledge and the complexity of the problems to solve. The next section surveys recent work in the field of semantic business process management at the Institute AIFB. This is followed by a wrap-up of joint research on distributed market structures and system architectures related to innovative energy systems and on the role of semantic web technologies for selforganization and organic computing.
2 A Logical Way to Master Complexity: the Challenging Tradeoff Between Expressiveness and Tractability A significant amount of the world’s knowledge is stored on the World Wide Web, and this amount is constantly increasing. The rapid growth rate and dynamics of this development, together with the sheer number of different documents, their varieties, and their varying quality, emphasizes the challenging task of managing and finding information on the WWW. In order for users to handle this variety of information more comfortably, it is best to use highly expressive languages to formulate queries, as they are closest to the natural languages used on the Internet. Unfortunately, small doses of expressive power come at the price of significant computational complexity (see [12]). Hence, it is a natural way of using formal methods to support all the
Semantic Web and Applied Informatics
31
different tasks of knowledge processing in an automatic or at least, semiautomatic way. Historically, the most suitable way to support such tasks has been the toolbox of mathematical and philosophical logic (see e.g. [20] and [13]). This toolbox has provided suitable ways to represent and process knowledge, define a correct semantics, the notion of truth, the notion of model, and the notions of proof and deduction to infer new knowledge from known facts. It has also been the method of translating knowledge expressed in one language into knowledge expressed in another language via interpretations. It is here where the emerging semantic web technology finds solid ground to support one of its most urgent needs: finding the correct meaning of some of the knowledge represented in the web and giving the users’ questions their intended meaning—guiding the search for knowledge and enabling correct answer responses for queries. Here, ontologies are handy tools to support this task. Referring to e.g. [28] they “provide formal specifications and computationally tractable standardized definitions of terms used to represent knowledge of specific domains in ways designed to maximize intercommunicability with other domains. The importance of ontologies has been recognized in fields as diverse as e-commerce, enterprise and information integration, qualitative modeling of physical systems, natural language processing, knowledge engineering, database design, medical information science, geographic information science, and intelligent information access. In each of these fields a common ontology is needed in order to provide a unifying framework for communication.” But this huge variety of applications requires a sufficiently expressive language for knowledge representation and widely applicable methods to process knowledge, which should be understandable and should efficiently scale up to large knowledge bases. The price one must pay is that complexity prevents easy access to really efficient solutions for most of the interesting questions in knowledge representation and reasoning. Consider as examples the XML-based Semantic Web languages OWL (web ontology language, which was standardized in 2004 by the World Wide Web Consortium—W3C), and RDF (the Resource Description Framework, which allows statements to be expressed in the form of subject-predicate-object triples) and related description logics (DL). The DL-based variants OWL Lite and OWL DL are already EXPTIME- and NEXPTIME-complete, respectively, and reasoning already becomes algorithmically intractable in these languages (see [36]). Considering RDF, it was shown in [45] that for SPARQL, a query language for RDF, the evaluation of SPARQL patterns is PSPACE-complete. Moreover, for the satisfiability of special variants of description logics one observes: S is PSPACE-complete (without TBox), SI is PSPACE-complete, SH, SHIF and SHIQ are EXPTIME-complete and SHOIQ is NEXPTIME-hard (see e.g. [3, 6, 27, 52]). This personal selection of examples of complexity results illustrates the tradeoff of expressiveness vs. tractability. It is a general observation in mathematical logic that the algorithmic complexity of algorithmic problems, as e.g. satisfiability (for propositional logic this is the known “mother of NP-complete problems” SAT—its goal is to find a satisfying assignment of the input formula), consistency, model
32
A. Oberweis et al.
checking, grows with the expressive power of the underlying logic and with the structural complexity of the considered class of models. These results date back to early investigations into decidability of theories (see e.g. [8, 46, 51]). They can also be observed in recent investigations of Descriptive Complexity Theory (see [32]), Finite Model Theory (see [24, 38]) and Parameterized Complexity Theory (see [16, 23, 29]). There is, of course, a lot of research and stimulating progress in the area of Semantic Web research itself trying to make special algorithmic solutions more efficient and develop algorithms which are scalable in practice (see the articles in this volume, e.g. [5]). Following the history of the current development we can observe that, together with contemporary developments in Applied Informatics, the results and the tools developed in mathematical logic have stimulated and enabled the current success of the semantic web. But research in these areas is no one-way street—real progress will only be observed with cooperation between many areas. We assume that several questions could be of real interest here. Most of the complexity results use the quite general approach of worst-case and asymptotic complexity. Here, the approach of Parameterized Complexity (see [16] for a general introduction), which recommends the study and use of more specific fine structures of the problems (to find parameters k, specific to the problem, such that problems of input size n could be solved in f (k) ∗ nO(1) time), could help to lead to further improvements (see [23] for first promising steps), as it did in Computational Biology [37]. As an example, consider the maximum satisfiability problem of propositional calculus (MAX-SAT—it is the goal to find an assignment which maximizes the number of satisfying clauses). Even for the maximum clause size k = 2 the maximum 2-satisfiability (MAX 2-SAT) is NP-complete. A recent breakthrough however shows, that MAX 2-SAT can be solved in 1.74n steps [53], while even for the clause size 3 (MAX 3-SAT) the best known bound for n variables is 2n [43]. When the number of clauses is bounded by m, maximum 2-satisfiability (MAX 2-SAT) can be solved in 1.15m steps and when the formula length is bounded by l, MAX 2-SAT can be solved in 1.08l steps (see [43]). The advantage of using algorithmic ideas to maximize the weights of variable assignments which “indicate if they can be added to the knowledge base while keeping it as consistent as possible” is that “three steps that frequently occur separately in Information Extraction can be treated by solving one maximum satisfiability problem: pattern selection, entity disambiguation and consistency checking” [11]. It could be promising to investigate the possibility of improving the efficiency of these methods by using the machinery from parameterized algorithms (see also [47, 50]). Furthermore, it would be interesting to combine parameterized approaches with research on the real structure of queries and combining this with a careful analysis of the structure of the Web as a complex dynamic system (see e.g. [4, 7, 22]) using techniques from Algorithm Engineering (see e.g. [42]). Further approaches which could definitely be of interest are the use of Fuzzy Logic (see e.g. [1, 21, 33]) which could relax the overpowering demand of sound and complete deductions. Of special interest in this respect could be the use of Modal Logic and possible world semantics or Kripke models [18] or a relaxation given by approximate reasoning (see e.g. [52]).
Semantic Web and Applied Informatics
33
3 Semantic Business Process Management This section surveys recent work in the field of semantic business process management at the Institute AIFB in the research group on business information systems. Here, we focus on three aspects: – Search for similar business process models in a process model repository. – Coupling of process models from different organizations. – Automatic user support for business process model editors. Business process management provides languages, methods, and software tools to support the whole life cycle of business process types and of business process instances. Semantic business process management integrates modeling languages (e.g. Petri Nets, Event Driven Process Chains, Business Process Modeling Notation) from the field of business process management and semantic technologies.
3.1 Search for Similar Business Process Models in Process Model Repositories A business process may be modeled in different ways by different modelers, even when they use the same modeling language. Concepts for finding similar process models in a process repository must include appropriate methods for solving ambiguity issues caused by the use of synonyms, homonyms, or different abstraction levels for process element names. So-called semantic business process models, which are ontology-based descriptions of process models, allow the search for similar process models or process fragments in a repository. In [17] a concept to compute the similarity between business process models is presented based on an OWL DL description of Petri nets. The use of semantic technologies in this scenario has the advantage that business process models (including process activity names) can be described in an unambiguous format, which supports computer reasoning and the automation of process composition. A more detailed description of these concepts can be found in [17, 34]. Search for similar process models based on recommender systems is proposed in [30, 31, 35].
3.2 Coupling of Process Models from Different Organizations The rapid growth of electronic markets’ activities demands flexibility and automation of the involved IT-systems in order to facilitate the interconnectivity of business processes and to reduce the required communication efforts. Inter-organizational business collaborations bring up synergy effects and can reduce enterprise risks. However, the insufficient integration of enterprises hampers collaboration due to
34
A. Oberweis et al.
different interpretations of the meaning of the used vocabulary. Furthermore, the integration of collaborating business partners into one single value creation chain requires flexible business processes at both sides to reduce integration cost and time. The interconnectivity of business processes can fail due to company specific vocabularies even if business partners share similar demands. In [14] a method is presented for semantically aligning business process models. The method supports (semi)automatic interconnectivity of business processes. A representation of Petri nets in the ontology language OWL is proposed, to semantically enrich the business process models. This semantic alignment is improved by a background ontology modeled with a specific UML profile allowing it to be modeled visually. Given two structures (e.g., ontologies or Petri nets), aligning one structure with another one means that for each entity (e.g., concepts and relations, or places and transitions) in the first structure, one tries to find a corresponding entity, which has the same intended meaning, in the second structure. Semantic technologies allow for the quicker discovery of mutual relationships between process models. For further information on this concept see [14, 34].
3.3 Automatic User Support for Business Process Model Editors Manual modeling of business processes is a time consuming task. Typos and structural modeling errors make it particularly error prone to model business processes manually. Modelers can be assisted in editing business process models by providing an autocompletion mechanism during the modeling activity. In [10] autocompletion, support for a business process model editor is introduced. This approach is based upon an OWL DL description of Petri nets. The autocompletion mechanism requires validation methods to check required properties of the automatically completed business process. To solve ambiguity issues caused by the use of different names for describing the same tasks, a machine readable format, which might be used for automatic reasoning, is required for business process models. Business processes modeled, e.g., with Petri nets, can be translated into the Web Ontology Language OWL, an unambiguous format which allows ontological reasoning. These semantic business process models combine process modeling methods with semantic technologies. A semantic description of Petri nets makes it easier to find appropriate process templates (reference processes), which can be proposed for autocompletion. During the modeling process, a recommendation mechanism determines possible subsequent fragments of all templates by computing similarities. If the system detects a high similarity between one element of a template and a modeling element, then subsequent elements of this element template are proposed for autocompletion. To ensure correct process flow behavior the system must also check properties such as deadlock freeness. The benefit of semantic technologies in this scenario is the exploitation of reasoning techniques that are provided for OWL and allow the inference of appropriate process fragments to complete a given business process model. For a more detailed description see [10, 34, 35].
Semantic Web and Applied Informatics
35
Fig. 1 SESAM marketplace for energy products
4 Semantic Web Technologies in Self-organization and Organic Computing The semantic web inherently refers to research problems arising from a large network of distributed resources built on Internet technologies. The growth of the Internet and, in particular, the introduction of the World Wide Web have influenced social and economic structures in many aspects. The economic potential of Internet and web technologies for the economy has been the topic of the research program “Internet Economics” of the Federal Ministry of Education and Research (BMBF) from 2003 to 2007. At Karlsruhe, the project SESAM: Self-organization and Spontaneity in Liberalized and Harmonized Markets was funded by this program. Based on the unique combination of competences at the Karlsruhe Institute of Technology (KIT), ten chairs from management and economics, informatics, and law cooperated in the analysis and design of market structures and mechanisms which are heavily influenced by Internet technology. Starting from multi-utility markets, the major area of application has been the design of innovative markets for an energy system which is increasingly influenced by highly distributed components acting as consumers and producers of electric power. The objective of SESAM has been to design a distributed peer-to-peer based market for trading energy related products and services (see Fig. 1 and [15]), allowing for spontaneous entry and exit of participants and providing a range of services for the optimization of tariffs, bundling consumers or producers into virtual consumers and virtual power plants by forming cooperations of participants having matching (or, complementary) interests (see [2, 19]), for agent-based negotiations and legally binding contracts (see [9]). An inherent prerequisite for the construction and operation of such a market is a common basis of understanding of the essential terms describing technical, economic, or legal aspects. Consequently, the design of ontologies has been a backbone of this project. The appropriate combination of specific ontologies into the various combined aspects of such a market place provided a particular challenge (this is indicated in Fig. 2). In particular, if contracts have to be negotiated by agents rep-
36
A. Oberweis et al.
Fig. 2 Combination and derivation of market related ontologies
resenting stakeholders from different regions, an essential prerequisite for getting legally binding contracts is a common understanding of technical and legal terms and requirements. This could only be achieved by a solid modeling with combined ontologies. The SESAM project has been a perfect example of the potential benefits of cooperation between the research groups at the Institute AIFB, in this case related to optimization (of tariffs and cooperation of market participants), efficient implementation of distributed market mechanisms in peer-to-peer technologies, and a common understanding of terms provided by ontologies. The research carried out in this project has strongly influenced the shaping of a major follow-up research program by the Federal Ministry of Economics on E-Energy: Large so-called model regions are formed for the prototypical evaluation of various approaches for the use of information and communication technologies and market mechanisms in the electrical power grid, based on smart metering of the grid components (see the editorial [49] of the topical issue of it-information technology on E-Energy). One of the six selected model regions has been formed by the utility company EnBW together with ABB, IBM, SAP, the consulting company Systemplan and the KIT under the name MeRegio: Moving towards Minimum Emission Regions. Five chairs of the KIT are involved in this project, representing expertise in management and economics, informatics, and law (see [39] at http://www.meregio.de). As an extension, another large project started in 2009: In the project MeRegioMobil eleven chairs of the KIT are cooperating with Bosch, Daimler, EnBW, Opel, SAP, Stadtwerke Karlsruhe and Fraunhofer ISI on the use of ICT for the integration of electric vehicles into the power network. A broad range of topics has to be addressed: The charging of batteries of electric vehicles has to be controlled in order to prevent excessive peak load on the power grid. This control can be further utilized to exploit the potential of dynamically available storage for electricity for balancing supply and demand of electric power, in particular in the presence of fluctuations in the supply of power from renewable sources. Furthermore, electric vehicles will rely on the availability of public and semi-public charging stations. The recharging of batteries at these different types of stations has to be supported by appropriate services addressing topics such as the identification of cars, and the possibility of roaming, which means charging for the use of electric power by one provider only, based on individual contracts
Semantic Web and Applied Informatics
37
with the utility companies, and respecting privacy issues. An essential part of this project is the design of a reference model for electric mobility. Since several stakeholders have to cooperate on these issues, they need a common understanding of all the interrelated topics and official regulations referring to car manufacturers, utility companies, or providers of charging stations. Therefore, the design of appropriate ontologies forms an important part of this project, which is getting substantial input from the group of Rudi Studer at the Institute AIFB. A quite visible part of MeRegioMobil is the construction of a demo and research lab for the integration of (the batteries of) electric vehicles into a smart home environment. The smart home is equipped with intelligent household appliances like a refrigerator or washing machine and with solar panels and a micro-scale combined heat power plant. One of the challenges consists of balancing the demand and supply of electricity and heat so that personal preferences and network requirements are matched in the most efficient way possible. Scaled to a large number of households, this requires ontologies for effective communication and it relies on a potentially high degree of autonomy and self-organization, since it is infeasible to explicitly control the behavior of all the participating components. Self-organization in technical application scenarios and its adaptive control are the major topics of the research area of Organic Computing which has been influenced significantly by researchers from the Institute AIFB (see [41, 44, 48]). The research community in organic computing is concentrating on the design of system architectures and methods supporting controlled self-organization in highly distributed and dynamically changing environments. The overall goal of this research is the design of dependable systems which can operate in a highly autonomous way in spite of unanticipated situations in their operating environment, complying with the objectives and expectations of their human users (or their “controlling entities”). Therefore, communication and interaction in complex, highly networked technical application systems as addressed in Organic Computing could benefit significantly from technologies resulting from research on Semantic Web, as they would provide a basis for deriving meaningful interaction.
5 Conclusion In this contribution we have sketched a selection of research topics at the Institute AIFB which are examples of the broad interconnections between semantic web research and other areas of Applied Informatics. Due to restrictions of time and space, we have barely scratched the surface of the broad range of topics and joint research activities at the Institute AIFB. Some topics which we could not address but which are nevertheless highly relevant are the design of efficient algorithms for large scale applications on the Internet, the design of intelligent systems to support WebServices and service-oriented architectures and applications, aspects of Information Management and Market Engineering (http://www.ime.uni-karlsruhe.de), Software and Systems Engineering, and last but not least, Service Science. All these areas are represented at the Institute AIFB, and they provide the basis for cooperative research in challenging projects—in the past, at present, and in the future.
38
A. Oberweis et al.
Acknowledgements We wish to thank Agnes Koschmider for her contribution to Sect. 2 “Semantic Business Process Management”.
References 1. Agarwal, S., Hitzler, P.: Modeling fuzzy rules with description logics. In: Cuenca Grau, Bernardo, Horrocks, Ian, Parsia, Bijan, Patel-Schneider, Peter (eds.) Proceedings of Workshop on OWL Experiences and Directions, Galway, Ireland. CEUR Workshop Proceedings, vol. 188 (2005). Online: http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/ Vol-188/sub5.pdf 2. Agarwal, S., Lamparter, S.: SMART: a semantic matchmaking portal for electronic markets. In: Proceedings of Seventh IEEE International Conference on E-Commerce Technology (CEC’05), pp. 405–408 (2005) 3. Ait-Kaci, H., Podelski, A., Smolka, G.: A feature constraint system for logic programming with entailment. Theor. Comput. Sci. 122(1&2), 263–283 (1994) 4. Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002) 5. Angele, J., Schnurr, H.P., Brockmans, S., Erdmann, M.: Real world application of semantic technology. In: Fensel, D. (ed.) Foundations for the Web of Information and Services, pp. 327– 341. Springer, Berlin (2011). This volume 6. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description-Logic Handbook. Cambridge University Press, Cambridge (2003) 7. Barabási, A.L., Dezso, Z., Ravasz, E., Yook, S.H., Oltvai, Z.: Scale-free and hierarchical structures in complex networks. In: Pastor-Satorras, R., Rubi, J., Diaz-Guilera, A. (eds.) Statistical Mechanics of Complex Networks. Lecture Notes in Physics. Springer, Berlin (2003) 8. Baudisch, A., Seese, D., Tuschik, P., Weese, M.: Decidability and quantifier-elimination. In: Barwise, J., Feferman, S., Model-Theoretic Logics, pp. 235–270. Springer, New York (1985). Chap. VII 9. Bergfelder, M., Nitschke, T., Sorge, C.: Signaturen durch elektronische Agenten— Vertragsschluss, Form und Beweis. Informatik Spektrum 28(3), 210–219 (2005) 10. Betz, S., Klink, S., Koschmider, A., Oberweis, A.: Automatic user support for business process modeling. In: Hinkelmann, K., Karagiannis, D., Stojanovic, N., Wagner, G. (eds.) Proc. Workshop on Semantics for Business Process Management at the 3rd European Semantic Web Conference, Budva/Montenegro, pp. 1–12 (2006) 11. Blohm, S.: Large-scale pattern-based information extraction from the world wide web. PhD Thesis at the Karlsruhe Institute of Technology (KIT) (2009) 12. Brachman, R.J., Levesque, H.J.: The tractability of subsumption in frame-based description languages. In: Brachman, R.J. (ed.) Proceedings of the National Conference on Artificial Intelligence (AAAI’84), Austin, USA, August 6–10, 1984, pp. 34–37. AAAI Press, Menlo Park (1984) 13. Brachman, J., Levesque, H.J.: Knowledge Representation and Reasoning. Elsevier, Amsterdam (2004) 14. Brockmans, S., Ehrig, M., Koschmider, A., Oberweis, A., Studer, R.: Semantic alignment of business processes. In: Manolopoulos, Y., Filipe, J., Constantopoulos, P., Cordeiro, J. (eds.) Proc. Eighth International Conference on Enterprise Information Systems (ICEIS 2006), pp. 191–196. INSTICC Press, Paphos/Cyprus (2006) 15. Conrad, M., Dinger, J., Hartenstein, H., Rolli, D., Schöller, M., Zitterbart, M.: A peer-to-peer framework for electronic markets. In: Peer-to-Peer Systems and Applications. Lecture Notes in Computer Science, vol. 3485. Springer, Berlin (2005) 16. Downey, R.G., Fellows, M.R.: Parameterized Complexity. Springer, New York (1999) 17. Ehrig, M., Koschmider, A., Oberweis, A.: Measuring similarity between semantic business process models. In: Roddick, J.F., Hinze, A. (eds.) Conceptual Modelling 2007, Proc.
Semantic Web and Applied Informatics
18. 19.
20. 21. 22. 23. 24. 25.
26. 27. 28.
29. 30.
31.
32. 33. 34. 35.
36.
37.
38. 39. 40.
39
Fourth Asia-Pacific Conference on Conceptual Modelling (APCCM 2007), Ballarat, Victoria/Australia. Australian Computer Science Communications, vol. 67, pp. 71–80 (2007) Fitting, M.C.: Intuitionistic Logic Model Theory and Forcing. North-Holland, Amsterdam (1969) Franke, M., Rolli, D., Kamper, A., Dietrich, A., Geyer-Schulz, A., Lockemann, P., Schmeck, H., Weinhardt, C.: Impacts of distributed generation from virtual power plants. In: Proceedings of the 11th Annual International Sustainable Development Research Conference, pp. 1–12 (2005) Genesereth, M.L., Nilsson, N.J.: Logical Foundations of Artificial Intelligence. Morgan Kaufmann, Los Altos (1987) Geyer-Schulz, A.: Fuzzy Rule-Based Expert Systems and Genetic Machine Learning, second revised and enlarged edition. Physica-Verlag, Heidelberg (1997) Gil, R., García, R., Delgado, J.: Measuring the Semantic Web, ongoing research column: real world SW cases. AIS SIGSEMIS Bull. 1(2), 69–72 (2004) Gottlob, G., Szeider, S.: Fixed-parameter algorithms for artificial intelligence, constraint satisfaction and database problems. Comput. J. 51(3), 303–325 (2008) Grädel, E., Kolaitis, P.G., Libkin, L., Marx, M., Spencer, J., Vardi, M.Y., Venema, Y., Weinstein, S.: Finite Model Theory and Its Applications. Springer, Berlin (2007) Grimm, S., Steffen Lamparter, S., Abecker, A., Agarwal, S., Eberhart, A.: Ontology based specification of web service policies. In: Proceedings of Semantic Web Services and Dynamic Networks, Informatik04 (2004) Hendler, E., Berners-Lee, T.: From the semantic Web to social machines: a research challenge for AI on the World Wide Web. Artif. Intell. 174, 156–161 (2010) Hemaspaandra, L.A., Ogilhara, M.: The Complexity Theory Companion. Springer, Berlin (2002) Herre, H., Heller, B., Burek, P., Hoehndorf, R., Loebe, L., Michalek, H.: General Formal Ontology (GFO), Part I: Basic Principles, Version 1.0, OntoMed Report No. 8—July 2006. University Leipzig, Institute of Medical Informatics, Statistics and Epidemiology (IMISE) and Institute of Informatics (IfI) Department of Formal Concepts (2006). http://www.ontomed.de/ Hlineny, H., Oum, S., Seese, D., Gottlob, G.: Width parameters beyond tree-width and their applications. Comput. J. 51(3), 326–362 (2008) Hornung, T., Koschmider, A., Oberweis, A.: Rule-based autocompletion of business process models. In: Proc. CAiSE Forum, 19th Conference on Advanced Information Systems Engineering, Trondheim/Norway (2007) Hornung, T., Koschmider, A., Lausen, G.: Recommendation based process modeling support: method and user experience. In: Li, Q., Spaccapietra, S., Yu, E., Olivé, A. (eds.) Proc. 27th International Conference on Conceptual Modeling (ER’08), Barcelona/Spain. Lecture Notes in Computer Science, pp. 265–278. Springer, Berlin (2008) Immerman, N.: Descriptive Complexity. Springer, New York (1998) Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic. Prentice Hall, New York (1995) Koschmider, A.: Ähnlichkeitsbasierte Modellierungsunterstützung für Geschäftsprozesse. Dissertation. Karlsruhe University Press, Universität Karlsruhe (TH) (2007) (in German) Koschmider, A., Oberweis, A.: Designing business processes with a recommendation-based editor. In: Rosemann, M., van Brocke, J. (eds.) Handbook on Business Process Management, vol. 1. Springer, Berlin (2010) Krötzsch, M., Rudolph, S., Hitzler, P.: Complexity boundaries for horn description logics. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence (AAAI-07), pp. 452–457. AAAI Press, Menlo Park (2007) Langston, M.A., Perkins, A.D., Saxton, A.M., Scharff, J.A., Voy, B.H.: Innovative computational methods for transcriptomic data analysis: a case study in the use of FPT for practical algorithm design and implementation. Comput. J. 51(1), 26–38 (2008) Libkin, L.: Elements of Finite Model Theory. Springer, Berlin (2004) MeRegio: Homepage of project MeRegio. http://meregio.forschung.kit.edu MeRegioMobil: Homepage of project MeRegioMobil. http://meregiomobil.forschung.kit.edu
40
A. Oberweis et al.
41. Müller-Schloer, C., Schmeck, H.: Organic computing: a grand challenge for mastering complex systems. it-Inf. Technol. 51(3), 135–141 (2010) 42. Näher, S., Wagner, D. (eds.): Algorithm Engineering: 4th International Workshop, Proceedings/WAE 2000, Saarbrücken, Germany, September 5–8, 2000. Lecture Notes in Computer Science, vol. 1982. Springer, Berlin (2001) 43. Niedermeir, R.: Invitation to Fixed-Parameter Algorithms. Ball, J., Welsh, D. (eds.) Oxford Lecture Series in Mathematics and Its Applications, vol. 31. Oxford University Press, London (2006) 44. SPP Organic Computing: Homepage. http://www.organic-computing.de/SPP 45. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. Lecture Notes in Computer Science, vol. 4273, pp. 30–43. Springer, Heidelberg (2006) 46. Rabin, M.O.: Decidable theories. In: Barwise, J. (ed.) Handbook of Mathematical Logic, pp. 559–629. North-Holland, Amsterdam (1977) 47. Samer, M., Szeider, S.: Fixed-parameter tractability. In: Biere, A., Heule, M., van Maaren, H., Walsh, T. (eds.) Handbook of Satisfiability. IOS Press, Amsterdam (2009). Chap. 13 48. Schmeck, H.: Organic computing—a new vision for distributed embedded systems. In: Proceedings Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC 2005), Seattle, WA, USA, May 18–20, 2005, pp. 201–203. IEEE Comput. Soc., Los Alamitos (2005) 49. Schmeck, H., Karg, L.: Editorial: E-energy—paving the way for an Internet of energy. it-Inf. Technol. 51(2), 55–57 (2010) 50. Szeider, S.: Parameterized SAT. In: Kao, M.Y. (ed.) Encyclopedia of Algorithms. Springer, Berlin (2008) 51. Tarski, A., Mostowski, A., Robinson, R.M.: Undecidable Theories. North-Holland, Amsterdam (1953) 52. Tserendorj, T.: Approximate assertional reasoning over expressive ontologies. PhD, Karlsruhe Institute of Technology (KIT) (2009) 53. Williams, R.: A new algorithm for optimal constraint satisfaction ant its implications. In: Proc. 31st ICALP. Lecture Notes in Computer Science, vol. 3142, pp. 1227–1237. Springer, Berlin (2004)
Effectiveness and Efficiency of Semantics Peter C. Lockemann
Abstract Processing of semantics by, e.g., analyzing ontologies, with the concomitant effort in building ontologies, will always have to compete with continuous advances in algorithms and encodings for graph theory, speech recognition and synthesis. Ultimately, the technique will prevail in a given situation that is superior in effectiveness (meeting a needed functionality not easily achievable by other means), and/or efficiency (providing the functionality with a minimum of resources). Coming from a database technology background, this paper gives a few examples of recent projects where the efficiency of database and web technologies was combined with the effectiveness of techniques around ontologies. The paper goes on to argue that by extending the concept of database views to semantics one obtains the means for systematically dealing with pragmatics.
1 To Start with: A Piece of History I have known Rudi Studer for about 25 years now. We first met when I tried to hire him as the senior scientist for my new research group at FZI. Unfortunately for me, but perhaps fortunately for him, he succumbed instead to an offer from IBM where he joined the LILOG project. LILOG (“Linguistic and logic methods and tools”) was a large project where young scientists from IBM collaborated with the best linguists and logicians from academia to develop technologies for the evolving need of natural language as an end-user interface—German in this case. The ultimate goal was to have a system that understands German text and to apply it to large bodies of text to build—what one would term today—a “knowledge base”, and then to use it to answer arbitrary questions. The project had an advisory board with professors Brauer, Schnelle and myself who regularly met with the project group over the course of 6 years. This is how I got to know Rudi Studer and his initial P.C. Lockemann () Karlsruhe Institute of Technology (KIT), Department of Informatics, 76128 Karlsruhe, Germany e-mail:
[email protected] P.C. Lockemann Forschungszentrum Informatik, Karlsruhe, Germany D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_3, © Springer-Verlag Berlin Heidelberg 2011
41
42
P.C. Lockemann
work on semantics quite well. Rudi’s role was in developing systematic methods and tools for knowledge based systems, something that was then and still now is called “knowledge engineering”. LILOG published a final report in 1991, a volume of 750 pages that attests to the huge scientific output [1]. The advisory board was invited to write an introductory chapter [2]. I recently reread the chapter and was quite amused to find the following text. Everyone has faced this situation one time or another. Suppose you arrive by train in a city that you are not familiar with. Before leaving the station you look for a street map, and a plan for the public transportation network to find the best way to get to your destination. You wonder whether it is best just to walk, to take a bus or streetcar and if so which one, or to jump into a taxicab. Indeed, you find a city map right by the exit. You consult the alphabetic street guide which identifies the grid in which to search further. Examining the grid on the map you finally locate the desired street, but alas it is a very long one, and it takes considerable searching for the fine print to identify a house number close to the one of your destination. Now you scan the neighborhood to find the red line signifying a streetcar line close by, and the red dot which indicates a stop. Unfortunately, it is several streets and corners away so that you now memorize the directions and turns to take after alighting from the streetcar. Finally you make an educated guess as to the name of the streetcar stop. Walking out of the station you discover a streetcar stop in front of you. You hope to find a network plan there and a timetable, and indeed you do. On closer inspection you detect the streetcar stop—you guessed its name right. Unfortunately, there is no direct connection so you must change cars. Worse, there seem to be several possibilities to do so. Which one is the best? And how long do you have to wait for the next streetcar? How long would you have to wait for the connecting car? Turning to the timetable you determine that— since it is long after rush-hour—the streetcar runs every twenty minutes, and the previous one left just five minutes ago. By that time you probably become conscious of the taxi stand right across, you walk over and take a cab which brings you to your destination within fifteen minutes and without further ado. How much easier would it be if the city had an information desk at the station where you could just ask for the best way. But staffing such desks is an expensive affair these days, so you never seem to find one. But why have a person there? All you really need is a microphone close to the street map where you explain in your own language where you wish to go. A loudspeaker may convey to you the various alternatives you have, you may ask questions for clarification, an illustration may appear on the map to show you the route you should take, and finally a printer may provide you with a written log of the final recommendation. You may even go and ask for further advice on where to obtain the most suitable ticket, whether you need exact change or may use bills, and so on.
Effectiveness and Efficiency of Semantics
43
2 The Future Has Arrived—Or Has It? What is so amusing about the story from today’s perspective? Well, almost everyone keeps much of that hypothetical automatic help in his or her pocket. Smartphones will instantaneously provide you with all the needed information, by being location aware they minimize the dialog to a few keystrokes or sweeps across the touch pad, and with their high-resolution screens they provide comfortable visualizations. Tickets may be purchased by and loaded on the phone. True, acoustic input and output is not the rule, but that is more an issue of suitability than availability, as everyone using navigation help in an automobile can attest to. And all that has come about without any of the sophisticated semantic techniques and tools developed by LILOG and other research groups. Rather, most of the progress is due to spectacular advances in graph, speech recognition and synthesis algorithms and encodings, and the like. Of course, one may argue that the functionality is limited to a fairly narrow discourse domain. But then, for all practical purposes an ambitious semantic analysis must be limited in its domain as well in order to be effective and efficient. Is semantics a purely academic endeavor without practical needs, then? Not really, because even this simple scenario demonstrates some shortcomings. True, information—in smartphones as well as in navigation systems—often becomes focused by taking location into account. Location, however, is just one contextual characteristic. Why not make use of the calendar (if the meeting is in 15 minutes the system should only recommend a taxi), or purpose (if the visit is private, public transportation is a more likely choice)? Suppose also that after business you want to meet an old friend, and you are looking for a cozy café that serves your favorite pancakes and is within walking distance of your friend’s. Or what do you have to do if your streetcar suddenly gets held up by an accident? Try that with today’s smartphones, and you will fail miserably. Apparently, whenever you can state your problem only in general and qualitative terms but still need a definite solution— provided there is one at all—there seems to be good reason to invoke semantics as part of developing the solution.
3 A Database Techniques Perspective Evidently, semantics has an important role to play. As a database person I have been particularly intrigued by the impact semantic techniques have had on the growing interconnection of hitherto quite separate techniques, structured databases and semistructured or unstructured textual information [3]. But this interconnection not withstanding, in the face of an explosive growth of database volumes, performance—or efficiency—remains the single most pressing issue in practical applications. Even after more than 30 years of existence, relational databases remain the mainstay of data archiving because they are the most efficient way to store, select and access data, particularly if the data are to be interrelated. Consequently, some domain semantics can be translated into structural aspects of relations and referential key constraints
44
P.C. Lockemann
as described in a database schema, while the interpretation and exploitation of the remaining semantics is left to the user. Even today there is a tendency to process Web content by extracting information from the material and pressing it into some sort of structure [3, 4]. Another classical technique to achieve efficiency is indexing. It is a central part of relational databases but also exists independently, for example in information retrieval systems. Index techniques solely rely on textual correspondences, so again the link to semantics is left to the external user. Semantics is often tied to large data collections. Hence to be efficient, semantics should make good use of database and index systems. Effectiveness—meeting a needed functionality not easily achievable by other means—of semantic techniques will then have to be achieved by explicating the semantics on a layer on top of these systems. I shall use three examples from research in my FZI team to illustrate the approach: • How can one efficiently integrate semantics into the classical retrieval process? • How can one incorporate the rich semantics of images into database systems? • How can one effectively develop the semantic structures for a new domain? Semantics by itself is just a large pool of knowledge. Pragmatics is about turning the knowledge into actions. Actions need only a narrow part of semantic knowledge relevant to a given situation. The context of an action directs what should be visible in the knowledge base. In database parlance, context imposes a view on semantics. By narrowing visibility, views have the potential to further raise the efficiency of semantics. I shall discuss two further examples to demonstrate the issue: • How can one integrate personalized context into the retrieval process? • How well can semantics help in unexpected but nonetheless restricted situations? In the following we address each of these examples in turn.
4 Effectiveness and Efficiency 4.1 Ontologies in Information Retrieval The prevailing technique in information retrieval (IR) is the vector space model. Originally, authors or other specialist personnel employed background knowledge to annotate document texts with index terms that somehow relate to the content. Specialists search for interesting documents by querying the document base and couching the problem in expressions over search terms. Document retrieval is by and large based on the concurrence of index and search terms. Since authors and specialists view documents or solution space, respectively, from different angles, there are always relevant documents that are being missed (success is measured in terms of recall), and among those retrieved many are deemed irrelevant (success is described in terms of precision). To improve on both measures, one has included synonyms almost since inception of the technique.
Effectiveness and Efficiency of Semantics
45
Due to its high cost, manual annotation has long been supplanted by automatic annotation. Automatic annotation draws the index terms from the physical text (hence it is referred to as full-text search). Clearly, it results in a huge number of index terms. To discriminate somewhat among the large number of terms, relevance weights are computed by various statistical frequency measures and attached to the terms. Since the method is biased towards the choice of words by an author, the likelihood of mismatch between index and search terms will increase and the problems of precision and recall will be amplified. To overcome the relevance problems of full-text search, one may mimic manual annotation by automatically adding related index terms that are not explicitly contained in the text (so-called metadata). Metadata are drawn from thesauri—an early form of onotologies. However, there is no unambiguous evidence that thesauri have indeed a positive impact: while they definitely improve recall they often have a detrimental effect on precision. This raises the question of whether ontologies, because of their richer semantics, could have a more positive effect, i.e., raise both recall and precision over simple full-text search. In his dissertation, G. Nagypal has carefully examined the question [5]. He selected a test collection from the English Wikipedia web encyclopedia, defined a suite of test queries that employed abstract concepts that rarely showed up in the test documents, and then developed an ontology for those abstract concepts. A novel and intricate algorithm for deriving weighted metadata was used, that matches text fragments to ontology terms and then extends the metadata base. A non-trivial problem is how to combine the results of full-text and metadata search. He developed several variations. Equally difficult was the establishment of a full-text search baseline, so that several baselines (full-text search queries) had to be used. Add to this that the results quite definitely depend on the quality and sophistication of the algorithm for deriving the metadata, and one would hardly expect that unambiguous answers can be found. But the results are fairly unambiguous. Nagypal could show that metadata search already provides by itself good results in those areas that are covered by an ontology. Since ontology construction is a cumbersome and expensive affair, one would like to know whether, in terms of search quality, an imperfect ontology could be compensated for by full-text search. This seems to be true only as long as the ontology is comparatively poor, otherwise the high quality of metadata search seems to dominate. Finally, how much effort should be extended in perfecting an ontology? Nagypal explored the question in an indirect way, by leaving out certain steps in the metadata derivation. There, the results are somewhat mixed depending on which steps were left out. To be on the safe side, then, the complete algorithm should be applied. Now, given that ontologies are effective in improving search results, one wonders whether that improvement comes at the price of efficiency, since after all, ontologies must be structurally analyzed if not processed by deductive means. Moreover, one has a choice between utilizing the ontology at indexing time or at retrieval time. This is a typical problem of cost distribution: If a document is rarely or never used, the added cost of indexing does not amortize, if it is accessed quite often, the cost added to query processing multiplies.
46
P.C. Lockemann
Fig. 1 IRCON system architecture
Nagypal decides in favor of utilizing ontologies during indexing, not the least because by augmenting the list of index terms per document, the high efficiency of the vector space model can be fully utilized. He runs a number of experiments and measures the execution times for the two independent searches, full-text search and metadata search, and for the combination of the two results, as well as the additional overhead for, e.g., query parsing, loading of the results, and output generation. It turns out that the time for metadata search occupies only a fraction of between 10 and 30 percent of full-text search, and total response time remains in the realm of a few seconds. Figure 1 shows the system architecture of the IRCON prototype and demonstrates how different systems have been chosen where each has its special strengths: Lucene for high-performance index management, POSTGRES for efficient management of the structured data, GATE for efficient and effective natural language processing. The figure clearly demonstrates the use of ontologies at indexing time and the efficient use of the index search engine for both, full-text and metadata search. In summary then, if done in a technically sophisticated way, semantics can add effectiveness to information retrieval without loss of efficiency.
4.2 Semantics in Image Retrieval As the saying goes, “a picture is worth more than a thousand words”. There is much more information in a (two-dimensional) image than in a (linear) text. One would expect, then, that retrieving pictures from an image repository offers far more op-
Effectiveness and Efficiency of Semantics
47
Fig. 2 Retrieval of and navigation across images
portunities than text retrieval. Conversely, one should also expect that annotating pictures in a repository offers far more challenges than annotating pure text. Beginning with the opportunities, consider the following example from the ImageNotion project [6–8]. Take Fig. 2. Suppose you wish to retrieve a picture of Hermann Göring visiting the Vichy government. Given some suitable annotation (“Hermann Göring”, among others) and some background knowledge that the Vichy regime governed parts of France during World War II, you may ultimately retrieve the center image in Fig. 2. This image may now become the starting point for further navigation through the repository. For example, the annotation includes a description of the location where the picture was taken (St. Florentin-Vergigny). One may now navigate to other pictures taken at the same place, and may retrieve one that shows the railway car where the armistice was signed at the end of World War I. From there one may continue to navigate to the French marshal Foch provided the annotation of “Marshal Foch” was derived from analyzing smaller sections of the picture that seem to contain persons. The same principle may apply to the starting point where Philippe Pétain (the head of the Vichy government) had been recognized, and from there one may search for further pictures of Pétain. The ImageNotion project drew heavily from the experiences of G. Nagypal’s dissertation. For example, for efficiency reasons retrieval of images should make use of the vector space model. Consequently, annotations will have to be by textual index terms that can then be extended via an ontology. But this clearly is not sufficient. Since image parts are images in their own right, one obtains a (linked) part-of hierarchy of images. Moreover, to navigate across images requires further semantic
48
P.C. Lockemann
Fig. 3 Annotation of images
links (relations) between images. Often there is also background text accompanying the image, e.g., on a web page. To deal with these rich structures, the ImageNotion project introduces two core ideas. The first is the concept of “imagenotion” as a description of a picture (which may itself be part of a larger picture) and combines the picture, the textual annotation, and several links to its parts or its parent, to the imagenotions of related pictures, and to accompanying textual material. What is being retrieved, then, are imagenotions where the user may then decide which of its constituents to display. Since the imagenotions represent a good deal of semantics the second—almost natural— idea is to introduce the imagenotions as nodes into a traditional ontology. Thus retrieval may start by localizing one or more imagenotions and continue by exploiting any of the relations in the ontology. G. Nagypal and A. Walter have run several empirical studies to verify that image archive users inexperienced in semantics and particularly ontologies could easily familiarize themselves with the retrieval technique. Not surprisingly, the complex semantic structures have a counterpart in a complex annotation process. Figure 3 illustrates the elements of this process for the picture of St. Florentin-Vergigny in 1918. There are two ways to start the process: Either there is an expert who knows enough about the picture to manually add index terms, or there is background text (as shown in the figure) to which classical text analysis techniques are applied. Either way, it makes sense to augment the resulting terms by terms gained from an ontology describing the general background, leaving only the more specialized ontology to the retrieval phase. The subsequent step concentrates on the discovery of the image parts (that can then again be annotated). Two sets of discovery algorithms are used, one by NTUA for persons and object detection, and a second by Fraunhofer-Gesellschaft applied after person detection
Effectiveness and Efficiency of Semantics
49
Fig. 4 ImageNotion system architecture (data management level)
to find the faces (including gender determination) and to identify a face provided an annotated picture can be found in the archive where the face has already been associated with a person. In this case one can also establish navigational links among the pictures. Figure 3 also shows that the automated part of the annotation process may be applied iteratively in order to correct or refine the previous result. The quality of the annotations (the metadata) is critical for retrieval effectiveness, since in contrast to Sect. 4.1 no full-text search is available to compensate for incomplete metadata. Early experiments showed that users had great difficulties to cope with preconstructed ontologies. Therefore it was decided to start out with a small domain-independent ontology and then have the users, during annotation or retrieval, add to the ontology (and, incidentally, also the imagenotions) on their own terms. As a lesson, it seems that from a purely practical standpoint effectiveness of semantics can often be raised if one leaves much of its acquisition in the hands of the users. Efficiency of retrieval depends on the indexing technique, but also on efficient storage of the imagenotions because of their use in navigation. Figure 4 shows the main extensions to the data management level as compared to the IRCON architecture. MySQL is used in place of POSTGRES and manages the annotations, links, textual material and standard ITC photo metadata, and a file system serves as the image repository.
4.3 Collaborative Ontology Building The previous Sect. 4.2 claims that under certain conditions effective semantics requires collective ontology building. One may argue that this has to do with the wide range of domains covered by the pictures. However, while developing a serviceoriented environmental information systems [9] we discovered that difficulties arise even for narrower domains such as environmental data, be they measurement data, analysis results, diagrams, or regulations. The difficulties are due to the wide differences in viewpoints that users apply to environmental data. An essential component of a service-oriented infrastructure is the central service registry. Take an analyst who has been given the task to build a new public information portal for flood emergency management and has to find the already published
50
P.C. Lockemann
Fig. 5 Collaborative ontology building
services that might be useful. Suppose the analyst searches for a suitable service under the term of “flood level”. Then he or she will in all likelihood miss a service for retrieving the current water level of rivers, even though this would be a good candidate for building the portal. If we had a relation from “flood level” to “water level” and used it in the discovery process, chances would be much higher that more of the appropriate services would be found. More to the point, the same service implementation may be applicable in different business situations and, hence, may require quite different semantic descriptions. Take again the water level service. It may be viewed, and employed, differently by a flood manager, the manager of a river shipping company and the manager of a hydropower plant. In general, business users need to know which services are available for which business purpose, how these services can be connected, which services have to be replaced when a business process has to be changed, or whether new services are needed in order to adapt to new requirements. We decided to take a more methodical (and, hence, more general) approach to collaborative ontology building that serves all stakeholders. Accordingly, we look for a solution that lets the ontology evolve in communication and collaboration of the business experts whenever one sees the need. In today’s networked world such a solution should exploit what is called a “social network”. Figure 5 shows the system architecture. It consists of four main components: a UDDI-based technical registry, a Semantic MediaWiki-based business registry, an ontology server and an ontology engineering component. The figure also indicates the basic workflow within the architecture. A software developer as a service publisher can use any UDDIcompatible client to publish a new service into the registry, which may also include
Effectiveness and Efficiency of Semantics
51
a technical description like a WSDL file in the case of a Web service. In addition to the technical description, the software developer may add some keywords based on the ontology in order to roughly categorize the business use of the Web service. The content of the UDDI Registry is dynamically embedded into the content of the Semantic MediaWiki that forms the business-oriented registry. The keywords chosen by the software developer are used as an initial categorization for the service. From now on business users can search or navigate along the contents of the Semantic MediaWiki, add additional information to the dynamically generated pages, or create new pages. Since we cannot expect the analysts to be experts in building ontologies, the engineering of the ontology should be made as simple as possible. A Semantic MediaWiki is ideal as a frontend for business analysts because it is easy to use, allows adaptation to dynamic changes, stimulates collaboration among analysts, and is a suitable framework for the semantic needs. The ontology is visualized as a graph, and all modifications can be easily done by dragging and dropping the nodes of the visual presentation. The range of possible modifications is restricted (hence the name “lightweight engineering” [10]). For example, concepts can only be connected via broader-narrower and related relations. Business analysts can create new Wiki pages or modify existing ones (including generated Wiki pages) for the purpose of adding further annotations. The annotation of Wiki pages can be carried out by means of such Semantic MediaWiki features as semantic links, semantic attributes, and inline queries (to embed dynamic content). Many annotations can be obtained from the ontology by navigating through it and extracting further facts, or by using a reasoner to derive implicit facts or some of the semantic links. Without overly expensive controlled experiments it is difficult to evaluate the cost savings if any. First experiences have been gained with the environmental information system of the state of Baden-Württemberg. The initial ontology we have used is based on an already existing and widely used taxonomy. The technical infrastructure as described above was developed in close communication with more than 10 representatives of business analysts and 5 representatives of developers, and was rolled out for a first testing period in April of 2007. First experiences seem to support our expectations for the narrow scope of environmental information systems.
5 Raising Efficiency Through Views 5.1 Personal Profiles as Views In Sect. 3 I argued that to turn semantics into actions one would need to impose a view on semantics to identify the contextual knowledge for the actions. In some of the work at FZI, and also in preparing for a large regional project, we developed a number of scenarios for which I give a few examples below. Common to all of them is that the situational information is highly personal (hence, the term “(personal) profiles”).
52
P.C. Lockemann
• For workplace learning, the context should cover the personal skills of the employee, his or her past learning activities, his current problem situation and knowledge defects, the available time and date for learning, his technical environment, his colleagues for potential help, etc. [11–13]. • For smart care (also: ambient assisted living), the context encompasses the health profile, medication regime, nourishment constraints, calendar with visits to physicians or by home services, nearby friends (for example, if the home care service has to locate the person when it is not at home) [14]. • For other social services, one may set up a platform where the profiles of young children with working parents, or of mobility-impaired seniors that ought to be entertained, could be matched to profiles of active seniors who look for opportunities to help others [15]. • For commuting, profiles of persons wishing to travel from home to workplace and vice versa at a given time range without using their own car could be matched against profiles of commuters who offer seats in their cars [15]. • For areas such as northeastern Germany where the declining population has lead to the thinning of government services, medical services, or shopping opportunities, the “consumer-to-provider” principle should be replaced by a “provider-toconsumer” principle. This requires a new kind of “smart” logistics that would be based on the personal profiles of the consumers [15]. Now suppose that each of these environments is supported by a set of IT services. To serve an individual person, a service must be adapted to an individual’s needs (“personalized services”). To do so the service platform must link the service specification, the personal profile, further background information and the current situation such as time or location. Depending on the kind of service, only part of the profile will have to be taken into account. Just consider home care where writing the shopping list and directing the home care service to the neighbor’s look at different parts of the profile, i.e., take different views on the profile. If there is need for reasoning, views may contribute to efficiency because only a predefined part of it must be processed. Taking cues from database technology, views may be virtual—the view is constructed from the profile at use time—or materialized—constructed at definition time and stored separately. Either form must be examined in the light of meeting the highest standards for protecting the privacy of individuals.
5.2 Coping with the Rare: Navigating Across Views The applications of Sect. 5.1 are comparatively static: The profiles and the semantics implicit in the services and explicit in the background knowledge are fairly stable, only time or location remain dynamic. In the scenario below the context is more fluid and, incidentally, also richer. In the face of global competition, companies must streamline their production processes to keep production in Germany. An important aspect is to invest as little
Effectiveness and Efficiency of Semantics
53
Fig. 6 Resilience management
capital as possible. An approach currently adopted by the automobile industry attempts to supply production plants with smaller batches of parts and to compensate for the smaller size by a higher frequency of delivery. To do so without raising transportation costs requires new concepts in transport logistics. These include combining small-size batches, possibly from different sources, into larger transport units, cooperation among transport companies, and a higher collection frequency at parts producers. The LogoTakt project studies a solution based on stable cyclic schedules along all stages of the supply chain [16]. An added value of the approach is that such schedules allow a shift away from highway to rail transportation. A weakness of the approach is that deviations from the plans and disruptions of resources along the way have a fairly immediate effect on regular schedules. LogoTakt pursues the objective of resilient schedules that remain reliable in the presence of such disturbances. It achieves resilience by assuming that the types of disturbances can be foreseen whereas what cannot be predicted is the type actually occurring and the time or place where it occurs in the chain. Resilience is based on the notion of buffers as spare resources in time, space or volume. Abstractly speaking, one could establish some sort of resource semantics that describe the resources together with all the factors affecting their use. For reasons of efficiency one would like to examine as little of the semantics as possible. Consequently, the views are organized in a hierarchy of increasing scope to be examined. Figure 6 illustrates the principle. Planning of the transport processes and—based on failure statistics—of the buffers is the responsibility of the planning platform. During normal operation the platform also collects all events signaled from the processes. Unless an event already refers directly to a disruption, e.g., by a driver’s message, it is compared to
54
P.C. Lockemann
the plan to check for a deviation. In either case, a suspected disturbance (1) is sent to the deviation detection where it is analyzed in more detail. If the deviation can be tolerated no further action is necessary. Otherwise the details (2) are passed on to the resilience manager. An important component is the workflow manager that maintains a history of earlier disturbances and the reactions to them. The resilience manager queries this system for similar situations (3), e.g., using case-based reasoning, and an adequate response (4). If none is forthcoming, the resilience manager will call on the disturbance resolver (5) to develop a solution. Basically the resolver attempts to find a buffer of least cost and/or least impact on the resolution of possible later disturbances (6). In doing so, it starts from an inspection of the most local solutions and if none can be found escalates along the hierarchy to solutions that take a wider context into account. For example, tour local solutions exist if an incomplete load is supplied or the load is provided late but the delay can be compensated for by changing the order of tour points. Transport local solutions inspect the entire chain. It may very well be that several tours serve the same freight train in the schedule so that a delayed supply of parts may be picked up by the next tour and still reach the train on time. This may also be true if a truck breaks down and a replacement truck runs the tour at a later time. A purchase-order local solution includes the consumer in the chain. For example, a supermarket-based production allows a slack of up to one day so that a delivery delayed by several hours may be acceptable. Fleet-internals are somewhat different in that they focus on the resources of a transport company rather than the supply chain. For example, they examine whether idling trucks or drivers are available, or the company has the means for intermediate storage. Finally, if all other buffers fail a network-wide solution considers for each stage in the chain other transport companies that may have spare resources. Finding a solution on this level entails more or less complicated negotiations, keeping in mind that the companies involved usually are fierce competitors. We also note that solutions on higher levels must usually be fed back to the lower levels and ultimately to the tour local level where the final solution for the tours must be determined. The solution found is returned to the disturbance resolver and the resilience manager who communicates it to the planning platform so that it can modify the current plan, and to the workflow manager to place it in its history base. Whereas in Sect. 5.1 a single view is employed during an action, in the present scenario the action may have to navigate across several views. The lower the view in the hierarchy the more often is it invoked. Consequently, view processing must perform better on the lower levels, and algorithmic solutions are essential (and possible) on these levels, whereas on the higher levels solutions that interpret semantics (e.g., reason on them, employ agent technology [17]) are necessary and become feasible. Hence, views seem predominantly a means of raising efficiency. As one of the reviewers seems to suggest, view hierarchies may also be a vehicle for improving on effectiveness. Depending on the characteristics of the view one may have a good chance to identify the technique most suitable for its level, and the division of work among the levels may also improve the chances for off-the-shelf solutions. Our present experience, however, is still limited to the lower levels. We
Effectiveness and Efficiency of Semantics
55
hope to proceed to the higher levels, perhaps in a different scenario such as emergency management, to test our hypothesis that view hierarchies do indeed contribute to efficiency while preserving effectiveness, and to examine whether effectiveness can be raised in an efficient manner.
6 Conclusions “Semantics” has always been many things to many people. To linguists semantics deals with the meaning of linguistic phenomena (words, sentences, paragraphs, documents), to logicians it is a well-defined interpretation in a model, to programming language people it is a set of mathematical objects to guide verification and implementation, to database experts it is purely the database structure from which to derive query execution efficiency without concern for content, for knowledge management scientists like Rudi Studer it is domain knowledge structured into ontologies and subject to deductive techniques in order to assist persons in interpreting documents, or to pass the interpretations on to more sophisticated machine processing. As a database person my own work has always emphasized structural aspects and the efficiency issues that go with them, and my view of the world has always been one of clearly identifiable entities and the relations among them. Database resilience has also been a central issue, but again with leaving the responsibility for semantics to transactions and focusing on efficient preservation of these semantics. But as my research group at FZI has become acquainted with parallel work going on in Rudi Studer’s FZI group, we have come around to see the need for emphasizing effectiveness as well and, hence, to integrate semantics in his sense into our work on information systems. The examples in this paper are thus a tribute to his influence (not the least because he reviewed some of our dissertations). Of course, one never sheds one’s own background, so we tend to look on semantics through the prism of database technology—witness the view concept or the tendency to use layered architectures where semantic solutions are always put “on top of” database solutions. It has always been a pleasure to collaborate with Rudi Studer. Acknowledgements The author wishes to express his gratitude to his former and current young colleagues and scientists Carsten Holtmann, Gabor Nagypal, Jens Nimis, Natalja Pulter, Andreas Schmidt, and Andreas Walter whose work forms the substance of this paper.
References 1. Herzog, O., Rollinger, C.-R. (eds.): Text Understanding in LILOG. Lect. Notes Comput. Sci., vol. 546. Springer, Berlin (1991) 2. Brauer, W., Lockemann, P.C., Schnelle, H.: Text understanding—the challenges to come. In: Herzog, O., Rollinger, C.-R. (eds.): Text Understanding in LILOG. Lect. Notes Comput. Sci., vol. 546, pp. 14–28. Springer, Berlin (1991)
56
P.C. Lockemann
3. Weikum, G.: DB and IR: both sides now. In: Proc. ACM SIGMOD Intntl. Conf., pp. 25–29 (2007) 4. Agrawal, R., et al.: The Claremont report on database research. Commun. ACM 52(6), 56–65 (2009) 5. Nagypal, G.: Possibly Imperfect Ontologies for Effective Information Retrieval. Universitätsverlag, Karlsruhe (2007) 6. Walter, A., Nagypal, G.: ImageNotion—methodology, tool support and evaluation. In: On the Move to Meaningful Systems 2007. Proc. Confed. Intntl. Conf. CoopIS/DOA/ODBASE/ GADA/IS. Lect. Notes Comput. Sci., vol. 4803, pp. 1007–1024. Springer, Berlin (2010) 7. Walter, A., Nagypal, G.: The combination of techniques for automatic semantic image annotation generation in the IMAGENOTION application. In: Proc. 5th European Semantic Web Conference (ESWC) (2008) 8. Walter, A., Nagypal, G., Nagi, K.: Evaluating semantic techniques for the exploration of image archives on the example of the ImageNotion system. Alex. Eng. J. 47(4), 327–338 (2008) 9. Paoli, H., Schmidt, A., Lockemann, P.C.: User-driven semantic Wiki-based business service description. In: Schaffert, S., Tochtermann, K., Pelegrini, T. (eds.) Networked Knowledge— Networked Media: Integrating Knowledge Management, New Media Technologies and Semantic Systems, pp. 269–284. Springer, Berlin (2009) 10. Braun, S., Schmidt, A., Zacharias, V.: SOBOLEO: Vom kollaborativen Tagging zur leichtgewichtigen Ontologie. In: Gross, T. (ed.) Mensch & Computer – 7. Fachübergreifende Konferenz – M&C 2007. Weimar, pp. 209–218. Oldenbourg Verlag, München (2007) 11. Schmidt, A.: Microlearning and the Knowledge Maturing Process: Towards Conceptual Foundations for Work-Integrated Microlearning Support Microlearning, Innsbruck (2007) 12. Braun, S., Schmidt, A., Hentschel, C.: Semantic desktop systems for context awareness— requirements and architectural implications. In: 4th European Semantic Web Conference (ESWC) (2007) 13. Maier, I., Schmidt, A.: Characterizing knowledge maturing: a conceptual process model for integrating e-learning and knowledge management. In: Conf. Professional Knowledge Management—Experiences and Visions, Potsdam (2007) 14. Kunze, C., Holtmann, C., Schmidt, A., Stork, W.: Kontextsensitive Technologien und Intelligente Sensorik für Ambient-Assisted-Living-Anwendungen. 1. Deutscher Kongress AmbientAssisted-Living (AAL ’08). VDE-Verlag, Berlin (2008) 15. Antrag im Spitzencluster-Wettbewerb des Bundesministeriums für Bildung und Forschung “Vertrauenswürdige Dienste für intelligente Infrastrukturen”. Cyberforum e.V. Karlsruhe (2009) 16. Pulter, N., Nimis, J., Lockemann, P.C.: Störungsmanagement in offenen, getakteten Logistiknetzen. Künstl. Intell. 24(2), 131–136 (2010) 17. Lockemann, P.C., Nimis, J.: Dependable multi-agent systems: layered reference architecture and representative mechanisms. In: Barley, M., et al. (eds.) Safety and Security in Multiagent Systems—Research Results from 2004–2006. Lect. Notes Artif. Intell., vol. 4324, pp. 27–48. Springer, Berlin (2009)
Knowledge Engineering Rediscovered: Towards Reasoning Patterns for the Semantic Web Frank van Harmelen, Annette ten Teije, and Holger Wache
Abstract The extensive work on Knowledge Engineering in the 1990s has resulted in a systematic analysis of task-types, and the corresponding problem solving methods that can be deployed for different types of tasks. That analysis was the basis for a sound and widely accepted methodology for building knowledge-based systems, and has made it possible to build libraries of reusable models, methods and code. In this paper, we make a first attempt at a similar analysis for Semantic Web applications. We will show that it is possible to identify a relatively small number of task-types, and that, somewhat surprisingly, a large set of Semantic Web applications can be described in this typology. Secondly, we show that it is possible to decompose these task-types into a small number of primitive (“atomic”) inference steps. We give semi-formal definitions for both the task-types and the primitive inference steps that we identify. We substantiate our claim that our task-types are sufficient to cover the vast majority of Semantic Web applications by showing that all entries of the Semantic Web Challenges of the last 3 years can be classified in these task-types.
1 Preface: Rudi Studer’s Journey from Knowledge Engineering to the Semantic Web (and Back Again) The last 20 years of Rudi Studer’s the intellectual journey illustrate the journey that an entire group of researchers has taken: starting with work on structured methods for building Knowledge Based Systems (which were isolated systems in a highly structured environment), they have migrated to working on the Semantic Web: developing knowledge intensive methods for the open and unstructured environment of the World-Wide Web. Section 3 and further are a slightly revised reprint under permission from K-CAP’09: Proc. of the 5th International Conference on Knowledge Capture. ©ACM (2009). doi http://doi.acm.org/ 10.1145/1597735.1597750. The revisions are based on the very detailed and insightful comments that we received from an anonymous reviewer during the preparation of this volume. F. van Harmelen () Dept. of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_4, © Springer-Verlag Berlin Heidelberg 2011
57
58
F. van Harmelen et al.
The early work on Knowledge Based Systems revolved around the discovery and definition of Problem Solving Methods, and the identification of Ontologies as a structuring principle. The first appearance of the Web was as a medium for discovering and exchanging Problem Solving Methods and Ontologies. Later, the roles reversed: no longer was the Web the medium for exchanging Ontologies, but instead Ontologies became a structuring principle for the Web itself, a cornerstone of what we now know as the Semantic Web. However, somewhere along this journey the Problem Solving methods got lost. At the 2009 KCAP conference, we wrote a paper about bringing the Problem Solving Methods back to the Semantic Web, thereby completing the full-circle that Rudi’s work has been making over a period of two decades. We therefore dedicate this work to the celebration of Rudi’s career on his 60th birthday. Before launching into our technical argument (namely that the Semantic Web is as in need of re-usable reasoning patterns as Knowledge Based Systems were 20 years ago), we will first trace the historical developments that led to this paper by highlighting key steps in Rudi’s work over the past two decades. Early Work on Problem Solving Methods: Rudi’s work in the early ’90s on the methodology MIKE and its associated formal specification language KARL focused on the description of “problem solving methods”: these were structured patterns of inference that could be re-used across multiple domains, and were concerned with reasoning tasks such as diagnosis, monitoring, classification, design, etc. The first publication of this work appeared in the legendary Banff workshop series: Angele, J., Fensel, D., Landes, D., Studer, R.: KARL: an executable language for the conceptual model. In: Proceedings of the 6th Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW-91), Banff, Canada, October 6–11 (1991)
First Mention of Ontologies in Rudi’s Work: Although it was always apparent that the domain knowledge used by the reasoning patterns could be structured and re-used in similar ways, this only started to get serious attention at a later period, around the mid ’90s. The word “ontology” started to become a very popular term. Rudi’s first publication that we were able to find which uses the O-word was in the EKAW series: Pirlein, T., Studer, R.: KARO: an integrated environment for reusing ontologies. In: Proceedings of the European Knowledge Acquisition Workshop (EKAW-94), Hoegaarden, Belgium, September 26–29. Lecture Notes in Artificial Intelligence (LNAI), vol. 867, pp. 200–225. Springer, Berlin (1994)
First Appearance of the Web as an Exchange Medium: In this same period, it was becoming clear that the Web would change the face of almost every area of Computer Science, including Knowledge Engineering. The first move that the Knowledge Engineering community made was with regards to the Web as the exchange medium for their Knowledge Engineering objects, the ontologies and the problem solving methods. One of Rudi’s early publications that runs in this same vein is:
Knowledge Engineering Rediscovered: Towards Reasoning Patterns
59
Benjamins, R., Plaza, E., Motta, E., Fensel, D., Studer, R., Wielinga, B., Schreiber, G., Zdrahal, Z., Decker, S.: IBROW3: an intelligent brokering service for knowledge-component reuse on the World-Wide Web. In: Proceedings of the 11th Workshop on Knowledge Acquisition, Modeling, and Management (KAW ’98), Banff, Canada, April (1998)
A key sentence from this paper is: “accessing libraries of reusable problemsolving methods on the Web”. Notice how nearly all of the authors of this paper published in 1998 (which is about “knowledge engineering on the Web”) have now all turned into highly active (and well-respected) Semantic Web researchers. First Appearance of the Semantic Web: But soon, the same researchers began to realise that the tables could be turned: rather than using the Web for the benefit of knowledge engineering methods, these knowledge engineering results could be used for the benefit of the Web. Even well before the term Semantic Web became common place, Rudi and his co-workers were already publishing work that can now be easily recognised as Semantic Web work avant la lettre: Decker, S., Erdmann, M., Fensel, D., Studer, R.: Ontobroker in a Nutshell. Lecture Notes in Computer Sciences. Springer (1998)
with as a key sentence: “we develop tools necessary to enable the use of ontologies for enhancing the web”. The term “Semantic Web” appeared first in a 2001 paper: Stojanovic, L., Staab, S., Studer, R.: Knowledge technologies for the semantic web. WebNet, pp. 1174–1183 (2001)
The Missing Link: Problem Solving Methods for the Semantic Web: But although ontologies are now widely used to enhance the Web, the other key insight from Knowledge Engineering (the reusable Problem Solving Methods), have not yet found their place in the collective consciousness of the Semantic Web community. Our KCAP 2009 paper, reprinted here, is meant to be a first step towards rectifying this omission. We argue that just as with ontologies (reusable domain patterns), the problem solving methods (as reusable reasoning patterns) can be of great benefit to both the technical engineering and the scientific understanding of the Semantic Web. But before we make the technical case for this argument, in the next section we will briefly recall the key model from Knowledge Engineering on which our case is built, a model to which Rudi contributed substantially with his early work.
2 Knowledge Engineering Background (For Those Who Don’t Know Their History) The most prominent structural model of reusable components for Knowledge Based Systems was the CommonKADS model, developed by a community of researchers in the late ’80s and early ’90s and lead by the University of Amsterdam. The best documentation of the CommonKADS models is [12], but we describe the core
60
F. van Harmelen et al.
Fig. 1 CommonKADS expertise model
model (the so-called “expertise model”) here, to make this paper self-contained, and to make this model accessible to the current-day Semantic Web community. The structure of the CommonKADS expertise model is depicted in Fig. 1. Problem Types: CommonKADS identifies a small number of problem types that account for many specific problem instances typically tackled by knowledge-based systems. These problem types are generic types of reasoning problems. Examples are diagnosis (reasoning from observations to causes), design (reasoning from functional requirements to a configuration of components), monitoring (interpreting a sequence of signals or events), classification (reasoning from a set of observed properties to an inferred class), etc., amounting to a dozen (at most a few dozen) of such problem types. Each of these problem types can be characterised functionally, without yet committing to any particular form of reasoning that must be deployed, much less to any particular system components or architecture. (In the paper that follows, we will instead use the term “task types” for the same notion.) Task Structures: These are hierarchical decompositions of problem types into procedures that will solve the problems of the relevant type. These procedural decompositions into subtasks come with a control structure over these procedural steps. The atomic procedural steps (the leaves of the decomposition) are called primitive tasks, which correspond 1–1 to primitive inferences. Inference Structures: These primitive inferences are linked together in datadependency graphs that are called inference structures. These graphs show which primitive inference steps pass on output to which other primitive inference steps. Such data passing between inference steps are called knowledge roles. The inference structures also show which parts of the domain knowledge are used by which
Knowledge Engineering Rediscovered: Towards Reasoning Patterns
61
inference steps. The use of domain knowledge by inference steps is through static knowledge roles. Domain Layer: contains the static domain knowledge to be used by the inference steps. Its form should correspond to the schemata that are required by the static knowledge roles of the primitive inference steps in the inference layer. A Problem Solving Method can now be characterised as consisting of an appropriately chosen task-structure plus corresponding decomposition into an inference structure of primitive inferences, plus the domain schemata that are required to fill the static knowledge roles used by these inferences. This layered architecture (distinguishing a procedural task layer, a declarative inference layer and a static domain layer) allows for the reusability of components at each of the levels: the same inference layer can be executed under different control strategies at the task layer; an inference layer can be deployed on different domains (as long as the domain layer offers the minimally required schemata), the same domain layer can be reused for different inference tasks, etc.
3 Introduction Starting with the seminal work by Clancey [4], research in Knowledge Engineering has developed a theory of generic types of tasks, which can be implemented by a generic set of problem solving methods, decomposable into primitive inference steps. Examples of the task types that were identified are diagnosis, design, scheduling, etc. Via the problem solving methods (e.g. a pruning method for the classification task) these tasks could be decomposed into elementary inference steps such as generate-candidate, specify-attribute, obtain-feature, etc. This work has lead to well-founded methodologies for building knowledge-based systems out of reusable components. The CommonKADS methodology [12] is perhaps the best known of such methodologies, although certainly not the only one. Another example is the generic tasks approach by Chandrasekaran [3]. We will use the terminology of the CommonKADS approach, and will identify tasks and inferences in the context of the Semantic Web. This work in Knowledge Engineering originated from a shared frustration concerning the lack of reusable components for building knowledge-based systems. If “Knowledge Engineering” was really “engineering”, where were the reusable components, and why did every implementor have to start from scratch? The important insight was to describe the tasks that Knowledge Based Systems perform at a sufficiently abstract level, the “Knowledge Level”, introduced by Newell in his 1980 AAAI presidential address [9], later in [10]. Once the discussion moved from the implementation details on the “symbol level” to the more abstract “knowledge level”, it became possible to identify generic task-types, reusable problem solving methods and reusable elementary inference steps. Since then, libraries of reusable
62
F. van Harmelen et al.
components have been published both in books (e.g. [2]) and on websites (e.g. http://www.commonkads.uva.nl), and are now in routine use. As is well known from other branches of engineering, reusable components help to substantially increase the quality of design and construction whilst simultaneously lowering the costs by reusing tried-and-tested design patterns and component implementations. Menzies [8] illustrates the reuse benefits for reasoning patterns.
3.1 Applicable to Semantic Web Engineering? The above raises the question of whether or not similar lessons can be applied to Semantic Web engineering. Can we identify reusable patterns and components that could help designers and implementers of Semantic Web applications? It hardly needs arguing that work on the Semantic Web has put great emphasis on the reusability of knowledge, in the form of ontologies. It is fair to say that insights about reusable knowledge elements have been at the birthplace of the Semantic Web enterprise. The idea of reusable ontologies has also been generalised into work on reusable ontology patterns (e.g. [1, 7]). However, all of this work deals with reusable knowledge elements. There is little if any work on reusable reasoning patterns. This paper is a first attempt at finding reusable reasoning patterns for Semantic Web applications.
3.2 Structure of the Paper After discussing related work (Sect. 4) and some formal preliminaries (Sect. 5), the paper is structured in the following steps: 1. Identify typical task types and give semi-formal definitions (Sect. 6.1); 2. Validate the task types by showing that a large number of representative and realistic Semantic Web applications can be classified into a limited number of such task types (Sect. 6.2); 3. Define primitive inference steps by giving semi-formal definitions (Sect. 7.1); 4. Validate the primitive inference steps by showing that the task types can be decomposed into the given inference steps (Sect. 7.2). If the above steps were to succeed, this would be of great value to Semantic Web application builders, leading to the possibility of libraries of reusable design patterns and component implementations. It would also constitute an advance in our understanding of the landscape of Semantic Web applications, which has until now mostly grown bottom up, driven by available technical and commercial opportunities, with little or no theory-formation on different types of applications and their relationships.
Knowledge Engineering Rediscovered: Towards Reasoning Patterns
63
4 Related Work As described above, most if not all work on reusability for the Semantic Web has focused on reusable knowledge, to the exclusion of reusable reasoning patterns. The only exception that we are aware of is the work by Oren [5, 11]. Although similar in approach (they also survey the past years of Semantic Web Challenge entries to detect recurring patterns in these applications), his focus is rather different. In Knowledge Engineering terms, his work focuses more on “symbol level” issues such as architectural components, programming language used and (de)centralisation of the architecture, whereas we are interested in a “knowledge level” analysis that is independent of implementation details. The one element in Oren’s analysis that comes closest to our goals is his “application type”, which has a large overlap with our notion of “task types”. However, Oren then links these application types to required architectural components (storage, user-interface, etc.), but does not link them to primitive reasoning steps, which is the goal that we are pursuing. Hildebrand et al. [6] analyses 33 semantic search applications which have a similar aim to ours (discovering re-usable components) but more limited than ours (considering only search applications), and not attempting any (semi-)formal definitions. From this brief analysis, we conclude that ours is the first attempt at a systematic analysis of Semantic Web reasoning patterns.
5 Formal Preliminaries We will use the terms terminology, ontology, class, concept and instance as follows: a terminology is a set of class-definitions (a.k.a. concepts) organised in a subsumption hierarchy; instances are members of such concepts. Ontologies consist of terminologies and sets of instances. More formally we will consider an ontology O as a set of triples s, p, o, where ∈ and ⊆ are special cases of p. In other words, we consider two specific predicates: ⊆ for the subsumption relation, and ∈ for the membership relation. We use c1 , ⊆, c2 to denote that a class c1 is subsumed by a class c2 . We use i, ∈, c to denote that an individual i is a member of a class c. A terminology T is a set of triples whose predicate is the subsumption ⊆ only, and an instance set I is a set of triples which predicate is the membership relation ∈ only. T resp. I can be extracted from ontology O with the function T (O) resp. I (O).1 An ontology is the union of its terminology, its instance set, and possibly triples s, p, o using other relations p: O ⊇ T ∪ I . We will overload the ∈ notation and also use it to denote that a triple 1 One might want to introduce additional projection functions such as C(O) to extract just the set of all concepts {c1 , c2 , . . .} and similarly for the set of all instances {i1 , i2 , . . .}. Such additional projection functions would make some of the formalisations that follow more elegant (and in some places even more correct).
64
F. van Harmelen et al.
is a member of a set (as in: s, p, o ∈ O). We do not assume that triple-sets are deductively closed. We will use to denote that a triple can be derived from a set (as in: O s, p, o), using some appropriate semantics (e.g. RDF Schema or OWL DL derivations). O ∗ contains all triples s, p, o which can be derived from O. Please note that O ∗ contains O and may be infinite. We will use lower case letters c, i for a single concept or instance, and uppercase letters C, I for concepts sets containing ci , ⊆, cj or instance sets containing i, ∈, ck . We will often use the terms “ontologies” and “knowledge” interchangeably.
6 Task Types In this section we will first (Sect. 6.1) identify a limited number of general task types and give semi-formal definitions for each of them. Notice that these tasks are identified for the Semantic Web application in the same way that the tasks of the CommonKADS framework (like diagnosis, etc.) are meant for knowledge based systems. The selection of the tasks represents the most prominent ones which can be found in current Semantic Web application; the selection is not intended to be complete. For each of these task types, we give the most common definition of that task, although we show in places that variations in these definitions are possible. Subsequently (Sect. 6.2) we will show how a representative set of Semantic Web applications can all be understood as instances of this small set of task types.
6.1 Defining Semantic Web Task Types We will characterise seven different task types. For each of them, we will give an informal description, the signature (types of their input and output parameters), and a semi-formal definition of the functionality (relation between input and output).
6.1.1 Search Perhaps the most prototypical Semantic Web application is search, motivated by the low precision of current search engines. Traditional search engines take as their inputs a query (usually in the form of a set of keywords) plus a data-set of instances (usually a very large set of web-pages), and return a subset of those instances. A Semantic Web search engine would take a query in the form of a concept description, this concept description would be matched against an ontology (i.e. a terminology used to organise an instance set), and members of the instanceset matching the query-concept would be returned. Hence, search is a mapping which maps a given concept c and ontology O to a set of instances I ⊆ I (O):
Knowledge Engineering Rediscovered: Towards Reasoning Patterns
65
Search: c × O → I
input request: a concept c. input knowledge: an ontology O, hence consisting of a terminology T (O) (a set of ci , ⊆, cj ), and an instance set I (O) (set of i, ∈, ck ). answer: search(c, O) returns an instance set such that: search(c, O) = {i, ∈, c|∃cj : cj , ⊆, c ∈ T (O) ∧ i, ∈, cj ∈ I (O)}.
In other words: search(c, O) returns all instances i that are known to be members of subconcepts of c (and hence are members of c as well). Notice that this definition only returns instances of concepts cj that are known to be a subconcept of c (since we demanded cj , ⊆, c ∈ T (O)), and might hence result in an empty answer. An alternative definition would be to allow the use of deductive machinery to derive the subconcepts of c: search(c, O) = {i, ∈, c|∃cj : O cj , ⊆, c ∧ O i, ∈, cj }. Instead of O cj , ⊆, c (resp. O i, ∈, cj ) we can write cj , ⊆, c ∈ T (O ∗ ) (resp. i, ∈, cj ∈ I (O ∗ )). This same move (exchanging O with O ∗ ) could be applied to many of the other task types in this section.
6.1.2 Browse Browsing is very similar to searching (and often mentioned in the same breadth), but has as crucial difference in that its output can either be a set of instances (as in search), or a set of concepts, that can be used for repeating the same action (i.e. further browsing). Thus, browse is a mapping which maps a given concept c and ontology O to a set of instances I plus a set of concepts C: Browse: c × O → I × C input request: a concept c. input knowledge: an ontology O consisting of a terminology T (O) and an instance set I (O). answer: browse(c, O) returns a set of instances and a set of concepts such that: browse(c, O) = search(c, O){cj , ⊆, c|cj , ⊆, c ∈ T (O) ∧ ¬(cj , c, O)} ∪ {c, ⊆, cj |c, ⊆, cj ∈ T (O) ∧ ¬(c, cj , O)} with (cj , c, O) ↔ ∃ck : ck , ⊆, c ∈ T (O) ∧ cj , ⊆, ck ∈ T (O).
Besides instances that a user might be interested in (based on the given input concept c), this returns the immediate neighbourhood of c (immediate sub- and super-concepts of c known in T ), to be used for repeated browsing by the user. As with search, alternative definitions are possible by returning a wider neighbourhood
66
F. van Harmelen et al.
for c, consisting also of indirect sub- and super-concepts, or by deducing a neighbourhood of c instead of being limited to the explicitly known neighbourhood (using instead of ∈). Also, the above definition of search only exploits the subclass hierarchy, and ignores the possibility (often used in practical applications) to browse along other properties. Again, it would be straightforward to give alternative definitions that would cover this. This same move (considering other predicates besides ⊆) could be applied to many of the other task types in this section.
6.1.3 Data Integration The goal of data-integration is to take multiple instance sets, each organised in their own terminology, and to construct a single, merged instance set, organised in a single, merged terminology. Hence, data integration is a mapping which maps a set of ontologies to a (new) ontology. Integrate: {O1 , . . . , On } → O
input request: multiple ontologies Oi with their terminologies Ti = T (Oi ) and their instance sets Ii = I (Oi ). answer: a single ontology O with terminology T and instance set I : integrate({O1 , . . . , On }) = O such that I = Ii and T ⊇ Ti .
It is difficult to give a more specific I/O-condition to characterise data-integration. Typically (but not always), all input instances are part of the output (I = Ii ), and typically (but not always), the output terminology consists of all the input terminologies (T ⊇ Ti ), enriched with relationships between elements of the different Ti , such as ci , ⊆, cj , ci , sameAs, cj .
6.1.4 Personalisation and Recommending Personalisation consists of taking a (typically very large) data set plus a personal profile, and returning a (typically much smaller) data set based on this user profile. The profile which characterises the interests of the user can be in the form of a set of concepts, or a set of instances. For instance, typical recommender services at on-line shops use previously bought items, which are instances, while news-casting sites typically use general categories of interest, which are concepts, personalise:
Idata × Iprofile × O → Iselection
personalise:
Idata × Cprofile × O → Iselection .
or
Knowledge Engineering Rediscovered: Towards Reasoning Patterns
67
Personalise: Idata × Cprofile × O → Iselection input request: an instance set Idata of triples i, ∈, cj and a profile characterised as either a set of instances Iprofile or a set of concepts Cprofile . input knowledge: an ontology O. answer: a reduced instance set Iselection with personalise(Idata , Cprofile , O) = {i, ∈, c |∃c : i ∈ Idata ∧ c ∈ Cprofile ∧ O c , ∼, c ∧ i, ∈, c ∈ I (O)}.
That is, personalisation returns instances that are members of concepts which are in some way related to the target concept(s) through some relevant relation ∼. Interestingly, if we take ∼ to be ⊆, this becomes essentially equivalent to our above definition of search. In practice, personalisation often yields a ranked list of results. The above definition does not cover this, but extensions of the above definition would be possible. This would involve changing the membership test c ∈ Cprofile to be a weighted function, and then exploiting this weight to yield a ranking of answers. This same move (obtaining ranking) could be applied to many of the other task types in this section. 6.1.5 Web-Service Selection Rather than only searching for static material such as text and images, the aim of Semantic Web services is to allow searching for active components, using semantic descriptions of web-services. We can then regard a concept c as the description of some functionality, and an instance i as a particular web-service. Membership i ∈ c is then interpreted as “service i implements specification c”, and ci ⊆ cj as “specification ci is a specialisation of specification cj ” (and consequently, every service i that implements specification ci also implements specification cj ). Just for mnemonic reasons, we will use f for functionality instead of c, and s for service instead of i, and similarly S for a set of services instead of I . At the level of the signature, the characterisation of this task is the same as that of general search: Service selection: f × O → S
input request: required functionality f . input knowledge: an ontology O containing a set of candidate services S(O) and a hierarchy of service specifications T (O). answer: members of the candidate set whose specification satisfies the required functionality: service_selection(f, O) = {s, ∈, f |∃fj : fj , ⊆, f ∈ T (O) ∧ s, ∈, fj ∈ S(O)}.
The difference in search is of course that the query describes functionality (rather than content), and the candidate set consists of services. In general, this will make the relations ∈ and ⊆ much harder to compute than in the case of search (where we deal instead with static data). Of course, the different variations that we gave for the definition of search (e.g. with or without deduction) can be applied here as well.
68
F. van Harmelen et al.
6.1.6 Web-Service Composition A goal even more ambitious than web-service selection is the task of composing a given number of candidate services into a single composite service with a specific functionality. The input of web-service composition is the same as for the selection of a single web-service above, but the output can now be an arbitrary control flow over a set of web-services. We will informally denote such a flow with F LOW without further specification. This results in a similar specification as search: Service composition: f × O → F LOW input request: required functionality f . input knowledge: an ontology O with set of candidate services S(O) and a hierarchy of service specifications T (O). answer: members of the candidate set whose compound specification F LOW satisfies the required functionality: compose(f, O) = F LOW such that T (O) F LOW , ∈, f , and s, ∈, fi ∈ S(O) for each service s occurring in F LOW .
I.e. the hierarchy of specifications in T (O) allows us to infer that the computed F LOW satisfies the required functionality, and F LOW must be composed of services taken from S(O).
6.1.7 Semantic Enrichment With this task type, objects such as images or documents are annotated with metadata. Such added meta-data can be used by task types like search or browse to increase the quality of their answers. It maps a single instance i to a set of triples about that instance: Semantic enrichment: i → I input request: an instance i to be enriched. answer: a set of triples I = {s, p, o|s = i} that all have i as their subject.
Notice that we have allowed triples with other relation-symbols besides ∈ or ⊆, allowing for other, domain specific, properties. The above definition requires that the answer consists only of triples that have the search term as the subject. This might well be too restrictive. Consider enriching a picture taken at a conference dinner; one may not only enrich such a picture with i, shows, Rudi but also with Rudi, affiliation, AIFB. This is just one example of the many variations that are possible for any of the exemplar task types that we have defined in this section.
Knowledge Engineering Rediscovered: Towards Reasoning Patterns
69
6.2 Validating the Task Types The key questions at this point are: how reusable is the above set of task types? Can most Semantic Web applications be described in terms of these task types? Is this small set of seven task-types sufficient, or will we end up inventing new task types for every new application (hence defeating the goal of reusability)?2 In order to measure the completeness and reusability of our list of task types, we have analysed all entries to the Semantic Web Challenge events of the years 2005, 2006 and 2007 to see if they could be properly described with our task types. As is well known, the “Semantic Web Challenge”3 is an annual event that stimulates R&D by showing the state-of-the-art in Semantic Web applications every year. It gives researchers an opportunity to showcase their work and to compare it to others. Since the competition leaves the functionality of the entries very unconstrained and allows the submissions of a large variety of Semantic Web applications, we claim that the collected entries over a number of years together provide a representative sample of state of the art Semantic Web applications, and are hence a suitable data-set for verifying the completeness and reusability of our list of task types. It is noteworthy that in his independent analysis, Oren [5, 11] also turned to the entries in the Semantic Web Challenge as a valid dataset. Figure 4 shows the results of our analysis. It covers all entries to the 2005, 2006 and 2007 competitions with the exception of a small number of applications about which we could not obtain any information, and a single application for which we were not able to understand the functionality. The analysis in Fig. 4 leads us to the following main observations: – All but one of the applications could be classified in terms of our task-types. The single missing application (SMART, from 2007) can best be described as performing “question answering”. This would indeed be a valid (and reusable) expansion of our list of task-types, but we were unable to come up with a reasonably formal definition of this task-type. – Often, a single application belongs to multiple task types. See for instance the prize winning e-culture application “MultimediaN” that performs a combination of searching, browsing, and semantic enrichment. This phenomenon is well known from Knowledge Engineering, where a single system also often performs multiple tasks (e.g. first diagnosis, then planning a treatment). Notice that Search and Browse often occur together in an application. – It is noticeable that the combined 2005–2007 Challenges do not contain a single submission that can be described as web-service selection. This raises some doubts as to the necessity of this task type. At the same time, there were some (although few) entries that could be properly described as web-service composition. In summary, we interpret these findings as support for the reasonable completeness and reusability of the task-types that we defined in Sect. 6.1. 2 Please
note we do not intend to present a complete list of tasks but the most prominent ones.
3 http://challenge.semanticweb.org/.
70
F. van Harmelen et al.
7 Primitive Inferences In this section we will define a number of primitive inference steps, and we will show that each of the task types identified earlier can be decomposed into a limited number of primitive inference steps. The qualification “primitive” is perhaps in need of some explanation. Just as in the CommonKADS methodology, we interpret the term “primitive” to mean that from the application builder’s point of view, it is not interesting to further decompose this step, i.e. application builders would typically regard such a step as atomic. Of course this is not a hard criterion: sometimes it might be useful to further decompose such a step, e.g. for optimisation reasons. Also, what is considered a primitive, elementary, atomic component for an application builder might well be a highly non-elementary, non-atomic and very complex operation to implement. And indeed, many of the primitive inference steps that we define below have been subject to several years of research and development. Thus, “primitive” should not be read as “simple”. It merely refers to the fact that this step will typically be regarded as atomic by application builders.
7.1 Defining Primitive Inference Steps In this section we define a small number of primitive inference steps for Semantic Web applications. We give a semi-formal definition of these primitive inferences, including their signature. Realisation
determines which concepts a given instance is a member:
• Signature: i × O → {ck , . . .} • Definition: Find all ck such that O i ∈ ck Subsumption
determines whether one concept is a subset of another:
• Signature: c1 × c2 × O → bool • Definition: Determine whether O c1 c2 Mapping finds a correspondence relation between two concepts defined in the ontology O. We follow the common approach, where the correspondence relation can be either equivalence, subsumption or disjointness: • Signature: c1 × c2 × O → {=, , , ⊥} • Definition: find an r ∈ {=, , , ⊥} such that c1 r c2 Retrieval is the inverse of realisation: determining which instances belong the given concept: • Signature: c × O → {ik , . . .} • Definition: find all ik such that ik ∈ c
Knowledge Engineering Rediscovered: Towards Reasoning Patterns Task types
Primitive inference steps Realisation Subsumption & classification
Search Browse Data integration Personalisation Service selection Service composition Semantic enrichment
x x x
x x x x x x
71
Mapping
Retrieval x x
x x x
Fig. 2 Task types in terms of primitive inference steps
Classification hierarchy:
determines where a given class should be placed in a subsumption
• Signature: c × O → ({cl , . . .}, {ch , . . .}) • Definition: Find all highest subclasses cl and all lowest super-classes sh such that O cl c c h Each of these inference steps can be defined on the literal content of O, or in terms of the deductive closure O ∗ . This would of course have consequences for how the task types are decomposed in terms of these primitive inferences. Our choice of these five primitive inference steps is not the only choice possible. For instance, both classification and mapping can be reduced to repeated subsumption checks, and are hence not required strictly speaking as separate inferences. Similarly, it is well known that subsumption in turn can be reduced to satisfiability. However, we have chosen the above five as primitive inference steps because they seem to constitute a conceptually coherent (although not formally minimal) set. Our formalisation of the different inference steps is somewhat asymmetric, with for example subsumption defined as a boolean function c1 × c2 × O → bool while realisation is defined as a non-boolean function c × O → i, which could of course also have been defined (more symmetrically) as a boolean function c × O × i → bool. Our choice is simply a matter of convenience, for example Fig. 3 is easier to draw with realisation as a non-boolean function.
7.2 Decomposing the Task Types to Primitive Inferences The table in Fig. 2 shows how each of the task-types from Sect. 6.1 can be decomposed into the primitive inferences described in Sect. 7.1. (In the table, classification and subsumption have been merged into a single column since the former is the iterated version of the latter.) We lack the space to discuss all of these decompositions in detail, and will discuss only two examples:
72
F. van Harmelen et al.
Fig. 3 Inference structures for the task types Search and Personalisation
Search: The description of the search task-type in Sect. 6.1 shows that it is a combination of classification (to locate the query-concept in the ontology in order to find its direct sub- or super-concepts) followed by retrieval (to determine the instances of those concepts, which form the answers to the query). Personalisation: If the personal profile in the personalisation task-type consists of a set of instances (e.g. previously bought items), then personalisation is a composition of realisation (to obtain the concepts that describe these instances), classification (to find closely related concepts), and retrieval (to obtain instances of such related concepts, since these instances might be of interest to the user). In a similar way, all of the prototypical task-types we described in Sect. 6.1 can be implemented in terms of the small set of primitive inference steps described in this section, resulting in the decomposition shown in Fig. 2. Notice that the table in Fig. 2 only displays the minimally required reasoning tasks for each task-type. For example, it is possible to equip the search task with a mapping component in order to map the vocabulary of a user-query to the vocabulary of the ontology used during search. Similar additions could have been made for many other task-types. Notice too, that semantic enrichment can not be defined based on these reasoning tasks. Usually in reasoning the input facts are assumed to be given, while semantic enrichment deals with constructing these input facts. Hence, semantic enrichment cannot be seen in terms of inference steps, but it is the only task type in our list that suffers from not being decomposable into a combination of primitive inference steps. The CommonKADS method [12] uses the notion of an inference structure to graphically depict the decomposition of a task into primitive inference steps, by showing the data-dependencies between the primitive inference steps that together make up a task. In Fig. 3, we show the inference structures for the Search and Personalisation task-types using the decomposition into primitive inference steps given above. This is what we consider the reasoning patterns. Also, Fig. 3 shows the structural similarity between Search and Personalisation: it makes clear that Personalisation is essentially Search, but preceded by a realisation-step to map instances to the concepts to which they belong.
Knowledge Engineering Rediscovered: Towards Reasoning Patterns Application
CONFOTO (’05) DynamicView (’05) FungalWeb (’05) Oyster (’05) Personal Reader (’05) Service Execution (’05) COHSE (’06) Collimator (’06) Dartgrid (’06) Dbin (’06) EKOSS (’06) eMerges (’06) Falcon-S (’06) Foafing the Music (’06) Geo Services (’06) MultimediaN (’06) Paperpuppy (’06) Semantic Wiki (’06) ArnetMiner (’07) Cantabria (’07) CHIP (’07) DORIS (’07) EachWiki (’07) GroupMe (’07) iFanzy (’07) Int.ere.st (’07) JeromeDL(’07) MediaWatch (’07) mle (’07) Notitio.us (’07) Potluck (’07) Revyu (’07) RKB Explorer (’07) SemClip (’07) SMART (’07) swse (’07) wwwatch (’07)
Task types Search Browse Data integr.
Personalisation
Service select.
73
Service compos.
x x x x
Semantic enrichment x x
x x
x
x x
x
x
x
x x x x
x
x x x
x x x
x
x x
x x
x x x
x x x x x
x
x
x
x x x
x x x x
x x
x x x x
x x x x x x x x x
x x
Fig. 4 Classification of Semantic Web Challenge in our task-types
8 Concluding Remarks The main contribution of this paper has been to make a first attempt at providing a typology of Semantic Web applications. We have defined a small number of prototypical task-types, and somewhat surprisingly, almost all entries from three years of Semantic Web Challenge competitions can be classified into these task-types. We
74
F. van Harmelen et al.
have also shown how each of these prototypical task-types can be decomposed into a small number of primitive inference steps. This results in the following reusable components: the identified tasks, the inferences and the decomposition of task into inferences (the so-called reasoning patterns). Analogously to established practice in Knowledge Engineering, these results provide a first step towards a methodology for building Semantic Web applications out of reusable components. Indeed, we regard this work as taking the first steps towards this goal. We would expect the typology of task-types to grow beyond the current set of seven to cover a larger corpus of Semantic Web applications. Also, the details of our semi-formal definitions and our decompositions may well have to be adjusted over time. And indeed, in a publication [13] that has appeared since the original version of this paper was published, Wang and colleagues have both used and extended our concepts and formalisations. They have provided extended versions of our task types Personalisation and Recommending, and they have extended the set of primitive inferences in order to define these extended versions in terms of primitive inferences. Altogether, this has yielded an insightful high-level definition of a fully implemented semantic recommender system. This work is in full agreement with the general aim of our proposal, namely that a more structured, abstract and implementation independent analysis of the Semantic Web applications “at the knowledge level” will be necessary if we are to rise above and beyond the current ad hoc practices.
References 1. Blomqvist, E., Sandkuhl, K.: Patterns in ontology engineering: classification of ontology patterns. In: Chen, C.S. (ed.) ICEIS (3), pp. 413–416 (2005) 2. Breuker, J., van de Velde, W.: Common Kads Library for Expertise Modelling. IOS Press, Amsterdam (1994). ISBN 9051991649 3. Bylander, T., Chandrasekaran, B.: Generic tasks for knowledge-based reasoning: the “right” level of abstraction for knowledge acquisition. Int. J. Man-Mach. Stud. 26(2), 231–243 (1987) 4. Clancey, W.J.: Heuristic classification. Artif. Intell. 27(3), 289–350 (1985). doi:10.1016/ 0004-3702(85)90016-5 5. Heitmann, B., Oren, E.: A survey of semantic web applications. Technical report, DERI, Galway (2007) 6. Hildebrand, M., van Ossenbruggen, J.R., Hardman, L.: An analysis of search-based user interaction on the semantic web. Technical report INS-E0706, CWI, Amsterdam (2007) 7. Lefort, L., Taylor, K., Ratcliffe, D.: Towards scalable ontology engineering patterns. In: AOW ’06: 2nd Australasian Workshop on Advances in Ontologies, pp. 31–40 (2006) 8. Menzies, T.: Object-oriented patterns: lessons from expert systems. Softw. Pract. Exp. 27(12), 1457–1478 (1997). doi:10.1002/(SICI)1097-024X(199712)27:123.3.CO;2-0 9. Newell, A.: The knowledge level (presidential address). AI Mag. 2(2), 1–20, 33 (1980) 10. Newell, A.: Reflections on the knowledge level. Artif. Intell. 59(1–2), 31–38 (1993) 11. Oren, E.: Algorithms and components for application development on the semantic web. PhD thesis, Nat. Univ. of Ireland, Galway (2007)
Knowledge Engineering Rediscovered: Towards Reasoning Patterns
75
12. Schreiber, G., Akkermans, H., Anjewierden, A., de Hoog, R., Shadbolt, N., van de Velde, W., Wielinga, B.: Knowledge Engineering and Management: The Commonkads Methodology. MIT Press, Cambridge (2000). ISBN 0262193000 13. Wang, Y., Wang, S., Stash, N., Aroyo, L., Schreiber, G.: Enhancing content-based recommendation with the task model of classification. In: Proceedings of EKAW. LNCS. Springer, Berlin (2010)
Semantic Technology and Knowledge Management John Davies, Paul Warren, and York Sure
Abstract Prof. Rudi Studer has been technical director of a number of significant EU collaborative projects researching the application of semantic technology to Knowledge Management. In this chapter, drawing largely on work done in these projects, we provide an overview of the knowledge management problems and opportunities faced by large organisations; and indeed also shared by some smaller organisations. We show how semantic technologies can make a significant contribution. We look at the key application areas: searching and browsing for information; sharing knowledge; supporting processes, in particular informal processes; and extracting knowledge from unstructured information. In each application area we describe some solutions, either currently available or being researched. We do this to provide examples of what is possible rather than to provide a comprehensive list. The use of ontologies as a form of knowledge representation underlies everything we talk about in the chapter. Ontologies offer expressive power; they provide flexibility, with the ability to evolve dynamically unlike typical database schemata; and they make machine reasoning possible.
1 Scientific and Technical Overview 1.1 Introduction Prof. Rudi Studer has been technical director of a number of significant EU collaborations researching the application of semantic technology to Knowledge Management (KM), including the integrated projects SEKT—Semantic Knowledge Technologies, http://www.sekt-project.com, and ACTIVE—Knowledge Powered Enterprise, http://www.active-project.eu. He also participated in the seminal OnToKnowledge (OTK) project [8], which paved the way for numerous further activities on Semantic Web and Knowledge Management. This chapter is essentially a survey of J. Davies () Future Business Applications and Services, BT Innovate and Design, British Telecommunications Plc., Ipswich, UK e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_5, © Springer-Verlag Berlin Heidelberg 2011
77
78
J. Davies et al.
the key contributions of these projects, set in the wider context of semantic-based KM in general. It can also be seen as a survey of how semantic technologies can make a difference to managing knowledge in large organisations. The fact that the management of knowledge in organisations is a problem, as well as an opportunity, is of no doubt. The management scientist Peter Drucker has commented that “the most important contribution management needs to make in the 21st century is to . . . increase the productivity of knowledge work” [10]. He identified increased productivity of manual work as a major distinguishing feature of successful organisations in the 20th century and saw increased productivity of knowledge work as a similarly distinguishing feature of successful organisations in the 21st century. As a management scientist, Drucker’s concern was with management’s contribution to increasing knowledge worker productivity. Our related concern here is with technology’s contribution.
1.2 The Challenges for Organisational Knowledge Management For those concerned with the management of information and knowledge in an organisation, there are a number of challenges: 1. Enabling the user to find, or be proactively presented with, the right information to perform a particular task. The information might be taken from a wide range of sources, including databases, an intranet or the Internet; or it might be an amalgam of information from various sources. Related to this is the need to organise information in a way in which it can be efficiently retrieved. 2. Sharing knowledge across the organisation. Here also, the knowledge may be in a database, intranet or internet (explicit knowledge), or simply in an individual’s head (tacit knowledge). The person who needs the knowledge, and the owner or creator of the knowledge, although colleagues, may even be located on different continents. 3. Helping users to navigate the processes, often collaborative processes, of which their work is composed. Central to this is sharing metadata between applications, to support a particular goal. Also important is having an understanding of the user’s current context, and what he or she is trying to achieve. 4. The integration of structured information held in corporate databases with unstructured information, e.g. held on the corporate intranet. By merging information from all corporate sources we can provide a complete picture of what the organisation knows about a particular topic. The importance of these challenges has been highlighted by an Economist Intelligence Unit [13] report which surveyed 565 executives from various industries. The survey found that 74% of respondents said that “data gathering is a significant or very significant challenge” and 68% said the same about data-searching. In fact, 42% of the respondents could not find relevant information when needed. 58% rated the challenge of knowledge sharing and collaboration as 4 or 5 (on a scale of 1 to 5,
Semantic Technology and Knowledge Management
79
where 5 is extremely challenging). 52% similarly rated the challenge of data integration as 4 or 5. Further, bearing out the need for information integration, 54% said that “necessary information resides in silos”. Interestingly, users were more satisfied with the quality and quantity of information available than with the ease of access and ease of use of that information. In Sects. 1.3 to 1.6 we discuss these challenges in greater detail, and why systems which analyse information on the semantic level are important in solving these challenges. In Sect. 1.7 we take another look at ontologies, the knowledge representation framework which underlies our semantic approach. Then in Sect. 2 we describe some applications of semantic technologies to these challenges. In Sect. 3 we make some concluding remarks.
1.3 Finding Information—And Organising Information so that It Can be Found 1.3.1 Defects of the Conventional Search Engine The search engine has been one of the great success stories of the World Wide Web as it allowed ordinary users to search the enormous amounts of available websites based on simple principles, i.e. keyword based search. However, its use within organisations has been less successful and created a degree of frustration. An important reason for this is well known. The page rank algorithm, pioneered by Google, depends on the rich pattern of hyperlinks that connect documents to each other which exist on the Web but which are rarely to be found on the organisational intranet. However, even at its most successful, the conventional search engine suffers from an approach based on text-string matching and consequent failure to interpret the semantics of a query or the semantics inherent in the documents being queried. In particular: the failure to identify polysemy (when a word or phrase has more than one meaning); a similar failure to take account of synonymy (when two or more words or phrases have the same meaning) and other forms of semantic connection between terms; an inability to make use of context; and less than optimal interpretation of results. Consider for example, the following query: “telecom company” Europe “John Smith” director The user might require, for example, documents concerning a telecom company in Europe, a person called John Smith, and a board appointment. Note, however, that a document containing the following sentence would not be returned using conventional search techniques: “At its meeting on the 10th of May, the board of London-based O2 appointed John Smith as CFO”
80
J. Davies et al.
In order to be able to return this document, the search engine would need to be aware of the following semantic relations: O2 is a mobile operator, which is a kind of telecom company; London is located in the UK, which is a part of Europe; A CFO is a kind of director. Lack of Context Many search engines fail to take into consideration aspects of the user’s context to help disambiguate their queries. User context would include information such as a person’s role, department, experience, interests, project work, and so on. Presentation of Results The results returned from a conventional search engine are usually presented to the user as a simple ranked list. The sheer number of results returned from a basic keyword search means that results navigation can be difficult and time consuming. Generally, the user has to make a decision on whether to view the target page based upon information contained in a brief result fragment. A survey of user behaviour on BT’s intranet suggests that most users will not view beyond the 10th result in a list of retrieved documents; only 17% of searches resulted in a user viewing more than the first page of results. Essentially, we would like to move from a document-centric view to a more knowledge-centric one [34], for example, by presenting the user with a digest of information gleaned from the most relevant results found as has been done in the Squirrel semantic search engine described later in this chapter.
1.3.2 Semantic Indexing and Retrieval The previous section discussed the limitations of conventional textual search technology and indicated that these limitations were caused by a failure to interpret the semantics both in the query and in the textual corpus being interrogated. Techniques have been developed for the automatic creation of semantic annotations. As explained in [9], semantic indexing and retrieval can then be performed on top of the semantic annotations. Indexing can be done with respect to two semantic features: lexical concepts and named entities. In this way a number of the problems discussed above can be overcome. Lexical concepts are introduced to overcome polysemy. Thus a word with two different meanings will be associated with two different lexical concepts. Wordsense disambiguation techniques can be used to disassociate these meanings [32]. Similarly, knowing that two words or phrases are associated with the same lexical concept enables the system to cope with synonymy. Moreover, the use of lexical concepts also enables hyponym-matching (a hyponym is a word or phrase whose meaning is included within that of another word or phrase). Thus, referring to the example in Sect. 1.3.1, CFO is a hyponym of director. Hyponym-matching overcomes the problem that a search for director will not identify references to CFO which may be relevant.
Semantic Technology and Knowledge Management
81
Fig. 1 Relating the named entities in a sentence to an ontology
Named entities are items such as proper nouns (denoting, for example persons, organisations and locations), numbers and dates. One study found that named entities were a common query type, in particular people’s names, whilst ‘general informational queries are less prevalent’ [12]. Such named entities can be identified as instances of a pre-defined ontology. A typical ontology for such purposes would need to have information about people, geography, company structure etc. One such ontology is PROTON [19] which was developed by Ontotext Lab (http://www.ontotext.com) and used within the SEKT project as the basis for several semantic search and browse tools. In fact, PROTON also includes a world knowledgebase, i.e. a set of instances and relations which are used to pre-populate the ontology. This can then be extended through analysis of the textual corpus. Of course this approach, while highly accurate, can lead to error. Therefore, information in the knowledgebase is flagged to indicate whether it is pre-defined or whether it is learned from the document database. The PROTON ontology is itself extensible, any particular domain ontology can develop its domain ontology as an extension to PROTON. Figure 1 illustrates how sentences can be analysed and the named entities related to the classes of an ontology. Packard Bell and BT have been identified as instances of companies, whilst London and UK have been identified as instances of city and country respectively. Once identified, these instances then form part of a knowledge base. Note that ‘its’ has been identified as being equivalent to BT in this particular sentence. The identification of words such as pronouns with the words or phrases which they stand for is known as anaphora resolution. Software to achieve this textual analysis is described in Sect. 2.4.
82
J. Davies et al.
1.4 Sharing Knowledge Across the Organisation A knowledge worker often has no idea that a document of use to him has been created by another colleague. He or she may even be unaware of the existence of the colleague, and they may be located geographically far apart. The document might have been created some time ago, by a colleague who has moved on to other work or left the organisation. Consultancies, such as Ernst & Young [7, 14] frequently take this subject most seriously. Typically they have a combination of part-time knowledge management enthusiasts in their operating units and full-time knowledge management specialists in a central unit. They use a platform, such as Lotus Notes, for document storage; a typical such document might be a customer proposal, which could be partially reused for other customers. In some cases users may simply enter a document directly into the repository. In others, the document is vetted for quality by one of the knowledge management team. In both cases the user will be required to describe the document using metadata compliant with a predefined taxonomy. Depending on the experience of the user and the particular document, this can take a significant amount of time and inhibit information being entered into the repository. A similar problem applies in reverse. To retrieve information a user needs to understand the taxonomy, and of course the original metadata needs to be accurate. Information may be missed, or the complexity of the system may again deter its use. What is needed is to analyse the documents as they are entered into the system, so as to automatically create semantic metadata which can be used for document retrieval. Automatic metadata creation also provides a consistency which may not occur when metadata is manually created.
1.5 Helping with Processes Current productivity tools offer basic support for processes, but little proactive help. Within Microsoft Outlook, for example, calendar and contact facilities provide tools for the user. However, all the intelligence needs to be supplied by the user. When the user types ‘phone John Smith’ at a given time in his diary, there is no automatic link to the contact book entry for John Smith. In addition, what information the system does have is routinely lost. Imagine the user receives an email with attachments from John Smith as part of the customer X bid proposal process. He saves the attachments in a folder. Then the link between the attachments and John Smith, or customer X, are lost. If our user wants to find all information sent by John Smith or about customer X, then there is nothing associated with the saved files to help him. When he or she is working on the customer X proposal process, there are no metadata associated with those files to indicate their relevance to customer X. Moreover, current systems have no idea of the context in which the user is working or the process currently being followed. For example, if the user is a patent
Semantic Technology and Knowledge Management
83
lawyer with six different patent filings under consideration, the system has no idea which one is currently the focus of his attention. Nor does it know whether the user is creating a patent, reviewing a colleague’s proposed filing, or searching for prior art. Yet such information would enable the system to proactively help the user. What is missing is metadata, shared between applications and linked to the context of the user’s work and the processes he or she performs.
1.6 Integrating Structured and Unstructured Information 1.6.1 The Need to Analyse Text Conventional corporate information systems are built on relational database technology and typically contain only structured information. This is true whether the systems are for customer relationship management, product information, employee information, competitor information etc. In knowledge management, we would also like to capture and exploit the unstructured information which exists as text on the intranet, in memos on personal computers, in emails, slide presentations etc.; and also the semi-structured information which exists in applications such as spreadsheets where schemas (i.e. row and column headings) exist but are not properly defined. The claim has been made that over 80% of the data in an organisation is unstructured [25]. Certainly, it is commonly known that a great deal of valuable information in an organisation exists in this form. What is needed is to extract this information and transform it into structured form to enable merging with the structured data. The problem is that structured data has defined semantics in the form of schemas. These semantics are local to the particular application, rather than being expressed using shareable ontologies, but they are semantics nevertheless. The application knows, for example, that the price field in a relational database contains the price in an agreed currency. In unstructured data it could be argued that the semantics are still there. A human can detect when a brochure describes a product price. However, the semantics are no longer defined in a machine interpretable way. The price can be anywhere in the document and can be introduced by many different kinds of language. Interpreting these semantics is a task which until recently has been regarded as requiring human intelligence. If we could extract structured information from unstructured data, then there are many applications which would benefit. A complete picture could be built up, based on all the information available to the enterprise, of, for example, any particular customer, supplier, or competitor. Instead of searching separately through emails, memos, corporate intranet and databases, a sales advisor would have a complete picture of a customer, based on all those sources. Added to the opportunity cost of not being able to use all the information potentially available to the organisation, is risk associated with the regulatory environment. Organisations which do not disclose all relevant information to regulatory
84
J. Davies et al.
Fig. 2 Combining structured and unstructured information
authorities may be seriously penalised. Yet the organisation can only disclose information it knows it has. Information lost on corporate computers cannot be disclosed at the appropriate time—but will certainly be revealed if the organisation is subject to a detailed forensic analysis of hard drives prior to a legal hearing. As an example, Forrester [27] describe a $1.4 Billion judgement against Morgan Stanley, arising from the latter’s inability to produce requested information. All this points to a growing business need to understand the semantics of textual information, to extract such information from free text, convert into a structured form and merge with pre-existing structured information. The overall goal is to combine structured and unstructured information and make the combined result available to a range of applications. This is illustrated in Fig. 2 where information from a variety of unstructured sources is combined with information from databases to create information described in terms of an ontology. This can then be combined with domain-specific knowledge and business rules, and then operated on by semantic queries to input to client applications. The essential challenge is to extract structured information out of unstructured information. One way to do this is to create semantic metadata. HTML, the language which underlies the WWW and our corporate intranets, is based on the use of metadata. However, the metadata in HTML are used to describe the format of data, e.g. to indicate a heading or a bulleted list. The need here is to create semantic metadata, i.e. metadata which tell us something about the data.
Semantic Technology and Knowledge Management
85
Such metadata can exist at two levels. They can provide information about a document or a page, e.g. its author, creation or last amendment date, or topic; or they can provide information about entities in the document, e.g. the fact that a string represents a company, or a person or a product code. The metadata themselves should describe the document or entities within the document in terms of an ontology. At the document level we might have a property in the ontology, e.g. hasAuthor to describe authorship. Within the document we would use classes such as Person, Company or Country to identify specific entities.
1.6.2 Combining the Statistical and Linguistic Approaches The metadata could be created by the authors of the document. In general this will not happen. The authors of Word documents or emails will not pause to create metadata. We need to generate metadata automatically, or at least semi-automatically. There are two broad categories of technology which we can use for this: statistical or machine learning techniques; and information extraction techniques based on natural language processing. The former generally operate at the level of documents, by treating each document as a ‘bag of words’. They are, therefore, generally used to create metadata to describe documents. The latter are used to analyse the syntax of a text to create metadata for entities within the text, e.g. to identify entities as Persons, Companies, Countries etc. Nevertheless, this division should not be seen too starkly. For example, one of the goals of the SEKT project (http://www.sekt-project.com), a European collaborative research project in this area which ran from 2004 to 2006, was to identify the synergies which arise when these two different technologies are used closely together. The metadata can create a link between the textual information in the documents and concepts in the ontology. Metadata can also be used to create a link between the information in the document and instances of the concepts. These instances are stored in a knowledgebase. Thus the ontology bears the same relationship to the knowledgebase as a relational database schema bears to the information in the database. In some cases the ontology and the knowledgebase will be stored together, in other cases separately. This is essentially an implementation decision. Ontologies are particularly useful for representing knowledge from unstructured text because of their flexibility and ability to evolve. Once created, ontologies can be much more easily extended than is the case for relational database schema. A simple illustration of this is shown in Fig. 3. This is not to say that the ontology-based approach will replace the use of relational databases. With increased flexibility comes increased computational expense. The ideal is to combine the two approaches. Where the system identifies a text string as an instance of a concept in the ontology but which is not represented in the knowledgebase, then that instance can be added to the knowledgebase. For example, the text string ‘ABC Holdings’ may be identified as a company, but one not represented in the knowledgebase. The system can then add ‘ABC Holdings’ to the knowledgebase. Section 1.3 has already
86
J. Davies et al.
Fig. 3 Ontologies offer greater flexibility than database schema
discussed how entities in text can be associated with entities in the knowledgebase; this was illustrated in Fig. 1. Research is also in progress to use natural language processing techniques to learn concepts from text, and thereby extend the ontology. However, this is a significantly harder problem.
1.7 Another Look at Ontologies The constant theme running through this chapter is the use of ontologies. An early, but still relevant, overview and categorisation of the ways ontologies can be used for knowledge sharing is given in [1]. Here the use of ontologies is categorised in a number of ways. Ontologies can be used in conjunction with conventional (i.e. non-intelligent) software or alternatively in conjunction with software employing AI techniques. The reference lists a number of principles which remain true: knowledge engineering needs to be minimised, as it represents an overhead; KM support needs to be integrated into everyday work procedures; and KM applications need to process information in an integrated manner. It describes a range of applications which remain important: knowledge portals for communities of practice; lessons learned archives; expert finders and skill management systems; knowledge visualisation; search, retrieval and personalisation; and information gathering and integration. Another high-level view of ontologies, and specifically their use in achieving data connectivity, is given by Uschold and Gruninger [39]. They note that connec-
Semantic Technology and Knowledge Management
87
Relational database
Ontological knowledgebase
Information model
Schema Hard to evolve Implemented with instances in database Computationally separate from instances
Ontology Flexible Can be implemented separately from instances Computationally concepts and instances treated similarly
Information which can be retrieved
What you put in is what you get out
Information entered into knowledgebase plus inferences from that information
Fig. 4 Comparison of relational databases and ontological knowledge bases
tivity is required at three layers: physical, syntactic and semantic. Great strides have been made in achieving connectivity at the first two layers. The challenge is now the third, and ontologies have a key role here. Semantic heterogeneity is a fact of life to be overcome—“there will always be sufficiently large groups for which global agreements are infeasible”. They present a spectrum of kinds of ontologies, defined by degree of formality. At the informal end there are sets of terms, with little specification of the meaning, and also ad hoc hierarchies, such as in Yahoo. At the formal end there are, e.g. description logics. At the informal end some of these might not properly be called ontologies, e.g. by members of the knowledge representation community. The point is that they are used in similar ways as some formal ontologies. Uschold and Gruninger compare ontologies with database schema; making the point that the mixing of types (concepts) with instances is a feature of ontologies which does not occur in database schema. In their view this is largely because of the much greater scale and performance requirements for database systems. Note that this is a computational feature, i.e. computationally database schema and database instances are treated quite separately. This is less the case in the ontological approach; indeed it can in some cases be a matter of design style whether an entity is represented as a concept or an instance. However, when we turn to implementation, the converse can be true. A database schema is embedded in the database; an ontology can exist in a separate physical implementation. The authors of this chapter have prepared their own summary of the chief differences between the relational database and ontological knowledgebase approach. This is summarised in Fig. 4. Uschold and Gruninger identify four ways in which ontologies help achieve a common understanding. Three are relevant to the theme of our chapter: • Neutral authoring. Here an ontology exists for authoring purposes, and the results are then translated into a variety of target ontologies. Enterprise modelling is an example of this. • Common access to information. Instead of employing translators between each data source, a neutral interchange format is employed, thereby reducing the required number of translators from O(N 2 ) to O(N ).
88
J. Davies et al.
• Query-based search, i.e. a sophisticated indexing mechanism with the added benefit of permitting answers to be retrieved from multiple repositories. Note that both of the first of these use neutral ontologies. However, in the case of neutral authoring the ontology can contain only those features present in all of the target systems. In the case of providing common access to information, the neutral ontology must cover all of the concepts in each of the target systems. Uschold and Gruninger also identify the use of ontologies for specification in software engineering, which is beyond the scope of our interests in this chapter.
2 Example Applications Building on the discussions in Sect. 1, this section describes example applications of semantic technologies, in particular addressing each of the challenges described Sect. 1.2: searching and finding information; sharing information within organisations; helping users to navigate processes, including by taking account of the user’s context; and extraction of structured information from unstructured data. The majority of these applications were developed in projects such as On-To-Knowledge, SEKT, NEON (http://www.neon-project.org/) and ACTIVE (http://www.active-project.eu/) where Rudi Studer led the involvement of the University of Karlsruhe (now part of Karlsruhe Institute of Technology).
2.1 Semantic Search, Browse and Information Storage In this section, we discuss some approaches to dealing with the first challenge identified in Sect. 1.2, that of users being able to find, or be automatically presented with, the relevant information to their current task. We discuss the Squirrel semantic search tool, the SEKTAgent, semantic information agent and TagFS, a semantic filing system.
2.1.1 Squirrel—An Example of Semantic Search and Browse Squirrel [11] provides combined keyword based and semantic searching. The intention is to provide a balance between the speed and ease of use of simple free text search and the power of semantic search. In addition, the ontological approach provides the user with a rich browsing experience. For its full-text indexing, Squirrel uses Lucene. PROTON is used as the ontology and knowledgebase, whilst KIM [4] is used for massive semantic annotation. The KAON2 [26] ontology management and inference engine provides an API for the management of OWL-DL and an inference engine for answering conjunctive queries expressed using the SPARQL syntax and provides the semantic backbone for the application. KAON2 also supports the Description Logic-safe subset of
Semantic Technology and Knowledge Management
89
Fig. 5 Meta-results page
the Semantic Web Rule Language (SWRL). This allows knowledge to be presented against concepts that goes beyond that provided by the structure of the ontology. For example, one of the attributes displayed in the document presentation is ‘Organisation’. This is not an attribute of a document in the PROTON ontology; however, affiliation is an attribute of the Author concept and has the range ‘Organisation’. As a result, a rule was introduced into the ontology to infer that the organisation responsible for a document is the affiliation of its lead author. Users are permitted to enter terms into a text box to commence their search. This initially simple approach was chosen since users are likely to be comfortable with it due to experience with traditional search engines. Squirrel then calls the Lucene index and KAON2 to identify relevant textual resources or ontological entities, respectively. Figure 5 shows an extract from the meta-result page returned. This is intended to allow users to quickly focus their search as required and to disambiguate their query if appropriate. The page presents the different types of result that have been found and how many of each type for the query ‘home health care’. Figure 6 shows a document view. The user has selected a document from the result set, and is shown a view of the document itself. This shows the meta-data and text associated with the document and also a link to the source page if appropriate— as is the case with web-pages. Semantically annotated text (e.g. recognised entities) is highlighted. ‘Mousing-over’ recognised entities provides the user with further information about the entity extracted from the ontology. Clicking on the entity itself takes the user to the entity view. Figure 7 shows an entity view for ‘Sun Microsystems’. It includes a summary generated by OntoSum [3, 4]. OntoSum is a Natural Language Generation (NLG) tool which takes structured data in a knowledge base (ontology and associated instances) as input and produces natural language text, tailored to the presentational context and the target reader. NLG can be used to provide automated documentation of ontologies and knowledge bases and to present structured information in a user-friendly way. Users can choose to view results as a consolidated summary (digest) of the most relevant parts of documents rather than a discrete list of results. The view allows users to read or scan the material without having to navigate to multiple results. Figure 8 shows a screenshot of a summary for a query for ‘Hurricane Katrina’. For
90
J. Davies et al.
Fig. 6 Document view
Fig. 7 Entity view
each subdocument in the summary the user is able to view the title and source of the parent document, the topics into which the subdocument text has been classified or navigate to the full text of the document. To gain an idea of how users perceive the advantages of semantic search over simply text-based search, Squirrel has been subjected to a three-stage user-centred evaluation process with users of a large Digital Library. 20 subjects were used, and the perceived information quality (PIQ) of search results obtained. Using a 7 point scale the average (PIQ) using the existing library system was 3.99 compared with an average of 4.47 using Squirrel—a 12% increase. The evaluation also showed
Semantic Technology and Knowledge Management
91
Fig. 8 Consolidated results
Fig. 9 A semantic query in SEKTagent
that users rate the application positively and believe that it has attractive properties. Further details can be found in [38].
2.1.2 SEKTagent—A Different View on Semantic Search Another approach to enabling semantic queries is exemplified by SEKTagent [4]. Figure 9 illustrates the basic approach by showing the following semantic query: ‘ANY (Person) hasPosition analyst withinOrganization ANY (Organization) locatedIn US’ The query is looking for someone who is an analyst working in any U.S. organisation. This is quite different from a text query. Everything is stated at a conceptual
92
J. Davies et al.
Fig. 10 Extract from one of the results of a semantic query—showing entities in the knowledgebase highlighted
level. The most concrete entity in the query is ‘US’. However, even this is not treated as a text string. The query may find a document referring to an analyst working in some city or state of the U.S., but not containing any reference itself to the U.S. The system makes use of the geographical knowledge in the knowledgebase to determine that this is a relevant document. Figure 10 shows an extract from one of the retrieved documents. Entities in the knowledgebase are highlighted. In this case, we have three such entities: Gartner; analyst; Kimberley Harris-Ferrante. The first of these is a company, the second a position in an organisation; and the third is a person. In fact, Kimberley HarrisFerrante is the analyst, working in a U.S. organisation, who satisfies this query. Moving the mouse over any of these entities displays more information about them. In the case of Gartner, for example, it provides the key facts about the company. Rather than just displaying raw information, natural language generation technology is applied to the relevant information in the knowledgebase to create text which can be easily read. The example illustrates another important feature which differentiates the ontology-based approach from that of relational databases. In a database, the only information which can be retrieved is that which is explicitly input into the database. An ontology-based system can make use of a reasoner to perform inferencing over the ontology and knowledgebase. In our example, the request was for someone performing a specific role in an organisation in the U.S. The information in the knowledgebase could well be that the organisation is located in some part of the U.S., e.g. a city or state. However, the knowledgebase associated with PROTON also has geographical information including states and major cities in the U.S. Armed with this information, it is able to make the necessary inferences.
2.1.3 Semantic Filing—TagFS and SemFS Section 1.3 discussed the difficulty which many people have in finding information which they themselves have stored, often on their own computers. One reason for this is that there is often more than one location where a file can logically be stored; yet users are in general restricted to storing information in a single location. A partial solution to this is the use of tags. However, this loses the advantage of being able to travel through the tree structure of a hierarchical set of folders. TagFS [2] merges the two approaches to obtain the advantages of both by using the tags to create a folder structure which is dynamic rather than fixed. In TagFS, the organisation of the resource is divorced from its location. The file is simply tagged.
Semantic Technology and Knowledge Management
93
To take the example from the reference, in a conventional filing system, a user saving music files would first establish a directory structure, e.g. year/artist/album. This would be quite distinct from a structure artist/album/year. In TagFS these three attributes, and any other which are appropriate are merely used to motivate tags. To find a file, it does not matter in which order you traverse the ‘directory’; the “directory path correspondingly denotes a conjunctive tag query which results in a set of files that fulfil all tag predicates”. TagFS is implemented using the SemFS architecture. SemFS provides mapping from traditional file system interfaces to annotation of information objects using RDF. Rather than interpreting directory structures as static storage hierarchies, as in a conventional file system, they represent dynamic views on information objects. In fact TagFS makes relatively simple use of SemFS, in that the latter offers an arbitrary number of different views, whilst TagFS simply employes one called ‘hasTag’. The use of RDF enables integration with other semantic desktop applications, as described in Sect. 2.3.
2.2 Semantic Information Sharing Section 1.2 identified the second challenge for organisational knowledge management as the importance, particularly acute in large organisations of being able to share information amongst colleagues. Here we look at some approaches to doing this.
2.2.1 Using Ontologies An obvious basis for describing, and hence sharing, information is to use an ontology. A simpler approach—but one which is less expressive—is to instead use a taxonomy. From the user’s viewpoint the ontological approach is more time-consuming than the taxonomic one. In general, to describe information in terms of an ontology is richer but more complex than describing information in terms of a taxonomy. The kind of semantic annotation techniques described in Sect. 1.6.2 can be used to automate, or at least partially automate, this process. The user wishing to retrieve information is then able to use the semantic search and browse techniques described in Sect. 2.1. Warren et al. [42] describe an implementation of this approach in a digital library. Here annotation is at two levels. Firstly, sets of topics are used to describe documents. Topics can have sub and super-topics, to create a lattice structure. As a design decision, for reasons of computational tractability, topics are implemented as instances, not concepts. As a starting point, schemas used by proprietary information providers (e.g. Inspec: http://www.theiet.org/publishing/inspec/) provided the topics. Machine learning was used to refine these topics and to automatically associate documents with topics. Secondly, using natural language techniques, named entities
94
J. Davies et al.
within documents are identified and associated with concepts. These concepts are drawn from, e.g., geography and business and include country, city, company, CEO, etc. The association of instances to concepts is illustrated by colour-coding, using the KIM system described in [3]. The creation and management of ontologies is required for many applications of semantic technology and is a significant research topic in itself. An overview of available methodologies is given in [37] which also describes a methodology, DILIGENT, developed in the SEKT project, for creating and maintaining distributed ontologies. In common with other such methodologies, the approach employs ordinary users, domain experts, and experts in ontology design. The approach is distributed in that different users may have slightly different versions of the ontology. Users refine their own version of a shared ontology on the basis of their experience, and these refinements are then fed back, as appropriate, to the shared ontology.
2.2.2 Tagging and Folksonomies In parallel to the use of ontologies in enterprises, the hobbyist and consumer world has adopted the use of informal tagging to describe all kinds of information and media objects. Such tags are said to constitute ‘folksonomies’. Like wikis, folksonomies are part of the phenomenon of Web2.0, in which consumers of information are also producers. Such folksonomies are commonly represented by ‘tag clouds’, in which character size, font or colour are used to represent how much the tag has been used. Flickr (http://www.flickr.com) is an example of a web-site for sharing photos which uses this approach. Delicious (http://delicious.com) is another example where tags are associated with bookmarked pages. The website displays not just the most popular bookmarks, but also the most popular tags. However, folksonomies lack descriptive power. In general they possess no structure, usually not even the hierarchical structure present in a taxonomy. Moreover, the problems of synonymy and polysemy occur here; the same tag may be used with different meanings, or different tags may be used with the same meaning. Compared with ontologies, folksonomies are even more limited. They do not permit automated reasoning, nor the kind of search and browsing techniques described earlier. In general, the user is free either to use a pre-existing tag or to use a new tag. The former has the practical value of encouraging convergence on a reasonable number of tags. However, it may lead to the emergence of dominant tags, representing particular views, and discourage the creation of new tags which may better represent a concept.
2.2.3 The Semantic MediaWiki The Semantic MediaWiki (http://semantic-mediawiki.org/wiki/Semantic_ MediaWiki) developed by Prof. Studer’s group in a number of collaborative projects (e.g. SEKT and ACTIVE) represents a different approach to combining the power
Semantic Technology and Knowledge Management
95
of formal semantics with the ease-of-use associated with Web2.0 [20, 40]. It builds on the success of wikis in enabling collaboration. Specifically, Semantic MediaWiki is a free extension of MediaWiki, the software used by Wikipedia. Whereas conventional wikis enable users to collaborate to create web-pages, the Semantic MediaWiki enables collaboration to create a knowledgebase to complement the web-pages. Conventional wikis have links between pages; a page describing London might contain a sentence ‘London is the capital of the U.K.’ and a link to a page describing the U.K. Syntactically this is done by writing [[U.K.]]. In the Semantic MediaWiki the user can explicitly associate a relation with a link; so that the link between the London page and the U.K. page can have the associated relation ‘is capital of’. This is done by extending the normal wiki syntax and writing [[is capital of::U.K.]]. This is entirely informal, in the sense that the user is free to choose any relation he or she likes, represented by any phrase the user likes. Of course, there is value in people using the same terms, and they can be encouraged to re-use existing relations; it is also possible to define equivalences between different terminologies (e.g. ‘knows about’ can be equated to ‘is expert in’). It is possible to use attributes to associate information with a page, other than that which can be represented by relations. For example, the U.K. page could have metadata associated with it describing its population. Syntactically this can be achieved by writing [[population: = 61,000,000]]. Once a knowledgebase has been created using a Semantic MediaWiki, it can then be queried. This can be done using a syntax very similar to the annotation syntax. This is intended for use by the more computer-literate. However, the syntax can be used to create results pages (e.g. a table of the populations of various countries) which can be viewed by everyone. Alternatively, page authors can insert a query enclosed in the tag, so that the displayed page shows not the query but the result of the query. This still leaves a requirement for non-technical users of the wiki to create general queries in a relatively easy-to-use way, i.e. without using a formal syntax. In response to this requirement, recent work has investigated how textual queries can be translated into query graphs composed of concepts, relations and instances in the ontology [16]. In the simplified example quoted in the reference, a user requires to know the deadline for submission to all (presumably forthcoming) conferences in Greece. He or she types the query string “conference Greece deadline”. The resultant query graph is show in Fig. 11. This is, in effect, a representation of an SPARQL query. The user is then provided with an interface for amending the query graph. He might, for example, wish to change ‘abstract deadline’ to ‘submission deadline’.
2.2.4 Ontology Editors Some approaches, as described in [17, 18, 23] draw on the tagging behaviour of a user or group of users in order to create or enhance a taxonomy or ontology. The objective is to create a synergy between the formal and informal approaches to
96
J. Davies et al.
Fig. 11 Query graph derived from ‘conference Greece deadline’
knowledge representation. Another way to achieve the same goal is to provide users with an easy-to-use ontology editor, restricted to creating and editing ontologies. OntoEdit [35] was one of the first ontology editors to provide a wide range of modelling features including support for different modelling languages and features for ontology evaluation, and the architecture was driven by a flexible plugin concept to allow for extensibility. The maturity of semantic web tools and applications has reached the plateau of productivity [36] and OntoEdit’s successor OntoStudio is nowadays available from the company ontoprise (http://www.ontoprise.de/). A lightweight ontology editor has been created for the Semantic MediaWiki [21] and is available from SourceForge (http://sourceforge.net/projects/smwontoeditor/). By lightweight we mean here ontologies with relatively limited features, but nevertheless powerful enough for generic knowledge management applications. The system supports both the import and export of OWL ontologies, and also the import of folksonomies. The latter feature allows a folksonomy dataset to be mapped to an ontology representation. Imported tags are compared with Wordnet and Wikipedia. Tags are clustered, mapped to the SKOS knowledge-organisation ontology [44] and then mapped and inserted according to the SMW ontology. Additionally, knowledge repair functionalities are provided that assist users with the discovery and mitigation of redundancies and inconsistencies within the knowledge base.
2.3 The Semantic Desktop—Supporting the User Throughout His Work The third KM challenge, offering context- and process-aware support for knowledge workers is the subject of this section.
2.3.1 Sharing Information and Metadata Across Applications and Desktops Section 1.5 noted the need for metadata, shared between applications and linked to the context of the user’s work and the processes he or she performs. In Europe, during 2006 to 2008, the Nepomuk project (http://nepomuk. semanticdesktop.org) was a major focus for work on the semantic desktop [15]. The goal of Nepomuk was to link data, and metadata, across applications and
Semantic Technology and Knowledge Management
97
across desktops, using shared conceptualisations expressed in RDF. Specifically, the project set out to provide “a standardised description of a Semantic Desktop architecture, independent of any particular operating system or programming language”. A reference implementation of this architecture has been developed, known as Gnowsis (http://www.gnowsis.org/). One of the products of Nepomuk was the SPONGE (Semantic Personal Ontology-based Gadget) software tool [30]. The tool “supports users finding, retrieving and annotating desktop resources . . . plus seamless access to internet information”. Some information and interaction is available via a small gadget, taking up limited space on the user’s screen. More information is available via the user’s browser. The reference claims that future work will extend the functionality with collaborative features. These include the ability to access remote desktops in a P2P topology and workspaces which will facilitate the sharing of resources. 2.3.2 Understanding User Context One of the early goals of the semantic desktop was to understand how the users’s information resources divide into a number of contexts, and to detect when a user switches between contexts [33]. This would enable information to be presented to the user, taking account of his or her current context. A number of current projects are investigating this theme. The APOSDLE project (http://www.aposdle.tugraz.at/) is aimed specifically at informal eLearning, i.e. at providing the user with small chunks of learning material just when required [22, 28]. This requires understanding the context of the user’s current work. For example, in one envisaged scenario the user’s actions are analysed to determine that, e.g., he or she is in the starting phase of a project. The user is then provided with information and guidance relevant to project start activity. The project is developing a number of widgets to enable user interaction. These include a context selector; a widget which displays resources relevant to the current context; a global search widget; and a ‘main’ widget which presents the current selected or detected context and possible learning goals. There is also a ‘cooperation wizard’ to guide users through cooperation processes. APOSDLE is ontology-based. The user creates three types of model: a domain model; a task model describing the tasks which need to be executed; and a learning goal model. Modelling tools are provided, including a semantic wiki and plug-ins for the ontology editor Protégé. The user can also annotate parts of documents using the domain model. A parallel but separate activity, involving some of the same researchers as in APOSDLE, is also developing a system for task detection [31]. The system is known as UICO, loosely an acronym from ‘an ontology-based User Interaction COntext model for automatic task detection on the computer desktop’. Another related project is ACTIVE [41], of which Prof Studer is Technical Director. ACTIVE has three main research themes: • Information delivery guided by user context; this entails the system being able to detect a user’s current context.
98
J. Davies et al.
• The creation of informal processes by users, and the learning of these processes through observation of the user’s interaction with his or her computer. By ‘informal’ processes are meant processes designed by individuals to achieve their work-related goals, rather than the formal processes designed on behalf of the organisation. • Knowledge sharing through the synergy of an informal (Web2.0) and an ontology-based approach. ACTIVE sees context and process as often orthogonal. For example, two of the case studies in the project (e.g. see [43]) are concerned in part with customer-facing people who spend a significant amount of time writing customer proposals. For these users context will often, but not necessarily, equate to customer. The process, on the other hand, is that of writing a customer proposal, which can be enacted in a number of contexts (i.e. for different customers). As noted above, ACTIVE is seeking to identify both the user’s context and his or her current process. Events as recognised at the machine level need to be combined through various stages to create an understanding of the processes and contexts at user level. ACTIVE aims to impose a minimum of overhead on the user. The user is able to specify his or her set of contexts and to associate information objects with contexts. However, the project is also researching both how to automatically associate information objects with particular contexts and also learn contexts. Contexts can be shared, i.e. a group of users can share the same context; this encourages the sharing of information. Processes can also be shared. This encourages process re-use and also process improvement as colleagues are able to review and improve each others’ processes. The third theme of ACTIVE is knowledge-sharing. This includes continued development of the Semantic MediaWiki and the lightweight ontology editor discussed in Sect. 2.2. The goal here is to make use of ontologies in knowledge management, so as for example to be able to exploit reasoning, but in a way which is sufficiently user-friendly for casual, non-specialist, users.
2.4 Extracting and Exploiting Semantics from Unstructured Information In challenge 4 for organisational KM and, in more detail, in Sect. 1.6 we identified the need to analyse text so as to create structured knowledge and merge with existing structured knowledge in, e.g. relational databases. We discussed the two approaches to creating metadata; one based on statistics and machine-learning and one based on an analysis of language syntax and grammar known as Natural Language Processing (NLP). The term text analytics is used to describe both approaches. In this section we briefly discuss some tools to help achieve this. The statistical and machine-learning approach is well represented by the TextGarden suite of software tools, http://kt.ijs.si/software/TextGarden/, Mladenic [24] developed within the Jozef Stefan Institute in Ljubljana, Slovenia, and used within
Semantic Technology and Knowledge Management
99
the SEKT project mentioned earlier. Text mining techniques are also provided (for example) as part of the open source data mining software, Rapid Miner, which is available on SourceForge and supported by Rapid-I GmbH, http://rapid-i.com. The NLP approach is represented by GATE, also used in the SEKT project (http://gate.ac.uk/). An early introduction to GATE is given in [6]; a slightly later, more comprehensive overview is given in [5]. GATE provides an environment for creating NLP applications. It combines three aspects; it is an architecture, a framework and a development environment for language engineering. GATE, developed within the University of Sheffield in the U.K., is open and includes a set of resources which others can use and extend. The architecture separates low-level tasks (e.g. data storage, data visualisation and location and loading of components) from data structures and algorithms. The framework provides a reusable design plus software building blocks. The development environment provides tools and a GUI for language engineering. It also provides an interface for text annotation, in order to create training corpora for machine learning algorithms. By an analysis of grammatical structures, such software can, for example, perform named entity recognition and deduce, with reasonable accuracy, to what nouns particular pronouns refer. Such applications are the basis for the semantic search techniques discussed in Sect. 2.1 and for the information extraction from text discussed in this section. A comprehensive specification for the analysis of unstructured text is provided by the UIMA (an acronym for Unstructured Information Management Applications) specification [29] being developed by the standards body OASIS (http://www.oasis-open.org). UIMA was originally developed by IBM. It is now an Open Source project at the Apache Software Foundation, see http://incubator. apache.org/uima/. The principle of UIMA is that applications are decomposed into components. The UIMA framework defines the interfaces between these components and manages the components and the data flows between them. UIMA is different in scope from GATE. The latter is an extensible development environment for language engineering. UIMA defines “platform-independent data representations and interfaces for text and multi-modal analytics” (multi-modal here refers to a combination of text, audio, video etc). As noted in the reference above: “The principal objective of the UIMA specification is to support interoperability among analytics”. This is divided into four design goals: • data representation—supporting the common representation of artefacts and metadata; • data modelling and interchange—supporting the platform-independent interchange of artefacts and metadata; • discovery, reuse and composition of independently developed analytics tools; • service level interoperability—supporting the interoperability of independently developed analytics based on a common service description and associated SOAP bindings. In summary, the goal of UIMA is to offer an extensible open source framework for the analysis of unstructured information.
100
J. Davies et al.
3 Concluding Remarks In this chapter we have provided an overview of the recent key developments in the area of semantic knowledge management, illustrated in part by projects in which Prof. Rudi Studer has played a leading role. Underlying all these developments is the use of ontologies to represent knowledge. Ontologies provide the flexibility and richness of expression required for a wide range of knowledge management applications. Professor Studer and his team, at what is now the Karlsruhe Institute of Technology (KIT), have for many years been among the world leading researchers in developing a theoretical and practical understanding of the application of ontologies to knowledge management, contributing significantly to the development of semantic knowledge management. Acknowledgements fruitful cooperation.
In closing, the authors would like to thank Rudi Studer for many years of
References 1. Abecker, A., van Elst, L.: Ontologies for knowledge management. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies, pp. 435–454. Springer, Berlin (2004). Chap. 22 2. Bloehdorn, S., Görlitz, O., Schenk, S., Völkel, M.: TagFS—tag semantics for hierarchical file systems. In: Proceedings of I-KNOW 06, Graz, Austria, September 6–8th (2006) 3. Bontcheva, K., Cunningham, H., Kiryakov, A., Tablan V.: Semantic annotation and human language technology. In: Davies, J., Studer, R., Warren, P. (eds.) Semantic Web Technologies: Trends and Research in Ontology-Based Systems, pp. 29–50 (2006) 4. Bontcheva, K., Davies, J., Duke, A., Glover, T., Kings, N., Thurlow, I.: Semantic information access. In: Davies, J., Studer, R., Warren, P. (eds.) Semantic Web Technologies: Trends and Research in Ontology-Based Systems, pp. 139–169 (2006) 5. Bontcheva, K., Tablan, V., Maynard, D., Cunningham, H.: Evolving GATE to meet new challenges in language engineering. Natural Language Engineering 10(3–4), 349–373 (2004) 6. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: an architecture for development of robust HLT. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL-02), Philadelphia, PA, pp. 168–175 (2002) 7. Davenport, T.H.: Knowledge Management Case Study; Knowledge Management at Ernst & Young. Information Technology Management White Paper. http://www.itmweb.com/ essay537.htm (1997) 8. Davies, J., Fensel, D., van Harmelen, F., Towards the Semantic Web: Ontology-Driven Knowledge Management. Wiley, Chichester (2003) 9. Davies, J., Kiryakov, A., Duke, A.: Semantic search. In: Goker, A., Davies, J. (eds.) Information Retrieval: Searching in the 21st Century. Wiley, London (2009) 10. Drucker, P.: Knowledge-worker productivity: the biggest challenge. California Management Review 41, 79–94 (1999) 11. Duke, A., Heizmann, J.: Semantically enhanced search and browse. In: Davies, J., Grobelnik, M., Mladenic, D. (eds.) Semantic Knowledge Management, pp. 85–102. Springer, Berlin (2009) 12. Dumais, S., Cutrell, E., Cadiz, J., Jancke, G., Sarin, R., Robbins, D.: Stuff I’ve seen: a system for personal information retrieval and re-use. In: Proceedings of SIGIR’03, Toronto. ACM Press, New York (2003) 13. Economist Intelligence Unit: Enterprise knowledge workers: understanding risks and opportunities (2007)
Semantic Technology and Knowledge Management
101
14. Ezingeard, J., Leigh, S., Chandler-Wilde, R.: Knowledge management at Ernst & Young UK: getting value through knowledge flows. In: Proceedings of the Twenty First International Conference on Information Systems, Brisbane, Queensland, Australia, pp. 807–822 (2000) 15. Groza, T., Handschuh, S., Moeller, K., Grimnes, G., Sauermann, L., Minack, E., Mesnage, C., Jazayeri, M., Reif, G., Gudjonsdottir, R.: The NEPOMUK project—on the way to the social semantic desktop. In: Proceedings of I-Semantics, pp. 201–211 (2007) 16. Haase, P., Herzig, D., Musen, M., Tran, T.: Semantic Wiki search. In: Proceedings of the 6th European Semantic Web Conference, Heraklion, Greece, pp. 445–460. Springer, Berlin (2009) 17. Hayman, S.: Folksonomies and tagging: new developments in social bookmarking. In: Proceedings of the Ark Group Conference: Developing and Improving Classification Schemes (2007) 18. Heymann, P., Garcia-Molina, H.: Collaborative creation of communal hierarchical taxonomies in social tagging systems. Technical Report 2006-10, Stanford University. http://ilpubs.stanford.edu:8090/775/ (2006) 19. Kiryakov, A.: Ontologies for knowledge management. In: Davies, J., Studer, R., Warren, P. (eds.) Semantic Web Technologies: Trends and Research in Ontology-Based Systems. Wiley, New York (2006) 20. Krötzsch, M., Vrandecic, D., Völkel, M.: Semantic MediaWiki. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) Proceedings of the 5th International Semantic Web Conference (ISWC-06). Springer, Berlin (2006) 21. Krötzsch, M., Bürger, T., Luger, L., Vrandecic, D., Wölger, S.: ACTIVE deliverable D1.3.2, Collaborative Articulation of Enterprise Knowledge. http://www.active-project.eu/fileadmin/ public_documents/D1.3.2_collaboration_articulation_of_enterprise_knowledge.pdf (2010) 22. Lindstaedt, S., Mayer, H.: A storyboard of the APOSDLE vision. Poster submitted to the First European Conference on Technology Enhanced Learning (EC-TEL 2006), October 01–04, 2006, Crete, Greece (2006) 23. Millen, D., Feinberg J., Kerr, B.: Social Bookmarking in the Enterprise, pp. 28–35. ACM Queue. http://researchweb.watson.ibm.com/jam/601/p28-millen.pdf (2005) 24. Mladenic, D.: Text mining in action! In: From Data and Information Analysis to Knowledge Engineering. Proceedings of the 29th Annual Conference of the Gesellschaft für Klassifikation e.V. University of Magdeburg (2005) 25. Moore, C: (vice president and research director Forrester Research): Information Indepth. Oracle. http://www.oracle.com/newsletters/information-insight/content-management/feb-07/ index.html (February 2007) 26. Motik, B., Studer, R.: KAON2: a scalable reasoning tool for the semantic web. In: Proceedings of the 2nd European Semantic Web Conference (ESWC’05), Heraklion, Greece (2005) 27. Murphy, B., Markham, R.: eDiscovery Bursts Onto The Scene. Forrester (2006) 28. Musielak, M., Hambach, S., Christi, C.: APOSDLE contextualised cooperation. In: ACM SIGCHI: ACM Conference on Computer Supported Cooperative Work 2008. Electronic Proceedings: CSCW [CD-ROM]. ACM Press, New York (2008) 29. OASIS: Unstructured Information Management Architecture (UIMA) Version 1.0. Working Draft 05 (29 May 2008) 30. Papailiou, N., Christidis, C., Apostolou, D., Mentzas, G., Gudjonsdottir, R.: Personal and group knowledge management with the social semantic desktop. In: Cunningham, P., Cunningham, M. (eds.) Collaboration and the Knowledge Economy: Issues, Applications and Case Studies. IOS Press, Amsterdam (2008). ISBN 978-1-58603-924-0 31. Rath, A., Devaurs, D., Lindstaedt, S.: UICO: an ontology-based user interaction context model for automatic task detection on the computer desktop. In: Proceedings of the 1st Workshop on Context, Information and Ontologies. ACM International Conference Proceedings Series (2009) 32. Russell-Rose, T., Stevenson, M.: The role of natural language processing in information retrieval. In: Goker, A., Davies, J. (eds.) Information Retrieval: Searching in the 21st Century. Wiley, London (2009)
102
J. Davies et al.
33. Sauermann, L., Bernardi, A., Dengel, A.: Overview and outlook on the semantic desktop. In: Proceedings of the 1st Workshop on the Semantic Desktop, ISWC 2005 (2005). http://CEUR-WS.org/Vol-175/ 34. Staab, S., Schnurr, H.-P., Studer, R., Sure, Y.: Knowledge processes and ontologies. IEEE Intelligent Systems 16(1), 26–34 (2001) 35. Sure, Y., Angele, J., Staab, S.: Multifaceted inferencing for ontology engineering. In: Journal of Data Semantics, vol. 1. Lecture Notes in Computer Science, vol. 2800, pp. 128–152 (2003) 36. Sure, Y., Gomez-Perez, A., Daelemans, W., Reinberger, M.-L., Guarino, N., Noy, N.: Why evaluate ontology technologies? Because it works! IEEE Intelligent Systems 19(4), 74–81 (2004) 37. Sure, Y., Tempich, C., Vrandecic, D.: Ontology engineering methodologies. Semantic web technologies—trends and research. In: Davies, J., Studer, R., Warren, P. (eds.) Ontology-Based Systems (2006) 38. Thurlow, I., Warren, P.: Deploying and evaluating semantic technologies in a digital library. In: Davies, J., Grobelnik, M., Mladenic, D. (eds.) Semantic Knowledge Management, pp. 181– 198. Springer, Berlin (2009) 39. Uschold, M., Gruninger, M.: Ontologies and semantics for seamless connectivity. SIGMOD Record 33(4), 58–64 (2004) 40. Vrandecic, D., Krötzsch, M.: Semantic MediaWiki. In: Davies, J., Grobelnik, M., Mladenic, D. (eds.) Semantic Knowledge Management, pp. 171–179. Springer, Berlin (2009) 41. Warren, P., Kings, N., Thurlow, I., Davies, J., Bürger, T., Simperl, E., Ruiz, C., Gómez-Pérez, J., Ermolayev, V, Ghani, R., Tilly, M., Bösser, T., Imtiaz, A.: Improving knowledge worker productivity—the ACTIVE integrated approach. BT Technology Journal 26(2) (2009) 42. Warren, P., Thurlow, I., Alsmeyer, A.: Applying semantic technology to a digital library. In: Davies J., Studer R., Warren P. (eds.) Semantic Web Technologies: Trends and Research in Ontology-Based Systems, pp. 237–257. Wiley, New York (2006) 43. Warren, P., Thurlow, I., Kings, N., Davies, J.: Knowledge management at the customer frontline—an integrated approach. Journal of the Institute of Telecommunications Professionals 4(1), (2010) 44. W3C: SKOS Simple Knowledge Organisation System. http://www.w3.org/2004/02/skos/ (2004)
Tool Support for Ontology Engineering Ian Horrocks
Abstract The Web Ontology Language (OWL) has been developed and standardised by the World Wide Web Consortium (W3C). It is one of the key technologies underpinning the Semantic Web, but its success has now spread far beyond the Web: it has become the ontology language of choice for a wide range of application domains. One of the key benefits flowing from OWL standardisation has been the development of a huge range of tools and infrastructure that can be used to support the development and deployment of OWL ontologies. These tools are now being used in large scale and commercial ontology development, and are widely recognised as being not simply useful, but essential for the development of the high quality ontologies needed in realistic applications.
1 Introduction The Web Ontology Language (OWL) [15, 33] has been developed and standardised by the World Wide Web Consortium (W3C). It is one of the key technologies underpinning the Semantic Web, but its success has now spread far beyond the Web: it has become the ontology language of choice for applications in fields as diverse as biology [38], medicine [9], geography [10], astronomy [8], agriculture [40], and defence [26]. Moreover, ontologies are increasingly being used for “semantic data management”, and DB technology vendors have already started to augment their existing software with ontological reasoning. For example, Oracle Inc. has recently enhanced its well-known database management system with modules that use ontologies to support ‘semantic data management’. Their product brochure1 lists numerous application areas that can benefit from this technology, including Enterprise Information Integration, Knowledge Mining, Finance, Compliance Management and Life Science Research.
1 http://www.oracle.com/technology/tech/semantic_technologies/pdf/oracle%20db%20semantics
%20overview%2020080722.pdf. I. Horrocks () Department of Computer Science, University of Oxford, Oxford, UK e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_6, © Springer-Verlag Berlin Heidelberg 2011
103
104
I. Horrocks
The standardisation of OWL has brought with it many benefits. In the first place, OWL’s basis in description logic has made it possible to exploit the results of more than twenty-five years of research and to directly transfer theoretical results and technologies to OWL. As a consequence, the formal properties of OWL entailment are well understood: it is known to be decidable, but to have high complexity (NExpTime-complete for OWL and 2NExpTime-complete for OWL 2 [34]). Moreover, algorithms for reasoning in OWL have been published, and implemented reasoning systems are widely available [12, 30, 39, 44]. These systems are highly optimised and have proven to be effective in practice in spite of the high worst-case complexity of standard reasoning tasks. One of the key benefits flowing from OWL standardisation has been the subsequent development of a huge range of tools and infrastructure that can be used to support the development and deployment of OWL ontologies. These include editors and ontology development environments such as Protégé-OWL,2 TopBraid Composer3 and Neon4 ; reasoning systems such as HermiT [30], FaCT++ [44], Pellet [39], and Racer [12]; explanation and justification tools such as the Protégé-OWL Debugger5 and the OWL explanation workbench [14]; ontology mapping and integration tools such as Prompt [32] and ContentMap [23, 32]; extraction and modularisation tools such ProSÉ6 ; comparison tools such as OWLDiff7 ; and version control tools such as ContentCVS [22]. Such tools are now being used in large scale and commercial ontology development, and are widely recognised as being not simply useful, but essential for the development of the high quality ontologies needed in realistic applications.
2 Background Ontologies are formal vocabularies of terms, often shared by a community of users. One of the most commonly used ontology modelling languages is the Web Ontology Language (OWL), which has been standardised by the World Wide Web Consortium (W3C) [15]; the latest version, OWL 2, was released in October 2009 [33]. OWL’s formal underpinning is provided by description logics (DLs) [2]—knowledge representation formalisms with well-understood formal properties. A DL ontology typically consists of a TBox, which describes general relationships in a domain, and an ABox, which describes information about particular objects in the domain. In a comparison with relational databases, a TBox is analogous to a database schema, and an ABox is analogous to a database instance; however, 2 http://protege.stanford.edu/overview/protege-owl.html. 3 http://www.topbraidcomposer.com/. 4 http://neon-toolkit.org/. 5 http://www.co-ode.org/downloads/owldebugger/. 6 http://krono.act.uji.es/people/Ernesto/safety-ontology-reuse. 7 http://krizik.felk.cvut.cz/km/owldiff/.
Tool Support for Ontology Engineering
105
DL ontology languages are typically much more expressive than database schema languages. Many ontology-based applications depend on various reasoning tasks, such as ontology classification and query answering, which can be solved using reasoning algorithms. Two types of reasoning algorithms for DLs are commonly used. Tableau algorithms [1, 16–18] can be seen as model-building algorithms: to show that a DL ontology K does not entail a conclusion, these algorithms construct a model of K that invalidates the conclusion. In contrast, resolution-based algorithms [19–21, 31] show that K entails a conclusion by demonstrating that K and the negation of the conclusion are contradictory. Reasoners are software components that provide reasoning services to other applications. Reasoners such as Pellet [35], FaCT++ [45], RACER [11], CEL [4], and KAON2 [29] provide reasoning services for a range of DLs, and have been used in many applications. Medicine and the life sciences have been prominent early adopters of ontologies and ontology based technologies, and there are many high profile applications in this area. For example, the Systematised Nomenclature of Medicine—Clinical Terms (SNOMED CT) [42] is a clinical ontology being developed by the International Health Terminology Standards Development Organisation (IHTSDO),8 and used in healthcare systems of more than 15 countries, including Australia, Canada, Denmark, Spain, Sweden and the UK. GALEN [41] is a similar open-source ontology that has been developed in the EU-funded FP III project GALEN and the FP IV framework GALEN-In-Use.9 The Foundational Model of Anatomy (FMA) [37] is an open-source ontology about human anatomy developed at the University of Washington. The National Cancer Institute (NCI) Thesaurus [13] is an ontology that models cancer diseases and treatments. The OBO Foundry10 is a repository containing about 80 biomedical ontologies developed by a large community of domain experts. Ontologies such as SNOMED CT, GALEN, and FMA are gradually superseding the existing medical classifications and will provide the future platforms for gathering and sharing medical knowledge; in the UK, for example, SNOMED CT is being used in the National Programme for Information Technology (NPfIT) being delivered by “NHS Connecting for Health”.11 Capturing medical records using ontologies will reduce the possibility for data misinterpretation, and will enable information exchange between different applications and institutions, such as hospitals, laboratories, and government statistical agencies. Apart from providing a taxonomy of concepts/codes for different medical conditions, medical ontologies such as SNOMED CT, GALEN, and FMA describe the precise relationships between different concepts. These ontologies are extensible at point of use, thus allowing for “post-coordination”: users can add new terms (e.g., “almond allergy”), which 8 http://www.ihtsdo.org/. 9 http://www.opengalen.org/. 10 http://www.obofoundry.org/. 11 http://www.connectingforhealth.nhs.uk/.
106
I. Horrocks
are then seamlessly integrated with the existing terms (e.g., as a subtype of “nut allergy”). Clearly, the correctness of such (extended) ontologies is of great importance, as errors could adversely impact patient care. Medical ontologies are strongly related to DLs and ontology languages. In fact, SNOMED CT can be expressed in the description logic EL++ [3], a well-known sub-boolean DL that is the basis for the EL profile of OWL 2 [34]. GALEN, although originally developed using the GRAIL description logic language [36], has now been translated into OWL.12 FMA was not originally modelled using description logics, but has also been translated into OWL [9]. The developers of medical ontologies have recognised the numerous benefits of using a DL based ontology language, such as the unambiguous semantics for different modelling constructs, the well-understood tradeoffs between expressivity and computational complexity [2, Chap. 3], and the availability of provably correct reasoners and tools. The development and application of medical ontologies such as SNOMED CT, GALEN, and FMA crucially depend on various reasoning tasks. Ontology classification (i.e., organising classes into a specialisation/generalisation hierarchy) plays a major role during ontology development, as it provides for the detection of potential modelling errors [46]. For example, about 180 missing sub-class relationships were detected when the version of SNOMED CT used by the NHS was classified using FaCT++ [47]. Furthermore, ontology classification can aid users in merging different ontologies [28], and it allows for ontology validation [5, 7]. In contrast, query answering is mainly used during ontology-based information retrieval [43]; e.g., in clinical applications query answering might be used to retrieve “all patients that suffer from nut allergies”. The benefits of reasoning enabled tools for supporting ontology engineering are now recognised well beyond the academic setting. For example, OWL reasoning tools are currently being used by British Telecom in the above mentioned NPfIT project, and other companies involved in this project, such as Siemens, are actively applying DLs as a conceptual modelling language.
3 Reasoning Support for Ontology Engineering SNOMED CT is extremely large: it currently defines approximately 400,000 classes. Developing ontologies, in particular such large ontologies, is extremely challenging. Large and often distributed teams of domain experts may develop and maintain the ontology over the course of many years. It is useful if not essential to support such development and maintenance processes with sophisticated tools that help users to identify possible errors in their formalisation of domain knowledge. For example, a reasoner can be used to identify inconsistent classes; that is, classes whose extension is necessarily empty. This typically indicates an error in the ontology as it is unlikely that the knowledge engineer intended to introduce a 12 http://www.co-ode.org/galen/.
Tool Support for Ontology Engineering
107
class that can have no instances and that is, in effect, simply a synonym for the built-in OWL Nothing class (the inconsistent class). Similarly, the reasoner can be used to recognise when two different classes are semantically equivalent; that is, classes whose extensions must always be the same. This may indicate an error or redundancy in the ontology, although it is also possible that multiple names for the same concept have deliberately been introduced—e.g., Heart-Attack and MyocardialInfarction. We can think of inconsistent classes as being over-constrained. A much more typical error in practice is that classes are under-constrained; this can arise because important facts about the class may be so obvious to human experts that they forget to explicate them or simply assume that they must hold. One very common example is missing disjointness assertions: a human expert may, for example, simply assume that concepts such as Arm and Leg are disjoint. A reasoner can also be used to help identify this kind of error—the reasoner is used to compute a hierarchy of classes based on the sub-class relationship, and this computed hierarchy can then be examined by human experts and compared to their intuition about the correct hierarchical structure of domain concepts. This is not just a theory; reasoning enabled ontology tools have by now proved themselves in realistic applications. For example, an OWL tool was used at the Columbia Presbyterian medical centre in order to correct important errors in the ontology used for classifying pathology lab test results; if these errors had gone uncorrected, then they could have had a serious and adverse impact on patient care [25]. Similarly, Kaiser Permanente,13 a large health care provider in the USA, is using the Protégé-OWL ontology engineering environment and the HermiT OWL 2 reasoner to develop an extended and enriched version of SNOMED-CT; in the following section we will examine this project in more detail and see how reasoning is being used to support the development of the extended ontology.
3.1 Extending SNOMED CT In order to support a wider range of intelligent applications, Kaiser Permanente need to extend SNOMED CT in a number of different directions. In the first place, they need to express concepts whose definition involves negative information. For example, they need to express concepts such as: Non-Viral-Pneumonia; that is, a Pneumonia that is not caused by a Virus. In the second place, they need to express concepts whose definition involves disjunctive information. For example, they need to express concepts such as Infectious-Pneumonia; that is, a Pneumonia that is caused by a Virus or a Bacterium. Finally, they need to express concepts whose definition includes cardinality constraints. For example they need to express concepts such as Double-Pneumonia; that is, a Pneumonia that occurs in two Lungs. 13 http://www.kaiserpermanente.org/.
108
I. Horrocks
Such concepts can be relatively easily added to the OWL version of SNOMED CT using a tool such as Protégé-OWL, and the reasoning support built in to ProtégéOWL can be used to check if the extended ontology contains inconsistent classes, or entailments that do not correspond to those expected by domain experts. After performing various extensions, including those mentioned above, all ontology classes were found to be consistent. However, the reasoner failed to find expected subsumption entailments; for example, the ontology does not entail that Bacterial-Pneumonia is a kind of Non-Viral-Pneumonia. This entailment was expected by domain experts because a pneumonia that is caused by a bacterium is not caused by a virus. The reason for this missing entailment is that the SNOMED CT ontology is highly under-constrained. For example, it does not explicitly assert “intuitively obvious” class disjointness; in particular, it does not assert that Virus and Bacterium are disjoint. Having identified this problem, the needed disjointness axioms were added to the extended version of SNOMED CT. After adding these axioms, many additional desired subsumptions were entailed, including the one between Bacterial-Pneumonia and Non-Viral-Pneumonia. Unfortunately, the OWL reasoner also revealed that previously consistent classes had become inconsistent in the extended ontology; one example of such a class was Percutanious-Embolization-of-Hepatic-Artery-Using-Fluoroscopy-Guidance. By using explanation tools, it was discovered that the reason for these inconsistencies were SNOMED CT classes such as Groin that describe “junction” regions of anatomy—in the case of Goin, the junction between the Abdomen and the Leg. In SNOMED CT, these junction regions are defined using simple subsumption axioms; for example, Groin is defined as a subclass of Abdomen and a subclass of Leg. When Abdomen and Leg are asserted to be disjoint, as is obviously intended, any instance of Groin would thus need to be an instance of two disjoint classes, and Groin is thus inconsistent. This reveals a serious modelling error in SNOMED CT—modelling such junction regions in this way is simply not correct. Correct modelling of (concepts such as) Groin turns out to be quite complex. After considerable effort, it was determined that an appropriate axiomatisation would be something like the following: Groin ∃hasPart.(∃isPartOf.Abdomen) Groin ∃hasPart.(∃isPartOf.Leg) hasPart ≡ isPartOf− Groin ∀hasPart.(∃isPartOf.(Abdomen Leg))
In this axiomatisation it is stated that the groin consists of two parts, one of which is part of the abdomen and one of which is part of the leg. The axiomatisation introduces the use of inverse roles as well as universal quantification, suggesting that quite an expressive ontology language is needed for precise modelling of anatomical terms. As well as illustrating the importance of reasoning enhanced ontology engineering tools, extending SNOMED CT in this way also revealed the importance of explanation. In particular, the inconsistencies that arose after the addition of the disjointness axioms were very difficult for domain experts to understand, to the extent
Tool Support for Ontology Engineering
109
that the correctness of these entailments was initially doubted. If explanation tools had not been available, the experts would very likely have lost faith in the reasoning tools, and probably would have stopped using them. By using explanation systems they were able to understand the cause of the problem, to see that the initial design of the ontology was faulty, and to devise a more appropriate axiomatisation.
4 Other Tools In addition to ontology engineering environments, a large range of other tools is now becoming available. This includes, for example, tools supporting ontology integration and modularisation, ontology comparison, and ontology version control. When developing a large ontology, it is useful if not essential to divide the ontology into modules in order to make it easier to understand and to facilitate parallel work by a team of ontology engineers. Similarly, it may be desirable to extract from a large ontology a module containing all the information relevant to some subset of the domain—the resulting small(er) ontology will be easier for humans to understand and easier for applications to use. New reasoning services can be used both to alert developers to unanticipated and/or undesirable interactions when modules are integrated, and to identify a subset of the original ontology that is indistinguishable from it when used to reason about the relevant subset of the domain [6]. These techniques have been implemented in tools such as ProSÉ.14 Given a subset of the vocabulary used in an ontology, ProSÉ can be used to extract a module that includes all the axioms relevant to that vocabulary. The extraction technique uses the semantics of the ontology rather than its syntax, and is based on the logical notion of conservative extensions [27]. It is this formal basis that allows the very strong semantic guarantee to be provided, i.e., the guarantee that, for any entailment question that uses concepts formed only from the given vocabulary, the answer computed using the module will be the same as that computed using the original ontology. ContentMap15 is an example of a tool that uses the same underlying semantic framework to support ontology integration. It uses semantic techniques to compute new entailments that would arise as a result of merging a pair of ontologies using a given set of integration axioms (axioms using terms from both ontologies). Users can say whether or not these new entailments are desired, and the tool suggests “repair plans”; these are minimal sets of changes that invalidate undesired entailments while retaining desired ones [24]. ContentCVS16 is an example of a tool that supports ontology versioning. It uses the well-known CVS paradigm, adapting it to the case of ontologies by using a combination of syntactic and semantic techniques to compare ontology versions [22]. 14 http://krono.act.uji.es/people/Ernesto/safety-ontology-reuse/proSE-current-version. 15 http://krono.act.uji.es/people/Ernesto/contentmap. 16 http://krono.act.uji.es/people/Ernesto/contentcvs.
110
I. Horrocks
5 Discussion As we have seen in Sect. 3.1, reasoning enabled tools provide vital support for ontology engineering. Ontology development environments such as Protégé-OWL are now considered a minimum requirement for serious ontology engineering tasks, and a wide range of additional tools and infrastructure is now becoming available. Experience with these tools has illustrated some of the complexities of ontology development, and suggests a high likelihood that non-trivial ontologies developed without tool support will contain errors. These may be errors of omission, typically where the ontology engineer(s) forget to add “obvious” information to the ontology; they may also be errors of commission, where concepts have been over-constrained or incorrectly modelled. Re-use and/or modular ontology design also introduces the possibility that, while individually correct, merging ontologies can reveal incompatibilities in their design. It has also become evident that, as well as identifying the existence of errors, it is essential for tools to be able to pinpoint errors, explain the reasoning involved in the unexpected (non-)entailment, and if possible offer repair suggestions. Without this facility, domain experts may be unable to identify the source of errors; this may even cause them to lose faith in the correctness of the reasoning system, and ultimately to stop using it.
References 1. Baader, F., Sattler, U.: An overview of tableau algorithms for description logics. Stud. Log. 69, 5–40 (2001) 2. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook. Cambridge University Press, Cambridge (2003) 3. Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope. In: Proc. IJCAI-05, Edinburgh, UK, pp. 364–369 (2005) 4. Baader, F., Lutz, C., Suntisrivaraporn, B.: CEL—a polynomial-time reasoner for life science ontologies. In: Proc. IJCAR’06, Seattle, WA, USA, pp. 287–291 (2006) 5. Bodenreider, O., Smith, B., Kumar, A., Burgun, A.: Investigating subsumption in SNOMED CT: an exploration into large description logic-based biomedical terminologies. Artif. Intell. Med. 39(3), 183–195 (2007) 6. Cuenca Grau, B., Horrocks, I., Kazakov, Y., Sattler, U.: Modular reuse of ontologies: theory and practice. J. Artif. Intell. Res. 31, 273–318 (2008) 7. Cure, O., Giroud, J.: Ontology-based data quality enhancement for drug databases. In: Proc. of Int. Workshop on Health Care and Life Sciences Data Integration for the Semantic Web (2007) 8. Derriere, S., Richard, A., Preite-Martinez, A.: An ontology of astronomical object types for the virtual observatory. In: Proc. of Special Session 3 of the 26th Meeting of the IAU: Virtual Observatory in Action: New Science, New Technology, and Next Generation Facilities (2006) 9. Golbreich, C., Zhang, S., Bodenreider, O.: The foundational model of anatomy in OWL: experience and perspectives. J. Web Semant. 4(3), 181–195 (2006) 10. Goodwin, J.: Experiences of using OWL at the ordnance survey. In: Proc. of the First Int. Workshop on OWL Experiences and Directions (OWLED 2005). CEUR Workshop Proceedings, vol. 188 (2005). CEUR. http://ceur-ws.org/
Tool Support for Ontology Engineering
111
11. Haarslev, V., Möller, R.: RACER system description. In: Proc. IJCAR 2001, Siena, Italy, pp. 701–706 (2001) 12. Haarslev, V., Möller, R., Wessel, M.: Querying the semantic web with racer + nRQL. In: Proc. of the KI-2004 Intl. Workshop on Applications of Description Logics (ADL’04) (2004) 13. Hartel, F.W., de Coronado, S., Dionne, R., Fragoso, G., Golbeck, J.: Modeling a description logic vocabulary for cancer research. J. Biomed. Inform. 38(2), 114–129 (2005) 14. Horridge, M., Parsia, B., Sattler, U.: Laconic and precise justifications in OWL. In: Proc. of the 7th International Semantic Web Conference (ISWC 2008). Lecture Notes in Computer Science, vol. 5318, pp. 323–338. Springer, Berlin (2008) 15. Horrocks, I., Patel-Schneider, P.F., van Harmelen, F.: From SHIQ and RDF to OWL: the making of a web ontology language. J. Web Semant. 1(1), 7–26 (2003) 16. Horrocks, I., Sattler, U.: A tableaux decision procedure for SHOIQ. In: Proc. of the 19th Int. Joint Conf. on Artificial Intelligence (IJCAI 2005), pp. 448–453 (2005) 17. Horrocks, I., Sattler, U., Tobies, S.: Practical reasoning for very expressive description logics. Log. J. IGPL 8(3), 239–263 (2000) 18. Horrocks, I., Sattler, U., Tobies, S.: Reasoning with individuals for the description logic SHIQ. In: Proc. CADE-17, Pittsburgh, PA, USA, pp. 482–496 (2000) 19. Hustadt, U., Motik, B., Sattler, U.: Reasoning in description logics with a concrete domain in the framework of resolution. In: de Mántaras, R.L., Saitta, L. (eds.) Proc. of the 16th European Conference on Artificial Intelligence (ECAI 2004), Valencia, Spain, August 22–27, pp. 353– 357. IOS Press, Amsterdam (2004) 20. Hustadt, U., Motik, B., Sattler, U.: Reducing SHIQ− description logic to disjunctive Datalog programs. In: Dubois, D., Welty, C.A., Williams, M.-A. (eds.) Proc. of the 9th Int. Conference on Principles of Knowledge Representation and Reasoning (KR 2004), Whistler, Canada, June 2–5, pp. 152–162. AAAI Press, Menlo Park (2004) 21. Hustadt, U., Motik, B., Sattler, U.: A decomposition rule for decision procedures by resolution-based calculi. In: Baader, F., Voronkov, A. (eds.) Proc. of the 11th Int. Conference on Logic for Programming Artificial Intelligence and Reasoning (LPAR 2004), Montevideo, Uruguay, March 14–18. Lecture Notes in Artificial Intelligence, vol. 3452, pp. 21–35. Springer, Berlin (2005) 22. Jiménez-Ruiz, E., Cuenca Grau, B., Horrocks, I., Berlanga Llavori, R.: ContentCVS: a CVSbased collaborative ontology engineering tool (demo). In: Proc. of the 2nd Int. Workshop on Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2009). CEUR Workshop Proceedings, vol. 559 (2009). CEUR. http://ceur-ws.org/ 23. Jiménez-Ruiz, E., Cuenca Grau, B., Horrocks, I., Berlanga Llavori, R.: Logic-based ontology integration using ContentMap. In: Vallecillo, A., Sagardui, G. (eds.) Proc. of XIV Jornadas de Ingeniería del Software y Bases de Datos (JISBD 2009), pp. 316–319 (2009) 24. Jiménez-Ruiz, E., Cuenca Grau, B., Horrocks, I., Berlanga Llavori, R.: Ontology integration using mappings: towards getting the right logical consequences. In: Proc. of the 6th European Semantic Web Conf. (ESWC 2009). Lecture Notes in Computer Science, vol. 5554, pp. 173– 187. Springer, Berlin (2009) 25. Kershenbaum, A., Fokoue, A., Patel, C., Welty, C., Schonberg, E., Cimino, J., Ma, L., Srinivas, K., Schloss, R., Murdock, J.W.: A view of OWL from the field: use cases and experiences. In: Proc. of the Second Int. Workshop on OWL Experiences and Directions (OWLED 2006). CEUR Workshop Proceedings, vol. 216 (2006). CEUR. http://ceur-ws.org/ 26. Lacy, L., Aviles, G., Fraser, K., Gerber, W., Mulvehill, A., Gaskill, R.: Experiences using OWL in military applications. In: Proc. of the First Int. Workshop on OWL Experiences and Directions (OWLED 2005). CEUR Workshop Proceedings, vol. 188 (2005). CEUR. http://ceur-ws.org/ 27. Lutz, C., Walther, D., Wolter, F.: Conservative extensions in expressive description logics. In: Proc. of the 20th Int. Joint Conf. on Artificial Intelligence (IJCAI 2007), pp. 453–458 (2007) 28. McGuinness, D.L., Fikes, R., Rice, J., Wilder, S.: An environment for merging and testing large ontologies. In: Proc. KR 2000, Breckenridge, CO, USA, pp. 483–493 (2000) 29. Motik, B., Sattler, U.: A comparison of reasoning techniques for querying large description logic ABoxes. In: Proc. LPAR 2006, pp. 227–241 (2006)
112
I. Horrocks
30. Motik, B., Shearer, R., Horrocks, I.: Hypertableau reasoning for description logics. J. Artif. Intell. Res. 36, 165–228 (2009) 31. Nivelle, H.D., Schmidt, R.A., Hustadt, U.: Resolution-based methods for modal logics. Log. J. IGPL 8(3), 265–292 (2000) 32. Noy, N.F., Musen, M.A.: The PROMPT suite: interactive tools for ontology merging and mapping. Int. J. Hum.-Comput. Stud. 59(6), 983–1024 (2003) 33. OWL 2 Web Ontology Language Overview. W3C Recommendation. Available at http://www. w3.org/TR/owl2-overview/ (27 October 2009) 34. OWL 2 Web Ontology Language Profiles. W3C Recommendation. Available at http://www. w3.org/TR/owl2-profiles/. (27 October 2009) 35. Parsia, B., Sirin, E.: Pellet: an OWL-DL reasoner. Poster. In: Proc. ISWC 2004, Hiroshima, Japan (2004) 36. Rector, A.L., Bechhofer, S., Goble, C.A., Horrocks, I., Nowlan, W.A., Solomon, W.D.: The GRAIL concept modelling language for medical terminology. Artif. Intell. Med. 9(2), 139– 171 (1997) 37. Rosse, C., Mejino, J.V.L.: A reference ontology for biomedical informatics: the foundational model of anatomy. J. Biomed. Inform. 36, 478–500 (2003) 38. Sidhu, A., Dillon, T., Chang, E., Sidhu, B.S.: Protein ontology development using OWL. In: Proc. of the First Int. Workshop on OWL Experiences and Directions (OWLED 2005). CEUR Workshop Proceedings, vol. 188 (2005). CEUR. http://ceur-ws.org/ 39. Sirin, E., Parsia, B., Cuenca Grau, B., Kalyanpur, A., Katz, Y.: Pellet: a practical OWL-DL reasoner. J. Web Semant. 5(2), 51–53 (2007) 40. Soergel, D., Lauser, B., Liang, A., Fisseha, F., Keizer, J., Katz, S.: Reengineering thesauri for new applications: the AGROVOC example. J. Digit. Inf. 4(4) (2004) 41. Solomon, W., Roberts, A., Rogers, J.E., Wroe, C.J., Rector, A.L.: Having our cake and eating it too: how the GALEN Intermediate Representation reconciles internal complexity with users’ requirements for appropriateness and simplicity. In: Proc. AMIA 2000, CA, USA, pp. 819– 823 (2000) 42. Spackman, K.A.: SNOMED RT and SNOMEDCT. Promise of an international clinical terminology. MD Comput. 17(6), 29 (2000) 43. Stevens, R., Baker, P.G., Bechhofer, S., Ng, G., Jacoby, A., Paton, N.W., Goble, C.A., Brass, A.: TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics 16(2), 184–186 (2000) 44. Tsarkov, D., Horrocks, I.: FaCT++ description logic reasoner: system description. In: Proc. of the Int. Joint Conf. on Automated Reasoning (IJCAR 2006). Lecture Notes in Artificial Intelligence, vol. 4130, pp. 292–297. Springer, Berlin (2006) 45. Tsarkov, D., Horrocks, I.: FaCT++ description logic reasoner: system description. In: Proc. IJCAR 2006, Seattle, WA, USA, pp. 292–297 (2006) 46. Wolstencroft, K., McEntire, R., Stevens, R., Tabernero, L., Brass, A.: Constructing ontologydriven protein family data-bases. Bioinformatics 21(8), 1685–1692 (2005) 47. Wroe, C.: Is semantic web technology ready for healthcare? In: Proc. of ESWC’06 Industry Forum. CEUR Workshop Proceedings, vol. 194 (2006). CEUR. http://ceur-ws.org/
Part II
Academic Legacy
Combining Data-Driven and Semantic Approaches for Text Mining Stephan Bloehdorn, Sebastian Blohm, Philipp Cimiano, Eugenie Giesbrecht, Andreas Hotho, Uta Lösch, Alexander Mädche, Eddie Mönch, Philipp Sorg, Steffen Staab, and Johanna Völker
Abstract While the amount of structured data published on the Web keeps growing (fostered in particular by the Linked Open Data initiative), the Web still comprises of mainly unstructured—in particular textual—content and is therefore a Web for human consumption. Thus, an important question is which techniques are most suitable to enable people to effectively access the large body of unstructured information available on the Web, whether it is semantic or not. While the hope is that semantic technologies can be combined with standard Information Retrieval approaches to enable more accurate retrieval, some researchers have argued against this view. They claim that only data-driven or inductive approaches are applicable to tasks requiring the organization of unstructured (mainly textual) data for retrieval purposes. We argue that the dichotomy between data-driven/inductive and semantic approaches is indeed a false one. We further argue that bottom-up or inductive approaches can be successfully combined with top-down or semantic approaches and illustrate this for a number of tasks such as Ontology Learning, Information Retrieval, Information Extraction and Text Mining.
1 Introduction The Semantic Web was originally intended as an extension of the traditional (syntactic) Web in which information is given a well-defined meaning and is thus machinereadable [3]. However, the Web as we experience it today mainly comprises unstructured content made for human consumption. This Web of human-readable data is the one that we access quite successfully through search engines such as Google or Yahoo. In order to provide effective access to this human-readable body of unstructured data, we need, from a technological point of view, to develop efficient and scalable approaches for searching, Text Classification, Text Clustering, Machine Translation, Speech Recognition, etc. Most of these tasks have been addressed so far purely by data-driven techniques, i.e. term weighting functions in Information Retrieval (IR) [77], supervised models S. Bloehdorn () Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_7, © Springer-Verlag Berlin Heidelberg 2011
115
116
S. Bloehdorn et al.
in Text Classification [82], unsupervised approaches in Clustering [55], probabilistic data-derived models in Machine Translation [13] and Speech Recognition [59]. An interesting question is whether semantic approaches can indeed support such tasks. Semantic approaches (we will refer to these as “top-down” approaches) in our sense are approaches which make use of explicitly and declaratively encoded knowledge for certain tasks and are able to reason on the basis of this knowledge. In contrast to data-driven approaches, in semantic approaches we define what needs to be necessarily true in a certain domain by encoding this into appropriate ontologies.1 In this sense it is indeed a legitimate question how much top-down processing is needed and beneficial in text-centered tasks such as those stated above. Before discussing this question in more depth, let us characterize both approaches by means of an analogy: When trying to perform the task of “basket analysis”, i.e. the task of analyzing shopping behavior via data mining techniques with the goal of optimization, we could on the one hand directly interview people asking them to identify combinations of products they typically buy together. Products are presented in a domain specific ontology that might also be extended depending on the answers obtained in the interviews. In this case, this is the top-down approach as we do not “observe” their shopping behavior, but ask them to state this explicitly and declaratively, leading to answers such as “I typically buy broccoli and tofu” that are used to extend the domain model, for example establish a relation between broccoli and tofu. This represents a top-down approach in our sense. On the other hand, we might apply inductive approaches such as association rule mining [1] to find out what they really buy together, e.g. the beer and the chips (data-driven approach). It is conceivable that the conclusions we get are quite different. In fact, it might be that people perceive their actual buying behavior differently to what it actually is, i.e. convinced that it follows a healthy pattern (broccoli and tofu) while this is actually not the case (i.e. they buy beer and chips). Now what is the important knowledge for an application in basket market analysis: the top-down information corresponding to what people think they buy or those things that they actually buy? Maybe it is the case that both are relevant as they simply reflect different perspectives of the world. In this sense we think that some authors have created a false dichotomy essentially claiming that the data-driven and the semantic approaches are orthogonal if not opposed to each other [46]. In this article we argue that this does not need to be the case. In fact, we regard both approaches as complementing each other. The intuition is as follows: if we have previous knowledge about how the world is structured, why should we not encode it in such a way that applications can reason about this knowledge? At the same time, if our analysis of the data reveals interesting patterns, why should we not use these to extend our knowledge about the world and encode them explicitly? In our view, the dichotomy between inductive/data-driven vs. semantic approaches is thus a false one. Certainly, there are tasks solvable by means of purely in1 While
it is true that fuzzy and non-monotonic extensions to description logics and OWL have been proposed, we puristicly view OWL as a non-fuzzy and monotonic logic here.
Combining Data-Driven and Semantic Approaches for Text Mining
117
ductive approaches. Such examples are machine translation where translation models in the form of n-grams derived from large (aligned!) corpora have been applied successfully [13]. Speech recognition is also a task where we can get far with purely data-driven approaches. So-called language models derived from corpora [59] are in fact an important component of speech recognizers and in absence of more topdown approaches encode expectations that can guide the speech recognizer. In contrast, there are other kinds of applications where a top-down approach is most suitable, i.e. in the task of integrating data from different sites, where the different parties need to agree on how their schemas match. In fact, schema matching— called Ontology Alignment or Ontology Matching in the Semantic Web field [33]— is a very good example of a task where semantic and bottom-up approaches can be combined successfully. There are different schemas or ontologies defining how a certain (micro-) world looks. The task is to map elements of one schema to elements in another schema. In fact, many data-driven techniques have been proposed to support this task [33], a very necessary step as schemas can indeed get huge [31]. A human engineer can then inspect the suggested mappings and decide whether they are appropriate or not (top-down approach). Schema matching is thus a good example of how top-down and data-driven techniques can be combined effectively. Halevy et al. argue that top-down approaches work best for tasks such as data integration whereas bottom-up approaches work best for text-centered or Natural Language Processing (NLP) tasks. While we cannot prove or disprove this statement we have a number of arguments for the fact that this corresponds to a very limited view of NLP. In fact, we can observe at least two main fallacies in the field of NLP: • Open domain fallacy: While in the early days of NLP research focus on specific domains was wide-spread—for example NLP access to databases—many researchers in NLP have not focused on any particular domain in recent years, developing their techniques on large domain-independent newspaper corpora such as the Brown corpus2 or the Penn Treebank.3 Examples of this are parsing [47], Word Sense Disambiguation (WSD) or Named-Entity Recognition. However, such models might be less useful in subdomains or very technical domains where the sublanguage differs significantly from the language used in newspapers. As it may be very expensive to gather corpora and train algorithms on specific domains, it can be a more feasible solution to rely on top-down approaches for encoding what we know or expect to be true in these domains. It is certainly to be questioned whether it is more cost-effective to develop an ontology or to produce labeled training data, a question which has not received much attention so far. • Data-is-everything fallacy: It is certainly not the case that all relevant knowledge can be derived from data. One apparent reason for this is that the most obvious (and fundamental) facts are not explicitly mentioned in text because the author assumes them to be known to all the potential readers [14]. Further, most approaches still require that the categories we want to distinguish are manually 2 http://icame.uib.no/brown/bcm.html. 3 http://www.cis.upenn.edu/~treebank/.
118
S. Bloehdorn et al.
defined, which represents a top-down process in our sense. Take the example of Word Sense Disambiguation. Most approaches to WSD are indeed supervised and they require the senses to be distinguished to be encoded in a top-down fashion. WordNet [36] has been used to deliver sense distinctions in many cases. The same holds for Named-Entity Recognition approaches where the classes of entities are typically fixed (unsupervised approaches as [34] are comparatively rare). The classes might also vary between domains, something that is often ignored in Natural Language Processing research. Further, application developers may simply not have access to enough data or simply not the (computing and human) resources to train systems for their domain and applications. In this case encoding knowledge in a top-down fashion instead of deriving it from data can be an interesting solution. This is also attractive if this knowledge can be transfered and used for other applications outside of the scope of text mining, leading to high synergy effects. Semantic or ontology-based approaches contribute to overcoming the above mentioned fallacies. In fact, according to the view put forth in this chapter, inductive and semantic approaches can be naturally and successfully combined in text mining tasks. In what follows we briefly mention some of these tasks that are discussed in more detail in the remainder of this chapter: 1. Ontology Learning (OL): Halevy et al. mention that the task of Ontology Writing (as they call it) represents a significant bottleneck and a costly task. For this reason, inductive techniques have been applied to the problem of learning an ontology. The main idea here is to exploit inductive and statistical learning techniques to derive interesting relationships from data and then ask a human user to confirm the universal validity of the conclusion, reject it by providing a counterexample, etc. In this article we will review some of these techniques, many of which have been developed in Karlsruhe, such as the algorithms underlying the Ontology Learning frameworks TextToOnto, Text2Onto or, more recently, RoLExO and RELExO. We will discuss these developments in more detail in Sect. 2. 2. Information Extraction (IE): Information Extraction has always been concerned with extracting structural knowledge (templates) from textual data. We will discuss different approaches where inductive techniques have been applied to the task of extracting structured knowledge representations from textual data. 3. Information Retrieval (IR): Most approaches in Information Retrieval index and retrieve textual data on the basis of Bag-of-Words models. Such approaches typically perform without any manually encoded ontological knowledge. However, it is also possible to rely on human-created categories to index textual documents. For example, Explicit Semantic Analysis (ESA) defines such an approach that has applications both in Information Retrieval and Text Classification [38]. Recently, it has also been shown that the strength of this approach is that it can completely abstract from surface appearance, thus being even applicable to retrieval across languages. We will discuss these topics in Sect. 3. Additionally we present extended vector space models that allow the capture of the semantics of
Combining Data-Driven and Semantic Approaches for Text Mining
119
documents in a bottom-up approach. Finally, we also present the application of semantic search to industrial scenarios. 4. Text Mining (TM): There are about as many approaches to TM as there are applications. This is due to the fact that TM is by definition a very application-driven field of research and the applications differ along many dimensions. Section 4 describes interesting approaches and applications that we have worked on in the recent past. They shed light on the combination of semantic and statistical approaches from various angles. As this chapter is part of a Festschrift for Rudi Studer we will mainly emphasize work performed in Karlsruhe as a tribute to him.
2 Ontology Learning The Web as we know it is a Web of human-interpretable data—billions of text documents, videos and images, whose information contents are intangible for computers or automated agents. Ever since Tim Berners-Lee published his seminal article in the Scientific American, researchers have hence been dreaming of a new type of Web: The Semantic Web is an extension of the Web which adds metadata (i.e. a formal and explicit representation of meaning of data) that can be processed by machines in a meaningful way. In order to make this vision come true and to enable more ‘intelligent’ automatic access to information, significant amounts of high-quality ontologies and semantic annotations will be indispensable. So far, it seems unlikely that manual efforts alone will ever suffice to generate and maintain those large amounts of metadata, especially not in theoretically complex or highly dynamic domains such as bioinformatics or medicine. Hence, these domains are nowadays among the key drivers when it comes to the development of new data mining and knowledge acquisition techniques. A relatively new field of research in data mining is Ontology Learning (OL)—the automatic or semi-automatic generation of ontologies by Machine Learning (ML) or NLP techniques. Ontology Learning approaches can be coarsely distinguished by the kind of input data they require. While some methods generate new metadata from existing informal or semi-formal resources—for example textual documents [19], databases [62], multimedia documents [54, 71] or folksonomies [57, 80]— others seek to bootstrap the Semantic Web by leveraging existing metadata. However, even more promising from our point of view seem to be hybrid approaches which benefit equally from the redundancy that comes with large amounts of informal data and the existence of high-quality, often manually engineered ontologies.
2.1 A Very Short History of Ontology Learning It is impossible to figure out what the single first paper about Ontology Learning is as especially in the early days of the Semantic Web the boundaries between lexical
120
S. Bloehdorn et al.
acquisition and Ontology Learning were even fuzzier than they are today. The term “Ontology Learning” at least was coined by Alexander Mädche and Steffen Staab. The two of them, jointly together with Claire Nédellec, organized the first official workshop on Ontology Learning, which was co-located with the European Conference on Artificial Intelligence (ECAI) 2000 in Berlin, Germany. The workshop attracted researchers from many different countries and laid the foundations for a highly interdisciplinary field of research. One year later, Alex published his PhD thesis [64], which turned out to be the first one in a long series of dissertations in Ontology Learning from text (e.g. Philipp Cimiano [20], Marta Sabou [75], David Sanchez Ruenes [79] and Johanna Völker [88]), not to mention the various books about Ontology Learning [17, 19]. By this point of time, a respectable number of automated approaches to generating RDFS-style ontologies had been developed yielding fairly good results for simple taxonomies. But when in 2004 the W3C published the first recommendation of the OWL standard, new challenges for the Ontology Learning community arose, including the automatic generation of disjointness axioms and the need to deal with logical inconsistencies. Despite first advances and promising results, facing these challenges is still one of the most difficult problems in Ontology Learning and far from being solved. In the following we will highlight the research in Karlsruhe that has contributed to the field of Ontology Learning. This includes approaches to term extraction, definition of lexico-syntactic patterns, clustering for taxonomy learning, association rules for discovering relations and relational exploration. We will also discuss different aspects of user interaction and introduce tools and applications developed in the scope of Ontology Learning.
2.2 Data-Driven and Knowledge-Based Approaches 2.2.1 Term Extraction Terms, i.e. nominal or verbal phrases referring to linguistic concepts, are widely accepted as a means of labeling classes, individuals and properties. One of the most fundamental tasks in lexical Ontology Learning therefore aims to identify terms or phrases which are relevant for a particular domain of interest like “broccoli” or “tofu”, for instance. Usually scores indicating their respective degrees of relevance are computed by counting the occurrences of the terms in a representative text corpus, for example using TFIDF or entropy as in Text2Onto [21] or by comparing their frequencies with statistics obtained from a reference corpus [2, 30]. In a Web context, one can also measure the relevance of terms by considering structural information provided by HTML or XML documents [16]. 2.2.2 Lexico-Syntactic Patterns One of the most well-known and most simple approaches to learning taxonomic hierarchies is based on early work in lexical acquisition: So-called “Hearst patterns”,
Combining Data-Driven and Semantic Approaches for Text Mining
121
lexical and syntactic clues for hyponymy relationships, have been shown to indicate class membership or subsumption relationships between atomic classes with reasonably high precision. For example, an Ontology Learning tool might suggest making Broccoli a subclass (or an instance) of Vegetable given a sentence like “Broccoli, spinach and other vegetables are known to be very healthy.” and the following pattern: NP{, NP} ∗ {and | or} other NP Additional patterns have been proposed by Ogata and Collier [72], Cimiano [23] and others: exception (e.g. “German beers except for Kölsch”), apposition (e.g. “Kellogg, the leading producer of cereals”), definites (e.g. “the Gouda cheese”) and copula (e.g. “Pepsi is a popular soft drink”). A frequent problem in patternbased Ontology Learning are so-called empty heads [18, 42], i.e. nominals which do not contribute to the actual meaning of a genus phrase (e.g. “glucose is a type of sugar”). In particular, the rules relying on Hearst-style patterns for the identification of hyponymy relationships may be mislead by expressions such as “one”, “any”, “kind” or “type”. Solutions to this problem have been proposed by Völker [91] and Cimiano [22]. Another drawback of nearly all pattern-based approaches to hyponym extraction is the problem of data sparseness. Since occurrences of lexicosyntactic patterns are comparatively rare in natural language texts, most research nowadays concentrates on the web as a corpus [9, 23, 26, 94]—even though the enormous syntactic and semantic heterogeneity of web documents poses new sorts of challenges. Moreover, Cimiano and others [27, 93] have investigated the fusion of multiple sources of evidence for complementing purely pattern-based approaches by heuristics or background knowledge, and recently, approaches to the automatic generation of patterns [10] aim to increase the flexibility and effectiveness of Ontology Learning or Information Extraction systems.
2.2.3 Clustering for Taxonomy Learning A different approach, specifically designed to support the acquisition of subsumption relationships, has been proposed by Cimiano and others [24]. It relies on hierarchical clustering techniques in order to group terms with similar linguistic behavior into hierarchically arranged clusters. The clusters obtained by this kind of approach are assumed to represent classes, the meaning of which is constrained by a characterizing set of terms. Clustering techniques in this line clearly have the advantage of yielding a higher recall than pattern-based approaches because they are less dependent on more or less explicit manifestations of hyponymy. On the other hand, the generated clusters, respectively classes, most often lack meaningful one-word labels—a fact that makes them most suitable for use within semi-supervised Ontology Learning frameworks, although several approaches to automatically labeling conceptual clusters have been proposed in recent years [58]. This also applies to methods for Ontology Learning which rely on conceptual clustering and in particular formal concept analysis. For example, Cimiano et al. [25] suggested an approach
122
S. Bloehdorn et al.
to taxonomy induction which groups concepts according to the argument slots they fill in verb phrases. Following this approach, “beer” and “lemonade” would most probably be grouped under the concept of “drinkable” which could be made a subconcept of “consumable” (as drinking is a type of consumption).
2.2.4 Association Rules for Discovering Relations Association rules have become popular as a means for detecting regularities between items in large transaction data sets. Such transaction data sets are collected, e.g., by large supermarkets which monitor the shopping behavior of their customers. Every transaction corresponds to a customer’s purchase, i.e. a set of products which were bought together (e.g. broccoli, tofu and mineral water). Mädche and Staab [65] applied this approach to an Ontology Learning setting by mining for associations between lexicalized concepts in a domain-specific corpus. Using support and confidence as measures for the association strength, their method suggests likely relationships between concepts in an ontology (e.g. Human and Food), which can then be labeled by the ontology engineer or an automatic approach such as the one proposed by Kavalek and Svátek [60]. This is a good example of a combination of a bottom-up approach (where association rule mining proposes relation candidates) and a top-down approach where a human in the loop decides whether this corresponds to a relevant relation in the domain or not.
2.2.5 Relational Exploration Relational exploration is a method for systematic expert interrogation based on formal concept analysis, which can be used for refining subsumption hierarchies. A first implementation of this method was developed by Rudolph and Völker whose framework for reasoner-aided relational exploration supports human domain experts in the challenging task of acquiring missing axioms, including class disjointness [89] and logically complex domain-range restrictions [90]. Given an ontology about food and beverages, for example, the following axioms could be suggested: “Everything edible and green must be a healthy vegetable”. Food Green Vegetable Healthy “Alcohol is unhealthy”. Alcoholic Healthy ⊥ “A person who buys something healthy must be a woman”. Person ∃ buy.Healthy Woman “If a man buys food, he must be a bachelor and the food must be a pizza”. Man ∃ buy.Food Bachelor ∀ buy.(¬ Food Pizza) Since the selection of hypothetic axioms proposed by the exploration algorithm is to a large extent driven by the underlying reasoner, one might consider this approach a first step from purely data-driven to knowledge-based Ontology Learning.
Combining Data-Driven and Semantic Approaches for Text Mining
123
2.3 User Interaction User interaction is among the greatest challenges for today’s Ontology Learning approaches, because those are expected to accomplish what the most sophisticated ontology editors so far have not made possible, namely to facilitate ontology creation by the masses. Especially domain experts without any prior knowledge about formal semantics and ontology representation languages must be enabled to contribute to the realization of the Semantic Web. A fair amount of user guidance seems unavoidable when it comes to the selection of methods and the efficient acquisition of reasonably sized ontologies, even if today’s Ontology Learning system can relieve domain experts and ontology engineers from some of the most tedious work. Finally, users of an Ontology Learning tool or framework will need some support in assessing the quality of the automatically generated suggestions, estimating their logical consequences and possibly correcting the ontology accordingly. Alexander Mädche was among the first to recognize these requirements and incorporate them into the TextToOnto Ontology Learning framework [66]. Later, Text2Onto [21] paved the way for a new generation of Ontology Learning frameworks by combining the idea of incremental, data-driven learning with an explicit representation of multiple modeling alternatives. Associated provenance information automatically generated during the Ontology Learning process enabled the generation of explanations, and experimental results indicate that this type of metadata (i.e. confidence and relevance values) can be a valuable help in automatic diagnosis and repair of logically inconsistent ontologies [43]. An important step towards more comprehensive user guidance in Ontology Learning has been made by Elena Simperl, Christoph Tempich and Denny Vrandeˇci´c, whose Ontology Learning methodology [83] extends an umbrella over the complex process of semi-automatic ontology creation and its integration into common ontology engineering frameworks.
2.4 Tools and Applications Over the years various Ontology Learning tools and frameworks have been developed in Karlsruhe, including TextToOnto and Text2Onto [21, 66], Pankow and CPankow [23, 26], LeDA [93], LExO [92], RELExO and RoLExO [89, 90], as well as AEON [94] and finally Pronto [10], which we describe in more detail in Sect. 4.1. These software prototypes have shown their usefulness in projects and case studies, across application scenarios such as Question Answering [8] or Ontology Alignment [67]. However, as we cannot describe all of these applications in detail, we constrain ourselves to highlighting one of them: an integrated prototype [45] which emerged from a group-internal competition at Schloss Dagstuhl: “Can you build the Semantic Web in one day?” [87]. Faced with this challenge, a semantic application leveraging the functionalities of various tools and datasets was developed. It is a combination of Bibster [44], TextToOnto and the Librarian Agent [86]. The system emerged as the winner of this competition. All of the other submissions were
124
S. Bloehdorn et al.
however equally remarkable. Just see the competition’s website4 for an overview of what was possible within twenty four hours of hacking in the early days of the Semantic Web.
3 Semantics in Information Retrieval In Information Retrieval (IR) and Computational Linguistics, Vector Space Model (VSMs) [78] and its variations—such as Word Space Models [81], Hyperspace Analogue to Language [63], or Latent Semantic Analysis (LSA) [29]—have become the mainstream paradigm for text representation. VSMs have been empirically justified by results from cognitive science [39]. They embody the distributional hypothesis of meaning [37], according to which the meaning of words or bigger text units is defined by contexts in which they (co-)occur. The contexts can be either local, i.e., just the immediate neighbors of the words, or global, e.g., a sentence or a paragraph or the whole document. Typically, global context is used for modeling of text meaning within the IR paradigm. To do this, a term-document matrix is constructed and the meaning of the documents is defined by the words they share, and vice versa. In the following we define the Bag-of-Words (BoW) model as the standard VSM. Documents are mapped to the vector space spanned by all terms in the collection. Values of each dimension, which correspond to terms, are defined using functions on the term frequency in documents and the collection. This model is based on an independence assumption of terms, i.e. all term vectors are orthogonal. While this simplifies the model it also has some substantial drawbacks. In many cases the appearance of one term increases the probability of another, i.e. “broccoli” and “vegetable”. This cannot be captured in the Bag-of-Words model. Applied to IR, relevant documents containing “broccoli” will not be found for queries containing “vegetable”. In this section we will introduce alternative VSMs that try to overcome these restrictions. In line with the introduction to this chapter, most approaches to IR can be classified as bottom-up approaches in the sense that ranking/retrieval functions are based on VSMs which are induced from term distributions based on documents and queries. These models are therefore based on the available data without using any conceptual knowledge. Also, existing VSMs ignore at least two further important requirements in order to serve as an adequate representation of natural language— the word order information and the way to model semantic composition, i.e. the meaning of phrases and sentences. We have worked on two approaches to overcome the limitations of current VSMs. The first approach moves away from the BoW model by representing documents in concept spaces. The ideas of mapping terms to concepts or accessing documents by extracting single units of information have been in the air for a while. Harris [48] already proposed in 1959 to extract certain relations from scientific articles by 4 http://km.aifb.kit.edu/projects/swsc.
Combining Data-Driven and Semantic Approaches for Text Mining
125
means of NLP and to use them for information finding. In order to achieve a kind of “conceptual” search, indexing strategies where the documents are indexed by concepts of WordNet [41], of Wikipedia [38] or an ontology [12] have been used. Exploiting these knowledge sources to represent documents is one way of combining data-driven with top-down approaches. On the one hand, term distributions are used to define concept mappings. On the other hand, the set of concepts are typically defined manually in a “top-down” manner. In the second approach, we abstract away from the document level and zoom in on the representation of meaning of the smaller text units. Until recently, little attention has been paid to the task of modeling more complex language structures with VSMs, such as phrases or sentences. The latter constitutes a crucial barrier for semantic vector models on the way to model language [95]. An emerging area of research receiving more and more attention among the advocates of distributional models addresses the strategies for representing compositional aspects of language within a VSM framework. This requires novel modeling paradigms that allow the integration of word order information into VSMs. In addition to the these theoretical approaches to IR we will also present industrial applications of semantic IR in Sect. 3.3 that emerged from the group of Rudi Studer.
3.1 Concept Spaces for Document Representations Two major problems of the Bag-of-Words (BoW) model can be identified: different terms having the same meaning (a case referred to as “synonymy”) as well as one term having different interpretations depending on the context (know as “homonymy”). Concept spaces are an extension to the BoW model which are used to obtain document representations that abstract from the term level. The motivation is to overcome the restrictions of the BoW model regarding synonyms and homonyms. Additionally these representations allow further application scenarios. A prominent example is Cross-lingual IR. While the BoW model is defined on terms which are mostly disjoint across languages, interlingual concept models enable the representation of documents in multiple languages in the same concept space.
3.1.1 Definition of Concept Spaces We distinguish between two approaches to define concept spaces: intrinsic and explicit. Intrinsic approaches are data-driven. By analyzing the data through certain methods, we can identify implicit or latent concepts. The semantics of these concepts is purely defined by the function that assigns association strength values of given documents to concepts. They have no formal description or label. The most prominent techniques to derive intrinsic concepts are Latent Semantic Indexing (LSI) [32] and Latent Dirichlet Allocation (LDA) [29]. In both cases, concepts are
126
S. Bloehdorn et al.
identified by performing a dimension reduction on the term-document matrix which is based on co-occurrence statistics of terms in documents. Document representations based on these derived dimensions (corresponding to intrinsic concepts) can e.g. be used to overcome the synonymy problem. These techniques have also been applied to multilingual scenarios. In order to apply LSI or LDA to multilingual data, a corpus consisting of parallel documents in each language is needed. Explicit concept models require an externally defined set of concepts. Further textual descriptions of these concepts are needed which we will refer to as concept signatures. These signatures have exactly the function of combining the data-driven and semantic paradigms. The semantic is represented by concepts as single units of meaning, while their textual description connects concepts to data. It is important to note that explicit concept models exploit this kind of knowledge and do not provide it. In the following section we present our research on Explicit Semantic Analysis (ESA), which is a prominent instance of an explicit concept model. In particular we concentrate on its multilingual extension and its application to IR.
3.1.2 Explicit Semantic Analysis Explicit Semantic Analysis (ESA) [38] attempts to index or classify a given document with respect to a set of explicitly given external concepts. The document is mapped to a point in the vector space spanning these concepts. The values of each dimension are computed by measuring the similarity between the text of the document and the textual descriptions of concepts. For example IR measures like TFIDF5 have been used to compute this similarity. While different data sources have been used for ESA applied to IR, exploiting Wikipedia seems to be the most successful approach [70]. Wikipedia is an adequate knowledge source for ESA as it combines the different aspects relevant for ESA, i.e. most articles define single concepts using textual descriptions given by the article body. Further advantages are the wide coverage of Wikipedia and the multilingual connections between Wikipedia databases in different languages. The left part of Fig. 1 visualizes a small part of the Wikipedia graph, containing articles, categories, category links between articles/categories and categories and language links connecting articles/categories across languages. In the original ESA paper, the use of articles as concepts is suggested. This corresponds to the lower level in Fig. 1. We developed a multilingual extension to this approach which exploits the language links to map between languages [84]. This approach was also applied to Wikipedia categories instead of articles, which corresponds to the upper level in Fig. 1. By exploring the parameter space of ESA functions and retrieval methods, an optimal parametric choice for cross-lingual retrieval 5 TFIDF
is a widely used statistical distribution value of terms in documents given a corpus. For a specific term and document, the TFIDF value is the product of the term frequency (TF)—the number of occurrences of the term in the given document—and the inverse document frequency (IDF)—the inverse number of documents in the corpus that contain the term.
Combining Data-Driven and Semantic Approaches for Text Mining
127
Fig. 1 Left: Multilingual structure of articles and categories in Wikipedia. Right: Example ESA vectors of the query “healthy food” using article and category concept space
was identified [85] and it was shown that retrieval performance is indeed superior to LSI and LDA in this scenario [28]. As an example, ESA vectors for the query “healthy food” based on articles (ESA) and categories (Cat-ESA) are presented in Fig. 1. The ESA representation activates the concepts Broccoli and Chips corresponding to Wikipedia Articles. Using Cat-ESA the activated concepts Foods, Snacks and Vegetables correspond to Wikipedia categories. The association strength of each concept is defined by the text similarity of the query to the article text respectively the text of articles contained in a category. ESA can be applied to IR by comparing ESA vectors of documents and queries. Coming back to our running example, the shopping profiles of users can be used as a document collection. As presented in Fig. 1, we assume that the concept space consists of two concepts, namely broccoli and chips. Looking at the shopping profile of a user with a healthy lifestyle, the corresponding ESA vector would probably have high values for broccoli and low values for chips: [broccoli=.6 chips=.2]. This is based on the high frequency of the term “vegetable” and the low frequency of the term “junk food” in the user profile, which are then related to broccoli and chips. Considering again the query “healthy food”, the ESA representation will have high values for broccoli: [broccoli=.8 chips=.1]. This mapping is based on the textual description of broccoli, which defines broccoli as healthy food, and of chips, which does not match the term “healthy”. Matching the query to the user profile will lead to high similarity. In contrast to the BOW model, this user will match the query without containing the query terms, as both (query and user profile) activate similar concepts in the ESA vector space. This shows the clear benefit of this semantic representation of text, which enables not only to retrieve direct matches but also related users, which might be relevant as well.
128
S. Bloehdorn et al.
Fig. 2 WSM Representation of two Sentences: “Peter likes chips and hates broccoli. Mary likes broccoli and hates chips”
3.2 Extended Vector Space Models Vector-based models have proven useful and adequate in a variety of NLP tasks. However, it has been long recognized that these models are too weak to represent natural language to a satisfactory extent, since the assumption is made that the word co-occurrence is essentially independent of the word order, and all the co-occurrence information is fed into one vector per word. In our work, we make use of the Word Space Model (VSM) [81], which is an instantiation of the term-by-term VSM. In WSMs, the meaning of a word is modeled as an n-dimensional vector, where the dimensions are defined by the co-occurring words within a predefined context window. Suppose our background knowledge corpus consists of the following two sentences: Peter likes chips and hates broccoli. Mary likes broccoli and hates chips. The distributional meanings of Peter, Mary, beer and broccoli would be in a similar way defined by the co-occurring likes which is insufficient, as beer and broccoli can only be liked by somebody but not like themselves; in case of Peter and Mary, both ways of interpretation should be possible. Figure 2 shows a WSM representation of the above sentences using a context window of one word to the left and one word to the right ignoring stop words. In order to get the “compositional” meaning of those sentences within current VSM paradigm, again the Bag-of-Words approach has been used as a default until recently [29, 61]. It is called BoW in this case, as it consists of simply adding up the individual vectors of the words to get the meaning of a phrase or a sentence. Figure 3 demonstrates the resulting vectors for such composition. It shows that the sentences Peter loves chips and hates broccoli and Peter hates chips and loves broccoli would mean the same with this kind of representation. Consequently, the vector sum operation cannot serve as an adequate means of semantic composition, as word order information is ignored. Giesbrecht [40] evaluate a number of advanced mathematical compositionality operations suggested by Widdows [95] on the task of multiword unit identification, making use of Word Space Models [81] and Random Indexing [76]. Our preliminary findings prove that the more advanced compositional operators, like tensor products, lead to better results than vector addition, which is still the common operator for computing the meaning of phrases in IR. Though our results are encouraging,
Combining Data-Driven and Semantic Approaches for Text Mining
129
Fig. 3 Using vector addition for sentence meaning representation
Fig. 4 From vectors to tensors
they suggest that just using a different mathematical operator with the same word meaning representations based on a single vector is not sufficient. This leads us to questioning whether VSM paradigms in their current form are suitable for modeling natural language. To overcome the aforementioned difficulties with VSMs, we are currently experimenting with matrix based distributional models of meaning which employ matrices instead of vectors to represent word distributions. Thereby we extend a standard VSM to a three-way tensor (see Fig. 4). The latter offers a potential of integrating both word order information and assigning to words characteristic matrices such that semantic composition can be later realized in a natural way via matrix multiplication.
3.3 Industrial Semantic Search Applications Originated in a diploma thesis of Eddie Moench [69], SemanticMiner® has gone its way from the academic approach of combining a semantic-centered approach of encoding domain knowledge with data-driven analytics engines and emerged into a successful industrial product. The user is enjoying the Guided Search that navigates her through the information wilderness by means of a domain knowledge model. It allows her to easily pose semantic queries to all kinds of information sources— especially unstructured documents. In the latest customer release, these queries are built automatically for the user by taking into account the working context in which she finds herself. For example, this context could be defined as a field service technician standing in front of a specific machine, searching for manuals, parts lists, and an ERP extract or a customer in a supermarket. Data-driven semantic information integration finally unleashes the power of logics by deep analysis of hidden knowledge by the externalization of implicit information.
130
S. Bloehdorn et al.
Fig. 5 SemanticMiner search process
SemanticMiner® (see Fig. 5) evolved in many directions since its first release. The embeddable core technology has been integrated in solutions like SemanticGuide, a service resolution management system, the Semantics for SharePoint suite, in Asian search engine products etc. It powers enterprise units in their Information Retrieval of all kinds: public government sites, pharmaceutical applications, call-centers and guides plant agents, service technicians, and researchers to do their job better and in a shorter time. On the data-centered level, various search engines have been integrated and SemanticMiner® runs at customers’ sites on Microsoft FAST ESP, Autonomy K2, IBM OmniFind EE, Oracle SES, Google Search Appliance and many more. Besides this lightweight integration, text analytic engines have been added to extend the domain knowledge used by the system by facts (instances, attributes and relations) on the fly. The most notable systems are IBM Cognos Content Analytics, TextToOnto and T-Rex.
4 Semantics in Text Mining The term Text Mining (TM) was introduced by Feldman and Dagan [35] to describe a new field of data analysis. Text Mining comprises various facets but in general it refers to the application of methods from Machine Learning to textual data. An overview of the topic is given by Hotho, Nürnberge, and Paaß [53]. Text Mining (TM) can be defined—similar to data mining—as the application of algorithms and methods from the fields of Machine Learning (ML) and statistics to texts with the goal of finding useful patterns. For this purpose it is necessary to pre-process the texts accordingly. Many authors use information extraction methods, NLP or some simple preprocessing steps in order to extract data from texts. TM can be used to structure document collections by clustering them according to their content or to identify and aggregate key facts these collections contain. By definition, TM is thus concerned with somehow (even if implicitly) getting hold of textual semantics.
Combining Data-Driven and Semantic Approaches for Text Mining
131
Some subfields of Text Mining go beyond the detection of patterns from texts as wholes but rather focus on the extraction of factual knowledge from them—a field which is known as IE. Text Mining is distinguished primarily by specific preprocessing methods which prepare the textual data for the analysis by ML techniques. The results of our research show that many such applications can benefit from ontologies and Semantic Web standards as a means to formalize semantics in a well-defined way and thereby enabling richer or more to-the-point models. There are about as many approaches to TM as there are applications. This is due to the fact that TM is by definition a very application-driven field of research and the applications differ in many dimensions. This chapter describes interesting approaches and applications that we have worked on in the recent past. They shed light on the combination of formal and statistical approaches from various angles.
4.1 Extracting Information from the Web With the goal of making information available on the Web accessible for machine processing, we have developed approaches for the extraction of binary relations (e.g. goesWellWith(beer,chips)) from documents on the Web. If we know the type of the target relation (e.g. things that can be consumed well together), the task becomes a matter of learning a model that decides for a given text segment, if an instance of the target relation is present or identify the type of relation of given instances. We chose sets of textual patterns as models for extraction [49, 74]. Such patterns describe text sections in an underspecified way. An example pattern expressing the previously mentioned target relation would be: I enjoyed a * with some yesterday. In general, patterns are example-based underspecified descriptions of text fragments. Sections that match a pattern can be assumed to express the target relation. The patterns allow for underspecification (e.g. the ∗ wildcard) and to identify the positions of the target information (e.g. ). The advantage of patterns—especially when compared to statistical discriminative models—lies in their explicit nature which enables the use of efficient algorithms for pattern mining and matching. We used a Web search engine to identify pattern matches in the text. This saved us the effort of scanning the entire Web for potential instances by exploiting the search engine’s index data structure. For mining, we made use of highly optimized Frequent Itemset Mining techniques. Our approach to automatically learning textual patterns is based on the idea that given a set of relation instances, we can induce good patterns by considering the context in which they occur on the Web and conversely, with the help of good patterns, new instances can be found by matching the patterns. Target information and model can thus be co-evolved in a bootstrapping manner by repeating these two steps [15, 74]. Thereby, one challenge is to focus the search for patterns in the enormous space of all possible patterns. We need to focus on those which are more likely to be
132
S. Bloehdorn et al.
useful. We identified the number of times a text fragment occurs in the target texts (the so-called support) as a good indicator for pattern quality [10] and were able to translate pattern induction into the problem of identifying frequent subsets of a collection of sets (Frequent Itemset Mining). This resulted in a considerable speedup as compared to the exhaustive exploration of the space of possible patterns. Clearly, looking for occurrences of frequent text fragments leaves most linguistic aspects of text interpretation untouched. To allow for the integration of formalized terminological knowledge in pattern induction and matching we introduced typed wildcards [11]. Typed wildcards, unlike the ∗, only match those words that belong to a particular class. We organized types for wildcards in a taxonomy incorporating syntactical and semantic classes. Furthermore, we extended the induction algorithm to work with typed wildcards. These taxonomic patterns are in fact able to improve both precision and recall of the extracted information. Apart from excluding undesired matches due to overly general wildcards, it may be possible to find more taxonomic patterns than classical patterns without typed wildcards for a given relation. This is due to the fact that grouping words by types may make additional relevant commonalities among text fragments apparent. Overall, this research represents a successful combination of data-driven and semantic text analysis. The approach is inherently data-driven but with the goal to derive semantic knowledge. Besides producing knowledge, this method is also able to integrate taxonomies into the pattern learning process. Our research has demonstrated that considering taxonomic knowledge is indeed beneficial [11].
4.2 Text Clustering and Classification with Semantic Background Knowledge The classification of data items, i.e. their automatic assignment to pre-defined and suitable classes, as well as the clustering of data items, i.e. their automatic grouping according to similarity, are classical ML tasks. The relevance of these types of learning problems for TM stems from the plethora of useful applications that can be built upon a successful automatic classification or clustering of textual items such as news texts.
4.2.1 Incorporating Background Knowledge in Text Representations Similar to the IR setting discussed in Sect. 3, documents are typically represented as so-called Bag-of-Words (BoW) vectors as originally proposed by Salton and McGill [78]. Subsequent learning algorithms operate in the resulting vector space with a number of dimensions equal to the number of distinct words of the corpus. Ontological background knowledge can be incorporated into the vector space model by applying additional preprocessing steps. After deriving the typical BoW representation, the vector dimensions are mapped to concepts of a given ontology
Combining Data-Driven and Semantic Approaches for Text Mining
133
or knowledge base. This constitutes an approach different from Explicit Semantic Analysis (ESA) (presented in Sect. 3) as each dimension, i.e. terms in the BoW space, is mapped independently to the concept space. The mapping is only defined by background knowledge and does not depend on the context of a specific document. In contrast, all dimensions of the ESA concept vectors depend on all the terms in a document. Enriching the term vectors with explicit concepts from the ontology has two benefits. First it can successfully deal with synonyms by mapping words to concepts. Adding additional hypernyms/super concepts allows for relating very similar topics which are the content of different documents but which a user would expect in the same cluster. By changing the document representation in a way that different words of the vector are mapped to the same (super) concept—to represent the same or a very similar topic by a common representation, the learning algorithm should be better able to group such documents together. By adding more super-concepts, additional noise is introduced which results in a drop of performance due to the fact that topics become related which do not have as much in common. Second, it introduces more general concepts which help to identify related topics and establishes a connection between documents dealing with the same topic but featuring different words. For instance, a document about broccoli may not be related to a document about cauliflower by the cluster algorithm if there are only “broccoli” and “cauliflower” in the term vector. But if the more general concept vegetable is added to both documents, their semantic relationship is revealed. We have investigated the influence of three different strategies for adding or/and replacing terms by concepts on the clustering/classification performance [52]. By mapping words to concepts, a new concept vector is computed. The first strategy uses all available information by performing the mining on both vectors together. The second strategy removes all words of the word vector which could be mapped on a concept. The last strategy bases the analysis only on the concept vector. In the following, we focus on some of our own results that use ontologies to improve clustering and classification tasks [4, 7, 52].
4.2.2 Semantics in Text Clustering Text document clustering methods can be used to find groups of documents with similar content. The result of a clustering is typically a partition of the set of documents. Usually the quality of a clustering is considered better if the contents of the documents within one cluster are more similar and between the clusters more dissimilar. A good survey is given by Jain, Murty, and Flynn [56], including a discussion of the performance of different clustering approaches. We illustrate the integration of background knowledge into the text clustering process by results of Hotho et al. [52] and Hotho [50] using a variant of the popular k-means clustering algorithm. In these experiments, we applied the usual preprocessing steps on the Reuters-21578 corpus, the FAODOC corpus and a small Java corpus. We used WordNet [68] as a lexical ontology of the English language as
134
S. Bloehdorn et al.
it not only provides a morphological component which significantly improves the preprocessing but also contains synonymy, hypernym/super-concept and frequency information about polysemous words. The main outcome of our experiment was the following: TFIDF weighting improves the text clustering performance significantly and is also helpful for integrating the background knowledge as it gives a good weight to the concepts. Word sense disambiguation is necessary during the mapping of words to concepts. There are indications that the “add strategy”, which uses both words and concepts equally, outperforms all other integration strategies. The integration of super-concepts into the concept vector additional improves the performance of the text clustering approach. Not only the performance of Text Clustering can be improved by using background knowledge. The integration of super-concepts provides also a very good basis for clustering visualization. Hotho, Staab, and Stumme [51] use Formal Concept Analysis (FCA) to compute the visualization. The resulting concept lattice makes the exploration of a new corpus easier than inspecting unrelated clusters as it provides a good overview over the different topics of the corpus by relating clusters to each other. High level concepts from the ontology are used to describe the commonalities of different clusters. The structure of the lattice helps also to drill down to very specific clusters while maintaining a clear relation to a major topic.
4.2.3 Semantics in Text Classification Text Classification refers to the automatic process of learning a model based on a given set of training examples with the goal of predicting the class or topic a new text document belongs to. Meanwhile, more advanced ML approaches like Support Vector Machines (SVMs) or Boosting show very impressive Text Classification performance. A good survey is presented by Sebastiani [82]. In this section, we report on our work which follows the main idea of integrating formally represented knowledge into the learning step with the goal to improve the prediction performance. We follow the presentation of our work given by Bloehdorn and Hotho, where we showed how background knowledge in form of simple ontologies can improve Text Classification results by directly addressing the problems of multiword expressions, synonymous words, polysemous words, and the lack of generalization. We used a hybrid approach for document representation based on the common term stem representation which is enhanced with concepts extracted from the used ontologies as in the Text Clustering setup introduced above. For the actual classification, we propose the use of the AdaBoost algorithm using decision stumps as base classifiers which has been proved to produce accurate classification results in many experimental evaluations and seems to be well suited to integrate different types of features. Evaluation experiments on three text corpora, namely the Reuters-21578, OHSUMED and FAODOC collections showed that our approach leads to improvements in all cases. We also showed that in most cases the improvement can be traced back to two distinct effects, one being situated mainly on the lexical level (e.g. detection of multiword expressions) and the generalization on the
Combining Data-Driven and Semantic Approaches for Text Mining
135
conceptual level (resolving synonyms and adding super-concepts). Along a similar line of thought, Bloehdorn, Basili, Cammis, and Moschitti [6] incorporate the transformed vector space implicitly through the use of kernel functions in a classification setting based on SVMs.6
4.2.4 Text Mining with Automatically Learned Ontologies So far, the ontological structures employed for the classification and clustering task are created manually by knowledge engineers which requires a high initial modeling effort. Research on Ontology Learning as discussed in Sect. 2 has started to address this problem by developing methods for the automatic construction of conceptual structures out of large text corpora mostly in an unsupervised process. To reduce the modeling effort, the next step is to first learn an ontology from text which perfectly matches the topics of the corpus and then add this newly extracted knowledge to the mining process as described in the previous sections. This approach was undertaken by Bloehdorn et al. [7], where we compared results both (i) to the baseline given by the BoW representation alone and (ii) to results based on the MeSH (Medical Subject Headings) Tree Structures as a manually engineered medical ontology. We could show that conceptual feature representations based on a combination of learned and manually constructed ontologies outperformed the BoW model, and that results based on the automatically constructed ontologies are highly competitive with those of the manually engineered MeSH Tree Structures.
4.3 New Event Detection New Event Detection presents an interesting application scenario for the combination of data-driven and semantic approaches in text mining. The problem is motivated by applications where news analysis is needed, as in financial market analysis. In the envisioned scenarios manual processing of the documents is not an option, as huge amounts of information have to be processed in a timely manner. In this context, the problem of New Event Detection has been studied extensively. The task consists in finding the first story reporting on an event in a stream of news. The problem has to be solved in an online manner, meaning that each text has to be labeled as new respectively old before future texts are available. In terms of ML tasks, the problem can either be treated as a classification problem (label texts either as new or as old) or as a clustering problem (each event is represented as a cluster, the first story of a cluster should be labeled as new, the others as old), where the latter view is prevailing. The standard approach for solving this problem is by searching for the most similar text that has already been clustered. 6 Further extensions such as those by Bloehdorn and Moschitti [5] combine this idea with more complex so-called tree kernel functions for text structure.
136
S. Bloehdorn et al.
If the similarity between the two texts exceeds a certain threshold, the new text is supposed to be reporting on the same event as its most similar document, otherwise it is clustered into a new cluster [73]. The classical BoW model poses several problems in this task: First, texts from the same source tend to be more similar to each other than texts from different sources. This is due to the specific language used in each source. Second, news reporting on the same kind of event tend to be very similar, although they report on different events. The difference between texts on different events of the same type mainly differ in the named entities that are involved. In order to overcome these problems it seems beneficial to use a representation of textual content which abstracts from the actual wording in the clustering task. We take a similar approach as is presented in Sect. 4.2.1: Our approach consists of including semantic annotations for entities and relations among them as an alternative document representation. This additional information is obtained using the OpenCalais7 service, which takes plain text as input and returns annotations of entities and relations among them in the form of an RDF (Resource Description Framework) graph. Using this approach, one challenge consists in determining similarities between texts using information from the annotation graph. Our approach consists of extracting features from the graph which can then be used as an extension of the classical BoW model. The motivation for this approach is that the annotation graph cannot represent the whole content of the text. Instead entities, entity types and triples from the annotation graph are used as additional features in the BoW model. By using this approach, a new document representation is generated, which combines term features with features that try to represent the content of the text.
5 Conclusion In this chapter we have argued that the dichotomy between inductive/data-driven and semantic approaches is a false one. We motivated this claim by introducing two fallacies—the open domain fallacy and the data-is-everything fallacy. Limiting tasks to specific domains reduces the costs for top-down modeling and enables the use of semantic approaches on tasks that have been mostly solved using datadriven approaches. Further, labeled data required for data-driven approaches might not be available for specific domains, which makes top-down approaches preferable in those cases. We presented a number of tasks where the two paradigms—data-driven and semantic—are naturally combined, clearly resulting in an added benefit. This includes tasks in the field of Ontology Learning that cover data-driven techniques, for example lexico-syntactic patterns, as well as semantic techniques, for example relational exploration. Further, we presented approaches to Information Retrieval 7 http://www.opencalais.com.
Combining Data-Driven and Semantic Approaches for Text Mining
137
that combines statistical term measures (data-driven) with conceptual knowledge, resulting in representations of documents in concept spaces. Finally, we described our approaches to Text Mining that, for example, improve data-driven approaches using taxonomies to refine extraction patterns. Given that this is a Festschrift dedicated to Rudi Studer we have regarded this chapter as a good opportunity to reflect on the work in this area carried out under his auspices. Rudi Studer has always regarded the topic of combining semantics and data-driven techniques as a crucial topic in his group. The contributions and work summarized in this chapter clearly corroborate this. Overall, we would all like to thank Rudi for his constant support and dedicated supervision of our work as well as for his seminal contributions to the field which have always inspired us.
References 1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of SIGMOD Conference, pp. 207–216 (1993) 2. Basili, R., Moschitti A., Pazienza M.T., Zanzotto, F.M.: A contrastive approach to term extraction. In: Proceedings of the 4th Terminology and Artificial Intelligence Conference (TIA), May, pp. 119–128 (2001) 3. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (May Issue) (2001) 4. Bloehdorn, S., Hotho, A.: Text classification by boosting weak learners based on terms and concepts. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM) 5. Bloehdorn, S., Moschitti, A.: Combined syntactic and semantic kernels for text classification. In: Amati, G., Carpineto, C., Romano, G. (eds.) Proceedings of the 29th European Conference on Information Retrieval (ECIR), Rome, Italy, pp. 307–318. Springer, Berlin (2007) 6. Bloehdorn, S., Basili, R., Cammisa, M., Moschitti, A.: Semantic kernels for text classification based on topological measures of feature similarity. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China. IEEE Comput. Soc., Los Alamitos (2006) 7. Bloehdorn, S., Cimiano, P., Hotho, A.: Learning ontologies to improve text clustering and classification. In: Spiliopoulou, M., Kruse, R., Nürnberger, A., Borgelt, C., Gaul, W. (eds.) Proceedings of the 29th Annual Conference of the German Classification Society (GfKl), Magdeburg, Germany, 2005, pp. 334–341. Springer, Berlin (2006) 8. Bloehdorn, S., Cimiano, P., Duke, A., Haase, P., Heizmann, J., Thurlow, I., Völker, J.: Ontology-based question answering for digital libraries. In: Proceedings of the 11th European Conference on Research and Advanced Technologies for Digital Libraries (ECDL), September 2007. Lecture Notes in Computer Science, vol. 4675. Springer, Berlin (2007). ISBN 978-3540-74850-2 9. Blohm, S., Cimiano, P.: Using the web to reduce data sparseness in pattern-based information extraction. In: Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Warsaw, Poland, pp. 18–29. Springer, Berlin (2007) 10. Blohm, S., Cimiano, P., Stemle, E.: Harvesting relations from the web—quantifying the impact of filtering functions. In: Proceedings of the 22nd Conference on Artificial Intelligence (AAAI), pp. 1316–1323. AAAI Press, Menlo Park (2007) 11. Blohm, S., Buza, K., Cimiano, P., Schmidt-Thieme, L.: Relation extraction for the semantic web with taxonomic sequential patterns. In: Sugumaran, V., Gulla, J.A. (eds.) Applied Semantic Web Technologies. Taylor & Francis, London (2011, to appear)
138
S. Bloehdorn et al.
12. Bonino, D., Corno, F.: Self-similarity metric for index pruning in conceptual vector space models. In: DEXA Workshops, pp. 225–229. IEEE Comput. Soc., Los Alamitos (2008) 13. Brants, T., Popat, A., Xu, P.J.D., Och, F.J.: Large language models in machine translation. In: Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2007) 14. Brewster, C., Ciravegna, F., Wilks, Y.: Background and foreground knowledge in dynamic ontology construction. In: Proceedings of the SIGIR Semantic Web Workshop, (2003) 15. Brin, S.: Extracting patterns and relations from the world wide web. In: Selected Papers from the International Workshop on the World Wide Web and Databases (WebDB), London, UK, pp. 172–183. Springer, Berlin (1999). ISBN 3-540-65890-4 16. Brunzel, M.: The XTREEM methods for ontology learning from web documents. In: Buitelaar, P., Cimiano, P. (eds.) Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, January. Frontiers in Artificial Intelligence and Applications, vol. 167, pp. 3–26. IOS Press, Amsterdam (2008) 17. Buitelaar, P., Cimiano, P., Magnini, B.: Ontology learning from Text: Methods, Evaluation and Applications, Juli. Frontiers in Artificial Intelligence, vol. 123. IOS Press, Amsterdam (2005) 18. Chodorow, M., Byrd, R.J., Heidorn, G.E.: Extracting semantic hierarchies from a large on-line dictionary. In: Proceedings of the 23rd Annual Meeting on Association for Computational Linguistics (ACL), pp. 299–304. Association for Computational Linguistics, Stroudsburg (1985) 19. Cimiano, P.: Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer, Berlin (2006). ISBN 978-0-387-30632-2 20. Cimiano, P.: Ontology learning and population from text. PhD thesis, Universität Karlsruhe (TH), Germany (2006) 21. Cimiano, P., Völker, J.: Text2Onto—a framework for ontology learning and data-driven change discovery. In: Montoyo, A., Munoz, R., Metais, E. (eds.) Proceedings of the 10th International Conference on Applications of Natural Language to Information Systems (NLDB), Alicante, Spain, June. Lecture Notes in Computer Science, vol. 3513, pp. 227–238. Springer, Berlin (2005) 22. Cimiano, P., Wenderoth, J.: Automatic acquisition of ranked qualia structures from the web. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), June, pp. 888–895 (2007) 23. Cimiano, P., Handschuh, S., Staab, S.: Towards the self-annotating web. In: Proceedings of the 13th International World Wide Web Conference (WWW), May, pp. 462–471. ACM, New York (2004). ISBN 1-58113-844-X 24. Cimiano, P., Hotho, A., Staab, S.: Comparing conceptual, divise and agglomerative clustering for learning taxonomies from text. In: de Mántaras, R.L., Saitta, L. (eds.) Proceedings of the 16th European Conference on Artificial Intelligence (ECAI), Valencia, Spain, pp. 435–439. IOS Press, Amsterdam (2004). ISBN 1-58603-452-9 25. Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora using formal concept analysis. Journal of Artificial Intelligence Research 24, 305–339 (2005) 26. Cimiano, P., Ladwig, G., Staab, S.: Gimme the context: context-driven automatic semantic annotation with C-PANKOW. In: Ellis, A., Hagino, T. (eds.) Proceedings of the 14th International World Wide Web Conference (WWW), Chiba, Japan, May, pp. 332–341. ACM, New York (2005) 27. Cimiano, P., Pivk, A., Schmidt-Thieme, L., Staab, S.: Learning taxonomic relations from heterogeneous sources of evidence. In: Buitelaar, P., Cimiano, P., Magnini, B. (eds.) Ontology Learning from Text: Methods, Evaluation and Applications, July. Frontiers in Artificial Intelligence, vol. 123, pp. 59–73. IOS Press, Amsterdam (2005) 28. Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1513–1518 (2009) 29. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391– 407 (1990)
Combining Data-Driven and Semantic Approaches for Text Mining
139
30. Drouin, P.: Detection of domain specific terminology using corpora comparison. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), pp. 79–82. European Language Resources Association, Paris (2004) 31. Drumm, C., Schmitt, M., Do, H.H., Rahm, E.: Quickmig: automatic schema matching for data migration projects. In: CIKM, pp. 107–116 (2007) 32. Dumais, S., Letsche, T., Littman, M., Landauer, T.: Automatic cross-language retrieval using latent semantic indexing. In: Proceedings of the AAAI Symposium on Cross-Language Text and Speech Retrieval (1997) 33. Ehrig, M.: Ontology Alignment: Bridging the Semantic Gap. Semantic Web and Beyond: Computing for Human Experience, vol. 4. Springer, Berlin (2007). ISBN 978-0-387-36501-5 34. Evans, R.: A framework for named entity recognition in the open domain. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), pp. 137–144 (2003) 35. Feldman, R., Dagan, I.: Knowledge discovery in texts (KDT). In: Fayyad, U.M., Uthurusamy, R. (eds.) Proceedings of the First International Conference on Knowledge Discovery (KDD 1996), Montreal, Quebec, Canada, August 20–21, pp. 112–117. AAAI Press, Menlo Park (1995) 36. Fellbaum, C.: WordNet. An Electronic Lexical Database. MIT Press, Cambridge (1998) 37. Firth, J.R.: A Synopsis of Linguistic Theory, 1930–1955. Studies in Linguistic Analysis, pp. 1– 32 (1957) 38. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1606–1611 (2007) 39. Gärdenfors, P.: Conceptual Spaces: The Geometry of Thought. MIT Press, London (2000) 40. Giesbrecht, E.: In search of semantic compositionality in vector spaces. In: ICCS, pp. 173–184 (2009) 41. Gonzalo, J., Verdejo, F., Chugur, I., Cigarran, J.: Indexing with WordNet synsets can improve text retrieval. In: Proceedings of the COLING/ACL ’98 Workshop on Usage of WordNet for NLP, Montreal, Canada, pp. 38–44 (1998) 42. Guthrie, L., Slator, B.M., Wilks, Y., Bruce, R.: Is there content in empty heads? In: Proceedings of the 13th Conference on Computational Linguistics (COLING), Morristown, NJ, USA pp. 138–143. Association for Computational Linguistics, Stroudsburg (1990). ISBN 952-902028-7 43. Haase, P., Völker, J.: Ontology learning and reasoning—dealing with uncertainty and inconsistency. In: da Costa, P.C.G., d’Amato, C., Fanizzi, N., Laskey, K.B., Laskey, K.J., Lukasiewicz, T., Nickles, M., Pool, M. (eds.) Uncertainty Reasoning for the Semantic Web I. Lecture Notes in Artificial Intelligence, vol. 5327. Springer, Berlin (2008). ISBN 978-3-540-89764-4. ISWC International Workshop, URSW 2005–2007. Revised Selected and Invited Papers 44. Haase, P., Schnizler, B., Broekstra, J., Ehrig, M., Harmelen, F., Mika, M., Plechawski, M., Pyszlak, P., Siebes, R., Staab, S., Tempich, C.: Bibster—a semantics-based bibliographic peerto-peer system. Journal of Web Semantics 2(1), 99–103 (2005) 45. Haase, P., Stojanovic, N., Sure, Y., Völker, J.: Personalized information retrieval in bibster, a semantics-based bibliographic peer-to-peer system. In: Tochtermann, K., Maurer, H. (eds.) Proceedings of the 5th International Conference on Knowledge Management (I-KNOW), July, pp. 104–111 (2005). JUCS, July 46. Halevy, A.Y., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intelligent Systems 24(2), 8–12 (2009) 47. Hall, J., Nilsson, J., Nivre, J., Megyesi, B., Nilsson, M., Saers, M.: Single malt or blended? A study in multilingual parser optimization. In: Proc. of the Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL (2007) 48. Harris, Z.: Linguistic transformations for information retrieval. In: Proceedings of the International Conference on Scientific Information, vol. 2, Washington, DC (1959)
140
S. Bloehdorn et al.
49. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, vol. 2. Association for Computational Linguistics, Stroudsburg (1992) 50. Hotho, A.: Clustern Mit Hintergrundwissen. Dissertationen zur Künstlichen Intelligenz, vol. 286. Akademische Verlagsgesellschaft, Berlin (2004). In German. Originally published as PhD thesis, Universität Karlsruhe (TH), Karlsruhe, Germany (2004) 51. Hotho, A., Staab, S., Stumme, G.: Explaining text clustering results using semantic structures. In: Principles of Data Mining and Knowledge Discovery, 7th European Conference, PKDD 2003, Dubrovnik, Croatia, September 22–26, 2003. Lecture Notes in Computer Science, pp. 217–228. Springer, Berlin (2003) 52. Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Proc. of the ICDM 03, The 2003 IEEE International Conference on Data Mining, pp. 541–544 (2003) 53. Hotho, A., Nürnberger, A., Paaß, G.: A brief survey of text mining. LDV Forum—GLDV Journal for Computational Linguistics and Language Technology 20(1), 19–62 (2005). ISSN 0175-1336 54. Jaimes, A., Smith, J.R.: Semi-automatic, data-driven construction of multimedia ontologies. In: Proceedings of the International Conference on Multimedia and Expo (ICME), Washington, DC, USA, pp. 781–784. IEEE Comput. Soc., Los Alamitos (2003). ISBN 0-7803-7965-9 55. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Upper Saddle River (1988) 56. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999) 57. Jäschke, R., Hotho, A., Schmitz, C., Ganter, B., Stumme, G.: Discovering shared conceptualizations in folksonomies. Journal of Web Semantics 6(1), 38–53 (2008). ISSN 1570-8268 58. Kashyap, V., Ramakrishnan, C., Thomas, C., Sheth, A.: TaxaMiner: an experimentation framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services 1(2), 240–266 (2005). ISSN 1741-1106 59. Katz, S.M., Gauvain, J.L., Lamel, L.F., Adda, G., Mariani, J.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. International Journal of Pattern Recognition and Artificial Intelligence 8 (1987) 60. Kavalec, M., Svátek, V.: A study on automated relation labelling in ontology learning. In: Buitelaar, P., Cimiano, P., Magnini, B. (eds.) Ontology Learning from Text: Methods, Evaluation and Applications. Frontiers in Artificial Intelligence and Applications, vol. 123, pp. 44– 58. IOS Press, Amsterdam (2005) 61. Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: the latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review 104, 211–240 (1997) 62. Li, M., Du, X.-y., Wang, S.: Learning ontology from relational database. In: Proceedings of the 4th International Conference on Machine Learning and Cybernetics, pp. 3410–3415 (2005) 63. Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical cooccurrence. Behavior Research Methods, Instrumentation, and Computers, 203–220 (1996) 64. Mädche, A.: Ontology learning for the semantic web. PhD thesis, Universität Karlsruhe (TH), Germany (2001) 65. Mädche, A., Staab, S.: Discovering conceptual relations from text. In: Horn, W. (ed.) Proceedings of the 14th European Conference on Artificial Intelligence (ECAI), August, pp. 321–325. IOS Press, Amsterdam (2000) 66. Mädche, A., Volz, R.: The text-to-onto ontology extraction and maintenance system. In: Workshop on Integrating Data Mining and Knowledge Management at the 1st International Conference on Data Mining (ICDM) (2001) 67. Meilicke, C., Völker, J., Stuckenschmidt, H.: Debugging mappings between lightweight ontologies. In: Proceedings of the 16th International Conference on Knowledge Engineering and Knowledge Management (EKAW), September. Lecture Notes in Artificial Intelligence, pp. 93–108. Springer, Berlin (2008). Best Paper Award! 68. Miller, G.A.: WordNet: a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)
Combining Data-Driven and Semantic Approaches for Text Mining
141
69. Moench, E., Ullrich, M., Schnurr, H.-P., Angele, J.: Semanticminer—ontology-based knowledge retrieval. Journal of Universal Computer Science 9(7), 682–696 (2003) 70. Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in domain-specific information retrieval. In: Working Notes of the Annual CLEF Meeting (2008) 71. Newbold, N., Vrusias, B., Gillam, L.: Lexical ontology extraction using terminology analysis: automating video annotation. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the 6th International Language Resources and Evaluation (LREC), Marrakech, Morocco, May. ELRA, Paris (2008) 72. Ogata, N., Collier, N.: Ontology express: statistical and non-monotonic learning of domain ontologies from text. In: Proceedings of the Workshop on Ontology Learning and Population (OLP) at the 16th European Conference on Artificial Intelligence (ECAI), August (2004) 73. Papka, R., Allan, J.: On-line new event detection using single pass clustering. Technical report, University of Massachusetts, Amherst, MA, USA 1998 74. Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: AAAI ’99/IAAI ’99: Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Conference Innovative Applications of Artificial Intelligence, pp. 474–479. American Association for Artificial Intelligence, Menlo Park (1999). ISBN 0-262-51106-1 75. Sabou, M.: Building web service ontologies. PhD thesis, Vrije Universiteit Amsterdam, The Netherlands (2006) 76. Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE, (2005) 77. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988) 78. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983) 79. Sanchez, D.: Domain ontology learning from the web. PhD thesis, Universitat Politècnica de Catalunya, Spain (2007) 80. Schmitz, C., Hotho, A., Jäschke, R., Stumme, G.: Mining association rules in folksonomies. In: Batagelj, V., Bock, H.-H., Ferligoj, A., Ziberna, A. (eds.) Data Science and Classification (Proc. IFCS 2006 Conference), Ljubljana, July. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 261–270. Springer, Berlin (2006). ISBN 978-3-540-34415-5. doi:10.1007/3-540-34416-0_28 81. Schütze, H.: Word space. In: Hanson, S., Cowan, J., Giles, C. (eds.) Advances in Neural Information Processing Systems 5. Morgan Kaufmann, San Mateo (1993) 82. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002) 83. Simperl, E., Tempich, C., Vrandeˇci´c, D.: A methodology for ontology learning. In: Buitelaar, P., Cimiano, P. (eds.) Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, January. Frontiers in Artificial Intelligence and Applications, vol. 167, pp. 225–249. IOS Press, Amsterdam (2008) 84. Sorg, P., Cimiano, P.: Cross-lingual information retrieval with explicit semantic analysis. In: Working Notes of the Annual CLEF Meeting (2008) 85. Sorg, P., Cimiano, P.: An experimental comparison of explicit semantic analysis implementations for cross-language retrieval. In: Proceedings of 14th International Conference on Applications of Natural Language to Information Systems (NLDB), Saarbrücken (2009) 86. Stojanovic, N.: On the role of the librarian agent in ontology-based knowledge management systems. Journal of Universal Computer Science 9(7), 697–718 (2003) 87. Sure, Y., Hitzler, P., Eberhart, A., Studer, R.: The semantic web in one day. IEEE Intelligent Systems 20(3), 85–87 (2005). ISBN 1541-1672. doi:10.1109/MIS.2005.54 88. Völker, J.: Learning expressive ontologies. PhD thesis, Universität Karlsruhe (TH), Germany (2008)
142
S. Bloehdorn et al.
89. Völker, J., Rudolph, S.: Lexico-logical acquisition of OWL DL axioms—an integrated approach to ontology refinement. In: Medina, R., Obiedkov, S. (eds.) Proceedings of the 6th International Conference on Formal Concept Analysis (ICFCA), February. Lecture Notes in Artificial Intelligence, vol. 4933, pp. 62–77. Springer, Berlin (2008) 90. Völker, J., Rudolph, S.: Fostering web intelligence by semi-automatic OWL ontology refinement. In: Proceedings of the 7th International Conference on Web Intelligence (WI), December. IEEE Press, New York (2008). Regular paper 91. Völker, J., Vrandeˇci´c, D., Sure, Y.: Automatic evaluation of ontologies (AEON). In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) Proceedings of the 4th International Semantic Web Conference (ISWC), November. Lecture Notes in Computer Science, vol. 3729, pp. 716– 731. Springer, Berlin (2005) 92. Völker, J., Hitzler, P., Cimiano, P.: Acquisition of OWL DL axioms from lexical resources. In: Franconi, E., Kifer, M., May, W. (eds.) Proceedings of the 4th European Semantic Web Conference (ESWC), June. Lecture Notes in Computer Science, vol. 4519, pp. 670–685. Springer, Berlin (2007) 93. Völker, J., Vrandeˇci´c, D., Sure, Y., Hotho, A.: Learning disjointness. In: Franconi, E., Kifer, M., May, W. (eds.) Proceedings of the 4th European Semantic Web Conference (ESWC), June. Lecture Notes in Computer Science, vol. 4519, pp. 175–189. Springer, Berlin (2007) 94. Völker, J., Vrandeˇci´c, D., Sure, Y., Hotho, A.: AEON—an approach to the automatic evaluation of ontologies. Journal of Applied Ontology 3(1–2), 41–62 (2008). Special Issue on Ontological Foundations of Conceptual Modeling 95. Widdows, D.: Semantic vector products: some initial investigations. In: Proceedings of the Second AAAI Symposium on Quantum Interaction (QI) (2008)
From Semantic Web Mining to Social and Ubiquitous Mining A Subjective View on Past, Current, and Future Research Andreas Hotho and Gerd Stumme Abstract Web mining is the application of data mining techniques to the Web. In the past eight years, we have been following this line of research within two growing subareas of the Web: the Semantic Web and the Social Web. In this paper, we recall our key observations, and discuss the next upcoming trend—the application of data mining to the Ubiquitous Web.
1 Introduction Some years after the rise of the Semantic Web as a research topic, Tim O’Reilly [14] initiated a discussion about the next generation of the World Wide Web, which he called “Web 2.0”. Different concepts fell under this notion, and it was not clear at the beginning, if this was the seed for an extensive growth, or just a flash in the pan. As one can see now, the Web 2.0 (also called the Social Web), together with mobile devices, is about to tremendously influence the way humans interact socially. Our research focus is on the adaptation of information retrieval, data, text, and web mining methods to new domains. We were thus among the first to study potential interactions of mining approaches with the Semantic Web and the Social Web. Currently, we are extending this scope to mobile applications, leading to the Ubiquitous Web. According to Fayyad et al. Data Mining is “the nontrivial process of identifying valid, previously unknown, and potentially useful patterns” [7] in a potentially very huge amount of data. Web Mining is the application of data mining techniques on content, structure, and usage of resources on the web [9]. To this end, a wide range of general data mining techniques, in particular association rule discovery, clustering, classification, and sequence mining, has been employed and developed further to reflect the specific structures of Web resources and the specific questions posed in Web mining. A. Hotho () Data Mining and Information Retrieval Group, University of Würzburg, 97074 Würzburg, Germany e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_8, © Springer-Verlag Berlin Heidelberg 2011
143
144
A. Hotho and G. Stumme
Web content mining analyzes the content of Web resources. Today, it is mostly a form of text mining. Recent advances in multimedia data mining promise to widen the access also to image, sound, video, etc. content of Web resources. The primary Web resources that are mined in Web content mining are individual pages. Web structure mining usually operates on the hyperlink structure of Web pages. Mining focuses on sets of pages, ranging from a single Web site to the Web as a whole. Web structure mining exploits the additional information that is (often implicitly) contained in the structure of hypertext. Therefore, an important application area is the identification of the relative relevance of different pages that appear equally pertinent when analyzed with respect to their content in isolation. Web usage mining focuses on records of the requests made by visitors to a Web site, most often collected in a Web server log or by small pieces of javascript. The content and structure of Web pages, and in particular those of one Web site, reflect the intentions of the authors and designers of the pages and the underlying information architecture. The actual behavior of the users of these resources may reveal additional structure. These three approaches to Web Mining can be adopted to the Semantic Web, the Social Web, and the Ubiquitous Web, as we will see below. Semantic Web Mining is the combination of the two areas Semantic Web and Web Mining (cf. [12]). The results of Web Mining can be improved by exploiting (the new) semantic structures in the Web, and Web Mining can be used techniques for building the Semantic Web. These techniques can also be used for mining the Semantic Web itself. Social Web Mining is the application of mining techniques on any kind of Web 2.0 system. Although Web 2.0 systems—as the name suggests—are still web applications and the analysis of such systems could be subsumed under the term web mining, new challenges for data mining emerge, as new structures and new data can be found in such systems. Social Web Mining is in line with the general idea of Semantic Web Mining. For instance, ontology learning based on data from social applications is an instantiation of Semantic [Web Mining]. As parts of the social web—in particular folksonomies—can be considered as having weak knowledge representation, on the other hand, analyzing their data is an instantiation of [Semantic Web] Mining. Ubiquitous Web Mining can be seen as the application of data mining and machine learning on the Ubiquitous Web. With mobile devices becoming more and more powerful, the Web is always at your fingertips and starts amalgamating with the real world. Simultaneously, these devices carry increasing types of sensors, so that information about the real world is fed back to the Web in real time. Developing data mining algorithms for the Ubiquitous Web means dealing with heterogeneous data sources (ranging from humans to sensors) which may contradict each other. Another challenge is the development and application of algorithms that can be run on mobile devices with limited resources. Networks formed by sensors and/or humans raise research questions similar to those for the Social Web. The structure of the paper is as follows: In Sects. 2 and 3, we describe our previous work on Semantic Web Mining and on Social Web Mining. In Sect. 4, we
From Semantic Web Mining to Social and Ubiquitous Mining
145
sketch possible steps towards Ubiquitous Web Mining. Since these three topics span a rather broad range of research, we abstain from discussing related work in detail— which would require three separate state of the art surveys—, but rather refer to the publications [2, 4, 6, 8, 12], where related work is discussed in large detail. In Sect. 5, we conclude with a discussion of future research trends.
2 Semantic Web Mining The two fast-developing research areas Semantic Web and Web Mining both build upon the success of the World Wide Web (WWW). They complement each other well because they each address one part of a new challenge posed by the great success of the current WWW: The nature of most data on the Web is so unstructured that they can only be understood by humans, but the amount of data is so huge that they can only be processed efficiently by machines. The Semantic Web addresses the first part of this challenge by trying to make the data machine-understandable, while Web Mining addresses the second part by (semi-)automatically extracting the useful knowledge hidden in these data, and making it available as an aggregation of manageable proportions. Semantic Web Mining aims at combining the two areas Semantic Web and Web Mining. This vision follows our observation that trends converge in both areas: increasing numbers of researchers work on improving the results of Web Mining by exploiting (the new) semantic structures in the Web, and make use of Web Mining techniques for building the Semantic Web. Last but not least, these techniques can be used for mining the Semantic Web itself. The wording Semantic Web Mining emphasizes this spectrum of possible interaction between both research areas: it can be read both as Semantic (Web Mining) and as (Semantic Web) Mining. In 2006, we have provided an overview where the two areas of Semantic Web and Web Mining meet [12]. In the survey, we described the current state of the two areas and then discussed, using an example, their combination, thereby outlining future research topics. When analyzing how these two areas cooperate today, one observes two main directions. First, Web mining techniques can be applied to help creating the Semantic Web. A backbone of the Semantic Web are ontologies, which at present are often hand-crafted. This is not a scalable solution for a wide-range application of Semantic Web technologies. The challenge is to learn ontologies, and/or instances of their concepts, in a (semi-)automatic way. Conversely, background knowledge—in the form of ontologies, or in other forms—can be used to improve the process and results of Web Mining. Recent developments include the mining of sites that become more and more Semantic Web sites and the development of mining techniques that can tap the expressive power of Semantic Web knowledge representation. A tighter interaction between these two directions may lead to a closed loop: from Web Mining to the Semantic Web and back. A tight integration of these aspects will greatly increase the understandability of the Web for machines, and will
146
A. Hotho and G. Stumme
thus become the basis for further generations of intelligent Web tools. Further investigation of this interplay will give rise to new research questions and stimulate further research both in the Semantic Web and in Web Mining—towards the ultimate goal of a truly comprehensive “Semantic Web Mining”: “a better Web” for all of its users, a “better usable Web”. One important focus is to enable search engines and other programs to better understand the content of Web pages and sites. This is reflected in the wealth of research efforts that model pages in terms of an ontology of the content, the objects described in these pages. We expect that, in the future, Web mining methods will increasingly treat content, structure, and usage in an integrated fashion in iterated cycles of extracting and utilizing semantics, to be able to understand and (re)shape the Web. Among those iterated cycles, we expect to see a productive complementarity between those relying on semantics in the sense of the Semantic Web, and those that rely on a looser notion of semantics.
3 Social Web Mining Complementing the Semantic Web effort, a new breed of so-called “Web 2.0” applications recently emerged on the Web. These include user-centric publishing and knowledge management platforms like wikis, blogs, and social resource sharing tools. For each of these types of systems, specific data mining approaches have been developed; see for instance the contribution of Bloehdorn et al. to this volume. Analyzing, extracting and transforming these weakly structured knowledge sources into a richer form will not only make the knowledge accessible by machines but also allows the combination of huge sources of information. Encyclopedic knowledge from Wikipedia, personal contributions from blogs and annotations of users from folksonomies provide different views on the same facts, and a combination of them through use of mining approaches will lead to a new level of knowledge, and thus to new applications. In the last few years, we have focused on one particular type of Web 2.0 systems, namely on resource sharing systems. These systems all make use of the same kind of lightweight knowledge representation, called folksonomy.1 Social resource sharing systems are web-based systems that allow users to upload all kinds of resources, and to label them with arbitrary words, so-called tags. The systems can be distinguished according to what kind of resources are supported. Flickr,2 for instance, allows the sharing of photos, del.icio.us3 the sharing of bookmarks, CiteULike4 and Connotea5 the sharing of bibliographic references, and 43Things6 even the sharing 1 http://www.vanderwal.net/folksonomy.html. 2 http://www.flickr.com/. 3 http://delicious.com. 4 http://www.citeulike.org. 5 http://www.connotea.org. 6 http://www.43things.com.
From Semantic Web Mining to Social and Ubiquitous Mining
147
of goals in private life. Because of the regular and system-independent structure of folksonomies, they are an ideal target for data mining research. One example is that structure mining is applied on a single folksonomy and not— as it is known from web mining—on the web graph as a whole. Given the high number of publications in the short lifetime of folksonomy systems, researchers seem to be very interested in folksonomies and the information and knowledge which can be extracted from them. This can be explained by the tremendous amount of information collected from a very large user basis in a distributed fashion in such systems. The application of mining techniques on folksonomies bears a large potential. Further, it is extending the general idea of Semantic Web Mining (see Sect. 2). Two aspects are of central interest: On the one hand, folksonomies form a rich source of data which can be used as a source for full-blown ontologies. This process is known as ontology learning and often utilizes data mining techniques. On the other hand, folksonomies are considered as weak knowledge representation, and analyzing their data can thus be seen as an implementation of Semantic Web Mining. The goal of this work is therefore to bridge the gap between folksonomies and the Semantic Web and to start to solve this problem with research contributions from various sides. More precisely, to reach this goal, a better understanding of the hidden and emergent semantics in folksonomies is necessary, as well as methods to extract the hidden information. Data Mining techniques provide methods for solving these issues. In this section, we will discuss in some more detail two approaches to Social Web Mining of tagging data. The next subsection deals with our own Web 2.0 platform, BibSonomy, while the following subsection addresses in a more general way the interplay of folksonomies and ontologies.
3.1 Analysis of Folksonomy Data Since access to data is crucial for the evaluation of new algorithms, we have set up or own system, BibSonomy,7 which allows sharing bookmarks and B IBTE X entries simultaneously. BibSonomy is a platform where researchers manage their publications on a daily basis. BibSonomy started as a student project at our group in spring 2005. It quickly grew out of the prototype status and has attracted to date roughly 5,000 users, making it—to the best of our knowledge—one of the three most popular social publication sharing systems at present. As we own the system, we have full access to all data, the user interface and so on. This puts us in the situation in which researchers normally do not find themselves: We can perform research experiments to test our new methods and push our research results into BibSonomy to show, evaluate, and demonstrate the advantages of our methods. BibSonomy provides us with data for experiments, but also allows online experiments and the implementation of the most successful algorithms as showcases. 7 http://www.bibsonomy.org.
148
A. Hotho and G. Stumme
In [2], we summarized our work on different aspects of mining folksonomies, and demonstrated and evaluated them within our system. We addressed a broad range of folksonomy research, such as the capturing of emergent semantics, spam detection, ranking algorithms, analogies to search engine log data, personalized tag recommendations and information extraction techniques. The tight interplay between our scientific work and the running system has made BibSonomy a valuable platform for demonstrating and evaluating Web 2.0 research. One such example of research around BibSonomy can be found in the (online) recommender experiments we were doing for the ECML PKDD discovery challenges 2008 and 2009.8 Both years’ challenges are based on BibSonomy data. While the focus of the first challenge was on two different tasks, spam detection and tag recommendation, in 2009 we specifically focused on three variants of the tag recommendation task. Two offline tasks dealt with different information quantities, while for the online task all recommender algorithms had to deliver their answers in a predefined time. Without BibSonomy we would not have been able to set up such challenges, and it was an exciting experience to organize such an event. Another example are the research results which found their way into the system. One of the first results was a lightweight recommender, followed by the FolkRank ranking, and the display of related tags and users to improve the browsing experience.
3.2 Ontologies and Folksonomies Ontologies are a well-known formal knowledge representation [11] and are the building block of the “Semantic Web” effort. With their well-defined semantics, ontologies offer benefits for a wide spectrum of applications supported by advanced tools from industry and academics. Nevertheless, there are problems incurred using Semantic Web technology in very large application contexts, especially in the web. The web contains huge masses of data—but not always the data is available in the structured form needed by the Semantic Web, i.e., as ontologies. The fact that the transformation process from unstructured to structured information is possible, but does not scale to the size of the web, is part of the well-known knowledge acquisition bottleneck. The reason is that a certain expertise is needed for creating and maintaining ontologies. This raises the cost of knowledge acquisition, and only few people are contributing. Learning ontologies from text [6] is a first way to simplify the acquisition process by utilizing machine learning approaches and linguistic knowledge. Folksonomies can be seen as a lightweight knowledge representation. Many unexperienced users contribute small pieces of information—unfortunately only in a weakly structured fashion. There is a large amount of information, but it is unstructured and therefore incompatible with semantically rich representations. Both approaches could benefit from each other: While folksonomies need more structure, 8 http://www.kde.cs.uni-kassel.de/ws/dc09/.
From Semantic Web Mining to Social and Ubiquitous Mining
149
ontologies need more contributors. Research in this direction has been stimulated in form of the “Bridging the Gap between Semantic Web and Web 2.0” workshop,9 where the contributions ranged from the use of human contributed information to simplified Web 2.0-like Semantic Web tools. The emergent semantics in folksonomies can be extracted by using machine learning algorithms or advanced analysis methods. Our first approach to this end was the application of association rules [10]. To be able to develop advanced knowledge extraction methods, however, a better understanding of the kind of underlying semantics is needed. We presented the summary of network properties of folksonomies in [3], which supports the existence of semantics in folksonomies. The next steps were a deeper understanding of the type of the relations hidden in folksonomies [4] and the development of methods to extract them [1]. In principle, the presented folksonomy mining approaches follow the Semantic Web Mining program. They make real our vision of utilizing mining to help build and analyze the Semantic Web. A central phenomenon of the Web 2.0 is the contribution of many users who are distributed over the world. At the beginning of the Web 2.0, these users were assumed to be sitting in front of a desktop computer or notebook. Consequently, the next step was to bring the web to mobile devices and to set up new services which do not only allow users to provide information at any time and any place, but also to monitor their activities. This physical information will provide new kinds of data which allow for new services. A combination of the physical world with its small devices, sensors, etc., the Web 2.0 look and feel, and the Semantic Web to connect everything will lead to the next generation of the Web.
4 Ubiquitous Web Mining Concurrent with the rise of the Semantic and the Social Web, mobile phones became more and more powerful, giving rise to Mobile Web applications. Today, we observe the amalgamation of these two trends, leading to a Ubiquitous Web, whose applications will support us in many aspects of daily life. The emergence of ubiquitous computing has begun to create new environments consisting of small, heterogeneous, and distributed devices that foster the social interaction of users in several dimensions. Similarly, the upcoming Social Semantic Web also integrates the user interactions in social networking environments. For instance, nowadays modern smartphones allow everyone to have access to the WWW at every place and at every time. At the same time, these systems are equipped with more and more sensors. Typical sensors in today’s smartphones are measuring geolocation, geographic north, acceleration, proximity, ambient light, loudness, moisture. Furthermore, access to the most prominent Web 2.0 platforms—in particular Facebook, Flickr, YouTube—is frequently pre-installed by the vendor. This example 9 http://www.kde.cs.uni-kassel.de/ws/eswc2007/.
150
A. Hotho and G. Stumme
shows that the worlds of WWW, Web 2.0, the Mobile Web, and sensor technology are rapidly amalgamating. Going one step further, we assume the rapid convergence of the Ubiquitous Web with the Internet of Things—more and more, the real world that is surrounding us will have its digital counterpart. Applications in the Ubiquitous Web will thus rely on a mix of data from sensors, social networks and mobile devices. This data needs to be integrated, aggregated, and analyzed by means of Data, Text, and Web Mining techniques and turned into a semantic and/or statistical representation of knowledge, which will then fuel the ubiquitous applications. This has become an important challenge for different research communities, since it requires the confluence of previously separated lines of research. Consequently, the last years have seen increasing collaboration of researchers from the Semantic Web, Web 2.0, social network analysis and machine learning communities. Applications that use these research results are achieving economic success. Data now becomes available that allows researchers to analyze the use, acceptance and evolution of their ideas. Mining in ubiquitous and social environments is thus an emerging area of research focusing on advanced systems for data mining in such distributed and network-organized environments. It also integrates some related technologies such as activity recognition, Web 2.0 mining, privacy issues and privacy-preserving mining, predicting user behavior, etc. However, the characteristics of ubiquitous and social mining are in general quite different from current mainstream data mining and machine learning. Unlike in traditional data mining scenarios, data do not emerge from a small number of (heterogeneous) data sources, but potentially from hundreds to millions of different sources. As there is only minimal coordination, these sources can overlap or diverge in any possible way. Another challenge in this regard is the development of data mining algorithms that are tailored to run on mobile devices with limited resources of energy and memory. We are currently taking the first steps towards Ubiquitous Web Mining. To illustrate this, we briefly describe one ongoing and one upcoming research project.
4.1 Conferator—A Ubiquitous Conference Service Within the project “VENUS—Design of socio-technical networking applications in situative ubiquitous computing systems”, we have developed Conferator, a ubiquitous conference service. VENUS is a research cluster at the interdisciplinary Research Center for Information System Design (ITeG) at Kassel University, funded by the State of Hesse as part of the program for excellence in research and development (LOEWE). The goal of VENUS is to explore the design process of future networked, ubiquitous systems, which are characterized by situation awareness and self-adaptive behavior. Thus, VENUS focuses on the interactions between the new technology, the individual user and the society. The long-term goal of VENUS is the creation of a comprehensive interdisciplinary development methodology for the design of ubiquitous computing systems.
From Semantic Web Mining to Social and Ubiquitous Mining
151
The aim of Conferator is to support the participants of the conference in their social interaction. Conferator features two key functionalities: PeerRadar and TalkRadar. PeerRadar will show you a history of your social contacts at the conference and provides additional information about them, such as their homepage, Facebook and twitter accounts, contact details (skype, icq, etc.), and their last (public) entries in BibSonomy. After the conference, PeerRadar will thus enable you to recall the social contacts you had during the conference. TalkRadar gives you the opportunity to personalize the conference schedule. You can select the talks that you intend to attend, and can store them in BibSonomy, so that it will be easier to cite them within your next publication. TalkRadar will give you more information about the current talk, such as additional information about the presenter, his or her publications, etc. Last but not least, TalkRadar stores the talks that you have actually attended. All conference participants who want to join the service will be provided with an RFID-Tag that is worn like a name tag. The tags communicate to readers that are installed at the walls, and also communicate with each other. In this way, the system can determine the room you are actually in, and the people you are currently talking to (more precisely: the people who are face to face with you). The data is encrypted and transmitted to the Conferator server, where they are aggregated. The hardware technology we use was developed within the Sociopatterns project [5],10 whose generous support we kindly acknowledge. Similar services were offered at previous conferences by the Live Social Semantics project [13]. Conferator puts a stronger focus on academic (rather than on general social) interaction. We ran Conferator for the first time at the Workshop “Lernen—Wissen— Adaptivität”11 in Kassel in October 2010, and are currently starting the analysis of the usage of the system and of the emerging social patterns.
4.2 EveryAware—Enhance Environmental Awareness Through Social Information Technologies EveryAware is an upcoming European Project that is going to start early 2011. Our aim is to enhance social awareness about environmental issues emerging in urban habitats through collaborative monitoring of air pollution and related events. To this end, volunteers in major European cities will be equipped with air pollution sensors and means of annotating their measurements. These locally-relevant, usermediated and user-generated data will be analyzed, processed, visualized. The outcome will be real-time, user-centered results that are disseminated through standard and largely available communication networks. 10 http://www.sociopatterns.org/. 11 http://www.kde.cs.uni-kassel.de/conf/lwa10.
152
A. Hotho and G. Stumme
EveryAware has the ultimate goal of triggering a bottom-up improvement of social strategies. The integration of participatory sensing with the monitoring of subjective opinions is novel and requires a tight interplay of research areas such as physics, social networks, semantic web, and data mining, and will definitely provide new challenges for Ubiquitous Web Mining.
5 Outlook Today we are seeing the first steps being taken towards the integration of the Social, Mobile and Semantic Webs. Along this path will be exciting challenges for researchers of different communities. New insights provided by machine learning and social network analysis techniques will lead to new types of knowledge. We envision that research in this area will be of growing interest, as the automatic extraction of knowledge from weakly structured sources contributed by a huge mass of users and the combination with structured knowledge will be an important basis for the Semantic Web. It will lead to a broad range of new applications, which allow for combining knowledge of different types, levels and from different sources to reach their goals. The upcoming Ubiquitous Web is one target application area which will benefit from this newly integrated knowledge. Semantic Web technology can bridge the gap between all kinds of information— independent of its source and its origin—and can be used as a starting point to put everything together. The real world information gathered by sensors will be used by applications running on mobile devices and will be connected with the information of their users from the social web. The combination of Semantic Web with Data Mining approaches may become the right means for connecting these worlds.
References 1. Benz, D., Hotho, A.: Position paper: ontology learning from folksonomies. In: Hinneburg, A. (ed.) LWA 2007: Lernen—Wissen—Adaption, Halle, September 2007, Workshop Proceedings (LWA), Martin-Luther-University Halle-Wittenberg, pp. 109–112 (2007) 2. Benz, D., Hotho, A., Jäschke, R., Krause, B., Mitzlaff, F., Schmitz, C., Stumme, G.: The social bookmark and publication management system bibsonomy. VLDB J. 19, 849–875 (2010). doi:10.1007/s00778-010-0208-4 3. Cattuto, C., Schmitz, C., Baldassarri, A., Servedio, V.D.P., Loreto, V., Hotho, A., Grahl, M., Stumme, G.: Network properties of folksonomies. AI Commun. 20(4), 245–262 (2007). Special Issue on “Network Analysis in Natural Sciences and Engineering” 4. Cattuto, C., Benz, D., Hotho, A., Stumme, G.: Semantic grounding of tag relatedness in social bookmarking systems. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T.W., Thirunarayan, K. (eds.) The Semantic Web—ISWC 2008, Proc. Intl. Semantic Web Conference 2008. Lecture Notes in Artificial Intelligence, vol. 5318, pp. 615–631. Springer, Heidelberg (2008) 5. Cattuto, C., Van den Broeck, W.V., Barrat, A., Colizza, V., Pinton, J.-F., Vespignani, A.: Dynamics of person-to-person interactions from distributed RFID sensor networks. PLoS ONE 5(7), e11596 (2010). doi:10.1371/journal.pone.0011596
From Semantic Web Mining to Social and Ubiquitous Mining
153
6. Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Intell. Res. 24, 305–339 (2005) 7. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Advances in Knowledge Discovery and Data Mining, pp. 1–34. MIT Press, Cambridge (1996) 8. Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., Stumme, G.: Tag recommendations in social bookmarking systems. AI Commun. 21(4), 231–247 (2008). doi:10.3233/AIC-2008-0438 9. Kosala, R., Blockeel, H.: Web mining research: a survey. SIGKDD Explor. 2(1), 1–15 (2000) 10. Schmitz, C., Hotho, A., Jäschke, R., Stumme, G.: Mining association rules in folksonomies. In: Batagelj, V., Bock, H.-H., Ferligoj, A., Ziberna, A. (eds.) Data Science and Classification (Proc. IFCS 2006 Conference), July 2006, Ljubljana, Studies in Classification, Data Analysis, and Knowledge Organization, pp. 261–270. Springer, Berlin/Heidelberg (2006). doi:10.1007/3-540-34416-0_28 11. Staab, S., Studer, R. (eds.): Handbook on Ontologies, Springer, Berlin (2004) 12. Stumme, G., Hotho, A., Berendt, B.: Semantic web mining—state of the art and future directions. J. Web Semant. 4(2), 124–143 (2006) 13. Szomszor, M., Cattuto, C., Van den Broeck, W.V., Barrat, A., Alani, H.: Semantics, sensors, and the social web: the live social semantics experiments. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC (2). Lecture Notes in Computer Science, vol. 6089, pp. 196–210. Springer, Berlin (2010) 14. O’Reilly, T.: What Is Web 2.0? Design patterns and business models for the next generation of software. http://oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html (2005)
Towards Networked Knowledge Stefan Decker, Siegfried Handschuh, and Manfred Hauswirth
Abstract Despite the enormous amounts of information the Web has made accessible, we still lack the means to interconnect and link this information in a meaningful way in order to lift it from the level of information to the level of knowledge. Additionally, new sources of information about the physical world become available through the emerging sensor technologies. This information needs to be integrated with the existing information on the Web and in information systems which require (light-weight) semantics as a core building block. In this position paper we discuss the potential of a global knowledge space, and which research and technologies are required to enable our vision of networked knowledge.
1 What is Networked Knowledge? The wealth of information and services on today’s information infrastructures like the Web has significantly changed everyday life and has substantially transformed the way in which business, public and private interactions are performed. The economic and social influence of the Web is enormous, enabling new business models and social change, and creating wealth. However, we have barely scratched the surface of what information technology can do for society. The Web has enabled information creation and dissemination, but has also opened the information floodgates. The enormous amount of information available has made it increasingly difficult to find, access, present and maintain information. As a consequence, we are literally drowning in information and starving for knowledge. However, systematic access to knowledge is critical for solving today’s problems—on individual and organizational as well as global levels. Although knowledge is inherently strongly interconnected and related to people, this interconnectedness is not reflected or supported by current information infras-
The work presented in this paper was supported (in part) by the Líon-2 project supported by Science Foundation Ireland under Grant No. SFI/02/CE1/I131.
S. Decker () Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, IDA Business Park, Lower Dangan, Galway, Ireland e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_9, © Springer-Verlag Berlin Heidelberg 2011
155
156
S. Decker et al.
tructures. Current information infrastructures support the interlinking of documents, but not the encoding and representation of knowledge. For example, current Hypertext links are untyped and don’t convey the meaning of the link. The lack of interconnectedness hampers basic information management and problem-solving and collaboration capabilities, like finding, creating and deploying the right knowledge at the right time. Unfortunately, this is happening at a time when humanity has to face difficult problems (e.g., climate change, energy and resource shortages, or negative effects of globalization such as the global financial crisis and recession caused by the collapse of the subprime mortgage market in the U.S.). New methods are required to manage and provide access to the world’s knowledge, for individual as well as collective problem solving. The right methods and tools for interconnecting people and accessing knowledge will contribute to solving these problems by making businesses more effective, scientists more productive and bringing governments closer to their citizens. Thus, the focus on Enabling Networked Knowledge is essential.
What is Networked Knowledge and why is it important? Besides the creation of knowledge through observation, networking of knowledge is the basic process of generating new knowledge. Networking knowledge, can produce a piece of knowledge whose information value is far beyond the mere sum of the individual pieces, i.e., it creates new knowledge. With the Web we now have a foundational infrastructure in place, enabling the linking of information on a global scale. Adding meaning moves the interlinked information to the knowledge level: Web + Semantics = Networked Knowledge. Knowledge is the fuel of our increasingly digital service economy (versus e.g., resources required for a manufacturing economy); linking information is a basis of economic productivity. Fortunately, current developments are helping to achieve these goals. Originating from the Semantic Web effort, more and more interlinked information sources are becoming available online, leading to islands of networked knowledge resources and follow-up industrial interest. As an example of a rapidly growing information space, Gartner predicts that “By 2015, wirelessly networked sensors in everything we own will form a new Web. But it will only be of value if the ‘terabyte torrent’ of data it generates can be collected, analyzed and interpreted” [1]. Making sensor-generated information usable as a new and key source of knowledge will require its integration into the existing information space of the Web. Now is the time to tackle the next step: exploiting semantics to create an overall knowledge network bridging the islands enabling people, organizations and systems to collaborate and interoperate on a global scale, and bridging the gap between the physical world and the virtual world so that the information on the Web (the virtual world) can directly influence activities in the real world and
Towards Networked Knowledge
157
Fig. 1 Networked Knowledge House
vice versa. This integrated information space of networked knowledge will impact all parts of people’s lives.
Hypothesis It is our central hypothesis that collaborative access to networked knowledge assists humans, organizations and systems with their individual as well as collective problem solving, creating solutions to problems that were previously thought insolvable, and enabling innovation and increased productivity on individual, organizational and global levels.
Research
In our opinion research needs to aim to:
1. develop the tools and techniques for creating, managing and exploiting networks of knowledge; 2. produce real-world networks of knowledge that provide maximum gains over the coming years for human, organizational and systems problem solving; 3. validate the hypothesis; and 4. create standards supporting industrial adaptation.
This overall research vision is broken down into complementary research strands, which form the Networked Knowledge House (see Fig. 1). Social Semantic Information Spaces deal with organization, linking, and management of knowledge on the Web. Semantic Reality addresses the integration of
158
S. Decker et al.
information from the physical world with knowledge in the virtual world (e.g., from Social Semantic Information Spaces), the creation of knowledge out of information about the physical world, and efficient mechanisms to access this information at large scale via sensors. The technologies created by these basic research strands are then applied in and customized to a set of application domains. This in turn requires research due to the specific requirements of these domains. Of course, the given list of application-oriented research domains is not comprehensive. A number of important domains are not listed, for example, environmental monitoring, traffic management and intelligent driving, logistics and tracking, or building management, to name a few. In the following sections we explain the Networked Knowledge House in more detail.
2 Why is Enabling Networked Knowledge Important? The World Wide Web has dramatically altered the global communications and information exchange landscape, removing access-, space- and time-constraints from business and social transactions. The Web has created new forms of interaction and collaboration between systems, individuals and organizations. The dramatic development of the Web and the changes it has made to society are only a glimpse of the potential of a next-generation information infrastructure connecting knowledge and people. By interlinking the World’s knowledge and providing an infrastructure that enables collaboration and focused exploitation of worldwide knowledge, Social Semantic Information Spaces and Semantic Reality, as will be explained in detail in the following sections, enable individuals, organizations and humanity as a whole to socialize, access services and solve problems much more effectively than we are able to today. The Web is already able to provide us with information, but for the most part lacks sufficient support for collaboration, knowledge sharing and social interaction. Within some examples we can already see the first glimpses of such a support infrastructure in current online social networking sites (currently serving hundreds of millions of users), even though these sites are just data silos and do not interconnect knowledge efficiently. Social Semantic Information Spaces and Semantic Reality as a networked knowledge infrastructure will also make businesses more effective and scientists more productive by connecting them to the right people and to the right information at the right time and enabling them to recognize, collect, and exploit the relationships that exist between the knowledge entities in the world. Vannevar Bush [2] and Doug Engelbart [3] proposed similar infrastructures in 1945 and 1962. However, the technology available at that time was not advanced enough to realize their visions. Figuratively speaking, their ideas were proposing jet planes when the rest of the world had just invented the parts to build a bicycle. With the Semantic Web effort delivering standards to interconnect information globally and the Social Web showing how to collaborate on a global scale, now a window of opportunity has opened to make these visions a reality and build a truly globally networked knowledge infrastructure.
Towards Networked Knowledge
159
Fig. 2 Social Semantic Information Spaces
3 Social Semantic Information Spaces One of the most visible trends on the Web is the emergence of “Social Web” (or Web 2.0) sites which facilitate the creation and gathering of knowledge through the simplification of user contributions via blogs, tagging and folksonomies, Wikis, podcasts and the deployment of online social networks. The Social Web has enabled community-based knowledge acquisition, with efforts like Wikipedia demonstrating ‘Wikinomics’ in creating the largest encyclopedia in the world. Although it is difficult to define the exact boundaries of what structures or abstractions belong to the Social Web, a common property of such sites is that they facilitate collaboration and sharing between millions of users. However, as more and more Social Web sites, communities and services come online, the lack of interoperation among them becomes obvious: the Social Web platforms create a set of isolated data silos—sites, communities and services that cannot interoperate with each other, synergies are expensive to exploit, and reuse and interlinking of data is difficult and cumbersome between silos. The entities in the Social Web are not only data artefacts. Instead, it is a network of interrelated users and their concerns as well as content that the users are related to as producers, consumers or commentors. To enable machines to assist us with the detection and filtering of knowledge, many of these often implicit links have to be made explicit. Social Semantic Information Spaces are a combination of the Semantic Web, the Social Web, collaborative working environments and other collaboration technologies. The goal behind Social Semantic Information Spaces is to create a universal collaboration and networked knowledge infrastructure, which interlinks all available knowledge and their creators. The resulting infrastructure would finally enable knowledge management capabilities as expressed by visionaries like Vannevar Bush and Doug Engelbart. Figure 2 shows how Social Semantic Information Spaces fit into the current landscape: Communication and collaboration tools are augmented and made interoperable with Semantic Web technologies. The result is a network of accessible interlinked knowledge, enabling productive collaboration and knowledge management. In the following we list a couple of specific examples of enabling technologies and describe where and how they fit in the idea of a Social Semantic Information
160
S. Decker et al.
Space. All these technologies are just starting up and further research is necessary to ensure their development into broadly adopted technologies. However, convergence between some of the different efforts are already recognizable today.
3.1 Semantic Social Networks From its beginning the Internet, was a medium for connecting not only machines but people. Email, mailing lists, the Usenet, and bulletin boards allowed people to connect and form online social networks, typically around specific topics. Although these groups did not explicitly define social networks, the ways people acted and reacted did so implicitly. The early Web continued this trend. More recently, sites such as Friendster and LinkedIn have brought a different notion of online communities by explicitly facilitating connections based on information gathered and stored in user profiles. However, all these sites are stovepipes and lock the information in: using the social network information for other purposes, e.g., for prioritizing email as discussed in [4], requires standardized data exchange mechanisms. Initial crystallization points to remedy this situation are efforts like the Friend-of-a-Friend vocabulary (FOAF1 ) or the Semantically-Interlinked Online Communities initiative (SIOC2 [5]). The SIOC initiative may serve as an example of how social networking information can be interlinked with content such as online discussions taking place on blogs, message boards, mailing lists, etc. In combination with the FOAF vocabulary for describing people and their friends, and the Simple Knowledge Organization Systems (SKOS) model for organizing knowledge, SIOC enables the linkage of discussion postings to other related discussions, people (via their associated user accounts), and topics (using specific ‘tags’ or hierarchical categories). As discussions begin to move beyond simple text-based conversations to include audio and video content, SIOC is evolving to describe not only conventional discussion platforms but also new Web-based communication and content-sharing mechanisms. Some social networking sites, such as Facebook, are also starting to provide query interfaces to their data, which others can reuse and link to via the Semantic Web. Thus this information becomes part of the Web of information, which may be used or reused for a variety of purposes, providing crystallization points for a network of knowledge.
3.2 Semantic Collaborative Technologies Apart from the specific data representation mechanisms outlined above, other mechanisms and technologies contribute to the emergence of Social Semantic Information Spaces on the Web. The Social Semantic Desktop (SSD) [6] effort (materialized 1 http://www.foaf-project.org. 2 http://www.sioc-project.org.
Towards Networked Knowledge
161
in the EU IP project NEPOMUK [7]) aims to facilitate organization of information on the desktop by using Semantic Web metadata standards. Ontologies capture both a shared conceptualization of desktop data and personal mental models. RDF serves as a common data representation format. Semantic Web Services standards can be used to describe the capabilities and interfaces of desktop applications. Together, these technologies provide a means to build the semantic bridges necessary for data exchange and application integration. The Social Semantic Desktop has the potential to transform the conventional desktop into a seamless, networked working environment, by obliterating the borders between individual applications and the physical workspace of different users. In contrast to desktop applications Wikis have become popular Web-based collaboration tools and are widely deployed to enable organizing and sharing of knowledge. Wikis gained popularity since they enable the management of online content in a quick and easy way by “group-editing” using a simple syntax. However, typically knowledge collected in Wikis cannot be reused easily automatically and is only usable for human consumption. Semantic Web techniques applied to Wikis [8] leverage semantic technologies to address this challenge. They provide means to rapidly acquire formal knowledge also by non-knowledge engineers, and to create a network of this knowledge linked with other information sources. A typical example is the Semantic MediaWiki,3 which enables the evolution of Wikipedia into a reusable knowledge source enabling automatic processing and human support.
4 Linked Open Data A prerequisite of Networked Knowledge is Linked Data. Linked Data as outlined by Tim Berners-Lee4 follows the listed principles: – Use URIs to identify things. – Use HTTP URIs so that these things can be referred to and looked up (“dereferenced”) by people and user agents. – Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML. – Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web. The Linking Open Data project5 and the Linked Open Data (LOD) effort [9] are deploying the Linked Data principles. Linked Open Data is strongly community driven and aims to alleviate the problem of missing, sufficiently interlinked datasets on the Internet. Through this ef3 http://meta.wikimedia.org/wiki/Semantic_MediaWiki. 4 See
http://www.w3.org/DesignIssues/LinkedData.
5 http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData.
162
S. Decker et al.
fort, a significant number of large-scale datasets6 have now been published. Hence, Linked open data has recently seen enormous growth, and all indicators point to further accelerated growth. This has lead to massive amounts of linked open data, often visualized as the so-called data cloud (see Fig. 37 ). The current approach to Linked Open Data publishing often leads to a number of crucial problems, hindering its usage and uptake, e.g.: (i) data providers are widely diverse in terms of quality, point of view, focus, detail, and modeling approach, but as of yet this diversity is hard to measure and understand, and thus even harder to work with; (ii) if at all, the data sources are only marginally aligned to each other, thus only technically promising the end of data silos without really delivering; (iii) a big proportion of LOD is noisy and must be cleaned and filtered before usage, but our methods for doing so are mostly ad-hoc and poorly understood; (iv) the interfaces for everyday programmers to access LOD are complex and the results often discouraging; etc. Given these problems, the data cloud of today resembles more a data desert created out of humongous amounts of linked data. Ultimately this data desert must be cultivated into networked knowledge in order to become a fertile ground for the development of applications that are truly useful for human beings and their tasks. Examples of effort that provide initial steps into the direction of converting Linked Data into Networked Knowledge are Sindice,8 sig.ma,9 Sirine10 and SWSE.11 These services enable humans and machines to locate and query Linked Data that has been published across the Web. In conclusion, the Linked Data community effort is providing light-weight infrastructures and large-scale datasets as initial building blocks for Networked Knowledge.
5 Linked Data Layer for the Future Internet As identified previously a prerequisite of Networked Knowledge is Linked Data. At the same time Linked Data is becoming an accepted best practice to exchange information in an interoperable and reusable fashion. Many different communities on the Internet use Linked Data standards to provide and exchange interoperable information. 6 For
example, DBpedia (http://dbpedia.org/), BBC music (http://www.bbc.co.uk/music/), LinkedGeoData (http://linkedgeodata.org/), and only recently by the New York Times (http://data. nytimes.com/).
7 http://lod-cloud.net/. 8 http://sindice.com. 9 http://sig.ma. 10 http://siren.sindice.com. 11 http://swse.org.
Fig. 3 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch
Towards Networked Knowledge 163
164
S. Decker et al.
Fig. 4 ISO/OSI 7 layer model and proposed Linked Data Layer
The current Internet Architecture is establishing communication interoperation using a layered approach. Consider a client application that wants to establish a reliable connection with a server application separated by multiple heterogeneous networks. If attacked directly, this problem is virtually unfeasible. Even in case of a direct physical connection between the client and the server, both sides have to be prepared to deal with different protocols, addressing schemas, packet sizes, error handling, flow control, congestion control, quality of service, etc. The amount of control information and mutual agreement needed to deliver and correctly interpret the bits across multiple networks is enormous. To reduce its design complexity, most network software is organized as a series of layers. The purpose of each layer is to offer certain services to the higher layers, shielding those layers from the details of how the offered services are implemented. Internetworking is achieved by a common understanding of protocols. The OSI/OSI 7-Layer architecture is a conceptual view of networking architectures. One possible view is a look at Linked Data as an independent layer in the Internet Architecture, on top of the networking layer, but below the application layers, since it provides a common data model for all applications as shown in the Fig. 4. Future work will clarify the properties of this layer and role in the Future Internet as the Internet is moving towards interdataworking—understanding each others data.
6 Semantic Reality Until now the virtual world of information sources on the World Wide Web and activities in the real world are usually separated. However, knowledge accessible
Towards Networked Knowledge
165
on the Web (the virtual world) may influence activities in the real world and vice versa, but these influences are usually indirect and not immediate. In contrast to this, imagine a world where: – Cars know where the traffic jams are and traffic can be managed based on realtime input about the traffic situation. – Medical data monitored through body sensor networks is automatically included into a patient’s electronic healthcare record. Should a critical condition be detected by these sensors, the patient can be physically located and the closest doctor on duty can be guided to the patient, whilst preparing the necessary resources in the hospital the patient is to be transferred to. – Your calendar knows how long the queue is at your physician. – Your travel planner knows that the train is delayed before you go to the train station. – Or generally, scarce resources can be managed efficiently and in-time. The advent of sensor technologies in conjunction with the Semantic Web now provides the unique opportunity to unify the real and the virtual worlds as, for the first time, we have the necessary infrastructure in place or large-scale deployment will happen in the short term. Their combination will enable us to build very large information spaces and infrastructures which, for the first time, facilitate the information-driven online integration of the physical world and computers. Similarly, as the Internet has changed the way people communicate in the virtual world, Semantic Reality extends this vision to the physical world, enabling novel ways for humans to interact with their environment and facilitating interactions among entities of the physical world (the so-called ‘Internet of Things’). The physical world will be represented in cyberspace and information on our environment will become ubiquitously available on the Internet. This integrated information space has a wide range of applications in monitoring, manufacturing, health, tracking and planning. We dubbed this Semantic Reality, which combines semantics (whether semantics is based on statistics, logical descriptions, or hybrid approaches does not matter) with a large scale information integration approach involving sensor data and data from the Web. In fact, we believe that a wide spectrum of approaches and their combinations will be necessary to cover the diverse requirements. Semantic Reality aims at an integrated information where integration is achieved in a manner compatible with the early Internet philosophy of community-driven agreement processes, emergent behavior and self-organization, but adding semantics as a key enabling ingredient. Without machine-processable semantics, such a large-scale system cannot work to its fullest extent. Yet, semantics must be light-weight, fault-tolerant, must support dynamic change, and has to be able to deal with incomplete, incorrect, and noisy information in order to be applicable and useful in a global-scale heterogeneous environment. The rationale for success could be along the lines of “a little bit of semantics gets you a long way”.12 12 See
http://www.cs.rpi.edu/~hendler/LittleSemanticsWeb.html.
166
S. Decker et al.
Fig. 5 Semantic Reality
Figure 5 shows how Semantic Reality fits into the overall picture: sensors connect the physical world to the computer, Social Semantic Information Spaces create and connect virtual worlds, and Semantic Reality integrates these two into one uniform information space which will provide novel ways of monitoring, controlling and influencing the environment, and how people and enterprises collaborate. The ultimate goal of Semantic Reality is “to deliver the right knowledge to the right people at the right time”. This requires the adequate description of information, people and their requirements, and a temporal view on data sources, be they “real” or “virtual”, i.e., a unified model of evolution of (integrated) information sources, thus moving from a static to a dynamic model of the Web and the physical world. This is currently taken into account only to a limited extent on the Web and only for “closed” applications, e.g., RSS feeds or blogs. Semantic Reality shares several goals and properties with ubiquitous and pervasive computing and ambient intelligence. It draws on a large body of work in sensor networks, embedded systems, ubiquitous and pervasive computing, ambient intelligence, networking, distributed systems, distributed information systems, artificial intelligence, software engineering, social networking and collaboration, and the Semantic Web. However, Semantic Reality is different from these research domains as it pushes the boundaries further by aiming at large-scale integration of (possibly isolated) information islands and the integration of systems, which requires the central use of semantics for information-driven integration and a uniform/universal, but light-weight semantic model of information sources and information. The sheer size of the possible systems poses quite novel and unique challenges. Semantic Reality systems can only be built, deployed, and maintained if a large degree of self-organization and automatization capabilities are being built into the infrastructures and their constituents, enabling automated deployment (plug-andplay), automated (re-)configuration, automated component and information integration, and tailored information delivery based on user context and needs in a serviceoriented way. The previous characteristics require semantic descriptions as a central ingredient: user requirements and contexts, the constituents of the system, the
Towards Networked Knowledge
167
dynamic data (streams) they produce, their functionalities, and requirements—all need to be described using light-weight semantic mechanisms to enable a machineunderstandable information space of real-world entities and their dynamic communication processes on a scale which is beyond the current size of the Internet. In the following, we briefly propose some of core challenges and hint at possible strategies to address them. Large-scale and open semantic infrastructures and flexible abstractions are required to enable the large-scale design, deployment and integration of sensor/actuator networks and their data. The integration has to happen on both the technical (data and network access) as well as on the semantic level (“What does the (stream) data provided actually mean?”). The infrastructure has to be open and easily extensible to address the heterogeneity issues which go far beyond those seen to date on the Internet. The infrastructure will draw on key enabling technologies such as (semantic) overlay networks using P2P technology to achieve scalability and light-weight semantic formats based on RDF and microformats. Middleware systems such as the Global Sensor Network (GSN) platform13 [10] are examples aiming at the development of a general-purpose middleware supporting these requirements. GSN is work-in-progress and provides a flexible middleware layer which abstracts from the underlying, heterogeneous sensor network technologies, supports fast and simple deployment and addition of new platforms, facilitates efficient distributed query processing and combination of sensor data, provides support for sensor mobility, and enables the dynamic adaption of the system configuration during runtime with minimal (zero-programming) [11] effort. Query processing, reasoning, and planning based on real-world sensor information will be core functionalities to exploit the full potential of Semantic Reality. The key research problems to overcome are the very large scale, the number of distributed information sources, the time-dependency of the produced data (streams), and the fact that the data is unreliable and noisy. For query processing this means supporting distributed query processing and load-balancing at large scales with only incomplete views on the state of the overall system. In this context, distributed eventbased infrastructures are of specific interest (“reactive” queries). Users should be able to register expressive, semantic “patterns” of interest and be notified by the system as soon as information satisfying their interests becomes available. Also, new approaches for distributed reasoning and reasoning on time-dependant information, taking into account modalities and being based on an open-world assumption will be necessary. The size and the physical distribution of data will require new approaches combing logical and statistical approaches which will have to trade logical correctness with statistical guarantees and expressivity with scalability. Essentially, the goal is to enable “The World is the Database” scenarios with support for structured querying, integrated views (real-world information with virtual 13 The
GSN implementation is available from http://gsn.sourceforge.net/.
168
S. Decker et al.
information), aggregation and analyses, and open, distributed reasoning over large, incomplete, and approximate data sets. Cross-layer integration and optimization will play a central role due to the extremely heterogeneous environment—a wide range of sensing devices with very heterogeneous hardware and processing characteristics; information systems and architectures along with virtual information streams which considerably increase complexity—and the various and often contradicting requirements on the different system levels. For example, sensor networks are optimized for a life-time and offer only primitive programming and query models. If this is combined with the “wrong” distribution approach, e.g., a distributed hash table for discovery and the “wrong” distributed query processing approach which does not limit expressivity of queries, this will lead to an inefficient system design and limit the life-time of sensors by draining their power sources because of incompatible processing strategies at the different levels. Semantic description and annotation of sensors, sensor data and other data streams will enable the flexible integration of information and (distributed) discovery of information. For scalability, integrity, and privacy reasons this has to be supported in a distributed fashion, for example, through semantic peer-to-peer systems. A prerequisite for discovery is the meaningful semantic description of sensors and sensor data by the manufacturer and by the user; for example, by the manufacturer through IEEE 1451 standard compliant Transducer Electronic Data Sheet (TEDS) [12], which essentially give a (non-semantic) description of the sensor that can very easily be ontologized, or via an ontologized subset of SensorML [13] which provides standard models and an XML encoding for describing sensors and measurement processes, and to the user, by extending these basic descriptions with annotations adding more information, meaning, and links to other information. The annotation of sensor data itself will be especially highly relevant to understand the meaning of the produced data and share this knowledge. Visualization environments supporting the annotation process will be of high importance. Such environments may support simple graphical annotation up to annotation with claims and findings in the case of scientific data. This derived knowledge can then be used again in the discovery process and will help to prevent “data graveyards” where interesting (measurement) information is available but cannot be used because the knowledge about its existence and meaning has been lost (e.g., the typical “PhD student finishes” syndrome). Due to the possibly large sizes of the produced data, this poses additional scalability problems. As with discovery, semantic annotation has to be supported in a distributed fashion, for example, by distributed semantic Wikis. Emergent semantics, self-organization, and plug-and-play are required to build working systems at the envisioned large scales where top-down system control, configuration, and enforcement of standards will be a very hard problem or even impossible. As we can see from the current community processes on the Web, a lot of
Towards Networked Knowledge
169
successful de-facto standards develop bottom-up. Conversely, these processes support the incremental development of standards and knowledge. The system must be able to self-organize and adopt its behavior in a plug-and-play fashion within organizational boundaries based on semantic understanding and agreement. Semantic understanding and agreements in turn will depend on dynamic processes which support (semi-)automatic assessment of the levels of agreement and their correctness. Such emergent semantic agreements can then be used as the basis for standardization (ontologies). Conversely, semantic formats can be advanced through such processes. Semantically enriched social network and collaboration infrastructures enable the targeted delivery of knowledge and information based on context description and actual user needs. The ubiquity of information requires a means to filter and direct data streams on a need-to-know basis. The definition of user profiles, needs and contexts are key features enabling targeted information delivery and avoiding overload. Social networking information enables both—information sharing and information filtering based on interests and information needs. Development support and tools along with experimental platforms and simulation tools will be necessary for efficient application development and testing. This means the availability of visual programming tools which support the developer in designing the acquisition, combination and integration of data. These designs then can be compiled down to the required target platforms (both sensor and back-end platforms, e.g., for business processes). To test applications, experimental testbeds along the lines of PlanetLab14 are essential as many of the characteristics of Semantic Reality systems require experimental evaluation under real-world conditions especially in terms of scale and distribution. To further evaluate applications, the integration of experiments and simulations should be supported in a seamless way, i.e., a test of an application in an experimental testbed should support the inclusion of simulation without changes to the application code. This means that parts of an application (or the complete application) should be able to run on an experimental testbed or on a simulator or any combination of those. On the application level modern paradigms such as service-oriented architectures, service mash-ups and Web 2.0-like functionalities should be available and be supported. Integrity, confidentiality, reputation, and privacy are the key security requirements for business users and consumers. The provided information has to be resistant against technical errors and attacks, stored and transported in a secure way, come from authentic and trustworthy sources and must ensure the privacy of its providers and users. Physical distribution can be beneficial, as it helps to avoid the creation of “Big Brother” scenarios which consumers and legislators would not tolerate. 14 See
http://www.planet-lab.org/.
170
S. Decker et al.
Vertical integration of business processes with middleware and sensor/actuator networks relying on the above technologies and functionalities enable the potential of Semantic Reality in an business environment. Sensor information, coming both from virtual and physical sources, enable agile business processes requiring minimal human intervention.
7 Application-Oriented Research Domains The Web has already influenced many different areas of society. The introduction of Social Semantic Information Spaces and Semantic Reality may have a similar influence, but like the Web, the transition of these new technologies into application areas is usually slow. To ensure rapid uptake and to provide maximum benefit to society, dissemination of research should focus on a number of selected application research domains. These domains are selected based on their potential impact on Society, but the list below is by no means complete, as we expect the technologies to influence many domains similar to how the Web influenced society. These research domains investigate the adoption and uses of Social Semantic Information Spaces and Semantic Reality, combining a critical mass of technology-oriented research with the research on needs in specific application environments to initiate groundbreaking innovation. Example research domains are: eHealth and Life Sciences The objective of the eHealth and Life Sciences domain is to reduce the cost associated with the drug research and delivery process, making clinical research more efficient through data integration, and enabling patients’ self-management of disease through telehealth, e.g., remote patient monitoring. Due to the heterogeneity of the eHealth domain, semantics is a crucial ingredient in achieving this objective. eScience The objective of the eScience domain is to improve collaboration among scientists working on computationally intensive problems, carried out in highly distributed network environments. Semantic support for distributed collaboration and annotation of scientific data and publications are of particular interest in our opinion. Telecommunications The objective of the telecommunications domain is to exploit semantic technologies for enabling telecoms to develop value-added communication services that will interface humans and machines, exploit machine-readable data, and improve intra-enterprise human communication and knowledge management. Context-information generated by sensors in conjunction with virtual information and unified communication profiles is of particular interest to enable new technology-supported communication paradigms. eBusiness and Financial Services The objective of the eBusiness and Financial Services domain is to apply new technology in the key areas of extracting business meaning from unstructured information, uncovering meaning within a business context, smarter Business Information Systems that can add meaning as they operate and communicate business information.
Towards Networked Knowledge
171
8 An Example Application Scenario To illustrate the possibilities of “Networked Knowledge” we present a simple application scenario: Siobhan Hegarty, who lives in Galway, is pregnant with her second child. During her first pregnancy, Siobhan suffered from elevated blood sugar levels which can endanger the unborn child and the mother. The problem with elevated blood sugar levels in pregnant women is that high blood sugar is an important warning sign that requires a quick response. Thus, measuring it a few times a day is not sufficient—constant monitoring is required. Fortunately, mobile sensors are available which enable Siobhan to leave the hospital while her general practitioner (GP) still gets the relevant information. Siobhan is being equipped with a mobile blood sugar sensor which can transmit readings via Bluetooth. The device is paired with Siobhan’s mobile telephone which transmits the sensor readings via GSM. Additionally she gets a GPS device which records her position and sends it via her mobile. Siobhan’s GP, Dr. James Mooney, enters the necessary monitoring requirements into his Care2X healthcare information system15 along with rules when to raise an alarm and to whom. For example, the system will call Siobhan and warn her via a synthesized message, while James is informed via a text message on his beeper which he wears all the time. The sensor readings from Siobhan’s blood sugar and GPS sensors are directly fed back into James’s Care2X system. Let us assume that after some time, Siobhan’s blood sugar levels change dramatically and the alarm rules are set off. Now it is important to get Siobhan to a doctor as fast as possible, or vice versa—a doctor to Siobhan. Besides notifying Siobhan and James, the Care2X system accesses the information system of the hospital and requests a proposal, whether it is better to bring Siobhan into the hospital via an ambulance or bring a doctor to Siobhan. The hospital information system which knows the GPS position of all doctors with matching skills to help Siobhan and of all ambulances produces an optimal plan based on real-time sensor input from the traffic control system of the city. Given the current positions of available ambulances and doctors with the necessary skills, the optimal strategy is to pick up the endocrinologist Dr. Sarah O’Connor from her home with a nearby ambulance and bring her to Siobhan. Unfortunately, while this plan was calculated two important changes to the scenario happen: (1) No more readings from Siobhan’s GPS are received, probably because she has entered a building or because the device ran out of battery and Siobhan does not respond to calls on her mobile and (2) the last blood sugar readings show some strange and unknown pattern which neither James nor Sarah can interpret. As a reaction, the system now tries to locate Siobhan via other means: the system tries to determine her position via triangulation of her mobile and additionally informs all Bluetooth access points in the vicinity of her last position to send a message if they recognize any of her Bluetooth devices. 15 See
http://www.care2x.org/.
172
S. Decker et al.
The strange patterns in the blood sugar readings worry James and Sarah and they decide to use their country-wide social network of clinical specialists to look for doctors who probably have already seen similar patterns. Additionally, they search medical databases on the Web for annotations describing such patterns. As a result of their search they find information which looks similar to the pattern they have seen but the result is inconclusive. In parallel, a colleague of theirs from Dublin who also participates in the social network to which they sent the symptoms, informs them that the pattern may indicate a malfunction of the blood sugar sensor and describes his experiences. In the meantime, Siobhan could be located by a Bluetooth access point. To be on the safe side the ambulance with Sarah on board is sent to her location and finds her in good condition. However, an examination reveals that indeed her blood sugar levels had changed dangerously and Siobhan is treated on the spot. After this successful intervention James and Sarah annotate the sensor readings to permanently store their findings. Their findings are stored in James’s Care2X system, the hospital’s information system and also made accessible to other doctors in the national infrastructure along with the actual sensor readings in a secure and anonymized way. Within this scenario Networked Knowledge is required at several points: every time a new system is involved (e.g., the hospital system) a joint understanding about the entities (e.g., Person) involved has to be established. This is a core requirement for Networked Knowledge and enables higher functionalities, like finding a person by other means (e.g., the triangulation method used in the scenario).
9 Conclusion We have described our comprehensive vision of Networked Knowledge. Although knowledge is inherently strongly interconnected and related to people, this rich interconnectedness, encoding and representation of knowledge is not reflected or supported by current information infrastructures. For instance, the Web does not provide a good enough reflection. The Web, however, provides a foundational infrastructure which enables the linking of information on a global scale. The Web emerged as a phenomenon that is unstructured, social and open. Adding meaning moves this interlinked information to the knowledge level: Web + Semantics = Networked Knowledge. The overall research vision is broken down into complementary research strands, namely: (i) Social Semantic Information Spaces, (ii) Semantic Reality and (iii) application-oriented research domains such as eBusiness, Health Care or eGovernment, just to name a few. Core research objectives for the next years include the foundation for the creation of knowledge networks and collaboration infrastructures, which will support human capabilities and enable the human-centric access to services and knowledge on a global scale, opening up new opportunities for individuals and organizations. Example topics include:
Towards Networked Knowledge
173
Foundations for semantic collaboration the development of technologies supporting distributed collaboration with a focus on the Semantic Desktop and the Web. Examples include APIs and ontologies that reuse existing social networking information from sites to assess the identity and relevance of information. Scalable reasoning and querying facilities for knowledge Current knowledge bases are not able to exploit and analyze knowledge, which would be necessary in order to learn from it. To exploit the available knowledge, scalable querying and data mining mechanisms need to be developed. Additionally, dynamic data sources (streams), modalities (time, space) and noise in the data, need to be taken into account and be supported. A Linked Data Layer in the Future Internet examining the requirements and exact design of a Linked Data Layer in the Future Internet Architecture. Frameworks for semantic sensor networks Currently sensor networks and the data they produce lack semantic description, making it difficult to integrate data coming from large-scale, dynamic sensor networks with existing information. It is necessary to develop practical semantic description methods for sensors and mobile device middleware, enabling the integration of sensor data with knowledge from knowledge networks. This will be part of a more general practical and deployable semantic service-oriented architecture. Knowledge networks are not created in a vacuum, but inside a highly dynamic information infrastructure—the Web, which provides us with a living laboratory enabling us to validate our approaches and hypothesis, and to improve our ideas. The first way to validate the hypothesis is to study the usage of emerging networks of knowledge on the Web. Many application areas are dealing with the challenges of large, open, heterogeneous, dynamic and distributed environments. Semantics is an important cornerstone for achieving scalability of knowledge interchange and interoperability. Projects should validate this hypothesis by investigating the required research and approaches in application domains, ranging from eHealth to eGovernment to eLearning.
References 1. Raskino, M., Fenn, J., Linden, A.: Extracting value from the massively connected world of 2015. Gartner Research. http://www.gartner.com/resources/125900/125949/ extracting_valu.pdf (1 April 2005) 2. Bush, V.: As we may think. Atl. Mon. 176, 101–108 (1945) 3. Engelbart, D.C.: Augmenting human intellect: a conceptual framework. Stanford Research Institute, Menlo Park, CA, USA (1962). Summary Report AFOSR-3233 4. Golbeck, J., Hendler, J.: Reputation network analysis for email filtering. In: Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA (2004) 5. Breslin, J.G., Harth, A., Bojars, U., Decker, S.: Towards semantically-interlinked online communities. In: European Semantic Web Conference. Springer, Berlin (2005) 6. Decker, S., Frank, M.: The Social Semantic Desktop. Technical Report, Digital Enterprise Research Institute (2004)
174
S. Decker et al.
7. Groza, T., Handschuh, S., Moeller, K., Grimnes, G., Sauermann, L., Minack, E., Mesnage, C., Jazayeri, M., Reif, G., Gudjonsdottir, R.: The NEPOMUK project—on the way to the social semantic desktop. In: I-Semantics’ 07. J. Univers. Comput. Sci. (2007) 8. Krötzsch, M., Vrandecic, D., Völkel, M.: Semantic MediaWiki. In: International Semantic Web Conference. Springer, Berlin (2006) 9. Bizer, C., Heath, T., Berners-Lee, T.: Linked data—the story so far. Int. J. Semantic Web Inf. Syst. 5, 1–22 (2009) 10. Aberer, K., Hauswirth, M., Salehi, A.: Infrastructure for data processing in large-scale interconnected sensor networks. In: 8th International Conference on Mobile Data Management (2007) 11. Aberer, K., Hauswirth, M., Salehi, A.: Invited talk: zero-programming sensor network deployment. In: SAINT-W ’07: Proceedings of the 2007 International Symposium on Applications and the Internet Workshops, Washington, DC, USA, p. 1. IEEE Computer Society, Los Alamitos (2007) 12. NIST: IEEE1451. http://ieee1451.nist.gov/ (2006) 13. Open Geospatial Consortium: Sensor Model Language (SensorML). http://vast.uah.edu/ SensorML/ (2008)
Reflecting Knowledge Diversity on the Web Elena Simperl, Denny Vrandeˇci´c, and Barry Norton
Abstract The Web has proved to be an unprecedented success for facilitating the publication, use and exchange of information on a planetary scale, on virtually every topic, and representing an amazing diversity of opinions, viewpoints, mindsets and backgrounds. Its design principles and core technological components have lead to an unprecedented growth and mass collaboration. This trend is also finding increasing adoption in business environments. Nevertheless, the Web is also confronted with fundamental challenges with respect to the purposeful access, processing and management of these sheer amounts of information, whilst remaining true to its principles, and leveraging the diversity inherently unfolding through world wide scale collaboration. In this chapter we will motivate engagement with these challenges and the development of methods, techniques, software and data sets that leverage diversity as a crucial source of innovation and creativity. We consider how to provide enhanced support for feasibly managing data at a very large scale, and design novel algorithms that reflect diversity in the ways information is selected, ranked, aggregated, presented and used. A successful diversity-aware information management solution will scale to very large amounts of data and hundreds of thousands of users, but also to a plurality of points of views and opinions. Research towards this end is carried out on realistic data sources with billions of items, through open source extensions to popular communication and collaboration platforms such as MediaWiki and WordPress.
1 Preface: The Ultimate Diversity Manager In our community, debates and discussions have raged over the years on the future development and ultimate destiny of semantic technologies: Linked Data versus ontologies, lightweight versus heavyweight, shared and reusable versus applicationand task-oriented, Description Logics versus F-logic, Web 2.0 versus the Semantic Web, bottom-up versus top-down, services versus the rest, to name just a few. More often than not both sides of the debate have had vocal proponents who can eventually trace their roots back to Karlsruhe and Rudi Studer, disguising as researchers from E. Simperl () Institut AIFB, Karlsruhe Institute of Technology, Karlsruhe, Germany e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_10, © Springer-Verlag Berlin Heidelberg 2011
175
176
E. Simperl et al.
the Semantic Technologies Institute in Innsbruck, the Digital Enterprise Research Institute in Galway, the Institute Web Science and Technologies in Koblenz, as well as many other places too numerous to mention. Rudi Studer was often an island in these storms, providing a resort where discussions could be calmed when they became too heated, bringing everyone back refocussed to the discussion table. He has been, in instance after instance, an important factor in the growth and development of the community, managing to balance the different sides without sacrificing integrity in any place. This has allowed the diversity in the Semantic Web research community to grow and bloom, and different parties to agree that they disagree, while continuing to pursue their research vision for the general benefit of the overall community. The EU-funded project RENDER1 runs from Fall 2010 to 2014 and deals with diversity not just in a single community, but rather on the Web in its entirety. With a strong consortium, ambitious goals and tremendous impact potential, RENDER aims to establish and foster the values that Rudi Studer demonstrated in his leadership role for the Semantic Web research community and for the Web as a whole.
2 The Web and Its Effect on Diversity in Information Management Twenty years after its introduction, the Web provides a platform for the publication, use and exchange of information, on a planetary scale, on virtually every topic, and representing an amazing diversity of opinions, viewpoints, mindsets and backgrounds. The success of the Web can be attributed to several factors, most notably to its principled scalable design, but also to a number of subsequent developments such as smart user-generated content, mobile devices, and most recently cloud computing. The first two of these have dramatically lowered the last barriers of entry when it comes to producing and consuming information online, leading to an unprecedented growth and mass collaboration. They are responsible for hundreds of millions of users all over the globe creating high-quality encyclopedias, publishing Terabytes of multimedia content, contributing to world-class software, and lively taking part in defining the agenda of many aspects of our society by raising their voices, and publicly expressing and sharing their ideas, viewpoints and resources [12]. This trend towards ‘prosumerism’ is finding more and more adopters in business environments as well, as enterprises not only become active in open initiatives, but encourage the participation of their employees and customers in taking decisions related to organizational management, product development or services offers [11]. The other side of the coin in this unique success story is, nevertheless, the great challenges associated with managing the sheer amounts of information continuously being published online, whilst allowing for purposeful use and leveraging the diversity inherently unfolding through global-scale collaboration. These challenges are 1 http://www.render-project.eu.
Reflecting Knowledge Diversity on the Web
177
still to be solved at many levels, from the infrastructure to store and access the information, through the methods and techniques to make sense out of it, to the paradigms underlying the processes of Web-based information provision and consumption. The information management methods and techniques that are at the core of essentially every channel one can use when attempting to interact with the vast ocean of information available on the Internet or elsewhere—be that Web search engines, news sites, eCommerce portals, online marketplaces, media platforms, the blogosphere or corporate intranets—are fundamentally based on principles that do not reflect, and cannot scale to, the plurality of opinions and viewpoints captured in this information. This holds for various aspects of the information management life cycle, from filtering, through ranking and selection, to the aggregation and presentation of information, and has twofold effects on information prosumers, as described in the following subsections.
2.1 Subpar Information Management Diversely expressed information, richly represented in almost every large-scale post-Web 2.0 environment, should form a target for significant optimizations of information management at various levels. In the omnipresent information overload individuals and organizations are facing, information management services should intelligently leverage diversity to adjust and enhance the ways they process, select, rank, aggregate and present this information to their users. As an example, when searching for blog posts, state-of-the-art technology—be that PageRank-based algorithms [9], recommendation engines [1] or collaborative filters [4]—tend to return either the most popular posts, or those which correspond with a personal profile and therefore with the known opinions and tastes of the reader. Alternative points of view, and new unexpected content, are not taken into account as they are not highly ranked, and posts expressing different opinions are sometimes even discarded [6]. This behavior has particularly negative consequences when dealing with information that is expected and intended to be subject to diverse opinion, as is the case with news reports, ratings of products or media content, customer reviews or any other type of subjective assessment [3, 10, 13]. In this case, potentially (large amounts of) useful information is de facto not accessible to the user through automated tools, and needs to be discovered through time-consuming manual processing. The same negative effects apply in a community-driven environment that is designed for collaboration—the most obvious example here probably being Wikipedia, which we specifically consider later. The information diversity exposed in such an environment, impressive both with respect to scale and the richness of opinions and viewpoints expressed, cannot be handled in an economically feasible manner without adequate computer support. In the long run, maintaining the current state-of-affairs will change the ways and the extent to which people are informed (or
178
E. Simperl et al.
not) on a particular topic, tremendously influencing how they look into that topic, what they find about it and what they think about it. In addition to providing suboptimal services to end-users, the minimal support provided by current information management services when it comes to diversity significantly hampers the productivity of any organization that needs to deal with the increasingly large amounts of diversely expressed information within the many information prosumption tasks that are part of its daily business, and any decision process made thereupon. This includes vertical sectors such as mass-media and publishers, eCommerce platforms, manufacturers, and, beyond this, any type of community-driven endeavor and indeed every organization acknowledging the great value of the opinions and viewpoints of its employees and customer base.
2.2 Broken Forms of Dialog, Opinion Forming and Collaboration Despite the pivotal effects of Web-based global collaboration, a thorough analysis and understanding of this phenomenon and of its economical, cultural, social and societal implications are so far lacking. Today, the creation of successful systems through massive collaboration and knowledge diversity is more based on luck and intuition than design and engineering. Supported by the rapid advent of technological trends such as Web 2.0 and mobile computing, the omnipresent Wikipedia, Facebook, YouTube, Amazon, MySpace, Google’s PageRank, and the blogosphere as a whole have become tremendously influential with respect to forming public opinion, but also to the ways information is processed, filtered, aggregated, presented and organized. The core technologies and mechanisms underlying these platforms— including popularity-links-based ranking, collaborative filtering, and global truthdriven mechanisms for decision making—are the de facto standard instruments to leverage collective intelligence from Web-based collaboration patterns, thus playing a crucial role in the means, and the extent to which, information is delivered to Internet users. The Web as it is built today allows the discovery and exchange of ideas among people with shared interests, and facilitates the creation of globally-reaching communities. This paradigm works well for many application scenarios. Nevertheless, as it is based on consensus-finding mechanisms such as the one in Wikipedia, it is less open to transparent representation and to supporting a dialogue between a plurality of views and opinions, which are inherent and useful in many domains and in societies, such as the European socio-political and cultural arena [8]. On the Web as we know it, dialogue and opinion forming are broken. Members of a community of interest tend to reinforce each other in their points of view. Only new members who already think alike, or are willing to accept the decisions the group has previously taken, integrate easily. Consequently it can be observed that such communities often foster an in-group agreed opinion that may significantly diverge from the opinion of society at large. As such, the society as a whole becomes increasingly polarized, making it almost impossible to discuss topics on a
Reflecting Knowledge Diversity on the Web
179
larger, all-society encompassing scale since inter-group communication and collaboration is hindered. There are numerous examples of this effect. Conservapedia2 is a project started by former Wikipedia contributors who argue that Wikipedia has a bias towards liberalism and atheism. Relatedly, whereas the English Wikipedia remains a common battlefield for supporters of the Serbian and Croatian points of view on many topics of their common history, the Serbian and Croatian language versions of Wikipedia unveil clear and distinct biases, which are growing due to their relatively separated and self-moderating communities. At another level, as will be elaborated later in this chapter, Wikipedia as a whole has to deal with huge amounts of information, which are for obvious reasons rich in diversity. To do so, Wikipedia editors have to rely on information management tools that are not designed to reflect this diversity, and on a gradually decreasing number of voluntary contributors. The collaboration paradigms behind the information management technology used within Wikipedia, the procedures installed to govern information provisioning, and the information consumption services that can be offered to end-users based on the available technology, hamper the sustainable growth of Wikipedia. Another example are collaborative filters such as those used by Amazon, which recommend books that are similar in (bias and) opinion to the one already being browsed or bought by the customer, thus effectively hiding alternative points of views or ideas, as opposed to leveraging this diversity to provide novel means to select, rank and present information to the users. Other examples using collaborative filtering techniques are StumbleUpon3 and Delicious,4 both offering a view on the Web that is highly personalized, and thus biased. Challenges remain when it comes to developing mechanisms and interfaces that retain the advantages of personalization, but without the sacrifice of diversity.
3 An Information Management Framework Reflecting Diversity In order to overcome the issues discussed above, a comprehensive conceptual framework and technological infrastructure is needed to enable, support, manage and exploit information diversity in Web-based environments. Diversity is a crucial source of innovation and adaptability. It ensures the availability of alternative approaches towards solving hard problems, and provides new perspectives and insights on known situations [7]. A successful diversity-enabled information management solution will scale to very large amounts of data and hundreds of thousands of users, but also to a plurality of points of views and opinions. It will be applicable to news streams covering tens of thousands of sources worldwide, (micro) blog streams adding up to more than a million posts per day, a full data stream from Wikipedia, and the Linked Open Data Cloud. 2 http://www.conservapedia.com/. 3 http://www.stumbleupon.com. 4 http://delicious.com.
180
E. Simperl et al.
The most successful Web enterprises, including the ones we mentioned so far, have proven to be those that manage to become better by growth, in other words, those among them that react positively to scaling effects. The principles, methods and techniques used by these platforms, and according to which information is processed, selected, ranked, combined and made accessible to users, are crucial for the utility of the information-based services offered to information prosumers. Through the underlying technologies and paradigms, the most popular platforms in the field are nowadays based on very simple information filtering, ranking and aggregation principles. Per design, these platforms do not support many forms of inter-group collaboration and communication in an optimal way, whilst providing limited support to individuals and organizations that need to make sense out of the richly diverse information available, so as to take informed decisions in their daily lives, or to ensure a competitive business advantage. We turn again to Wikipedia and the blogosphere as examples: Wikipedia is a tremendous success, but it is also a rigid meritocratic system with a decreasing number of active contributors, whereas the blogosphere has to deal with the limited attention of the blog authors. What is needed are novel concepts, methods and tools that allow humans and machines to leverage the huge amounts of information created by an active community, based on interaction models that support expressing, communicating and reasoning about divergent models and world views simultaneously. This would not only enhance true collaboration, but would also significantly improve various aspects of the information management life cycle, thus addressing information overload in sectors which rely on opinion-driven information sources and mass participation—news, ratings, reviews, and social and information sharing portals of any kind. This should be achieved by novel approaches towards information management, not only scaling to hundreds of thousands of users and billions of information items, but also to a plurality of points of views and opinions. A diversity-enabled solution will help to realize a world where information is shared in a fundamentally different manner than the consensual approach promoted by movements such as Web 2.0, and where communication and collaboration across the borders of social, cultural or professional communities are truly enabled via advanced Web technology. New technologies and applications will be developed towards enabling and leveraging global-scale collaboration without the need to constantly compromise to reap the complementary and unique abilities of millions of European citizens, and towards providing useful and efficient information management services reflecting diversity aspects. The architecture shown in Fig. 1 aims to leverage and integrate concepts, methods and techniques from the areas of Web 2.0, data and Web mining, semantic technologies, information retrieval, and information aggregation. Were computers able to effectively access and intelligently process the content of collaboratively-created information-based artifacts then novel algorithms providing a more complex aggregation could dramatically increase the efficient dissemination, discussion, comparison, and understanding of the knowledge expressing the uniqueness of its authors and contributors. Ranking and filtering algorithms could balance and enhance both individuality and convergence. Applications would then be able to use these new
Reflecting Knowledge Diversity on the Web
181
Fig. 1 A representative conceptual architecture
algorithms to provide suggestions and to help users navigate the ever-growing Web, exploiting its rich content to its full potential. Diversity-empowered information management, including methodologies for diversity articulation, acquisition and usage, languages for capturing this knowledge, and algorithms to process it, provide the theoretical principles and the core technology to achieve the four following objectives. • Collect and manage information sources which are rich in diversity of viewpoints so that this information is available in an effective form and can be processed efficiently. This is achieved by crawling, harvesting, structuring and enriching various information sources with a great basis of diversity. Very large amounts of content and metadata are leveraged: news, blog and microblog streams, content and logs from Wikipedia, news archives and multimedia content available to Google, discussion forums and customer feedback databases (such as Telefónica’s), taken together adding up to hundreds of millions of items, updated on a daily basis. This data will be managed by a highly scalable data management infrastructure, and enriched with machine-understandable descriptions and links referring to the Linked Open Data Cloud. The results can be published online
182
E. Simperl et al.
as high-quality, self-descriptive data sets that will be available to the large-scale information management community worldwide for widespread use. • Identify and extract the diversity embodied within the various information sources collected, and make the connections and references between different items and sources explicit. In particular: spot and assess biases, factual coverage, and the intensity of opinions expressed; identify complex events; and track topics along multiple sources, across data modalities and languages. The results will be stored and managed through the data management infrastructure mentioned above, and serve as input for higher-level processing and usage. • Represent and process diversely expressed information so as to explicate and conceptualize the results of the mining task, to enable the development of diversified information management algorithms and services. Novel, scalable techniques enable reasoning over opinions and viewpoints, and for diversity-aware information selection and ranking. Means to make diversity information accessible to the end-user must be investigated, providing sophisticated metaphors, interfaces and software tools to organize, display and visualize it. Ranking algorithms will take into account the viewpoints underlying different information items—within the top hits, provided a concept such as “top hit” is still deemed appropriate. Information will be summarized in fundamentally different ways, not only making explicit the biases that have been introduced through such processes, but also trying to minimize them. The results will be displayed to the end-user, who will then be able to navigate and discover content and topics within the diversity space. • Use diversity as integral concept of popular communication, collaboration and information sharing platforms, in the form of extensions to MediaWiki, WordPress and Twitter. Diversity-aware technology will allow the explicit linkage of items with a dissenting view, and thus increase the diversity exposure of the wider Web audience. These objectives are motivated by the needs of a number of real-world potential adoptees of diversity-aware technology, surveyed in the following sections.
3.1 Enabling True Collaboration at Wikipedia Wikipedia is a top-10 Web site providing a community-built encyclopedia for free. Its success hinges on the support of its volunteer contributors. Although the guidelines set up at Wikipedia aim for a balanced coverage, systemic bias is introduced by the individual views of the actual contributors, the numbers behind each opinion among the Wikipedians, and the procedures defined for creating and editing articles. The growing complexity of these procedures, in addition to the sheer effort required to keep a balance on the basis of the procedures in place, hamper the work of the Wikipedia editing team. Whereas in early years it was easy and straightforward to start a new article, today this basic task has become significantly more difficult.
Reflecting Knowledge Diversity on the Web
183
Table 1 Growth rate of the English Wikipedia. Article count on January 1st of each year Year
Article count
Annual increase
% annual increase
Average daily increase
2002
19 700
19 700
–
2003
96 500
76 800
390%
210
54
2004
188 800
92 300
96%
253
2005
438 500
249 700
132%
682
2006
895 000
456 500
104%
1251
2007
1 560 000
665 000
74%
1822
2008
2 153 000
593 000
38%
1625
2009
2 679 000
526 000
24%
1437
2010
3 143 000
464 000
17%
1271
Table 2 Number of articles and their monthly number of edits Year
Articles (all)
Edits (all)
Articles (English)
Edits (English)
Articles (German)
Edits (German)
2002
20 000
16 000
17 000
14 000
900
150
2003
150 000
114 000
104 000
75 000
12 000
16 000
2004
440 000
414 000
188 000
176 000
47 000
61 000
2005
1 400 000
1 600 000
447 000
620 000
189 000
315 000
2006
3 600 000
6 200 000
901 000
2 800 000
350 000
719 000
2007
7 200 000
10 800 000
1 500 000
4 600 000
541 000
949 000
2008
11 700 000
11 300 000
2 200 000
4 300 000
710 000
817 000
2009
16 200 000
12 300 000
2 700 000
4 400 000
873 000
868 000
2010
21 500 000
13 000 000
3 200 000
4 000 000
1 000 000
838 000
These factors have an impact on Wikipedia’s growth rate, both with respect to the number of new editors and new content.5 As can be seen in Table 1, the absolute number of new articles has been decreasing during the last two years dramatically (by more than 10% per year). However the amount of edits is steadily growing as can be seen in Table 2. One way to overcome the current unsatisfactory state of affairs is to revise the existing Wikipedia processes—and, we argue, every other ongoing massively collaborative endeavor. Activities such as the editing of conflicts, the organization of the content, the checking of inconsistencies—both within Wikipedia and with respect to external sources—and the integration over different languages are very demanding in terms of the amounts of human labor they require. Elaborated procedures that cover some of these aspects, including edit conflict resolution, arbitration committees, and banning policies, and a growingly complex hierarchy of readers, contrib5 http://stats.wikimedia.org/.
184
E. Simperl et al.
utors, editors, administrators, bureaucrats, ombudspersons, trustees, and so on, are in place, but their operation, given a declining number of active Wikipedians, is not sustainable. Also, the outcomes of these costly processes are not always positive; the meritocratic approach of Wikipedia often finds champions for specific opinions, but seldom for a generally balanced, diversity-minded depiction of a topic. The German chapter of the Wikimedia Foundation is investigating how to build a truly ‘diversified’ Wikipedia. Wikipedia editors need support in discovering useful content and the diversity of viewpoints within a topic to encourage large-scale participation and sustainable growth. The improved handling of edit conflicts will form a central component. Using the massive amount of metadata, described below, available within Wikipedia—which directly scales with the number of edits shown in Table 2—as well as a series of structured and semi-structured external information sources. Representations, techniques and tools will be developed to discover, understand, and use the following types of information: the multitude of opinions and viewpoints, the points of dissent, content that would otherwise disappear from view, the quality of articles, and controversies surrounding specific topics. Information sources that are useful in this context include, but are not limited to, the complete edit history of each article, change comments, user contribution logs, social networks in the user contribution logs—as in, who works with or against whom on which articles—the content of articles, including comments and previous versions, access logs and various external data sets such as Eurostat, data.gov(.uk, etc.), Twitter, Linked Open Data, Freebase, Wolfram Alpha, and archives of scientific publications. In order to provide these representations, techniques and tools, a number of Web 2.0-motivated challenges will be addressed: • Identify and extract diversely expressed information: Mechanisms will be developed to identify and extract opinions, viewpoints and sentiments based on the available Wikipedia metadata, going significantly beyond shallow extraction. These mechanisms will use temporally coincident comments on the article talk pages, and answers to these, as well as discussions on the responsible contributors’ talk pages. Wikimedia has access to a critical mass of this metadata so as to be able to reliably discover relations between changes, and deeply understand their meaning, thus giving an overview over the events that lead to a conflict situation. • Represent and process diversely expressed information: Methods will be designed that utilize opinions and viewpoints to summarize, understand, and visualize the flow of discussions on a specific topic. As a highly expressive formalization of discussions cannot be achieved in a feasible way, due to the limitations of formal knowledge representation languages, coupled with the computational complexity associated with inference over such rich formalizations. These methods will leverage semi-structured data such as fragments of articles and associated change information, as well as lightweight representations and reasoning that make explicit key aspects of diversity. • Diversify information management: We distinguish among four different diversity-empowered services that will considerably improve the ways information is currently managed within Wikipedia.
Reflecting Knowledge Diversity on the Web
185
– Quality assessment of articles: Methods will be defined, and user-friendly tools developed, to assess the quality and reliability of a Wikipedia article, and of specific statements within an article. The result of the confidence and diversity analysis should be made accessible and explicit to Wikipedia readers. Currently, manual tagging of articles exist, but the tags are often out-of-date and incompletely applied to the article set. Automatic tagging mechanisms will provide the reader with more confidence about the trustworthiness of the article. Besides tagging complete articles, users should also be able to mark a single statement and query the system about that statement. Often, Wikipedia articles contain in general good knowledge, but are sprinkled with small inconsistencies or simple acts of vandalism. Whereas blunt lies are often discovered and corrected quickly, subtle errors may escape the attention of most readers and linger in the article for a long time. The aim is to learn to categorize and understand edits to Wikipedia, and record this information as metadata to each article. – Conflict resolution support: Methods to facilitate the understanding of the history of given conflicts, make the involved parties and their interests explicit, and create possible resolutions, will be investigated, thus supporting editors in handling edit conflicts. User-friendly tools will be developed to guide editors in their analysis of the Wikipedia metadata that is relevant in the context of such conflicts. – Anomaly detection: Methods to automatically uncover anomalies within processes and content and highlight them will be developed. For example, edit hot-spots can be highlighted in an article and its history, the most important revisions can be discovered and marked, articles that have been collaterally edited during an edit war can be aggregated and displayed, and inconsistent data may be discovered and rendered for the user. – Consistency checking: Wikipedia is available in all major European languages. This allows the mutual checking of the content of different language editions against each other, in order to provide missing information from one language for another, and to discover biased language edition articles. Crosslanguage fact detection and suggestion methods will be provided that will pool the most facts learned with the highest confidence from each language version.
3.2 Diversity-Minded News Coverage and Delivery at Google News In the Google News diversity scenario, content is created either by professional journalists or by Web users. The resulting information is potentially rich in diversity, which, if leveraged properly, may facilitate new user experiences, forms of content prosumption, and business models for news providers and aggregators. Google will make diversity-aware technology accessible to general Web users and collect usage data in order to refine and validate our services and research results. This will be achieved by diversifying the results of Google Alerts, and by providing a browser
186
E. Simperl et al.
extension that makes the diversified view to a news item visible. Google also has rich information about general Web usage that can be used to further enrich news items with additional information and links, be it to the Web of comments on news articles, their metadata, the connections between portal content and the rest of the Web. This includes, but is not limited to: the blogosphere and its own linking structure, RSS feeds, comments, forums, news groups, “diggs”,6 Facebook comments, and ‘tweets’. Insightful comments on an article often never reach the surface and simply disappear in the vast ocean of news-related information, since search algorithms, as we know them today, will very likely find those comments that are most linked to articles, or blog posts by bloggers who are prominent anyway. Personalization cannot solve this problem, since it just leads to a Web that solely reinforces the user’s own biases, and thus furthers the fragmentation of Web users. Such a fragmentation, in turn, results in an ever-stronger radicalization, since a fragmented group will tend to a more radical shared point of view, whereas an integrated group will average towards a more balanced point of view. This is especially evident in political blogs, which often have a very clear-cut alignment to a party, and graciously cite blogs with the same opinion, while steadfastly ignoring bloggers of a differing opinion—with the one exception: to ridicule them. The cost of manually discovering interesting, diverse opinions and viewpoints on the Web and using this information within news-based services is currently prohibitive due to the sheer size of the information space to be dealt with; thus, it can be tackled only by primarily automatic approaches [5]. A diversity-aware information management framework will address this limitation and provide efficient representations and techniques to explore the opinions, viewpoints and discussions around a topic on the public Web, leveraging this diversity into innovative news prosumption services that go beyond popularity ranking and traditional filtering and recommendations [2]. An example will help to illustrate this. Typically, when an event happens, this is covered in the news in very different ways. Imagine the following headlines about the same event, from a German newspaper (A) and a British one (B): • [A] “Robbie: Comeback even worse than expected”. • [B] “Solid performance in his first show. Robbie is back”. Even though both stories refer to the same event, they represent very different viewpoints, exposing different biases and levels of intensity in wording. The most passionate stakeholders will provide very timely reports, matching one of the two points of view and, due to the nature of their passion, will most likely be heavily biased towards one of them. So, a short while after the event took place, different opinions will be presented close to the extremes. The sources will reference each other, usually staying within their own opinion. Taking this scenario further, it is likely that even more people will report on the event and, as part of their reports, link to the most popular sources, as indicated through common and dedicated search engines, 6 http://digg.com.
Reflecting Knowledge Diversity on the Web
187
thus steadily quoting only one side of the story. The result of this reinforcing effect will be that the biased source will become even more popular, the other source will de facto disappear from the public opinion, ‘hidden’ in a long list of hits returned by a search engine. Moreover, the consequence will be an ever-increased bias towards the more popular reporting. In this specific example, a language-related bias might also occur, due to a high amount of English speaking news papers and users linking to source A and similar news headlines. Most readers of the English-speaking Web will thus hardly ever see headline B. In this case, the Web, will serve to emphasize the divide even further. With little care for the objective truth, A has killed off the opinion diversity. The opposite situation might occur in German-speaking countries. The models underlying this example are very simple: we assume that early reports are usually more biased and even a small advantage of an early overrepresented opinion will grow disproportionately in a short time due to the ranking of diversity-unaware search mechanisms. This leads to one-sided media coverage, and thus to only one side being credited by each member of the general public. Diversity-aware technology will contribute to the alleviation of this situation, by providing representations, techniques and software tools that not only are aware of the inherent diversity and biases within news reports and related information sources, but are able to use this diversity as an asset to offer improved and balanced information management services. More specifically, in order to develop a diversity-aware news aggregation, a number of news-motivated challenges need to be tackled, as follows. • Identify and extract diversely expressed information: News stories already provide rich metadata. Unlike in a Wiki-based environment, the content is basically static, and changes mostly occur at the level of user-produced comments and discussions associated to a story. Still, online news stories exist within an increasingly denser Web of interlinked information sources. The link structure is implicitly temporally directed, which means that only articles that have been published before are referred. To identify opinions and viewpoints within journalistic content, external information sources need to be carefully selected and processed, in order to provide a high-quality and comprehensive experience for the reader. • Represent and process diversely expressed information: Discovering news about the same event, selecting, ranking and summarizing the information related to it, while paying proper attention to the full spectrum of opinions expressed, will be the key aspects of a diversity-aware news access. Since news has a naturally high information density, a high precision must be sought in automatically discovering the biases, selecting the most relevant sources for each bias, and then carefully choosing what to display in order to ensure that the differences between the items can be easily perceived by the reader. This also applies to external sources that are seen to be relevant to the topic at hand; the challenge here being the balanced choice of the different sides of the same story, based on own and external content. • Diversifying news prosumption: Diversity can improve news aggregation management in several novel and interesting ways, creating unique selling points for the usage of diversity-aware news aggregation.
188
E. Simperl et al.
– Diversity-preserving presentation of news: Today, popular viewpoints easily dominate existing news ranking mechanisms, and thus alternatives often simply disappear, draining the Web of its natural diversity. Our technology will help in preserving the diversity and in exposing it to the casual reader without being disruptive to their normal (reading/viewing) flow. – RSS feeds and email alerts on opinions: Whereas traditional news aggregators allow for the setting up of news alerts on specified topics, our solution will enable the selection of not only a topic, but also a specific opinion, and bias. – Debate discovery: The Web is the biggest information space ever created— and thus the biggest space for debates and discussions. Successful diversityaware technology will be able to discover the most active debates on the Web today, identify their proponents and allow users to dive into the debate, understanding the social interactions between the participants and their common history. – Difference analysis: Often two different reports about an event will both contain correct information, but the selection of the information uncovers underlying biases. Diversity-enabled technology in Google News will discover the differences in the coverage of an event, but also the common set of facts about an event, creating a navigable information space that can be explored by the user in order to gain a better understanding of the inherent diversity.
3.3 Diversity-Enhanced Customer Relationship Management at Telefónica An illustrative potential corporate adopter of diversity-aware technology is provided by the telecoms sector. In particular, the telecommunication operator Telefónica, aim to address the growing needs of their multi-national enterprise to exploit the ‘wisdom’ of their large customer base—expressed as a vast array of opinions, viewpoints, suggestions and ideas—as a means to optimally respond to market demands and developments. Telefónica are developing a novel approach to customer relationship management as a first step towards the implementation of a more comprehensive global enterprise crowdsourcing strategy. Working out and allocating resources to meet the varying issues, topics and problems is often difficult given the large amounts of heterogeneous and rapidly changing information to be taken into account, including call center contacts, the Telefónica on-line Web site, and public forums in which customers share their opinions, knowledge, advice and dissatisfaction. Telefónica will leverage diversely expressed information, created through these channels, as a means to improve and forecast future product and service-related decisions by paying proper attention to the feedback of millions of customers. Currently, it is not possible to fully exploit this rich customer feedback to take management decisions based upon them in a timely fashion, as the representations, techniques and tools required to create the awareness of this diversity at the level of decision makers, and to use it for business purposes, are
Reflecting Knowledge Diversity on the Web
189
simply missing. Manually tracing and capturing the millions of information items produced by (interaction with) Telefónica’s large customer base on a monthly basis, and processing this raw input so that it can be used for economic and organizational decision-making processes, is not feasible at the moment. To achieve this, a number of enterprise-motivated challenges will have to be addressed: • Identify and extract diversely expressed information: Telefónica has evolved through different phases of means to attend to its large customer base: from a large network of offices, to call centers and finally becoming an ‘online’ company. Nowadays, most contacts with customers are focused on the latter channels. From these contacts a large amount of data is gathered and has to be processed. Although the contacts held between operators and clients are recorded with structured means, operators use free text to reflect customers’ point of view reliably. This unstructured record is also found in open forums, customer portals and customer surveys of service quality. By adopting diversity-aware technology, Telefónica will be able to assess incoming requests, complaints and concerns, identify opinions, viewpoints, trends and tendencies, and take feasible actions based therein. Awareness of the various opinions and viewpoints concerning Telefónica’s products and services is one of the most important factors in enabling evidencebased actions addressed to maintain and improve the quality perceived by its customers. This knowledge will not only allow Telefónica to carry out early reactive actions to a negative perception on a new product or service, but it also enable proactive actions, anticipating solutions and mechanisms oriented to improve the perception of quality. The tools developed will unleash the real opinion about Telefónica’s products and services spread out amongst the communication channels outlined above. This diversity of opinion, stemming from different and divergent groups, will enable reflections on market segmentation. There is a need to know, not only the aggregate opinion of all customers, but also about the segmented opinions of groups of customers according to market segments. The quality perception of these different segments of products is necessary to improve the relationship between Telefónica and their customers. • Represent and process diversely expressed information: Similar to the other case studies, methods will be designed that allow the tracing of topics and their evolution in time. Discussions will be analyzed and biases, as well as the intensity of expressed opinions, within them identified. Knowledge within Telefónica will be represented using lightweight knowledge structures suitable for application in diversity-aware technology. Ranking and visualizing techniques for diversity information are essential for this case study in order to handle the huge amounts of information expected. • Diversifying customer management: – Awareness of diversity within the company: As described above, Telefónica maintains a number of different communication channels to interact with customers in questions regarding products and services. The communication flow over these channels is no longer manually tractable. In general, the impact of
190
E. Simperl et al.
these technologies on customer satisfaction can be seen as positive; nonetheless, from a company-oriented perspective, the highly valuable incoming information is not appropriately exploited. From a business intelligence point of view, there is interest in identifying, accessing and taking into account the full amounts of information with a broad diversity basis in customer communications. – Topic searching and evolution: Finding channels in which diversity is the dominant factor is a challenge for managers who would like to be fully aware and able to deal with the issues raised. Currently there are two main kinds of processing: clustering and classification. Clustering provides automatic clusters by carrying out statistical analysis of data; classification, carried out manually by operators when filling the record of the contacts, is limited to a fixed number of categories. In this context, senior managers are still left without adequate tools to allow them to search for topics and track their evolution in the channels dominated by diversity. These topics are not necessarily known ahead of time, and thus their inclusion is not possible in the classification process. This need appears particularly when new products or services are launched, when there is an update of features in the network, the products or services. In these situations, a new trend of requests, or complaints motivated by these events are collected, but their proper impact or analysis is not possible because the classification mechanisms may diffuse its perception. Therefore, via diversity-aware techniques and tools, Telefónica first expects to improve its awareness of different, unknown topics, and second, to enhance decisions taken on the way or means to deal with those topics. All of them creating a higher level of satisfaction within Telefónica’s large customer base.
4 Conclusions Diversity-aware technology will help to realize a world where information is acquired and shared in a fundamentally different manner than the consensual approach promoted by movements such as Web 2.0, and where communication and collaboration across the borders of social, cultural or professional communities are truly enabled via advanced Web technology, supporting one of the credos of European society: “United in diversity”.
References 1. Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley Series in Probability and Statistics. Wiley, New York (2003) 2. Fortuna, B., Galleguillos, C., Cristianini, N.: Detecting the bias in media with statistical learning methods (2008)
Reflecting Knowledge Diversity on the Web
191
3. Ghose, A., Ipeirotis, P.G.: Designing novel review ranking systems: predicting the usefulness and impact of reviews. In: ICEC ’07: Proceedings of the Ninth International Conference on Electronic Commerce, pp. 303–310 (2007) 4. Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to weave an information tapestry. Commun. ACM 35(12), 61–70 (1992) 5. Iacobelli, F., Birnbaum, L., Hammond, K.J.: Tell me more, not just “more of the same”. In: IUI ’10: Proceeding of the 14th International Conference on Intelligent User Interfaces, pp. 81–90. ACM, New York (2010) 6. Knowledge@Wharton: Rethinking the long tail theory: how to define ‘hits’ and ‘niches’ (2009) 7. Mannix, E., Neale, M.A.: What difference makes a difference? Psychol. Sci. Public Interest 6(2) (2005) 8. O’Hara, K., Stevens, D.: The Devil’s long tail: religious moderation and extremism on the web. IEEE Intell. Syst. 24, 37–43 (2009) 9. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998) 10. Sharoda, P., Meredith Ringel, M.: CoSense: enhancing sensemaking for collaborative web search. In: CHI ’09: Proceedings of the 27th International Conference on Human Factors in Computing Systems, pp. 1771–1780 (2009) 11. Shirky, C.: Here Comes Everybody: The Power of Organizing Without Organizations. Penguin Group, Baltimore (2008) 12. Tapscott, D., Williams, A.D.: Wikinomics: How Mass Collaboration Changes Everything. Portfolio, New York (2006) 13. Zhang, M., Ye, X.: A generation model to unify topic relevance and lexicon-based sentiment for opinion retrieval. In: SIGIR ’08: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 411–418 (2008)
Software Modeling Using Ontology Technologies Gerd Gröner, Fernando Silva Parreiras, Steffen Staab, and Tobias Walter
Abstract Ontologies constitute formal models of some aspect of the world that may be used for drawing interesting logical conclusions even for large models. Software models capture relevant characteristics of a software artefact to be developed, yet, most often these software models have no formal semantics or the underlying (often graphical) software language varies between different use cases in a way that makes it hard if not impossible to even fix its semantics. In this contribution, we survey the use of ontology technologies for software models in order to carry advantages over to the software modeling domain. It will demonstrate that ontology-based metamodels constitute a core means for exploiting expressive ontology reasoning in the software modeling domain while remaining flexible enough to accommodate varying needs of software modelers.
1 Introduction Today Model Driven Development (MDD) plays a key role in describing and building software systems. A variety of different software modeling languages may be used to develop one large software system. Each language focuses on different views and problems of the system [17]. Model Driven Software Engineering (MDSE) is related to the design and specification of modeling languages and it is based on the four-layer modeling architecture [2]. In such a modeling architecture the M0-layer represents the real world objects. Models are defined at the M1-layer, a simplification and abstraction of the M0-layer. Models at the M1-layer are defined using concepts which are described by metamodels at the M2-layer. Each metamodel at the M2-layer determines how expressive its models can be. Analogously, metamodels are defined by using concepts described as metametamodels at the M3-layer. Although the four-layer modeling architecture provides the basis for formally defining software modeling languages, we have analyzed some unresolved challenges to express formal semantics and syntactic expressions of modeling lanG. Gröner () Institute for Web Science and Technologies, University of Koblenz-Landau, Universitätsstrasse 1, Koblenz 56070, Germany e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_11, © Springer-Verlag Berlin Heidelberg 2011
193
194
G. Gröner et al.
guages. Semantics of modeling languages often are not defined explicitly but hidden in modeling tools. To fix a specific formal semantics for metamodels, it should be defined precisely in the metamodel specification. The syntactic correctness of models is often analyzed implicitly in procedural checks of the modeling tools. To make well-formedness constraints more explicit, they should be defined precisely in the metamodel specification. We suggest defining modeling languages in a way that makes use of ontology technologies, in particular description logics [3]. Description logic is a family of logics for concept definitions that allows for joint as well as for separate sound and complete reasoning at the model and at the instance level given the definition of domain concepts. OWL2, the web ontology language, is a W3C recommendation with a very comprehensive set of constructs for concept definitions [18] and constitutes formal models of software languages. Since ontology languages are described by metamodels and can be developed in a model-driven manner, they provide the capability to combine them with software modeling languages. In this chapter, we tackle both MDD challenges of defining semantics and syntactic expressions. We show how ontologies can support the definition of software modeling language semantics and provide the definition of syntactic constraints. Since OWL2 has not been designed to act as a metamodel for defining modeling languages we propose to build such languages in an integrated manner by bridging pure language metamodels and an OWL metamodel in order to benefit from both. This chapter is structured as follows: we start with the state of the art in specifying software modeling languages by creating metamodels. In Sect. 3 we present the idea of how to bridge modeling languages and ontology languages. Here, we emphasize that both languages can be used in an equal and seamless manner. The result is a new enriched modeling language. In Sect. 4 we present how such languages can be used to define semantics and syntactic constraints of modeling languages. Based on these constraints, we can provide software designers services for checking the validity of domain models. In Sect. 5, we present the implementation. At the end of this chapter, we compare our approaches with related work and present a conclusion.
2 State of the Art of Software Modeling A relevant initiative from the software engineering community called Model Driven Engineering (MDE) is being developed in parallel with the Semantic Web [17]. The MDE approach recommends first to develop models describing the system in an abstract way, which later is transformed into a real, executable system (e.g. source code).
Software Modeling Using Ontology Technologies
195
2.1 Software Modeling To realize transformations of models and to advance their understanding and usability, models must have a meaning and must conform to a given structure. In MDE models are described by software languages, where the software languages theirselves are described by so-called metamodeling languages. A language consists of an abstract syntax, at least one concrete syntax, and semantics. The abstract syntax of a software language is described by a metamodel and is designed by a language designer. A metamodel is a model that defines the language for expressing a model. The Semantics of the language may be defined by a natural language specification or may be captured (partially) by logics. A concrete syntax, which could be of the textual or visual kind, is used by a language user to create software models. Since metamodels are also models, metamodeling languages are needed to describe software languages. The abstract syntax is described by using a metametamodel. In the scope of graph-based modeling [5], a metamodeling language must allow for defining graph schemas, which provide types for vertices and edges and structures them in hierarchies. Here, each graph is an instance of its corresponding schema. The Meta-Object Facility (MOF) is OMG’s standard for defining metamodels. It provides a language for defining the abstract syntax of modeling languages. MOF is, in general, a minimal set of concepts which can be used for defining other modeling languages. The version 2.0 of MOF provides two metametamodels, namely Essential MOF (EMOF) and Complete MOF (CMOF). EMOF prefers simplicity of implementation to expressiveness. CMOF instead is more expressive, but its implement is more time consuming [20]. EMOF mainly consists of the Basic package of the Unified Modeling Language (UML) which is part of the UML infrastructure [21]. It allows for designing classes together with properties which are used to describe data attributes of classes and which allow for referencing to other classes. Another metametamodel is provided by the Ecore metamodeling language, which is used in the Eclipse Modeling Framework (EMF) [4]. It is similar to EMOF and will be considered in the rest of this paper, since it is fully implemented in the EMF. Ecore provides four basic constructs: (1) EClass—used for representing a modeled class. It has a name, zero or more attributes, and zero or more references. (2) EAttribute—used for representing a modeled attribute. Attributes have a name and a type. (3) EReference—used for representing an association between classes. (4) EDataType—used for representing attribute types. As already mentioned in the introduction, models, metamodels and metametamodels are arranged in a hierarchy of 4 layers. Figure 1 depicts such a hierarchy. Here the Ecore metametamodel is chosen to define a metamodel for a process language. Here, the process language is a modeling language used for example to design behavioral aspects of software systems. The process language is defined by the language designer. He uses the metametamodel by creating instances of the concepts it provides. The language user takes into account the metamodel and creates
196
G. Gröner et al.
Fig. 1 A metamodel hierarchy
instances which build a concrete process model. A process model itself for example can describe the behavior of a system running in the real world. Figure 3 depicts a sample process model in concrete syntax which is instance of the metamodel depicted in Fig. 2. Here we used a textual syntax to define classes like ActivityNode or ActivityEdge and references like incoming or outgoing to define links between instances of corresponding classes in the metamodel. Action nodes (e.g., Receive Order, Fill Order) are used to model concrete actions within an activity. Object nodes (e.g. Invoice) can be used in a variety of ways, depending on where values or objects are flowing from and to. Control nodes (e.g. the initial node before Receive Order, the decision node after Receive Order, and the fork node and join node around Ship Order, merge node before Close Order, and activity final after Close Order) are used to coordinate the flows between other nodes. Process models in our example can contain two types of edges, where edges have exactly one source and one target node. One edge is used for object flows and another edge for control flows. An object flow edge models the flow of values to or from object nodes. A control flow is an edge that starts an action or control node after the previous one is finished.
2.2 Challenges In the following, we are going to present some challenges. They partially are extracted from [32] and will be exemplified in the following.
2.2.1 Metamodel Well-formedness Constraints and Checks Model correctness is often analyzed implicitly in procedural checks of the modeling tools. To make well-formedness constraints more explicit, they should be defined precisely in the metamodel specification.
Software Modeling Using Ontology Technologies
197
abstract class ActivityNode { r e f e r e n c e i n c o m i n g [0 − ∗] : A c t i v i t y E d g e o p p o s i t e O f t a r g e t ; r e f e r e n c e o u t g o i n g [0 − ∗] : A c t i v i t y E d g e o p p o s i t e O f s o u r c e ; } c l a s s ObjectNode extends ActivityNode { } c l a s s Action extends ActivityNode { a t t r i b u t e name : S t r i n g ; } abstract c l a s s ControlNode extends ActivityNode { c l a s s I n i t i a l extends ControlNode { } c l a s s F i n a l extends ControlNode { } c l a s s Fork extends ControlNode { } c l a s s Join extends ControlNode { } c l a s s Merge e x t e n d s C o n t r o l N o d e { } c l a s s Decision extends ControlNode { }
}
abstract class ActivityEdge { r e f e r e n c e s o u r c e [1 − 1] : A c t i v i t y N o d e ; r e f e r e n c e t a r g e t [1 − 1] : A c t i v i t y N o d e ; } c l a s s ObjectFlow extends ActivityEdge { } c l a s s ControlFlow extends ActivityEdge { }
Fig. 2 Process metamodel at M2 layer
Fig. 3 Process model at M1 layer
Although the Ecore language allows for defining cardinality restrictions for references, it is impossible to define constraints covering all flows of a given process model. For example, a language designer wants to ensure, that every flow in an M1 process model goes from the initial node to some final node. To achieve constraints like the previous one, language designers have to define formal syntactic constraints of the modeling language which models have to fulfill. Such constraints should be declared directly beside the metamodel of the language. Based on metamodel constraints, language designers want to check the consistency of the developed language, or they might exploit information about concept satisfiability, checking if concept in the metamodel can have instances.
198
G. Gröner et al.
2.2.2 Language Semantics The Ecore metamodeling language for example does not provide definitions of formal semantics. The semantics of modeling languages is often not defined explicitly but hidden in modeling tools. To define a specific formal semantics for languages, it should be defined precisely either in the metamodel specification or by transformations which transform software models into logic representations. 2.2.3 Model Services Language users require services for productive modeling and progress verification of even incomplete models. There is an agreement about the challenges faced by current modeling approaches [10]: tooling (debuggers, testing engines), interoperability with other languages, formal semantics. When language users want to verify whether all restrictions and constraints imposed by the language metamodel hold, they want to use services to check the consistency of the M1 model. For example, the users of the process modeling language require services for checking if its models are consistent with regard to the metamodel and its constraints. It is important, that the elements of a model have the most specific type. Thus, language users should be able to select a model element and call a service for dynamic classification. Dynamic classification allows for determining the classes to which model objects belong dynamically, based on object descriptions in the metamodel. For example, a language user who has not the complete knowledge of the language, wants to know which types of nodes can be targets of an ObjectFlow edge. Further, it might be interesting for language users to query for existing model elements of a model repository by describing the concept in different possible ways. Thus, the reuse (of parts) of process models is facilitated.
3 Bridging Software Languages and Ontology Technologies In the following we present two general approaches of bridging software models and ontologies. The two approaches mainly differ in the layer of the model hierarchy where they are defined and the layer where the bridge is used and applied on software models. Language bridges are used to provide reasoning on metamodel and model layer. Here the metamodels has the role of a schema and models are its instances. Model bridges provide reasoning on models, where models define the schema. Both bridges are exemplified in Sect. 4.
3.1 Language Bridge Figure 4 depicts the general architecture of a language bridge, combining software languages and ontology technologies. The bridge itself is defined at the M3 layer,
Software Modeling Using Ontology Technologies
199
Fig. 4 Language bridge
where a metametamodel like Ecore is considered and bridged with the OWL metamodel. Here we differ between two kinds of bridges: M3 Integration Bridge and M3 Transformation Bridge.
3.1.1 M3 Integration Bridge The design of an M3 integration bridge consists mainly of identifying concepts in the Ecore metametamodel and the OWL metamodel, which are combined. Here, existing metamodel integration approaches (e.g. presented in [30] and [24]) are used to combine different metamodels. This results in a new metamodeling language which allows for designing metamodels of software languages at the M2 layer with seamlessly integrated constraints. This design and the benefits of integrated metamodels are demonstrated in Sect. 4.1. An integrated metamodeling language provides all classes of the Ecore metametamodel and OWL metamodel. It merges, for example, OWL Class with Ecore EClass, OWL ObjectProperty with Ecore References, or OWL DataProperty with Ecore Attribute. Therefore, a strong connection between the two languages is built. For example, if a language designer creates a class in his metamodel, he creates an instance of the merge of OWL Class and Ecore EClass. Hence, a language designer can use the designed class within OWL class axioms and simultaneously use features of the Ecore metamodeling language such as the definition of simple references between two classes. The integration bridge itself is used at the M2 layer by a language designer. He is now able to define language metamodels with OWL annotations integrated with the metamodel. The annotations are used to restrict the use of concepts a designer modeled and to extend the expressiveness of the language. To provide modeling services to the language user and language designer, the integrated metamodel is transformed into a description logics TBox. The models created by the language users are transformed into a corresponding description logics ABox. Based on the knowledge base consisting of TBox and ABox we can pro-
200 Table 1 Ecore and OWL: comparable constructs
G. Gröner et al. Ecore
OWL
Package
Ontology
Class
Class
Instance and literals
Individual and literals
Reference, attribute
Object property, data property
Data types
Data types
Enumeration
Enumeration
Multiplicity
Cardinality
vide standard reasoning services (e.g. consistency checking) and application specific modeling services (e.g. guidance and debugging) to both language user and designer. We exemplify these services in Sect. 4.1.
3.1.2 M3 Transformation Bridge The M3 Transformation Bridge allows language designers and language users to achieve representations of software languages (metamodel/model) in OWL. It provides the transformation of software language constructs like classes and properties into corresponding OWL constructs. As one might notice, Ecore and OWL have a lot of similar constructs like classes, attributes and references. To extend the expressiveness of Ecore with OWL constructs, we need to establish mappings between the Ecore constructs onto OWL constructs. Table 1 presents a complete list of similar constructs. Based on these mappings, we develop a generic transformation script to transform any Ecore Metamodel/Model into OWL TBox/ABox—OWLizer. Figure 5 depicts the conceptual schema of transforming Ecore into OWL. The four lanes, Actor, Ecore, Model Transformation and OWL Metamodel, show three modeling levels according to the OMG’s Four layered metamodel architecture: the metametamodel level (M3), the metamodel level (M2) and the model level (M1). Vertical arrows denote instantiation whereas the horizontal arrows are transformations, and the boxes represent packages. A model transformation takes the UML metamodel and the annotations as input and generates an OWL ontology where the concepts, enumerations, properties and data types (TBox) correspond to classes, enumerations, attributes/references and data types in the UML metamodel. Another transformation takes the UML model created by the UML user and generates individuals in the same OWL ontology. The whole process is completely transparent for UML users. As one may notice, this is a generic approach to be used with any Ecore-based language. For example, one might want to transform the UML Metamodel/Models as well as all the Java grammar/code into OWL (classes/individuals). This approach can be seen as a linked data driven software development environment [14] and it is illustrated in Sect. 4.2.
Software Modeling Using Ontology Technologies
201
Fig. 5 OWLizer
Fig. 6 Model bridge
3.2 Model Bridge Model bridges connect software models and ontologies on the modeling layer M1. They are defined in the metamodeling layer M2 between different metamodels. Figure 6 visualizes a model bridge. The bridge is defined between a process metamodel on the software modeling side and an OWL metamodel in the OWL modeling hierarchy. The process metamodel is an instance of an Ecore (EMOF) metametamodel. A model bridge is defined as follows: (1) Constructs in the software modeling and in the ontology space are identified. These constructs or language constructs are used to define the corresponding models in the modeling layer M1. (2) Based on the identification of the constructs, the relationship between the constructs is analyzed and specified, i.e. the relationship of an Activity in a process metamodel like the BPMN metamodel to an OWL class. We distinguish between a transformation and integration bridge.
202
G. Gröner et al.
3.2.1 M2 Integration Bridge Integration bridges merge information of the models from the software modeling and from the ontology space. This allows the building of integrated models (on modeling layer M1) using constructs of both modeling languages in a combined way, e.g. to integrate UML class diagrams and OWL. As mentioned in Sect. 3.1.2, UML class-based modeling and OWL comprise some constituents that are similar in many respects like classes, associations, properties, packages, types, generalization and instances [22]. Since both approaches provide complementary benefits, contemporary software development should make use of both. The benefits of integration are twofold. Firstly, it provides software developers with more modeling power. Secondly, it enables semantic software developers to use object-oriented concepts like inheritance, operation and polymorphism together with ontologies in a platform independent way. Such an integration is not only intriguing because of the heterogeneity of the two modeling approaches, but it is now a strict requirement to allow for the development of software with many thousands of ontology classes and multiple dozens of complex software modules in the realms of medical informatics [19], multimedia [28] or engineering applications [27]. TwoUse (Transforming and Weaving Ontologies and UML in Software Engineering) addresses these types of systems [26]. It is an approach combining UML class-based models with OWL ontologies to leverage the unique and potentially complementary strengths of the two. TwoUse consists of an integration of the MOFbased metamodels for UML and OWL, the specification of dynamic behavior referring to OWL reasoning and the definition of a joint profile for denoting hybrid models as well as other concrete syntaxes. Figure 7 presents a model-driven view of the TwoUse approach. TwoUse uses UML profiled class diagrams as concrete syntax for designing combined models. The UML class diagrams profiled for TwoUse are input for model transformations that generate TwoUse models conforming to the TwoUse metamodel. The TwoUse metamodel provides the abstract syntax for the TwoUse approach, since we have explored different concrete syntaxes. Further model transformations take TwoUse models and generate the OWL ontology and Java code. TwoUse allows developers to raise the level of abstraction of business rules previously embedded in code. It enables UML modeling with semantic expressiveness of OWL DL. TwoUse achieves improvements on the maintainability, reusability and extensibility for ontology based system development. 3.2.2 M2 Transformation Bridge A transformation bridge describes a (physical) transformation between models in layer M1. The models are kept separately in both modeling spaces. The information is moved from one model to the model in the other modeling space according to the transformation bridge. With respect to the example depicted in Fig. 6, a process model like a UML Activity Diagram is transformed to an OWL ontology. The transformation rules or patterns are defined by the bridge.
Software Modeling Using Ontology Technologies
203
Fig. 7 Model bridge
4 Applications of Language and Model Bridges 4.1 Integrated Well-formedness Constraints in Metamodels Having an integrated metametamodel available, a language designer can now create language metamodels with integrated OWL constraints and axioms. For example, in Fig. 8, he enriches the process metamodel from Fig. 2 by extending the class ActivityNode by some further reference called ‘edge’. With a pure Ecore metametamodel it is not possible to define references as transitive or as a chain covering two further references. Using an integrated metamodeling language, a language designer can now define, that the reference edge is transitive (using the keyword transitive) and can define that the reference edge is a chain of the reference outgoing and target (using the keyword isChain). Beside OWL object property axioms and OWL object property expressions, which are adopted by Ecore references, a language designer is able to create OWL class axioms and OWL class expressions. In Fig. 8, he defines, that the class ActivityNode is equivalent with a class expression, which requires via the reference edge to be connected with some Final node. Here, the ObjectSomeValuesFrom class expression which allows for existential quantification is defined by the object restriction on. . . with some construct. Further, we restrict the concept of initial nodes, such that each nodes which directly appear after an initial node must have type Action or ControlNode. Thus, no object nodes are allowed directly after the initial node. Overall, the language designer has an ontology-based metamodeling language which provides a seamless and integrated design of formal syntactic constraints within the metamodel itself. The constraints are defined by using some ontology languages, which are used in combination with familiar concrete Ecore syntaxes (e.g. with textual Ecore modeling syntax provided by the Eclipse Modeling Framework). In the following, we are going to present concrete reasoning services and how they are adopted on software models.
204
G. Gröner et al.
c l a s s A c t i v i t y N o d e e q u i v a l e n t T o e d g e some F i n a l { r e f e r e n c e i n c o m i n g [0 − ∗] : A c t i v i t y E d g e o p p o s i t e O f t a r g e t ; r e f e r e n c e o u t g o i n g [0 − ∗] : A c t i v i t y E d g e o p p o s i t e O f s o u r c e ; t r a n s i t i v e r e f e r e n c e e d g e [0 − ∗] : A c t i v i t y N o d e i s C h a i n ( o u t g o i n g , t a r g e t ) ; } \ dots \ c l a s s I n i t i a l e x t e n d s C o n t r o l N o d e s u b C l a s s O f o u t g o i n g some ( t o some ( A c t i o n or C o n t r o l N o d e ) ) { } \ dots \ c l a s s F i n a l e x t e n d s C o n t r o l N o d e s u b C l a s s O f ( e d g e some A c t i v i t y N o d e ) and n o t ( e d g e some A c t i v i t y N o d e ) { }
Fig. 8 Process metamodel with integrated ontologies at M2 layer
4.1.1 Consistency Checking Figure 9 depicts a simple process model which is created by a language user. Language users want to validate the model to check its consistency with respect to the metamodel. Therefore, they use the consistency checking service. It is enabled by transforming the metamodel to a description logic knowledge base TBox and the process model itself to the ABox. Then consistency checking is performed by using the correspondent standard reasoning service of a reasoner like Pellet [25]. Having the service for language users available, we can identify that the model in Fig. 9 is inconsistent. A language user wants to get an explanation why the model is not consistent. The user wants to get some debugging relevant facts and the information how to repair the model. Such explanation services are provided by standard reasoners, e.g. Pellet [25] and are implemented in the TwoUse Toolkit (cf. Sect. 5). In the case of the model in Fig. 9, a language user gets the following explanation, which explains why the model is not consistent. The message provided by the TwoUse Toolkit looks as follows: E x p l a n a t i o n from TwoUse T o o l k i t −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− CHECK CONSISTENCY C o n s i s t e n t : No Explanation : receiveOrder type Action Action subClassOf A c t i v i t y N o d e A c t i v i t y N o d e e q u i v a l e n t T o e d g e some F i n a l
Here the language user sees, that the node receiveOrder is of type Action which is a specialization of ActivityNode. Furthermore, there must be an outgoing flow to some
Software Modeling Using Ontology Technologies
205
Fig. 9 Inconsistent process model
final node from each activity node. This constraint is prescribed by the metamodel (cf. Fig. 8), in particular by the class ActivityNode and its equivalent class expression.
4.1.2 Satisfiability Checking To avoid inconsistencies in models, language designers want to ensure that their metamodels have no unsatisfiable classes. Here, satisfiability checking services may help. To enable this service, the metamodel is transformed to description logics knowledge base TBox. Satisfiability checking services, as provided by Pellet and the TwoUse Toolkit, can be adopted on the metamodel in Fig. 8. Result of the service is that the class Final is not satisfiable. Based on explanation services a reason can be computed. This might look as follows: E x p l a n a t i o n from TwoUse T o o l k i t −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− U n s a t i s f i a b i l i t y of Final : Explanation : F i n a l e q u i v a l e n t T o n o t e d g e some A c t i v i t y N o d e and e d g e some A c t i v i t y N o d e
4.1.3 Classification Figure 10 depicts a process model with an invoice node between the two actions. Here the invoice node has no specific type. Language users often do not know which type they must assign to a model element. Additionally, it is important for a model that its elements have the most specific type, e.g. for model transformations or code generations. Having the metamodel and the process model in a description logic knowledge base, we can provide the classification service to language users. It allows for determining the types which model instances belongs to dynamically, based on the descriptions in the model and the metamodel. The result for classifying the invoice node is the type ObjectNode because it is linked with two dataflow edges. Fig. 10 Process Model with element of unknown type
206
G. Gröner et al.
Fig. 11 Aligning UML diagrams and Java code with OWL
4.2 Transforming Metamodels Software development consists of multiple phases, from inception to production. During each software development phase, developers and other actors generate many artifacts, e.g. documents, models, diagrams, code, tests and bug reports. Although some of these artifacts are integrated, they are usually handled as islands inside the software development process. Many of these artifacts (graphical or textual) are written using a structured language, which has a defined grammar. In a model-driven environment, concepts of software languages are represented by metamodels, whereas the artifacts written in those software languages are represented by models, which are described by the language metamodel. Thus, by transforming software metamodels and models into OWL and by aligning the OWL ontologies corresponding to software languages, we are able to link multiple data sources of a software development process, creating a linked-data repository for software development. Let us consider an example of integrating two data sources: UML diagrams and Java Code. Regardless of generating Java code from UML diagrams, developers would like to have a consistent view of corresponding classes and methods in UML and Java, i.e., developers might want to consult UML diagrams looking for a corresponding Java class. In this scenario, OWL and ontology technologies play an important role. Figure 11 depicts the usage of M3 transformations together with ontology technologies. UML metamodel and model as well as Java grammar (metamodel) and java code (model) are transformed into OWL ontologies. Ontology alignment techniques [6] might identify some concepts in common between the two ontologies (UML and Java), e.g., package, class, method. Moreover, individuals with the same name in these two ontologies are likely the same. Once the two ontologies are aligned, queries against the Java ontology also retrieve elements defined in UML diagrams. Now it is possible to retrieve sequence diagrams including a given Java class, since the two artifacts (UML diagrams and
Software Modeling Using Ontology Technologies
207
Java code) are now linked. This is only one example of the great potential provided by linking software engineering artifacts using OWL technologies.
4.3 Transformation of Process Models into Semantic Representations Process models capture the dynamic behavior of an application or system. In software modeling, they are represented by graphical models like BPMN Diagrams or UML Activity Diagrams. Both metamodels are instances of Ecore metametamodels. The corresponding metamodels prove to be a flexible mean for describing process models for various applications. However, process models are often ambiguous with inappropriate modeling constraints and even missing semantics (cf. [11]). We identified the following shortcomings of process models in the software modeling space. (1) There is no semantic representation of control flow dependencies of activities in a process, i.e. execution ordering of activities in a flow. Such constraints allow the description of order dependencies like an activity requires a certain activity as a successor. (2) It is quite common in model-driven engineering to specialize or refine a model into a more fine-grained representation that is closer to the concrete implementation. In process modeling, activities could be replaced by subactivities for a more precise description of a process. Hence, modeling possibilities for subactivities and also for control flow constraints of these subactivities are a relevant issue. (3) Quite often, one may formulate process properties that cover modality, i.e. to express a certain property like the occurrence of an activity within a control flow is optional or unavoidable in all possible process instances (traces). We continue this section with an overview of the model bridge from a UML activity diagram to an OWL ontology with a short discussion of design decisions (Sect. 4.3.1). Afterwards in Sect. 4.3.2, we demonstrate the usage of ontological representation of process models in order to compensate the afore-mentioned shortcomings in software process modeling.
4.3.1 Process Modeling Principles in OWL A process model describes the set of all process runs or traces it allows. Activities are represented by OWL classes and a process is modeled as a complex expression that captures all activities of the process. A process run is an instance of this complex class expression in OWL. The process models are described in OWL DL, as syntax we use the DL notation. Transformation patterns from UML activity diagrams to OWL are given in Table 2. Control flow relations between activities are represented by object properties in OWL, i.e. by the property TOi . All TOi object properties are subproperties of the transitive property TOT. A process is composed by activities. A process definition is described in OWL by axioms as show in No. 5. The control flow (No. 6) is a class
208
G. Gröner et al.
Table 2 Transformation to OWL Construct
UML notation
DL notation
1. Start
Starti
2. End
End i
3. Activity
ReceiveOrder
4. Edge
TOi
5. Process P
P ≡ Starti ∃=1 TOi . (ReceiveOrder ∃=1 TOi .Endi )
6. Flow
ReceiveOrder ∃=1 TOi .FillOrder
7. Decision
ReceiveOrder ∃=1 TOi . ((RejectOrder FillOrder) ∃=1 TOi .CloseOrder)
8. Condition
ReceiveOrder ∃=1 TOi . ((FillOrder κOrder accepted ) (Stalled ¬κOrder accepted ))
9. Fork and Join
ReceiveOrder ∃TOi . (ShipOrder ∃=1 TOi .CloseOrder) ∃TOi . (SendInvoice ∃=1 TOi .CloseOrder) = 2 TOi
10. Loop
Loopj ∃=1 TOi . FillOrder, Loopj ≡ ReceiveOrder ∃=1 TOj . (Loopj Endj )
expression in OWL like A ∃TOi .B that means the activity A is directly followed by the activity B. We use concept union for decisions (No. 7). The non-deterministic choice between activity B and C is given by the class expression ∃TOi .(B C). Flow conditions (No. 8) are assigned to the control flow. We represent conditions by OWL classes. A loop (No. 10) is a special kind of decision. An additional OWL class Loopj for the subprocess with the loop is introduced to describe multiple occurrences of the activities within the loop. Parallel executions are represented by intersections (No. 9). It is an explicit statement that an activity have multiple successors simultaneously.
4.3.2 Process Modeling and Retrieval in OWL The semantic representation of process models in OWL tackle the problems and shortcoming that are mentioned at the beginning of this section in multiple ways. For instance, the validation of process properties, specializations or refinement relations
Software Modeling Using Ontology Technologies
209
between process models, as well as the retrieval of processes benefits from this representation. This section gives an overview of query patterns for process retrieval using semantic queries. The process model in OWL gives an explicit description of the execution order dependencies of activities. Hence, this information is used for process retrieval. A query describes the relevant ordering conditions like which activity has to follow (directly or indirectly) another activity. For instance a process that executes the activity FillOrder before MakePayment with an arbitrary number of activities between them, is given by the query process description ∃TOT.(F illOrder ∃TOT.MakePayment). The transitive object property TOT is used to indicate the indirect connection of the activities. The result are all processes that are subsumed by this general process description. Besides ordering constraints, this semantic query processing allows the retrieval of processes that contain specialized or refined activities. For instance the result of the demonstrated query also contains all processes with subactivities of FillOrder and MakePayment. The corresponding class expressions in the OWL model are specializations of the class expression given by the query expression. Finally, the usage of in the queries allows handling of modality for activity occurrences in a process, like a query that expresses that the activity ShipOrder has to occur or might occur.
4.4 Integrating Models with Ontologies In general, the Strategy Pattern solves the problem of dealing with variations. However, as already documented by [7], the Strategy Pattern has a drawback. The clients must be aware of variations and of the criteria to select between them at runtime. Hence, the question arises of how the selection of specific classes could be determined using only their descriptions rather than by weaving the descriptions into client classes. The basic idea lies in decoupling class selection from the definition of client classes by exploiting OWL-DL modeling and reasoning. We explore a slight modification of the Strategy Pattern that includes OWL-DL modeling and that leads us to a minor, but powerful variation of existing practices: the Selector Pattern. To integrate the UML class diagram with patterns and the OWL profiled class diagram, we rely on the TwoUse approach. The hybrid diagram is depicted in Fig. 12. The Selector Pattern is composed by a context, the specific variants of this context and their respective descriptions, and the concept, which provides a common interface for the variations (Fig. 12). Its participants are: • Context maintains a reference to the Concept object. • Concept declares an abstract method behavior common to all variants. • Variants implement the method behavior of the class Concept. The Context has the operation select, which uses OWL-like query operations to dynamically classify the object according to the logical descriptions of the variants. A Variant is returned as result (Fig. 12). Then, the Context establishes an association with the Concept, which interfaces the variation.
210
G. Gröner et al.
Fig. 12 The selector pattern
The application of the Selector Pattern presents some consequences, that we discuss as follows: Reuse. The knowledge represented in OWL-DL can be reused independently of platform or programming language. Flexibility. The knowledge encoded in OWL-DL can be modeled and evolved independently of the execution logic. Testability. The OWL-DL part of the model can be automatically tested by logical unit tests, independently of the UML development. The application of TwoUse can be extended to other design patterns concerning variant management and control of execution and method selection. Design patterns that factor out commonality of related objects, like Prototype, Factory Method and Template Method, are good candidates.
5 Implementation: The TwoUse Toolkit The TwoUse Toolkit is an implementation of current OMG and W3C standards for developing ontology-based software models and model-based OWL ontologies. It is a model-driven tool to bridge the gap between Semantic Web and Model Driven Software Development. The TwoUse Toolkit has two User Profiles: model-driven software developers and OWL ontology engineers. The TwoUse Toolkit provides the following functionality to model-driven software developers: • • • •
Describe classes in UML class diagrams using OWL class descriptions. Semantically search for classes, properties and instances in UML class diagrams. Model variability in software systems using OWL classes. Design business rules using the UML Profile for SWRL.
Software Modeling Using Ontology Technologies
211
• Extent software design patterns with OWL class descriptions. • Make sense of UML class diagrams using inference explanations. • Write OWL queries using SPARQL, SAIQL or the OWL query language based on the OWL Functional Syntax using the query editor with syntax highlight. • Validate refinements on business process models. To OWL ontology engineers, the TwoUse Toolkit provides the following functionalities: • Graphically model OWL ontologies and OWL safe rules using OMG UML Profile for OWL and UML Profile for SWRL. • Graphically model OWL ontologies and OWL safe rules using the OWL2 Graphical Editor. • Graphically model and store ontology design patterns as templates. • Write OWL queries using SPARQL-DL, SAIQL or the OWL query language based on the OWL Functional Syntax using the query editor with syntax highlight. • Specify and safe OWL ontologies using the OWL2 functional syntax with syntax highlighting. • Specify OWL ontology APIs using the agogo editor. We have implemented the TwoUse Toolkit in the Eclipse Platform using the Eclipse Modeling Framework [4] and is available for download on the project website.1
6 Related Work In the following, we group related approaches into two categories: approaches where languages are bridged and approaches where models are bridged. Among approaches of language bridges, one can use languages like F-Logic or Alloy to formally describe models. In [1], a transformation of UML+OCL to Alloy is proposed to exploit analysis capabilities of the Alloy Analyzer [15]. In [31], a reasoning environment for OWL is presented, where the OWL ontology is transformed to Alloy. Both approaches show how Alloy can be adopted for consistency checking of UML models or OWL ontologies. F-Logic is a further prominent rule language that combines logical formulas with object oriented and frame-based description features. Different works (e.g. [8, 29]) have explored the usage of F-Logic to describe configurations of devices or the semantics of MOF models. The integration in the cases cited above is achieved by transforming MOF models into a knowledge representation language (Alloy or F-Logic). Thus, the expressiveness available for DSL designers is limited to MOF/OCL. Our approach extends these approaches by enabling language designers to specify class descriptions à la OWL together with MOF/OCL, increasing expressiveness. 1 http://code.google.com/p/twouse/.
212
G. Gröner et al.
There are various approaches that build model bridges between software models and ontologies. Here, we only mention those that are related to process modeling in OWL. The OWL-S process model [23] describes processes in OWL. The process specification language [12, 13] allows formal process modeling in an ontology. Process models are represented in OWL in combination with petri nets in [16] and in Description Logics for workflows in [9]. Compared to our demonstrated model, these approaches either lack in an explicit representation of control flow dependencies in combination with terminological information of activities like a hierarchical structuring of activities, or retrieval of processes with respect to control flow information is only weakly supported.
7 Conclusion In this chapter, we presented the building blocks for bridging software languages and ontology technologies. Language bridges are generic and can be used in existing software languages as well as new software languages that explore the extended functionalities provided by OWL. Model bridges have an ad-hoc character and are language specific. While language bridges improve software development by realizing some of the major motivations of the OWL language, e.g., shared terminology, evolution, interoperability and inconsistency detection, model bridges allow for exploring new ways of modeling software as well as different ways of exploiting reasoning technologies.
References 1. Anastasakis, K., Bordbar, B., Georg, G., Ray, I.: UML2Alloy: a challenging model transformation. In: Lecture Notes in Computer Science, vol. 4735, p. 436 (2007) 2. Atkinson, C., Kuhne, T.: Model-driven development: a metamodeling foundation. IEEE Softw. 20(5), 36–41 (2003) 3. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, New York (2003) 4. Budinsky, F., Brodsky, S.A., Merks, E.: Eclipse Modeling Framework. Pearson Education, Boston (2003) 5. Ebert, J.: Metamodels taken seriously: the TGraph approach. In: Kontogiannis, K., Tjortjis, C., Winter, A. (eds.) 12th European Conference on Software Maintenance and Reengineering. IEEE Comput. Soc., Piscataway (2008). URL http://www.uni-koblenz. de/~ist/documents/Ebert2008MTS.pdf 6. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2007). 3-540-49611-4 7. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading (1995) 8. Gerber, A., Lawley, M., Raymond, K., Steel, J., Wood, A.: Transformation: The Missing Link of MDA. Lecture Notes in Computer Science, pp. 90–105 (2002) 9. Goderis, A., Sattler, U., Goble, C.: Applying DLs to workflow reuse and repurposing. In: Description Logic Workshop (2005)
Software Modeling Using Ontology Technologies
213
10. Gray, J., Fisher, K., Consel, C., Karsai, G., Mernik, M., Tolvanen, J.-P.: Panel—DSLs: the good, the bad, and the ugly. In: OOPSLA Companion ’08. ACM, New York (2008) 11. Gröner, G., Staab, S.: Modeling and query pattern for process retrieval in OWL. In: Proc. of the 8th International Semantic Web Conference (ISWC). Lecture Notes in Computer Science, vol. 5823, pp. 243–259. Springer, Berlin (2009) 12. Grüninger, M.: In: Staab, S., Studer, R. (eds.) Ontology of the Process Specification Language, pp. 575–592. Springer, Berlin (2009) (Chap. 29) 13. Grüninger, M., Menzel, C.: The process specification language (PSL) theory and application. AI Mag. 24, 63–74 (2003) 14. Iqbal, A., Ureche, O., Hausenblas, M., Tummarello, G.: LD2SD: linked data driven software development. In: Proceedings of the 21st International Conference on Software Engineering & Knowledge Engineering (SEKE’2009), Boston, Massachusetts, USA, July 1–3, 2009, pp. 240–245. Knowledge Systems Institute Graduate School, Skokie (2009) 15. Jackson, D.: Software Abstractions: Logic, Language, and Analysis. MIT Press, Cambridge (2006) 16. Koschmider, A., Oberweis, A.: Ontology based business process description. In: EMOIINTEROP (2005). URL http://www.ceur-ws.org/Vol-160/paper12.pdf 17. Mellor, S.J., Clark, A.N., Futagami, T.: Model-driven development. IEEE Softw. 20(5), 14–18 (2003) 18. Motik, B., Patel-Schneider, P.F., Horrocks, I.: OWL 2 Web Ontology Language: Structural Specification and Functional-Style Syntax. URL http://www.w3.org/TR/owl2-syntax/ (2009) 19. O’Connor, M.J., Shankar, R., Tu, S.W., Nyulas, C., Parrish, D., Musen, M.A., Das, A.K.: Using semantic web technologies for knowledge-driven querying of biomedical data. In: AIME, pp. 267–276 (2007) 20. OMG: Meta Object Facility (MOF) Core Specification. URL http://www.omg.org/docs/ formal/06-01-01.pdf (2006) 21. OMG: UML Infrastructure Specification, v2.1.2. OMG Adopted Specification (2007) 22. OMG: Ontology Definition Metamodel. Object Modeling Group. URL http://fparreiras/specs/ ODMptc06-10-11.pdf (2008) 23. OWL-S: Semantic Markup for Web Services. URL http://www.w3.org/Submission/OWL-S (2004) 24. Parreiras, F.S., Walter, T.: Report on the combined metamodel. Deliverable ICT216691/UoKL/ WP1-D1.1/D/PU/a1, University of Koblenz-Landau, (2008). MOST Project 25. Parsia, B., Sirin, E.: Pellet: an OWL DL reasoner. In: Proc. of the 2004 International Workshop on Description Logics (DL2004). CEUR Workshop Proceedings, vol. 104, 2004 26. Silva Parreiras, F., Staab, S.: Using ontologies with UML class-based modeling: the TwoUse approach. Data Knowl. Eng. 69(11), 1194–1207 (2010) 27. Staab, S., Franz, T., Görlitz, O., Saathoff, C., Schenk, S., Sizov, S.: Lifecycle knowledge management: getting the semantics across in x-media. In: Foundations of Intelligent Systems, ISMIS 2006 Bari, Italy, September 2006. Lecture Notes in Computer Science, vol. 4203, pp. 1–10. Springer, Berlin (2006). URL http://www.uni-koblenz.de/~staab/ Research/Publications/2006/ismis.pdf 28. Staab, S., Scherp, A., Arndt, R., Troncy, R., Gregorzek, M., Saathoff, C., Schenk, S., Hardman, L.: Semantic multimedia. In: Reasoning Web, 4th International Summer School, Venice, Italy. Lecture Notes in Computer Science, vol. 5224, pp. 125–170. Springer, Berlin (2008) 29. Sure, Y., Angele, J., Staab, S.: OntoEdit: guiding ontology development by methodology and inferencing. In: Lecture Notes in Computer Science, pp. 1205–1222 30. Walter, T., Ebert, J.: Combining DSLs and ontologies using metamodel integration. In: Domain-Specific Languages. Lecture Notes in Computer Science, vol. 5658, pp. 148–169. Springer, Berlin (2009) 31. Wang, H.H., Dong, J.S., Sun, J., Sun, J.: Reasoning support for Semantic Web ontology family languages using Alloy. Multiagent Grid Syst. 2(4), 455–471 (2006) 32. Wende, C.: Ontology services for model-driven software development. MOST Project Deliverable. URL www.most-project.eu (2009)
Intelligent Service Management—Technologies and Perspectives Sudhir Agarwal, Stephan Bloehdorn, and Steffen Lamparter
Abstract Intelligent infrastructures, in particular Semantic Technologies, have accompanied the development to service-centric architectures and computing paradigms since the early days. In this contribution, we assess the current state of these technologies with respect to the intelligent management of services and we describe the main developments that will most likely shape the field in the future. As a general introduction, we structure the landscape of services according to three complementary dimensions: (i) Level of Information and Communication Technology (ICT) Involvement, (ii) Level of Co-Creation, and (iii) Service Role. In the main part, we focus on the following technology areas: semantic description of services incl. the acquisition of service descriptions, discovery and ranking algorithms, service composition, service markets, as well as service monitoring and analytics.
1 Introduction The notion of service-orientation has received significant attention from different viewpoints, most notably from Economics and ICT. From an Economics standpoint, the service sector drives the economies of most developed countries and this importance is constantly increasing. The service sector includes, for example, all activities of service sector firms, specifically ICT-services, as well as services associated with physical goods production and maintenance and services from the public sector. Increasingly, independent specialized providers also work together in flexible networks at varying degrees of cooperation to jointly create customized service offerings. Many traditional product-oriented companies are now primarily serviceoriented [31]. From an ICT viewpoint, current technologies enable the provision of digital services over computer networks. This has led to a fundamental change in the way distributed computing systems are being constructed: service-oriented systems consist of loosely-coupled software components and data resources, usually hosted and controlled by independent parties, that are accessible using standardized technologies [29, 49, 50]. The combination of technical flexibility, standardization, and S. Agarwal () Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_12, © Springer-Verlag Berlin Heidelberg 2011
215
216
S. Agarwal et al.
openness provided by service-orientation has helped companies to move beyond proprietary systems and towards modular, reusable business services and processes. Service-orientation has thus become a corner-stone of agile ICT and has helped to solve many of the technical and organizational problems typically obstruct flexibility of software systems. While the viewpoints of Economics and ICT appear very different at first, they have a powerful impact on each other [60]. Recently, the mutual exchange between these respective views has led to the emergence of a new orthogonal research field sometimes referred to as ‘Service Science’ with the aim of taking a holistic view on the understanding of services systems [11, 55]. The Internet as a disruptive platform, offering new means of communication as well as access to a wealth of content and functionalities, has especially contributed to the intertwining of technical and economic considerations in the design of new services. The so called ‘Internet of Services’ is expected to provide any business and user on the Internet with the possibility of offering and managing innovative and successful service offerings through the intelligent use of Internet and service technology. At the same time, the engineering of services on the Internet requires careful consideration of the business context, including business requirements and opportunities, business transformation, and social, organizational and regulatory policies. Intelligent technologies for service management have accompanied the evolution towards service-oriented architectures and computing paradigms since the early days. Generally, this includes all types of technologies from Artificial Intelligence (AI), most notably Semantic Technologies and Data Analytics, used for supporting or fully automating service management tasks. On an Internet of Services, a key to success will be the ability to systematically apply these technologies to continuously and adequately adapt services and service delivery channels to a highly dynamic and heterogeneous environment. Intelligent infrastructures can, for example, facilitate the dynamic execution of complex services and the bundling of services to complex value chains, so called Service Value Networks (SVNs). They have the power to optimize service usage and delivery as well as the involved resources and can help in transparently mapping these solutions into existing ICT infrastructures despite existing heterogeneities. Therefore, intelligent services infrastructures enable the efficient provisioning of competitive added-value solutions specifically tailored to the needs of individual customers and are thus a major step towards mass customization and “productization” of services. In this contribution, we assess the current state of all types of “intelligent” technologies for service management and describe the main developments that are most likely shape the field in the future. In order to structure these developments, the essential components of service technology are identified in Sect. 2, both from an economic as well as a technological viewpoint, and placed within an overall service landscape. The major lines of work in intelligent service management are summarized in Sect. 3. In Sect. 4, we then take a deeper look at the application of intelligent management functionalities in selected areas of the service landscape. As a tribute to Rudi Studer and in line with the dedication of this collection, we will give special attention to work from the area of Semantic Technologies performed within his
Intelligent Service Management—Technologies and Perspectives
217
research groups in Karlsruhe. In Sect. 5, we look at the perspectives and future challenges of intelligent service technology. In Sect. 6, we finally bring these lines of thinking together and conclude.
2 Structuring the Landscape of Service Systems There is a plethora of viewpoints on the concepts of service-orientation and service systems. This is especially the case when comparing economic and technical viewpoints that are different but at the same time dependent, as they have a strong impact on each other [60]. As a result, the term service is heavily overloaded and has been used to identify different concepts and different levels of granularity which are, however, not always easy to tell apart. In this section, we aim to clarify the use of the term throughout this paper and at identify the relevant perspectives from which it can be analyzed. On the one hand, this is meant to provide a summary—though certainly not exhaustive—of service definitions. On the other hand, it serves the purpose of adequately organizing the intelligent technologies for service management and usage which we discuss in this paper. Generally, there is a broad consensus in literature about some distinctive economic properties of services when compared to products. In the following, we list some of these distinguishing characteristics based on Fitzsimmons and Fitzsimmons [19], Chesbrough and Spohrer [11]. Customer participation in the service process Services are not determined by the provider alone. In contrast, the participation of the customer, e.g. in terms of input, is required. Often, this collaborative aspect of service provisioning is referred to as the co-creation of the service. Simultaneity Service provisioning and consumption happen at the same period of time. Perishability Services cannot be stored as they perish once they are delivered. This property thus requires an exact match of supply and demand in order to avoid inefficient spending of resources. Intangibility Services are intangible in the sense that they cannot be embodied in an artifact. Heterogeneity Service instances are, due to the co-creation of services between producer and consumer, unique and cannot be repeated or replicated in an identical manner. In summary, services rely on an intangible and perishable service activity that transforms the condition or state of another economic entity [25]. Obviously, this definition is very broad and various different services are captured under it ranging from, e.g., transportation of goods over execution of bank transactions to provisioning of digital content on the Web. As each of these services requires different concepts for service management and usage, we arrange the set of underlying service activities along three dimensions: (i) Level of ICT Involvement, (ii) Level of Co-Creation, and (iii) Service Role. This
218
S. Agarwal et al.
Fig. 1 Service Landscape: service activities arranged along three complementary dimensions: (i) Level of ICT Involvement, (ii) Level of Co-Creation, and (iii) Service Role
setup is illustrated in Fig. 1. The technologies for intelligent service management and usage and corresponding streams of research are discussed in the subsequent sections in the context of the different classes of service activities within this threedimensional classification.
2.1 First Dimension: Level of ICT Involvement As a first dimension we introduce the level of ICT involvement. The dimension refers to the technical means that are used by consumers and providers to realize the service activity including their interaction within the order and delivery process. Service activity is carried out (mainly) by human interaction This category captures services such as transport services in logistics (see e.g. [28]). The main characteristic of these services is that the service activity can be performed primarily by humans without involvement of ICT for the service activity. While the service activity itself is carried out manually, ICT support for managing the service or accessing backend functionalities is still possible, e.g. order management, or route planning.
Intelligent Service Management—Technologies and Perspectives
219
Service activity is supported by ICT This category comprises services which are partially carried out by humans and by means of ICT. Most prominently this includes services that are invoked over a digital interface but either trigger real-world service activities or provide a physical product as output (e.g. Amazon’s book selling service). Service activity is carried out by ICT The remaining category captures fully digitalized and automated services. Amongst others, this category contains ICT infrastructure services (e.g. cloud computing services) or information services (e.g. stock quote tickers offered via the Web).
2.2 Second Dimension: Level of Co-Creation While the first dimension classifies services according to ICT involvement in the service activity, the second dimension distinguishes the level of co-creation between customers and providers. Depending on this level, the complexity and focus of the service model and the applicable technologies change. Service activity as black box At the most abstract level of co-creation, the service activity is considered a black box with input provided by the customer and output returned from the provider. This means that the service activity is inherently atomic, i.e. it is not decomposed further in a meaningful way. For several management technologies and a set of specific services, this view provides a suitable level of abstraction to effectively support service providers and consumers. For example, for discovering Web services a description of service outputs can be very valuable to preselect services that might be relevant for a certain task. Service activity as process Instead of considering the service activity as a black box, it is possible to view the service activity as a process that is executed during service delivery. While this approach is considerably more complex even for rather simple service activities, typically more powerful management and usage concepts can be provided such as behavioral matching of services or automated service execution using a business process engine like WS-Business Process Execution Language (BPEL). In addition, a process-oriented view of the service activity also enables a recursive definition where individual process steps are again considered as services. Service activity as value network Finally, another level of complexity is added if the service activity involves multiple autonomous providers and consumers simultaneously. Other than in the case of processes, where the overall interaction can be orchestrated by some authority, the interaction now becomes only loosely controlled through contracts between the involved parties forming a complex SVN. This scenario poses specific challenges for service management and usage. For example, scheduling and configuration algorithms might have to rely on limited and highly dynamic information only when optimizing the supply chain.
220
S. Agarwal et al.
2.3 Third Dimension: Service Role The third dimension is denoted service role and distinguishes different roles a service activity can play in relation to other services. Service activity is stand-alone value-creation Services of this type mainly target consumers/end users. While they may be (and often are) used in conjunction with other services they exhibit a stand-alone business objective. Service activity is infrastructure provisioning Services of this type typically support higher business objectives embodied in services of the first type through provisioning of IT infrastructure. Service activity is coordination Services of this type relate to the management and coordination of other services. Along with providing an infrastructure for individual services, they provide the framework for the management and coordination of a multitude of different services (“meta services”, see e.g. [54]). For example, a service market platform or a discovery component needs to be seen as a service itself. Note that this classification is inherently relative, as e.g. the question whether a service should be considered as an infrastructure service needs to be answered on the basis of how many stand-alone services use the given service in an infrastructurelike manner.
3 Intelligent Management Functionalities for Services In this section we introduce the overall area of intelligent service management. We first shortly sketch the history of service-orientation in ICT and then identify the major lines of work in intelligent service management.
3.1 Overview on the Evolution of Service Technology Historically, various developments have shaped the notion of service-orientation within ICT, including object-orientation, component-based software engineering, distributed processing and business process management [29, 49, 50]. It is commonly acknowledged that several service-orientation principles have their roots in the object-oriented and component-based design paradigm, such as the reusability of services within multiple different settings, composability, and strict encapsulation, i.e. details of the implementation can and should not be obvious from the service interfaces. Other core characteristics have been contributed by the distributed computing community, such as interoperability, heterogeneity, transparency and broking.
Intelligent Service Management—Technologies and Perspectives
221
In this context, the emerging Internet provided a suitable platform for interorganizational service provisioning. In order to address the needs of complex enterprise applications the early Web technologies have been extended by a set of Web service standards which enable services beyond simple Hypertext Transfer Protocol (HTTP) requests. These Web service standards on the one hand reuse existing Web technologies, such as on the transport layer (HTTP, URI) or on the formatting layer (Extensible Markup Language (XML)), and on the other hand add new functionality, such as additional messaging protocols (e.g. SOAP, WS-Security), service interface descriptions (e.g. Web Service Description Language (WSDL)), or service orchestration specifications (e.g. BPEL).
3.2 Overview on the Use of AI Technologies in Service Management We now shortly identify the major lines of work in intelligent management of services, which we will then discuss in more detail in Sect. 4.
3.2.1 Service and Process Description In the context of services, the use of intelligent technologies often aims at automating selected service management tasks. In order to enable such automatic method and tools, languages with formal semantics for describing services as well as algorithms and tools exploiting these descriptions are needed. Semantic Technologies [56] are well suited for this field. Generally, service and process description languages should allow a precise formal modeling of involved resources, service functionalities, non-functional properties (e.g. credentials) and process behavior (orchestration and choreography) in a unified way. Languages for specifying constraints on services and processes, e.g. for searching, ranking and composing the services and processes automatically, should allow specification of desired functionality (desired behavior and desired changes in the resources) and desired quality (e.g. preferences over the non-functional properties). This means that formalisms for describing resources (e.g. Web Ontology Language (OWL) [26]), temporal behavior (e.g. π -calculus [44] and Petri-Nets [52]), non-functional properties (e.g. [20]) need to be taken into account at the same time. In most applications, the manual creation of these formal descriptions is however hardly possible (e.g. too cumbersome or expensive). One way to overcome this bottleneck is the automated acquisition of service and process descriptions from given data repositories. By means of a visual process editor automatically acquired processes can be rectified and extended manually.
222
S. Agarwal et al.
3.2.2 Service Discovery and Ranking The discovery of semantic Web service and process descriptions aims at identifying those descriptions, which satisfy the needs of a query. Scalable discovery solutions feature high precision and recall of the discovered descriptions matching the query. Ideally, they directly integrate the ranking phase into the discovery phase such that most valuable service descriptions are considered first for expensive matchmaking operations. Ranking components determine an ordering of the discovered service and process descriptions and considers user preferences on functional and non-functional properties. It enables automation of service related tasks (e.g. composition) by finding the most appropriate service or process for a given query.
3.2.3 Service Composition Composition is an important means for creating new services based on already existing ones. Therefore services are aggregated to form a new workflow, whose input and output parameters match those of the service requested by the user. In between those services, which compose the new workflow, there have to be matches between their parameters, i.e. preceding services have to create those output parameters that are required by the following services. Not only is it important to find matching input and output parameter types but also such parameters that fulfill the purpose the user needs. Therefore, besides a syntactical type checking, the formal semantics of the parameters have to be considered.
3.2.4 Service Market Platforms and Negotiation In contrast to the current situation where passive Web services are offered, our goal is to provide solutions for the autonomous and active engagement of services in economic activities. We intend to model the “intelligent” behavior of services, through analytics methods for seamless monitoring and analysis of all relevant service activities.
3.2.5 Intelligent Service Analytics Intelligent service analytics is a wide research field concerned with methods that deal with the automatic detection of patterns in data about services. Inductive techniques of intelligent data analysis [9] can also be combined with semantic technologies to form “hybrid” intelligent technologies [10]. In the context of the intelligent management of services, intelligent data analytics aims at techniques for analyzing information about services, processes or loose service networks with the goal of highlighting useful information, predicting future developments, and supporting coordinative actions.
Intelligent Service Management—Technologies and Perspectives
223
Fig. 2 Overview of the Semantic Service Description Formalism described in Sect. 4.1
4 Assessing Intelligent Service Management Technologies In this section, we take a deeper look at the application of intelligent management functionalities in selected areas of the service landscape. As a tribute to Rudi Studer and in line with the dedication of this collection, we particularly focus on work from the area of Semantic Technologies performed within his research groups in Karlsruhe.
4.1 Semantic Description of Services In order to enable the application of intelligent methods and tools, languages with formal semantics for describing services in terms of functionalities and nonfunctional characteristics are needed. Along the structure of the service landscape sketched in Sect. 2, this is relevant for all areas where the service activity is at least supported by ICT (level of ICT involvement) and for all types of service roles. Depending on the level of co-creation, functional properties can be described solely by means of resources (service activity as black box) or by means of a combination of resources and dynamic behavior (service activity as process or as value network). By adding the dynamic behavior of a service to the semantic description, considering services as processes and SVNs becomes possible. Figure 2 illustrates this idea and exemplifies how the different service aspects can be semantically described. Based on the approach presented in [5], in the following we discuss each of the three aspects in more detail.
224
S. Agarwal et al.
4.1.1 Modeling Resources with Ontologies Most approaches to semantic service descriptions such as OWL-S [58], WSMO [14], or SAWSDL [18] allow the use of ontologies to describe the resources involved in a service activity. In addition, relationships between resources in particular “equality” and “inequality” can be specified. These relationship types are necessary to achieve interoperability in the descriptions of individuals and are directly provided by most ontology languages such as the Web Ontology Language OWL [26]. The resources can be further classified into sets that can be hierarchically ordered according to the subset relationship. In addition to the mentioned relation types, OWL also allows the modeling of arbitrary relation types among concepts and use of relations among individuals.
4.1.2 Modeling Behavior with Pi-Calculus When moving beyond the black box view within the service landscape, a Web service—whether stateless or stateful—is seen as a complex process or even network. Mostly, Web services following a Remote Procedure Call (RPC) paradigm are stateless, whereas Web services with a flow of Web pages are typically stateful. In order to model the semantics of both kinds of Web services, the semantic process description language suprime PDL can be used [5]. The syntax is defined as follows: P ::= 0|y[v1 , . . . , vn ].Q|yx1 , . . . , xn .Q|l(x1 , . . . , xn )(y1 , . . . , ym).Q| [ω]Q|P1 ||P2 |P1 + P2 |@A{y1 , . . . , yn } The named process expression is called Agent Identifier. For any agent identifier def
A (with arity n), there must be a unique defining equation A(x1 , . . . , xn ) = P , where the names x1 , . . . , xn are distinct and are the only names which may occur unbound in P . Now, the process Agent @A{y1 , . . . , yn } behaves like P {y1 /x1 , . . . , yn /xn }. Note that defining equations provide recursion, since P may contain any agent identifier, even A itself.
4.1.3 Modeling Non-functional Properties with SPKI Agarval et al. [5] present a semantic extension of the Simple Public Key Infrastructure (SPKI) [12, 16, 17] for modeling non-functional properties in a way that any user can certify any service properties in a decentralized fashion while still allowing user to build their trust in the properties and automatically reason about them despite possible heterogeneity. SPKI—though being simple and powerful— has the drawback that the names of the properties are simple strings, which do not allow certification of complex properties, e.g. AIFB-Employee and above 25 in a way that one can automatically reason about them. This problem can be addressed with Semantic-SPKI by viewing SPKI names as DL concept descriptions and public
Intelligent Service Management—Technologies and Perspectives
225
keys as DL individuals. This means that in a name certificate, we use a DL concept expression in place of the identifier A. This makes it possible to issue complex properties, since complex properties can be constructed by using DL constructors for building complex concepts. The subject of a name certificate is either a key or a name. In a semantic-SPKI name certificate (K, C, S, V ), we view the name certificate equivalent to the ABox assertion C(S) if S is a key, or equivalent to the TBox assertion S C, if it is a name.
4.1.4 Service Policies While the former approach allows for specifying a single allowed configuration of a service, in many real-world scenarios a wide range of different configurations are offered or requested, e.g. the non-functional property response time of a Web service has to be at most 10 ms and does not have to be exactly 10 ms. Several policy languages have been proposed for describing such service configurations via constraints over non-functional properties. They typically support all kinds of services independent of the level of ICT involvement. However, up until now most policy languages have been proposed for a black box view of the service activity. Within the current Web service language stack WS-Policy [65], EPAL [30] and the XACMLbased Web Service Policy Language (WSPL) [46] are used to express constraints on the usage of services such as access rights or quality of service guarantees and on the privacy of the data exchanged within a Web service transaction. They use XML as syntax for representing the policies and the meaning of XML tags is defined in a natural language specification, which is not amenable to machine interpretation and subject to ambiguous interpretations. In order to improve automated consistency checking and enforcement of policies, several ontology-based policy languages have been proposed. While KAoS [62], Kolovski et al. [37] and Lamparter [38] are based mainly on OWL-DL, REI [34] uses OWL-Lite only as syntax for exchanging policies and performs reasoning based on a logic programming approach. Since pure OWL-DL is not fully sufficient to cope with the situation where one value depends on other parts of the ontology, KAoS extends the logic by so called role-value maps. The most recent approach by Kolovski et al. [37] defines the semantics of the WS-Policy specification by means of an OWL-DL ontology. This allows them to use a standard OWL reasoner for policy management and enforcement. However, due to the open world assumption of OWL their results sometimes are counterintuitive. Therefore, they plan to extend their approach with default logic and ensure decidability with restricting default rules to named entities [36]. The latter roughly corresponds to the usage of DL-safe rules in the approach by Lamparter [38], where the traditional view of polices as simple boolean constraints is extended to utility function policies which can also be applied for service ranking and selection. While the former approaches adopt a black box view on the service activity, approaches for semantic policy descriptions for processes as well as service value networks as described above are supported by Agarwal et al. [4]. The work shows
226
S. Agarwal et al.
how complex preference expressions formalized as utility functions can be attached to process descriptions and thereby a ranking of service activities that are viewed as processes and value networks can be derived.
4.1.5 Acquisition of Semantic Service Descriptions While there are many approaches for describing services, in most applications manual creation of these formal descriptions is hardly possible (e.g. too cumbersome or expensive). One way to overcome this bottleneck is the automated mining of processes from given data repositories. This approach allows organizations to obtain relevant knowledge about their running business processes and services in a more efficient and effective way and to detect possible problems or bottlenecks in their systems for further optimizations. Bai [7] developed a tool that crawls the Web for WSDL documents and converts the WSDL description into suprime PDL description. Since, such automatically generated semantic description can not be guaranteed to be semantically correct, the automatic tool is augmented with a graphical editor for manual refinement and improvement of the descriptions. To overcome the lack of freely available WSDL documents, a slightly modified version of the open source tool FORM2WSDL1 has been used to convert HTML forms to WSDL. With the two converters, a large number of semantic description of Web services can be crawled. Since the types used in various WSDL documents are not interconnected and it is a time consuming task to map the large number of ontologies manually, the mapping tool FOAM [15] is used for detecting simple mappings between the ontologies automatically. Agarwal [1] extended the semi-automatic acquisition technique to obtain semantic descriptions of processes implicitly contained in Web sites. The work present a mapping of the dynamics and data flow of Web sites to a semantic process description language as discussed above. Web pages, Web sites and Web services are viewed in a unifying way as Web processes. The automatic acquisition algorithms and the graphical editor for manual editing have been implemented as part of the suprime framework (http://www.aifb.kit.edu/web/Suprime_Intelligent _Management_and_Usage_of_Processes_and_Services). Hoxha and Agarwal [27] further extend the approach by semantic value selection for form submissions as well as by improved form processing by exploiting the form layout.
4.2 Discovery and Ranking of Service and Process Descriptions Efficient discovery of services is a central task in the field of Service-Oriented Architecture (SOA). Service discovery techniques enable users, e.g. end users and developers of a service-oriented system, to find appropriate services for their needs 1 http://www.yourhtmlsource.com/projects/Form2WSDL/.
Intelligent Service Management—Technologies and Perspectives
227
Fig. 3 Discovery approaches that use the same formalism for an offer D and a request R are based on intersections (left). Using two different formalisms allows to specify the (exact) matches in the request (right)
by matching user’s goals against available descriptions of services. The more formal the service descriptions are, the more automation of discovery can be achieved while still ensuring comprehensibility of a discovery technique. Along the structure of the service landscape sketched in Sect. 2, this is relevant for all areas where the service activity is at least supported by ICT (level of ICT involvement) and for all types of service roles. However, the objects of discovery will usually be either individual services, processes or entire value networks (level of co-creation). 4.2.1 Matchmaking of Atomic Services The left side of Fig. 3 shows a service description D and a request R that are interpreted as a set of service executions. The executions are depicted by the dots. A match is given if there is an intersection between D and R. Degrees of matches, for instance plugin or subsume match, present different types of the intersection for both sets of runs. Although the notion of different types of matches was applied in many discovery approaches [13, 23, 35, 43, 57, 59], intersection based approaches require further matchmaking to make sure that a service delivers the expected result for all possible input values. The right side of Fig. 3 shows a more intuitive interpretation of a request (in analogy to database queries), in which a request is viewed as a set of all desired services a user is looking for. A desired service is described by a combination of desired properties depicted by the dots in the figure. Any service description from the pool of available service descriptions that describes a service contained in the set is considered as a match for the request. Junghans et al. [33] present a solution for specifying desired service properties and for ruling out undesired properties. However, the approach neither covers NFPs for discovery nor are changes caused by the service execution itself considered. To address these issues, Junghans and Agarwal [32] present a discovery approach that treats functional and non-functional properties uniformly and enables service functionality descriptions to capture the dynamics of service executions. The approach is based on an extended formal model of Web services that treats functional and nonfunctional properties with same importance. The service request formalism with its semantics for specifying even complex constraints on Web service properties as well the definition of a match between offers and requests is presented by Junghans and Agarwal [32].
228
S. Agarwal et al.
4.2.2 Matchmaking of Processes and Networks Agarwal and Studer [6] presented an approach for matchmaking Web services based on their process like annotations. Unlike most of the existing approaches that may provide expressive formalism for describing Web services but do not provide matchmaking algorithms that make use of the expressivity, our matchmaking approach can use all the available semantic information. The request formalism allows users to specify constraints on inputs and outputs including the relationships between them, constraints on the temporal behavior of the process underlying Web services, and allows logical combinations of both types of constraints. The main feature of our matchmaking algorithm is not only to provide a match/no-match answer as is the case with most of the existing approaches, but a set of conditions, that a user has to fulfill in order to obtain the required functionality from a web service. A combination of this discovery approach with a utility based ranking approach is demonstrated by Agarwal et al. [4].
4.2.3 Ranking of Discovered Results While the discovery component determines the set of descriptions that match a request, the ranking component determines an ordering of the discovered service and process descriptions and considers user preferences on functional and nonfunctional properties. It enables automation of service related tasks (e.g., composition) by finding the most appropriate service or process for a given query. Therefore, the development of the ranking component comprises formalisms to specify preferences on service properties. Approaches for representing cardinal preferences are presented by Lamparter et al. [39] based on utility functions and by Agarwal and Lamparter [3] based on fuzzy logic. Intuitively, a fuzzy rule describes which combination of property values a user is willing to accept to which degree, where property values and degree of acceptance are fuzzy sets. Agarwal and Hitzler [2] show how fuzzy IF-THEN rules can be evaluated with the help of a DL reasoner. The main novelties of the fuzzy logic based approach can be summarized as follows: Expressivity This approach is capable of modeling complex preferences and thus considering the relationship between different non-functional properties. For instance, the prior approaches did not allow users to formulate that a Web service with a high price and with a comparably large response time is not acceptable. Efficiency Using fuzzy logics introduces the well proven benefits of low computational costs to compute a ranking. Considering the vast number of targeted Web service descriptions and the potential size of user preferences, the complexity of a Web service ranking algorithm is crucial for usability. Indecisiveness Users are not forced to formulate crisp preferences; they do not even need to be aware of specific values of a property. The fuzzy logic based approach allows users to formulate imprecise requirements.
Intelligent Service Management—Technologies and Perspectives
229
4.3 Service Composition Composition is an important means for creating new services based on already existing ones. Therefore services are aggregated to form a new workflow, whose input and output parameters match those of the service requested by the user. Along the structure of the service landscape sketched in Sect. 2, composition is relevant for all areas where the service activity is at least supported by ICT (level of ICT involvement) and a composition will typically contain services of different service roles. Here, the level of investigation are certainly atomic services (level of co-creation), while the eventual goal of the composition are the resulting processes or SVNs. In between the services, which compose the new workflow, there have to be matches between their parameters, i.e. preceding services have to create those output parameters that are required by the following services. Not only is it important to find matching input and output parameter types, but also such parameters that fulfill the purpose the user needs. Therefore, besides a technical matching, semantics have to be regarded considering the parameters and their use in the services. Matching between parameter-notations that use different vocabularies but share a semantic meaning should be found and matched, so that the specific services can be composed nonetheless. The aim is to provide methods that ensure that a composition is syntactically as well as semantically correct. While this has to be an automatic proceeding, due to the huge amount of possible workflow compositions, there may be manual parts for the human user, which ensure transparency and control. Non-functional properties are maybe not covered satisfactory, even if a sufficient service description language is provided. This may happen because the user himself has problems in articulating or valuating his nonfunctional preferences. A manual revision of the composition should ensure that all his needs and specific wishes are regarded. Therefore we strive for a semi-automatic composition. Furthermore, a dynamic reconfiguration at runtime should be possible if required. This can be the case if some services that are used in the composition fail or are overloaded or because some external or internal parameters changed. The system should recognize such cases and react in that way, that it autonomously finds solutions and provides them, e.g. by replacing a part of the composition. Uniform data mediation is an important basis for composition methods. Thus, the service description language has to ensure powerful data mediation. The compositions trustworthiness should be compatible for service-oriented computing, so that its unlimited usage can be guaranteed.
4.4 Service Market Platforms A service market platform augments the technical service management infrastructure with a business-oriented view. In this regard, the technical aspects of service descriptions and requests have to be extended by economic properties such as price.
230
S. Agarwal et al.
Based on these extended descriptions, more sophisticated economic coordination functionality can be provided to service providers and customers leading to a higher efficiency in the market. In general, a service market platform consists of three major components: (i) a bidding language which is used to advertise service offers and requests on the market; (ii) a market mechanism for allocating offers to requests and for determining the market price; (iii) and finally a contract management component for specifying and executing the agreements reached on the market. Approaches for realizing these components are discussed in the following.
4.4.1 Bidding Language Generally, the structure of bidding languages in electronic commerce depends on the product that is traded on the market. As a first step, we adopt the black box view on a single service. As outlined above, in this context a service can be described by a functional and non-functional description. For supporting non-functional descriptions of a service a multi-attributive bidding language is required. Such languages allow the specification of several parameters that are negotiated within the market beyond price. In the case of services these parameters could represent quality of service guarantees given by the provider or guarantees requested by the customer. Specifically for Web services several XML-based bidding languages have been proposed such as WS-Agreement [21] and the Web Service Offering Language (WSOL) [61] that support the specification of service requests and offers. As they support only discrete attributes and no functional relations, an exponential number of price attachments are required to model multi-attributive service descriptions. To address this issue, Lamparter et al. [39] show how offers and requests for configurable Web services can be efficiently representing and semantically matched within the discovery step. If the service activity is seen as a process, additional complexity is added to the bidding language. In this situation, a combinatorial bidding language is required which supports the specification of interdependent offers. For example, one might have to specify that for all activities within the process a suitable service has to be acquired on the market. If that is not possible, no service should be allocated to the request. Combinatorial bidding languages are for instance presented in Nisan [47] and Schnizler et al. [53]. How ontologies can be used to efficiently express multiattributive, combinatorial service requests and offers is presented in Lamparter et al. [40].
4.4.2 Market Mechanisms Considering the economic properties of the service selection/ranking algorithms some problems become evident: the service selection is not incentive compatible in the sense that providers have no incentive to reveal their true valuation of providing their service to the customer. That means it can be advantageous for a provider to
Intelligent Service Management—Technologies and Perspectives
231
strategically over- or underprice the service. To address this problem, market mechanisms can be introduced where prices are dynamically determined based on supply and demand. Thereby, markets make sure that a provided services is awarded to the requester who has the highest valuation of the service. This is particularly relevant for resource restricted services like computational (grid) services since not all requesters might get a certain service due to resource limitations. For supporting such complex scenarios with resource restricted service processes a combinatorial multi-attribute double auction such as the MACE system [53] is required. In order to support semantic matching within the market Lamparter and Schnizler [42] extend the MACE system with a semantic bidding language as outlined in Lamparter et al. [40] and a semantic matching algorithm as discussed in Sect. 4.2.2. In addition, Lamparter and Schnizler [42] show how semantic dependencies between service offers and requests can be utilized to enhance the market performance. An approach for extending the view on an entire service value network is given by van Dinther et al. [64]. 4.4.3 Web Service Contract Management Automated management of contracts requires formal, machine-interpretable descriptions [45]. Thereby, automation of management tasks like contract formation, monitoring and execution is enabled. There are several languages that strive for formalization of Web service contracts. Lamparter et al. [41] present an ontology for representing parts of Web service contracts using OWL-DL. Other approaches such as Grosof and Poon [24]; Governatori [22]; Paschke et al. [51]; Oldham et al. [48] formalize legal clauses using various proprietary rule formalisms. While fully automated contracting is hardly achievable with these approaches, augmenting an automatically closed contract with an umbrella contract that provides the legal basis for the automation seems promising [41].
4.5 Intelligent Service Analytics Intelligent service analytics is a wide research field concerned with methods that deal with the automatic detection of patterns in data about services. In the context of the intelligent management of services, the aim is to use intelligent data analytics [9] for analyzing information about services, processes or loose service networks with the goal of highlighting useful information, predicting future developments, and supporting coordinative actions. Current research begins to provide methods and tools that allow for the continuous monitoring, analysis, and prediction of the technical and business characteristics of individual services [8, 66]. Similarly, automated support in creating process descriptions, called ‘process mining’ [63], can reduce the manual effort and may help to identify discrepancies between the modeled process and the one practically running in the business domain. Further, initial methods for analyzing information about decentralized service networks are a growing field of research on SVNs [67, 68].
232
S. Agarwal et al.
5 Future Perspectives for Intelligent Service Management In this section, we look at some perspectives and future challenges of intelligent service technology. We identify areas, where intelligent service management technologies are likely to play an important role.
5.1 Situational (Web) Processes Most of today’s business processes are complex and consist of more than one party or single step procedures. In the Web, this is reflected by the existence of billions of Web sites, which may be regarded as complex processes, and on the other side only a few thousand publicly available WSDL files that present single services. The availability of semantic descriptions of services and processes in the Web facilitates their discovery, as well as their composition into more complex workflows. It also facilitates the automatic execution of such workflows despite their heterogeneity. However, the deficit of semantic descriptions of Web processes deprives the users from using such sophisticated automatic techniques. The scope of future research is to fill this gap by providing semi-automatic techniques for the acquisition of a large number of semantic process descriptions on the (deep) Web. Therefore, the data found in the online sources should be modeled formally using ontologies and the processes of user navigating through Web forms should be used to generate semantic service descriptions.
5.2 Monitoring and Analytics of Service Ecosystems Creating a service-based economy for the Future Internet necessarily requires coupling any technical solution with a proper understanding of the economic aspects involved. The ability to offer flexible and successful services will be strictly dependent on the capacity to analyze services, processes and SVNs comprehensively: from the technical details to high-level business perspectives in an integrated fashion. Specific challenges arise when services are offered together on loose Service Value Networks. Service Value Networks are and need to be subject to analysis and data mining for highlighting interesting features, for the prediction of future developments, for optimizing the setup of the network with respect to parameters and for supporting coordinative actions. However, given the decentralized nature of Service Value Networks, complete information on the entire network will usually not be available and might even be withheld strategically by individual actors. This raises multiple questions with respect to analytical techniques, including the question of how robust analysis techniques can be devised that can deal with limited amounts of available data and how analysis techniques can be devised to actively interact with actors in the SVN and make them disclose missing information most relevant to the
Intelligent Service Management—Technologies and Perspectives
233
analysis task at hand, possibly by providing appropriate compensation. Orthogonal to these, the question arises how monitoring infrastructures can be devised in such a way that they can capture relevant information, even in a decentralized environment.
5.3 Optimization and Self-management of Services The key to successful service offerings will be the ability of services to adapt to ever changing market conditions. Services may disappear, change their price, or modify the Quality of Service provided. New service providers may affect the market with novel solutions that are cheaper, faster, more reliable, or of better quality overall. And as the market evolves, new opportunities may arise on the basis of unprecedented partnerships and value co-creations. Providing techniques for sustaining profitable businesses within such a dynamic environment will be paramount for the Future Internet. While existing composition technology starts to enable the initial ad-hoc formation of Service Value Networks, future technologies need to provide the capacity to automatically rearrange service execution to meet demand and honor the Service Level Agreements contracted, the possibility to swap service providers if a certain service is not fulfilling requirements or if new more suitable services appear, or even the ability to automatically engage in new SVNs transparently. Future intelligent service management technologies will support the optimization of SVNs based on perceived metrics, SLAs and KPIs, but also the time and cost for adopting the optimization measures and the propagation of changes both at the technical and at the business level.
6 Conclusion Intelligent infrastructures, in particular Semantic Technologies, have accompanied the development to service-centric architectures and computing paradigms from the early days. In this contribution, we have provided an overview of the current and envisioned future areas of intelligent service management. As a general introduction, we have structured the landscape of services according to three complementary dimensions: (i) Level of ICT Involvement, (ii) Level of Co-Creation, and (iii) Service Role. We have then assessed the current state of intelligent service management technologies in more detail, especially with a strong focus on expressive service and process descriptions based on Semantic Technologies. The expressivity of the description techniques can cover the first two dimensions completely. Regarding the third dimension, they have been employed only for stand-alone services so far and it remains as a future task to evaluate how far they can be applied for infrastructure and coordination services as well. We then discussed techniques for obtaining semantic descriptions of services semi-automatically. Having the semantic descriptions that contain information required for intelligent management and usage of
234
S. Agarwal et al.
services in machine interpretable form, we then discussed how some of the management and usage tasks, namely discovery, ranking, composition, service markets and service analytics can be performed automatically to a large extent. Finally, we have sketched some recent developments, namely situational Web process, monitoring and analytics of service ecosystems, and optimization and self-management of services that will most likely play an important role in the field in the future.
References 1. Agarwal, S.: Semi-automatic acquisition of semantic descriptions of web sites. In: Proceedings of the Third International Conference on Advances in Semantic Processing, Sliema, Malta. IEEE Press, New York (2009) 2. Agarwal, S., Hitzler, P.: Modeling fuzzy rules with description logics. In: Proceedings of Workshop on OWL Experiences and Directions, Galway, Ireland (2005) 3. Agarwal, S., Lamparter, S.: sMart—a semantic matchmaking portal for electronic markets. In: Mueller, G., Lin, K.-J. (eds.) Proceedings of the 7th IEEE International Conference on E-Commerce Technology, Munich, Germany, pp. 405–408 (2005) 4. Agarwal, S., Lamparter, S., Studer, R.: Making web services tradable: a policy-based approach for specifying preferences on web service properties. J. Web Semant. 7(1), 11–20 (2009) 5. Agarwal, S., Rudolph, S., Abecker, A.: Semantic description of distributed business processes. In: Hinkelmann, K., Abecker, A., Boley, H., Hall, J., Hepp, M., Sheth, A., Tho¨nssen, B. (eds.) AAAI Spring Symposium—AI Meets Business Rules and Process Management, Stanford, USA (2008) 6. Agarwal, S., Studer, R.: Automatic matchmaking of web services. In: International Conference on Web Services (ICWS’06). IEEE Comput. Soc., Los Alamitos (2006) 7. Bai, T.: Automatische Extraktion von Semantischen Beschreibungen von Web Services (Automatic extraction of semantic description of web services). Student Research Project supervised by Sudhir Agarwal and Rudi Studer (2006) 8. Basu, S., Casati, F., Daniel, F.: Toward web service dependency discovery for SOA management. In: Proceedings of the 2008 IEEE International Conference on Services Computing (SCC), Honolulu, Hawaii, USA, 8–11 July 2008, pp. 422–429. IEEE Comput. Soc., Los Alamitos (2008) 9. Berthold, M.R., Hand, D.J.: Intelligent Data Analysis. Springer, Berlin (2003) 10. Bloehdorn, S., Hotho, A.: Machine learning and ontologies. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies, 2nd edn. Springer, Berlin. Chap. 31, in Germany 11. Chesbrough, H., Spohrer, J.: A research manifesto for services science. Commun. ACM 49(7), 35–40 (2006) 12. Clarke, D.E., Elien, J.-E., Ellison, C.M., Fredette, M., Morcos, A., Rivest, R.L.: Certificate chain discovery in SPKI/SDSI. J. Comput. Secur. 9, 285–322 (2001) 13. Constantinescu, I., Binder, W., Faltings, B.: Flexible and efficient matchmaking and ranking in service directories. In: ICWS ’05: Proceedings of the IEEE International Conference on Web Services, Washington, DC, USA, pp. 5–12. IEEE Comput. Soc., Los Alamitos (2005) 14. de Bruijn, J., Fensel, D., Kerrigan, M., Keller, U., Lausen, H., Scicluna, J.: The web service modeling language. In: Modeling Semantic Web Services. Springer, Berlin (2008) 15. Ehrig, M., Staab, S., Sure, Y.: Bootstrapping ontology alignment methods with APFEL. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) Proceedings of the 4th International Semantic Web Conference, ISWC 2005, Galway, Ireland, November 6–10, 2005. Lecture Notes in Computer Science, vol. 3729, pp. 186–200. Springer, Berlin (2005) 16. Ellison, C.M., Frantz, B., Lampson, B., Rivest, R.L., Thomas, B.M., Ylonen, T.: Simple public key certificate. http://world.std.com/~cme/html/spki.html (1999)
Intelligent Service Management—Technologies and Perspectives
235
17. Ellison, C.M., Frantz, B., Lampson, B., Rivest, R.L., Thomas, B.M., Ylonen, T.: SPKI certificate theory. Internet RFC 2693 (1999) 18. Farrell, J., Lausen, H.: Semantic annotations for WSDL and XML schema. W3C recommendation, W3C. Published online on August 28th, 2007 at http://www.w3.org/TR/sawsdl/ (2007) 19. Fitzsimmons, J.A., Fitzsimmons, M.J.: Service Management: Operations, Strategy, Information Technology. McGraw-Hill, New York (2007) 20. Franch, X., Botella, P.: Putting non-functional requirements into software architecture. In: IWSSD ’98: Proceedings of the 9th International Workshop on Software Specification and Design, Washington, DC, USA, p. 60. IEEE Comput. Soc., Los Alamitos (1998) 21. Global Grid Forum: Grid resource allocation agreement protocol. Web Services Specification (2006) 22. Governatori, G.: Representing business contracts in RuleML. Int. J. Coop. Inf. Syst. 14, 181– 216 (2005) 23. Grimm, S., Lamparter, S., Abecker, A., Agarwal, S., Eberhart, A.: Ontology based specification of web service policies. In: INFORMATIK 2004—Proceedings of Semantic Web Services and Dynamic Networks. Lecture Notes in Informatics, vol. 51, pp. 579–583 (2004) 24. Grosof, B., Poon, T.C.: SweetDeal: representing agent contracts with exceptions using XML rules, ontologies, and process descriptions. In: Proceedings of the 12th World Wide Web Conference, Budapest, Hungary, pp. 340–349 (2003) 25. Hill, T.: On goods and services. Rev. Income Wealth 23(4), 315–338 (1977) 26. Horrocks, I., Patel-Schneider, P.F., van Harmelen, F.: From SHIQ and RDF to OWL: the making of a web ontology language. J. Web Semant. 1(1) (2003) 27. Hoxha, J., Agarwal, S.: Semi-automatic mining of semantic descriptions of processes in the web. In: IEEE/WIC/ACM International Conference on Web Intelligence, Toronto, Canada. IEEE Press, New York (2010) 28. Hoxha, J., Scheuermann, A., Bloehdorn, S.: An approach to formal and semantic representation of logistics services. In: Proceedings of the Workshop on Artificial Intelligence and Logistics (AILog) at the 19th European Conference on Artificial Intelligence (ECAI 2010) (2010) 29. Huhns, M., Singh, M.: Service-oriented computing: key concepts and principles. IEEE Internet Comput. 9(1), 75–81 (2005) 30. IBM Corporation: Enterprise privacy authorization language (EPAL 1.2). Available from http://www.w3.org/Submission/EPAL. W3C Member Submission (2003) 31. Jetter, M., Satzger, G., Neus, A.: Technological innovation and its impact on business model, organization and corporate culture—IBM’s transformation into a globally integrated, serviceoriented enterprise. Bus. Inf. Syst. Eng. 1 (2009) 32. Junghans, M., Agarwal, S.: Web service discovery based on unified view on functional and non-functional properties. In: Fourth IEEE International Conference on Semantic Computing, ICSC 2010, Pittsburgh, PA, USA. IEEE Press, New York 33. Junghans, M., Agarwal, S., Studer, R.: Towards practical semantic web service discovery. In: The Semantic Web: Research and Applications. Proceedings of the 7th Extended Semantic Web Conference, ESWC 2010, Heraklion, Crete, Greece, May 30–June 3, 2010. Lecture Notes in Computer Science. Springer, Berlin (2010) 34. Kagal, L.: A policy-based approach to governing autonomous behavior in distributed environments. PhD thesis, University of Maryland Baltimore County, Baltimore, MD 21250 (2004) 35. Klusch, M., Kapahnke, P., Zinnikus, I.: Hybrid adaptive web service selection with SAWSDLMX and WSDL-analyzer. In: The Semantic Web: Research and Applications, pp. 550–564 (2009) 36. Kolovski, V., Parsia, B.: WS-policy and beyond: application of OWL defaults to web service policies. In: 2nd International Semantic Web Policy Workshop (SWPW’06). Workshop at the 5th International Semantic Web Conference (ISWC) (2006) 37. Kolovski, V., Parsia, B., Katz, Y., Hendler, J.A.: Representing web service policies in OWLDL. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) The Semantic Web—ISWC 2005, 4th International Semantic Web Conference (ISWC 2005). Lecture Notes in Computer Science, vol. 3729, pp. 461–475. Springer, Berlin (2005)
236
S. Agarwal et al.
38. Lamparter, S.: Policy-based contracting in semantic web service markets. PhD thesis, University of Karlsruhe, Germany (2007) 39. Lamparter, S., Ankolekar, A., Grimm, S., Studer, R.: Preference-based selection of highly configurable web services. In: Proc. of the 16th Int. World Wide Web Conference (WWW’07), Banff, Canada, pp. 1013–1022 (2007) 40. Lamparter, S., Ankolekar, A., Oberle, D., Studer, R., Weinhardt, C.: Semantic specification and evaluation of bids in web-based markets. Electron. Commer. Res. Appl. 8(1) (2009) 41. Lamparter, S., Luckner, S., Mutschler, S.: Semi-automated management of web service contracts. Int. J. Serv. Sci. 1(3/4) (2008) 42. Lamparter, S., Schnizler, B.: Trading services in ontology-driven markets. In: SAC ’06: Proceedings of the 2006 ACM Symposium on Applied Computing, Dijon, France, pp. 1679–1683. ACM, New York (2006) 43. Li, L., Horrocks, I.: A software framework for matchmaking based on semantic web technology. Int. J. Electron. Commer. 8(4), 39–60 (2004) 44. Milner, R., Parrow, J., Walker, D.: A calculus of mobile processes, Parts I+II. J. Inf. Comput., 1–87 (1992) 45. Milosevic, Z., Governatori, G.: Special issue on contract architectures and languages—guest editors’ introduction. Int. J. Coop. Inf. Syst. 14(2–3), 73–76 (2005) 46. Moses, T., Anderson, A., Proctor, S., Godik, S.: XACML profile for web services. Oasis Working Draft (2003) 47. Nisan, N.: Bidding and allocation in combinatorial auctions. In: Proceedings of the 2nd ACM Conference on Electronic Commerce (EC’00), pp. 1–12. ACM, New York (2000) 48. Oldham, N., Verma, K., Sheth, A., Hakimpour, F.: Semantic WS-agreement partner selection. In: WWW ’06: Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland, pp. 697–706. ACM, New York (2006) 49. Papazoglou, M., van den Heuvel, W.-J.: Service oriented architectures: approaches, technologies and research issue. VLDB J. 16(3), 389–415 (2007) 50. Papazoglou, M.P., Traverso, P., Dustdar, S., Leymann, F.: Service-oriented computing: state of the art and research challenges. IEEE Comput. 40(11), 38–45 (2007) 51. Paschke, A., Bichler, M., Dietrich, J.: ContractLog: an approach to rule based monitoring and execution of service level agreements. In: International Conference on Rules and Rule Markup Languages for the Semantic Web (RuleML 2005), Galway, Ireland (2005) 52. Petri, C.A.: Communication with automata. CriBiss Air Force Base, RADC-TR-65-377 1(1) (1966) 53. Schnizler, B., Neumann, D., Veit, D., Weinhardt, C.: Trading grid services—a multi-attribute combinatorial approach. Eur. J. Oper. Res. (2006) 54. Scholten, U., Fischer, R., Zirpins, C.: Perspectives for web service intermediaries: how quality makes the difference. In: Proceedings of the 10th International Conference on Electronic Commerce and Web Technologies (EC-Web 09), Linz, Austria. Lecture Notes in Computer Science, vol. 5692, pp. 145–156. Springer, Berlin (2009) 55. Spohrer, J., Maglio, P.P., Bailey, J., Gruhl, D.: Steps toward a science of service systems. Computer 40(1), 71–77 (2007) 56. Staab, S., Studer, R. (eds.) Handbook on Ontologies, 2nd edn. International Handbooks on Information Systems. Springer, Berlin (2009) 57. Stollberg, M., Hepp, M., Hoffmann, J.: A caching mechanism for semantic web service discovery. In: Aberer, K., et al. (eds.) The Semantic Web. 6th Int. Semantic Web Conf., Busan, Korea. Lecture Notes in Computer Science, vol. 4825, pp. 480–493. Springer, Berlin (2007) 58. Sycara, K., Paolucci, M., Ankolekar, A., Srinivasan, N.: Automated discovery, interaction and composition of semantic web services. J. Web Semant. 1(1) (2003) 59. Sycara, K., Paolucci, M., Ankolekar, A., Srinivasan, N.: Automated discovery, interaction and composition of semantic web services. J. Web Semant. 1, 27–46 (2003) 60. Tai, S., Lamparter, S.: Modeling services—an inter-disciplinary perspective. In: Model-Based Software and Data Integration. Communications in Computer and Information Science, vol. 8, pp. 8–11. Springer, Berlin (2008)
Intelligent Service Management—Technologies and Perspectives
237
61. Tosic, V.: Service offerings for XML web services and their management applications. PhD thesis, Department of Systems and Computer Engineering, Carleton University, Canada (2005) 62. Uszok, A., Bradshaw, J.M., Johnson, M., Jeffers, R., Tate, A., Dalton, J., Aitken, S.: KAoS policy management for semantic web services. IEEE Intell. Syst. 19(4), 32–41 (2004) 63. van der Aalst, W.M.P., Günther, C.W.: Finding structure in unstructured processes: the case for process mining. In: Basten, T., Juhás, G., Shukla, S.K. (eds.) Seventh International Conference on Application of Concurrency to System Design (ACSD 2007), Bratislava, Slovak Republic, 10–13 July 2007, pp. 3–12. IEEE Comput. Soc., Washington (2007) 64. van Dinther, C., Blau, B., Conte, T., Weinhardt, C.: Designing auctions for coordination in service networks. In: Research and Innovations (SSRI) in the Service Economy. The Advancement of Service Systems. Springer, Berlin (2010). ISSN:1865-4924 65. W3C: Web services policy framework 1.5 (WS-policy). http://www.w3.org/2002/ws/policy/ (2006) 66. Wetzstein, B., Leitner, P., Rosenberg, F., Brandic, I., Dustdar, S., Leymann, F.: Monitoring and analyzing influential factors of business process performance. In: Proceedings of the 2009 IEEE International Enterprise Distributed Object Computing Conference (EDOC 2009), pp. 141–150. IEEE Comput. Soc., Washington (2009) 67. Wetzstein, B., Danylevych, O., Leymann, F., Bitsaki, M., Nikolaou, C., van den Heuvel, W.-J., Papazoglou, M.: Towards monitoring of key performance indicators across partners in service networks. In: International Workshop Series on Monitoring, Adaptation and Beyond (MONA+), collocated with ICSOC/ServiceWave 2009, Stockholm, Sweden. Springer, Berlin (2009) 68. Wetzstein, B., Karastoyanova, D., Kopp, O., Leymann, F., Zwink, D.: Cross-organizational process monitoring based on service choreographies. In: Proceedings of the 2010 ACM Symposium on Applied Computing (SAC 2010), Sierre, Switzerland, pp. 2485–2490. ACM, New York (2010)
Semantic Technologies and Cloud Computing Andreas Eberhart, Peter Haase, Daniel Oberle, and Valentin Zacharias
Abstract Cloud computing has become a generic umbrella term for the flexible delivery of IT resources—such as storage, computing power, software development platforms, and applications—as services over the Internet. The foremost innovation is that the IT infrastructure no longer lies with the user, breaking up the previously monolithic ownership and administrative control of its assets. The combination of cloud computing and semantic technologies holds great potential. In this chapter, we analyze three ways in which cloud computing and semantic technologies can be combined: (1) building on cloud computing technologies, e.g. from the area of distributed computing, to realize better semantic applications and enable semantic technologies to scale to ever larger data sets, (2) delivering semantic technologies as services in the cloud, and (3) using semantic technologies to improve cloud computing, in particular to further improve automatic data-center management. For each of these dimensions we identify challenges and opportunities, provide a survey, and present a research roadmap.
1 Introduction Cloud computing has become a trend in IT that moves computing and data away from desktop and portable PCs into larger data centers. It refers to applications delivered over the Internet as well as to the actual cloud infrastructure—namely, the hardware and systems software in data centers that provide these services [3]. More specifically, cloud computing provides information technology capabilities as a service, i.e., users can access such services remotely via the Internet without knowledge, expertise, or control over the physical hardware that supports them. The services are elastically scalable and charged with a utility pricing model. Elastic scalability means that the service can quickly adapt to a user’s growing or shrinking requirements. Utility or pay-per-use pricing means that the user only has to pay for what he or she actually uses (and not some spare capacity that might be needed at some point) [21]. A. Eberhart () Fluid Operations, 69190 Walldorf, Germany e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_13, © Springer-Verlag Berlin Heidelberg 2011
239
240
A. Eberhart et al.
Fig. 1 The cloud architecture reference stack according to [9]
Originally the focus was on the software as a service layer provided by application service providers. A variety of other cloud services appeared in the meantime which can be classified into what is frequently called the “as a service” reference stack [9]. This stack classifies cloud services into several layers moving from hardware-oriented to full-fledged software solutions and even “human as a service.” Below we explain the layers, viz., Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS), and Human as a Service (HaaS). As shown in Fig. 1 the layers are accompanied by business support (such as billing or monitoring) and administration features (such as deployment or configuration). • Infrastructure as a Service (IaaS): At the bottom of the stack lies the basic infrastructure, such as CPU power, storage, or network transport. It represents a commoditization of the data center and operations. At this layer we also find services such as security, identity management, authentication, and e-commerce. Consider the Amazon S3 services for data storage as an example. • Platform as a Service (PaaS): A platform for creating software applications that will be built and run in the cloud using the software as a service architecture. This level can be seen as a general-purpose application development environment. Examples include the Google App Engine or the SalesForce AppExchange. This level provides a venue for building applications with a series of general purpose APIs and running these applications on cloud infrastructure. • Software as a Service (SaaS) or Application as a Service: Software applications hosted in the cloud and delivered as services. This is the domain of SaaS companies in areas such as customer relationship management (CRM) or supply chain management (SCM). For example, SalesForce.com and SAP’s Business by Design have created applications that unite various services under their control into one application environment. • Human as a Service (HaaS): This highest layer relies on massive scale aggregation and extraction of information from crowds of people. Each individual in the crowd may use whatever technology or tools required to solve the task. Examples of this layer are Amazon’s Mechanical Turk or “newsworthy” video streams by YouTube. Note that for the remainder of this chapter, the Human as a Service layer is of minor interest.
Semantic Technologies and Cloud Computing
241
The cloud computing paradigm incorporates or overlaps with other technological trends, most notably from the areas of (1) distributed computing, to enable scalability, (2) multi-tenancy software architecture, to economically serve a large number of smaller customers, and (3) virtualization and data center automation, to effectively manage the large data centers that are the needed for successful cloud services. The common theme is reliance upon the Internet to satisfy the computing needs of the users. In this chapter, we discuss the mutual benefits arising from the combination of cloud computing and semantic technologies. We can differentiate between three ways in which these technologies can be combined: 1. Cloud computing for semantic technologies (Sect. 2): The first manner of combining both technologies focuses on how semantic technologies are improved by building on cloud computing technologies. This includes not only the use of cloud services to build semantic applications, but also techniques from the area of distributed computing to enable semantic technologies to scale to ever larger data sets. 2. Semantic technologies offered as cloud services (Sect. 3): Here, the idea is to offer the functionality of semantic technology platforms or entire semantic applications as a service in the cloud. 3. Better clouds through semantic technologies (Sect. 4): Finally, we argue that semantic technologies can be used across the whole cloud stack to improve the management and maintenance of clouds. The major use case here is to improve automatic data center management by semantic information integration or increased automation with policies and service level agreements with reasoning. In the following, we discuss each of the aspects in more detail in Sects. 2 to 4. It is the goal of this chapter to (a) identify challenges and opportunities, (b) give a survey, and (c) present a research roadmap for each of the three aspects. Finally, we give a conclusion in Sect. 5.
2 Cloud Computing for Semantic Technologies This section is concerned with the question of how cloud computing techniques and cloud services can be used for the improvement of semantic technologies.
2.1 Challenges and Opportunities We examine two different ways in which cloud computing can be harnessed for semantic technologies: (1) the application of cloud concepts and technologies for semantic technologies and (2) the use of existing cloud services to directly improve applications of semantic technologies.
242
A. Eberhart et al.
The use of cloud computing technologies and services for semantic technologies promises improvements in areas such as: scalability to large datasets, elasticity in the response to changing usage profiles and systems that are more reliable and easier to build and maintain [6]. To realize these potential benefits of cloud computing, many challenges need to be overcome; challenges such as the creation of new algorithms to realize semantic applications in a distributed way and techniques to build semantic applications on top of remote cloud services.
2.2 Survey We will first examine the use of cloud technologies for the improvement of semantic applications. Looking at cloud computing from a technological point of view, the most salient properties are the use of virtualization to hide the complexity of heterogeneous hardware resources and distributed, shared nothing architectures to effectively harness large clusters of computers. From these properties, a large number of cloud technologies and applications has emerged, mostly in the area of databases and frameworks for the analysis and transformation of large amounts of data. The most prominent developments in this direction build on the MapReduce [2] programming model for massively parallel processing of semantic data. The MapReduce programming model allows for the quick creation and deployment of parallel and distributed applications to process vast amounts of data on a very large cluster of commodity machines. Developed by Google, MapReduce has been deployed to a wide range of tasks including sorting, construction of inverted indexes and machine learning. The strength of MapReduce lies in its ability to scale to a very large number of machines, its ease of use, its ability to deal with unstructured data without preprocessing and its resilience to failures. On the other hand, it is outperformed by distributed databases for many tasks where these properties are not needed [16]. Since the internal development of MapReduce at Google, it has been implemented many times, and robust, open source implementations such as Hadoop1 are readily available. Applied to the semantic technologies, the most prominent use of MapReduce is the massively parallel computation and materialization of all reasoning consequences for large knowledge bases [19, 20]. Building on Hadoop, this system was able to shatter previous records with respect to the size of the knowledge base and the number of derivations per second. In an earlier work, Newman et al. also used MapReduce for querying, processing and reasoning over RDF data [12, 13]. An even more elaborate use of cloud technologies was described by Mika et al. in [11]. In addition to MapReduce, these systems also build on the distributed cloud database HBase2 and Pig3 [4, 14]; a platform for the massive parallel execution of 1 http://hadoop.apache.org. 2 http://hadoop.apache.org/hbase/. 3 http://hadoop.apache.org/pig/.
Semantic Technologies and Cloud Computing
243
data analysis tasks described in a high level language. Mika et al. used these building blocks both to realize query processing over large amounts of RDF data and for the implementation of the indexing pipeline of the semantic search engine Sindice [15]. Another application of MapReduce to semantic technologies is its use for the parallel semantic analysis of documents [8]. Through this method, semantic annotation tools can be applied more quickly across bigger sets of documents. Consider the identification of concept instantiations (instance of relationships) from ontologies in text or the automatic population of ontologies from text as an example of such a method. A second way to combine semantic technologies and cloud computing is the use of actual existing cloud services. Today there are already a large number of cloud varied services available that are realized using the above described (and other) cloud technologies. Hence, instead of using these cloud technologies directly, semantic applications can build on service that are build using them. On the one hand, this usually reduces performance, because integration between the different components has to be looser. On the other hand, the developer of the semantic application benefits from reduced complexity (e.g., he or she does not need to setup a distributed database), elastic scalability, and utility pricing for these services. One use case is the use of external compute clouds offered as a service (such as Amazon’s EC2) to quickly and cheaply tackle load peaks in semantic applications, for example to rent 100 (virtual) computers to do a semantic analysis of millions of documents in hours instead of weeks, or to use cloud resources for reasoning materialization—resources that only need to be paid when a large amount of new data needs to be processed. Note that these can be the same applications described above, only the actual execution does not happen in a company’s data center, but on external computing resources offered as a service. Another intriguing possibility is the use of a “database as a service,” a specific kind of IaaS offering; one of the oldest and best established of which is SimpleDB by Amazon. This service offers access to a distributed, scalable and redundant database management system with a utility pricing model. Unlike the simpler S3 service by amazon, SimpleDB offers a query language modeled after (a very simple form of) SQL. For the user, SimpleDB holds the promise of not having to worry about the problems of a large scale database deployment; of not having to worry about redundancy, backups, and scaling to large data sets and many concurrent database clients. Stein et al. [17] have shown that using such a database as a back-end for a triple store can be even faster in handling parallel queries than state of the art triple stores. However, because of the simplicity of the query language currently supported by SimpleDB, this only worked for relatively simple SPARQL queries— for more complex queries too much processing must be done by the clients, resulting in dismal query execution times. Finally, distributed file systems and even the MapReduce framework are offered as cloud computing service (e.g., Amazon’s S3 and Elastic MapReduce respectively). Again these services offer to shield the user from the cost and complexity of running these systems, possibly at the expense of slightly lower performance and possibly higher cost in the long run.
244
A. Eberhart et al.
2.3 Roadmap Despite remarkable initial success, the use of cloud technologies for semantics is just in its beginnings. Most applications—in particular those concerned with reasoning—are still research prototypes that do not support essential functionalities such as updates, the full semantics of Semantic Web standards such as OWL DL (only OWL-horst [18] is implemented) or essential reasoning functions like query answering and satisfiability checking. There is also the fundamental issue that many existing cloud technologies achieve their remarkable scalability by supporting only very limited functionality (when compared to traditional databases, for instance) while semantic technologies strive to support functionality that in many aspects goes even beyond what relational databases offer. It remains to be seen to what extent the ‘scalability through simplicity’ approaches can be transferred to semantic technologies and to what extent semantic technologies must develop separate distributed, shared nothing approaches that achieve elastic scalability.
3 Semantic Technologies as Cloud Services In this section, we explore the possibility of offering the functionality of semantic technology platforms or entire semantic applications as a service in the cloud.
3.1 Challenges and Opportunities According to this idea, users and developers can exploit the power of semantic technologies to make sense of heterogeneous data at large scale without the need for its own infrastructure, benefitting from the elastic scalability, flexibility, ease of deployment and maintenance in the cloud. In this context, cloud services present a viable alternative to in-house solutions. With a significant reduction of costs on the one hand and the ease of use on the other, semantic technologies can thus potentially reach a much larger customer base. The opportunities for the proliferation of these technologies are immense. Generally, we can distinguish two kinds of cloud services for semantic technologies: 1. Horizontal cloud services provide platform functionality enabling tenants to build semantic applications or new services in the cloud. These services are located on the PaaS layer. 2. Vertical cloud services provide functionality that is useful to tenants as an application in itself. These semantic applications run in the cloud, provide a direct service to the customer, and are located in the SaaS layer. Platform functionality (PaaS) for semantic technologies may include all base services in building semantic applications, in particular it can include the following:
Semantic Technologies and Cloud Computing
245
• Semantic data management as a service covers repository functionality, including the ability to store, retrieve, query, and explore semantic data. • Semantic search as a service realizes advanced ways of searching data based on understanding the meaning of queries. • Data integration as a service enables connecting heterogeneous data sources in automated ways. • NLP as a service targets the use of natural language processing techniques to generate structure from unstructured data, e.g., to automatically generate metadata from text. Potential uses of vertical cloud services are just as manifold and diverse as applications of semantic technologies themselves. Examples include semantically enabled portals, business intelligence applications, and the like. In the context of semantic applications in the enterprise, there is a huge potential in linking enterprise data with public data via the principles of linked data [1]. The most visible example of adoption and application of the linked data principles has been the linking open data project, as part of which a continuously growing amount of data has been published, covering diverse domains such as people, companies, publications, popular culture and online communities, the life sciences genes, governmental and statistical data, and many more. Making this cloud of data available to business intelligence solutions for quick and effective decision making is now more than ever one of the core enablers of corporate growth, productivity and a sustainable competitive advantage.
3.2 Survey Recently, several offerings providing various forms of platform functionality have appeared. Some of them are offered as hosted services, others are also made available as a virtual appliance to be run in the cloud. As a simple and prominent example, RDF repositories are offered as a service. For example, the well known Virtuoso RDF database is packed and made available as a virtual machine image for Amazon EC2, distributed as Virtuoso Cloud Edition.4 An example of a more comprehensive platform for semantic applications is the OpenLink Data Spaces (ODS)5 offering. Atop OpenLink Virtuoso, ODS is a distributed collaborative web application platform that primarily targets the creation of points of presence on the web for exposing, exchanging, and creating data. ODS has also been made available to run in Amazon’s EC2 cloud. An example of a hosted platform for data sharing is the Talis platform.6 Supporting data publishers and developers, the platform provides hosted storage for 4 http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtInstallationEC2. 5 http://ods.openlinksw.com/. 6 http://www.talis.com/platform/.
246
A. Eberhart et al.
both structured and unstructured data, interfaces for querying, exploring, and manipulating data. In this way, application developers can use the Talis platform as a cloud-based data repository. OpenCalais7 offers a platform service that automatically creates rich semantic metadata for submitted content using NLP, machine learning and other methods. Commercial offerings in the areas of vertical semantic cloud services are still rare. This is however quite different from their “non-semantic” counterparts. Business intelligence in the cloud is almost a commodity. For instance, companies such as GoodData8 successfully offer business intelligence as a service, with collaboration support, ad hoc analysis, dashboarding and reporting functionality. A clear limitation of these offerings is the limitation to closed data sets, the lack of the ability to aggregate heterogeneous data sets and to analyze the data semantically. The ability to provide ad hoc analysis in a semantic way—making sense of heterogeneous data at a large scale—is clearly desirable, but unrealized thus far. So far, the linked data principles [1] provide recommended best practices for exposing, sharing, and connecting pieces of information on the semantic web using URIs and RDF. However, these practices do not yet address the question of how the data is actually managed and provisioned in an application. In working with today’s linked data cloud, the management of data in the linked open data cloud is far from the ideas of cloud computing in which resources are provisioned in a dynamically scalable and virtualized manner as a service over the Internet. Instead, data sets are published in the form of RDF dumps, or are—at best—made accessible via an SPARQL endpoint, which provides a comparably limited interface for querying. In any case, significant manual deployment effort is required to make use of linked data sets in a semantic application. We therefore argue for a dynamically scalable and virtualized data layer for the semantic web, complementing the current best practices for publishing data with practical means to provision and deploy the data transparently in semantic applications. What will need to be achieved is the ability to transparently provision data sources as virtual appliances: Such virtualized data sources will support the identification and composition of relevant data sets in an ad hoc way, abstracting the applications from the specific setup of the physical data sets, e.g., whether they are local or remote, centralized or distributed, etc.
3.3 Roadmap A clear research challenge in realizing such semantic applications in the cloud is the management of data as a service, including the ability to provision data sets in a virtualized way, as it is already done with software today. Key to realizing the full potential of semantic applications in the cloud will be the ability to integrate private, enterprise data in on-premise applications and databases with public data in 7 http://www.opencalais.com/. 8 http://www.gooddata.com/.
Semantic Technologies and Cloud Computing
247
the cloud. What is thus needed is the concept of “data as a service” applied to the semantic web, in particular to the countless data sources published using linked data principles. Based on the notions of semantic software as a service and virtualized data sources, we envision the concept of a semantic web appliance, in which both software as well as semantic data are provisioned in a completely transparent way.
4 Better Clouds Through Semantic Technologies This section discusses the case that semantic technologies can be used across the whole cloud stack to improve the management and maintenance of these clouds.
4.1 Challenges and Opportunities The emergence of cloud offerings such as Amazon AWS or SalesForce.com demonstrates that the vision of a fully automated data center is feasible. Recent advances in the area of virtualization make it possible to deploy servers, establish network connections, and allocate disc space virtually via an API rather than having to employ administrators who physically carry out these jobs. Ideally, the only manual jobs left involve the replacement of broken disks or servers and the connection of new containers of storage and compute appliances to expand the data center’s capacity. Note that virtualization is not limited to CPU virtualization via a hypervisor such as XEN or VMware ESX. Virtualization can be defined as an abstraction layer between a consumer and a resource that allows the resource to be used in a more flexible way. Examples can be drawn from the entire IT stack. Storage Area Networks (SANs) virtualize mass storage resources, VLAN technology allows for the use of a single physical cable for multiple logical networks, hypervisors can run virtual machines by presenting a virtual hardware interface to the guest operating system, and remote desktop software such as VNC virtualizes the screen display by redrawing it on a remote display. Public cloud offerings put a lot of pressure on CIOs who have to explain why provisioning a server takes weeks in house, while anybody with a credit card can achieve this goal in 15 minutes on a public IaaS portal. Obviously, there are good reasons for keeping certain systems and applications in house. Cloud vendor lockin, security concerns, and insufficient service level guarantees are cited most frequently. Nevertheless, cutting IT costs remains a top priority. The pressure is often alleviated by a simple name change: the local data center becomes the private cloud. Obviously, the local data center only deserves this new name if: 1. processes such as systems lifecycle management or troubleshooting are automated on a large scale, 2. end users are empowered to book IT services via a self-service portal, and 3. IT cost becomes transparent to users and is no longer financed by a companywide IT “tax.”
248
A. Eberhart et al.
In the following we define some requirements for achieving this task and look at how semantic technologies can help to meet these requirements.
4.1.1 Data and Service Integration Clearly, being able to automate data center operations via low level APIs is a prerequisite for achieving the requirements listed above. The challenge lies in the proper integration of data received from infrastructure components and the orchestration of subsequent actions as a response to events such as user requests or alarms. As we mentioned previously in the introduction and definition of the term virtualization, many layers play a role in this picture and one is faced with a large set of provider APIs ranging from storage to application levels. The situation grows even more complex when products and solutions from different vendors are found in the data center. In most cases, CIOs do not have the luxury of being able to start from scratch with a unified set of hardware and technology selected for the given tasks at hand. They typically face a mix of technologies acquired over several years, sometimes through company mergers. Products from different vendors and sometimes even different product versions differ vastly in syntax and semantics of the data supplied and functionality offered via APIs. Industry standards such as SMI-S (Storage Management Initiative-Specification) or OVF (Open Virtualization Format) try to address these problems but are usually limited to a specialized sub-domain or hampered by poor vendor support. Semantic technologies have been designed for these real-world situations. RDF can serve as a data format and model for integrating these semantically heterogeneous information sources in order to get a complete picture across the entire data center, both horizontally—across product versions and different vendors—and vertically—across storage, compute units, network, operating systems, and applications. Research in semantic web services [10] can help to define a meta layer allowing the orchestration of data center tasks on a higher level of abstraction.
4.1.2 Documentation and Annotation Data and service integration are key aspects of running data centers and clouds in an efficient and cost effective way. For this purpose, cloud management software is fed with data from provider APIs. This data contains technical information about the infrastructure and the software running on it. Examples range from available storage space, network throughput, server models, IP addresses to application server configurations. In order to create a complete picture available, organizational and business aspects need to be added to the technical data. Consider the following examples: The decision of whether or not to place a workload on a redundant cluster with highly available storage is strongly affected by the service level the system needs to meet; data center planning tools must take expiring warranties of components into account, and a relatively mild punishment for SLA violations might lead
Semantic Technologies and Cloud Computing
249
a cloud operator to sometimes take a chance and place workloads on less reliable infrastructure. In order to collaborate efficiently, data center operators need to document procedures and log activities. Proper knowledge management is essential in order to avoid the need for a problem to be resolved repeatedly by different staff members. Activities are usually managed via a ticketing system, where infrastructure alerts and customer complaints are distributed and resolved by operators. The examples above show that business and organizational information must be addressed in a unified way. When business information about systems or customers is stored or when documentation about a certain hardware type is written, it must be possible to cross reference information collected from infrastructure providers. Technology similar to work done in the semantic Wiki community can help to satisfy these requirements [7]. Operating on an RDF base which is fed by infrastructure providers, operators can extend this data by documenting and annotating the respective items. 4.1.3 Policies Cloud operators typically define business policies on a high level, which are subsequently monitored by software or taken into account when decisions are being made. In order to be successful in the marketplace, cloud providers need to be able to adapt policies to changing market requirements or new competition. Changing policies quickly is only possible when the gap between high level business policies and low level implementation is not too big. Here, semantic technologies can offer a lot of work on this issue by mapping business policies into rule based systems.
4.2 Survey Many of the aspects mentioned above are partly available in today’s data center/cloud management systems. The details of how the public clouds of Google or Amazon are managed are well kept secrets. Consequently, our analysis bases on commercial offerings such as IBM Tivoli, HP OpenView or VMware vCenter. Tivoli and OpenView cover issue tracking and have a policy engine which is relatively flexible. Most commercial systems are far from having a truly integrated datasource and appear more like a bundle of individual standalone software products. The fluidOps eCloudManager product suite is one of the first commercial offerings to integrate semantic technologies [5]. Policies are fully customizable via JBoss rules, and the upcoming eCloudManager Intelligence Edition features semantic Wiki and powerful analytics and visualization technology.
4.3 Roadmap Many items remain on the research agenda. One of the key issues concerns the ways in which domain specific software can benefit from open schemata. In other
250
A. Eberhart et al.
words, how can software that was written for a given purpose be best extended and customized by collaboratively extending its schema, annotating new information and adding new policies on the fly. A second research question focuses upon another aspect of extensibility, namely integrating new data and services on the fly. Note that both questions are not specific to data centers but are relevant in nearly any domain. Last but not least, best practices, case studies and quantitative evaluations are on the research agenda.
5 Summary In this chapter we have discussed the benefits of combining cloud computing and semantic technologies. In particular we had a look at different ways of combining the technologies: First, cloud computing for semantic technologies focused on how cloud services help building semantic applications or help them in scaling to ever larger data sets. Second, the idea of semantic technologies offered as cloud services is to offer the functionality of semantic technology platforms or entire semantic applications as a service in the cloud. Finally, better clouds through semantic technologies finds semantic technologies across the whole cloud stack to improve the administration of clouds. For each of the three combinations we concluded that there is a great potential of one technology benefitting from the other. In addition, we surveyed the state-of-the-art for each combination, showing that there already is ongoing research or even first promising examples. However, each of the combinations also poses new challenges and research questions that have yet to be answered.
References 1. Bizer, Christian, Heath, Tom, Berners-Lee, Tim: Linked data—the story so far. Int. J. Semantic Web Inf. Syst. 5(3), 1–22 (2009) 2. Dean, Jeffrey, Ghemawat, Sanjay: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 3. Dikaiakos, Marios D., Katsaros, Dimitrios, Mehra, Pankaj, Pallis, George, Vakali, Athena: Cloud computing: distributed Internet computing for it and scientific research. IEEE Internet Comput. 13(5), 10–13 (2009) 4. Gates, Alan, Natkovich, Olga, Chopra, Shubham, Kamath, Pradeep, Narayanam, Shravan, Olston, Christopher, Reed, Benjamin, Srinivasan, Santhosh, Srivastava, Utkarsh: Building a highlevel dataflow system on top of Map-Reduce: the Pig experience. Proc. VLDB Endow. (PVLDB) 2(2), 1414–1425 (2009) 5. Haase, Peter, Mathäß, Tobias, Schmidt, Michael, Eberhart, Andreas, Walther, Ulrich: Semantic technologies for enterprise cloud management. In: Patel-Schneider, Peter F., Pan, Yue, Hitzler, Pascal, Mika, Peter, Zhang, Lei, Pan, Jeff Z., Horrocks, Ian, Glimm, Birte (eds.) International Semantic Web Conference (2). Lecture Notes in Computer Science, vol. 6497, pp. 98–113. Springer, Berlin (2010) 6. Jeffery, Keith, Neidecker-Lutz, Burkhard: The future of cloud computing—opportunities for European cloud computing beyond 2010. Technical report, European Commission— Information Society and Media (2010)
Semantic Technologies and Cloud Computing
251
7. Kleiner, Frank, Abecker, Andreas, Brinkmann, Sven F.: WiSyMon: managing systems monitoring information in semantic Wikis. In: Riehle, Dirk, Bruckman, Amy (eds.) Proceedings of the 2009 International Symposium on Wikis, Orlando, Florida, USA, October 25–27, 2009. ACM, New York (2009) 8. Laclavík, Michal, Šeleng, Martin, Hluchý, Ladislav: Towards large scale semantic annotation built on MapReduce architecture. In: ICCS ’08: Proceedings of the 8th International Conference on Computational Science. Part III, pp. 331–338. Springer, Berlin (2008) 9. Lenk, Alexander, Klems, Markus, Nimis, Jens, Tai, Stefan, Sandholm, Thomas: What’s inside the cloud? An architectural map of the cloud landscape. In: CLOUD ’09: Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing, pp. 23–31. IEEE Comput. Soc., Washington (2009) 10. McIlraith, Sheila A., Son, Tran Cao, Zeng, Honglei: Semantic web services. IEEE Intell. Syst. 16(2), 46–53 (2001) 11. Mika, Peter, Tummarello, Giovanni: Web semantics in the clouds. IEEE Intell. Syst. 23(5), 82–87 (2008) 12. Newman, Andrew, Li, Yuan-Fang, Hunter, Jane: A scale-out RDF molecule store for improved co-identification, querying and inferencing. In: Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS), pp. 1–16 (October 2008) 13. Newman, Andre, Li, Yuan-Fang, Hunter, Jane: Scalable semantics—the silver lining of cloud computing. In: Proceedings of the IEEE Fourth International Conference on eScience, pp. 111–118. IEEE Comput. Soc., Los Alamitos (December 2008) 14. Olston, Christopher, Reed, Benjamin, Srivastava, Utkarsh, Kumar, Ravi, Tomkins, Andrew: Pig Latin: a not-so-foreign language for data processing. In: Wang, Jason Tsong-Li (ed.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008, pp. 1099–1110. ACM, New York (2008) 15. Oren, Eyal, Delbru, Renaud, Catasta, Michele, Cyganiak, Richard, Stenzhorn, Holger, Tummarello, Giovanni: Sindice.com: a document-oriented lookup index for open linked data. Int. J. Metadata Semant. Ontol. 3(1), 37–52 (2008) 16. Pavlo, Andrew, Paulson, Erik, Rasin, Alexander, Abadi, Daniel J., DeWitt, David J., Madden, Samuel, Stonebraker, Michael: A comparison of approaches to large-scale data analysis. In: SIGMOD ’09: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 165–178. ACM, New York (2009) 17. Stein, Raffael, Zacharias, Valentin: RDF on cloud number nine. In: Proceedings of NeFoRS 2010: New Forms of Reasoning for the Semantic Web: Scalable & Dynamic, pp. 11–23 (2010) 18. ter Horst, Herman J.: Combining RDF and part of OWL with rules: semantics, decidability, complexity. In: Gil, Yolanda, Motta, Enrico, Benjamins, V. Richard, Musen, Mark A. (eds.) The Semantic Web—ISWC 2005. Proceedings of the 4th International Semantic Web Conference, ISWC 2005, Galway, Ireland, November 6–10, 2005. Lecture Notes in Computer Science, vol. 3729, pp. 668–684. Springer, Berlin (2005) 19. Urbani, Jacopo, Kotoulas, Spyros, Maassen, Jason, van Harmelen, Frank, Bal, Henri E.: OWL reasoning with WebPIE: calculating the closure of 100 billion triples. In: Aroyo, Lora, Antoniou, Grigoris, Hyvönen, Eero, ten Teije, Annette, Stuckenschmidt, Heiner, Cabral, Liliana, Tudorache, Tania (eds.) The Semantic Web: Research and Applications. Proceedings of the 7th Extended Semantic Web Conference, ESWC 2010. Part I Heraklion, Crete, Greece, May 30–June 3, 2010. Lecture Notes in Computer Science, vol. 6088, pp. 213–227. Springer, Berlin (2010) 20. Urbani, Jacopo, Kotoulas, Spyros, Oren, Eyal, van Harmelen, Frank: Scalable distributed reasoning using MapReduce. In: Bernstein, Abraham, Karger, David R., Heath, Tom, Feigenbaum, Lee, Maynard, Diana, Motta, Enrico, Thirunarayan, Krishnaprasad (eds.) The Semantic Web—ISWC 2009. Proceedings of the 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25–29, 2009. Lecture Notes in Computer Science, vol. 5823, pp. 634–649. Springer, Berlin (2009) 21. Vaquero, Luis M., Rodero-Merino, Luis, Caceres, Juan, Lindner, Maik: A break in the clouds: towards a cloud definition. SIGCOMM Comput. Commun. Rev. 39(1), 50–55 (2009)
Semantic Complex Event Reasoning—Beyond Complex Event Processing Nenad Stojanovic, Ljiljana Stojanovic, Darko Anicic, Jun Ma, Sinan Sen, and Roland Stühmer
Abstract Complex event processing is about processing huge amounts of information in real time, in a rather complex way. The degree of complexity is determined by the level of the interdependencies between information to be processed. There are several more or less traditional operators for defining these interdependencies, which are supported by existing approaches and the main competition is around the speed (throughput) of processing. However, novel application domains like Future Internet are challenging complex event processing for a more comprehensive approach: from how to create complex event patterns over the heterogeneous event sources (including textual data), to how to efficiently detect them in a distributed setting, including the usage of background knowledge. In this chapter we present an approach for intelligent CEP (iCEP) based on the usage of semantic technologies. It represents an end-to-end solution for iCEP starting from the definition of complex event patterns, through intelligent detection, to advanced 3-D visualization of complex events. At the center of the approach is the semantic model of complex events that alleviates the process of creating and maintaining complex event patterns. The approach utilizes logic-based processing for including domain knowledge in the complex event detection process, leading to complex event reasoning. This approach has been implemented in the web-based framework called iCEP Studio.
1 Introduction The need for real-time processing of information has increased tremendously recently as not only the content but also its real-time context is determining the value of information. Having such contextual information in real-time represents a big competitive advantage: it will substantially increase the responsiveness of a system. This means that a system can react to some (exceptional) situations with more agility (and even proactivity). An extreme example is the stock-exchange: Algorithmic trading systems fully automate buy and sell decisions and act within 20 milliseconds N. Stojanovic () FZI Forschungszentrum Informatik, Karlsruhe, Germany e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_14, © Springer-Verlag Berlin Heidelberg 2011
253
254
N. Stojanovic et al.
without human intervention. However, besides the speed, the complexity of the context in which information will be considered/processed, defines the value of information. Indeed, since enterprises are nowadays influenced by many different, dynamically changing factors (e.g., stock-exchange information, customers requests, competitors, etc.) information in isolation has little value: only a combination of information (in real-time) has high value. Since not every combination is a useful one, the combinatory process must be designed and performed in an intelligent way. It must lead to the detection of situations which are very relevant and important for the reaction. Otherwise, the process will lead to a second-order information overload. The following two challenges in a nutshell are the emerging needs for intelligent complex event processing (iCEP): • How to support the discovery/definition and maintenance/evolution of patterns for complex models of situations (context). • How to enable an efficient detection of such situations in real-time and under extreme conditions (high throughput, complex and heterogeneous events, unreliable event sources). However, although there are many successful event processing applications in areas such as financial services (e.g., dynamic tracking of stock fluctuations, surveillance for frauds and money laundering etc.), sensor-based applications (e.g., RFID monitoring), network traffic monitoring, Web click analysis etc., current approaches for event processing are mainly focused on the extreme processing of homogeneous event streams and neglect the need for a more manageable and intelligent processing. In this chapter we present an approach for intelligent CEP (iCEP) based on the usage of semantic technologies. It represents an end-to-end solution for iCEP starting from the definition of complex event patterns, through intelligent detection, to advanced 3-D visualization of complex events. Besides satisfying two above mentioned challenges, this approach benefits from the semantic background in several ways: 1. Since the semantics of events is explicitly represented, they will be combined semantically which ensures the precise description of the situations of interest in the form of patterns (e.g., through the semantic comparison to existing patterns). 2. Semantics enables the usage of domain knowledge in the detection process, so that relevant patterns can be discovered in more robust ways (e.g., if there is not an explicit match, the hierarchies can be explored). 3. Management of patterns of interest is also facilitated with semantics (e.g., semantic analysis of the problems in the pattern of some situations can lead to better suggestions of what should be changed in an out-dated pattern). This approach has been implemented in the web-based framework called iCEP Studio.1 1 iCEP
Website http://icep.fzi.de/ for more details.
Semantic Complex Event Reasoning—Beyond Complex Event Processing
255
This chapter is organized in the following way: We start with a motivating example explaining challenges which drive our approach. In Sect. 3 we discuss the role of semantics for CEP. Section 4 details the conceptual architecture of the approach, whereas Sects. 5 and 6 present our approaches for complex pattern management and detection, respectively. Section 8 gives an overview of the related work and Sect. 9 contains concluding remarks.
2 Motivation An emerging characteristic of event processing is its ubiquity: events are everywhere—they just need to be collected and combined in real-time in a proper way. An example is the increasing need for processing events from Web 2.0 sources (e.g., blogs, social networks etc.). Tweets (from Twitter2 ) can be converted into events and used for detection of some interesting situation, e.g., several tweets about X within one minute. These situations can be modeled in patterns that represent topics of interest which need to be detected in near real-time (e.g., developing news stories or certain situations of interest that have just happened). Events from this source can be further combined with other sources (e.g., Google Finance,3 etc.) to detect further situations of interest (i.e., complex events). To motivate our work, let us consider a monitoring service based on named entities extracted from Twitter; and events received from Google Finance accordingly. Let us assume a user wants to be informed about the business news impact on stocks (of companies involved in that news story). For example, a company X launches a new product Y (announced via X’s channel on Twitter4 ), and a user wants to monitor possible changes of X’s stocks. In particular, the user would be interested in detecting changes of the stock price and of the volume of traded stocks after the announcement; and to be alerted if the price is greater/smaller than the max/min price, and the volume increased/plummeted with respect to its average value (calculated before the announcement). The mentioned example requires complex pattern matching over continuously arriving events. First, it demands their aggregation of a finite yet unbound number of events (e.g., to calculate the average price). Then the overall system needs to extract5 named entities of interest, and reason about them (with respect to a given background knowledge of a particular business domain), leading iCEP to extensions towards logic inferencing. 2 Twitter
is a micro-blogging service which enables users to send messages of up to 140 characters to a set of users, the so-called followers: http://twitter.com/.
3 Google
Finance: http://google.com/finance.
4 Such
information does not strictly need to be taken from Twitter. Other sources also provide real-time updates e.g., the Wall Street Journal blog: http://blogs.wsj.com/.
5 Named entities extraction is out of scope of this chapter. For this task, in our example, we used the
OpenCalais service http://opencalais.com. OpenCalais uses natural language processing, machine learning and other methods to analyze text (in our case, tweets) and finds named entities within it.
256
N. Stojanovic et al.
Next, let us imagine that the user was using the above mentioned pattern for a while, but she/he is not satisfied with the responses and would like to change something in the pattern. Obviously, there are many changes that can be performed and a brute force method would not work. By analyzing the execution of individual parts of the patterns, one could, for example, conclude that a new constraint should be added: only if above mentioned increase/plummeting in the volume with respect to its average value happens within the first thirty minutes after the announcement is it worth to buy/sell stocks. Patterns (e.g., when to inform someone) are dependent upon several factors such as user’s needs, changes in the event sources, etc. Therefore, patterns must be dynamically verified. This in turn entails that the description of a complex situation must be maintained over time, i.e., changed/adapted to new conditions. Finally, let us assume that a user has defined many patterns and would like to be informed (via e-mail) after they are fulfilled. Such a setup might produce so-called “notification overload” since she/he will not be able to react on all these notifications in real-time, especially without performing an analysis of the importance of particular events. A way around this is to visualize those notifications and enable the user to see all of them in a broader context. It will support getting a high-level view on emerged situations—enabling so-called situational awareness.
3 Ontologies in Event Processing 3.1 The Need for Semantics In this section, we discuss the use of ontologies as a high-level, expressive, conceptual modeling approach for describing the knowledge upon which the event processing is based. CEP systems constantly monitor and gather the data they need to react to or act upon, according to their management tasks and targets. This data is elaborated upon and organized through the notion of events (e.g., bankruptcy of a company X, etc). Events in turn are typically meaningful in a certain context when correlated with other events. The output of such analyses can be newly created events describing the occurrence of a potentially interesting (for example, dangerous, valuable, or important) situation (modeled by a pattern) that has been identified. For example, a stock collapse situation can be defined as “at least 5 events of sale of the same stock within 2 hours period”. This new event can then be fed into the system as input or it can trigger concluding actions based both on the input and on other domain-specific data. For example, a stock collapse can be related to closing the market. The biggest challenge of the CEP systems is how one gets the system to figure out what it should be monitoring, and how to adapt without humans putting in every possibility. This requires (i) modeling events and (ii) complex event patterns in the design time as well as (iii) discovering patterns (i.e., interesting situations) based on
Semantic Complex Event Reasoning—Beyond Complex Event Processing
257
the events in real-time. Ontologies seem to be good candidates for resolving these issues: they are structured, formal, and allow inference. • First, ontologies can facilitate interoperability between events published by different sources by providing a shared understanding of the domain in question. In this way, problems caused by structural and semantic heterogeneity of different event models can be avoided. • Second, the explicit representation of the semantics of complex event patterns through ontologies will enable CEP systems to provide a qualitatively new level of services such as verification, justification, and gap analysis, as we discuss later in this chapter. • Finally, ontologies do not only define information, but also add expressiveness and reasoning capabilities. Ontology rules provide a way to define behavior in relation to a system model. In the remainder of this section we discuss how ontologies can advance the CEP systems.
3.2 Events An event is a special kind of a message generated by a resource in the domain. Analyzing event data is difficult if the data is not normalized into a common, complete, and consistent model. This entails not only reformatting the data for better processing and readability, but also breaking it down into its most granular pieces. It includes filtering out unwanted information to reduce analytical errors or misrepresentations. It can even involve acquiring more information from outside the scope of the original event data, for example, from the operating system on which an event source is running. Moreover, different applications publish different types of events. Therein lie the challenges for the model designer—to allow virtually any type of event to be defined and to provide maximum infrastructure for supporting event handling. As ontologies serve for information integration,6 they can also be used for getting streaming data (i.e., events) from multiple sources (with different event formats). Indeed, the prerequisite for meaningful analysis of an event is that information about that event be properly organized and interpreted. Thus, it is important to express information about events in a common and uniform way, which implies in turn that a model of events is necessary. The role of such a model is to describe what happened, why it happened, when it happened, and what the cause was. Additional support 6 There
are several reasons for using ontologies for information integration. As the most advanced knowledge representation model available today, ontologies can include essentially all currently used data structures. They also can accommodate complexity because the inclusion of deductive logic extends existing mapping and business logic capabilities. Further, ontologies provide shared conceptualization and agreed upon understanding of a domain, both prerequisites for (semi-)automatic integration.
258
N. Stojanovic et al.
information for decision making may also be incorporated, such as cost analysis, prioritization, and asset allocation information. This additional data enables a thorough analysis, but, in order to separate the monitoring logic and the decision logic, this model must not contain information about how the event should be resolved.
3.3 Pattern The automated response to the events is typically based on patterns.7 Modeling these patterns and mimicking the decision-making behavior of domain experts is a central problem in the development of CEP systems. In order to determine which events to monitor in the CEP systems and how to analyze them, domain experts must be intimately familiar with the operational parameters of each managed resource and the significance of related events. Certainly, there is no domain expert who possesses absolutely all knowledge about a domain. This represents a bottleneck in the development of a model for a concrete domain. The formal methods can be considered as a methodology for extracting knowledge in a domain in a semi-automatic way. For example, an analysis of a broken chain during the inference process could help domain experts comprehend the effect of an event. If properly used, this process can reduce the number of incorrect activities and can even guide a refinement process. Moreover, this analysis can assist in evaluating whether rules produce the same output as would be generated by a human expert in that domain. Ontology-based representation give the possibility to better exploit the knowledge available in the domain itself (e.g., the event model), and, ultimately, to perform automated reasoning on this knowledge. In general, reasoning tasks can be used to verify a model. Consistency checking, detection of redundancies, and refinement of properties are some of these reasoning activities. Using these concepts it is possible to guide a domain expert through the complex event patterns development process by providing such additional information as what else must be done in order to detect a situation.
3.4 Detection of Complex Event Patterns The complex event detection is the heart of the CEP systems. It uses a set of patterns to determine if one needs to perform an action on an event. Having a formal logical model of situations would enable the use of some very interesting reasoning services to support the whole event processing. For example, we can define two situations as conflicting each other and try to avoid running the whole system in such a state. The CEP system can formally check the consistency 7A
more descriptive definition of pattern is given in Sect. 5.
Semantic Complex Event Reasoning—Beyond Complex Event Processing
259
Fig. 1 Life cycle of complex event processing applications
of the defined situations and backtrack if a conflict (meaning inconsistency in the system) appears. It can help us to optimize reactions to situations. Additionally, abstract definitions of patterns enable more efficient detection of (complex) events, which the system should react to. For example, situation detection may not be purely deterministic—and here we are getting to intelligent complex event processing. Example: the pattern is only an approximation, “an event E1 happens at least 4 times within an hour”. Now, if the event E1 happens 3 times and the pattern is an approximation, there is some probability that the situation did occur.
4 Conceptual Architecture Complex event processing is a recurring cycle every application goes through [29]. The cycle consists of four phases.8 They are Plan, Observe, Orient and Act. They describe the life cycle of a CEP application from gathering requirements, detecting events, visualizing results and finally taking action, cf. Fig. 1.9 The cycle is closed by the feedback of observations from previous phases into the Plan phase. The first phase, the Plan phase, is concerned with the design of a CEP application. In this phase, the application is modeled with respect to the situations which should be reacted to. Section 5 describes this phase in detail. The second phase, called Observe, is the centerpiece of CEP. In this phase, situations are detected according to the models (or patterns) which were created in the preceding phase. Section 6 describes this phase in detail. 8 Depending 9 The
on the nature of an application some phases can be skipped.
CEP gears logo is ©IBM Haifa.
260
N. Stojanovic et al.
The third phase, termed Orient phase, presents the findings of the preceding phase for orientation. This phase is important for the human in the loop. If a fully automatic application is desired then this phase can be skipped. Section 7 describes this phase in detail. The fourth phase, named Act, covers the responses to the observed situations in accordance with any decisions made in the Orient phase. Before we dive into details of the phases, the authors would like to introduce some important general terms in CEP which help understand the conceptual architecture of a CEP application. These terms are event processing network and event processing agent. They were introduced by David Luckham in [23]. Conceptually, the detection of events (e.g., in the Observe phase) is run on a network, called the event processing network (EPN). Such a network is a directed graph consisting of channels transporting events and nodes processing events. The nodes are called event processing agents (EPA). An EPA has one or more input channels and one or more output channels. Additionally, an EPA contains application logic of how to process events from its input channels and sometimes external data. There are several classes of EPAs which offer a coarse distinction of the wide range of possible processing operations. These classes are event enrichment, projection, aggregation, splitting, composition, filtering and pattern detection. An event processing network is a composition of one or more EPAs to create a complex application. An EPA internally is composed of three stages. Input events must pass through all of these stages to participate in the output. The stages are filtering, detection and derivation [17].10 Not all stages are needed in all classes of EPAs and may be skipped if necessary. The filtering stage is responsible for eliminating events early which may not become part of further stages and eventually the output. The detection stage finds events which can be combined according the prescribed event operations. Finally, the derivation stage creates the output event or events from the set of events from the previous stage. Figure 2 shows an ample event processing network. It is composed of event producers, channels, event processing agents and event consumers. In general event processing networks resemble a conceptional abstraction of a CEP application. Additionally, for many cases the EPN helps in the concrete implementation of an application as well. However, the user of a CEP application is usually not involved in the creation of a network. Most event processing platforms create the EPN from a declarative statement by the user which is compiled into a network. Section 5 explains more about these statements, generally called event patterns. Section 6 subsequently returns to the generations of EPNs from these patterns. 10 We call the middle stage “detection” instead of “matching” in order to include a wider field of event operations such as reasoning which goes beyond pattern matching.
Semantic Complex Event Reasoning—Beyond Complex Event Processing
261
Fig. 2 Event processing network
5 Complex Event Processing Modeling To detect a pattern over events properly is the most important capability of CEP systems. This capability enables a just-in-time reaction to occurred situations. In order to recognize these situations so-called complex event patterns (CEPATs) have to be defined. These patterns11 are used in order to process the events and aggregate them to higher level complex events. A pattern is an expression formed by using a set of events (either simple or complex) and a set of event operators [14]. They resemble knowledge about the reactivity of the system. For example the pattern12 (A AND B) happen WITHIN 2 Minutes contains two events A and B, a logical Operator AND and a window operator WITHIN. In order to cope with the evolving nature of business environments, we need effective and efficient support for advanced CEPAT management. However, today’s management tasks in a CEP system related to the generation, maintenance and evolution of CEPATs are performed manually without any systematic methodology and tool support. So far, no academic approach exists dealing with the management and evolution of CEPATs. Also the vendors providing CEP systems neglect the issue of management and evolution. They are focused on runtime, rather than design time issues. In this section we present a methodology for the management of CEPATs which supports the whole life cycle of a CEPAT: starting from its generation, throughout its usage including its evolution. The methodology treats CEPATs as knowledge artefacts that must be acquired from various sources, represented, shared and evaluated 11 Pattern 12 For
and complex event pattern are used synonymously in this article.
the sake of convenience the pattern is represented in a pseudocode form.
262
N. Stojanovic et al.
Fig. 3 Phases of the CEPAT life cycle
properly in order to enable their evolution. We argue that such a methodology is a necessary condition for making a CEP system efficient and keeping it alive. In a nutshell, our approach is a semantic-based representation which enables us to (a) make the relationships between CEPATs explicit in order to reuse existing pattern artefacts (b) automatically suggest improvements in a CEPAT, based on necessity derived from usage statistics.
5.1 GRUVe: The Methodology We define the CEPAT life cycle management as a process covering the generation, the refinement, the execution and the evolution of CEPATs. The focus is on supporting the user by creating new patterns, increasing the reusability of existing pattern artefacts and giving hints for pattern evolution. As illustrated in Fig. 3, the GRUVe13 methodology consists of four main phases. These phases form a feedback loop enabling continual flow of information collected in the entire life cycle of a CEPAT. The main idea is to enable non-technicians to search incrementally for the most suitable form of requested CEPATs and to continually improve their quality taking into account changes that might happen in the internal or external world. The basis for 13 Acronym
for Generation Refinement Usage eVolution of complex (e)vent patterns.
Semantic Complex Event Reasoning—Beyond Complex Event Processing
263
the development of the methodology are requirements given in [29]. Additionally, the methodology has been influenced by our past work in the area of knowledge management and ontology evolution [30, 31]. Below we describe these phases in more detail.
5.2 Generation Phase The life cycle of CEPATs starts with their development which is classified into three categories (see Fig. 3 block Generation). The simplest way to develop a CEPAT is a manual development which can be done only by business experts. However, we cannot expect that an arbitrary user spends time finding, grouping and ordering the events in order to create their desired CEPAT. In order to do that the user must be aware of the way of combining events, he/she must find the right type of events, foresee and solve the intermediate conflicts that might appear and order events in the correct way. A more user-oriented approach can be obtained by reusing existing CEPATs, the so-called search-based approach. Here, the users should be able to specify their complex goals without considering how they can be realized. By selecting similar existing CEPATs at the start of the CEPAT development process, the users could realize their request much faster. For instance, if we consider the motivating example—if the user wants to generate a new pattern for stock-exchange situation—he/she can search for events which belong to the type stock-exchange from a certain event source. In that way the generation of a CEPAT could be realized as a straightforward search method, where the specificities of the CEPAT domain are used to define a search space and to restrict the search in this space. Besides CEPATs developed by experts or users explicitly, there are also implicit CEPATs in the domain, reflected in the behavior of the system. These can be discovered through the analysis of log files, etc. Various data mining methods can be used for the extraction of such knowledge. Independent of how CEPATs are identified, in order to deal with them they have to be represented in a suitable format which supports their management. Therefore, the generated complex event patterns must be represented in an internal, CEP platform independent way.
5.2.1 RDFS-Based Complex Event Pattern Representation A well defined event pattern model has a major impact on the flexibility and usability of CEP tools [28]. Contemporary event and pattern models lack the capability to express the event semantics and relationships to other entities. Although the importance of such a semantic event representation is demonstrated in practice [5], there is no systematic work in existence on this topic, so far. Still most of the existing event models consider an event as an atomic unstructured or semistructured entity without explicitly defined semantics. Also, the existing CEP languages are product
264
N. Stojanovic et al.
Fig. 4 Upper-level event and CEPAT ontology
specific languages designed solely for the matching process rather than for their own management and evolution. Our semantic model (cf. Sect. 3) for event and pattern representation is based on RDFS.14 RDFS in general has the advantage of offering formal and explicit definitions of relations between concepts. We use our RDFS based pattern representation at different stages within the methodology: increasing the reusability of existing pattern artefacts, validation of defined patterns and identification of relations between two patterns in order to evolve them. The upper-level ontology contains a set of concepts Event, EventOperator, EventSource, EventType and a set of event properties. Each property may have a domain (denoted by ∃) concept as well as a range (denoted by − ) concept (see Fig. 4). These concepts can be specialized in order to define new types, sources and operators.
5.3 Refinement Phase As we already mentioned the process of creating CEPATs is a kind of weaklydefined search operation in an information space rather than a search with very precisely defined queries. Lessons learned from the Information retrieval community show that the right mechanism for the former case is the incremental refinement of queries [31], based on various feedback that can be obtained in that process. It 14 RDF Schema (RDFS) is an extensible knowledge representation language providing basic elements for the description of ontologies.
Semantic Complex Event Reasoning—Beyond Complex Event Processing
265
would mean that the main task in the CEPAT creation process is not the very precise formulation of a pattern in the first place but an iterative improvement of the CEPAT. As presented in Fig. 3, this phase has the CEPAT as input created in the previous phase. It allows the user to fine tune the CEPAT if he/she is not sure about the quality of the created CEPAT. Note that an example of the refinement is given in the motivating example: when the user wants to generate a new stock-exchange pattern, she/he starts from the existing pattern that must be extended (refined) in order to accommodate new requirements.
5.4 Usage Phase Once created, CEPATs must be deployed in a CEP engine, after being transformed into the corresponding syntax. However in order to ensure the quality of created patterns, it is necessary to monitor what happens to a CEPAT or a set of CEPATs in a CEP engine at runtime. Nowadays there is no monitoring of the defined CEPATs e.g., if they had been executed in a CEP engine. It is not obvious to see how often certain patterns have been triggered and which parts of the pattern have been executed how often. Nor is it available which patterns are going to be executed next. However, we believe information on how often a pattern was triggered or how high the current degree of fulfillment is might be essential for the pattern evolution. The goal of this phase is to track as much of this information as possible and process it in the context of the defined CEPATs. These statistics can be used within the Evolution phase in order to evolve and optimize a pattern.
5.5 Evolution Phase A pattern that has not become rapidly obsolete must change and adapt to the changes in its environments, user’s needs, etc. Therefore, if a pattern aims at remaining useful it is essential that it is able to accommodate the changes that will inevitably occur. Developing patterns and their applications is expensive but evolving them is even more expensive. Facilitating those changes is complicated if large quantities of events and CEPATs are involved. The Evolution phase is responsible for coping with changes that may impact the quality of a CEPAT. In a more open and dynamic business environment the domain knowledge evolves continually. The basic sources that can cause changes in a business system are: • The environment: The environment in which systems operate can change. • The user: Users’ requirements often change after the system has been built, warranting system adaptation. For example, hiring new employees might lead to new competencies and greater diversity in the enterprise which the system must reflect.
266
N. Stojanovic et al.
• The process: The business applications are coupled around the business processes that should be re-engineered continually in order to achieve better performances. The goal of this phase is to use the analytics provided by the Usage phase and suggest to the user some improvements to evaluated CEPATs. While a good design may prevent many CEPAT errors, some problems will not pop out before CEPATs are in use. Therefore, the relationship with the usage of a CEP-based system is paramount when trying to develop useful CEPATs and cannot be neglected. One of the key ideas of the approach is having a well defined representation of the complex event patterns in order to recognize different relations between them, cf. Sect. 3 on ontologies. The information about the relations can be used for getting more precise information on top of the pattern statistics knowledge. Regarding motivating example, the evolution phase would serve as a suggestion for changing the original complex event pattern for stock-exchange-alarm since it has been detected only couple of time (the expectation was higher). Moreover, the system can suggest what to change (e.g., to introduce a time-based constraint) in order to make the pattern more efficient.
6 Semantic Complex Event Processing: A Logic-Based Approach As mentioned in Sect. 4 (cf. Fig. 1) after defining complex event patterns (Sect. 5), a CEP systems starts observing input data streams (e.g. from Twitter or Google Finance as introduced in our motivating example) in order to detect situations of interests, represented in the form of complex event patterns. This process is the main topic of this section.
6.1 Problem Statement The general task of complex event processing can be described as follows. Within some dynamic setting, events take place. Those atomic events are instantaneous, i.e., they happen at one specific point in time and have a duration of zero. Notifications about these occurred events together with their timestamps and further associated data (such as involved entities, numerical parameters of the event, or provenance data, see Sect. 5.2.1) enter the CEP system in the order of their occurrence.15 The CEP system further features a set of complex event descriptions, by which complex events can be specified as temporal constellations of atomic events. The complex events thus defined can in turn be used to compose even more complex 15 The
phenomenon of out-of-order events meaning delayed notification about events that have happened earlier, is outside the focus of this chapter.
Semantic Complex Event Reasoning—Beyond Complex Event Processing
267
events and so forth. As opposed to atomic events, those complex events are not considered instantaneous but are endowed with a time interval denoting when the event started and when it ended. The purpose of the CEP system is now to detect complex events within this input stream of atomic events. That is, the system is supposed to give notification that the occurrence of a certain complex event has been detected, as soon as the system is notified of an atomic event that completes a sequence which makes up the complex event due to the complex event description. This notification may be accompanied by additional information composed from the atomic events’ data. As a consequence of this detection (and depending on the associated data), responding actions can be taken, yet this is outside the scope of this chapter. In summary, the problem we address in our approach is to detect complex events (specified in an appropriate formal language) within a stream of atomic events. Thereby we assume that the timeliness of this detection is crucial and algorithmically optimize our method towards a fast response behavior.
6.2 Syntax In this section we present the formal syntax of ETALIS Language for Events, while in the remaining sections of the chapter, we will gradually introduce other aspects of the language (i.e., informal and operational semantics of the language16 ). The syntax of the our language allows for the description of time and events. We represent time instants as well as durations as non-negative rational numbers q ∈ Q+ . Events can be atomic or complex. An atomic event refers to an instantaneous occurrence of interest. Atomic events are expressed as ground atoms (i.e., predicates followed by arguments which are terms not containing variables). Intuitively, the arguments of a ground atom describing an atomic event denote information items (i.e., event data) that provide additional information about that event. Atomic events can be composed to form complex events via event patterns. We use event patterns to describe how events can (or have to) be temporally situated to other events or absolute time points. The language P of event patterns is formally defined by P ::= pr(t1 , . . . , tn ) | P |P
BIN
WHERE
t | q | (P ).q
P | NOT (P ).[P , P ].
Thereby, pr a predicate name with arity n, ti denote terms, t is a term of type boolean, q is a non-negative rational number, and BIN is one of the binary operators SEQ , AND , PAR , OR , EQUALS , MEETS , EQUALS , STARTS, or FINISHES. As a side condition, in every expression p WHERE t, all variables occurring in t must also occur in the pattern p. 16 Our
prototype, ETALIS, is an open source project, available at: http://code.google.com/p/etalis.
268
N. Stojanovic et al.
Fig. 5 Language for event processing—composition operators
Finally, an event rule is defined as a formula of the shape pr(t1 , . . . , tn ) ← p where p is an event pattern containing all variables occurring in pr(t1 , . . . , tn ). After introducing the formal syntax of our formalism, we will give some examples to provide some intuitive understanding before proceeding with the formal semantics in the next section. Adhering to a stock market scenario, one instantaneous event (not requiring further specification) might be market_closes( ). Other events with additional information associated via arguments would be bankrupt(lehman) or buys(citigroup, wachovia). Within patterns, variables instead of constants may occur as arguments, whence we can write bankrupt(X) as a pattern matching all bankruptcy events irrespective of the victim. “Artificial” time-point events can be defined by just providing the according timestamp.
6.3 Informal Semantics Figure 5 demonstrates the various ways of constructing complex event descriptions from simpler ones in our language for event processing. Moreover, the figure informally introduces the semantics of the language. For further details about formal declarative semantics of ETALIS Language for Events a reader is referred to [32], Sect. 4. Let us assume that instances of three complex events, P1 , P2 , P3 , are occurring in time intervals as shown in Fig. 5. Vertical dashed lines depict different time units, while the horizontal bars represent detected complex events for the given patterns.
Semantic Complex Event Reasoning—Beyond Complex Event Processing
269
In the following, we give the intuitive meaning for all patterns from the figure: • (P1 ).3 detects an occurrence of P1 if it happens within an interval of length 3. • P1 SEQ P3 represents a sequence of two events, i.e., an occurrence of P1 is followed by an occurrence of P3 ; thereby P1 must end before P3 starts. • P2 AND P3 is a pattern that is detected when instances of both P2 and P3 occur no matter in which order. • P1 PAR P2 occurs when instances of both P2 and P3 happen, provided that their intervals have a non-zero overlap. • P2 OR P3 is triggered for every instance of P2 or P3 . • P1 DURING (0 SEQ 6) happens when an instance of P1 occurs during an interval; in this case, the interval is built using a sequence of two atomic time-point events (one with q = 0 and another with q = 6, see the syntax above). • P1 EQUALS P3 is triggered when the two events occur exactly at the same time interval. • NOT (P3 ).[P1 , P1 ] represents a negated pattern. It is defined by a sequence of events (delimiting events) in the square brackets where there is no occurrence of P3 in the interval. In order to invalidate an occurrence of the pattern, an instance of P3 must happen in the interval formed by the end time of the first delimiting event and the start time of the second delimiting event. In this example delimiting events are just two instances of the same event, i.e., P1 . Different treatments of negation are also possible, however we adopt one from [3]. • P3 STARTS P1 is detected when an instance of P3 starts at the same time as an instance of P1 but ends earlier. • P3 FINISHES P2 is detected when an instance of P3 ends at the same time as an instance of P1 but starts later. • P2 MEETS P3 happens when the interval of an occurrence of P2 ends exactly when the interval of an occurrence of P3 starts. It is worth noting that the defined pattern language captures the set of all possible 13 relations on two temporal intervals as defined in [7]. The set can also be used for rich temporal reasoning. It is worthwhile to briefly review the modeling capabilities of the presented pattern language. For example, one might be interested in defining an event matching stock market working days: workingDay( ) ← NOT(marketCloses( ))[marketOpens( ), marketCloses( )]. Moreover, we might be interested in detecting the event of two bankruptcies happening on the same market working day: dieTogether(X, Y ) ← (bankrupt(X) SEQ bankrupt(Y )) DURING workingDay( ). This event rule also shows, how event information (about involved institutions, provenance, etc.) can be “passed” on to the defined complex events by using vari-
270
N. Stojanovic et al.
ables. Furthermore, variables may be employed to conditionally group events into complex ones if they refer to the same entity: indirectlyAcquires(X, Y ) ← buys(Z, Y ) AND buys(X, Z). Even more elaborate constraints can be put on the applicability of a pattern by endowing it with a boolean type term as filter.17 Thereby, we can detect a stock prize increase of at least 50% in a time frame of 7 days, remarkableIncrease(X) ← (prize(X, Y1 ) SEQ prize(X, Y2 )).7 WHERE Y2 > Y1 · 1.5. This small selection arguably demonstrates the expressivity and versatility of the introduced language for event processing.
6.4 Operational Semantics In Sect. 6.3 we have defined complex events patterns formally. This section describes how complex events, described in our language for event processing, can be detected at run-time (following the semantics of the language). Our approach is established on goal-directed, event-driven rules and decomposition of complex event patterns into two-input intermediate events (i.e., goals). Goals are automatically asserted by rules as relevant events occur. They can persist over a period of time “waiting” to support detection of a more complex goal. This process of asserting more and more complex goals shows the progress towards detection of a complex event. In the following subsection, we give more details about a goal-directed, event-driven mechanism w.r.t. event pattern operators (formally defined in Sect. 6.3).
6.4.1 Sequence of events Let us consider a sequence of events represented by the pattern in rule (1) (e is detected when an event a 18 is followed by b, and followed by c). We can always represent the above pattern as e ← ((a SEQ b) SEQ c). In general, rules (2) represent two equivalent rules,19 e←a e ← p1
BIN
e ← (((p1
p2
BIN
SEQ
b
SEQ
BIN
pn ,
p2 ) BIN p3 ) . . .
BIN
BIN
p3 . . .
(1)
c,
(2) pn ).
that also comparison operators like =, < and > can be seen as boolean-typed binary functions and, hence, fit well into the framework.
17 Note
18 More 19 That
precisely, by “an event a” is meant an instance of the event a.
is, if no parentheses are given, we assume all operators to be left-associative. While in some cases, like SEQ sequences, this is irrelevant, other operators such as PAR are not associative, whence the precedence matters.
Semantic Complex Event Reasoning—Beyond Complex Event Processing
271
Algorithm 6.1 Sequence Input: event binary goal ie1 ← a SEQ b. Output: event-driven backward chaining rules for SEQ operator. Each event binary goal ie1 ← a SEQ b is converted into: { a(T1 , T2 ) : −for_each(a, 1, [T1 , T2 ]). a(1, T1 , T2 ) : −assert(goal(b(_, _), a(T1 , T2 ), e1(_, _))). b(T3 , T4 ) : −for_each(b, 1, [T3 , T4 ]). b(1, T3 , T4 ) : −goal(b(T3 , T4 ), a(T1 , T2 ), ie1 ), T2 < T3 , retract(goal(b(T3 , T4 ), a(T1 , T2 ), ie1 (_, _))), ie1 (T1 , T4 ). }
We refer to this kind of “events coupling” as binarization of events. Effectively, in binarization we introduce two-input intermediate events (goals). For example, now we can rewrite rule (1) as ie1 ← a SEQ b, and the e ← ie1 SEQ c. Every monitored event (either atomic or complex), including intermediate events, will be assigned with one or more logic rules, fired whenever that event occurs. Using the binarization, it is more convenient to construct event-driven rules for three reasons. First, it is easier to implement an event operator when events are considered on “two by two” basis. Second, the binarization increases the possibility for sharing among events and intermediate events, when the granularity of intermediate patterns is reduced. Third, the binarization eases the management of rules. As we will see later in this section, each new use of an event (in a pattern) amounts to appending one or more rules to the existing rule set. However, what is important for the management of rules is the fact that we don’t need to modify existing rules when adding new ones.20 In the following, we give more details about assigning rules to each monitored event. We also sketch an algorithm (using Prolog syntax) for detecting a sequence of events. Algorithm 6.1 accepts as input a rule referring to a binary sequence ei ← a SEQ b, and produces event-driven backward chaining rules (i.e., executable rules) for the sequence pattern. The binarization step must precede the rule transformation. Rules, produced by Algorithm 6.1, belong to one of two different classes of rules21 . We refer to the first class as to goal inserting rules. The second class corresponds to checking rules. For example, rule (4) belonging to the first class inserts goal(b(_, _), a(T1 , T2 ), e1(_, _)). The rule will fire when a occurs, and the meaning of the goal it inserts is as follows: “an event a has occurred at [T1 , T2 ],22 and we are waiting for b to happen in order to detect ie1 ”. Obviously, the goal does not carry information about times for b and ie1 , as we don’t know when they will occur. In general, the second event in a goal always denotes the event that has just occurred. The role of the first event is to specify what we are waiting for to detect an event that is in the third position, 20 This
holds even if patterns with negated events are added.
21 Later 22 Apart
on, we will introduce the rules implementing the for each loop.
from the timestamp, an event may carry other data parameters. They are omitted here for the sake of readability.
272
N. Stojanovic et al.
for_each(Pred, N, L) : −((FullPred = . . . [Pred, N, L]), event_trigger(FullPred), (N1 is N + 1), for_each(Pred, N 1, L)) ∨ true, a(1, T1 , T2 ) : −assert(goal(b(_, _), a(T1 , T2 ), e1(_, _))),
(3) (4)
b(1, T3 , T4 ) : −goal(b(T3 , T4 ), a(T1 , T2 ), ie1 ), T2 < T3 , retract(goal(b(T3 , T4 ), a(T1 , T2 ), ie1 (_, _))), ie1 (T1 , T4 ).
(5)
Rule (5) belongs to the second class being a checking rule. It checks whether certain prerequisite goals already exist in the database, in which case it triggers the more complex event. For example, rule (5) will fire whenever b occurs. The rule checks whether goal(b(T3 , T4 ), a(T1 , T2 ), ie1 ) already exists (i.e., a has previously happened), in which case the rule triggers ie1 by calling ie1 (T1 , T4 ). The time occurrence of ie1 (i.e., T1 , T4 ) is defined based on the occurrence of constituting events (i.e., a(T1 , T2 ), and b(T3 , T4 ), see Sect. 6.3). Calling ie1 (T1 , T4 ), this event is effectively propagated either upward (if it is an intermediate event) or triggered as a finished complex event. We see that our backward chaining rules compute goals in a forward chaining manner. The goals are crucial for computation of complex events. They show the current state of progress toward matching an event pattern. Moreover, they allow for determining the “completion state” of any complex event, at any time. For instance, we can query the current state and get information how much of a certain pattern is currently fulfilled (e.g., what is the current status of certain pattern, or notify me if the pattern is 90% completed). Further, goals can enable reasoning over events (e.g., answering which event occurred before some other event, although we do not know a priori what are explicit relationships between these two; correlating complex events to each other; establishing more complex constraints between them etc.). Goals can persist over a period of time. It is worth noting that checking rules can also delete goals. Once a goal is “consumed”, it is removed from the database.23 In this way, goals are kept persistent as long as (but not longer) than needed. Finally, in Algorithm 6.1 there exist more rules than the two mentioned types (i.e., rules inserting goals and checking rules). We see that for each different event type (i.e., a and b in our case) we have one rule with a f or_each predicate. It is defined by rule (3). Effectively, it implements a loop, which for any occurrence of an event, goes through each rule specified for that event (predicate) and fires it. For example, when a occurs, the first rule in the set of rules from Algorithm 6.1 will fire. This first rule will then loop, invoking all other rules specified for a (those having a in the rule head). In our case, there is only one such a rule, namely rule (4). However, in general, there may be as many of these rules as usages of a particular event may be manifold in an event program (i.e., set of all event patterns). Let us observe a situation in which we want to extend our event pattern set with an additional 23 Removal of “consumed” goals is often needed for space reasons but might be omitted if events are required in a log for further processing or analyzing.
Semantic Complex Event Reasoning—Beyond Complex Event Processing
273
pattern that contains the event a (i.e., additional usage of a). In this case, the rule set representing a set of event patterns needs to be updated with new rules. This can be done even at runtime. Let us assume the additional pattern to be monitored is iej ← k SEQ a. Then the only change we need to make is to add one rule to insert a goal and one checking rule (in the existing rule set). In this subsection we explained the basics about computation of complex events in ETALIS. For further details about other operators of ETALIS Language for Events, a reader is referred to [32], Sect. 5.
7 Complex Event Processing Presentation One of the main advantages of complex event processing is the recognition of situations of interest in an enormously intensive stream of information. In some of these situations reactions can be performed automatically. In that case a CEP system triggers already predefined actions. However, there are situations that must additionally be analyzed in order to find the best possible reaction. This is usually the case when there are many situations emerging in a short period of time, so that the user needs to get an overview of the whole context. We refer to this as the Orient phase in the CEP cycle described in Sect. 4 (cf. Fig. 1). In the general case, although the reaction of the system can be automated, there is always a human in the loop, who must react if something goes wrong, especially in very critical situations. As mentioned in motivating example, a very efficient method to cope with this is the visual presentation of complex events in real time. Our approach is based on the 3-D visualization, since several characteristics of a complex event can be presented in the same time. Indeed, if we assume that the consumer of this visualization is a decision maker, then the visualization should enable a very quick view on as much as possible relevant information, so that she/he can easily select the most relevant one for the decision making process. There are several dimensions of an event which we will discuss here, briefly. If we consider the motivating example, the color of the complex event representation (i.e., a ball) can represent the type of the complex event (e.g., the topic discussed in twitter, that will trigger the stock-exchange event), the radius can represent the value of the complex event (e.g., the price of company we are following: bigger ball means bigger price) and the y-axis would be the importance of a complex event (e.g., if it appears often or seldom: greater Y means more often). The remaining x-axis is reserved for time. Figure 6 displays the event flow from our motivating example, related to Twitter. Another important feature for the decision making support is that the visualization is active, i.e., a user can click on a flying ball in order to get more information about that instance of a complex event pattern. This means that once the user has selected a situation she/he finds relevant, the user can drill down in the available data in order to get more insights about the situation at hand, especially how that situation has emerged. Note that a complex event pattern can be fulfilled in many ways (different combinations of atomic events).
274
N. Stojanovic et al.
Fig. 6 3-D visualization of complex-event streams
Finally, our approach uses semantics incorporated in complex event patterns and their relations to support semantic filtering of detected complex events. The approach, termed “semantic glasses”, enables higher level abstraction of detected complex events, by grouping them based on the subsumption hierarchy. For example, based on Fig. 6, semantic glasses will color balls related to Merkel and Obama in the same color that correspond to the concept Person. The same will be valid for balls related to China and India in Fig. 6. Additionally, semantic glasses use the information about semantic relations between complex event patterns to connect related patterns (situations) graphically.
8 Related Work In this section we first discuss related work from different fields of research relevant for the Plan phase, namely current approaches in event representation, business rule evolution and rule management. We also take a look at existing CEP systems for the Observe and Act phases, such as related logic-based approaches and others.
8.1 Event Representation Different approaches exist for event representation. Distributed publish/subscribe architectures such as those presented in [13, 27] and [6] represent an event as a set
Semantic Complex Event Reasoning—Beyond Complex Event Processing
275
of typed attributes. Each individual attribute has a type, name and value. The event as a whole has purely syntactical and structural value derived from its attributes. Attribute names and values are simple character strings. The attribute types belong to a predefined set of primitive types commonly found in programming languages. In our opinion, this kind of event model is not sufficient for CEPAT management since it does not provide any explicit semantics. The only justification for choosing this typing scheme is the scalability aspect. The event stream engine AMIT (see [4] and [5]) is targeting the high-performance situation detection mechanism. Via the offered user interface, one can model business situations based on events, situations, lifespans and keys. Within AMIT an event is specified with a set of attributes and it is the base entity. All events belong to a certain class (a class contains events with similar characteristics) and have relationships to other events. Attributes can be references to other objects as well. This approach is close to ours in the sense that it describes the importance of generalization, specialization and relationships of events.
8.2 Event Representation in Existing CEP Systems There are a large number of CEP systems, both academic and commercial. The Aurora/Borealis [1] system has the goal of supporting various kinds of real-time monitoring applications. Data items are composed of attribute values and are called tuples. All tuples on a stream have the same set of attributes. The system provides a toolset including a graphical query editor that simplifies the composition of streaming queries using graphical boxes and arrows. The Cayuga [11] system can be used to detect event patterns in the event stream as well. An event here is a tuple of attribute value pairs according to a schema. The goal of the Stream [9] project is to be able to consider both structured data streams and stored data together. The Esper [16] CEP engine supports events in XML, Java-objects and simple attribute value pairs. There are also many commercial CEP vendors like TIBCO, Coral8/Aleri, StreamBase or Progress Apama providing tool suites including development environments and a CEP engine. Most of the described systems use different SQL-like or XML based complex event pattern definitions, which makes it difficult to understand and read them. They do not provide a semantic model for events and event patterns either. The management of event patterns is restricted to create, modify and delete. Furthermore they do not use any underlying management methodology and do not tackle the issues of reusability and evolution of patterns.
8.3 Complex Event Detection For the Observe and Act phases of CEP on the other hand, a number of formal reactive frameworks have been proposed in order to capture relevant changes in a system and respond to those changes adequately. Work on modeling behavioral aspects of
276
N. Stojanovic et al.
an application (using various forms of reactive rules) started in the Active Database community a long time ago. Different aspects have been studied extensively, ranging from modeling and execution of rules to discussing architectural issues [26]. However, what is clearly missing in this work is a clean integration of active behavior with pure deductive and temporal capabilities. A lot of work [12, 21, 24, 25] in the area of rule-based CEP has been carried out, proposing various kinds of logic rule-based approaches to process complex events. As pointed out in [12], rules can be effectively used for describing so-called “virtual” event patterns. There are a number of other reasons to use rules: Rules serve as an abstraction mechanism and offer a higher-level event description. Also, rules allow for an easy extraction of different views of the same reactive system. Rules are suitable to mediate between the same events differently represented in various interacting reactive systems. Finally, rules can be used for reasoning about causal relationships between events. To achieve the aforementioned aims, these approaches all represent complex events as rules (or queries). Rules can then be processed either in a bottom-up manner [33], a top-down manner [2, 15], or in a manner that combines both [10]. However, all these evaluation strategies have not particularly been designed for eventdriven computation. They are rather suited for a request-response paradigm. That is, given (and triggered by) a request, an inference engine will search for and respond with an answer. This means that, for a given event pattern, an event inference engine needs to check if this pattern has been satisfied or not. The check is performed at the time when such a request is posed. If satisfied by the time when the request is processed, a complex event will be reported. If not, the pattern is not detected until the next time the same request is processed (though it can become satisfied in-between the two checks, being undetected for the time being). For instance, [25] follows the mentioned request-response (or so-called query-driven24 ) approach. It proposes to define queries that are processed repetitively at given intervals, e.g., every 10 seconds, trying to discover new events. However, generally events are not periodic or if so might have differing periods and nevertheless complex events should be detected as soon as they occur (not in a predefined time window). This holds in particular for time-critical scenarios such as monitoring stock markets or nuclear power plants. To overcome this issue, in [12], an incremental evaluation was proposed. The approach is aimed at avoiding redundant computations (particularly re-computation of joins) every time a new event arrives. The authors suggest the utilization of relational algebra evaluation techniques such as incremental maintenance of materialized views [19]. A big portion of related work in the area of rule-based CEP is grounded on the Rete algorithm [18]. Rete is an efficient pattern matching algorithm, and it has been
24 If
a request is represented as a query (what is a usual case).
Semantic Complex Event Reasoning—Beyond Complex Event Processing
277
the basis for many production rule systems (CLIPS,25 TIBCO BusinessEvents,26 Jess,27 Drools,28 BizTalk Rules Engine,29 etc.). The algorithm creates a decision tree that combines the patterns in all the rules of the knowledge base. Rete was intended to improve the speed of forward chained production rule systems at the cost of space for storing intermediate results. The left hand side of a production rule can be utilized to form a complex event pattern, in which case Rete is used for CEP. Thanks to forward chaining of rules, Rete is also event-driven (data-driven). Close to our approach is [20]. It is an attempt to implement business rules also with a Rete-like algorithm. However, the work proposes the use of subgoals and data-driven backward chaining rules. It has deductive capabilities, and detects satisfied conditions in business rules (using backward chaining), as soon as relevant facts become available. In our work, we focus rather on complex event detection, and enable a framework for event processing in pure Logic Programming style [22]. Our framework can accommodate not only events but conditions and actions (i.e., reactions on events), too. As this is not a topic of this chapter, an interested reader is referred to our previous work [8]. Concluding this section, many mentioned studies aim to use more formal semantics in event processing. Our approach based on our language for event processing may also be seen as an attempt towards that goal. It features data-driven computation of complex events as well as rich deductive capabilities.
9 Conclusion This chapter illustrates the role of semantics in the real-time processing of large streams of events, extracted from data and textual information. Current, nonsemantic approaches are missing the robustness (dealing with very heterogeneous events) and intelligence (reasoning about complex events). We presented a novel approach for the more intelligent complex event processing, based on the use of semantic technologies and elaborated on the advantages of such an approach in detail: • Since the semantics of events is more explicitly represented, they will be combined semantically which ensures more precise descriptions of the situations of interest. • Semantics enables usage of domain knowledge in the detection process, so that relevant situations can be discovered in a more robust way. 25 CLIPS:
http://clipsrules.sourceforge.net/.
26 TIBCO
BusinessEvents: businessevents/businessevents.jsp. 27 Jess:
http://www.tibco.com/software/complex-event-processing/
http://jessrules.com/.
28 Drools:
http://jboss.org/drools/.
29 BizTalk Rules Engine: http://msdn.microsoft.com/en-us/library/dd879260%28BTS.10%29.aspx.
278
N. Stojanovic et al.
• Management of situations of interest is facilitated with semantics (e.g., semantic analysis of the problems in the description of situations can lead to better suggestion what should be changed in an out-dated model). This work opens new challenges for the Semantic Web community: dealing with real-time aspects and dynamics, that will enable the creation of a new generation of mission critical applications.
References 1. Abadi, D., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Erwin, C., Galvez, E., Hatoun, M., Maskey, A., Rasin, A., Singer, A., Stonebraker, M., Tatbul, N., Xing, Y., Yan, R., Zdonik, S.: Aurora: a data stream management system. In: SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 666–666. ACM, New York (2003). doi:10.1145/872757.872855 2. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995). 0-201-53771-0 3. Adaikkalavan, R., Chakravarthy, S.: Snoopib: interval-based event specification and detection for active databases. In: Data Knowledge Engineering. Elsevier, Amsterdam (2006) 4. Adi, A., Etzion, O.: Amit—the situation manager. VLDB J. 13(2), 177–203 (2004). doi:10. 1007/s00778-003-0108-y 5. Adi, A., Botzer, D., Etzion, O.: Semantic event model and its implication on situation detection. In: ECIS (2000) 6. Aguilera, M.K., Strom, R.E., Sturman, D.C., Astley, M., Chandra, T.D.: Matching events in a content-based subscription system. In: PODC ’99: Proceedings of the Eighteenth Annual ACM Symposium on Principles of Distributed Computing, pp. 53–61. ACM, New York (1999). doi:10.1145/301308.301326 7. Allen, J.F.: Maintaining knowledge about temporal intervals. Commun. ACM 26(11), 832– 843 (1983) 8. Anicic, D., Stojanovic, N.: Expressive logical framework for reasoning about complex events and situations. In: Intelligent Event Processing—AAAI Spring Symposium 2009, Stanford University, California (2009) 9. Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: Stream: the Stanford data stream management system. Technical Report 200420, Stanford InfoLab (2004). http://ilpubs.stanford.edu:8090/641/ 10. Bancilhon, F., Maier, D., Sagiv, Y., Ullman, J.D.: Magic sets and other strange ways to implement logic programs. In: PODS ’86, Massachusetts, United States. ACM, New York (1986) 11. Brenna, L., Demers, A., Gehrke, J., Hong, M., Ossher, J., Panda, B., Riedewald, M., Thatte, M., White, W.: Cayuga: a high-performance event processing engine. In: SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1100–1102. ACM, New York (2007). doi:10.1145/1247480.1247620 12. Bry, F., Eckert, M.: Rule-based composite event queries: the language xchangeeq and its semantics. In: RR. Springer, Berlin (2007) 13. Carzaniga, A., Rosenblum, D.S., Wolf, A.L.: Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. 19(3), 332–383 (2001). doi:10.1145/380749. 380767 14. Chakravarthy, S., Mishra, D.: Snoop: an expressive event specification language for active databases. Data Knowl. Eng. 14(1), 1–26 (1994) 15. Chen, W., Warren, D.S.: Tabled evaluation with delaying for general logic programs. J. ACM. 43(1), 20–74 (1996)
Semantic Complex Event Reasoning—Beyond Complex Event Processing
279
16. Esper: Esper Version 3.2.0, Espertech Inc. Online Resource. http://esper.codehaus.org/ (2009). Last visited: January 2010 17. Etzion, O., Niblett, P.: Event Processing in Action, Manning (2010). 978-1935182214 18. Forgy, C.L.: Rete: a fast algorithm for the many pattern/many object pattern match problem. Artif. Intell. 19, 17–37 (1982) 19. Gupta, A., Mumick, I.S.: Magic sets and other strange ways to implement logic programs. IEEE Data Eng. Bull. (1985) 20. Haley, P.: Data-driven backward chaining. In: International Joint Conferences on Artificial Intelligence. Milan, Italy (1987) 21. Lausen, G., Ludäscher, B., May, W.: On active deductive databases: the statelog approach. In: ILPS ’97, (1998) 22. Lloyd, J.W.: Foundations of Logic Programming. Comput. Sci. Press, New York (1989). 07167-8162-X 23. Luckham, D.C.: The Power of Events: An Introduction to Complex Event Processing in Distributed Enterprise Systems. Addison-Wesley, Boston (2001). 0201727897 24. Motakis, I., Zaniolo, C.: Composite temporal events in active database rules: a logic-oriented approach. In: Deductive and Object-Oriented Databases. Springer, Berlin (1995) 25. Paschke, A., Kozlenkov, A., Boley, H.: A homogeneous reaction rules language for complex event processing. In: EDA-PS. ACM, New York (2007) 26. Paton, N.W., Díaz, O.: Active database systems. ACM Comput. Surv. 31(1), 63–103 (1999) 27. Pietzuch, P.R., Bacon, J.: Hermes: a distributed event-based middleware architecture. In: ICDCSW ’02: Proceedings of the 22nd International Conference on Distributed Computing Systems, pp. 611–618. IEEE Comput. Soc., Washington (2002) 28. Rozsnyai, S., Schiefer, J., Schatten, A.: Concepts and models for typing events for event-based systems. In: DEBS ’07: Proceedings of the 2007 Inaugural International Conference on Distributed Event-Based Systems, pp. 62–70. ACM, New York (2007). doi:10.1145/1266894.1266904 29. Sen, S., Stojanovic, N., Stojanovic, L.: GRUVe: a methodology for complex event pattern life cycle management. In: Proceedings of the 22nd International Conference on Advanced Information Systems Engineering (CAiSE), Hammamet, Tunisia, June 7–9, 2010. Lecture Notes in Computer Science, vol. 6051. Springer, Berlin (2010) 30. Stojanovic, L.: Methods and tools for ontology evolution. Ph.D. Thesis, University of Karlsruhe, Germany (2004) 31. Stojanovic, N.: Ontology-based information retrieval. Ph.D. Thesis, University of Karlsruhe, Germany (2005) 32. Technical-Report: A declarative and rule-based language for complex event processing. http://sites.google.com/site/anonymousresearchsite/paper.pdf 33. Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vols. I and II, 2nd edn. Freeman, New York (1990)
Semantics in Knowledge Management Andreas Abecker, Ernst Biesalski, Simone Braun, Mark Hefke, and Valentin Zacharias
Abstract This chapter exemplarily illustrates the role that semantic technologies can play in knowledge management. Starting from a conceptual overview of knowledge management, the role of semantic technologies is explored along two dimensions: (1) on the one hand, the degree of formality of explicit knowledge (from tags in folksonomies to F-Logic rules); and (2) on the other hand, the degree of externalization of knowledge (from implicit knowledge in human’s heads to actionable knowledge in expert systems). Several examples from industry and research are used to illustrate operating points along these dimensions.
1 Introduction Following Abecker [2], Knowledge Management (KM) is a: • systematically managed organizational activity; • which views implicit and explicit knowledge as a key strategic resource of an organization, and thus • aims at improving the handling of knowledge at the individual, team, organization, and inter-organizational level; • in order to achieve organizational goals such as better innovation, higher quality, more cost-effectiveness, or shorter time-to-market; • by holistically and synergetically employing methods, tools, techniques, and theories from manifold areas such as information and communication technologies, strategic planning, change management, business process management, innovation management, human resource management, and others; • in order to achieve a planned impact on people, processes, technology, and culture in an organization. A. Abecker () FZI Forschungszentrum Informatik, Karlsruhe, Germany e-mail:
[email protected] A. Abecker disy Informationssysteme GmbH, Karlsruhe, Germany e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_15, © Springer-Verlag Berlin Heidelberg 2011
281
282
A. Abecker et al.
Knowledge Management (KM), is a holistic approach that developed in the mid 1990s from different roots in management sciences, information technology, pedagogics and more, which takes a unifying perspective on organizational activities which were formerly (and still are, in most organizations) scattered amongst different organizational functions such as corporate strategy, research and documentation, innovation management, human-resource and competence management, quality management, IT, and more. It aims at a joint and synergetic planning, management, and monitoring, and at a thorough understanding of the interoperations of such functions, as well. KM initiatives (cf. [2]) typically adopt: • either the so-called Codification approach which aims at explicit knowledge, e.g. in the form of lessons learned databases, best practice documents, frequentlyasked questions systems, experience case bases, or—more traditionally—textbook knowledge, technical documentation, project documentation, Internet or Intranet knowledge (“corporate Wikipedia”), etc. • or the so-called Personalization approach which focuses on the human being as the knowledge bearer, creator, and user in his communication and collaboration processes; activities here address topics such as incentives for knowledge sharing, expert finder systems and social software tools, personal and group competence development, communication support through technology, process design, architecture, corporate culture, etc. Abecker and van Elst [3] discuss the role of technology in KM, in particular the use of ontologies. They conclude that IT is very often considered an important enabler of KM, but very seldom the really critical success factor. Often, KM can already be facilitated by “low tech” solutions, if appropriately used—but technology must always be embedded in a holistic approach and used with a reasonable system understanding and methodological guidance. Nevertheless, knowledge technologies and Semantic Web technologies can offer many interesting ideas for innovative KM support, often addressing some of the more difficult aspects, such as entry barriers for KM systems, or dealing with implicit or tacit knowledge. This article also illustrates that there is a plethora of promising points of application for semantic technologies in Knowledge Management, not “the one”, ultimate KM tool suite. This chapter will corroborate and further illustrate these findings by presenting examples that show the purposeful integration of technological and methodological measures in KM. It will also span at least two dimensions along which concrete IT applications in KM can be classified: 1. if explicit knowledge plays a role, concrete approaches can vary much with respect to the degree of formality that such explicit knowledge (and the metaknowledge used for describing it) has; for instance, if knowledge is only represented in unstructured text1 (which is the cheapest form) without any additional 1 For
a more detailed discussion of structuredness, formality, and explicitness, please refer to [28]. In KM, one typically considers natural-language texts or other multimedia representations as “unstructured”, compared to structured or semi-structured representations in relational or in XML or RDF databases. Völkel [28] also elaborates on cost models and cost-benefit considerations in personal KM, whereas [30] address such issues in more collaborative environments.
Semantics in Knowledge Management
283
meta-information, all retrieval approaches must rely on fulltext search with all its known drawbacks; if informal tags were added, human users might be supported in finding relevant information; if quality-assured relational metadata were added which refer to a domain ontology, sophisticated semantic search would be enabled, but at the price of much more costly metadata creation. In this dimension, many more operating points are possible; 2. a very similar (but not identical) trade-off can be identified with respect to the level of externalization of the knowledge which shall be managed; here, one extreme is to externalize almost nothing, just keep the human beings as the main knowledge bearers and try to identify through a yellow-page system, a skill management system etc. the right person for a given task; a medium approach would be to write down project lessons learned in a semi-structured format, but leave it to the user to assess such a knowledge item’s specific applicability in a new situation and to transfer it to a new context; the other extreme are knowledgebased systems which can fully automate complex tasks like diagnosis, planning, configuration, etc. but at the cost of extremely expensive system creation and of a certain brittleness of the system’s capabilities at the borderlines of its proven application area. Roughly since the year 2004 up till the time of writing, many industry and publicly funded research projects have been performed in Prof. Rudi Studer’s groups at AIFB and FZI and in their spin-off company ontoprise GmbH, altogether creating manifold different contributions to the KM state of research and practice, and leading to a number of dissertation theses covering different facets of the KM landscape. In this chapter, we collect major results of some of these theses which illustrate many of the points made above and which represent different working points in the design space spanned by the two trade-offs above. The following research endeavors will be elaborated in more detail in the following sections: • We start with the KMIR (Knowledge Management Implementation and Recommendation) method and tool for best practice-based KM introduction (Sect. 2). KMIR supports the introduction of KM in an organization, i.e. it treats KM from the meta level of a knowledge manager or KM consultant. Technically, KMIR is based on an ontology-based Case-Based Retrieval approach which uses ontologies as the basis for complex similarity assessment for best-practice case retrieval. • HRMORE takes up the technology of ontology-based similarity assessment and applies it to the management of human competence profiles in Human Resource Management (Sect. 3) as an important element of KM. Appropriate Strategic HRM processes—aligned with overall knowledge strategies—are defined around the technological basis of an HR Data Warehouse. • We keep the application area of competence management, but replace the centralistic, top-down methods of HRMORE by a more participatory, collaborative approach and come to people tagging with the social semantic bookmarking tool SOBOLEO (Sect. 4)—thus addressing the dynamics and bottom-up aspects of knowledge-intensive businesses.
284
A. Abecker et al.
• Coming to the area of completely externalized and fully operationalizable knowledge, we stress some of the work done and insights gained in the Halo project striving for fully-automated question answering in natural sciences (Sect. 5). We now present these projects in more detail, before we conclude with a summary in Sect. 6.
2 Semantically-Enabled Introduction of Knowledge Management 2.1 KMIR Project Context The Knowledge Management Implementation and Recommendation Framework (KMIR) describes an ontology-based system for supporting consulting agencies in accompanying an organization’s Knowledge Management (KM) implementation. In KMIR, Best Practice Cases (BPCs) of successful KM introductions are captured in an ontology-based case base. The organization profile and problem description of a newly analyzed company can be matched against descriptions of stored BPCs. The most similar BPC is then retrieved as a recommendation, may be adapted to the new situation, and finally reused by the company which wants to start a KM project. The KMIR framework combines Semantic Web and Case-based Reasoning (CBR) techniques. The KMIR framework adopts the holistic approach of KM by considering technological, organizational and human aspects of KM, as well as the organizational culture.
2.2 KMIR Objectives and Approach An organization’s KM introduction has to overcome manifold barriers in the organizational, technical, and cultural dimension. In order to handle such a complex endeavor and to flexibly react on new customers’ knowledge problems, a KM consulting agency should collect and capture as many experiences as possible from already accomplished KM implementation projects. This can be done, e.g., through running project debriefings at the end of KM introduction projects, thus trying to externalize and structure personal experiences of senior consultants in the form of BPCs. Based on this externalized experience knowledge, consultants are to a certain degree able to reuse positive experience and to avoid mistakes that have been made in previous projects. The practical problem is that descriptions of BPCs are usually in the form of unstructured reports. Therefore, they are normally not directly applicable to a new customer’s needs. On this account, KMIR supports consulting agencies in accompanying an organization’s implementation of KM. BPCs of successfully conducted introductions are captured by the system’s ontology-based case base in order to reuse them for further projects. The technical solution to the aforementioned problem matches a newly defined organization profile against existing BPCs in the case base. The most similar
Semantics in Knowledge Management
285
Fig. 1 CBR Cycle according to [1]
retrieved BPC is returned as a recommendation, then adapted and finally reused by the accompanied organization. KMIR is methodologically based on the CBR Cycle by Aamodt and Plaza [1]. The four processes of the CBR Cycle comprise Retrieval of the most similar case(s) for a new problem, Reuse of information and knowledge from the retrieved case(s) in order to solve the new problem, Revision of the proposed solution, and finally Retainment of a newly originated case for solving new problems in the future (see Fig. 1). We performed the following steps for developing KMIR: 1. Identification of indicators for the description/portability of KM BPCs. 2. Verification of identified indicators in the form of an open survey. 3. Development of a “reference model” and ontology-based case base implementing the evaluation results [16]. 4. Collection of (unstructured) episodic cases from different information sources which describe “real” events. 5. Definition of “prototypical” cases to capture innovative technical solutions, new methods and practices into the case base that are not widely used in organizations (these hypothetical cases complement the “real” ones in order to sufficiently cover the space of possible organizational problem situations). 6. Development and implementation of the KMIR Framework Architecture. 7. Structuring and storing cases from 4. and 5. into the case base. For technically supporting all processes of the CBR Cycle, we have designed and implemented the KMIR architecture, which consists of the following components: 1. An ontology-based case base containing KM BPCs: BPCs are represented as interrelated bundles of instances of concepts described in an overall KM BPC ontology. 2. A Case Editing Component: supports a consulting agency (a) on one hand in the structured description of BPCs, or just single problem-solution pairs based on accompanied KM introduction projects; and (b) facilitates an organizational audit at the customer organization in order to identify the organization’s general structure, technical infrastructure, knowledge problems and knowledge goals. 3. An ontology-based Matching Component: returns most similar cases by matching a customer request with existing BPCs in the case base.
286
A. Abecker et al.
Fig. 2 KMIR framework architecture
4. A Solution Generator: associates a customer’s profile, knowledge problem and goals with existing solutions, methods and experiences of the most similar BPC in order to offer KM recommendations to a customer (i.e., about how to introduce KM, based on retrieved and adapted most similar cases). 5. A Learning Component: stores adapted, reused and revised best practices cases as a new case into the case base. 6. Administration Functions: support the configuration of similarity measures and filters, and provide further means for maintaining the CB. An overview on the KMIR framework architecture components and their interrelations is given in Fig. 2. The components are described more detailed in the following subsections.
2.2.1 Ontology-Based Case Base Each BPC is stored as a set of interlinked “profile instances” in the ontologybased case base (Fig. 3). The conceptual level of the ontology consists of the main concepts “Company”, “Profile”, “Problem”, “Goal”, “Solution” and “Method”. The concepts “Company” and “Profile” are linked together by the property “Company_has_Profile”. Knowledge Problems which the companies had to solve are sub-divided into organizational, technical and cultural ones. A “Knowledge Goal” can either be normative, strategic or operative. Each profile is linked
287
Fig. 3 Excerpt of the KMIR Ontology
Semantics in Knowledge Management
288
A. Abecker et al.
to one or more problem(s) or goal(s) by the properties “Profile_has_Problem” and “Profile_has_Goal”. A problem is linked to one or more achieved solution(s) with the property “Problem_has_Solution” and an inverse property “Solution_solves_Problem”. Problems can address a specific core process of the Probst KM Model (i.e., knowledge acquisition, sharing, etc.) [21]. Problems are divided into sub-problems by the property “Problem_consists_of/is_part_of_problem”. The concept “Problem” has sub-concepts “Organizational Problem”, Technical Problem” and “Cultural Problem”, because the implementation of a KM system could depend, for instance, on a specific technology and, furthermore, require to solve a specific organizational problem, as well as a cultural change in the organization. The concept “goal” disposes of the more special sub-concepts “Normative Goal”, “Strategic Goal” and “Operative Goal”. Every solution can be combined with a method (property: “uses_method”), a knowledge instrument (property “uses_knowledge_instrument”), a specific technology or software-tool, which again may depend on a technology (properties: “uses_Software_tool/Technology” and “depends_on_Technology”). Moreover, a solution, software, or technology can consist or be a part of other solutions (just as software tools and technologies). Several other concepts of the ontology are structured by a taxonomy in order to have the possibility for more precisely specifying the top concepts.
2.2.2 Description of KM Best Practice Cases Selected and created episodic and prototypical BPCs are described by the use of a Case Editing Component, a Web-based user interface, which is part of the KMIR framework architecture and allows for a template-oriented filling of all known attributes of a BPC. Attribute values are filled in as texts, as numbers, or can be chosen from pulldown menus. The interface is automatically generated from the ontology defining the case structure. Finally, a described best practice case is directly stored into the ontology as a set of instances, attributes and relations.
2.2.3 The Organizational Audit The Case Editing Component is also used to later support, e.g., a consulting agency in capturing a new customer’s organization profile, thus its organizational structure, technical infrastructure and economic aspects, as well as normative, strategic, and operational knowledge goals. Additionally, the organization may define target costs for the implementation of a KM solution, may describe or select organizational, technical or cultural knowledge problems and requirements, and finally assign them with typical knowledge processes. KMIR further supports the association of weights to all described aspects, in order to attach more or less importance to them. The received profile from the organizational audit is directly stored as a set of instances, attributes and relations into the ontology which structures the CB. In order to disencumber consultants from filling in all characteristic values of the customer profile
Semantics in Knowledge Management
289
that have to be used later for the case retrieval, several characteristic values are automatically created or transformed by the use of derivation rules and transformation rules before storing a new case into the case base. Derivation rules infer the organization type (e.g., “Small and Medium Enterprise”) from the characteristic values “turnover” and “company size”, transformation rules are used to transform values between different scale units (e.g., time and currency). Further, it is possible to only define one or more problems or problem-solution-pairs, because in practice, customers often have already accomplished several KM activities and now search for a solution to solve one or more new particular problem(s).
2.2.4 Case Retrieval Process In order to retrieve BPCs that are as similar as possible to a newly created customer profile achieved from the organizational audit, or simply to retrieve solutions for one or more requested problems, a matching component matches the profile or a given problem (set) against already existing BPCs or problems from the CB. This is done by combining syntax-based with semantical similarity measures [10]. Syntax-based similarity measures in our system are distance-based similarity, syntactical similarity (edit distance combined with a StopWord-Filter and Stemming) and equality for comparison of values of numeric data types from the organization profile with those of existing BPCs. Additionally, the profile from the self-description process is matched against profiles of the CB using semantic similarity measures in order to compute the similarity between (sets of) instances on the basis of their corresponding concepts and relations to other objects (relation similarity) as well as taxonomic similarity. Relation similarity is used on the one hand for comparing attribute values of instances that are not direct instantiations of the concept “profile”, but of further concepts instantiations (e.g., of concept “problem” or “software”) that are linked to the concept “profile” (using the relations “profile_has_problem” and “profile_uses_software”). On the other hand, the similarity type is used for, e.g., comparing instantiations of the concept “problem” that are linked to further instantiations of the concept “Core process” using the relation “(problem) addresses core process”. Taxonomic similarity identifies similar software tools or technologies for the requesting organization, which base upon problem-solution pairs of BPCs similar to the defined problem(s) from the organization profile. For example, an organization is searching for an extension of its existing groupware system using an ontology-based tool solution. The matching component identifies a similar groupware system in the case base, which also served as a basis for such an extension. This finding is made by checking all instances of the corresponding software sub-concept “groupware” and recommending the assigned solution to the requesting organization. Furthermore, taxonomic similarity is used to compare particular attribute instances based on the conceptual level in order to improve results of the syntactic similarity computation (e.g., matching the attribute “sector” of a profile based on the concept taxonomy “primary”, “secondary” and “tertiary sector”). Finally, a weighted average determines the global similarity of all local similarities.
290
A. Abecker et al.
For the technical realization of the matching component, we have integrated a Java-based framework for instance-similarities in ontologies into the KMIR architecture. We added a user interface for parameterizing the user-defined selection and composition of (atomic) similarity measures, and their assignment with weights directly in KMIR. Settings are stored in an XML-File and processed by the underlying similarity framework. Depending on the selected similarity measure(s), attributes like maxdiff (distance-based similarity) or recursion depth (instance Relation similarity) can be defined. Due to the complexity of computing ontology-specific similarity measures, the similarity framework provides two different types of filters, pre-filters and post-filters in order to constrain the number of instances to be considered for similarity computation. They can be individually combined from (atomic) filters. All filters are configurable either by a KMIR user interface or directly in the XML-File.
2.2.5 Recommendations and Solution Generation The Recommendations Component provides recommendations based on identified most similar case(s). This is done by presenting one or more profile(s) retrieved within the matching process that correspond to the profile from the organizational audit—including similar problems, as well as interlinked solutions and methods to solve these problems. In addition, the system user can identify for each profile’s problem-solution pair further relations to other KM aspects by browsing the structure of the ontology. The identified most similar case(s) also comprise information about implementation costs and time, qualitative and quantitative benefits, savings, sustainability, application to other fields, external support/funding and others. An example for a so called “holistic recommendation” would be the recommendation of using a specific tool, technology or knowledge instrument combined with a specific organizational method, as well as with a required organizational culture program. Moreover, the system provides a Solution Generation Component which supports the automatic generation of solutions by interlinking problems with solutions of similar problems from the CB. This can be done for either single problems or all problems of a selected profile based on a predetermined minimal similarity value. When generating solutions for profiles, the Solution Generator only creates solutions for one or more problem(s), if a profile can be identified where the global similarity of all profile attributes has at least a predetermined value. Moreover, we are currently developing modification rules, in order to realize automatic case adaptation in “easy situations”. For instance, it is planned to implement a verification component which allows KMIR to check if a specific “software application” makes sense for a recommendation or solution generation, based on background information defined by further specific attributes (e.g., compatibility, interoperability, scalability and extensibility of the software tool to be recommended), and on this basis also adapt technical solutions from a BPC to specific needs of a new customer.
Semantics in Knowledge Management
291
2.2.6 Feedback Loop and Learning Successfully accomplished KM implementations are added as a new BPC into the CB. This is done by technically supporting the revision of the new constructed KM introduction solution (e.g., editing/correcting existing information to the generated solution or providing additional information like for instance new experiences or benefits, etc.). After that, the adapted, reused and revised BPC is stored as a new learned case into the CB. The learning component collects lessons learned regarding successful or inappropriate given recommendations in order to refine or extend the BPCs as well as the general structure of the CB. This is done by providing an evaluation function to the requesting organization. The consulting agency then has the opportunity to describe to the customer experiences made with the given recommendations regarding their correctness and capability to solve a specific customer problem. The evaluation results directly flow into the learning component and are considered in the next case retrieval by using them for an internal ranking of the best practice cases in the CB. With these results, the recommendations component is able to provide better recommendations to new requesting organizations in the future. Poorly evaluated recommendations with a low ranking can either be optimized or thrown out of the CB.
2.3 KMIR: Concluding Remarks We have described the KMIR framework which supports consulting agencies in accompanying a customer’s introduction of KM by providing recommendations. In order to develop KMIR, an extensive collection, analysis and structuring of BPCs from different information sources was done. Using that initially developed CB, the KMIR Framework has been evaluated with regards to retrieval quality and processing time [17]. Evaluation results showed that the use of Semantic Technologies could significantly improve the retrieval quality of the system. Fortunately, the calculation time could be kept in an acceptable scale. The KMIR framework currently comprises 54 structured episodic Best Practice Cases (BPCs) of real KM introductions. 40% of the BPCs are provided by SMEs and 60% by LSEs. The BPCs dispose of 300 knowledge goals, 250 knowledge-problem descriptions and 170 solutions. KMIR can be accessed at: http://www.kmir.de. The similarity framework developed for KMIR has been reused in other contexts, especially in the HRMORE project described below. Though already a couple of years old, KMIR still seems to be the only online available, structured collection of KM best practices on the Web. In contrast to first-generation, top-down KM introduction methods (see, for instance, [20]), our experience with industrial use cases suggests that a BPC-oriented “middle-out method” may be much more realistic in practice. Furthermore, the technological basis of KMIR, an ontology-based CBR system for best-practice cases, should be applicable in manifold other consulting areas and more project domains other than just KM introduction.
292
A. Abecker et al.
Fig. 4 Tasks of human resource management
3 HRMORE: HRM and KM in Large Organizations 3.1 HRMORE Project Context: Relationships Between HRM and KM The HRMORE project started as contracted research for DaimlerChrysler; this pilot project led to a dissertation that was run fully embedded in the software-support team of the HR department at DaimlerChrysler AG, Wörth Plant [6]. The main subject of Human Resource Management (HRM) is the working human being. Personnel are the determining factor for the overall success of companies. Therefore, the best possible use of human resources should be a prior strategic aspect. The HRM department is responsible for keeping the company in a good shape which means having the necessary number of adequately skilled employees at the right time in the right place. Therefore, we define HRM as the strategic and target-oriented composition, regulation and development of all areas that affect human resources in a company. Important tasks of HRM are (see Fig. 4): Personnel Recruitment, Personnel Placement, Personnel Development, Dismissals, Personnel Planning, Personnel Controlling, Personnel Administration. Strategic HRM (SHRM) is the aggregation of all activities that reference the efforts of individuals to reach and formulate strategic goals of a company [24]. This definition leads to the basic concepts of SHRM [23]: • Consideration of actual and future human resources in companies. • Alignment of strategic and HRM goals of the company. • Creation of a strategic competitive advantage by providing an adequate number of employees with the right skills at the right time at the right place. SHRM is closely related to strategic personnel development, an organized learning process that occurs in the social environment of the company [23]. The goal of personnel development is to improve the achievement potential of employees or
Semantics in Knowledge Management
293
organizational units. This encompasses all planning and controlling instruments, results and processes. Strategic personnel development shall close the gap between actual group-based skills and future group-based skill requirements on a highly aggregated level [22]. If one looks at the individual employee’s personal skills, knowledge, and capabilities and their possible evolution paths (“personnel asset analysis” and “personnel development” in Fig. 4, at the individual level), one enters the area of competence management (sometimes also called skill management) which is analyzed more in depth in the Sect. 3.2 below. Both KM and HRM are management disciplines with strategic, tactical and operational aspects that need a structured and holistic approach, covering corporate culture and identity, organization, as well as IT. Whether HRM is considered a part of KM or vice versa, doesn’t really matter. But it is clear that many concrete measures, like personal trainings or team coachings, job rotation or mentoring concepts, fit well in both areas.
3.2 HRMORE Objectives and Approach 3.2.1 Competence Management and Competence Catalogs We see competence as the knowledge-based and network-driven ability of an actor and his environment to act (alone or with partners) such that existing customer requirements are (directly or indirectly) satisfied optimally. By this, sustainable addedvalue is created in a competitive manner. During the last years, the importance of skill management approaches (which we will use synonymously with competence management in this paper) has become more and more visible to companies. Companies have recognized that the knowledge of their employees is a decisive competitive factor. Measures like oldage pension and reduction in staff in general imply a constant loss of knowledge. This trend is dangerous, especially regarding senior employees who possess much implicit knowledge. The ongoing demographic change in Europe will get companies into trouble because in the near future, as is already the case in some instances, they will no longer be able to satisfy their recruitment requirements. This implies a foreseeable shortage of human resources and therefore a shortage of knowledge. An accompanying aspect is the speed of technical innovation that takes place. Employees have to be trained to grow with the requirements of their positions. A variety of technical systems and the growing number of variants, e.g., in the automotive sector, imply a constant change that has effects on the knowledge bases of the workers. Companies therefore feel the pressure to secure their new blood on the one hand and to train their existing employees on the other. Companies that want to cope with these challenges and that want to be versatile have to invest in the target-oriented development of the skills of their employees. Holistic approaches of competence management aim at closing the gap between competence-offers and competence-demands. In an ideal solution, this should be supported semi-automatically with the help of an intelligent software system. This
294
A. Abecker et al.
Fig. 5 Various uses of a competence catalog
is the basis for subsequent measures like the development of competencies through trainings etc. Competence catalogs (CCs) play a central role in competence management (Fig. 5). The content of a CC are all skills that are relevant to the company. Normally they are structured in a taxonomy. CCs are the vocabulary to document the actual skills of the employees in their skill-profiles. The catalog is also used to define the reference skill-profiles for positions. These two types of profiles allow a matching and an identification of a gap between a position’s requirements and an employees’ profile. An employee skill-profile depicts the actual skills of an individual employee. Single skills in the profile can be weighted (e.g., beginner, advanced, expert, trainer). A reference-position skill-profile is a list of weighted skills that are needed to fulfill the working requirements of this individual position. Both profiles should use the same vocabulary from the same competence catalog [7]. 3.2.2 Scenario at DaimlerChrysler AG, Wörth Plant The HRMORE project was run at DaimlerChrysler AG, Wörth Plant. It aimed at using ontologies for integrating the existing processes in HR. The first step to get a consistent CC was to assess all relevant competencies needed to fulfill the needs of the company to orderly execute their processes. The modeling of the catalog was done with the KAON ontology management tool. The catalog already existed as a simple, flat database table and was imported into KAON as it was, without any further change, in the first step. The taxonomy has about 700 single competencies. This number results from the diversification in the automotive sector with its variety of different job profiles throughout all areas of production, management, and administration. Without this sophisticated catalog, it wouldn’t be possible to represent all employees with their individual job profiles.
Semantics in Knowledge Management
295
Fig. 6 Top-level concepts of HRMORE ontology
The taxonomic structure of the competence catalog was extended by further attributes and relationships to represent the links between individual competencies, training seminars, and competence weights with training analysis and training planning as two core processes of personnel development (Fig. 6). Basically, in our concrete usage context, there were no fundamentally new functionalities of the ontology-based approach, compared with established ERPSoftware using relational data models. However, the ontology-based solution has some advantages: • Usage of the taxonomic structure of ontologies. Competency catalogs are normally taxonomies. Therefore, the hierarchy can easily be exploited, for example, to aggregate competencies to a more abstract level and build up skill-groups. This enables the use of these abstractions wherever it is helpful, e.g. for describing group or team skills at a more abstract level, or for formulating strategic competence goals more compact. • Ontology-based similarity assessment can be used for matching reference profiles and employee profiles (if no exact match is possible in profile comparison) or for recommendation of possibly useful trainings (if there is no training available which perfectly closes an identified competence gap). • In practice, the training planning underlies many more restrictions than the simple question of whether a training program covers a specific competence gap of an employee. Strategies for handling such restrictions and finding an optimized
296
A. Abecker et al.
Fig. 7 HR data warehouse and processes/applications
training plan, can well be expressed with business rules or logic-programming rules. In particular, the following restrictions can occur: – time restrictions of the employee – time restrictions of the training measure – pre-requisites to be allowed to take part in a training measure – limitation of the number of possible participants per training – budget limitations which lead to prioritization rules • The integration of legacy systems and additional information sources can be done easier using ontology-based information integration approaches. In cases where legacy systems model parts of the application domain redundantly or inconsistently, automated analyses can help to enforce globally consistent models. Using ontology-mapping techniques, an existing position catalog (from which reference position profiles can be taken) can easier be integrated. In the HRMORE project, business processes and software support for project staffing, succession planning, and training planning were designed within the context of existing DaimlerChrysler tool and process landscape [6]. The overall approach was based on the idea of an integrating HR Data warehouse plus associated software modules, the interaction of which is shown in Fig. 7. As a central point of information, we designed the Human Resource Data Warehouse (HR-DW) which integrates most of the HR-data from legacy-systems in one place. On top of the HR-DW, the HR domain ontology acts as a meta-layer between
Semantics in Knowledge Management
297
the database and the application modules. It consists mainly of the competence catalog and some further enriching information like the organizational structure or the reference position catalog. We have prototypically implemented the module “Project Staffing” which shows representatively that an ontology-based matching of competence profiles does work. The ontology-based similarity search was evaluated extensively.
3.2.3 Concluding Remarks This section sketched the interrelations between Knowledge Management and Human Resource Management and sketched how ontologies can be used for training planning. We also discussed the general benefits of ontologies in HR. The main focus of our work covered all process and technology aspects of SHRM and ontologybased competence management, fully embedded in a large organization’s real HR business processes. Ontologies play the role of representing the competence catalog, personal and job skill profiles, and they provide the basis for similarity-based profile matching with background knowledge. Furthermore, ontologies help to integrate different legacy systems in the HR Data Warehouse. Modeling employees’ skills with ontologies was already proposed together with a number of use cases by Stader and Macintosh [25]. This approach was primarily technology-driven—whereas our work is closely embedded in the organizational processes, tool landscape, and typical procedures of an HR department. Regarding the use of background knowledge for skill matching, Liao et al. [19] use declarative retrieval heuristics to traverse the ontology-structure, and Sure et al. [26] use F-Logic inferencing to derive competencies and a soft matching. Colucci et al. [9] use description-logic inferences to handle background knowledge and incomplete knowledge for matching profiles. Similarly, Lau and Sure [18] focus on profile matching for applications like team staffing or Intranet-search in employee yellow pages. Basic research into adequate modeling of ontology-based competence catalogs was done in the KOWIEN project. Gronau and Uslar [13] worked on the economic perspective of competence management and the linkage between this and knowledge and business process modeling. These two projects enhance the formal side and expedite the microeconomic understanding; but we haven’t yet found a comparable, integrating approach like ours that encompasses the whole SHRM field.
4 Using Soboleo for People Tagging 4.1 SOBOLEO-PT: Project Context The previous section has shown the usefulness of ontologies for competence management. Building upon this, we are aiming for a more participatory competence management approach within the MATURE project. Knowing-who is an essential
298
A. Abecker et al.
element for efficient KM processes within organizations, e.g. for finding the right person to talk to, for team staffing or for identifying training needs. However, it is still hard to show sustainable success on a larger scale, especially on the level of individual employees. This is often the case because competence information, e.g. contained in employee directories, is not kept up-to-date or is not described in a manner relevant to potential users. To overcome these difficulties, we propose a collaborative competence management approach based on Web 2.0-style people tagging and complement it with community-driven ontology engineering methods.
4.2 SOBOLEO-PT: Objective and Approach Traditionally, competence management approaches are seen as top-down instruments with a small expert group modeling and maintaining a centralized competence catalog at irregular time intervals (often more than yearly) or even as one-time activity without scheduled updates. This catalog is then provided to the lower management and the employees in order to provide, update, and apply requirements and employee profiles. However, when applying the catalog, employees often encounter the problem of not being able to understand the meaning of the competence notions (because they were not involved in the modeling process) or find the topics relevant to them (especially very recent topics). The creation and maintenance of the individual employee profiles is similarly difficult. In practice, one can mostly observe self-assessment approaches or external assessment approaches done by superiors or through formal procedures. While the latter approach is expensive and cumbersome and thus can only be observed in limited areas, the first approach often fails because of missing motivation that can be traced back to no immediate benefit for the employees. The systems are hardly embedded into everyday work activities and the profiles do not contain information that is of high relevance to colleagues. Thus, very recent and specialized topics cannot be applied because the competence catalog does not yet contain them. Our collaborative competence management approach involves every employee and lets them participate. It is based on “People Tagging”, i.e. individual employees tag each other according to the topics they associate with this person. We complement this with community-driven ontology engineering methods, which enable the employees to contribute to the continuous development of the competence catalog and to adapt it to their needs.
4.2.1 The Approach: Collaborative People Tagging Our lightweight approach is based on collaborative tagging as a principle to gather the information about persons inside and outside the company (if and where relevant): individual employees can describe and tag the expertise and interests of their
Semantics in Knowledge Management
299
Fig. 8 Ontology maturing process
colleagues with keywords in an easy way. In this manner, we gain a collective review of existing skills and competencies. Knowledge can be shared and awareness strengthened within the organizational context around who knows what. This tagging information can then be used to search for the appropriate person(s) to talk to in a particular situation. It can also be used for various other purposes: for instance, for personnel development one needs to have sufficient information about the needs and current capabilities of employees to make the right decisions about required training programs.
4.2.2 Foundation: Collaborative Construction of a Shared Understanding Collaborative competence management needs the continuous development of a shared vocabulary (ontology). Competencies usually have an integrating function in the enterprise, bringing together strategic and operational levels, as well as human resources, and performance management aspects. So, these notions have to be shared by the whole organization (in the ideal case): consequently, we cannot do this without a shared vocabulary—a shared vocabulary which the employees evolve during its usage, i.e. during the tagging or search process. We have developed the ontology maturing process model (see Fig. 8) that operationalizes this collaborative view of the development of such a vocabulary, and along with this, a shared understanding. The model structures the process of evolving competence ontologies into four phases: (1) Emergence of ideas: By employees annotating each other with any topic tag, new topic ideas emerge; (2) Consolidation in Communities: A common topic terminology evolves through the collaborative (re-)usage of the topic tags within the community of employees; (3) Formalization: through gardening activities, e.g. by adding hierarchical or ad-hoc relations between topic tags, the topic terminology is organized into competencies; (4) Axiomatization: differentiation of abstract competencies into competencies with levels
300
A. Abecker et al.
Fig. 9 People tagging with SOBOLEO-PT
and adding of precise generalization and composition relations for reasoning purposes. With our Web-based SOBOLEO-PT system, employees can tag each other with concepts from the shared vocabulary (see Fig. 9). The primary idea is to annotate a person via his/her personal Web page. Therefore we provide a bookmarklet-based tagging tool that can be used on top of existing Intranet employee directories or social networking sites. If users find a colleague’s Web site, they can add him/her to the directory and annotate him/her with concepts from the shared vocabulary by one click at the browser bookmarklet. Persons for which no Web page is easily available can also be added directly. Each person that is tagged at least once, is represented by one personal profile page within SOBOLEO-PT and can also be tagged directly on this page. During tagging, the system supports the users with tag suggestions based on the existing shared vocabulary and the content of the person’s Web page. In case they want to tag with a topic the existing ontology concepts do not cover (e.g., because the topic is too new or specific), the employees can adapt an existing concept or just use a new term, without an agreed meaning. These new terms are automatically
Semantics in Knowledge Management
301
Fig. 10 SOBOLEO-PT’s collaborative ontology editor
added to the shared vocabulary as “prototypical concepts”, reflecting the fact that it’s not clear yet how they relate to the existing concepts. During gardening activities, the users can then remove the new terms from the “prototypical concepts” container and integrate them into the vocabulary and add additional information. In this way, topic tags are incrementally formalized and aggregated and competencies are defined where necessary, e.g. for organizational reporting. For these tasks, a lightweight, browser-based, and real-time collaborative ontology editor based on the SKOS formalism is available. As a lightweight language, SKOS is relatively easy to understand for non-modeling experts and allows us to seamlessly work with half-formalized domains. SOBOLEO-PT’s ontology editor (see Fig. 10) enables users to structure the concepts with hierarchical relations (broader and narrower) and to indicate that concepts are “related”. Concepts can have a (multi-word) preferred label and a description in multiple languages, as well as any number of alternative and hidden labels. The collaborative editor can be used by several users at the same time. Changes are immediately visible and effective to all users. The vocabulary information also serves as background knowledge to support the search process or explorative navigation (Fig. 11): Users can improve the retrieval by adding and refining vocabulary information. For instance, if the users miss entries in the search results because of missing links between concepts (e.g., entries with ‘Glasgow’ or ‘Edinburgh’ when searching for ‘Scotland’), they can easily add them. In this way, we achieve a collaborative and incremental in-situ revision and
302
A. Abecker et al.
Fig. 11 Semantic expert search
improvement. Real-time collaborative gardening tools are provided to promote the convergence towards a shared living vocabulary.
4.3 SOBOLEO-PT: Concluding Remarks Evaluation studies of the people tagging approach within MATURE are promising and have shown that people tagging is accepted by employees in general, and that they view it as beneficial. These studies also revealed that we have to be careful when designing such a people tagging system. We have to not only consider technical aspects, but the socio-technical system as a whole, including affective barriers, the organizational context, and other motivational aspects. There is no one-size-fitsall people tagging system as it depends on organizational or team culture which aspects are seen as acceptable, and which are alienating. This has led us to the development of a conceptual design framework, in which we identified basic design decisions that can be customized for each people tagging system instance. Such design decisions are, for instance, to whom are a person’s assigned tags visible? Do they need to be approved by the tagged person before getting published? Is the tagged person allowed to delete unwanted tags? Which
Semantics in Knowledge Management
303
kind of tags shall be allowed? Only professional or off-topic tags as well? How do frequency timestamps of the tag assignment influence the search heuristics? Within a pilot in the MATURE project, we are currently analyzing these questions in close connection to the organizational context.
5 The Project Halo: Full Formalization of Scientific Knowledge 5.1 Halo: Project Context The long-term goal of Project Halo is to build a ‘Digital Aristotle’, a computer system that stores a significant part of humankind’s scientific knowledge and that is able to answer novel questions (i.e. questions not known or foreseen during the creation of the system) about these parts of science. This system should be able to act both as an interactive tutor for students and as a research tool for scientists. From a KM perspective, Project Halo represents the area of completely externalized and fully operational knowledge. Project Halo is structured as a multistage effort, the second stage of which will be discussed here in detail. In this second phase, the author of this section worked as a subcontractor for ontoprise GmbH which, in turn, managed a multi-partner project contracted by VULCAN Inc.
5.2 Halo: Objective and Approach The main goal of the first phase of Project Halo [11] was to access the current stateof-the-art in applied knowledge representation and reasoning (KR&R) systems, to establish whether existing KR&R systems could form the foundation for a Digital Aristotle. The domain chosen for this experiment was a subset of the questions from the introductory college-level Advanced Placement (AP) for chemistry. This phase showed that it is indeed possible with current KR&R technology to match the performance of students on the Advanced Placement test; however, it also showed that the creation of the knowledge needed for this task is very expensive. The goal of the second phase was to create and evaluate tools that allow domain experts to create the knowledge base with ever decreasing reliance on knowledge engineers [8, 12]. It was hoped that this could dramatically decrease the cost of building a scientific question-answering system and further reduce the number of errors due to incomplete domain knowledge of the knowledge engineers. The domain chosen for the second phase was that of introductory college-level Advanced Placements tests for chemistry, biology and physics. The second phase was conducted as a 22 month effort undertaken in two stages. The first 6 month stage was dedicated to a careful analysis of sample questions and the design of the application. This phase was followed by a 15 month implementation phase that ended with an
304
A. Abecker et al.
evaluation. Three teams participated in the first stage, only two in the second stage. The project structure was such that different teams had the same requirements, realized them separately and were then judged on their relative performance. The results reported here are solely from team ontoprise GmbH, which included team members from ontoprise GmbH, Carnegie Mellon University, Open University, Georgia Tech and DFKI. During the second phase of Project Halo, team ontoprise GmbH built the Dark Matter Studio tool (DMS) in order to support domain experts in the creation of a scientific knowledge base. Only a short overview of this system will be given here. Dark Matter Studio is built to support a document rooted methodology [11] of knowledge formulation. This methodology stipulates that domain experts use an existing document (such as a textbook) as basis for formulating knowledge. Having a root document as a foundation should help the experts to decide what to model and in which order. It is further assumed that textbooks are carefully structured in a way that is well suited for knowledge formulation. All created knowledge is tied to the document; the document can thus help to contextualize and explain the contents of the knowledge base. Dark Matter Studio is created in a way that isolates the user from the details of the knowledge representation language and the workings of the inference engine. The concepts the user manipulates are often on a higher level than the concepts of the knowledge base. For example, a rule created by the user is frequently translated into many rules for the inference engine. Dark Matter Studio is built on top of ontoprise’s ontology engineering environment OntoStudio [27], which in turn is built on top of the Eclipse framework. Eclipse provides a plug-in framework which allows it to seamlessly extend OntoStudio. The main reasoning component in Dark Matter Studio is ontoprise’s inference engine Ontobroker [5] which utilizes Mathematica for equation solving. The main hub for the knowledge formulation work is the annotation component, which is a further development of the annotation tool Ont-O-Mat [14] which includes support for semi-automatic annotation based on the KANTOO machine-translation environment. Graphical editors for the ontology were extended from the OntoStudio system; rules are created with a new graphical rule editor. Dark Matter Studio includes dedicated editors for the formulation of process knowledge, explanations and tests. The verification support for Dark Matter Studio included support for testing, debugging and anomaly detection heuristics (partly developed on the basis of [30], see Fig. 12). The evaluation consisted of two parts, a knowledge formulation and a question answering section. In the first part, six domain experts (DEs), two each for the domains of physics, chemistry and biology (senior students in these fields) were handed the system, trained on it for two weeks and were then asked to formalize some pages from a textbook in their field. In the second part of the evaluation, domain experts were given DMS with one of the knowledge bases created during knowledge formulation. They were then given a number of questions for this domain that they should formalize as queries to get answers using DMS. Altogether the domain experts spent on average ca. 100 hours interacting with the system [15].
Semantics in Knowledge Management
305
Fig. 12 Dark-matter studio debug perspective
5.3 Halo: Evaluation Results and Concluding Remarks All domain experts were able to successfully create a taxonomy, relations and attributes. The size of the taxonomy varied between the domains, with the Physics DEs creating the fewest concepts and instances. All domain experts were able to successfully use the graphical rule editor of DMS to create a large number of rules. The integration of the rule editor with the rest of DMS worked well and some domain experts were excited to see their rules ‘come to life’ in the tests. The testing component proved to be very popular with the domain experts. The graphical interface of the testing component was usable and all domain experts succeeded in creating a considerable number of tests. In the question-formulation part of the evaluation, the quality of DMS and of the created knowledge bases were evaluated. To do this, the system—together with a knowledge base—was given to different domain experts that had to use it to formalize and answer a set of questions that hadn’t been known before. In order for a question to be answered correctly, the knowledge base needed to be correct, needed to encompass the required knowledge, and the question must have been correctly formalized by the domain expert (QF DE); i.e. formalized in a way that it yielded an answer and also accurately reflected the question. An incorrect formalization of the question could again depend on an error by the QF DE, knowledge missing from the knowledge base or a limited expressiveness of the question formulation tool. The question-answering performance for questions judged to be fully adequate is shown in Table 1. The numbers show the count of results that were fully correct,
306
A. Abecker et al.
Table 1 Rating of answers for fully adequately formulated queries # Fully adequate
# Nearly adequate
# Inadequate
DE 1 (Physics)
4
0
3
DE 2 (Biology)
1
0
0
DE 3 (Biology)
0
0
0
DE 4 (Chemistry)
0
0
0
DE 5 (Chemistry)
0
0
4
DE 6 (Physics)
7
0
2
nearly or not adequate; only the results for fully adequate queries are shown. It can be seen that only the Physics DEs succeeded in formulating and correctly answering a considerable number of questions. Overall, the very challenging evaluation setup (that required domain experts to model knowledge that could successfully answer novel questions posed by other domain experts) showed some success- particularly in the physics domain. At the same time, however, most questions could not be correctly answered, showing major limitations of DMS. We do not go into more detail about Project Halo since it is also discussed in detail in a separate chapter of this book. Nevertheless, some remarks can be made in the broader context of this chapter on semantic technologies for KM: The Halo series of projects has proven more than once that, in principle, knowledge-based systems can—in a fully-automated manner—deliver stunning results and solve really difficult and practice-relevant problems. However, it also confirmed the problems that have been encountered by the expert system community for many years—high expense of knowledge acquisition on the one hand and brittleness of system behavior (often leading to no answers, but also to very inefficient runtime behavior) on the other. These problems still exist and are only marginally ameliorated with today’s technologies. Consequently, this approach to providing methods and tools that enable Domain Experts to build non-trivial knowledge bases on their own did not lead to a breakthrough. Several conclusions could be drawn from that: • First, it seems that the “old-fashioned” approach to let Domain Experts and Knowledge Engineers interact for knowledge acquisition, still has merit. Hence, in the following, ontoprise GmbH and Project Halo adopted an approach where knowledge engineers and domain experts jointly build the knowledge base, interacting through Web 2.0 style tools such as Wiki-style engineering frameworks that allow more interaction, more communication, and more agile development. • Second, there is still a need for more powerful and user-friendly knowledgeacquisition frameworks; in Halo, for instance, the DMS provides already expressive abstraction layers, and we invested in better rule-base debugging mechanisms [30]. Better explanation techniques might also help. • Last, but not least, in the broader context of KM, the Halo experiments also confirmed the hypotheses already made, e.g., in [4] in the context of Organizational Memory research, who postulated the advent of Intelligent Assistants instead of
Semantics in Knowledge Management
307
Expert Systems, i.e. knowledge-based support systems that help finding information, partly automate problem-solving or critique human activities, but do not aim at fully-automated solutions. Such systems can be more cost-efficient, easier built in an evolutionary manner, and better accepted by users. Also many of ontoprise’s recent customer projects realized Intelligent Assistant or Intelligent Advisory systems rather than traditional Expert Systems. Altogether, it all comes back to a careful feasibility and cost-benefit analysis for defining the best system layout.
6 Conclusion In this chapter, we presented a number of KM projects using semantic technologies run in the recent years; some characteristics of these projects are compared in the Table 2 below. We see that all KM endeavors need some methodological foundations if they shall be successful in practice. We see also the close relationships of Knowledge Management to other research and technology fields (like Case-Based Reasoning or Knowledge-Based Systems in IT, or Human Resource Management in a more general organization-management context) which underlines that KM in an organization must be seen as a cross-cutting (or, boundary spanning) activity that needs the appropriate personnel and management backing. Semantic technologies played different roles in the respective projects, but mainly they were used to describe and annotate “resources” (people, best-practice cases) in order to find them later more precisely. Coming back to the two trade-off dimensions introduced in Sect. 1 of this chapter, we can note: With respect to the degree of required modeling formality, the very formal Halo approach turned out to be cumbersome and risky for end users, in practice, but, of course, may deliver by far the most added-value from automation. So, appropriate usage scenarios must be found here. In contrast, KMIR and HRMore—which used ontological knowledge just to describe informal case descriptions, specific case facets, or just the knowledge in people’s minds—can live with far fewer correct and complete knowledge bases and fewer related knowledge engineering and maintenance problems. Of course this metadata knowledge can only be used for retrieval. Such methods, combined with the idea of intelligent advisory systems, may be valuable in many KM scenarios. In SOBOLEO-PT, this idea was even further evolved: formalization starts with just simple tags, thus having extremely low entry and acceptance problems, and stepwisely can be further formalized if possible and/or needed. Such maturation approaches, together with participatory and collaborative methods also seem to be a winning path for future KM solutions. Similar considerations can be made for the dimension of the extent of knowledge externalization: more explicit knowledge representations like rule bases of strongly structured lessons learned entries, are more tangible, operational, reusable, sharable, and evolvable for an organization; but they must be engineered in a much more careful and costly manner than skill profiles whilst the knowledge resides in the
308
A. Abecker et al.
Table 2 Characteristics of presented projects Application
Technological Added value approach through semantic technologies
Methodological aspect
Case-based retrieval
Ontology stores cases and provides background knowledge for similarity assessment
Best-practice based KM introduction method
HRMORE Strategic Implicit competence knowledge: management employees
HR data warehouse; similarity matching for skill profiles
Ontology stores skill profiles and enables similarity matching
Adapted processes for strategic HRM, especially succession planning, training planning, recruiting
SOBOLEO Operational Implicit competence knowledge: management employees
Lightweight, collaborative tagging approach
Semantic search for experts
Knowledge maturing theory
Deep formal inferencing for question answering
Documentrooted method for knowledge acquisition for KBS
KMIR
HALO
Knowledge bearer/knowledge item
KM Explicit introduction knowledge: project best practices
Natural sciences
OntologyExplicit based knowledge: KBS F-Logic knowledge base
employees’ heads. A careful decision for the most appropriate approach is required in each concrete KM project. For research, stepwise (graded transitions between the two extremes), might be a thrilling topic—which is already partly addressed by approaches such as Semantic Wikis (elaborated in another chapter of this book) or Personal KM tool suites [28, 29]. Acknowledgements The work presented in this chapter has been funded over several years by a number of private and public institutions—just to mention the most important ones: the German Federal State of Baden-Württemberg, the German Federal Ministry for Economics and Technology (BMWi) within the research programme THESEUS, the German Federal Ministry of Education and Research (BMBF) with the research project “Im Wissensnetz”, the European Commission with IST projects “MATURE” and “NEPOMUK”, as well as DaimlerChrysler AG and VULCAN Inc.
References 1. Aamodt, A., Plaza, E.: Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Commun. 7(i), 39–59 (1994)
Semantics in Knowledge Management
309
2. Abecker, A.: Business-process oriented knowledge management: concepts, methods, and tools. PhD thesis, Universität Karlsruhe (TH) (2004). Supervisors: Prof. Dr. Rudi Studer, Prof. Dr. Peter Knauth, Prof. Dr. Grigoris Mentzas 3. Abecker, A., van Elst, L.: Ontologies for knowledge management. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies, 2nd edn. Springer, Berlin (2008) 4. Abecker, A., Bernardi, A., Hinkelmann, K., Kühn, O., Sintek, M.: Towards a technology for organizational memories. IEEE Intelligent Systems & Their Applications 13(3) (1998) 5. Angele, J., Kifer, M., Lausen, G.: Ontologies in F-logic. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies, 2nd edn. Springer, Berlin (2008) 6. Biesalski, E.: Unterstützung der Personalentwicklung mit ontologiebasiertem Kompetenzmanagement. PhD thesis, Universität Karlsruhe (TH) (2006). Supervisors: Prof. Dr. Rudi Studer, Prof. Dr. Peter Knauth (in German) 7. Biesalski, E., Abecker, A.: Similarity measures for skill-profile matching in enterprise knowledge management. In: 8th Int. Conf. on Enterprise Information Systems (ICEIS-06) (2006) 8. Chaudhri, V.K., John, B.E., Mishra, S., Pacheco, J., Porter, B., Spaulding, A.: Enabling experts to build knowledge bases from science textbooks. In: K-CAP ’07: Proc. of the 4th Int. Conf. on Knowledge Capture, pp. 159–166. ACM, New York (2007) 9. Colucci, S., Di Noia, T., Di Sciascio, E., Donini, F.M., Mongiello, M., Mottola, M.: A formal approach to ontology-based semantic match of skills descriptions. J. Univers. Comput. Sci. 9(12), 1437–1454 (2003) 10. Ehrig, M., Haase, P., Hefke, M., Stojanovic, N.: Similarity for ontologies—a comprehensive framework. In: 13th European Conference on Information Systems (ECIS-2005) (2005) 11. Friedland, N.S., Allen, P.G., Matthews, G., et al.: Project HALO: towards a digital Aristotle. AI Mag. 25(4), 29–48 (2004) 12. Gómez-Pérez, J.M., Erdmann, M., Greaves, M.: Applying problem solving methods for process knowledge acquisition, representation, and reasoning. In: K-CAP ’07: Proc. of the 4th Int. Conf. on Knowledge Capture, pp. 15–22. ACM, New York (2007) 13. Gronau, N., Uslar, M.: Requirements and recommenders for skill management. In: DiengKuntz, R., Matta, N. (eds.) ECAI-04 Workshop on Knowledge Management and Organizational Memory (2004) 14. Handschuh, S., Maedche, A.: Cream—creating relational metadata with a component-based, ontology-driven annotation framework. In: Proc. 1st Int. Conf. on Knowledge Capture (K-CAP) (2001) 15. Hansch, D., Erdmann, M.: Deep authoring, answering and representation of knowledge by subject matter experts—final report implementation phase. Technical report, ontoprise GmbH (2006) 16. Hefke, M., Abecker, A., Jäger, K.: Portability of best practice cases for knowledge management introduction. J. Univers. Knowl. Manag. 1(3), 235–254 (2006) 17. Hefke, M.: Ontologiebasierte Werkzeuge zur Unterstützung von Organisationen bei der Einführung und Durchführung von Wissensmanagement. PhD thesis, Universität Karlsruhe (TH) (2008). Supervisors: Prof. Dr. Rudi Studer, Prof. Dr. Peter Knauth (in German) 18. Lau, T., Sure, Y.: Introducing ontology-based skills management at a large insurance company. In: Workshop Modellierung 2002, pp. 123–134. (2002) 19. Liao, M., Hinkelmann, K., Abecker, A., Sintek, M.: A competence knowledge base system for the organizational memory. In: Puppe, F. (ed.) XPS-99/5. Deutsche Tagung Wissensbasierte Systeme. Lecture Notes in Artificial Intelligence, vol. 1570. Springer, Berlin (1999) 20. Mentzas, G., Apostolou, D., Young, R., Abecker, A.: Knowledge Asset Management. Springer, London (2002) 21. Probst, G., Raub, S., Romhardt, K.: Managing Knowledge: Building Blocks for Success. Wiley, London (1999) 22. Scholz, C.: Personalmanagement. Informationsorientierte und verhaltenstheoretische Grundlagen, 3rd edn. Vahlen, München (1993) (in German) 23. Scholz, C., Djarrazadeh, M.: Strategisches Personalmanagement—Konzeptionen und Realisationen. USW-Schriften für Führungskräfte, vol. 28. Schäffer Poeschel, Stuttgart (1995) (in German)
310
A. Abecker et al.
24. Schuler, R.S.: Strategic human resource management: linking people with the strategic needs of the business. Organizational Dynamics (1992) 25. Stader, J., Macintosh, A.: Capability modelling and knowledge management. In: Applications and Innovations in Expert Systems VII. Proc. ES’99—19th Int. Conf. of the BCS Specialist Group on KBS and Applied AI, pp. 33–50. Springer, Berlin (1999) 26. Sure, Y., Maedche, A., Staab, S.: Leveraging corporate skill knowledge—from ProPer to OntoProPer. In: Mahling, D., Reimer, U. (eds.) 3rd Int. Conf. on Practical Aspects of Knowledge Management (2000) 27. Sure, Y., Angele, J., Staab, S.: Ontoedit: multifaceted inferencing for ontology engineering. In: Journal on Data Semantics. Lecture Notes in Computer Science, vol. 2800, pp. 128–152. Springer, Berlin (2003) 28. Völkel, M.: Personal knowledge models with semantic technologies. PhD thesis, KIT— Karlsruhe Institute of Technology (2010). Supervisors: Prof. Dr. Rudi Studer, Prof. Dr. Klaus Tochtermann 29. Völkel, M., Haller, H.: Conceptual data structures for personal knowledge management. Online Inf. Rev. 33(2), 298–315 (2009) 30. Zacharias, V.: Tool support for finding and preventing faults in rule bases. PhD thesis, Universität Karlsruhe (TH) (2008). Supervisors: Prof. Dr. Rudi Studer, Prof. Dr. Karl-Heinz Waldmann
Semantic MediaWiki Markus Krötzsch and Denny Vrandeˇci´c
Abstract Semantic MediaWiki (SMW) is an extension of MediaWiki—a widely used wiki-engine that also powers Wikipedia—which makes semantic technologies available to broad user communities by smoothly integrating with the established wiki usage. SMW is used productively on a large number of sites world-wide in application areas ranging from science over knowledge management to leisure activities. Meanwhile, a vibrant ecosystem of third-party extensions has grown around SMW, offering many options for extended features and customizations. Yet, the original vision of establishing “Semantic Wikipedia” has remained important for the development of the SMW project, leading to a strong focus on usability and scalability.
1 Preface: A Short Story of SMW SMW began in Spring 2005. Wikipedia just has turned four and has reached 500,000 English articles, and the community was preparing for the First International Wikimedia Conference Wikimania 2005 in Frankfurt/Main. The call for contributions attracted the interest of four first-year PhD students and wiki enthusiasts in Rudi Studer’s group. To them, it seemed clear that the benefits promised by semantic technologies—seamless data exchange, intelligent processing of large knowledge sources, unhindered re-use of information—were indispensable for bringing Wikipedia to its full potential. Of course, they were not yet sure how exactly this was to be realized. Nonetheless, Frankfurt was close to Karlsruhe, so a proposal was developed, a paper was written and accepted [16], and a visionary presentation was given. The presentation was particularly visionary indeed, as there was no implementation at that time. Fueled by the positive feedback (and programming support) of an energetic community, this practical limitation of the approach was soon overcome, and SMW 0.1 was published in September 2005. The first presentation of the system to the Semantic Web community followed at WWW 2006 [29]. Since M. Krötzsch () Oxford University Computing Laboratory, University of Oxford, Oxford, UK e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_16, © Springer-Verlag Berlin Heidelberg 2011
311
312
M. Krötzsch, D. Vrandeˇci´c
that time, a user and developer community has grown around the system, developing countless creative applications of SMW in areas ranging from the sophisticated, such as genome research, to the (more or less) trivial, like scuba diving. What this short story illustrates, other than the early history of SMW, are some crucial aspects of the work in Rudi Studer’s group. Most obviously, bottom-up efforts like SMW—just like any truly creative research work—require a lot of personal freedom to flourish. This is widely understood. Giving such freedom to unexperienced researchers, however, also bears many risks, and finding a proper balance here is a mark of excellent leadership. Another vital factor that enabled SMW was a cooperative atmosphere that suggests the exchange of ideas beyond the boundaries of research areas and projects. An important achievement of Rudi Studer has been to shape a group where intense cooperation is natural and expected. Accordingly, SMW is just one of the successful bottom-up efforts that developed in this fertile research climate. The KAON software suite [8], the Text2Onto ontology learning platform [10], the KAON 2 reasoner [18], and the Soboleo tagging management system [31] provide other examples, to which current and future generations of PhD students will surely be adding their own.
2 Introduction Wikis have become popular tools for collaboration on the Web, and many vibrant online communities employ wikis to exchange knowledge. For a majority of wikis, public or not, primary goals are to organize the collected knowledge and to share information. But in spite of their utility, most content in wikis is barely machineaccessible and only weakly structured. In this chapter we introduce Semantic MediaWiki (SMW) [17], an extension to the widely used wiki software MediaWiki [4]. SMW enhances MediaWiki by enabling users to annotate the wiki’s contents with explicit information. Using this semantic data, SMW addresses core problems of today’s wikis: • Consistency of content: The same information often occurs on many pages. How can one ensure that information in different parts of the system is consistent, especially as it can be changed in a distributed way? • Accessing knowledge: Large wikis have thousands of pages. Finding and comparing information from different pages is challenging and time-consuming. • Reusing knowledge: Many wikis are driven by the wish to make information accessible to many people. But the rigid, text-based content of classical wikis can only be used by reading pages in a browser or similar application. SMW is a free and open source extension of MediaWiki, released under the GNU Public License. The integration between MediaWiki and SMW is based on MediaWiki’s extension mechanism: SMW registers for certain events or requests, and MediaWiki calls SMW functions when needed. SMW thus does not overwrite any part of MediaWiki, and can be added to existing wikis without much migration
Semantic MediaWiki
313
Fig. 1 Architecture of SMW’s main components in relation to MediaWiki
cost. Usage information about SMW, installation instructions, and the complete documentation are found on SMW’s homepage.1 Figure 1 provides an overview of SMW’s core components and architecture that we will refer to when explaining the features of SMW within this chapter. Section 3 explains how structural information is collected in SMW, and how this data relates to the Web Ontology Language OWL [21]. Section 4 surveys SMW’s main features for wiki users: semantic browsing, semantic queries, and data exchange on the Semantic Web. Queries are the most powerful way of retrieving data from SMW, and their syntax and semantics are presented in detail. In Sect. 5 we survey related systems. This chapter generally refers to SMW 1.5.0 as the most recent version at the time of this writing, but future updates of SMW will largely preserve downwards compatibility.
3 Annotation of Wiki Pages The main prerequisite of exploiting semantic technologies is the availability of suitably structured data. For this purpose, SMW introduces ways of adding further structure to MediaWiki by means of annotating the textual content of the wiki. In this section, we recall some of MediaWiki’s current means of structuring data (Sect. 3.1), and introduce SMW’s annotations with properties (Sect. 3.2). Finally, a formal semantic interpretation of the wiki’s structure in terms of OWL is presented (Sect. 3.3).
1 http://semantic-mediawiki.org.
314
M. Krötzsch, D. Vrandeˇci´c
3.1 Content Structuring in MediaWiki The primary method for entering information into MediaWiki is wikitext, a simplified markup language that is transformed into HTML pages for reading. Accordingly, wikitext already provides many facilities for describing formatting, and even some for structuring content. In this section, we review the most important basic structuring mechanisms in MediaWiki: links, namespaces, categories, redirects, and templates. For defining the interrelation of pages within a wiki, hyperlinks are arguably the most important feature. They are vital for navigation, and are sometimes even used to classify articles informally. In Wikipedia, for example, articles may contain links to pages of the form [[as of 2010]] to state that the given information might need revalidation or updates after that year. The primary structural mechanism of most wikis is the organization of content in wiki pages. In MediaWiki, these pages are further classified into namespaces, which distinguish different kinds of pages according to their function. Namespaces cannot be defined by wiki users, but are part of the configuration settings of a site. A page’s namespace is signified by a specific prefix, such as User: for user homepages, Help: for documentation pages, or Talk: for discussion pages on articles in the main namespace. Page titles without a registered namespace prefix simply belong to the main namespace. Most pages are subject to the same kind of technical processing for reading and editing, denoted Page display and manipulation in Fig. 1. The major exception are so-called special pages—built-in query forms without user-edited content—that use Special: as a namespace prefix. Many wiki engines generally use links for classifying pages. For instance, searching for all pages with a link to the page [[France]] is a good way to find information about that country. In MediaWiki, however, this use has been replaced by a more elaborate category system [27]. Every page can be assigned to one or many categories, and each category is represented with a page in the Category: namespace. Category pages in turn can be used to browse the classified pages, and also to organize categories hierarchically. Page categories and their hierarchy can be edited by all users via special markup within the wiki. Overall, the category system is the one function of MediaWiki that is closest in spirit to the extensions introduced by SMW. Another structuring problem of large wikis are synonymous and homonymous titles. In case of synonyms, several different pages for the same subject may emerge in a decentralized editing process. MediaWiki therefore has a redirect mechanism by which a page can be caused to forward all requests directly to another page. This is useful for resolving synonyms but also for some other tasks that suggest such forwarding (e.g. the mentioned articles [[as of 2005]] are redirects to the page about the year 2005). Homonyms, in turn, occur whenever a page title is ambiguous, and may refer to many different subjects depending on context. This problem is addressed by so-called disambiguation pages that briefly list the different possible meanings of a title. Actual pages about a single sense then either use a unique synonym or are augmented with parentheses to distinguish them, e.g. in the case of [[1984 (book)]].
Semantic MediaWiki
315
A final formatting feature of significance to the structure of the wiki is MediaWiki’s template system. The wiki parser replaces templates with the text given on the template’s own page. The template text in turn may contain parameters. This can be used to achieve a higher consistency, since, e.g., a table is then defined only on a single template page, and all pages using this template will look similar. The idea of capturing semantic data in templates has been explored inside Wikipedia2 and in external projects such as DBpedia [2]. In addition to the above, MediaWiki provides many ways of structuring the textual content of pages themselves, by introducing sections or tables, presentation markup (e.g. text size or font weights), and so on. SMW, however, aims at collecting information about the (abstract) concept represented by a page, not about the associated text. The layout and structure of article texts is not used for collection semantic annotations, since they should follow didactic considerations.
3.2 Semantic Annotations in SMW We will now introduce the simple data model that SMW uses for maintaining structured information in a wiki, along with the wikitext syntax that is often used for specifying such data. When considering syntax examples, it should be kept in mind that by now SMW supports multiple input methods, so that the input syntax is not essential. Indeed, the only parts of SMW that are aware of the surface syntax are the components for Parsing and (for query syntax) Inline Queries in Fig. 1. The underlying conceptual framework, based on properties and types is therefore more relevant than syntactic details. Adhering to MediaWiki’s basic principles, semantic data in SMW is also structured by pages, such that all semantic content explicitly belongs to a page. Every page corresponds to an ontology entity (including classes and properties). This locality is crucial for maintenance: if knowledge is reused in many places, users must still be able to understand where the information originated. Different namespaces are used to distinguish the different kinds of ontology entities: they can be individuals (the majority of the pages, describing elements of the domain of interest), classes (represented by categories in MediaWiki, used to classify individuals and also to create subcategories), properties (relationships between two individuals or an individual and a data value), and types (used to distinguish different kinds of properties). Categories have been available in MediaWiki since 2004, whereas properties and types were introduced by SMW. Properties in SMW are used to express binary relationships between one individual (as represented by a wiki page) and some other individual or data value. Each wiki-community is interested in different relationships depending on its topic area, and therefore SMW lets wiki users control the set of available properties. SMW’s 2 See,
e.g., http://de.wikipedia.org/wiki/Hilfe:Personendaten.
316
M. Krötzsch, D. Vrandeˇci´c
’’’London’’’ is the capital city of [[England]] and of the [[United Kingdom]]. As of[[2005]], the population of London was estimated 7,421,328. Greater London covers an area of 609 square miles. [[Category:City]] ’’’London’’’ is the capital city of [[capital of::England]] and of the [[capital of::United Kingdom]]. As of [[2005]], population of London was estimated [[population::7,421,328]]. Greater London covers an area of [[area::609 square miles]]. [[Category:City]] Fig. 2 Source of a page about London in MediaWiki (top) and in SMW (bottom)
property mechanism follows standard Semantic Web formalisms where binary properties also are a central expressive mechanism. But unlike RDF-based languages, SMW does not view property statements (subject-predicate-object triples) as primary information units. SMW rather adopts a page-centric perspective where properties are a means of augmenting a page’s contents in a structured way. MediaWiki offers no general mechanism for assigning property values to pages, and a surprising amount of additional data becomes available by making binary relationships in existing wikis explicit. The most obvious kind of binary relations in current wikis are hyperlinks. Each link establishes some relationship between two pages, without specifying what kind of relationship this is, or whether it is significant for a given purpose. SMW allows links to be characterized by properties, such that the link’s target becomes the value of a user-provided property. But not all properties take other wiki pages as values: numeric quantities, calendar dates, or geographic coordinates are examples of other available types of properties. For example, consider the wikitext shown in Fig. 2 (top). The markup elements are easy to read: triple quotes ’’’ . . . ’’’ are used for text that should appear bold-faced, and text within square brackets [[. . . ]] is transformed into links to the wiki page of that name. The given links to [[England]], [[United Kingdom]], and [[2005]] do not carry any machine-understandable semantics yet. To state that London is the capital of England, one just extends the link to [[England]] by writing [[capital of::England]]. This asserts that London has a property called capital of with the value England. This is even possible if the property capital of has not been introduced to the wiki before. Figure 2 (top) shows further interesting data values that are not corresponding to hyperlinks, e.g. the given population number. A syntax for annotating such values is not as straightforward as for hyperlinks, but using the same markup in both cases is still preferable over introducing completely new markup. So an annotation for the population number can be added by writing [[population:: 7,421,328]]. In this case, 7,421,328 is not referring to another page and we do not want our statement to be rendered as a hyperlink. To accomplish this, users must first declare the property population and specify that it is of a numerical type. This mechanism is described below. If a property is not declared yet, then SMW assumes that its values denote wiki pages such that annotations will become hyperlinks. An an-
Semantic MediaWiki
317
Fig. 3 A semantic view of London
notated version of the wikitext for London is shown in Fig. 2 (bottom), and the resulting page is displayed in Fig. 3. Properties are introduced to the wiki by just using them on some page, but it is often desirable to specify additional information about properties. SMW supports this by introducing wiki pages for properties. For example, a wiki might contain a page [[Property:Population]] where Property: is the namespace prefix. A property page can contain a textual description of the property that helps users to employ it consistently throughout the wiki, but it also can specify semantic features of a property. One such feature is the aforementioned (data)type of the property. In the case of [[Property:Population]] one would add the annotation [[has type::Number]] to describe that the property expects numerical values. The property has type is a built-in property of SMW with the given special interpretation. It can also be described on its property page but it cannot be modified or deleted. SMW provides a number of datatypes that can be used with properties. Among those are String (character sequences), Date (points in time), and the default type Page that creates links to other pages. Further types such as Geographic coordinate (locations on earth) are added by extensions to SMW. Each type provides its own methods to process user input, and to display data values. SMW supplies a modular Datatype API as shown in Fig. 1 that can also be extended by application-specific datatypes. Just like properties, types also have dedicated pages within the wiki, and every type declaration creates a link to the according page. To some extent, it is also possible to create new customized datatypes by creating new type pages. These pages, of course, cannot define the whole computational processing of a data value, but they can create parameterized versions of existing types. The main application of this is to endow numerical types with conversion support for specific units of measurement. For example, the property Area in Fig. 2 (bottom) might use a custom type that supports the conversion between km2 and square miles (as can be seen
318
M. Krötzsch, D. Vrandeˇci´c
in Fig. 3). Unit conversion is of great value for consolidating annotations that use different units, which can hardly be avoided in a larger wiki.
3.3 Mapping to OWL The formal semantics of annotations in SMW, as well as their mapping for the later export (see Sect. 4.3) is given via a mapping to the OWL ontology language [21] (see [13] for a textbook introduction). Most annotations can easily be exported in terms of OWL, using the obvious mapping from wiki pages to OWL entities: normal pages correspond to individuals, properties in SMW correspond to OWL properties, categories correspond to OWL classes, and property values can be abstract individuals or typed literals. Most annotations thus are directly mapped to simple OWL statements. OWL further distinguishes object properties, datatype properties, and annotation properties. SMW properties may represent any of those depending on their type. Types themselves do not have OWL semantics, but may decide upon the XML Schema type used for literal values of a datatype property. Finally, containment of pages in MediaWiki’s categories is interpreted as class membership in OWL. SMW offers a number of built-in properties that may also have a special semantic interpretation. The above property has type, for instance, has no equivalent in OWL and is interpreted as an annotation property. Many properties that provide SMW-specific meta-information (e.g. for unit conversion) are treated similarly. MediaWiki supports the hierarchical organization of categories, and SMW can be configured to interpret this as an OWL class hierarchy (this may not be desirable for all wikis). Moreover, SMW introduces a special property subproperty of that can be used for property hierarchies. Overall, the schematic information representable in SMW is intentionally shallow, since the wiki is not intended as a general purpose ontology editor that requires users to have specific knowledge about semantic technologies. Future work will investigate the possibility of mapping to external Semantic Web resources and reusing knowledge that is being shared on the Web external to the wiki.
4 Exploiting Semantics No matter how simple it is to create semantic annotations, the majority of users will neglect it as long as it does not bear immediate benefits. In the following we introduce several features of SMW that show contributors the usefulness of semantic markup.
Semantic MediaWiki
319
Fig. 4 Inverse search in SMW, here giving a list of everyone born in London
4.1 Browsing As shown in Fig. 3, the rendered page may include a so-called factbox which is placed at the bottom of the page to avoid disturbing normal reading. The factbox summarizes the given annotations, provides feedback on possible errors, e.g. if a given data value does not fit a property’s type, and offers links to related functions. Note that most SMW instances do not display the factbox but rather choose to customize the users experience by using inline queries to display the semantic data. These links can be used to browse the wiki based on its semantic content. The page title in the factbox heading leads to a semantic browsing interface that shows not only the annotations within the given page, but also all annotations where the given page is used as a value. The magnifier icon behind each value leads to an inverse search for all pages with similar annotations (Fig. 4). Both of those user interfaces are realized as special pages, architecturally similar to the special page OWL Export in Fig. 1. In addition, the factbox shows links to property pages, which in turn list all annotations for a given property. All those browsing features are interconnected by appropriate links, so that users can easily navigate within the semantic knowledge base.
4.2 Querying SMW includes a query language that allows access to the wiki’s data. The query language can be used in three ways: either to directly query the wiki via a special
320
M. Krötzsch, D. Vrandeˇci´c
Fig. 5 A semantic query for all cantons of Switzerland, together with their capital, population, and languages; the data stems from an automatically annotated version of Wikipedia
query page, to add the answer to a page by creating an inline query (cf. Fig. 1), or by using concepts. Inline queries enable editors to add dynamically created lists or tables to a page, thus making up-to-date query results available to readers who are not even aware of the semantic capabilities of the underlying system. Figure 5 shows a query result as it might appear within an article about Switzerland. Compared to manually edited listings, inline queries are more accurate, easier to create, and easier to maintain. Concepts allow classes to be described intensionally, and thus provide a counterpart to MediaWiki’s extensionally described categories. A namespace Concept: is introduced, each page of which can be used to define one concept by specifying a query. Individual pages cannot be tagged explicitly with a concept; instead, an individual instantiates a concept implicitly by satisfying the query description. This allows to define concepts such as ISWC Conference by means of a query such as [[Category:Conference]] [[series::ISWC]]. All conferences that are properly annotated will then automatically be recognized as ISWC conferences. Concepts can be used in queries just as normal categories, and allow a higher abstraction than categories do. The syntax of SMW’s query language is closely related to wiki text, whereas its semantics corresponds to specific class expressions in OWL.3 Each query is a disjunction of conjunctions of conditions. Fundamental conditions are encoded as query atoms whose syntax is similar to that of SMW’s annotations. For instance, [[located in::England]] is the atomic query for all pages with this annotation. Queries with other types of properties and category memberships are constructed following the same principle. Instead of single fixed values one can also specify ranges of values, and even specify nested query expressions. A simplified form of SMW’s query language is defined in Fig. 6 (top). The main control symbols used to structure queries are: OR (and, in property values, ||) as the disjunction operator, and as (sub)query delimiters, + as the empty 3 SMW’s
query language has never been officially named, but some refer to it as AskQL [11].
Semantic MediaWiki QUERY CONJ ATOM SUB PROP VALUE CAT PAGE
::= ::= ::= ::= ::= ::= ::= ::=
QUERY CONJ ATOM SUB PROP
::= ::= ::= ::= ::=
321
CONJ (’OR’ CONJ)* ATOM (ATOM)* SUB | PROP | CAT | PAGE ’’ QUERY ’’ ’[[’ TITLE ’::’ VALUE (’||’ VALUE)* ’]]’ ’+’ | SUB | ((’>’|’=|500,000]] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
ObjectUnionOf( ObjectIntersectionOf( ObjectUnionOf(City) ObjectUnionOf( ObjectSomeValuesFrom(located_in ObjectUnionOf( ObjectIntersectionOf( ObjectUnionOf(Country) ObjectSomeValuesFrom(member_of ObjectOneOf(EU)) ) ) ) ObjectSomeValuesFrom(population ObjectUnionOf(valueToOWL(>=500,000)) ) ) ) )
Fig. 7 Example SMW query (top) with sketch of corresponding OWL class description (bottom)
Just like OWL, SMW’s query language does not support explicit variables, which essentially disallows cross-references between parts of the query. This ensures that all queries are tree-like. For instance, it is not possible to ask for the names of all the people who died in the city they were born in. This restriction makes query answering tractable [12, 30], which is essential for SMW’s usage in large wikis. In contrast, when variables are allowed, querying is at least NP-hard, and it becomes harder still for tractable fragments of OWL 2 [15]. SMW queries, as introduced above, merely define a result set of pages. In order to retrieve more information about those results, SMW allows so-called print requests as parts of queries. For instance, adding ?has capital as a query parameter will cause all values of the property has capital to be displayed for each result. Figure 5 shows a typical output for a query with multiple print requests. By using further parameters in query invocation, result formatting can be controlled to a large degree. In addition to tabular output, SMW also supports various types of lists and enumerations. Many further custom formats—such as Exhibit faceted browsing views and interactive timelines—are provided by the Semantic Result Formats extension package to SMW.
Semantic MediaWiki
323
4.3 Giving Back to the Web The Semantic Web is all about exchanging and reusing knowledge, facilitated by standard formats that enable the interchange of structural information between producers and consumers. Section 3.3 explained how SMW’s content is grounded in OWL, and how this data can also be retrieved via SMW’s Web interface as an OWL export. As shown in Fig. 1, this service is implemented as a special page. It can be queried for information about certain elements. The link RDF feed within each factbox also leads to this service (see Fig. 3). Exported data is provided in the RDF/XML serialization of OWL [22], using appropriate URIs as identifiers to prevent confusion with URLs of the wiki’s HTML documents. The semantic data is not meant to describe the HTML-document but rather its (intended) subject. The generated OWL/RDF is “browseable” in the sense that URIs can be used to locate further resources, thus satisfying the linked data principles [7]. All URIs point to a Web service of the wiki that uses content negotiation to redirect callers either to the OWL export service or to the according wiki page. Together with the compatibility to both OWL and RDF this enables a maximal reuse of SMW’s data. Tools such as Tabulator [6] that incrementally retrieve RDF resources during browsing can easily retrieve additional semantic data on user request. SMW furthermore provides scripts for generating the complete export of all data within the wiki, which is useful for tools that are not tailored toward online operation such as the faceted browser Longwell.4 Sample files of such export are found at http://semanticweb.org/RDF/. SMW generates valid URIs for all entities within the wiki. It does not burden the user with following the rules and guidelines for “cool URIs” [24], but generates them automatically from the article name. Users can, at any time, introduce new individuals, properties, or classes. Because of this, it does not make sense to use a hash namespace, as the returned file would be an ever growing and changing list of entity names. Instead, a slash namespace is used, so that SMW can basically use the local name as a parameter in creating the required export of data.
5 Related systems Before SMW, other semantic wikis had been created, but most of them, by now, have been discontinued [9, 20, 28]. Many of the early semantic wikis emphasized the semantic side, and disregarded some of the strengths of wikis such as their usability and low learning curve. The approach of SMW has generally been to prefer a tight integration with established usage over semantic expressivity. The most notable (and stable) related system currently is KiWi [26], previously known as IkeWiki [25]. KiWi is similar to SMW with respect to the supported kinds of easy-to-use inline wiki annotations, and various search and export functions. In 4 http://simile.mit.edu/wiki/Longwell.
324
M. Krötzsch, D. Vrandeˇci´c
contrast to SMW, KiWi introduces the concept of ontologies and (to some extent) URIs into the wiki, which emphasizes use-cases of collaborative ontology editing that are not the main focus of SMW. KiWi uses URIs explicitly to identify concepts, but provides interfaces for simplifying annotation, e.g. by suggesting properties. Another, also text-based approach, is taken by the semantic wiki system KnowWe [5] (based on JSPWiki) that, instead of a generic annotation mechanism, introduces domain-specific annotation patterns that are close to common notations in the given field, thus reducing the cognitive overhead for domain experts to create knowledge systems. Besides text-centered semantic wikis, various collaborative database systems have appeared recently. Examples of such systems include OntoWiki [1], OpenRecord,5 freebase,6 and OmegaWiki.7 Such systems typically use form-based editing, and are used to maintain data records, whereas SMW concentrates on text, the main type of content in wikis. OntoWiki draws from concepts of semantic technologies and provides a built-in faceted (RDF) browser. The other systems have their background in relational databases. There are two extensions to SMW that help with making SMW more similar to such a form-based editing system, Semantic Forms developed by Yaron Koren8 and Halo developed by ontoprise.9 SMW has become the base for a number of further research works in the area of semantic wikis. [23] describes a Peer2Peer extension of SMW that allows the distributed editing of the semantic wiki. [3] describes the usage of SMW as a light weight application model, implementing two applications on top of it. The MOCA extension [14] to SMW fosters the convergence of the emerging vocabulary within an SMW instance.
6 Conclusions We have shown how MediaWiki can be modified to make part of its knowledge machine-processable using semantic technologies. On the user side, our primary change is the introduction of property-value annotations to wiki pages by means of a slight syntactic extension of the wiki source. By also incorporating MediaWiki’s existing category system, semantic data can be gathered with comparatively little effort on the user side. For further processing, this knowledge is conveniently represented in an RDF-based format. We presented the system architecture underlying our actual implementation of these ideas, and discussed how it is able to meet the high requirements for usability and scalability with which we are faced. 5 http://www.openrecord.org. 6 http://www.freebase.com. 7 http://www.omegawiki.org. 8 http://www.mediawiki.org/wiki/Extension:Semantic_Forms. 9 http://smwforum.ontoprise.com.
Semantic MediaWiki
325
We have demonstrated that the system provides many immediate benefits to MediaWiki’s users, thus helping to overcome the notorious Semantic Web “chicken and egg” problem by providing early incentives for iterative metadata creation in running systems. The emerging pool of machine accessible data presents great opportunities for developers of semantic technologies who seek to evaluate and employ their tools in a practical setting. In this way, Semantic MediaWiki has become a platform for technology transfer that is beneficial both to researchers and a significant number of users worldwide, making semantic technologies part of the common usage of the World Wide Web.
References 1. Auer, S., Dietzold, S., Riechert, T.: OntoWiki—a tool for social, semantic collaboration. In: Gil, Y., Motta, E., Benjamins, R.V., Musen, M. (eds.) Proc. 5th Int. Semantic Web Conference (ISWC’05). LNCS, pp. 736–749. Springer, Berlin (2006) 2. Auer, S., Lehmann, J.: What have Innsbruck and Leipzig in common? Extracting semantics from wiki content. In: Franconi, E., Kifer, M., May, W. (eds.) Proc. 4th European Semantic Web Conference (ESWC) (2007) 3. Bao, J., Ding, L., Huang, R., Smart, P., Braines, D., Jones, G.: A semantic wiki based lightweight web application model. In: Proceedings of the 4th Asian Semantic Web Conference, pp. 168–183 (2009). URL http://www.cs.rpi.edu/~baojie/pub/2009-07-28-aswc-final.pdf 4. Barret, D.J.: MediaWiki. O’Reilly, Sebastopol (2008) 5. Baumeister, J., Reutelshoefer, J., Puppe, F.: KnowWE: community-based knowledge capture with knowledge wikis. In: Proceedings of the 4th International Conference on Knowledge Capture (K-CAP’07). ACM, New York (2007) 6. Berners-Lee, T., Chen, Y., Chilton, L., Connolly, D., Dhanaraj, R., Hollenbach, J., Lerer, A., Sheets, D.: Tabulator: exploring and analyzing linked data on the semantic web. In: Rutledge, L., Schraefel M.C., Bernstein, A., Degler, D. (eds.) Proceedings of the Third International Semantic Web User Interaction Workshop SWUI2006 at the International Semantic Web Conference ISWC2006 (2006) 7. Bizer, C., Heath, T., Berners-Lee, T.: Linked data—the story so far. International Journal on Semantic Web & Information Systems 5, 1–22 (2009) 8. Bozsak, E., Ehrig, M., Handschuh, S., Hotho, A., Maedche, A., Motik, B., Oberle, D., Schmitz, C., Staab, S., Stojanovi´c, L., Stojanovi´c, N., Studer, R., Stumme, G., Sure, Y., Tane, J., Volz, R., Zacharias, V.: KAON—towards a large scale semantic web. In: Bauknecht, K., Tjoa, A.M., Quirchmayr, G. (eds.) Proceedings of the Third International Conference on E-Commerce and Web Technologies (EC-Web 2002), Aix-en-Provence, France. LNCS, vol. 2455, pp. 304–313. Springer, Berlin (2002) 9. Campanini, S.E., Castagna, P., Tazzoli, R.: Towards a semantic wiki wiki web. In: Tummarello, G., Morbidoni, C., Puliti, P., Piazza, F., Lella, L. (eds.) Proceedings of the 1st Italian Semantic Web Workshop (SWAP2004), Ancona, Italy (2004) 10. Cimiano, P., Völker, J.: A framework for ontology learning and data-driven change discovery. In: Proceedings of the 10th International Conference on Applications of Natural Language to Information Systems (NLDB’2005) 2005 11. Ell, B.: Integration of external data in semantic wikis. Master thesis, Hochschule Mannheim (December 2009) 12. Flum, J., Frick, M., Grohe, M.: Query evaluation via tree-decompositions. Journal of the ACM 49(6), 716–752 (2002) 13. Hitzler, P., Krötzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies. Chapman & Hall/CRC, London (2009)
326
M. Krötzsch, D. Vrandeˇci´c
14. Kousetti, C., Millard, D., Howard, Y.: A study of ontology convergence in a semantic wiki. In: Aguiar, A., Bernstein, M. (eds.) WikiSym 2008 (2008). URL http://eprints.ecs. soton.ac.uk/16374/ 15. Krötzsch, M., Rudolph, S., Hitzler, P.: Conjunctive queries for a tractable fragment of OWL 1.1. In: Aberer, K., Choi, K.-S., Noy, N. (eds.) Proc. 6th Int. Semantic Web Conf. (ISWC’07). Springer, Berlin (2007) 16. Krötzsch, M., Vrandeˇci´c, D., Völkel, M.: Wikipedia and the semantic web—the missing links. In: Proceedings of Wikimania 2005—The First International Wikimedia Conference. Wikimedia Foundation, Frankfurt, Germany (2005) 17. Krötzsch, M., Vrandeˇci´c, D., Völkel, M., Haller, H., Studer, R.: Semantic Wikipedia. Journal of Web Semantics 5, 251–261 (2007) 18. Motik, B.: Reasoning in description logics using resolution and deductive databases. PhD thesis, Universität Fridericiana zu Karlsruhe (TH), Germany (2006) 19. Motik, B., Patel-Schneider, P.F., Parsia, B. (eds.): OWL 2 Web Ontology Language: Structural Specification and Functional-style Syntax. W3C Recommendation, 27 October (2009). Available at http://www.w3.org/TR/owl2-syntax/ 20. Nixon, L.J.B., Simperl, E.P.B.: Makna and MultiMakna: towards semantic and multimedia capability in wikis for the emerging web. In: Schaffert, S., Sure, Y. (eds.) Proc. Semantics 2006. Österreichische Computer Gesellschaft, Vienna (2006) 21. OWL Working Group, W.: OWL 2 Web Ontology Language: Document Overview. W3C Recommendation, 27 October (2009). Available at http://www.w3.org/TR/owl2-overview/ 22. Patel-Schneider, P.F., Motik, B. (eds.): OWL 2 Web Ontology Language: Mapping to RDF Graphs. W3C Recommendation, 27 October (2009). Available at http://www.w3.org/TR/ owl2-mapping-to-rdf/ 23. Rahhal, C., Skaf-Molli, H., Molli, P., Weiss, S.: Multi-synchronous collaborative semantic wikis. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) Proceedings of the International Conference on Web Information Systems Engineering (Wise 2009), Poznan, Poland. LNCS, vol. 5802 (2009) 24. Sauermann, L., Cyganiak, R.: Cool URIs for the semantic web. W3C Interest Group Note. Available at http://www.w3.org/TR/cooluris/ (2008) 25. Schaffert, S.: IkeWiki: a semantic wiki for collaborative knowledge management. In: Tolksdorf, R., Simperl, E., Schild, K. (eds.) 1st International Workshop on Semantic Technologies in Collaborative Applications (STICA 2006) at the 15th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE 2006), pp. 388–396 (2006). doi:10.1109/WETICE.2006.46. URL http://ieeexplore.ieee.org/xpl/ freeabs_all.jsp?arnumber=4092241 26. Schaffert, S., Eder, J., Grünwald, S., Kurz, T., Radulescu, M., Sint, R., Stroka, S.: KiWi—a platform for semantic social software. In: Lange, C., Schaffert, S., Skaf-Molli, H., Völkel, M. (eds.) 4th Workshop on Semantic Wikis (SemWiki2009) at the European Semantic Web Conference (ESWC 2009), Herakleion, Greece. CEUR-WS, vol. 646 (2009) 27. Schindler, M., Vrandeˇci´c, D.: Introducing new features to Wikipedia: case studies for web science. IEEE Intell. Syst. 26(1), 56–61 (2011). doi:10.1109/MIS.2011.17 28. Souzis, A.: Building a semantic wiki. IEEE Intelligent Systems 20(5), 87–91 (2005) 29. Völkel, M., Krötzsch, M., Vrandeˇci´c, D., Haller, H., Studer, R.: Semantic Wikipedia. In: Carr, L., Roure, D.D., Iyengar, A., Goble, C.A., Dahlin, M. (eds.) Proceedings of the 15th International Conference on World Wide Web (WWW2006), pp. 491–495. ACM, Edinburgh (2006) 30. Yannakakis, M.: Algorithms for acyclic database schemes. In: Proceedings of the 7th International Conference on Very Large Data Bases, pp. 82–94. IEEE Comput. Soc., Los Alamitos (1981) 31. Zacharias, V., Braun, S.: SOBOLEO: social bookmarking and lightweight ontology engineering. In: Proceedings of the Workshop on Social and Collaborative Construction of Structured Knowledge (CKC 2007) at the 16th International World Wide Web Conference (WWW2007), Banff, Canada, May 8 (2007)
Real World Application of Semantic Technology Juergen Angele, Hans-Peter Schnurr, Saartje Brockmans, and Michael Erdmann
Abstract Ontoprise GmbH is a leading provider of industry-proven Semantic Web infrastructure technologies and products supporting dynamic semantic information integration and information management processes at the enterprise level. Ontoprise has developed a comprehensive product suite to support the deployment of semantic technologies in a range of industries. In this article, we present some typical applications where we successfully demonstrated the feasibility, maturity, and power of semantic technologies in real enterprise settings.
1 Introduction With its mature and standard-based products, ontoprise delivers key components to the Semantic Web infrastructure. The company was founded in 1999 as a spin-off of Rudi Studer’s group at AIFB, Karlsruhe University, Germany, with the aim of commercializing the technology and research results provided by the group. In the years that followed, ontoprise and Rudi’s group continued to collaborate closely and a number of former members of the academia have continued their careers at ontoprise. Ontoprise has developed a comprehensive product suite for supporting the deployment of semantic technologies in a range of industries. The reasoning engine OntoBroker [2] represents the core of ontoprise’s stack of technologies. It can be considered a sophisticated extension to databases, which is able to reason on the basis of basic data to support many different tasks from simple semantic search to decision support and to complex problem solving. OntoStudio [1] is a modeling environment for ontologies and is especially focused on the development of rules. A third main product of ontoprise is Semantic MediaWiki+ (SMW+), an extension of Semantic MediaWiki which began at Rudi’s group at AIFB. SMW+ is a semantic enterprise wiki for work groups, departments or companies. In this article, we will focus on industry solutions based on these products and thus, emphasize the potential of semantic technologies in industries: J. Angele () ontoprise GmbH, An der RaumFabrik 29, 76227 Karlsruhe, Germany e-mail:
[email protected] D. Fensel (ed.), Foundations for the Web of Information and Services, DOI 10.1007/978-3-642-19797-0_17, © Springer-Verlag Berlin Heidelberg 2011
327
328
J. Angele et al.
• For service organizations, problem-solving expertise is vital. High quality services can only be delivered when hotline staff and service technicians work handin-hand and the competence necessary for solving problems is available where and when necessary. Semantic Guide makes the necessary expertise available at the right place, at the right time, in the right language, and independent from individuals. Intelligent advisory systems like Semantic Guide can distribute experts’ problem-solving skills to all employees in order to directly decrease maintenance costs and improve service quality and customer satisfaction. • Ontoprise also supports the leading application areas of Semantic Business Analytics. Organizations may run very different operations; nevertheless they often face the same challenges: their operations tend to become more and more complex and harder to keep under control. These organizations need a systematic approach for controlling complex operations, taking into account all of the technical, legal and business aspects, and providing well-founded decision-making support. Semantic Business Analytics includes the application of semantic technologies in the area of controlling operations and decision support in complex settings.
2 Semantic Technologies in Advisory Systems A recent study in the industrial goods sector estimated that 65% of the clients change their suppliers because they are dissatisfied with the services provided. To be competitive, manufacturers need to continuously improve the quality of their service. Furthermore, service organizations have to struggle with rising costs, as well as increased customer expectations. High quality services can only be delivered when the necessary problem-solving know-how is readily available where it is needed. Typically, the required problem solving competence is often distributed among several experts, related to a specific area of expertise or based upon local availability. As a result, service technicians often spend an unnecessary amount of time solving problems which are already known. This downtime leads to an increase in time-to-fix and overall costs. Semantic Guide makes this expertise clearly available at the right time and place, in the appropriate language, and independent of specific individuals. This section describes the core technologies used within the Semantic Guide, ontoprise’s platform for building advisory systems, using a real sample case from one of our clients, KUKA Roboter GmbH.
2.1 Expert System for Customer Service The example explained in the following sections describes the use of a semantic expert system to support the customer service department of the industrial robot manufacturer, KUKA Roboter GmbH. For many manufacturers of capital goods,
Real World Application of Semantic Technology
329
customer service optimization is an essential process. Here, semantic technologies can be used for the following purposes: • to help improve the diagnosis of malfunctions, • to train service engineers to prepare their assignments at the customer-site, and • to support service engineers in identifying the right solution for customer problems. Essential for industrial robots (which are, for example, frequently used in the automotive industry) is first and foremost an extremely high availability. With this particular robot manufacturer, the means to deal with this high demand were insufficient as rather basic methods, such as email communication and a basic error database, were used. This outdated approach had become insufficient due to the rapid growth of the company, shorter innovation-cycles, an ever-increasing customer base and a huge variety of different applications. The various applications range from spot welding in the automotive industry, to gas-shielded arc welding and maintenance of roller coasters in amusement parks. This broad product range led to a high number of possible error variants and potential solutions. Parallel to customer growth, the number of service technicians also grew significantly. As a result, knowledge about problem solving became scattered while the demand for certain expertise grew, especially among the new service engineers. The system we developed for KUKA is based on ontologies that represent the complex dependencies between the robot components and their applications. It enables access and integration of existing information sources, and enhances document search to enable failure assessment. In addition an ontology also describes the method used for searching for problem solutions. The implemented system on top of the semantic layer is web-based. This results in a three tier architecture: the existing sources of information in the back-end, the semantic layer with the domain ontology and finally the level of user interface. In the following sections, we describe the starting position-as well as different parts of the solution-in more detail.
2.2 Starting Position KUKA Roboter GmbH has clients on five continents, and more than 5000 employees world-wide in about 70 subsidiaries. KUKA is one of the world’s leading providers of industry robots and welding systems and other automated production systems. Since its establishment more than 100 years ago, the company has stood for innovation in machine and systems engineering. It develops progressive solutions for the automation of industrial production processes, especially in the European automotive sector. The number of variations of robot configurations is extremely high. On average, one new system configuration—for instance, in the form of a new software update—is brought to market every month. Particularly for younger and less experienced service technicians, this represents a major challenge. With the help of
330
J. Angele et al.
Fig. 1 Sample KUKA robot example: As the specific robot KR 500 is a heavy weight robot and all heavy weight robots must be equipped with a hydro-pneumatic counterweight balancing system, the search for “cwb” in context of “KR 500” can exclude all cases affecting gas or spring counterweight balancing systems
a knowledge-based advisory system, the wealth of expertise from senior technicians can be accessed by the entire service crew.
2.3 General Approach During product development, documentations were already being drafted in accordance with certain formulation guidelines and saved in the form of XML text elements in KUKA’s content management system (CMS). The text elements are linked with a knowledge model using the device components or software releases, and so on. The knowledge model is thus equipped to assign text elements to the relevant robots. These texts are also used for training content. Since 2005, KUKA service technicians have been using an intelligent advisory system developed by ontoprise, which uses the text elements provided by the CMS. Service technicians can search the knowledge base by cases, solutions1 and documents. An extended search limits the matches to certain robot types or software releases using the underlying ontology. The ontology defines which causes and solutions belong to the actual robot configuration and filters out those which do not apply. Cf. Fig. 1 for an example of how the model helps to filter out irrelevant infor1 The
search for solutions can make sense if the cause of a problem is known but requires information on the procedures.
Real World Application of Semantic Technology
331
Fig. 2 List of matches with solutions and cases within the knowledge base
Fig. 3 List of matches based on the ontology
mation. In the screen shots of Figs. 2 and 3 we demonstrate the effect of focusing only upon relevant solutions related to a certain robot type “KR 500”.
332
J. Angele et al.
Fig. 4 Domain ontology
2.4 Domain Knowledge At the beginning of the project, an ontology was developed in close co-operation with the robot manufacturer, representing the different concepts of robot technology and the inter-dependencies of error messages and symptoms. It was an important aspect to use terminology that was understandable for the prospective users to ensure the maintainability after project end. The basic modeling followed the On-to-Knowledge Methodology [7] with competency questions in different workshops involving experienced service engineers, who were subsequently in charge of the operational system as knowledge base editors. In the initial process step of the methodology, the employees formulated specific questions which the system would later have to be able to answer. From these questions, the most important concepts were identified and represented with their dependencies in the basic modeling step. In addition, some instances of these concepts were modeled as an example. Figure 4 shows a web-based visualization of the taxonomic hierarchy of the ontology. C-labels denominate the concepts and indentions are their corresponding sub-concepts. Thus, robots have the specializations “high load”, “middle load”, “low load”, “heavy load” and “special designs”. K-labels denominate the different construction parts that belong to a certain assembly. For example, the “axle 1” is a sub-part of a “high load” robot. V-labels are relations and mean “connected to”. In our example the “axle 1” is connected to the “basic frame”. Beyond that,
Real World Application of Semantic Technology
333
the editors modeled errors, subsequent errors and their possible solutions. Rules in the ontology describe, for example, when certain errors can be included and/or excluded: If an error X arises at a certain construction unit and the current robot does not contain this construction unit, the error and thus all sequence errors are excluded as long as the sequent errors do not have any others causes. If an error X concerns a certain software release and the current robot runs with a newer software version then the error X and all sequent errors can be excluded.
2.5 Problem Solving Methods In order to design a generic system that can be adapted to other domains, the method for the search of suitable solutions was modeled independent of the domain knowledge. In particular, an important requirement stated that after the system suggested a list of possible solutions to the specified error symptoms, the system could ask differentiating questions to reduce the number of solutions presented step by step. Archetypes for the modeling were the generic problem solving methods, as they were discussed in research from the 80’s and 90’s [6]. In our case, a simplified variant of the problem solving method Cover and Differentiate [3] was modeled because it was the most suitable for the existing problem and requirements. Cover and Differentiate is the problem solving method which was used in the expert system “MOLE” [4]. It is suitable for covering classification problems whose solution is a subset of a certain quantity of predefined solutions. This method represents a search algorithm in a directed acyclic graph. The nodes of the graph are the states. An edge of the state s1 to the state s2 describes that s2 is a cause for the state s1. Additional knowledge, so called differentiating knowledge, allows qualifying or disqualifying further states and thus differentiating between different solutions. Two substantial principles are realized by this problem solving method: • The exhaustive principle signifies that each arising symptom must be covered by at least one cause. Therefore, a condition may not be removed from the further view if it represents the one and only cause for an observed symptom. • The exclusivity principle signifies that only one cause can be responsible for an observed symptom (single fault hypothesis). Qualifying knowledge was used to reduce the number of possible solutions. Such qualifying knowledge can refer to the conditions as well as to the edges. Additional knowledge can be defined for a condition and/or an edge in a graph. In our system, this knowledge was used to pose purposeful questions. Thus, in our condition graph, the auxiliary knowledge ‘green oil’, for example, causes certain conditions to be disqualified and hence one of the three solutions can be eliminated.
334
J. Angele et al.
Fig. 5 Ontology of the problem solving method Cover-and-Differentiate in OntoStudio
The problem solving method was itself modeled as an ontology, of which a part is shown in Fig. 5. All conditions are represented by the concept state. A condition is specified in more detail by a name. Possible causes for a condition are described by the relation hasCause. The cause arising in a certain problem solution is connected to the condition using the relation isEdge. The concept EventQualifier names qualifying knowledge for conditions. An element of this class is assigned to a concrete condition using the relation forState. A ConnectionQualifier qualifies the cause relationship between two conditions. The class ObservedState contains the actually observed symptoms like, for example, ‘oil in the arm’ and FinalState marks the final causes. These are also connected with the proposals for the solution using the relation hasSolution. Rules now describe the above mentioned principles. Other rules define the paths of observed symptoms to their final solutions. Rules describe the influence of the qualifying knowledge on the conditions and the edges in the graph. As an example, we describe the rule which defines the different paths and as well as the influence of the EventQualifier: If for an ObservedState X or a State X the cause is Y and neither the edge (X, Y ) is eliminated by a ConnectionQualifier, nor the condition Y or the condition X over an EventQualifier is eliminated, then the edge (X, Y ) is an edge to a final cause. If a path leads from an ObservedState X to a State Y along the relation hasCause, then Y is a possible cause.
Real World Application of Semantic Technology
335
Fig. 6 Mapping between the domain ontology and the problem solving ontology with OntoMap
One of the strengths of problem solving methods, apart from the general validity, are the well-defined requirements it has to the necessary knowledge of the domain. It is clear, for example, after selecting Cover and Differentiate that there must be error descriptions, error causes, relations and qualifying knowledge. Thus, the choice of such a problem solving method naturally also affects the structuring of the knowledge of the application domain (see former section). While the terminology and structuring of the problem solving method are independent of the application domain and thus specific for the method, the terminology and structuring of the application domain are specific for the domain. To bring together both models and receive an executable model, there must be suitable mapping between both models. Figure 6 shows such a mapping that was graphically modeled using the tool OntoMap.
2.6 Integration of Different Data Sources The ontology of the application domain, in our case robots, their applications, faults, and possible problem solutions, on the one hand serves to represent this extremely complex and interwoven knowledge. On the other hand, this ontology serves to access and integrate existing sources of information. In our case, this included two existing databases of the development department as well as several sources of text (work instructions, quality management references, technical service messages in email format, and so on). The structured information could be edited using the data base of the control error messages (on-line assistance of the PC control) and the development error data base. Both data bases describe a similar structure as the modeled errors in the ontology. In addition, an error has one or more solutions and can have a subsequent error. The ontology serves to re-interpret these sources of information in a uniform and generally clear terminology and structure. In addition, again using OntoMap, the information sources were attached to the ontology, as
336
J. Angele et al.
Fig. 7 Integration of existing information sources in the ontology
shown in Fig. 7. With a query to the ontology at run-time, corresponding queries are then generated to the information sources and, thus, the current information they contain is accessed. This enables the usual maintenance of the existing information sources by the system owner while the expert system can always access updated information. The integration of unstructured, textual information was realized by ontoprise’s semantic search technology SemanticMiner [5].
2.7 Further Functions Beyond a basic semantic search of the existing documents, SemanticGuide provides support in identifying possible root causes for observed problems. After selecting a “case”, solution suggestions are presented. A summary can be displayed for the technician using the mouse-over function. If the service technician is not able to find a solution in the system and has to venture out onto a new solution path, he can describe this in a feedback form. In a back-office editorial process, copy editors can review this feedback and eventually add new cases and solutions to the knowledge base. Service technicians are evaluated quarterly by the quantity and quality of feedback. This additionally promotes exchange of knowledge within the service organization.
Real World Application of Semantic Technology
337
3 Semantic Business Analytics Predictive analytics encompass a variety of techniques from statistics and data mining that analyze current and historical data to make predictions about future events. Such predictions rarely take the form of absolute statements, and are more likely to be expressed as values that correspond to the odds of a particular event or behavior taking place in the future (Wikipedia2 ). For predictive analysis, semantics also play an important role. As our sample cases show, it induces several requirements to the chosen underlying model. For instance, the prediction models must be communicable between different domain experts and business people, the terminology must be well defined and well understood, and the models must be very flexible concerning modifications, thus requiring an abstract representation level. Moreover, the models must be immediately operational without any representation breaks in between. Predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Models capture the relationships amongst many factors to allow assessment of risk or potentiality associated with a particular set of conditions, guiding decision making for candidate transactions. We apply ontologies to define these predictive models. As the relationships are very complex, expressivity of the representation language is crucial. Declarative rules are the right means to define these complex relationships on a high level of abstraction. In the following section, we describe our solution for our customer Alstom Power AG—one of the global leaders in the world of power generation. In our sample case, Alstom is commissioning a hydro power plant in Malaysia, a project which is called the “Bakun Hydroelectric Project”. The customer of the Bakun Hydroelectric Project requested an expert system delivered with the plant, focusing on supporting the plant operators in predicting and handling issues in difficult situations. Basically, the system has three functional parts. (i) The intelligent and experience based monitoring part analyzes the data and events being received from the power plant components. (ii) The guide section gives advice on dealing with critical issues detected by the monitors, and finally (iii) the system should be able to learn from previous incidents. The whole expert system shell for Alstom is called CEXS (Computer Guided Expert System Shell).
3.1 Intelligent, Experience-Based Monitoring In the following analysis, we only describe the monitoring section. The role of the monitors is to predict future undesired operating conditions and faults, and therefore suggest maintenance actions. Each monitor holds exactly one target of detection, and has an alarm message attached. A guideline may also be attached to support the 2 http://en.wikipedia.org/wiki/Predictive_analytics.
338
J. Angele et al.
Fig. 8 Excerpt from the ontology
operator in remedying the issue. Monitors do not require the operator’s interaction as they run autonomously, and they continuously observe a set of plant readings based on rule-based calculations/derivations. The monitors have fixed sample rates, i.e. they are time-based triggered by the monitor agent. The CEXS monitors data and events of the control unit, the ALSPA P320 system. The ALSPA P320 system collects all events and measures from the power plant. This data is then checked for feasibility and cleansed if necessary. Afterwards, the data and events are analyzed using experiences gained from past incidents to recognize upcoming errors or unstable machine states as soon as possible. To achieve this, the incidents are stored in a historical data repository, enabling experts to evaluate new predictive monitors against past incidents. From these examinations, the experts can derive a new set of analyzing rules and add them to the monitoring agent. Mechanisms and tools are provided with the CEXS to analyze the incidents and derive rules from gained experiences.
3.2 Sensor-Based Ontology for Monitoring Initially, the CEXS primarily monitors the turbine, the generator, and the unit transformers. For these aspects of the plant, a set of monitoring rules and interactive guidelines are provided. The excerpt from the ontology in Fig. 8 shows the central concepts of the ontology model. Monitors may be activated or deactivated, and they are assigned to one or more alarms. The alarms created by the monitors contain a warning message, the description of the alarm, a time period for which events have been considered for this alarm, and a flag indicating the firing of the alarm. A guideline can be attached to an alarm, guiding the operator to handle the problem scenario defined by the alarm. The central concept is Event. Each event has a [time; value] tuple representing the occurrence time stamp and the corresponding value of the sensor. The concept has
Real World Application of Semantic Technology
339
a relation originatesFrom to Signal. Running Value holds the current value of Signal. This separate concept is necessary for computing reasons, as signals and events are only transferred over the bus system of the ALSPA P320 on value changes. A Signal itself originates from a certain unit of which there are eight in the Bakun Hydroelectric Project. Complex relationships in this domain are described using object-logic rules. For instance, given the course of events when an oil pump runs stable—the oil pump has a stabilization time of 5 minutes after the starting procedure. The start of the pump is indicated by signal value 1 from signal CL105. This could be represented in a rule like: “if the signal CL105 has the value 1 and the time stamp T , then the pump runs stable at time T + 5 min”. In object-logic, such a rule is represented as follows: ?P[runsStable]:- ?E:Event[hasValue->1, hasId-> CL105, hasTimeStamp->?T, originatesFromSensor->?S] and currentTime(?C) and ?D is ?C - ?T and ?D > 5 and ?S[belongsToUnit->?P].
3.3 Monitor Administration A monitor continuously retrieves and analyzes actual data and informs people or other systems as soon as it detects an issue or a critical upcoming problem. In our case, the active monitors analyze the data from over 2,000 sensors per unit delivered by the power plant, and give an alarm if upcoming problems are predicted. The analysis of the sensor data together with their history may become very complex. Therefore, powerful formalisms must be used to represent these methods. Object-logic rules are used to describe these monitors. These rules describe logical relationships between the components, the sensor data, and their history. Another simple example for the monitor “Insufficient oil pressure of OPU” could be: “if the signal with signal-id 4711 has occurred and the sensor value for this signal has increased by more than 10% in the last 10 minutes, then the oil pressure is too low”. Monitors are developed, maintained and deployed in the ontology development environment OntoStudio. OntoStudio has been extended by a component for the development and administration of such monitors, as shown in Fig. 9. The alarm rules of the monitors often follow the same pattern. They all have some conditions and throw alarms which equals to the same header structure. A specialized editor (rule wizard) has been developed for easier definition of these monitors. This editor follows the concept of the rule wizard in Microsoft Outlook, as can be seen in Fig. 10. Different templates show rules in natural language. A click on a red part shows a list of instances of a concept which allows filling the concept template by an item in the list (instance of the concept). This rule editor automatically creates executable object-logic rules.
340
J. Angele et al.
Fig. 9 Monitor administration in OntoStudio
Fig. 10 Rule wizard for monitor rules
To meet the interactive guideline requirements and guide the operator through complex faults and thus resume operation, we have used our product SemanticGuide as shown in the previous section.
4 Conclusion and Outlook Ontoprise’s products and the use cases presented in this article clearly show the support of the knowledge worker in the enterprise by semantic technologies. Searching for information like documents or concrete decision support for the maintenance of robots can be strongly improved by ontologies and reasoning on top of these ontologies. Currently, ontoprise’s products have thousands of installations; this demonstrates that semantic technologies are becoming increasingly used in industrial applications. Clearly, we have left the research corner and moved on towards customer satisfaction, focusing our attention on supporting their needs.
Real World Application of Semantic Technology
341
In conclusion, Rudi’s spin-off from 1999 has successfully transferred research results for the university to a successful and profitable business. A lively partnership between Rudi’s institute and ontoprise remains, thus allowing for the continuation of this valuable knowledge transfer and collaboration.
References 1. Angele, J., Erdmann, M., Schnurr, H.-P., Wenke, D.: Ontology based knowledge management in automotive engineering scenarios. In: Proceedings of ESTC 2007, 1st European Semantic Technology Conference, Vienna, Austria, 31.05–01.06 (2007) 2. Decker, S., Erdmann, M., Fensel, D., Studer, R.: Ontobroker: ontology based access to distributed and semi-structured information. In: Meersman, R., et al. (eds.) Database Semantics: Semantic Issues in Multimedia Systems. Kluwer Academic, Norwell (1999) 3. Eshelman, L.: MOLE: a knowledge-acquisition tool for cover-and-differentiate systems. In: Marcus S. (ed.) Automatic Knowledge Acquisition for Expert Systems, pp. 37–80. Kluwer, Boston (1988) 4. Eshelman, L., McDermott, J.: MOLE: a knowledge acquisition tool that uses its head. In: Boose, J.H., Gaines, B. (eds.) AAAI-Workshop: Knowledge-Acquisition for Knowledge Based Systems, Banff, Kanada (1986) 5. Mönch, E., Ullrich, M., Schnurr, H.-P., Angele, J.: SemanticMiner—ontology-based knowledge retrieval. In: Special Issue of Selected Papers of the WM2003 in the Journal of Universal Computer Science (J.UCS) 6. Marcus, S. (ed): Automatic Knowledge Acquisition for Expert Systems. Kluwer, Boston (1988) 7. Sure, Y., Staab, S., Studer, R.: On-to-knowledge methodology. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies, pp. 117–132. Springer, Berlin (2003)