Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5887
Tat-Seng Chua Yiannis Kompatsiaris Bernard Mérialdo Werner Haas Georg Thallinger Werner Bailer (Eds.)
Semantic Multimedia 4th International Conference on Semantic and Digital Media Technologies, SAMT 2009 Graz, Austria, December 2-4, 2009 Proceedings
13
Volume Editors Tat-Seng Chua National University of Singapore 3 Science Drive, Singapore 117543, Singapore E-mail:
[email protected] Yiannis Kompatsiaris Informatics and Telematics Institute Centre for Research and Technology–Hellas 6th km Charilaou-Thermi Road, 57001 Thermi-Thessaloniki, Greece E-mail:
[email protected] Bernard Mérialdo Institut Eurécom Département Communications Multimédia 2229, route des Crêtes, 06904 Sophia-Antipolis CEDEX, France E-mail:
[email protected] Werner Haas Georg Thallinger Werner Bailer JOANNEUM RESEARCH Forschungsgesellschaft mbH Institute of Information Systems Steyrergasse 17, 8010 Graz, Austria E-mail: {werner.haas, georg.thallinger, werner.bailer}@joanneum.at Library of Congress Control Number: 2009939151 CR Subject Classification (1998): H.5.1, H.4, I.7, I.4, H.5, H.3.5 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-10542-4 Springer Berlin Heidelberg New York 978-3-642-10542-5 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12798667 06/3180 543210
Preface
This volume contains the full and short papers of SAMT 2009, the 4th International Conference on Semantic and Digital Media Technologies 2009 held in Graz, Austria. SAMT brings together researchers dealing with a broad range of research topics related to semantic multimedia and a great diversity of application areas. The current research shows that adding and using semantics of multimedia content is broadening its scope from search and retrieval to the complete media life cycle, from content creation to distribution and consumption, thus leveraging new possibilities in creating, sharing and reusing multimedia content. While some of the contributions present improvements in automatic analysis and annotation methods, there is increasingly more work dealing with visualization, user interaction and collaboration. We can also observe ongoing standardization activities related to semantic multimedia in both W3C and MPEG, forming a solid basis for a wide adoption. The conference received 41 submissions this year, of which the Program Committee selected 13 full papers for oral presentation and 8 short papers for poster presentation. In addition to the scientific papers, the conference program included two invited talks by Ricardo Baeza-Yates and Stefan R¨ uger and a demo session showing results from three European projects. The day before the main conference offered an industry day with presentations and demos that showed the growing importance of semantic technologies in real-world applications as well as the research challenges coming from them. From the submitted proposals, the Workshop and Tutorial Chairs selected two full day workshops, namely: – Semantic Multimedia Database Technologies – Learning the Semantics of Audio Signals In addition, there were three half-day tutorials, namely: – A Semantic Multimedia Web: Create, Annotate, Present and Share Your Media – MPEG Metadata for Context-Aware Multimedia Applications – Web of Data in the Context of Multimedia The workshops complement the conference by providing a forum for discussion about emerging fields in the scope of SAMT and the tutorials are an opportunity for the participants to get a condensed introduction in one of the many areas related to semantic multimedia. This conference would not have been possible without the tremendous support of many people. We would like to thank the Workshop and Tutorial Chairs, Josep Blat, Noel O’Connor and Klaus Tochtermann, the Industry Day Chairs
VI
Preface
Wessel Kraaij and Alberto Messina, as well as Wolfgang Halb, Helen Hasenauer and Karin Rehatschek, who did a great job in organizing this event. We would like to thank the Program Committee members for the thorough review of the submissions, the invited speakers, the workshop organizers and tutors, and all contributors and participants. We are grateful for the support provided by the consortia of the SALERO and VidiVideo projects, the Young European Associated Researchers (YEAR) network, the City of Graz, the Province of Styria and the Austrian Federal Ministry of Science and Research. December 2009
Tat-Seng Chua Yiannis Kompatsiaris Bernard M´erialdo Werner Haas Georg Thallinger Werner Bailer
Conference Organization
General and Local Chairs Werner Haas Georg Thallinger Werner Bailer
JOANNEUM RESEARCH, Austria JOANNEUM RESEARCH, Austria JOANNEUM RESEARCH, Austria
Program Chairs Tat-Seng Chua Yiannis Kompatsiaris Bernard M´erialdo
National University of Singapore, Singapore ITI, Greece Eurecom, France
Program Committee Riccardo Albertoni Yannis Avrithis Bruno Bachimont Wolf-Tilo Balke Mauro Barbieri Jenny Benois-Pineau Stefano Bocconi Susanne Boll Nozha Boujemaa Tobias B¨ urger Chiara Catalano Oscar Celma Lekha Chaisorn Stavros Christodoulakis Philipp Cimiano Matthew Cooper Charlie Cullen Thierry Declerck Mark van Doorn Touradj Ebrahimi Alun Evans Bianca Falcidieno Christophe Garcia
IMATI-GE/CNR, Italy NTUA, Greece INA, France University of Hannover, Germany Philips Research, The Netherlands University of Bordeaux, France University of Trento / VUA, Italy/ The Netherlands University of Oldenburg, Germany INRIA, France STI Innsbruck, Austria University of Genova, Italy Universitat Pompeu Fabra, Spain I2R, Singapore Technical University of Crete, Greece Uni Karlsruhe, Germany FXPAL, USA Dublin Institute of Technology, Ireland DFKI, Germany Philips Research, The Netherlands Swiss Federal Institute of Technology, Switzerland Barcelona Media, Spain IMATI-GE/CNR, Italy France Telecom R&D, France
VIII
Organization
Joost Geurts Michael Granitzer William Grosky Siegfried Handschuh Michael Hausenblas Willemijn Heeren Winston Hsu Ichiro Ide Ignasi Iriondo Antoine Isaac Ebroul Izquierdo Alejandro Jaimes Joemon Jose Mohan Kankanhalli Brigitte Kerherv´e Stefanos Kollias Harald Kosch Hyowon Lee Jean Claude Leon Paul Lewis Craig Lindley Suzanne Little Vincenzo Lombardo Mathias Lux Erik Mannens Stephane Marchand-Maillet Simone Marini Jose M. Martinez Mark Maybury Oscar Mayor Vasileios Mezaris Carlos Monzo Michela Mortara Frank Nack Chong-Wah Ngo Zeljko Obrenovic Jacco van Ossenbruggen Jeff Z. Pan Thrasyvoulos Pappas Ewald Quak Lloyd Rutledge Mark Sandler Simone Santini Shin’ichi Satoh Ansgar Scherp
CWI, The Netherlands Know Center, Austria University of Michigan, USA DERI, Ireland DERI, Ireland University of Twente, The Netherlands NTU, Taiwan Nagoya University / NII, Japan Universitat Ramon Llull, Spain VUA, The Netherlands QMUL, UK Telefonica R&D, Spain University of Glasgow, UK NUS, Singapore Universit´e du Qu´ebec `a Montr´eal, Canada NTUA, Greece University of Passau, Germany Dublin City University, Ireland INPG, France University of Southampton, UK Blekinge Tekniska H¨ogskola, Sweden Open University, UK Universit`a di Torino, Italy University of Klagenfurt, Austria Ghent University, Belgium University of Geneva, Switzerland IMATI-GE / CNR, Italy GTI-UAM, Spain MITRE, USA Universitat Pompeu Fabra, Spain ITI, Greece Universitat Ramon Llull, Spain IMATI-GE/CNR, Italy CWI, The Netherlands City University of Hong Kong, Hong Kong TU Eindhoven, The Netherlands VUA, The Netherlands University of Aberdeen, UK Northwestern University, USA Tallinn University of Technology, Estonia Open Universiteit Nederland, The Netherlands Queen Mary, UK Universidad Autonoma de Madrid, Spain NII, Japan University of Koblenz-Landau, Germany
Organization
Nicu Sebe Elena Simperl Alan Smeaton Cees Snoek Michela Spagnuolo Steffen Staab Vojtech Svatek Nadja Thalmann Raphael Troncy Giovanni Tummarello Vassilis Tzouvaras Remco Veltkamp Paulo Villegas Doug Williams Marcel Worring Li-Qun Xu Rong Yan
University of Amsterdam, The Netherlands STI Innsbruck, Austria Dublin City University, Ireland University of Amsterdam, The Netherlands IMATI-GE/CNR, Italy University of Koblenz-Landau, Germany University of Economics Prague, Czech Republic University of Geneva, Switzerland CWI, The Netherlands DERI, Ireland NTUA, Greece Utrecht University, The Netherlands Telefonica R&D, Spain BT, UK University of Amsterdam, The Netherlands British Telecom, UK IBM, USA
Additional Reviewers Rabeeh Ayaz Abbasi Jinman Kim Francesco Robbiano Jinhui Tang Xiao Wu
Organizing Institution
University of Koblenz-Landau, Germany University of Geneva, Switzerland IMATI Institute at CNR, Italy National University of Singapore, Singapore City University of Hong Kong, Hong Kong
IX
X
Organization
Supporting Organizations and Projects
Table of Contents
Keynote Talk: Mining the Web 2.0 for Improved Image Search . . . . . . . . . Ricardo Baeza-Yates
1
Keynote Talk: More than a Thousand Words . . . . . . . . . . . . . . . . . . . . . . . . Stefan R¨ uger
2
Content Organization and Browsing A Simulated User Study of Image Browsing Using High-Level Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teerapong Leelanupab, Yue Feng, Vassilios Stathopoulos, and Joemon M. Jose Exploring Relationships between Annotated Images with the ChainGraph Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steffen Lohmann, Philipp Heim, Lena Tetzlaff, Thomas Ertl, and J¨ urgen Ziegler On the Co¨ operative Creation of Multimedia Meaning . . . . . . . . . . . . . . . . . Claudio Cusano, Simone Santini, and Raimondo Schettini
3
16
28
Annotation and Tagging I On the Feasibility of a Tag-Based Approach for Deciding Which Objects a Picture Shows: An Empirical Study . . . . . . . . . . . . . . . . . . . . . . . Viktoria Pammer, Barbara Kump, and Stefanie Lindstaedt Statement-Based Semantic Annotation of Media Resources . . . . . . . . . . . . Wolfgang Weiss, Tobias B¨ urger, Robert Villa, Punitha P., and Wolfgang Halb Large Scale Tag Recommendation Using Different Image Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rabeeh Abbasi, Marcin Grzegorzek, and Steffen Staab Interoperable Multimedia Metadata through Similarity-Based Semantic Web Service Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Dietze, Neil Benn, John Domingue, Alex Conconi, and Fabio Cattaneo
40 52
65
77
XII
Table of Contents
Content Distribution and Delivery Semantic Expression and Execution of B2B Contracts on Multimedia Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V´ıctor Rodr´ıguez-Doncel and Jaime Delgado
89
A Conceptual Model for Publishing Multimedia Content on the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias B¨ urger and Elena Simperl
101
CAIN-21: An Extensible and Metadata-Driven Multimedia Adaptation Engine in the MPEG-21 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fernando L´ opez, Jos´e M. Mart´ınez, and Narciso Garc´ıa
114
Annotation and Tagging II Shot Boundary Detection Based on Eigen Coefficients and Small Eigen Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Punitha P. and Joemon M. Jose
126
Shape-Based Autotagging of 3D Models for Retrieval . . . . . . . . . . . . . . . . . Ryutarou Ohbuchi and Shun Kawamura
137
PixGeo: Geographically Grounding Touristic Personal Photographs . . . . . Rodrigo F. Carvalho and Fabio Ciravegna
149
Short Papers Method for Identifying Task Hardships by Analyzing Operational Logs of Instruction Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junzo Kamahara, Takashi Nagamatsu, Yuki Fukuhara, Yohei Kaieda, and Yutaka Ishii Multimodal Semantic Analysis of Public Transport Movements . . . . . . . . Wolfgang Halb and Helmut Neuschmied
161
165
CorpVis: An Online Emotional Speech Corpora Visualisation Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charlie Cullen, Brian Vaughan, John McAuley, and Evin McCarthy
169
Incremental Context Creation and Its Effects on Semantic Query Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandra Dumitrescu and Simone Santini
173
OntoFilm: A Core Ontology for Film Production . . . . . . . . . . . . . . . . . . . . . Ajay Chakravarthy, Richard Beales, Nikos Matskanis, and Xiaoyu Yang
177
Table of Contents
RelFinder: Revealing Relationships in RDF Knowledge Bases . . . . . . . . . . Philipp Heim, Sebastian Hellmann, Jens Lehmann, Steffen Lohmann, and Timo Stegemann Image Annotation Refinement Using Web-Based Keyword Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ainhoa Llorente, Enrico Motta, and Stefan R¨ uger
XIII
182
188
Automatic Rating and Selection of Digital Photographs . . . . . . . . . . . . . . . Daniel Kormann, Peter Dunker, and Ronny Paduschek
192
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
197
Keynote Talk: Mining the Web 2.0 for Improved Image Search Ricardo Baeza-Yates Yahoo! Research Barcelona http://research.yahoo.com
There are several semantic sources that can be found in the Web that are either explicit, e.g. Wikipedia, or implicit, e.g. derived from Web usage data. Most of them are related to user generated content (UGC) or what is called today the Web 2.0. In this talk we show how to use these sources of evidence in Flickr, such as tags, visual annotations or clicks, which represent the the wisdom of crowds behind UGC, to improve image search. These results are the work of the multimedia retrieval team at Yahoo! Research Barcelona and they are already being used in Yahoo! image search. This work is part of a larger effort to produce a virtuous data feedback circuit based on the right combination many different technologies to leverage the Web itself.
T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, p. 1, 2009. c Springer-Verlag Berlin Heidelberg 2009
Keynote Talk: More than a Thousand Words Stefan R¨ uger Knowledge Media Institute The Open University http://kmi.open.ac.uk/mmis
This talk will examine the challenges and opportunities of Multimedia Search, i.e., finding multimedia by fragments, examples and excerpts. What is the stateof-the-art in finding known items in a huge database of images? Can your mobile phone take a picture of a statue and tell you about its artist and significance? What is the importance of geography as local context of queries? To which extent can automated image annotation from pixels help the retrieval process? Does external knowledge in terms of ontologies or other resources help the process along?
T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, p. 2, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Simulated User Study of Image Browsing Using High-Level Classification Teerapong Leelanupab, Yue Feng, Vassilios Stathopoulos, and Joemon M. Jose University of Glasgow, Glasgow, G12 8RZ, United Kingdom {kimm,yuefeng,stathv,jj}@dcs.gla.ac.uk
Abstract. In this paper, we present a study of adaptive image browsing, based on high-level classification. The underlying hypothesis is that the performance of a browsing model can be improved by integrating highlevel semantic concepts. We introduce a multi-label classification model designed to alleviate a binary classification problem in image classification. The effectiveness of this approach is evaluated by using a simulated user evaluation methodology. The results show that the classification assists users to narrow down the search domain and to retrieve more relevant results with respect to less amount of browsing effort.
1
Introduction
The accumulation of large volumes of multimedia data, such as images and videos, has led researchers to investigate indexing and search methods for such media in order to render them accessible for future use. Early, Content Based Image Retrieval (CBIR) systems were solely based on low-level features extracted from images inspired by developments in image processing and computer vision [3]. Nevertheless, due to the “semantic gap” [13] problem, using just low-level descriptors will not lead to an effective image retrieval solution. Recent research in multimedia indexing has investigated automatic annotation methods to index multimedia data with keywords which convey the semantic content of media. Those keywords cannot however represent all aspects of image content due to inherent complexity of multimedia data. In both of the above cases in multimedia retrieval, the search paradigm is similar and inspired by traditional information retrieval systems. The searcher poses a query to the system, which can be a rough sketch: a predicate query such as “images with at least 80% blue” or a textual query, and then the system returns a ranked list of potentially relevant images. An alternative search paradigm that suits better to the nature of multimedia data, and especially images, is browsing. A well studied browsing approach is to visualize retrieved images as a graph where nodes are images and paths are relationships between them based on some underlying similarity. Browsing is facilitated by allowing users to browse the collection by following paths in this graph (e.g. [5,6]). In this approach, relevance feedback and the Ostensive Model of developing information needs can be easily integrated [15]. T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 3–15, 2009. c Springer-Verlag Berlin Heidelberg 2009
4
T. Leelanupab et al.
Browsing models are inherently different than traditional image retrieval systems since the focus is not on user queries but in the user’s browsing path where implicit feedback is provided. It is therefore difficult to see how automatically extracted keywords can be utilized to improve image browsing. Although query by keywords has been shown to be able to improve performance over low-level similarity in image retrieval especially when both methods are combined [10], no study has been performed on integrating high-level classification applied to annotate images on browsing models to the best of our knowledge. The underlying assumption of this integration is that the browsing effectiveness will be enhanced. Motivated by this we aim at answering the following research questions: – How to integrate high-level semantic concepts to a browsing model where user queries are limited and search is based on user’s implicit feedback? – Can we improve the response of a browsing model by using high-level classification? That is, can we reduce the number of clicks a user followed, and consequently the time spend browsing, in order to find relevant images? Browsing systems are interactive search systems that require user intervention and therefore user experiments are required to evaluate such methods. However, there are several strategies for integrating high-level classification to the browsing model and each has to be evaluated separately; hence, evaluating them requires a large scale user experiment which is expensive and time consuming. In this paper we propose an evaluation methodology based on simulating user actions by exploiting log files of user interactions with the system from previous user experiments [9]. Simulated evaluation can be used as the first step before performing an actual user study while ensuring a fair comparison between different methods. Once an appropriate methodology is found to perform reasonably well using this simulated methodology, a user experiment can be carried out so as to validate the validity of approach. The main contributions of our paper are: – We integrate high-level semantic concepts using multi-label classification to our browsing model. – We evaluate the effect of high-level semantics by using a simulated evaluation methodology by exploiting user logs. – We show that semantic similarity can improve browsing performance by reducing the time spent by a user browsing in order to find relevant images. The rest of this paper is structured as follows. In Section 2, we give a short survey of related works in current image browsing systems and present their inadequacies. Section 3 introduces an approach of integrating high-level concepts to a browsing model. Section 4 presents the experimental design and measures of our study. The result of our experiments is detailed in Section 5. Finally, we conclude our work in Section 6.
2
Image Browsing
A graph-based representation of retrieved images has been well studied to assist users in accessing their image collection. Heesch [4] surveys related work
A Simulated User Study of Image Browsing Using High-Level Classification
5
Fig. 1. Screenshot of Image Browsing Interface
in several browsing models in content based image retrieval. For instance, NNk networks introduced in [5] provides a graph-based structure by ranking images under the metric parametrised in terms of feature-specific weights. Torres et al. [14] propose the Spiral and Concentric Rings technique to visualise query results by placing the query image in the centre and fills a spiral line with the similar retrieved images. Similar work was carried out by Urban et al. [15] where the intentionality of user’s information needs is represented as nodes in a graph. In this approach, users browse through an image collection via retrieved images visualised in the graph. User’s interactions on the images, such as clicks, is considered as relevance feedback, which is then used to expand a search query. This approach employs the Ostensive Model of developing information needs [2] to adaptively tailor the search query to retrieve other similar images related to the leave of the graph users click on. The Ostensive Model reformulates a temporal dimension of interacted information objects to the notion of probabilistic relevance using different ostensive profiles. We adopt Urban’s approach by implementing the multi-aspect based image browsing system as introduced by Leelanupab et al. [9]. This system is employed as a baseline system to investigate our assumption on integrating high-level semantic concepts in our study. Figure 1 shows the screenshot of the interface, which can be divided into two main vertical panels. The left panel consists of: Full View tab (A) and Relevant Result tab (B). Dragging and dropping images to the Full View (A) tab will display a full size visualisation of the image, accompanied with its textual descriptions. In our user evaluation, users are expected to browse an image collection to find relevant images for a given search scenario and store them into the result panel (B). The right panel is the Browsing Panel (C) containing independent browsing sessions, visualised as tabs. It is hypothesised that each tab/session represents different aspects of search topics given by user’s interests to support complex
6
T. Leelanupab et al.
search tasks as suggested by Villa et al. [18]. On the browsing panel, a user, for instance, selects an image (1) considered as a node of this graph. Similar images will be shown as leaves of this node. Selecting one of these leaves (i.e. image (2)) will implicitly provide relevance feedback, with which the system incorporates the Ostensive Model to expand the search query. This model considers the iteration when feedback was provided by decaying the relevance of feature extracted from objects interesting to a user against the time of interacting with him. It is suggested that lower weighting should be given to earlier iterations since the user has most likely narrowed down his search interest in later few iterations [15]. As a result, the subsequent information in the path is assumed to be more relevant to the user. Here, browsing sessions can be initiated by selecting images from a keyword search or other browsing sessions. At the top right of the frame, a Switching Mode button (D) is provided in order to offer the user the option to change search methods between traditional keyword search and adaptive browsing. Although this approach has been proved to retrieve more relevant information [15], it still relies on retrieval performance which is based on features extracted from objects.
3
Incorporating High-Level Semantics
Current existing image browsing systems are mostly driven by low-level visual cues (e.g. [5,15,17]); however, there is the semantic gap problem, which is basically the disparity between low-level features and high-level semantics. Use of low level features cannot give satisfactory retrieval results in many cases, especially when the high level concepts in the user’s mind are not easily expressible in terms of low level features. Thus, the extraction of visual concepts can be regarded as a bridge between the low level features and high level semantic concepts to improve the retrieval performance. In order to fill this gap, a considerable amount of research into classification methods [16] has been done in particular since the classification can be assigned as a translator to bridge the low-level features with the semantic concepts by classifying images into different categories based on its similarity with each category. Furthermore, most of the classification methods are focusing on binary classifiers, which only classifies the data into one of two classes. One of the most famous methods in binary classifier is the Support Vector Machine (SVM). The key advantage of the SVM is that it seeks to fit an optimal hyperplane between classes and may require only a small training sample. However, using the binary classifier on image data may bring the problems associated when an image may belong to more than one class in semantics. For example, an image of natural scene taken during a trip to the Highlands might belong to outdoor, and nature classes rather than either of them. Motivated by the existing needs in CBIR, we have optimised the merits of the above retrieval models to build our retrieval framework. We apply a classification method on a browsing model to support explorative search tasks by semantics.
A Simulated User Study of Image Browsing Using High-Level Classification
3.1
7
High-Level Classification
A multi-label classification technique is employed to alleviate a binary classification problem on SVM. We defined a small set of six generic concepts paired for three classifiers that create three class labels for each image since our underlying idea is to define classes which will be suitable for all images instead of specific ones. Using specific classes might cause the difficulties in classification since the system needs a large number of classes to describe an image collection. In addition, using large number of specific classes will result in the degradation in the accuracy and efficiency. The concept groups are defined based on the nature of the database. A SVM based image classification method is employed to learn the visual concepts from the training set, and then applied on the testing images to give the concept labels. Spatial Features for Concept Detection – A number of existing works [12] have stated that the most efficient way for human beings to identify an image is from coarse to fine. Thus, different images can be classified into different scene concept groups based on their coarse scene information [11]. For instance, images of man-made scenes are characterised by geometry of vertical and horizontal structure: urban outdoor scenes will have more vertical edges, with less in indoor scenes. Considering the possibility of extracting concept information via scene characters, we develop a concept based image classification by using scene characteristic features. The scene characteristic features are computed in the frequency domain using the Gabor filter [7]. SVM-based Concept Detector Training –The original SVM is designed for binary classification. In our case, we have six pre-defined image classes, resulted in a multi-class problem. We use the following method to reduce a set of binary problems. First, a set of binary classifiers, each of which was trained to separate one class from the rest, is built. In another word, n hyperplanes are constructed, where n is the number of pre-defined classes. Each hyperplane separates one class from the others. In this experiment, three pairs of classifiers were defined in order to classify three pairs of generic concepts, such as indoor/outdoor, nature/man-made, and portrait/crowd, where different concepts can overlap in an image. As a result, a total combination of those classifiers can form eight (23 ) different categories. For instance, the classification result for one image can be represented as 011, where 0 means that the image is classified into the 1st category of each pair and 1 means into the 2nd category. It is our intention to measure the effectiveness of integrating such high-level classes for image browsing models. 3.2
Using Semantic Concepts on Image Browsing System
In the browsing system, the classification will work as follows. First, the classification is applied to compute the category of images in the experimental collection
8
T. Leelanupab et al.
so that the raw image data in term of low-level features can be transferred into high level concept. Next, given images in a browsing path selected by the user, the retrieval algorithm takes these browsing images as a query and searches for similar images only within the same category as labelled in the collection. Note that every image in the path belongs to only one of eight categories based on three pairs of pre-defined classes with the underlying assumption that high-level concepts will increasingly improve retrieved results in each browsing iteration. This approach exploits user’s feedback of selecting images as query to specifically browse an image collection within the same category he or she is interested in. The user will browse an image collection based not only on low-level features, but also on semantic concepts.
4
Experiment
In this section, we detail our experimental setup. First, we outline two browsing systems used in the evaluation. A detail of a data collection is described in the next section. We present our method to obtain user interactions to simulate users acting on systems, followed by discussing task information which affects the experimental results. We finally describe a strategy which simulates users’ browsing behaviour. 4.1
System Description
There are two image browsing systems used in this evaluation: a baseline system that can enhance a simple search query using adaptive browsing, and a proposed system that extends the standard browsing system with high-level classification. Both systems have the same interface as shown in Figure 1 and share the same retrieval back–end, which uses textual and visual features as well as an “Ostensive Model” [2] as adaptive retrieval model. The Terrier IR system1 was used for removing stop-word, stemming, indexing in textual retrieval. Okapi BM 25 was used to rank retrieval results. Importantly, to support visual queries, three MPEG7 image features have been extracted for image dataset: Colour Layout, Edge Histogram, and Homogeneous Texture. The weight of visual and textual features are equally balanced in retrieval. The proposed system employed highlevel classification, which classifies the image collection into sub-categories using multi-label classification. 4.2
Data Collection
Our aim of this study is to assume the role of a real user, browsing his/her collection. Therefore we employed a real user collection, called CoPhIR2 , for SVM training and experiment. The current collection contains 106 million images derived from the Flickr3 archive. For training classification, we asked three 1 2 3
http://ir.dcs.gla.ac.uk/terrier/ http://cophir.isti.cnr.it/ http://www.flickr.com/
A Simulated User Study of Image Browsing Using High-Level Classification
9
multimedia information retrieval experts to manually classify 200 sample images for each class according to six pre-defined concepts. We selected the other subset of estimated 20000 images taken by unique users between 1 October 2005 and 31 March 2006 as an experimental collection. This time period was selected since it covers the highest density of images from unique users. The text used for keyword search is derived from titles, descriptions, and tags given by Flickr users. 4.3
Mining User Interactions
There has been no precedent set of the methods to best evaluate a IR system. A system-oriented evaluation based on the Cranfield model is unsuitable to evaluate interactive search systems due to its adaptive, cognitive, and behavioural features of the environment where interactive systems perform. Borlund [1] proposed an alternative approach, called user-centred evaluation to evaluate interactive systems. This approach is very helpful to obtain valuable data on the behaviour of users and interactive search systems. Nevertheless, such methodology is inadequate to benchmark various underlying adaptive retrieval algorithms because it is expensive in terms of time and repeatability. Another alternative means of evaluating such systems is the use of simulated user evaluation to assume a role of users to trigger search queries and browse retrieved results. There are two possible ways to run a simulated user study. One is to use a test collection, similar to the Cranfield method, to mimic user’s query formulation, browsing and relevance assessment [19]. However, this method requires ground truth data, which are hard to generate for a large collection and has a limitation of diversity in queries per topic. The other one is to create a pool of user’s interactions derived from an actual user study to generate a range of search strategies [8]. Since we have no ground truth data and browsing is a complex activity based on individual users, we therefore adopted the simulated user study by using the log-files of a prior user experiment [9]. If such a user was available, he or she would similarly do a set of actions that, in their opinion, would increase the chance of retrieving more relevant documents for given search topics. Our objective is to find out whether the effectiveness of the browsing system modelled using high-level classification would have improved. We mined 4 types of user’s interactions based on the nature of task exploration reflecting their judgement on retrieved images as shown in Table 1: (1) Table 1. 24 Users’ interaction statistics Interactions Topic # Queries # Browses # T1 151 258 T2 316 274 T3 180 231 T4 153 351
Results Sessions # Total # Two or More % 104 397 115 29.0 114 215 83 38.6 95 254 76 29.9 125 377 104 27.6
10
T. Leelanupab et al.
“# Queries”, a list of textual queries executed to get a potential set of images to start browsing such as “wild animals”, “endangered birds” etc. (2) “# Browses”, a list of clicked images to be used for browsing further, called “Browse Images” in this paper; (3) “# Sessions”, a list of chosen images to start new browsing sessions, referred to as “Session Images”; and (4) “Results”, a list of relevant images added to a relevance list in total and when two or more users selected with its percentage, named as “Total” and “Two or More” respectively in the sub-column. Our underlying assumption of each user interaction is that users clicked on browse images from a set of retrieved images when they found them at most closely relevant and capable of leading them to see more relevant images. Users selected session images when they found them relevant and show different aspects of search topics. Users added images into a result list when they found them relevant to their information needs. 4.4
Search Task Information
For this evaluation we aimed at simulating browsing patterns and search strategies of 24 users performing four tasks in the prior user study [9]. All fours tasks were explorative search topics which gave users broad indicative requests and asked them to discover different aspects of images used in many various simulated situations as suggested by Borlund [1]. Tasks 1 – 4 entitled “Find different aspects of wild living creatures, “Find different aspects of vehicles”, “Find different aspects of natural water”, and “Find different aspects of open scenery” respectively. After completing each task, the users were asked to describe their experiences related to tasks and a system in questionnaires. The questionnaires disclosed that user perceived Task 2 and 3 as the most difficult tasks, followed by Task 4 and 1. Accordingly, the total number of relevant images retrieved from Task 2 and 3 is lower than for the other tasks as shown in Table 1. Another reason supporting this is the task complexity for given topics resulted in the nature of the collection. Table 1 illustrates the level of task complexity using the number of user agreement on selecting relevant images. As Table 1 shows, the percentage of relevant images selected by two or more users is higher in Task 2 and 3, which indicates Task 2 and 3 may be “narrower” than Task 1 and 4. Assuming that there will be less agreement amongst users for broader tasks, which require a greater extent of interpretation. According to the questionnaires, the retrieval results found, and the level of specification, T2 and T3 might be more complicated and difficult than T1 and T4, which requires more interpretation. Consequently, this factor may influence the results of our simulation which will be discussed later. 4.5
A Browsing Strategy
We devised a browsing strategy to repeat user interactions to answer our research questions based on four lists of interaction types mined from the previous study: textual query, browsing, session, and result/relevance lists. The strategy was aimed at following user interactions to decide what the next action will be performed and then updating relevance results. We use a hypothetical component called the Agent or simulated user who controls the flow of interactions
A Simulated User Study of Image Browsing Using High-Level Classification
11
with two browsing systems. We recorded all the results and actions performed by the agent. Our simulation procedure uses the following steps. First, the agent submits a textual search query by randomly selecting a textual query in a textual query list and then the systems return a list of top nine images. The agent interacts with these images by matching images appearing in a session list to start new browsing sessions. Note that each interaction data is used only once for each task. If there are two or more session images found, they will be put into a session queue according to their ranking in that query. The session queue follows a First-In, First-Out (FIFO) pattern which serves the image found first to start a new browsing session. In a next step, the agent selects one of the retrieved images to further browsing. There are two options to select the images. The agent chooses the images existing either in a relevance list or in a browsing list. One difference is that if the agent found any images on a relevance list, they will also be added to a relevance results. At this step, the agent can only select one image in order to simplify our browsing strategy. If there were two or more images found on any of two lists, the agent will take the image in the relevance list first, indicating higher relevance based on user judgement. In case of two or more images found from the same list, an agent will select the one ranked higher. During browsing, the systems retrieved six candidates to the agent in each search iteration. Whenever the agent cannot match retrieved images with any images in the three lists, he will start to create a new session from the first image in the session queue or to re-enter a new query in a textual query list when the queue is empty. Moreover, if the agent found images existing in a session list during browsing, those images will be added to the session queue. The browsing simulation will end up whenever the agent completely found all relevant images in the relevance list or performed all search queries in the query list. Following the given browsing strategy, we separately performed simulation runs on all four tasks over baseline and proposed systems.
5
Results
This section presents the results of our experiments based on the research questions stated in Section 1. Table 2 shows the comparison of the experimental results. We used a total of 4 search tasks per system in our analysis. We denote “A” by the baseline system, whereas “B” stands for the proposed system. The “# Iterations” means the total number of iteration that an agent stops browsing. ¯ and “SD” show the mean and standard deviation of relevant images, re“X” trieved in each iteration. To measure the statistical significance of the results, we applied T-Test (parametric) to the difference between the baseline and proposed systems. All tests were paired to carry out a one-tailed test and critical value (p-value) was set to 0.05, unless otherwise stated. “A>B” represents the number of browsing iterations, where System “A” retrieved more relevant images than a system “B” in each iteration, followed by its percentage. “A
12
T. Leelanupab et al.
Table 2. Results of simulated user runs: Task 1 – 4 (statistically significant p<0.05)
Topic # Iterations T1 T2 T3 T4 All
603 435 336 657 2031
A ¯ SD X 0.34 0.98 0.24 0.78 0.34 1.07 0.32 0.99 0.31 0.96
B ¯ SD X 0.44 0.94 0.31 0.67 0.42 0.84 0.39 0.92 0.39 0.87
T-Test A>B p-value 0.0366 87 0.0952 42 0.1491 50 0.0839 85 0.0390 264
%
A
%
14.43 9.65 14.89 12.94 13.00
146 83 85 145 459
24.21 19.08 25.30 22.07 22.60
As Table 2 shows, the proposed system “B” outperforms in terms of the average number of retrieved relevant images as well as the iteration percentage where the number of results in the system “B” is greater than in the baseline system “A”. We also found statistically significant difference amongst two systems at the p < 0.05 by T-Test in the number of retrieved relevant images. Furthermore, approximately 22.60% of the total iterations denoted “A
A Simulated User Study of Image Browsing Using High-Level Classification
(a) Task 1
(b) Task 2
(c) Task 3
(d) Task 4
13
Fig. 2. Number of all relevant images in relevant list per iteration: Task 1 – 4
100 images; on the contrary, using a classification system requires only roughly 200 browses. The time spent for browsing dramatically reduces on the proposed system.
6
Conclusion and Future Work
In this paper, we aim at answering two main research questions. We first illustrate how to integrate high-level classification with a browsing model. Each image is classified into six generic concepts using multilabel classification, represented by a set of binary classifiers, which forms eight semantic categories. A browsing model benefits from this semantic concepts by helping users to narrow a search domain rather than to browse the whole collection. We use a simulated evaluation methodology by exploiting a log file of user interactions to investigate the second research question. The analysis of evaluation results shows that the classification approach has the potential to improve a browsing model in exploring an image collection. The number of browses in the classification system immensely decreases to reach the same number of relevant images as in a browsing system. In addition, the results show a linear improvement of the number of relevant images over all iterations and the dominant performance in virtually all browsing iterations (see Table 2) for a classification system. To sum up, integrating high-level classification with a browsing model can improve efficiency and effectiveness of an image browsing model.
14
T. Leelanupab et al.
Currently, our assumption are only supported by our simulated user study. Although the simulated evaluation allows us to benchmark our classificationbased browsing approach, the results still require to be verified by the use of other evaluation techniques. One of the future works is to perform a real user study to confirm our findings of simulated user study. Only an actual study will provide acceptable data to support our assumption. Moreover, our future work could be to validate our assumption on integrating high-level classification with other standard browsing models. This can help to improve browsing models in the future.
Acknowledgements This research is supported by the Royal Thai Government and the European Commission under contract FP6-027122-SALERO.
References 1. Borlund, P.: The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. Information Research 8(3) (2003) 2. Campbell, I., van Rijsbergen, C.J.: The ostensive model of developing information needs. In: CoLIS 1996 (1996) 3. Del Bimbo, A.: Visual information retrieval. Morgan Kaufmann Publishers Inc., San Francisco (1999) 4. Heesch, D.: A survey of browsing models for content based image retrieval. Multimedia Tools Appl. 40(2), 261–284 (2008) 5. Heesch, D., Howarth, P., Magalh˜ aes, J., May, A., Pickering, M., Yavlinski, A., R¨ uger, S.M.: Video Retrieval using Search and Browsing. In: TREC 2004 – Text REtrieval Conference, Gaithersburg, Maryland, November 15-19 (2004) 6. Herman, I., Melan¸con, G., Marshall, M.S.: Graph visualization and navigation in information visualization: a survey. IEEE TVCG 6(1), 24–43 (2000) 7. Howarth, P., R¨ uger, S.M.: Evaluation of texture features for content-based image retrieval. In: Enser, P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 326–334. Springer, Heidelberg (2004) 8. Joho, H., Hannah, D., Jose, J.M.: Revisiting ir techniques for collaborative search strategies. In: Boughanem, M., et al. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 66–77. Springer, Heidelberg (2009) 9. Leelanupab, T., Hopfgartner, F., Jose, J.M.: User centred evaluation of a recommendation based image browsing system. In: IICAI 2009 (to appear, 2009) 10. Mezaris, V., Doulaverakis, H., Herrmann, S., Lehane, B., O’Connor, N., Kompatsiaris, I., Strintzis, M.G.: Combining textual and visual information processing for interactive video retrieval: Schema’s participation in trecvid 2004. In: TRECVID 2004 (2004) 11. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. Computer Vision 42(3), 145–175 (2001) 12. Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in recognition. Progress in brain research 155, 23–36 (2006)
A Simulated User Study of Image Browsing Using High-Level Classification
15
13. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-Based Image Retrieval at the End of the Early Years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000) 14. Torres, R.S., Silva, C.G., Medeiros, C.B., Rocha, H.V.: Visual structures for image browsing. In: CIKM 2003 (2003) 15. Urban, J., Jose, J.M., van Rijsbergen, C.J.: An adaptive technique for contentbased image retrieval. Multimedia Tools and Applications 31(1), 1–28 (2006) 16. Vailaya, A., Figueiredo, M.A.T., Jain, A.K., Zhang, H.-J.: Image classification for content-based indexing. IEEE Trans. on Image Processing 10(1), 117–130 (2001) 17. Viaud, M.-L., Thi`evre, J., Go¨eau, H., Saulnier, A., Buisson, O.: Interactive components for visual exploration of multimedia archives. In: CIVR 2008 (2008) 18. Villa, R., Cantador, I., Joho, H., Jose, J.M.: An aspectual interface for supporting complex search tasks. In: SIGIR 2009 (2009) 19. White, R.W., Jose, J.M., van Rijsbergen, C.J., Ruthven, I.: A simulated study of implicit feedback models. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 311–326. Springer, Heidelberg (2004)
Exploring Relationships between Annotated Images with the ChainGraph Visualization Steffen Lohmann1 , Philipp Heim2 , Lena Tetzlaff1 , Thomas Ertl2 , and J¨ urgen Ziegler1 1
University of Duisburg-Essen, Interactive Systems and Interaction Design, Lotharstr. 65, 47057 Duisburg, Germany {steffen.lohmann,lena.tetzlaff,juergen.ziegler}@uni-due.de 2 University of Stuttgart, Visualization and Interactive Systems, Universit¨ atsstr. 38, 70569 Stuttgart, Germany {philipp.heim,thomas.ertl}@vis.uni-stuttgart.de
Abstract. Understanding relationships and commonalities between digital contents based on metadata is a difficult user task that requires sophisticated presentation forms. In this paper, we describe an advanced graph visualization that supports users with these activities. It reduces several problems of common graph visualizations and provides a specific chain arrangement of nodes that facilitates visual tracking of relationships. We present a concrete implementation for the exploration of relationships between images based on shared tags. An evaluation with a comparative user study shows good performance results on several dimensions. We therefore conclude that the ChainGraph approach can be considered a serious alternative to common graph visualizations in situations where relationships and commonalities between contents are of interest. After a discussion of the limitations, we finally point to some application scenarios and future enhancements. Keywords: graph visualization, interactive exploration, relationship discovery, exploratory search, user-annotated images, visual tracking, shared metadata, tagging, photo sharing, annotation.
1
Introduction
Metadata is important in the organization, management, and retrieval of all kinds of digital contents. It also links the contents allowing for structured exploration and the discovery of relationships and commonalities. However, finding and following these links is often difficult for users, mostly due to the constraints of the presentation forms that are used to display the contents. Visualizations are needed that explicitly show relationships between digital contents based on shared metadata. This need becomes particularly apparent against the background of recent developments in the Web. Many providers of media sharing services use tagging as a specific form of user-generated annotation on their websites. Tagging-based T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 16–27, 2009. c Springer-Verlag Berlin Heidelberg 2009
Exploring Relationships between Annotated Images
17
systems enable users to annotate digital contents with multiple, arbitrary terms in order to organize these contents for themselves and/or others. That way, large collections of tagging data emerge that are commonly known as folksonomies. These folksonomies are an unstructured form of metadata that helps users to browse media collections and find specific contents. For instance, an analysis of the tagging data from the photo sharing website Flickr1 showed that each photo is annotated by almost three tags on average that consist usually of one, sometimes two, words and mostly refer to the contents of the images [4]. Since the same tags are usually assigned to multiple images, implicit relationships based on these shared tags result. 1.1
State-of-the-Art
One popular visualization that supports browsing in folksonomies are tag clouds. Typically, a tag cloud presents a certain number of most often used tags where the tags’ popularity is expressed by their font sizes [7]. Although this visualization type allows easy access to digital contents, tag clouds are usually visualized separately from the contents in a specific area of the user interface (see Fig. 1a for an example from the multimedia sharing website ipernity2 ). Links between contents that are based on shared tags are not explicitly shown making it hard to identify and follow them. Some more advanced tag cloud visualizations group similar tags [3] or even visualize links between tags [9] based on certain criteria such as tag co-occurance. However, these visualizations merely improve the presentation of the tags themselves by showing their interrelations but provide no clue how the tags link the digital contents and what relationships between the contents exist.
(a)
(b)
Fig. 1. a) tag clouds are simple browsing interfaces that show no relationships, b) common graph visualizations tend to produce crossing edges and overlapping nodes 1 2
http://www.flickr.com http://www.ipernity.com/explore/keyword
18
S. Lohmann et al.
Especially for the visualization of image collections, many other approaches have been proposed. Some even use metadata to arrange the images in a meaningful way. For instance, Rodden et al. [11] propose a presentation form that arranges image thumbnails according to their mutual similarity based on lowlevel visual features and textual captions. Dontcheva et al. [2] present an interactive visualization that clusters images from Flickr according to their tags and provides several interaction mechanisms for browsing the clusters. Another popular example is the statistical clustering of photos in Flickr that is based on an analysis of tag co-occurrences. Only few approaches try to present images and their tags in a combined way. For instance, yahoo taglines3 visualizes random images from Flickr in an animated ’tag river’ that simulates a timeline. However, such approaches allow no systematic exploration of the images but rather support free browsing and serendipitous discoveries. More generally, there is a lack of visualizations that display digital contents along with their metadata and explicitly show interrelations. In particular, relationships along multiple dimensions are hard to be followed in existing presentation forms. Graph visualizations, on the other hand, seem to be highly appropriate to address these user needs. One example for a graph visualization of user-annotated images is TagGraph 4 . It displays relationships between images from Flickr by representing these images and their tags as nodes that are connected by edges (see Fig. 1b). However, common graph visualizations, such as TagGraph, have several drawbacks when displaying digital contents that are interrelated by shared metadata: – Crossing edges and high densities hamper the visual tracking of relationships or even result in misinterpretations [10]. – Overlapping nodes can result in an imperfect presentation of the digital contents and their metadata. – Positioning of the nodes is often not optimal for the visual tracking of relationships. These drawbacks are clearly visible in the example given in Figure 1b where we used the TagGraph tool to visualize images that are highly interconnected by several shared tags: In some cases, the relationships are not clear due to high densities and crossing edges (e.g., for the tags ’nikon’, ’philipp’, or ’dcdead’); in others, nodes overlap image parts (e.g., the tags ’green’ or ’great’). Yet in other cases, the positioning of the tags is not optimal (e.g., the tags ’reflection’ or ’spiegelung’ are not placed in between the two images they connect). We developed an advanced graph visualization – the ChainGraph – that reduces these problems and provides better support for the exploration of relationships between digital contents based on shared metadata. Although the general 3 4
http://research.yahoo.com/taglines/ http://taggraph.com
Exploring Relationships between Annotated Images
19
ChainGraph approach is not limited to a specific application area, we particularly focus on a concrete implementation of the ChainGraph in the following that visualizes relationships between images based on shared tags. In Section 2, we introduce the general idea, present the implementation, and describe a scenario. In Section 3, we report on a user evaluation where we compared the ChainGraph with a common graph visualization. Finally, we discuss limitations of this approach and give an outlook on future work and possible application areas of the ChainGraph in Section 4.
2
The ChainGraph Approach
The basic idea of the ChainGraph approach is best described by comparing it with a common way of visualizing linked contents in a graph (see Fig. 2). In common visualizations, each metadata instance is represented by exactly one node. If a metadata instance is shared by many content items, a force-directed layout [1] arranges the content nodes radially around this metadata instance (cp. tags ’winter’, ’trees’, and ’snow’ in Fig. 2a). In the ChainGraph, by contrast, each metadata instance connects two content items at most. This is realized by multiplying metadata nodes in the visualization: Every metadata instance that is shared by more than two content items is represented by several nodes arranged in a chain which connects the content nodes in a certain consecutive order (cp. the chains ’winter’, ’trees’, and ’snow’ in Fig. 2b).
(a)
(b)
Fig. 2. Visualization of images and shared tags with a) a common graph and b) the ChainGraph visualization
Although this multiplication increases the total number of nodes and edges in the graph, it significantly reduces the graph’s density and energy level, i.e., attractive and repulsive forces are assigned rather parallel than opposite, resulting in a fewer number of crossing or overlapping edges compared to common graph
20
S. Lohmann et al.
visualizations. It also results in a placement of the metadata nodes in position between the content nodes what further facilitates the identification of relationships and commonalities (cp. tags ’sunset’ and ’sky’ in Fig. 2a and Fig. 2b). It is important to notice that connections within a chain must be interpreted in a transitive manner, i.e., all content items along one chain are related to each other via the same metadata instance, independently of their order. To support the correct interpretation and visual tracking of the chains, all nodes and edges that represent the same metadata instance are visualized in the same color. For the quality of the graph layout, however, the order of the nodes is highly relevant. Changing the images’ order in the ’winter’-chain in Fig. 2b, for instance, would probably break the nice parallel arrangement of the chains in the graph and might even result in crossing edges or overlaps. Therefore, we developed a special algorithm that guarantees an optimal arrangement of the nodes by adding them step by step according to certain selection criteria. In each step, the algorithm calculates a heuristic value, the constraintLevel, for all resources that have not yet been added to the graph, and chooses the resource with the highest constraintLevel to be added next. A detailed description of the used algorithm is given in [5]. We implemented an application prototype in Adobe Flex 5 that demonstrates how the ChainGraph approach can be used to visualize relationships between images based on shared tags. All nodes of the graph can be moved via drag&drop and images of interest can be enlarged simply by a click. The question of how it supports the exploration of tag-based relationships and commonalities between images is best described by a small scenario. 2.1
Scenario
Fig. 3 shows a screenshot of the application prototype as it can be used to browse a collection of annotated images of Paris6. Assume a user is searching for a representative image of Paris that she would like to use as an illustration for a text about the French capital. She first glances at the image that shows the Eiffel Tower in summer (Fig. 3, 1.). After enlarging this image, she is not satisfied with it as it looks a bit boring in her opinion. Therefore, she goes through further images by following the chain labeled with the tag ’eiffel tower’. Another image of the Eiffel Tower catches her attention and she enlarges it (Fig. 3, 2.). She likes the monochromatic style of the image and decides to look for further images of this kind. Consequently, she follows the tag chain labeled with ’black & white’. She recognizes that another chain, labeled with ’people’, meets the ’black & white’ chain and runs in parallel with it (Fig. 3, 3.) – she is on the right path, since people on an image help to make it lively and interesting what is in line with the goals of her search. Finally, she reaches a black-&-white image showing the subway of Paris with passengers inside (Fig. 3, 4.). Since she also reaches the ’eiffel tower’ chain again, this symbol of Paris is also on the image, visible in 5 6
http://www.adobe.com/products/flex The prototype is online accessible at http://interactivesystems.info/chaingraph
Exploring Relationships between Annotated Images
21
4.
3.
2.
1.
Fig. 3. Using an ChainGraph implementation to search for a picture of Paris
the background through the window of the subway. After enlarging the image, she recognizes that it perfectly meets her needs and copies it to her Weblog as illustration of her text about Paris. As shown by the scenario, browsing with the ChainGraph is usually a combination of goal-oriented and exploratory search [8]. Relatively vague user needs can be iteratively refined by discovering and following ’tag chains’ of interest. If several images share more than one tag, the chains run in parallel making it easy for users to browse through related images and to select the one that fits best with their needs. Note that we visualized only shared tags in this example as these help to discover relationships between the images. Of course, tags that are assigned only to single images can also be shown in the ChainGraph implementation simply by adding a labeled node and connect it with the image.
3
Evaluation
We performed a user study where we compared the ChainGraph approach with a common graph visualization. We were mainly interested in the understandability,
22
S. Lohmann et al.
user acceptance, and performance of the ChainGraph. The presentation of the graphs in the study was similar to the one shown in Fig. 2. However, we used a more abstract visualization in order to avoid biases resulting from personal preferences or distractions caused by certain images or tags. Therefore, the nodes and edges of the graph visualization consisted of numbered labels instead of actual images and tags in the user study. 3.1
Study Design
Overall, we generated three pairs of graph visualizations for the user study, whereas each pair consisted of one ChainGraph and one common graph that both showed exactly the same data. We kept the total number of content items and metadata instances constant (six each) but gradually increased the number of metadata connections between the content items for each evaluation pair. In this case, the minimum possible number of connections based on shared metadata instances is twelve (if each of the six metadata instances is shared by exactly two content items, cp. Table 1a) and the maximum number is 36 (if each of the six metadata instances is shared by all six content items, cp. Table 1c). The application of these extreme values in the user study makes no sense, since both graph types (ChainGraph and common graph) look the same for the minimum value; and no insights can be gained in case of the maximum value since all metadata instances are shared by all content items. Therefore, we choose three values in between (18, 24, 30) in order to test graphs with increasing densities. We then generated random distributions for these values that we used to draw the nodes and edges for both graph types. Table 1b shows the random sample for the graphs with 24 shared metadata connections. We arranged the nodes of all graphs in a force-directed layout [1] and applied the optimization algorithm mentioned in Section 2 for the ordering of the chains. For each graph type, the repulsion – a factor that controls how strongly the nodes are pushed away from each other – was defined in a way that the graph looks aesthetically pleasant and the edges were not becoming too long or short,
Table 1. Distributions with a) 12 assignments (minimum), b) 24 assignments (random matrix) and c) 36 assignments (maximum) (C = content item, M = metadata instance) (a) M1 M2 M3 M4 M5 M6 C1 x x C2 x x C3 x x C4 x x C5 x x C6 x x
(b) C1 C2 C3 C4 C5 C6
M1 M2 M3 x x x x x x x x x x x x
(c) M4 M5 x x x x x x x x x
M6 x x x
C1 C2 C3 C4 C5 C6
M1 x x x x x x
M2 x x x x x x
M3 x x x x x x
M4 x x x x x x
M5 x x x x x x
M6 x x x x x x
Exploring Relationships between Annotated Images
(a)
23
(b)
Fig. 4. a) Common graph and b) ChainGraph visualization with 24 metadata connections (cp. Table 1b)
resulting in a generally lower repulsion for the ChainGraph. Fig. 4 shows one pair of graph visualizations from the user study (common graph and ChainGraph) with 24 shared metadata connections each. Since we were particularly interested in how well the ChainGraph supports the visual tracking of metadata relationships and the identification of commonalities between digital contents based on shared metadata, we defined the following three user tasks for the comparative study: 1. Find the pair of resources that shares most metadata instances. 2. Find all metadata instances that are shared by a given pair of resources. 3. Find all metadata instances that are shared by a given triple of resources. In sum, we thus applied a 2x3x3 within-subject design with variables graph type (common graph vs. ChainGraph), task type (task 1, 2, 3), and shared metadata connections (18, 24, 30). 3.2
Procedure
Twelve participants, mainly students, took part in the study, with an average age of 29 (ranging from 22 to 47). The general familiarity with graphs was given with an average of 7.7 (median of 8.5) on a scale of 1 to 10. All subjects reported normal or corrected to normal vision and no color blindness. We presented all three pairs of graphs along with the three tasks on a 17” TFT monitor with a screen resolution of 1280 x 1024 px to all participants. Each graph type and each task were introduced and explained by an example. To control learning effects, we interchanged the presentation order of the two graph types and randomly assigned the participants to one of the settings (group A started with the common graph, group B with the ChainGraph visualizations). After completing all three tasks for all distributions of one graph type, the subjects were asked to fill out an evaluation sheet. The corresponding graph
24
S. Lohmann et al.
type had to be rated according to 23 pre-defined items on a scale of one to five. The items were then mapped to the four dimensions effectiveness, understandability, control, and attractiveness (5-7 items per dimension). At the end of the study, the subjects had to directly compare both graph types and indicate their general familiarity with graphs in a questionnaire. Furthermore, we measured the time needed to fulfill the tasks and the accuracy of the answers by counting wrong answers. 3.3
Results
Overall, the ChainGraph visualization performed very well in the user study. Nine of the twelve participants preferred using the ChainGraph to solve the tasks of the study. It also reached slightly better results in the evaluation sheets: Fig. 5a shows the user ratings on the four dimensions attractiveness, control, understandability, and effectiveness that were generated from the items of the evaluation sheet (higher value = better rating). The participants quickly understood the ChainGraph layout and did not report on serious difficulties when using it to accomplish the tasks. The colored edges proved to be helpful in following the chains. Although the compactness of the common graph was considered positive, the study participants complained about its high number of crossings edges. The good user ratings seem to stem especially from the first task. Here, the ChainGraph performed significantly better than the common graph with respect to the time needed to accomplish the task, independently of its density (i.e., the number of shared metadata instances, see Fig. 5b). This indicates that the ChainGraph layout assists particularly in the identification of similar contents. This benefit is further strengthened by the optimization algorithm mentioned in Section 2, as it arranges content items with many shared metadata instances close to each other.
ChainGraph
attractiveness 4
100
2 1 effectiveness
0
Commongraph
ChainGraph
90
3
control
Timen needed (in nsec)
Commongraph
80
Increasing number of shared metadata instances
70 60 50 40 30 20 10 0
understandability
(a)
Task 1 Task1
Task 2 Task2
Task 3 Task3
(b)
Fig. 5. Results from the comparative study: (a) user ratings for both graph types on the four evaluation dimensions, (b) time needed to accomplish the tasks
Exploring Relationships between Annotated Images
4
25
Discussion
With the ChainGraph, we introduced a new visualization approach for the exploration of relationships between digital contents based on shared metadata. Since the ChainGraph represents shared metadata instances by multiple nodes it avoids an agglomeration of content nodes around metadata nodes and thus reduces the graph’s general density. This also decreases the probability for crossing edges and overlapping nodes and tends to a placement of metadata nodes in between the content nodes they connect. These modifications were developed to facilitate the exploration of relationships and commonalities between content items and might result in an improved readability and usability for related user activities. To the best of our knowledge, the proposed ChainGraph is the first and only approach that multiplies nodes and arranges them in chains to better support the visual tracking of relationships and the identification of commonalities. We demonstrated the applicability of this approach with an implementation for the interactive exploration of tag-based relationships within a selected set of annotated images. 4.1
Limitations
As illustrated in the evaluation, the ChainGraph provides no benefits for some extreme cases. For instance, it would be identical to a common graph visualization in distributions where all metadata instances are assigned to exactly two content items (cp. Section 3.1 and Table 1a). However, such extreme distributions are very unlikely in real application scenarios. Furthermore, it is important to notice that we developed the ChainGraph for the visualization of a limited set of annotated contents but not as a visualization for whole content collections. Usually, exploration with the ChainGraph is only one of the many activities of a corresponding search process. For instance, in the scenario given in Section 2.1, the presented ChainGraph could be a result of a user query for the tag ’paris’. A general limitation of the ChainGraph is its relatively large size due to the multiplication of metadata nodes. Consequently, it needs more screen space than other presentation forms and is well applicable only on large displays with a high screen resolution. Although this is not a serious problem in times of Full HD, it might still restrict the application areas of the ChainGraph. 4.2
Application Scenarios and Outlook
Several use cases are imaginable for the ChainGraph approach. Since the visualization is best viewed on large displays, we tested our implementation both on a projection screen and a multi-touch table with the dataset given in Section 2.1 (see Fig. 6). As expected, the interaction with the ChainGraph visualization displayed on the projection was experienced as more immersive than its presentation on a usual monitor. As a side effect, the thumbnail images had already a sufficient size to get a fair impression of the image contents and hence
26
S. Lohmann et al.
(a)
(b)
Fig. 6. Exploring user-annotated images with the ChainGraph visualization a) on a large projection and b) on a multi-touch table
fewer enlargements of images were needed. Regarding the multi-touch interaction, we discovered several issues that might be beneficial for an efficient use of the ChainGraph visualization. For instance, users could drag the background with one hand while enlarging or rearranging images with the other; or they can use both hands to easily place images next to each other for a better comparison. Since multi-touch input is not supported by our current ChainGraph implementation, we work on this feature in order to get a better understanding of the opportunities multi-touch interfaces provide for the graph-based exploration of image collections (also cp. [6]). In this paper, we focused on the description and evaluation of the basic ChainGraph approach and the exploration of user-annotated images. Of course, many enhancements are imaginable: On the one hand, supporting further content formats (e.g., video or audio) might be an interesting extension; however, this also requires additional considerations about adequate representations within the visualization. On the other hand, the available and visualized metadata might go beyond simple tags. For instance, automatically extracted metadata might also be considered, such as low-level descriptors or context and media file information (e.g., the ’black & white’ tag of the scenario in Section 2.1 could have been automatically derived from the file information). Especially structured metadata raises many opportunities for extensions that allow aggregating and filtering of relationships or faceted exploration. However, structured metadata is not available in many situations. As we have shown in this paper, the ChainGraph might
Exploring Relationships between Annotated Images
27
offer a valuable alternative to common graph visualizations already for unstructured metadata, such as user-assigned tags. The only requirement is that the contents share certain metadata instances.
References 1. Fruchterman, T., Reingold, E.: Graph drawing by force-directed placement. Softw. Pract. Exper. 21(11), 1129–1164 (1991) 2. Girgensohn, A., Shipman, F., Wilcox, L., Turner, T., Cooper, M.: Mediaglow: Organizing photos in a graph-based workspace. In: IUI 2009: Proc. international conference on Intelligent user interfaces, pp. 419–424. ACM, New York (2009) 3. Hassan-Montero, Y., Herrero-Solana, V.: Improving Tag-Clouds as Visual Information Retrieval Interfaces. In: Proc. Multidisciplinary Information Sciences and Technologies, InSciT 2006, Merida, Spain (October 2006) 4. Heckner, M., Neubauer, T., Wolff, C.: Tree, funny, to read, google: what are tags supposed to achieve? a comparative analysis of user keywords for different digital resource types. In: SSM 2008: Proc. ACM workshop on Search in social media, pp. 3–10. ACM, New York (2008) 5. Heim, P., Lohmann, S.: A new approach to visualize shared properties in resource collections. In: Proc. International Conference on Knowledge Management and Knowledge Technologies, pp. 106–114 (2009) 6. Kristensson, P.O., Arnell, O., Bj¨ ork, A., Dahlb¨ ack, N., Pennerup, J., Prytz, E., Wikman, J., ˚ Astr¨ om, N.: Infotouch: an explorative multi-touch visualization interface for tagged photo collections. In: NordiCHI. ACM International Conference Proceeding Series, vol. 358, pp. 491–494. ACM, New York (2008) 7. Lohmann, S., Ziegler, J., Tetzlaff, L.: Comparison of tag cloud layouts: Task-related performance and visual exploration. In: Gross, T., et al. (eds.) INTERACT 2009, Part I. LNCS, vol. 5726, pp. 392–404. Springer, Heidelberg (2009) 8. Marchionini, G.: Exploratory search: from finding to understanding. Commun. ACM 49(4), 41–46 (2006) 9. Michlmayr, E., Cayzer, S.: Learning user profiles from tagging data and leveraging them for personal(ized) information access. In: Proc. Workshop on Tagging and Metadata for Social Information Organization, 16th International World Wide Web Conference (2007) 10. Purchase, H.C.: Which aesthetic has the greatest effect on human understanding? In: DiBattista, G. (ed.) GD 1997. LNCS, vol. 1353, pp. 248–261. Springer, Heidelberg (1997) 11. Rodden, K., Basalaj, W., Sinclair, D., Wood, K.: Does organisation by similarity assist image browsing? In: CHI 2001: Proc. Human factors in computing systems, pp. 190–197. ACM, New York (2001)
On the Co¨operative Creation of Multimedia Meaning Claudio Cusano, Simone Santini, and Raimondo Schettini Universit`a degli Studi di Milano - Bicocca Escuela Polit´ecnica Superior, Universidad Aut´onoma de Madrid Universit`a degli Studi di Milano - Bicocca
Abstract. In this paper, we propose a content-based method the for semiautomatic organization of photo albums based on the analysis of how different users organize their own pictures. The goal is to help the user in dividing his pictures into groups characterized by a similar semantic content. The method is semi-automatic: the user starts to assign labels to the pictures and unlabeled pictures are tagged with proposed labels. The user can accept the recommendation or make a correction. The method is conceptually articulated in two parts. First, we use a suitable feature representation of the images to model the different classes that the users have collected, second, we look for correspondences between the criteria used by the different users. A quantitative evaluation of the proposed R approach is proposed based on pictures of a set of members of the flickr photosharing community.
1 Introduction The process of signification, that semantic computing tries to unravel using formal means, already extremely complex in the case of text, acquires new dimensions and nuances in the case of multimedia data. An image, per se, doesn’t have any meaning, being just a recording of a certain situation that happend to unfold in front of a camera at a certain point in the past. Its only inherent meaning can be described as the Barthesian ca-a-´et´e: the thing that is represented happened in the past. But, of course, many things happened in the past that were not recorded in images, and the meaning of an image is related to a decision: the decision to record certain things and not others. Photos are not taken higgledy-piggledy, but according to certain discoursive practices that depend on the purpose of the picture and on the community in which they are taken. Taking a picture in order to convey a meaning is an activity that follows certain socially-dictated rules. These rules are with us from the beginning of our picture-taking life, and we follow them more or less unconsciously. When we are on vacation, we take mostly pictures of stereotypically happy moments, often in front of the same sceneries and monuments, and we avoid certain themes (sexual situations, for example). Often, these practices tell us about the meaning of a picture more than the contents of the picture itself. These observations, schematic and superficial as they may be, point to the impossibility of creating a semantic image classification system based only on the contents of the images. Semantic classification entails the division of the image space along semantic lines, and these lines depend crucially on the discoursive practices that preside T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 28–39, 2009. c Springer-Verlag Berlin Heidelberg 2009
On the Co¨operative Creation of Multimedia Meaning
29
image acquisition and on the interpretative practices of the community to which the images are directed. In this day and age, fortunately, a lot of information about community practices is available in a conformation that affords formalization. Thanks to the emergence of online communities, community practices can be understood by analyzing the way people organize their data on the internet. Our current work aims at using this structural information for understanding the semantics of images and, in a broader view, for understanding the process of signification in multimedia. The system that we present in this paper is a simple outcome of this activity, and it aims at helping people in a task that, with the advent of digital cameras, has become fairly common: to classify personal photographic pictures, dividing them into thematically organized folders. The criteria that preside this organization are, of course, highly personal: in this case, what’s good for the goose is not necessarily good for the gander. The same vacation photos that a person will divide in “Rhodos” and “Santorini” will be divided by someone else into “family”, “other people” and “places” or into “beach”, “hotel” and “excursion”, or in any other organization. However, the discoursive practices that preside to this classification are, to a certain extent, common to all users. That is, all said and done, people are not that original. Nor could they be: in the internet communit era, photos, and their classification scheme are communication means, and communication can only work through a shared code. Classification is part of a semiotic system, and must have some degree of uniformity and predictability, at least within the community in which the communication is done. Faithful to the principles of community based semantic creation, we try to use the collective wisdom of the community in order to suggest to one of its member possible ways of classification. Briefly, when a person (we call this person the apprentice) start putting photos into carpets, the system will look at other users of the community and at the classifications they made. Members who agree with the classification made by the apprentice (yclept the wizards) will be used as classifiers to propose a classification of the apprentice’s unclassified images. We can see a system like this under two possible lights. On the one hand, we can see it as a classification aid. In this view, the apprentice has a certain classification in mind, which she will not change, and the purpose of the system is to help her by bringing upfront, in a suitable interface, the pictures that will go into the folders that the apprentice has created. On the other hand, we can see it as an exploration and discovery tool. When the apprentice begins making the classification, her ideas are still uncertain, and she will be open to changes and adaptations of her scheme. In this sense, bringing up photos according to the classification scheme of the wizards will create a dialectic process in which criteria are invented, discarded, modified. The classification with which the apprentice will end up with mightn’t remind the original one at all, simply because looking at the organization induced by the wizards has given her new ideas. This second view is, in many ways, the most interesting one. Alas, it is virtually impossible to evaluate the effectiveness of a system in this capacity short of long term user satisfaction users. As a matter of praxis, in this paper we will only consider our system in the first capacity: as an aid to create a fixed classification, and will evaluate it accordingly.
30
C. Cusano, S. Santini, and R. Schettini
Commercial systems for the management and classification of personal photos rely essentially on manual annotation, and their only distinguishing trait is the interface that they use to make annotation as rapid and convenient as possible. Research prototypes take a more ambitious view, and try to provide tools for automatic or (more often) semiautomatic classification. A prototype system for home photo management and processing was implemented by Sun et al. [13]. Together with traditional tools, they included a function to automatically group photos by time, visual similarity, image class (indoor, outdoor, city, landscape), or number of faces (as identified by a suitable detector). The system developed by Wenyin et al. [17] allows the categorization of photos into some predefined classes. A semi-automatic annotation tool, based on retrieval by similarity, is also provided: when the user imports some new images, the system searches for visually similar archived images, and the keywords with higher frequencies in these images are used to annotate the new images. Mulhem and Lim proposed the use of temporal events for organizing and representing home photos using structured document formalism [9]. Shevade and Sundaram presented an annotation paradigm that attempts to propagate semantic by using WordNet and low-level features extracted from the images [12]. As the user begins to annotate images, the system creates positive and negative example sets for the associated WordNet meanings. These are then propagated to the entire database, using low-level features and WordNet distances. The system then determines the image that is least likely to have been annotated correctly and presents the image to the user for relevance feedback. A common approach to the automatic organization of photo albums consists in the application of clustering techniques, grouping images into visually similar sets. Some manual post-processing is usually required to modify the clusters in order to match user’s intended categories. Time information is often used to improve clustering by segmenting the album into events. Platt proposed a method for clustering personal images taking into account timing and visual information [11]. Li et al. exploited time stamps and image content to partition related images in photo albums [6]. Key photos are selected to represent a partition based on content analysis and then collated to generate a summary. A semi-automatic technique has been presented by Jaimes et al. [5]. They used the concept of Recurrent Visual Semantics (the repetitive appearance of visually similar elements) as the basic organizing principle. They proposed a sequence-weighted clustering technique which is used to provide the user with a hierarchical organization of the contents of individual rolls of film. As a last step, the user interactively modifies the clusters to create digital albums. Since people identity is often the most relevant information for the user, it is not surprising that several approaches have been proposed for the annotation of faces in family albums. Das and Loui used age/gender classification and face similarity to provide the user with the option of selecting image groups based on the people present in them [2]. The idea of exploiting user correlation in photo sharing communities has been investigated by Li et al. [7]. They proposed a method for inferring the relevance of userdefined tags by exploiting the idea that if different persons label visually similar images using the same tags, these tags are likely to reflect objective aspects of visual content. Each tag of an image accumulates its relevance score by receiving votes by neighbors (i.e. visually similar images) labeled with the same tag.
On the Co¨operative Creation of Multimedia Meaning
31
2 The System In this paper, we propose a method the for semi-automatic organization of photo albums. The method is content-based, that is, only pictorial information is considered. It should be clear from the contents of the paper that the method is applicable to nonvisual information such as keywords and annotations. In spite of the importance that these annotations may have for the determination of the semantics of images, we have decided to limit our considerations to visual information on methodological grounds, since this will give us a more immediate way of assessing the merits of the method vis-`a-vis simple similarity search. The goal is to help the user in classifying pictures dividing them into groups characterized by similar semantics. The number and the definition of these groups are completely left to the user. This problem can be seen as an on-line classification task, where the classes are not specified a priori, but are defined by the user himself. At the beginning all pictures are unlabeled, and the user starts to assign labels to them. After each assignment, the unlabeled pictures are tagged with proposed labels. The user can accept the recommendation or make a correction. In either case the correct label is assigned to the image and the proposed labels are recomputed. Unlabeled pictures are displayed sorted by decreasing confidence on the correctness of the suggestion, but the order in which the user process the images is not restricted. A suitable user interface will allow a rapid label confirmation, and a quick and easy organization of the photo album. 2.1 Correlation within the Community One of the difficulties of assisted album organization is that, at the beginning, we lack information on the criteria that the user is going to apply in partitioning his pictures. However, a huge library of possible criteria is available in photo-sharing communities. The users of these services are allowed to group their own images into sets, and we can assume that these sets contain pictures with some commom characteristic, at least at the semantic level. For instance, sets may contain pictures taken in the same location, or portraying a similar subject. Our idea is to exploit the knowledge encoded in how a group of users (the wizards, in the following) have partitioned their images, in order to help organize the pictures of a different user (the apprentice). The method is conceptually articulated in two parts. First, we use a suitable feature representation of the images of the wizards to model the different classes that they have collected, second, we look for correspondences between the (visual) criteria used in the wizards’ classes and those that the apprentice is creating in order to provide advice. Simply (maybe overly so) put: if we notice that one of the classes that the apprentice is creating appears to be organized using criteria similar to those used in one or more wizard’s classes, we use the wizards’ classes as representative, and the unlabeled apprentice images that are similar to those of the wizard class are given the label of that class. Consider a wizard, who partitioned his pictures into the C categories {ω1 , . . . , ωC } = Ω. These pictures are used as a training set in order to train a classifier that implements a classification function g : X → Ω from the feature space X into the set of wizard classes. If the partition of the wizard exhibits regularities (in terms of visual content)
32
C. Cusano, S. Santini, and R. Schettini
that may be exploited by the classification framework, then g may be used to characterize the pictures of the apprentice as well. Of course, it is possible that the apprentice would like to organize his pictures into categories different from those of the wizard. However, people tend to be predictable, and it is not at all uncommon that the sets defined by two different users present some correlation that can be exploited. To do so, we define a mapping π : Ω → Y between the classes defined by the wizard and the apprentice (where Y = {y1 , . . . , yk } denotes the set of apprentice’s labels). We allow a non-uniform relevance of the apprentice’s images in defining the correlation with the wizard’s classes. Such a relevance can be specified by a function w that assigns a positive weight to the images. Weighting will play an important role in the integratation of the predictions based on different wizards, as described in Section 2.3. Let Q(ωi , yj ) be the set of images to which the apprentice has assigned the label yj , and that, according to g, belong to ωi ; then π is defined as follows: π(ω) = arg max w(x), ω ∈ Ω, (1) y∈Y
x∈Q(ω,y)
where a label is arbitrarily chosen when the same maximum is obtained for more than one class. That is, π maps a class ω of the wizard into the class of the apprentice that maximizes the cumulative weight of the images that g maps back into ω. If no apprentice image belong ω we define π(ω) to be the class of maximal total weight. If we interpret w has a misclassification cost, our definition of π denotes the mapping which, when combined with g, minimizes the total misclassification error on the images of the apprentice: min w(x) 1 − χ{y} (π(g(x))) , (2) π:Ω→Y
x,y
where the summation is taken over the pairs (x, y) of images of the apprentice with the corresponding labels, and where χ denotes the indicator function (χA (x) = 1 if x ∈ A, 0 otherwise). The composition h = π ◦ g directly classifies elements of X into Y . In addition to embedding the correlation between the wizard and the apprentice, h shows a useful property: the part defined by g is independent of the apprentice, so that it can be computed off-line, allowing the adoption of complex machine learning algorithms such as SVMs, neural networks, and the like; the part defined by π, instead, can be computed very quickly since it is linear in the number of the images labeled by the apprentice and does not depend on the whole album of the wizard, but only on its partial representation provided by g. In this work, g is a k−nearest neighbor (KNN) classifier. Other classification techniques may be used as well, and some of which would probably lead to better results. We decided to use the KNN algorithm because it is simple enough to let us concentrate on the correlation between the users, which is the main focus of this paper. 2.2 Image Description Since we do not know the classes that the users will define, we selected a set of four features that give a fairly general description of the images. We considered two features that describe color distribution, and two that are related to shape information. One color
On the Co¨operative Creation of Multimedia Meaning
33
and one shape feature are based on the subdivision of the images into sub-blocks; the other two are global. We use spatial color moments, color histogram, edge direction histogram, and a bag of features histogram. Spatial color distribution is one of the most widely used feature in image content analysis and categorization. In fact, some classes of images may be characterized in terms of layout of color regions, such as blue sky on top or green grass on bottom. Similarly to Vailaya et al. [14], we divided each image into 7 × 7 blocks and computed the mean and standard deviation of the value of the color channels of the pixels in each block. This feature is made of 294 components (six for each block). Color moments are less useful when the blocks contain heterogeneous color regions. Therefore, a global color histogram has been selected as a second color feature. The RGB color space has been subdivided in 64 bins by a uniform quantization of each component in four ranges. Statistics about the direction of edges may greatly help in discriminating between images depicting natural and man made subjects [15]. To describe the most salient edges we used a 8 bin edge direction histogram: the gradient of the luminance image is computed using Gaussian derivative filters tuned to retain only the major edges. Only the points for which the magnitude of the gradient exceeds a set threshold contribute to the histogram. The image is subdivided into 5 × 5 blocks, and a histogram for each block is computed (for a total of 200 components). Bag-of-features representations have become widely used for image classification and retrieval [18,16,3]. The basic idea is to select a collections of representative patches of the image, compute a visual descriptor for each patch, and use the resulting distribution of descriptors to characterize the whole image. In our work, the patches are the areas surrounding distinctive key-points and are described using the Scale Invariant Feature Transform (SIFT) which is invariant to image scale and rotation, and robust vis-a-vis a substantial range of distortions [8]. The SIFT descriptors extracted from an image are then quantized into “visual words”, which are defined by clustering a large number of descriptors extracted from a set of training images [10]. The final feature vector is the normalized histogram of the occurrences of the visual words in the image. 2.3 Combining Users Of course, there is no guarantee that the classes chosen by two different users have a sufficient correlation to make our approach useful. This is why we need several wizards and a method for the selection of those who may help the apprentice organize his pictures. The same argument may be applied to features as well: only some of them will capture the correlation between the users. Consequently, we treated the features separately instead of merging them into a single feature vector: given a set of pictures labeled by the apprentice, each wizard defines four different classifiers h, one for each feature considered. These classifiers will then be combined into a single classification function that will be then applied to the pictures that the apprentice has not yet labeled. To combine the classifiers defined by the wizards we apply the multiclass variation of the Adaboost algorithm proposed by Zhu et al. [19]. In particular, we used the variation called Stagewise Additive Modeling using a Multi-class Exponential loss function (SAMME). Briefly, given a set {(xi , y1 ), . . . , (xn , yn )} of image/label pairs, the
34
C. Cusano, S. Santini, and R. Schettini
algorithm selects the best classifier and assign to it a coefficient. Different weights are assigned to correctly and incorrectly classified training pairs, and another classifier is selected taking into account the new weights. More iterations are run in the same way, each time increasing the weight of misclassified samples and decreasing that of correctly classified samples. The coefficients associated to the classifiers depends on the sum of the weights of misclassified samples. For each iteration the classifier is chosen by a weak learner that takes into account all the wizards and all the features. For each wizard u and each of the four features f , a KNN classifier gu,f has been previously trained. Given the weighted training sample, the corresponding mapping functions πu,f are computed according to (1); this defines the candidate classifiers hu,f = πu,f ◦ gu,f . The performance of each candidate is evaluated on the weighted training set and the best one is selected. The boosting procedure terminates after a set number T of iterations. Given an image to be labeled, a score is computed for each class: sy (x) =
T
(t)
α(t) χ{y} (h (x)),
y ∈ Y,
(3)
t=1 (t)
where h is the classifier selected at iteration t, and α(t) is the corresponding weight. The combined classifier H is finally defined as the function which selects the class corresponding to the highest score: H(x) = arg max sy (x).
(4)
y∈Y
The combined classifier can be then applied to unlabeled pictures. According to [19], the a posteriori probabilities P (y|x) may be estimated as: P (y|x) =
sy (x) k−1 sy (x) y ∈Y exp k−1
exp
.
(5)
We used the difference between the two highest estimated probabilities as a measure of the confidence of the combined classifier. Unlabeled pictures can then be presented to the user sorted by decreasing confidence. It should be noted that the output of the classifiers gu,f can be precomputed for all the images of the apprentice. The complexity of the whole training procedure is O(nU F T ), that is, it is linear in the number of labeled pictures n, features considered F , wizards U , and boosting iterations T . The application of the combined classifier to unlabeled pictures may be worked out in O((N − n)T ), where N is the number of apprentice’s images. Finally, sorting requires O((N − n) log(N − n)). Using the settings described in Section 3, the whole procedure is fast enough, on a modern personal computer, for real time execution and can be repeated whenever a new picture is labeled without degrading the user’s experience. 2.4 Baseline Classifiers In addition to exploiting the information provided by the wizards, we also considered a set of classifiers based on the contents of the apprentice’s pictures. They are four KNN
On the Co¨operative Creation of Multimedia Meaning
35
classifiers, one for each feature. They are trained on the pictures already labeled and applied to the unlabeled ones. These additional classifiers are included in the boosting procedure: at each iteration they are considered for selection together with the classifiers derived from the wizards. In the same way, it would be possible to include additional classifiers to exploit complementary information, such as camera metadata, which has been proven to be effective in other image classification tasks [1]. The four KNN classifiers are also used as baseline classifiers to evaluate how much our method improves the accuracy in predicting classes with respect to a more traditional approach.
3 Experimental Results R To test our method we downloaded from flickr the images of 20 users. Each user was R chosen as follows: i) a “random” keyword is chosen and passed to the flickr search engine; ii) among the authors of the pictures in the result of the search, the first one who organized his pictures into 3 to 10 sets is selected. In order to avoid excessive variability in the size of users’ albums, sets containing less than 10 pictures are ignored and sets containing more than 100 pictures are sub-sampled in such a way that only 100 random images are downloaded. Duplicates have been removed from the albums. The final size of users’ albums ranges from 102 to 371, for a total of 3933 pictures. Unfortunately, some of the selected users did not organized the pictures by content: there were albums organized by time periods, by aesthetic judgments, and so on. Since, our system is not designed to take into account this kind of categorizations, we decided to reorganize the albums by content. To do so, we assigned each album to a different volunteer, and we asked him to label the pictures by content. The volunteers received simple directions: each class must contain at least 15 pictures and its definition must be based on visual information only. The volunteers were allowed to ignore pictures to which they were not able to assign a class (which usually happened when the obvious class would have contained less than 15 images). The ignored pictures were removed from the album for the rest of the experimentation. Table 1 reports the classes defined by the volunteers for the 20 albums considered. To quantitatively evaluate the performance of the proposed method we implemented a simulation of user interaction [4]. This approach effectively allows to evaluate objectively the methodology without taking into account the design and usability of the user interface. As a measure of performance, we considered the fraction of cases in which the class proposed by the system for the picture selected in step 3c agrees with the annotation performed by the volunteer. The simulation has been executed for the 20 albums considered. Each time an album corresponds to the apprentice and the other 19 correspond to the wizards. Since the final outcome may be heavily influenced by the random choice of the first picture, we repeated the simulation 100 times for each album. Three variants of the method have been evaluated: i) using only the KNN classifiers as candidates; ii) using only wizard-based classifiers; iii) using both KNN and wizards. The parameters of the method have been tuned on the basis of the outcome of preliminary tests conducted on ten additional albums annotated by the authors. The number of
36
C. Cusano, S. Santini, and R. Schettini
Table 1. Summary of the annotation performed by the 20 volunteers. For each album are reported the number of pictures and the names given to the classes into which the images have been divided. Album 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Size 328 261 182 251 177 188 151 182 140 227 371 168 170 209 146 134 158 156 102 234
Classes animals, artefacts, outdoor, vegetables boat, city, nature, people close-ups&details, landscapes, railways, portraits&people, sunsets buildings, flora&fauna, musicians, people, things animals, aquatic-landscape, objects, people animals, buildings, details, landscape, people arts, city, hdr buildings, hockey, macro bodies, environments, faces animals, beach, food, objects, people animals, sea, sunset, vegetation animals, flowers, horse racing, rugby animals, concert, conference, race aquatic, artistic, landscapes, close-ups beach, calendar, night, underwater animals, family, landscapes animals, cold-landscapes, nature-closeups, people, warm-landscapes buildings, landscape, nature leaves&flowers, men-made, panorama, pets, trees microcosm, panorama, tourism
neighbors considered by the wizards and by the KNN classifiers has been set to 21 and 5, respectively; the number of boosting iterations has been set to 50. Table 2 shows the average percentage of classification errors obtained on the 20 albums by the three variants of the method. Regardless the variant considered, there is a high variability in performance on the 20 albums, ranging from about 4% to 60% of misclassifications. Albums 8, 13, and 15 have been organized into classes which are easy to discriminate and obtained the lowest classification errors. It is interesting to note that these three albums have been the easiest to annotate manually as well (according to informal volunteers’ feedback). In particular, albums 13 and 15 have been annotated by the volunteers into classes that are very similar to those defined by the original R flickr users: in both cases the only difference is that two sets have been merged by the volunteers into a single class. The opposite happens for the albums to which correspond the highest classification errors: album 4 originally contained 12 classes, while albums 5 and 19 were organized in 8 classes. In no case the best result has been obtained using only the wizards-based classifiers. For six albums (1, 3, 5, 14, 16, 18) the wizard-only variant of the method obtained lower errors than the KNN-only variant. It seems that, in the majority of the cases, direct information about image similarity cannot be ignored without a performance loss. The combination of wizards and KNN classifiers outperformed the two other strategies on 14/20 albums. In some cases the improvement is barely noticeable, but in other cases it is significant, with a peak of more than 6% of decrease of misclassifications for album
On the Co¨operative Creation of Multimedia Meaning
37
Table 2. Percentage of errors obtained by simulating user interaction on the 20 albums considered. The results are averaged over 100 simulations. For each album, the best performance is reported in bold. Standard deviations are reported in brackets. Album KNN only 1 30.4% (1.5) 2 30.3% (1.3) 3 51.3% (2.1) 4 55.5% (2.0) 5 54.6% (2.4) 6 48.0% (1.9) 7 24.7% (1.0) 8 12.3% (1.4) 9 43.5% (1.9) 10 31.4% (1.4)
Wizards KNN + Wizards Album KNN only Wizards KNN + Wizards 28.8% (0.9) 27.9% (0.9) 11 27.1% (1.1) 27.9% (1.2) 24.4% (1.3) 33.4% (1.2) 26.6% (1.8) 12 20.7% (1.3) 35.7% (1.9) 23.9% (1.7) 47.0% (1.9) 45.1% (2.1) 13 17.6% (1.2) 18.9% (1.4) 16.2% (1.0) 55.9% (1.4) 54.0% (1.8) 14 52.2% (1.9) 51.3% (1.6) 51.2% (1.7) 54.5% (2.3) 54.2% (2.2) 15 4.6% (1.4) 10.5% (1.1) 4.5% (0.7) 48.2% (2.1) 46.5% (1.9) 16 32.6% (2.1) 30.5% (2.1) 27.3% (2.1) 32.8% (1.6) 27.1% (1.9) 17 35.2% (2.3) 39.4% (1.7) 34.2% (2.0) 13.2% (1.0) 13.5% (1.2) 18 36.2% (2.1) 34.0% (1.6) 32.9% (2.1) 45.4% (2.1) 45.4% (2.1) 19 57.0% (3.3) 62.5% (3.4) 60.0% (3.6) 35.9% (1.7) 32.1% (1.5) 20 21.6% (1.4) 21.9% (0.9) 18.8% (1.2)
55 album 8 album 11 album 13 album 15 album 20
40 35 30 25 20 15
percentage of misclassifications
percentage of misclassifications
45
10
album 1 album 2 album 7 album 16 album 18
50 45 40 35 30 25
2
4
6 8 10 12 14 16 18 number of wizards
2
4
(a)
(b) 68 album 3 album 9 album 10 album 12 album 17
60 55 50 45 40 35
percentage of misclassifications
65 percentage of misclassifications
6 8 10 12 14 16 18 number of wizards
album 4 album 5 album 6 album 14 album 19
66 64 62 60 58 56 54 52 50 48
2
4
6 8 10 12 14 16 18 number of wizards
(c)
2
4
6 8 10 12 14 16 18 number of wizards
(d)
Fig. 1. Percentage of misclassifications obtained on the 20 albums, varying the number of wizards considered. To improve the readability of the plots the albums have been grouped by similar performance.
38
C. Cusano, S. Santini, and R. Schettini
3. For the other six albums the KNN baseline classifier is the best approach, with a slight improvement over the variant KNN+wizards (a maximum of 3.2% for album 12). To verify the influence of the number of the wizards on classification accuracy, we repeated the simulations of the wizards-only variant of the method, sampling each time a different pool of wizards. For each album, simulations are performed sampling 1, 4, 7, 10 ,13, 16, and 19 wizards, and each simulation has been repeated 50 times (a different pool of wizard is randomly sampled each time). The plots in Figure 1 report the results obtained in terms of average percentage of misclassification errors. As expected, for almost all albums, the error rate decreases as the number of wizards increases. The plots suggest that in most cases better performance may be obtained by considering more wizards, in particular for the albums where the lowest errors have been obtained (see the first plot of the figure).
4 Conclusions In this paper, we described a content-based method the for semi-automatic organization of personal photo collections. The method exploits the correlations, in terms of visual content, between the pictures of different users considering, in particular, how they organized their own pictures. Combining this approach with a KNN classifier we obtained R better results (measured on the pictures of 20 flickr users) with respect to a traditional classification by similarity approach. In this work, we considered the apprentice and the wizards as clearly different characters. We plan to extend our approach to actual photo-sharing communities, where each user would be apprentice and wizard at the same time. However, in order to scale up to millions of wizards (the size of the user base of major photo-sharing websites) a method should be designed for filtering only the wizards that are likely to provide good advices. Moreover, we are considering to exploit additional sources of information such as keywords, annotations, and camera metadata. Finally, we are investigating similar approaches, based on the correlation between users, for other image-related tasks such as browsing and retrieval.
References 1. Boutell, M., Luo, J.: Bayesian fusion of camera metadata cues in semantic scene classification. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 623–630 (2004) 2. Das, M., Loui, A.: Automatic face-based image grouping for albuming. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 4, pp. 3726–3731 (2003) 3. Grauman, K., Darrell, T.: The pyramid match kernel: discriminative classification with sets of image features. In: Proceedings of the Tenth IEEE International Conference on Computer Vision, vol. 2, pp. 1458–1465 (2005) 4. Ivory, M.Y., Hearst, M.A.: The state of the art in automating usability evaluation of user interfaces. ACM Computing Surveys 33(4), 470–516 (2001) 5. Jaimes, A., Benitez, A., Chang, S.-F., Loui, A.: Discovering recurrent visual semantics in consumer photographs. In: Proceedings of the International Conference on Image Processing, vol. 3, pp. 528–531 (2000)
On the Co¨operative Creation of Multimedia Meaning
39
6. Li, J., Lim, J., Tian, Q.: Automatic summarization for personal digital photos. In: Proceedings of the Fourth International Conference on Information, Communications and Signal Processing, vol. 3, pp. 1536–1540 (2003) 7. Li, X., Snoek, C., Worring, M.: Learning tag relevance by neighbor voting for social image retrieval. In: Proceeding of The first ACM International Conference on Multimedia Information Retrieval, pp. 180–187 (2008) 8. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 9. Mulhem, P., Lim, J.: Home photo retrieval: Time matters. In: Bakker, E.M., Lew, M., Huang, T.S., Sebe, N., Zhou, X.S. (eds.) CIVR 2003. LNCS, vol. 2728, pp. 308–317. Springer, Heidelberg (2003) 10. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2161–2168 (2006) 11. Platt, J.: Autoalbum: clustering digital photographs using probabilistic model merging. In: Proceedings of the IEEE Workshop on Content-based Access of Image and Video Libraries, pp. 96–100 (2000) 12. Shevade, B., Sundaram, H.: Vidya: an experiential annotation system. In: Proceedings of the ACM SIGMM Workshop on Experiential Telepresence, pp. 91–98 (2003) 13. Sun, Y., Zhang, H., Zhang, L., Li, M.: Myphotos: a system for home photo management and processing. In: Proceedings of the Tenth ACM International Conference on Multimedia, pp. 81–82 (2002) 14. Vailaya, A., Figueiredo, M., Jain, A., Zhang, H.-J.: Image classification for content-based indexing. IEEE Transactions on Image Processing 10(1), 117–130 (2001) 15. Vailaya, A., Jain, A., Zhang, H.J.: On image classification: city images vs. landscapes. Pattern Recognition 31(12), 1921–1935 (1998) 16. Wallraven, C., Caputo, B., Graf, A.: Recognition with local features: the kernel recipe. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 1, pp. 257–264 (2003) 17. Wenyin, L., Sun, Y., Zhang, H.: Mialbum — a system for home photo managemet using the semi-automatic image annotation approach. In: Proceedings of the Eighth ACM International Conference on Multimedia, pp. 479–480 (2000) 18. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision 73(2), 213–238 (2007) 19. Zhu, J., Rosset, S., Zou, H., Hastie, T.: Multiclass adaboost. Technical report, Stanford University (2005), http://www-stat.stanford.edu/˜hastie/Papers/samme.pdf
On the Feasibility of a Tag-Based Approach for Deciding Which Objects a Picture Shows: An Empirical Study Viktoria Pammer1 , Barbara Kump2 , and Stefanie Lindstaedt1
2
1 Know-Center {vpammer,slind}@know-center.at Knowledge Management Institute, TU Graz
[email protected]
Abstract. Many online platforms allow users to describe resources with freely chosen keywords, so called tags. The specific meaning of a tag as well as its specific relation to the tagged resource are left open for interpretation to the user. Although human users mostly have a fair chance at interpreting it, machines do not. An algorithmic approach for identifying descriptive tags however could prove useful for intelligent search for pictures and providing first-cut overviews over tagged picture repositories. In this paper we investigate the characteristics of the problem to decide which tags describe visible entities on a given picture. Based on a systematic user study, we are able to discuss in detail the problems involved for both humans and machines when identifying descriptive tags. Furthermore, we investigate the general feasibility of developing a tag-based algorithm tackling this question. Finally, a concrete implementation and its evaluation are presented.
1
Introduction
Various social software and collaborative tagging platforms have sprung up everywhere on the web. They enable users to describe photos, news, blogs, research publications and web bookmarks with freely chosen keywords, so called tags, and to share both their content and their tags with other users. The appeal of tagging lies in its simplicity for the tag producer, who can attach to a resource any keyword he or she deems appropriate. No explanation of the exact meaning of the tag has to be provided and no rules restricting the vocabulary have to be adhered to. This ease of use on the tag producer’s side creates a disadvantage on the side of the tag consumer. The (human or machine) tag consumer who wants to use other people’s tags for some purpose has to interpret the given tags. Human tag consumers may often be able to do so, given both the tag and
The Know-Center is funded within the Austrian COMET Program - Competence Centers for Excellent Technologies - under the auspices of the Austrian Ministry of Transport, Innovation and Technology, the Austrian Ministry of Economics and Labor and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG.
T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 40–51, 2009. c Springer-Verlag Berlin Heidelberg 2009
On the Feasibility of a Tag-Based Approach
41
the tagged resource. However, they may already have difficulty in finding the appropriate tags to formulate queries [1]. Even more at disadvantage are machines who want to consume tags. Consequently, multiple attempts at enriching the tags themselves with semantics have been made (Sec. 2). In this paper, we analyse the semantics of a specific relation between tags and pictures, namely the “shows”-relation which expresses that a tag refers to an entity which is visible on the picture. We present a problem analysis (Sec. 3) and discussion (Sec. 5) of the challenge of identifying such a relation, a theoretical upper-limit for tag-based algorithms attempting to identify the “shows”-relation and describe a proof-of-concept implementation and its evaluation (Sec. 4).
2
The Semantics of Tags
The abundance of tagged data on the web today has prompted researchers to investigate the semantics of tags. In many cases, the approaches taken are attempts to specify the exact meaning of each tag within one collaborative tagging system. Such approaches typically try deriving ontologies of some kinds from the folksonomic structures relating users, tags and resources which underlie collaborative tagging systems. For instance, Mika [2] derives light-weight ontologies directly from folksonomies, such that they represent a community’s view on a topic. A different approach is taken e.g. by Schmitz [3], who uses WordNet to lay an ontology over the tags by mapping tags to concepts in WordNet. Rattenbury et al. [4] infer whether a tag describes an event or a location based on characteristics of the temporal and spatial distribution of tags. Considered under the aspect of the semantics of the relation between tags and resources, tags that describe a location or an event do not only fall into different categories, in that they would be put in different places in an ontology, but also inherently differ in the relation they can have to a picture. More directly related to our research, Golder and Hubermann [1] describe the kinds of tags encountered within collaborative tagging systems. According to them, tags can for instance identify who or what the resource is about, the kind of the resource (e.g., article, picture, example), who owns the resource, or characteristics of the resource (e.g., awesome, interesting). The kinds of tags correspond, in a different terminology, to the relation between tags and the resources they are attached to. Note that the starting point for discussing semantics is different when discussing tags in collaborative tagging systems or labels in coordinated image labelling efforts, where the goal is to create training sets for automatic classifiers (see e.g. [5] or the ESP Game1 ). In the latter, the semantics of labels is predefined by the goal, i.e. a correct label describes an object visible on the labelled picture. Similar to the categorisation of different kinds of tags by Golder and Hubermann [1], Bechhofer [6] categorises different kinds of semantic annotations. semantic annotations are different from tags insofar as they unambiguously define the meaning of each tag (the tag becomes a concept) and the meaning of 1
http://www.espgame.org/gwap/gamesPreview/espgame/
42
V. Pammer, B. Kump, and S. Lindstaedt
each relation between a tag and a resource. One of the possible meanings of a relation cited in the paper by Bechhofer is “instance reference”, to which our contribution is strongly linked. The core of our research is the observation that not all tags attached to a picture describe the visible content of the picture [1,6]. In collaborative tagging environments, people tag not only what is visible on a picture (e.g. the Eiffel Tower), but they add additional tags which for instance relate to the context in which the picture was taken (e.g. Paris trip, holiday), to adjectives that describe the picture (e.g., impressive, high), or are just statements to express the user’s likes or dislikes (e.g., wow!). We are specifically interested in the “shows”-relation between a picture and a tag. It can be verbalised as “Picture <X> shows a(n)
” where “<X>” stands for a specific picture, and “” stands for a specific tag. For instance, one could say “The picture in Figure 1 shows a flower”. For brevity, we often refer to tags in a “shows”-relation to the picture they are related to as “descriptive tags”. The “shows”-relation asserts that there is some part on the picture on which a real-world instance corresponding to the tag is visible, but the exact part is not further specified and does not have a URI. Indeed, it is only the picture as a whole which has a URI. Furthermore, for the time being we do not differentiate between the case where the tag denotes a concept and an instance of this concept is visible on the picture (e.g. “castle”), and the case where the tag denotes an instance and this instance is visible on the picture (e.g. “Versailles”). As an example, consider Fig. 1. It shows flowers, a garden and a house at the background. There are many more tags assigned to the picture. Depending on the viewer’s knowledge she might recognise the flower as a hollyhock and the house as “Haus Liebermann”2 . Then, if the viewer spoke German, the tags “Garten” and “Malve” could be recognised as being translations of “garden” and “hollyhock”. It is the ultimate goal of our work to devise an algorithm that can identify precisely this “shows”-relation. Our concrete research question is, how humans do and machines possibly can decide whether a tag describes an object visible on a picture, given a picture and a tag.
3
What Does the Picture Show? An Empirical Study with Two Human Raters
In an empirical study we sampled data from Flickr and let two human raters answer the question which tags refer to visible entities on the corresponding pictures. Based on this sample data, we investigated (a) the proportion of descriptive tags and (b) the reliability of human ratings. The latter indicates whether the given problem, identifying descriptive tags, can reliably be decided given a picture and its tags. If the ratings differ substantially across time or different raters for instance, this would indicate that the problem can only be decided taking into account additional information such as e.g. time-of-the-day or mood of the rater. 2
A villa near Berlin’s Wannsee (a lake).
On the Feasibility of a Tag-Based Approach
43
Given tags: Flower Garten garden Haus Liebermann Malve Weitwinkel hollyhock wide angle Wannsee Germany NaturesFinest Tags probably describing content: Malve hollyhock Flower Garten garden Haus Liebermann
Fig. 1. Only part of the tags describe the content of the picture. Other tags describe where it was taken and which camera calibration was used. Lower- and uppercase writing was taken over from the original Flickr tags. The Flickr page of this photo is online at http://flickr.com/photos/sevenbrane/2631266076/. The photo is licensed under the Creative Commons http://creativecommons.org/licenses/by-nc/2.0/deed.en
3.1
Setup of the Study
Data Sets. As data source for our study we chose Flickr, which provided us with pictures and tags given both by picture owners and visitors. Flickr is an online photo sharing and management platform. Pictures on Flickr mostly show everyday objects that are also visible to the human eye and are mostly made by handheld cameras. Tags are mostly in English (see e.g. [7]). For the analyses, a data set (Set A) was selected. In order to arrive at Set A, a set of 500 publicly available pictures rated as “most interesting” were downloaded on July 3, 2008 from Flickr, using the Flickr API3 . Pictures without tags were removed from the dataset, which left 405 photos. The photos were mostly tagged by their owners, but partly also by visitors to the pictures. The thus compiled data set for our study (Set A) consists of 3862 (picture, tag) pairs and was then also used for the evaluation of the algorithm described in Sec. 4.3 For investigating the reliability of human rating decisions, a further data set (Set B) 3
http://www.flickr.com/services/api/
44
V. Pammer, B. Kump, and S. Lindstaedt
was needed. Set B consists of 20 randomly chosen pictures from Set A. Set B hence is a subset of Set A and it comprises 189 (picture, tag) pairs. Rating Procedure. Two human raters participated in our investigation. Both Rater 1 and Rater 2 were native German speakers, although they had a good knowledge of English. In order to define the relation “this picture shows a(n) ”, for each pair (picture, tag), a rating has to be made, whether the tag describes something that is visible on the picture. A judgement with regard to one picture and one tag therefore consists of deciding either: “yes, this tag describes an object visible on the picture” (positive decision) or “no, this tag does not describe an object visible on the picture” (negative decision). The rating can be done by human raters or a machine (algorithm). For the user study, Rater 1 was provided with a table of all (picture, tag) pairs from Set A, and with all 405 pictures that built the basis for Set A. Rater 1 was instructed to rate each (picture, tag) pair according to the above described rating procedure. The decision was noted in a table next to the pair (picture, tag). This procedure was repeated for all 405 pictures, i.e. for all 3862 (picture, tag) pairs. The rating procedure for Rater 2 was equal to the procedure of Rater 1, but only performed on Set B. Quantifying the Agreement between Different Rating Sources. If two (human or machine) raters agree, this means that both raters decide for a (picture, tag) pair either “yes, the picture shows a ”, or both decide “no, the picture does not show a ”. If the same rater gives a judgement at two different points in time, the same terminology applies. Disagreement means that one of the raters decides “yes, the picture shows a ” while the other rater comes to the conclusion “no, the picture does not show a ”. Aggregated over a sample of tags, the extent of agreement and disagreement can be visualized by means of a contingency table or a bar chart. The percentage of agreement (or disagreement respectively) of two raters can easily be computed by looking at the ratio between the number of judgements where both raters agree and the total number of judgements. Moreover, the agreement and disagreement can be measured using a correlation coefficient, or a contingency coefficient. For the analyses that we describe in the remainder of this article we use the Φ coefficient (see e.g. [8]), a measure of the degree of association between two binary variables. The binary variable in our case is the yes/no decision taken with respect to a (picture, tag) pair. The values of the Φ range from −1 to +1. For our purposes Φ = +1 can be interpreted as perfect agreement, which means that the raters agree for all rating decisions. Perfect disagreement is indicated by Φ = −1 and means that two raters disagree for all rating decisions. A coefficient Φ = 0 means that the rating decisions of two raters are not systematically connected at all. In other words, if Φ = 0, the decision of one rater (whether a tag refers to an object visible on a picture) cannot be predicted from the other rater’s decision. Φ allows testing for statistical significance.
On the Feasibility of a Tag-Based Approach
45
Naturally, an algorithm must aim towards perfect agreement. However, any algorithm that shows perfect disagreement could easily be changed into one with perfect agreement by simply inverting its decisions. 3.2
Percentage of Tags Referring to an Object Visible on a Picture
One question to be answered in our study was how many tags on average describe visible entities on the tagged pictures. Rater 1 rated 782 (20.3%) out of 3862 tags in Set A to describe visible objects. The low number of tags which are considered to refer to something visible on a picture is remarkable. The goal of our research was to investigate the possibility of automatically distinguishing exactly those 20.3% of descriptive tags from the other 79.7% of non-descriptive tags. 3.3
Reliability of the Rating Decisions
The other question that we wanted to answer concerned the reliability of the rating decisions by the human raters. Two aspects of reliability were taken into account, namely retest reliability and inter-rater reliability. In this context, retest reliability refers to the stability of the judgements of Rater 1 over time and their independency of confounding variables (e.g. mood of the rater, day of the week). If a rater is not able to make reliable decisions, this is a source of error variance, and an indication against the feasibility of deriving an algorithm which is able to approximate human ratings. Inter-rater reliability means the extent to which the result is unaffected by individual rating tendencies of Rater 1. If inter-rater reliability is low, any algorithm trying to take that decision must be personalised. Retest Reliability. For investigating retest-reliability, Rater 1 was asked to repeat part of her ratings two weeks after she had performed the rating procedure described in 3.1 for Set A. The repeated ratings were given on Set B (189 (picture, tag) pairs, a subset of Set A). At the time the retest ratings of Rater 1 were given, one of the pictures was not online anymore. Therefore Rater 1 rated only 182 (picture, tag) pairs in the retest round. Examination of the retest reliability gave Φ = 0.84 (p < .01). This indicates a high retest reliability. In concrete numbers, Rater 1 assessed 182 tags twice. At the second rating, she rated 9 (4.9%) out of these 182 tags differently than the first time. Inter-rater Reliability. In order to assess inter-rater reliability, Rater 2 was asked to judge the tags assigned to the pictures in Set B. These ratings were compared with the original ratings of Rater 1 on Set A. Overall, 189 (picture, tag) pairs were judged by both raters. The agreement of the two raters was Φ = 0.77 (p < .01). This indicates also a high inter-rater reliability. In concrete numbers, Rater 1 and Rater 2 disagreed on 14 (7.4%) out of 189 tags. Discussion. Since both, retest reliability and inter-rater reliability are satisfactory, it can be assumed that the presentation of a picture and a tag suffices in principle to predict the judgement of a human rater on whether the tag is in a “shows”-relation with the picture.
46
4
V. Pammer, B. Kump, and S. Lindstaedt
A Tag-Based Algorithm
After having found out that given a picture and its tags the decision which tags describe visible objects on the picture can be reliably made by human raters (see above), we were interested in whether even more contextual information could be taken away. Assuming an agent only sees the tags, without the picture, how well can it guess which tags will describe visible entities on the picture? The interest of this question, besides its academic interest, is very practical: Analysing a number of tags is easier than analysing picture content, and an algorithm which automatically can identify the “shows”-relation could clearly be useful. For instance, starting from a picture showing specific objects (e.g. a house and flowers), it could immediately retrieve pictures which show similar kinds of objects. Such an algorithm could also be useful to get a rough overview of the content of image repositories. An automatic execution of this task has the advantage of performing well also on huge amounts of data, and in producing machine-readable metadata which can be further processed automatically. 4.1
Quantifying the Limitation of a Tag-Based Algorithm
A tag-based algorithm has knowledge about the (meaning of) tags and may additionally have prior knowledge about the pictures to be expected. In order to quantify the limitations of any tag-based algorithm on Set A with respect to the ratings of Rater 1, an optimal algorithm was constructed. The algorithm is optimal in the sense that no other tag-based algorithm can exist which agrees better with Rater 1’s ratings on Set A. The optimal algorithm always decides exactly as Rater 1 did, except for tags where the human rater gives alternating decisions. In these cases, the optimal algorithm goes for the majority of decisions. If there is a draw, the optimal algorithm chooses “no”. For instance, the tag “leaves” occurred five times, and Rater 1 decided four of the times that the tag described objects visible on the picture. The optimal algorithm stays with the majority of decisions and always decides “yes” for the tag “leaves”. Going for the majority ratings was an arbitrary decision by the authors of the study. It does not influence the overall correlation, but would influence an in-depth study on false positive and false negative decisions. The overall agreement between the optimal algorithm and Rater 1’s ratings on Set A was computed. The correlation coefficient was Φ = 0.94 (p < .01). In concrete numbers, the optimal algorithm agreed with the human rater on 3783 (98.0%) of 3862 tags. This very high correlation indicates that in a fixed environment, e.g. within one platform, the patterns of usage are consistent enough to make an informed guess about whether a tag represents a visible entity or not. To some extend this corresponds also to making a guess about the actual meaning of a tag. 4.2
Algorithm Design
Given the theoretical feasibility of a well-performing tag-based algorithm, we implemented and evaluated a WordNet-based algorithm (WN-Algorithm) as
On the Feasibility of a Tag-Based Approach
47
proof-of-concept. The interest to the reader lies in the, as we think, generally applicable design, and most of all in its simplicity together with its good performance. As was already seen, a tag-based algorithm deterministically guesses, given a tag, whether it will be visible on a given (indeed on any) picture. In order to perform well, such an algorithm must have a general knowledge about the meaning of tags , i.e. which tags describe concepts that can in principle be visible on a picture, as well as some knowledge about the domain, i.e. which pictures can be expected. The second is necessary to make reasonable decisions given tags like “Africa”, which in Flickr mostly seem to mean “taken in Africa”, while in a picture database of satellite pictures it would probably mean “Africa is visible”. The design of WN-Algorithm follows two steps. First, every tag must be disambiguated, i.e. decided which of a tag’s possibly many meanings the algorithm shall assume. In a second step, given the meaning of a tag, the algorithm must decide whether it is likely to denote a visible entity on any given picture of the picture database. Disambiguation. The knowledge about meanings of words, and the general frequency of their occurrence are both encoded in WordNet [9,10], a lexical database of English. For instance, WordNet encodes that the word “wood” may mean either the material or a forest amongst other things, but that mostly the former is meant in English. Additionally, WordNet maps English words (nouns, adjectives, adverbs, verbs) to sets of synonyms, and refers to a reasonable amount of words used as tags in Flickr ([7]). For more specialised domains, a domain ontology including references from concepts to words (e.g. via labels, comments, synonyms) would be necessary. We actually side-stepped the issue of disambiguation by mapping each tag simply to the concept it most frequently stands for. The relevant point is to understand that this procedure is a simple heuristic. Clearly this potentially affects the proposed algorithm whenever the most frequent meaning of a word according to WordNet does not correspond to the most frequent meaning of a tag in Flickr. In order to deal with very obvious such cases, we have had to introduce stop- and go-wordlists. Rules. The knowledge which concepts denote entities likely to be visible and which do not was encoded into rules based on the underlying knowledge structure, i.e. WordNet. Rules were formulated as decision criteria in terms of the WordNet hierarchy. The hierarchy was traveled from top to down, and at appropriate levels rules specified that all concepts4 and instances5 in the subtree below would be decided to denote visible entities on any given picture or not. The WordNet-based rules were preceded by a preprocessing stage in which 4 5
A concept refers to an idea of something. A concept often refers to something abstract, e.g. “love” or to a group of real world entities, e.g. “flower”. An instance refers to a specific entity in the real world, e.g. “Big Ben” is an instance of the concept “clock tower” and refers to a specific entity.
48
V. Pammer, B. Kump, and S. Lindstaedt
stemming (stripping tags off eventual suffixes) and exclusion of adjectives, adverbs and verbs6 was performed. An exemplary rule in WN-Algorithm is “If a tag corresponds to an instance of a concept in WordNet, in most cases it does not describe a visible object on a picture in Flickr except if it is an instance of the concept ’artefact’ ”. Consequently, for any tag which corresponds to an instance of a concept other than “artefact”, e.g. “vienna” or “mozart”, WN-algorithm guesses that the tag does not describe a visible entity on any picture in Flickr. Indeed, the tag “vienna” mostly means that the picture was taken in Vienna but not that the picture shows (the whole of) Vienna. A search for pictures tagged with “mozart” returns a lot of pictures taken in Salzburg and some from opera performances evidently of one of Mozart’s plays, but among the first 100 hits, there is not a single (!) picture showing Mozart himself. Instances of the concept “artefact” are for example “eiffel tower” or “golden gate bridge”. 4.3
Algorithm Evaluation
WN-Algorithm’s ratings were compared with Rater 1’s ratings on Set A. An overview of the results is given in Table 1. The agreement between the implemented algorithm and the Test Set is Φ = 0.55 (p < .01). Out of 3862 (picture, tag) pairs, Algorithm 95 agreed with the Test Set on 3290 pairs (84.9%) and disagreed on 572 pairs. Set into the context of the baselines given by the inter-rater reliability and the optimal algorithm, the good performance can be easily recognised: Where Algorithm 95 and the Test Set agree in 84.9%, the two raters agree in 92.6%. The difference to the optimal algorithm, which achieves an agreement of 98.0%, is slightly higher, which shows in our opinion the potential for personalisation. A personalised algorithm would have to take mindset, perception and personal basic level into account (see Section 5 for a detailed description). In order to obtain indications on the generalisability of these results, the WordNet categories used by the WN-Algorithm were mapped to the WordNet Table 1. The implemented algorithm’s agreement with Rater 1’s ratings on Set A in comparison to the inter-rater reliability and the agreement of the optimal algorithm with Rater 1’s ratings on Set A WN-Algorithm Agreement 84.9% with Rater 1 [%] Agreement 0.55 (p < .01) with Rater 1 [Φ]
6
Rater 2 Optimal Algorithm (Inter-rater reliability) 92.6%
98.0%
0.77(p < .01)
0.94(p < .01)
Adjectives denote qualities of possibly visible things, but there is nothing like an instance of “beautiful”, and similar for adverbs and verbs.
On the Feasibility of a Tag-Based Approach
49
categories used by Sigurbj¨ornsson and von Zwol in [7] on a snapshot of the Flickr database of approximately 52 million photos. A comparison shows that the proportions of WordNet categories in the two studies are similar. This indicates that our data set is representative concerning the distribution of tags over the underlying knowledge structure WordNet, and our results can be generalised to other representative data sets taken from Flickr.
5
Why Is It Difficult to Decide What a Picture Shows?
The example in Fig. 1 illustrates that deciding which tags are descriptive of a picture’s content is more difficult than it seems to be at first glance. The algorithm described above, simply determines that “Figure 1 shows a flower, a garden and a hollyhock”. In addition to this, Rater 1 identified the tags “Malve” and “Garten”, the German words for hollyhock and garden respectively. The other ratings coincide. Additionally, it could be conjectured that “Haus Liebermann” is the name of the house on Fig. 1. Depending on the knowledge of the rater (human or machine), the judgement would then differ. In a detailed manual analysis we studied the (picture, tag) pairs on which Rater 1 and Rater 2, or Rater 1 and WN-Algorithm, disagreed. For space reasons, detailed references to the underlying test data are not given. The following list of reasons for disagreement between (human and algorithmic) raters is the condensed output of the analysis that was carried out. These reasons for disagreement illustrate at the same time the inherent difficulty in distinguishing descriptive from non-descriptive tags given a (picture, tag) pair. Definitional Disagreement. Two raters have a different definition of what “an object” on a picture is. This is a foundational source of disagreement, since the problem we wanted an algorithm to solve was exactly to decide on which tags describe objects visible on a picture. Examples of this are the tags “agriculture” or “feeling stitchy” (which was a writing embroidered on linen on the picture in question), on which the two human raters disagreed. Difference in Perception. An object might have been overlooked, such as a spider web7 . Difference in Knowledge. This may happen with technical terms, uncommon English terms or terms in an unknown foreign language. Examples are the terms “mangrove” , “hydrangea” or “jetty” . This error can also be made by algorithmic raters, e.g. if the used background knowledge is not specific enough or does not contain terms of a specific language. We observed that in cases of lack of knowledge or lack of confidence to agree to a specific term, human raters tended to rate a tag as not describing an object shown on the picture. This might be an argument in later stages for emphasizing the minimisation of false positive rating decisions. 7
The picture shows a flower with a butterfly, and a barely visible spider web:http://flickr.com/photos/18718027@N00/2631263572
50
V. Pammer, B. Kump, and S. Lindstaedt
Difference in Mindset. Due to different mindset (educational background, culture, personal opinion, etc.), the interpretation of a tag can differ between raters. A rater might disagree with some tag, such as calling a ferret a pet8 . Especially with artistic pictures, which often show surreal or imaginary scenes, also the interpretation of a picture can differ between raters up to the point where it becomes impossible to agree on specific objects that are shown on a picture. Disambiguation. When a tag can have multiple meanings, a rater needs to disambiguate first before a decision can be made. Raters may disagree over the meaning they assign to a tag. In general human raters use picture context as well as a vast amount of background knowledge to disambiguate whereas an algorithmic rater disambiguates according to strict rules and only limited background knowledge. Algorithmic Limitations. Finally there are erroneous decisions because of inherent limitations of the used algorithm. An algorithm that is based on some structured background knowledge is also dependent on this structure. Where the structure cannot make a difference, the algorithm cannot either (except in precise case-by-case rules / stop- and go-wordlists). In our work, for instance the central limitation is that the algorithms are tagbased, i.e. they decide for one tag regardless of the picture. The algorithm therefore disagrees with a rater who takes the picture content into account when a tag in principle describes a visible object, such as “hotel”, but the corresponding picture does not show a hotel9 .
6
Conclusion
What does this picture show? This question is relevant for automated processing of large repositories of tagged pictures. A simple application would be tag-based retrieval of pictures showing similar content for instance. We investigated the question of deciding which of a picture’s tags describe visible entities depicted on it from multiple angles. First, we studied the proportion of such descriptive tags over non-descriptive tags. On our test data it has turned out that only approximately 20% of all given tags relate to the directly visible picture content. We further explored the problems involved with answering such a question for human raters, and investigated retest and inter-rater reliability. A qualitative analysis led to the insight that identifying descriptive tags reliably is not a clear-cut task even for humans, because of differences in definition, knowledge, mindset, perception and disambiguation. With the goal of devising an algorithm which is able to reliably identify the “shows”-relation between a tag and a picture, we further investigated the feasibility of tag-based algorithms, which should base their decisions solely on the tags. Given a positive result, namely that within a given platform the usage pattern of one tag 8 9
The picture shows a ferret: http://flickr.com/photos/77651361@N00/2631585847 This picture shows the ocean, a piece of beach and a bird but not a hotel: http://flickr.com/photos/26079103@N00/2630745505
On the Feasibility of a Tag-Based Approach
51
is consistent enough to allow for tag-based decisions, we implemented a proofof-concept solution which agrees with a human rater in ≈ 85% cases. We argue that this result is satisfactory compared with the results from retest reliability (95, 1%) and inter-rater reliability (92, 6%). Furthermore, generalisability of our results can be assumed on the one hand because of the high inter-rater reliability between the two raters, and on the other hand because of the similar tag distribution with respect to WordNet categories of Set A when compared to a much larger dataset described in [7]. Future work includes the investigation of alternative knowledge structures on which a tag-based algorithm can rely, such as e.g. OpenCyc [11]. Another open question is whether it makes sense to personalise such an algorithm and how personalisation could be incorporated into the general algorithm design. In order to generalise our approach to arbitrary resources, such as for instance web bookmarks, more research is necessary.
References 1. Golder, S.A., Hubermann, B.A.: Usage patterns of collaborative tagging systems. Journal of Information Science 32(2), 198–208 (2006) 2. Mika, P.: Ontologies are us: A unified model of social networks and semantics. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 522–536. Springer, Heidelberg (2005) 3. Schmitz, P.: Inducing ontology from flickr tags. In: Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland (May 2006) 4. Rattenbury, T., Good, N., Naaman, M.: Towards automatic extraction of event and place semantics from flickr tags. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference, pp. 103–110. ACM Press, New York (2007) 5. Volkmer, T., Thom, J.A., Tahaghoghi, S.M.M.: Modeling human judgment of digital imagery for multimedia retrieval. IEEE Transactions on Multimedia 9(5), 967– 974 (2007) 6. Bechhofer, S., Carr, L., Goble, C.A., Kampa, S., Miles-Board, T.: The semantics of semantic annotation. In: Meersman, R., Tari, Z., et al. (eds.) CoopIS 2002, DOA 2002, and ODBASE 2002. LNCS, vol. 2519, pp. 1152–1167. Springer, Heidelberg (2002) 7. Sigurbj¨ ornsson, B., van Zwol, R.: Flickr tag recommendation based on collective knowledge. In: Huai, J., Chen, R., Hon, H.W., Liu, Y., Ma, W.Y., Tomkins, A., Zhang, X. (eds.) WWW, pp. 327–336. ACM, New York (2008) 8. Cohen, J., Cohen, P., West, S.G.: Applied Multiple Regression/correlation Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Mahwah (2003) 9. Fellbaum, C.: A semantic network of english: The mother of all wordnets. Computers and the Humanities 32, 209–220 (1998) 10. Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995) 11. OpenCyc: http://www.opencyc.org/ (Last visited: October 31, 2008)
Statement-Based Semantic Annotation of Media Resources Wolfgang Weiss1 , Tobias B¨ urger2, Robert Villa3 , Punitha P.3 , and Wolfgang Halb1 1
Institute of Information Systems, JOANNEUM RESEARCH Forschungsges. mbH, Graz, Austria [email protected] 2 Semantic Technology Institute (STI), Innsbruck, Austria [email protected] 3 University of Glasgow, UK {villar,punitha}@dcs.gla.ac.uk
Abstract. Currently the media production domain lacks efficient ways to organize and search for media assets. Ontology-based applications have been identified as a viable solution to this problem, however, sometimes being too complex for non-experienced users. We present a fast and easy to use approach to create semantic annotations and relationships of media resources. The approach is implemented in the SALERO Intelligent Media Annotation & Search system. It combines the simplicity of free text tagging and the power of semantic technologies and by that makes a compromise in the complexity of full semantic annotations. We present the implementation of the approach in the system and an evaluation of different user interface techniques for creating annotations. Keywords: Semantic annotation, Information storage and retrieval, semantic media asset management.
1
Introduction
The management of media resources in media production is a continuous challenge due to growing amounts of content. Because of the well-known limitations, manual annotation of media is still required. We present a statement-based semantic annotation approach which allows fast and easy creation of semantic annotations of media resources. The approach is implemented in the Intelligent Media Annotation & Search1 (IMAS) system, which is being developed within the European project SALERO2 . An integral part of the work being done in SALERO is the management of media objects with semantic technologies which is addressed by the IMAS system by enabling their semantic annotation and retrieval. The use of semantic technologies reduces the problem of ambiguity 1 2
http://salero.joanneum.at/imas/ http://www.salero.eu
T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 52–64, 2009. c Springer-Verlag Berlin Heidelberg 2009
Statement-Based Semantic Annotation of Media Resources
53
in search by using existing, well-defined vocabularies, it allows us to do query expansions, and to deal with multilinguality. During prototypical development iterations of our system we have experienced, that most paradigms applied in semantic annotation tools are not suitable for inexperienced users who are typically used to keyword-based tagging and suffer from information overload when confronted with complex annotation tasks and user interfaces. Our aim was thus to develop an approach which is faster and easier to use for our targeted user group, while making a compromise in complexity of full semantic annotations. Beside describing the content of each media resource, the approach allows to relate media resources to other media resources. In the following we present the IMAS system and the implemented annotation approach. The system is based on the principles described in [1]. These principles describe methodologies to support users in the process of manual semantic annotation including (i) selection of adequate ontology elements and (ii) extending of ontologies during annotation time. Furthermore, we present an evaluation of different user interface techniques for creating annotations. The remainder of this paper is organized as follows: Firstly we present the IMAS system (Section 2). Secondly we present the statement-based semantic annotation approach and its evaluation (Section 3). Then we are situating our work with related work in the area (Section 4) and conclude the paper with a summary and outlook to future work (Section 5).
2
System Description and Design Principles
The main aim of the IMAS is to allow easy annotation of media assets for later retrieval and reuse by users in media production. In order to support this, it has been built based on the following design principles: 1. Designed for content creators. The target users of the system are nontechnically experienced content creators in the domain of media production. 2. Easy to use. The interface provides Web 2.0 – based interaction mechanisms to make the annotation process as easy as possible. 3. Global annotations. To facilitate the annotation process, we only allow global annotation of media resources instead of annotating parts of it. 4. Statement-based annotation process. We allow to create statements, which use ontological elements, to describe the content of media resources. 5. Ontology extension during use. We allow users to easily extend the ontology during use based on principles described in [1]. 6. Portability of the system. In order to port the systems to other domains, only the underlying annotation ontology has to be adapted. 7. Integration of semantic and content-based search. The system provides an integrative view onto results from different search engines and by that provides a fallback solution which is able to retrieve objects without annotations too.
54
W. Weiss et al.
Fig. 1. IMAS System Overview
The IMAS integrates two systems whose functionalities are offered as a set of Web services, i.e. the Semantic Services and the Content-based Services. The architecture of the IMAS system is shown in [Fig. 1]: 2.1
Semantic Services
IMAS is realized on top of the SALERO Semantic Workbench (cf. [2]) which not only offers a graphical user interface to engineer ontologies but also a set of services which provide ontology management functionality to other applications. Most notably the services offer persistent storage of ontologies and annotations and the retrieval of stored information. 2.2
Content-Based Services
The Content-based Services offer functionality for the indexing and retrieval of image, video and textual information. The aim of the service is to complement the semantic search services and as a fall-back system in the cases where material is not indexed by the semantic services. As such their emphasis is on automatic indexing techniques, which can be used to retrieve images, text or video without manual annotation. Textual information is indexed using standard Information Retrieval techniques (the Terrier system is used [3]); image and video data is indexed by extracting low-level visual features based on the MPEG-7 standard, as currently implemented in the ACE toolbox [4].
Statement-Based Semantic Annotation of Media Resources
2.3
55
Search System
To search for media objects, the following input options are available to create a query: (i) free text, (ii) semantic concepts, (iii) statements, and (iv) images. Free text search is executed in both the Semantic Services and the Contentbased Services. The concept-based and statement-based search is expanded in the Semantic Services. Via the exemplary images a query is submitted to the Content-based Services. The results of both systems are integrated on a late fusion fashion. In our system a round robin mechanism combined with a pollingbased result fusion technique is adopted to fuse results. Further details of the backend services can be found in [5,6,7].
3
Statement-Based Annotation
The IMAS end user application is an integrated Web-based application which can be used to annotate and search for media objects. As illustrated in [Fig. 1], it consumes functionality of (i) the Semantic Services which are used to add, update and delete semantic annotations as well as to search the annotation repository and (ii) the Content-based Services which are used to retrieve media resources based on its intrinsic features such as colour, histograms or shapes. 3.1
Usage
The application allows to annotate arbitrary resources which are stored in preconfigurable media repositories. In order to ease the annotation process for our target user group, media resources are annotated globally instead of regionor segment-based. Media resources are annotated by creating statements and by relating them to other media resources. Annotation statements contain semantic elements which are defined in the annotation ontology (see also section 3.2). The annotation statements are formalized according to the annotation ontology and represent natural language-like statements about the content of the media resource. Statements are in the form of < Concept isRelatedTo {Concept1 ...Conceptn } > which are triples where a concept can be set in relation to other concepts. Using statements with semantic elements is a compromise in complexity between loose and fully semantically described annotations. [Fig. 2] illustrates statements with an example Image from the tiny planets3 universe. To create such statements, three different input options are available as shown in [Fig. 3]: (1) combining concepts via drag-and-drop, (2) selecting concepts consecutively and (3) using the text box as a command line interface in the spirit of [8] with auto-completion. Input option three is optimally suited for frequent users and input options one and two are ideal for users who rarely create annotations. 3
http://www.mytinyplanets.com/
56
W. Weiss et al.
Fig. 2. Example of Annotation Statements
Fig. 3. Creation of Annotation Statements
Fig. 4. Creation of relationships
Statement-Based Semantic Annotation of Media Resources
57
An additional possibility to annotate the content of media resources in the IMAS system is to relate them to each other. Hereby, we can describe that one media resource is, for instance, a part, a revision, or a transformation of another media resource. This allows us to use statements of the related media resources, to keep track of revisions of the media resources or to suggest alternatives in the search result. The relationship ontology with its properties is described subsequently in the section 3.2. To create relationships (see also [Fig. 4]) of selected media resources (1), the user drags a source media resource from the file browser (2) and drops it on the desired relationship (e.g. the revision) of the relationship panel (3). This action creates the following annotation (4): < SourceResource is a revision of TargetResource >.
3.2
Ontologies
To represent the semantic annotations we developed two ontologies according our needs. The first one is the SALERO annotation ontology (c.f. [Fig. 5]), for describing media resources, authors of media resources, projects in which they are used, and annotations. The property annotea:related, derived from the
Fig. 5. The SALERO annotation ontology
58
W. Weiss et al.
Annotea Annotation Schema4 , is used to describe the annotation statements. The subclasses of Annotation and AnnotationConcept form the domain specific parts of the ontology, which have to be adapted if someone wants to use the annotation tool in a different domain. The subclasses of Annotation describe media specific annotation types such as annotations of audio, image or video. The subclasses of AnnotationConcept represent the domain-specific part of the annotation ontologies and are used to describe the content of media resources. They currently include the following classes: – Character: The actors of a movie, e.g. Bing and Bong. – CharacterParts: Parts of the body of the characters, e.g. Hand, Nose, or Toe. – Expression: Includes verbs and adjectives to describe the behaviour of characters, e.g. smiling, dancing, open, or wet. – Location: A geographical or imaginary location, e.g. Barcelona or Outer Space. – Object: A tangible or visible entity, e.g. Balloon, Umbrella, or Cake. The scope of the second ontology is to describe relationships between media resources and how they are most probably derived or based on each other. The relationships of the Functional Requirements for Bibliographic Records (FRBR)5 model provide a solid ground to describe the relationships of media resources in the media production domain in general and in the SALERO project in particular. The relationships are supposed to enhance the browsing experience of media professionals in their media collections. The relationships of our ontology were chosen based on a general analysis of the domain based on related work, an analysis of images from the MyTinyPlanets collection, and a set of expert interviews. It contains a subset of the FRBR core ontology6 and two additional relationships, which were not covered by the FRBR core ontology (frbr: http://purl.org/vocab/frbr/core#) (c.f. [9,10]): – frbr:adaptation. A media resource, which is based on another media resource, exchanges parts of the original content, e.g. in a new image a tree is exchanged by a space ship. – frbr:alternate. An alternative (file) format of the same content, e.g. jpg instead of png. – frbr:imitation. A real world scene is imitated in a cartoon, e.g. a scene of “Star Wars” is imitated by Bing & Bong in “Tiny Planets”. – frbr:part. A media resource contains a part of another media resource, e.g. the hands of the character Bing are a part of the final animation of Bing. Supposing that the hands are modelled in a different file. – frbr:reconfiguration. A rearrangement of the content of a media resource, e.g. a new scene is based on an existing scene with the same content such as trees, space ships, characters. In the new scene the content is locally rearranged. 4 5 6
http://www.w3.org/2000/10/annotation-ns http://www.ifla.org/VII/s13/wgfrbr/index.htm http://vocab.org/frbr/core.html
Statement-Based Semantic Annotation of Media Resources
59
– frbr:revision. A newer or other version of a media resource, e.g. a new version of a space ship, which is based on an existing space ship, uses a different texture for its surface. – frbr:transformation. From a sketch to a (3-d) model, e.g. the first step in creating a 3-d model is to draw a sketch on a sheet of paper. The next step is to create a 3-d model in the computer, which is a transformation of the sketch. – frbr:translation. A translation into a different language, e.g. the embedded text and audio of a clip are translated from English to French. – duplicate. For duplicated media resources, e.g. the same file is stored at a different location. – version. The super property for frbr:adaption, frbr:reconfiguraiton, frbr:revision and frbr:transformation. The property relationships is the super property for all relationship properties, its range and domain is MediaResource. 3.3
Usability Test
An initial evaluation of the annotation aspect of the IMAS is already done. Our aim was (i) to find major obstacles in the annotation user interface as well as in the annotation process and (ii) to compare the semantic statement-based approach with other existing approaches. Evaluation Methodology: We recruited 9 participants of our institute who were not involved in the project or in the realm of semantic annotation. The subjects ranged in age from 25 to 40 and all are software developers. We created two groups. The first group of five participants compared the annotation process of the IMAS application with the desktop application of Google Picasa7. In the second group we did chronometries of creating annotations with (i) IMAS, (ii) a free text tagging approach, similar to flickr8 and (iii) creating fully semantic annotations with PhotoStuff [11]. For later analysis we did screen captures of the user actions and conspicuous behaviour was noted by the observer. Furthermore, the usability test consisted of a short user questionnaire where the participants had to answer following questions: – What are the positive and negative impressions of each tool? – What was the most difficult task? – Which tool would you use, if you have to annotate 1000 images? Before the test began, each subject got an introduction in the usability test and into the different annotation applications. Then the users had to annotate the test data with each application. The test data consisted of ten images from the tiny planets9 universe together with descriptive text for each image. For example, 7 8 9
http://picasa.google.com/ http://www.flickr.com/ http://www.mytinyplanets.com/
60
W. Weiss et al.
the descriptive text for [Fig. 2] is: “Bing has a bag on his shoulder and a book in his hands. Bong and alien are smiling.” Results: The success rate to complete the annotation tasks with IMAS is 93%. The reasons why not all participants where able to successfully complete all tasks are (i) user interface and design problems and (ii) that the semantic statementbased approach was not clear for every participant and thus produced wrong or incomplete annotations. A successful example, done by a participant with IMAS, includes following statements for the image [Fig. 2]: – – – –
Bing is related to: Shoulder bag, Shoulder Bing is related to: Book, Hand Bong is related to: smiling Alien is related to: smiling
and following tags created with Google Picasa: “alien, bing, bong, book, hand, shoulder bag on his shoulder, smiling”. These examples describe the content according the descriptive text of the image well and fulfils our quality requirements for a successful annotation.
Task Completion Times
180
160
140
time (s)
120
100
80
60
40
20
0 Tag Tool
IMAS
PhotoStuff
Fig. 6. Comparison of task completion times
[Fig. 6] shows the time measurements for completing the tasks where the statement-based approach is compared to a free text tagging approach and to creating fully semantic annotations with PhotoStuff. Creating annotations with PhotoStuff was the most time consuming approach (median 60s; mean 69.9s for
Statement-Based Semantic Annotation of Media Resources
61
creating the annotations of a single media resource). The subjects complained that they have to do long-winded recurrent tasks, such as selecting concepts and manually creating instances. The fastest approach was to use the simple free text tagging approach (median 19s; mean 18.6s), although the subjects claimed to use a system with auto completion feature. Task completion time using IMAS with the statement-based approach ranks between the two other systems with a median of 30s and a mean of 36.3s. We observed that most subjects firstly used the concept tables (see also [Fig. 3] and the section 3.1) and after annotating approximately three media resources the subjects tended to use only the command line interface with auto completion. The questionnaire revealed following facts: The users liked the auto completion feature which demonstrates following user statement: “I highly appreciate that the system suggests concepts to me I can use.” In the subjects opinion this feature helps efficiently to create specific suitable annotations. Furthermore, this was a crucial reason why for 8 out of 9 participants IMAS is the first choice for annotating 1000 media resources. One participant prefers to use a simple free text tagging approach. On the other hand, the participants also revealed new suggestions to improve the annotation process: “A copy and paste functionality of concepts and statements would be fine.” “Existing annotations should be editable.” The users also complained that the text box does not work as expected in some situations, e.g. when pressing the space bar to select a concept.
4
Related Work
The organisation, classification, and retrieval of media objects is an ongoing challenge in games and media production. Semantic technologies have been identified as a viable solution to overcome some of the problems in this area [2]. A wide range of multimedia annotation tools [12,13] already offer functionality to attach ontological annotations to (parts) of the multimedia content and some offer reasoning services on top of them to semi-automatically create annotations based on existing annotations. The K-Space Annotation Tool [14] provides a framework around the Core Ontology for Multimedia for efficient and rich semantic annotations of multimedia content. PhotoStuff [11] allows to use any ontology for the annotation of images and is available as a standalone desktop application. A Web-based demonstrator for browsing and searching with very limited functionality is also available. Imagenotion [15] already provides an integrated environment for the collaborative semantic annotation of images and image parts. User tests showed that the use of standard ontologies and tools is not generally suitable, which led to the development of a method where ontologies consist of imagenotions that graphically represent a semantic notion through an image.
62
W. Weiss et al.
A first step in closing the semantic gap, between low level signal processing and high level semantic descriptions, is to use a multimedia ontology infrastructure. The Core Ontology on Multimedia (COMM) [16] aims to provide a high quality multimedia ontology that is compatible with existing (semantic) Web technologies. The MPEG-7 based multimedia ontology enables to describe algorithms and digital data, decompositions, region locations, content and media annotations, semantic annotations and example annotations. The objective of the AIM@SHAPE [17] ontology for virtual humans is to provide a semantic layer to reconstruct, stock, retrieve and reuse content and knowledge of graphical representations of humans. The ontology allows to represent, amongst other things, shapes, parts of bodies, animations and emotions. The aim of the W3C Annotea [18] project is to enhance collaborations via shared metadata based on Web annotations, bookmarks, and their combinations. The project encourages users to create Annotea objects, including annotations, replies, bookmarks and topics. Therefore, a RDF schema can be used which defines all necessary properties.
5
Conclusion and Outlook
In this paper we have presented a semantic statement-based approach for fast and easy annotation of media resources in the realm of media productions. This concept is implemented in the Intelligent Media Annotation & Search10 application which also allows to create relationships of media resources. Creating annotations using statements with semantic elements is a compromise in complexity and expressiveness between loose and full semantical descriptions. We have developed two ontologies to store annotations and relationships. The system integrates semantic and content-based search to provide a fall-back and an alternative retrieval system. The initial usability test has shown that the approach of semantic statement-based annotation is not as fast as free text tagging but much faster than creating full semantic annotations with PhotoStuff. It is planned to explore a more sophisticated fusion mechanism for the search system. A thematic browsing through the content, based on the ontologies, would also be feasible. An important item is a further evaluation of IMAS, especially the search system in combination with the annotations, to determine the limitations of this approach.
Acknowledgements The research leading to this paper was partially supported by the European Commission under contract “IST-FP6-027122” (SALERO). Images from “My Tiny Planets” are by courtesy of Peppers Ghost Productions11 . 10 11
http://salero.joanneum.at/imas/ http://www.peppersghost.com
Statement-Based Semantic Annotation of Media Resources
63
The authors would like to thank their colleagues who contributed to the implementation of the components described here, especially Georg Thallinger, Gert Kienast, Philip Hofmair, Roland M¨ orzinger and Christian Ammendola.
References 1. B¨ urger, T., Ammendola, C.: A user centered annotation methodology for multimedia content. In: Poster Proceedings of ESWC 2008 (2008) 2. B¨ urger, T.: The need for formalizing media semantics in the games and entertainment industry. Journal for Universal Computer Science, JUCS (June 2008) 3. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A High Performance and Scalable Information Retrieval Platform. In: Proc. of ACM SIGIR 2006 Workshop on Open Source Information Retrieval, OSIR (2006) 4. O’Connor, N., Cooke, E., le Borgne, H., Blighe, M., Adamek, T.: The ace-toolbox: low-level audiovisual feature extraction for retrieval and classification. In: 2nd IEE European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies, pp. 55–60 (2005) 5. Punitha, P., Jose, J.M.: Topic prerogative feature selection using multiple query examples for automatic video retrieval. In: Proceeding of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, Boston (in press, 2009) 6. Punitha, P., Jose, J.M.: How effective are low-level features for video retrieval (submitted for review) (2009) 7. Weiss, W., B¨ urger, T., Villa, R., Swamy, P., Halb, W.: SALERO Intelligent Media Annotation & Search. In: Proceedings of I-SEMANTICS 2009 - International Conference on Semantic Systems, Graz, Austria (2009) 8. Raskin, A.: The Linguistic Command Line. In: ACM Interactions, January/February 2008, pp. 19–22 (2008) 9. IFLA Study Group: Functional requirements for bibliographic records, final report (February 2009), http://www.ifla.org/files/cataloguing/frbr/frbr_2008.pdf 10. Riva, P.: Introducing the functional requirements for bibliographic records and related ifla developments. In: ASIS&T The American Society for Information Science and Technology (August/September 2007) 11. Halaschek-Wiener, C., Golbeck, J., Schain, A., Grove, M., Parsia, B., Hendler, J.A.: PhotoStuff - An Image Annotation Tool for the Semantic Web. In: Poster Proceedings of the 4th International Semantic Web Conference (2005) 12. Obrenovic, Z., B¨ urger, T., Popolizio, P., Troncy, R.: Multimedia semantics: Overview of relevant tools and resources (September 2007), http://www.w3.org/2005/Incubator/mmsem/wiki/Tools_and_Resources 13. Simperl, E., Tempich, C., B¨ urger, T.: Methodologies for the creation of semantic data. In: Sicilia, M.A. (ed.) Handbook of Metadata, Semantics and Ontologies. World Scientific Publishing Co., Singapore (2009) 14. Saathoff, C., Schenk, S., Scherp, A.: KAT: The K-Space Annotation Tool. In: Proccedings of the SAMT 2008 Demo and Poster Session (2008) 15. Walter, A., Nagyp´ al, G.: IMAGENOTION - Collaborative Semantic Annotation of Images and Image Parts and Work Integrated Creation of Ontologies. In: Proceedings of 1st Conference on Social Semantic Web (CSSW), pp. 161–166 (2007)
64
W. Weiss et al.
16. Arndt, R., Troncy, R., Staab, S., Hardman, L., Vacura, M.: COMM: Designing a Well-Founded Multimedia Ontology for the Web. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudr´e-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 30–43. Springer, Heidelberg (2007) 17. Garcia-Rojas, A., Vexo, F., Thalmann, D., Raouzaiou, A., Karpouzis, K., Kollias, S.: Emotional body expression parameters in virtual human ontology. In: Proceedings of 1st Int. Workshop on Shapes and Semantics, Matsushima, Japan, June 2006, pp. 63–70 (2006) 18. Koivunen, M.R.: Annotea and semantic web supported collaboration. In: Proceedings of UserSWeb: Workshop on End User Aspects of the Semantic Web. Heraklion, Crete (2005)
Large Scale Tag Recommendation Using Different Image Representations Rabeeh Abbasi, Marcin Grzegorzek, and Steffen Staab ISWeb - Information Systems and Semantic Web, University of Koblenz-Landau Universit¨ atsstrasse 1, 56070 - Koblenz, Germany {abbasi,marcin,staab}@uni-koblenz.de
Abstract. Nowadays, geographical coordinates (geo-tags), social annotations (tags), and low-level features are available in large image datasets. In our paper, we exploit these three kinds of image descriptions to suggest possible annotations for new images uploaded to a social tagging system. In order to compare the benefits each of these description types brings to a tag recommender system on its own, we investigate them independently of each other. First, the existing data collection is clustered separately for the geographical coordinates, tags, and low-level features. Additionally, random clustering is performed in order to provide a baseline for experimental results. Once a new image has been uploaded to the system, it is assigned to one of the clusters using either its geographical or low-level representation. Finally, the most representative tags for the resulting cluster are suggested to the user for annotation of the new image. Large-scale experiments performed for more than 400,000 images compare the different image representation techniques in terms of precision and recall in tag recommendation.
1
Introduction
With the explosive growth of Web and the recent development in digital media technology, the number of images on the Web has grown tremendously. Online photo services such as Flickr and Zooomr allow users to share their pictures with family, friends, and the online community at large. An important functionality of these services is that users manually annotate their pictures using so called tags, which describe their contents or provide additional contextual and semantical information. Tags are used for navigation, finding, and browsing resources and thus provide an immediate benefit for users. In practice, however, users often tag their pictures fully manually which is very time-consuming and therefore very inconvenient and expensive. For this reason, it is very important to automatize this process by developing the so called Tag Recommender Systems assisting users in the tagging phase. Although this research area has been now active for a couple of years [1,2,6,7], the existing recommendation strategies are preliminary and their performance for generic scenarios rather moderate. The basic idea of recommending tags for a new image is to reuse tags assigned to similar images which have been stored in T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 65–76, 2009. c Springer-Verlag Berlin Heidelberg 2009
66
R. Abbasi, M. Grzegorzek, and S. Staab
the data collection before. One of the most challenging problems here is to find those similar images in a large-scale photo collection. Early approaches aimed at solving this image retrieval problem by using exclusively low-level features [13] which turned out to be almost impossible in generic environments at large-scale. Their performance was acceptable only for certain domain specific applications such as the content-based medical image retrieval [12]. Nowadays, the state-ofthe-art imaging devices provide pictures together with the geographical coordinates (geo-tags) stating precisely where they have been acquired. Therefore, more and more researchers make use of the additional information designing tag recommender systems with quite promising results [4,8]. The most of recent tag recommendation approaches combine different image description types (geo-tags, tags, low-level features) in order to achieve reasonable results [9,11]. However, one can observe a lack of research activities comparing the benefits each of these description types brings to a tag recommender system on its own. In our paper, we exploit these three kinds of image descriptions to suggest possible annotations for new images uploaded to a collaborative tagging system independently of each other. First, we cluster the existing largescale data collection separately for the geo-tags, tags, and low-level features. Additionally, we perform random clustering in order to provide a baseline for experimental results. Once a new image has been uploaded to the system, it is assigned to one of the clusters using either its geographical or low-level representation. Finally, the most representative tags for the resulting cluster are suggested to the user for annotation of the new image. Large-scale experiments performed for as many as 413, 848 images compare the different image representation techniques in terms of precision and recall in tag recommendation. The paper is structured as follows. Section 2 gives an overview about our tag recommendation system. In Section 3 the content description methods (features) used in social media are shortly explained, especially those used in our framework. Section 4 explains how generating image annotations works in our framework. In Section 5, we describe the dataset used for experiments and evaluate tag recommender systems of the architecture proposed at large-scale. The tests compare different image representation techniques in terms of precision and recall in tag recommendation. Section 6 concludes our investigations and their results presented in this paper.
2
System Overview
We split the overall system for tag recommendation into two parts: training and tag recommendation. The system is trained based on the image features available in social media, once the system is trained, it is used for recommending tags for new images. Following is the brief description of training and tag recommendation phases: Training: In the training phase images are clustered based on their features. A cluster contains homogeneous images depending upon the type of features
Large Scale Tag Recommendation Using Different Image Representations
67
used for clustering. For this research work, we considered geographical coordinates, low-level image features and tags as image features. As an example, a cluster based on geographical coordinates might represent the images taken in a particular location, a cluster based on low-level features might contain images showing buildings or a beach, and a cluster based on tagging data might represent concepts like concert or river. Clustering process used in this research work is described in section 4.1. Representative tags of a set of homogeneous images (i.e. images in a cluster) are used to annotate new images. The method of identifying representative tags is described in the section 4.2. Tag Recommendation: For recommending tags to a new image, we map the image to its closest cluster and assign the representative tags of the cluster to the new image. The method of classifying an image to its closest cluster and recommending tags are described in section 4.3. In the following section, we describe the features that we have used in our experiments and are also available in Folksonomies on a large scale.
3
Features in Social Media
To analyze the effect of different type of features on the performance of tag recommendation, we use three different image features in our experiments, namely Geographical Coordinates (G), Low-level image features (L), and Tags (T ). Following are the details of the features used in this research work. Geographical Coordinates: With the advancement in camera and mobile technologies, nowadays many devices are available in market that are able to capture the location of the image using a built-in or external device. In addition to the possibility of capturing location of an image using a GPS device, some folksonomies like Flickr facilitate the users to add geographical coordinates to their images by providing a map interface where users can place their images on the map. Due to this easiness, there are many images in Flickr which are enriched with geographical information. In the CoPhIR dataset [3], around 4 Million out of 54 Million images are annotated with geographical coordinates. The number of geographically annotated images is supposed to increase in future as more devices will be able to capture the geographical coordinates. We represent the geographical coordinates of the images in a two dimensional vector space G ∈ 2 . Each row vector gi of the feature space G represents the geographical coordinates of the image i. Low-level Image Features: There are five different types of low-level MPEG7 features available in the CoPhIR dataset for 54M images. Table 1 shows the properties and dimensions of the low-level features available in CoPhIR dataset. Based on initial experimental results, we consider two low-level features for evaluation, the MPEG-7 Edge Histogram Descriptor (EHD) and Color Layout (CL), which outperformed other available low-level image features. EHD represents the local edge distribution and CL represents the color and spatial information in
68
R. Abbasi, M. Grzegorzek, and S. Staab
Table 1. Properties and dimensions of low-level features available in CoPhIR dataset Low-level Feature Scalable Color Color Structure Color Layout Edge Histogram Homogeneous Texture
Properties Color histogram Localized color distributions Color and spatial information Local-edge distribution Texture
Dims 64 64 12 80 62
the images. We represent the low-level image features based on EHD and CL in 80 and 12-dimensional feature spaces LE ∈ 80 and LC ∈ 12 respectively. A row vector i of the feature space LE or LC represents the edge histograms or color layout of the image i respectively. Tags: Tags are freely chosen keywords associated with the images. There is no restriction in selecting a tag for an image. A tag might represent a concept in an image, describe the image itself or it might also represent the context of the image (e.g. location, event, time etc.). On average there are only a few tags associated with the images. In 54M images of the CoPhIR dataset, each images has on average 3.1 tags. We represent the tags of the images as a nt dimensional vector space T ∈ nt , where nt is the number of tags in the dataset. A row vector ti∗ of the vector space T represents a resource whose non-zero values represent the tags associated with the resource i. A column vector t∗j represents a tag vector whose non-zero values represent the resources associated with the tag j. A value tij represents the number of times resource i is associated with the tag j. The images in all feature spaces are indexed in the same order. For an image i, the row vector gi represents its geographical coordinates, li represents its low-level image features, and ti represents the tags associated with the same image i.
4
Tags Recommendation
This section explains the proposed tag recommendation system in detail. In the training phase of tag recommendation, the resources are first clustered (Sec. 4.1), then for each cluster, its representative tags are identified (Sec. 4.2). In the tag recommendation phase, a new resource is mapped to its closest cluster and the representative tags of the closest cluster are recommended for the new image (Sec. 4.3). 4.1
Clustering
Although many sophisticated clustering algorithms exist in literature, but the literature is still sparse for clustering high dimensional and large datasets. We use K-Means clustering algorithm in our experiments. K-Means is capable of
Large Scale Tag Recommendation Using Different Image Representations
69
Input: Feature space F ∈ {G, L, T, D}, Number of clusters k Output: A set of k clusters Method: 1. 2. 3. 4.
Randomly select k images from feature space F as the initial cluster centroids Assign each image to the closest cluster Update cluster centroids If cluster centroids are changed, then repeat step 2-3 Fig. 1. K-Means clustering algorithm
clustering very large and high dimensional datasets. Of course, other clustering methods can also be employed in the framework, when one desires to fine tune the performances or improve the results. The K-Means algorithm we used is described in figure 1. In the following, we describe in detail how do we set different parameters for using K-Means. Number of clusters: There is no generally accepted rule for setting the number of clusters for using K-Means. For our experiments we use the number of clusters as suggested by Mardia et al [10, page 365]. We define the number of clusters for n images as follows: n k= (1) 2 By using k as defined in the above equation, we get same number of clusters for each feature space. Initial Cluster Centroids: In K-Means clustering, the quality of clustering also depends on the selection of initial cluster centroids. For our experiments, k images are randomly selected. The same set of randomly selected images are used as cluster centroids for each feature space. Selecting the same set of images for different feature spaces avoids accidental improvement of one feature space over an other based on the initial centroids. Computing distance/similarity between resources: During the clustering process, each image is assigned to its closest cluster (fig 1, step 2). We need a distance measure to compute the distance between an image and its closest centroid. The most popular distance measure used is Euclidean Distance [5, page 388]. Euclidean Distance between two m-dimensional vectors f and c is defined as follows: m euclidean(f , c) = (fi − ci )2 (2) i=1
We use euclidean distance for non-text feature spaces (i.e. geographical, low-level, and random feature spaces). For text (or tags) based feature spaces it is common
70
R. Abbasi, M. Grzegorzek, and S. Staab
to use Cosine Similarity [5, page 397]. We use cosine similarity to compute similarity between image tags (in feature space T ) and cluster centroids. Cosine similarity between two m-dimensional vectors f and c is defined as follows: cosine(f , c) =
f T· c ||f ||||c||
(3)
Experimental results show that cosine similarity for tag/text based features performs significantly better than euclidean distance. For comparison between different distance measures, we also evaluated the results on Manhattan distance for non-text based features. There was no significant improvement in results if we use Manhattan distance. Manhattan distance between two vectors f and c is defined as follows manhattan(f , c) =
m
|fi − ci |
(4)
i=1
4.2
Identifying Representative Tags
After clustering images into k clusters, we identify the representative tags for each cluster. The most representative tags of a cluster are recommended for the new image. To identify the representative tags of each cluster, we rank the tags by user frequency in descending order. The rank of a tag is higher if more users have used it and vice versa. We associate the top s tags to the cluster c, and represent the set of most representative top s tags associated with a cluster c as cT . 4.3
Classification and Tag Recommendation
Once we have clustered the images and identified representative tags of these clusters, we can recommend representative tags of the closest cluster from a new image. The image is mapped to the cluster, whose centroid is at minimum distance from the image. Most representative tags associated with the mapped cluster are assigned to the new image. We assume that we have the geographical coordinates and low-level features of the new image, but we do not have tags associated with the new image. In the case of clusters based on geographical or low-level feature space, we can directly measure the distance between the geographical or low-level features of the new image and the centroids of the clusters. But for tag based clustering, we do not have tags for the new image. Therefore we have to compute the centroids of tag based clusters in terms of either geographical or low-level features. For clusters based on geographical coordinates, we classify the new image to one of the clusters whose centroid is at minimum geographical distance from the new image. For low-level clusters, we classify the new image based on the distance between its low-level features and cluster centroids. For tag based clusters, as we do not have any tags for the new image, we classify the new image based on the distance between its geographical
Large Scale Tag Recommendation Using Different Image Representations
71
coordinates and the mean of geographical coordinates of the tag based clusters. The mismatch between feature spaces used for tag based clustering and the new image negatively effects the results of tags based clustering. To sum up the tag recommendation process, we list down the recommendation processes in three steps as follows: 1. Find closest cluster centroid c to the image f (use geographical mean as cluster centroid for geographical (G) and tag (T ) based clusters; and lowlevel mean as cluster centroids for clusters based low-level (L) features) 2. Recommend the tags cT associated with the cluster c to the new image
5
Experiments and Results
In this section the experiments and results are presented. The image dataset is briefly described in Section 5.1, the distinction between the training and the test data comes in Section 5.2, which is followed by the evaluation method in Section 5.3. Section 5.4 presents the comprehensive results achieved in our work. 5.1
Image Dataset
CoPhIR dataset [3] consists of images uploaded to Flickr by hundreds of thousands of different users, which makes the dataset very heterogeneous. One can find images of very different types like portraits, landscapes, people, architecture, screen shots etc. To perform an evaluation on different types of features (geo-tags, tags, low-level) on a reasonably large scale, we created a subset of the original CoPhIR dataset. We selected the images taken in national capitals1 of all the world countries. For this purpose, we considered all the images with Euclidean distance (in terms of latitude and longitude) from center of a capital city not higher than 0.1. We ignored the capital cities which had less than 1, 000 images; this resulted into a set of 58 cities. To keep the experiments scalable, we randomly selected 30, 000 images for cities which had more than 30, 000 images. There were only three such cities Paris, London, and Washington DC. In the end, we had images of 58 capital cities, ranging from 1, 000 to 30, 000 images with an average of 8, 000 images per city. Total number of images in our evaluation dataset was 413, 848. For scalability, particularly for low-level image features, images are trained and evaluated separately for each city. Base Line: In order to compare the effectiveness of different image features, we created a random feature space for the images. We assign a random value between 0 and 1 to each of the image in dataset as its random feature. We consider the random features as the baseline for comparison. Same clustering methods are applied on the random features as on the other features. Random feature space is uni-dimensional and is represented as D ∈ . 1
http://en.wikipedia.org/wiki/National_capitals
72
5.2
R. Abbasi, M. Grzegorzek, and S. Staab
Training and Test data
It is important to carefully select the training and test datasets, because when a user uploads images to Flickr, he can perform batch operations on the set of images. For example, he can assign same tags or geographical coordinates to all the images in a batch. It is also possible that the images have similar low-level features, e.g. if the images belong to a beach or a concert. If we randomly split the images into test and training datasets, there is a chance that some images belonging to a user are used for training, while other images of the same user are used in test dataset for evaluation. Such random split may effect the final results because a test image might be mapped to a cluster containing images from the same user, having similar features as the test image. It is very likely that the test image gets annotated with perfect tags, as tags of both test and training images were provided by the same user. To make the evaluation transparent, instead of randomly splitting the resources into training and test dataset, we split the users. For each city, we use resources of 75% users for training and resources of 25% users as test dataset. No image in the test dataset is annotated by a user who has also annotated images in the training dataset. After splitting the users into training and test datasets, we use 310, 590 images for training the system and 103, 258 images used as ground truth for evaluating the system. Another aspect of fair evaluation is the quality of the tags. There are some tags which are very common in both test and training datasets. These tags mostly represent city or country names, which can be suggested by looking into a geographical database. Some common tags might not be very specific, e. g., the tags geotagged, 2007, travel etc. Very common tags also effect the evaluation results, as they are abundant in both test and training datasets, and are almost suggested for every test image. This results into higher precision and recall values. To make the evaluation more transparent, we do not consider the ten most frequent tags for each city and we also ignore the frequent tags geotagged and geotag, because all the images in our dataset are geo-tagged and most of the images have these two tags. For each city, we also remove the very rare tags which might be incorrectly spelled tags or tags specific to a particular user. For this reason, for each city, we ignore those tags which are used by less than three users. 5.3
Evaluation
We consider the tags associated with the 103, 258 test images as ground truth. The images in the ground truth are tagged by different users and as there is no restriction on the selection of tags for a resource, therefore the tags in ground truth are very noisy. The noise in the data leads to inferior results, but the overall results show the comparative analysis of different feature spaces. We evaluate the methods using standard evaluation methods used in information retrieval: Precision P , Recall R, and F-Measure F . The evaluation measures are defined as follows:
Large Scale Tag Recommendation Using Different Image Representations
P = R= F =
Number of correctly suggested tags Number of suggested tags Number of correctly suggested tags Number of expected tags
2×P ×R P +R
73
(5) (6) (7)
In addition to the standard precision and recall measures, we also computed the macro precision Pm , macro recall Rm , and macro F-Measure Fm over tags as follows: Pm =
# of times t correctly suggested
t∈Tags Suggested
# of tags suggested
Rm =
Fm =
5.4
# of times t suggested
(8)
# of times t correctly suggested
t∈Tags Expected
# of times t expected
# of tags expected
2 × Pm × Rm Pm + Rm
(9)
(10)
Results
The results presented in this section give a comparative view of tag recommendation based on different types of features. The automated evaluation on one hand provides the possibility to do evaluation on a large scale, but on the other hand the ground truth (test data) might contain invalid tags. We try to make the evaluation transparent and more meaningful by filtering certain types of tags (see Section 5.2). By removing very common tags, there is a certain decrease in evaluation results, but we believe that filtering make the evaluation fair. We have also evaluated the results without filtering the dataset, and in that case even random feature space gives a F-Measure value of 0.42. This is because of the reason that very common tags are recommended for the test images and there is always a major overlap between common tags of training and test data. The precision, recall, and F-Measure values presented in this section might appear to be low for the reader, but one shall keep in mind the filtering applied on the dataset to make the evaluation transparent. Table 2 consists of nine charts (chi∈{1,2,3},j∈{1,2,3} ) presenting the experimental results. Charts in the first row (ch1,j∈{1,2,3} ) depict the so called micro average evaluation and were generated in accordance to the evaluation criteria (5), (6), and (7)
74
R. Abbasi, M. Grzegorzek, and S. Staab Table 2. Result charts (chi∈{1,2,3},j∈{1,2,3})
ch1,1 – Micro Precision
ch1,2 – Micro Recall
ch1,3 – Micro F-Measure
ch2,1 – Macro Precision
ch2,2 – Macro Recall
ch2,3 – Macro F-Measure
ch3,1 – Micro F-Measure comparing results of two different low-level features Edge Histogram Descriptor (EHD) and Color Layout (CL)
ch3,2 – Micro F-Measure comparison of Cosine (Cos) and Euclidean (Eucl) distances for tag/text based features
ch3,3 – Micro F-Measure comparison of Manhattan (Manh) and Euclidean (Eucl) distances for non-text based features. Dark lines show the results obtained using Manhattan distance and gray lines show results obtained using euclidean distance
respectively. As one can see, in all three cases the results are significantly better when using geo-tags for image description. The performance of the tag recommendation using low-level features and textual tags differs only slightly from the results based on random clustering. For exactly one tag being recommended, the precision amounts to: 0.1385 for geo-tags, 0.0502 for low-level features, 0.0451 for textual tags, and 0.0338 for random clustering. Charts in the second row (ch2,j∈{1,2,3} ) present the so called macro average over tags evaluation and were generated in accordance to the evaluation criteria
Large Scale Tag Recommendation Using Different Image Representations
75
(8), (9), and (10) respectively. Similar to the micro average evaluation, the results here are significantly better for geo-tags, while the performance in case of textual tags, low-level features, and random clustering is almost the same. For exactly one tag being recommended, the macro precision for geo-tags amounts to 0.1584, for low-level features - 0.0521, for textual tags - 0.0414, and the baseline is 0.0312. In the third row of charts (ch3,j∈{1,2,3} ) in Table 2 some further evaluations can be found. In the first and second row charts (chi∈{1,2},j∈{1,2,3} ) the Edge Histogram Descriptor (EHD) was applied whenever low-level features were used and cosine similarity was used for tag based feature space. This has got an experimental reason. As you can see in Chart (ch3,1 ), the EHD performs slightly better than the Color Layout (CL) in terms of micro F-Measure and chart (ch3,2 ) shows a clear advantage of the cosine distance over the euclidean distance for tag based features. And finally, Chart (ch3,3 ) explains why using the simple Euclidean distance has appeared to be sufficient in our approach. The results remain almost the same when using Manhattan distance.
6
Conclusion
In our paper, we exploited three kinds of image description techniques, namely geo-tags, tags, and low-level features, to suggest possible annotations for new images uploaded to a social tagging system. In order to compare the benefits each of these description types brings to a tag recommender system on its own, we investigated them independently of each other. The evaluation was done on a large-scale image database. For experiments we used the CoPhIR dataset [3] including images uploaded to Flickr by hundreds of thousands of different users. The processing chain of our algorithm for generating image annotations contains: (i) clustering the images, (ii) finding representative tags for the clusters, (iii) classification of new images and tag recommendation. The results showed that geo-tags are the most helpful image descriptors for tag recommendation, while textual tags and low-level features provide only a slightly better performance than the random baseline. In the future, we will keep investigating the tag recommendation problem for large-scale heterogeneous image archives. We will further develop our framework to allow comprehensive experimental studies. We will also investigate the problem for some more domain dependent data collections.
Acknowledgments This work has been partially supported by the European project Semiotic Dynamics in Online Social Communities (Tagora, FP6-2005-34721). We would also like to acknowledge Higher Education Commission of Pakistan (HEC) and German Academic Exchange Service (DAAD) for providing scholarship and support to Rabeeh Abbasi for conducting his PhD.
76
R. Abbasi, M. Grzegorzek, and S. Staab
References 1. Adrian, B., Sauermann, L., Roth-Berghofer, T.: Contag: A semantic tag recommendation system. In: Pellegrini, T., Schaffert, S. (eds.) Proceedings of I-Semantics 2007, September 2007, pp. 297–304. JUCS (2007) 2. Basile, P., Gendarmi, D., Lanubile, F., Semeraro, G.: Recommending smart tags in a social bookmarking system. In: Bridging the Gap between Semantic Web and Web 2.0 (SemNet 2007), pp. 22–29 (2007) 3. Bolettieri, P., Esuli, A., Falchi, F., Lucchese, C., Perego, R., Piccioli, T., Rabitti, F.: CoPhIR: a test collection for content-based image retrieval. CoRR, abs/0905.4627v2 (2009) 4. Cristani, M., Perina, A., Castellani, U., Murino, V.: Content visualization and management of geo-located image databases. In: CHI 2008: CHI 2008 extended abstracts on Human factors in computing systems, pp. 2823–2828. ACM, New York (2008) 5. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006) 6. Heymann, P., Ramage, D., Garcia-Molina, H.: Social tag prediction. In: SIGIR 2008: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 531–538. ACM, New York (2008) 7. J¨ aschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., Stumme, G.: Tag recommendations in folksonomies. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 506–514. Springer, Heidelberg (2007) 8. Kennedy, L., Naaman, M., Ahern, S., Nair, R., Rattenbury, T.: How flickr helps us make sense of the world: context and content in community-contributed media collections. In: MULTIMEDIA 2007: Proceedings of the 15th international conference on Multimedia, pp. 631–640. ACM, New York (2007) 9. Kennedy, L.S., Naaman, M.: Generating diverse and representative image search results for landmarks. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 297–306. ACM, New York (2008) 10. Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, London (1979) 11. Mo¨ellic, P.-A., Haugeard, J.-E., Pitel, G.: Image clustering based on a shared nearest neighbors approach for tagged collections. In: CIVR 2008: Proceedings of the 2008 international conference on Content-based image and video retrieval, pp. 269– 278. ACM, New York (2008) 12. M¨ uller, H., Michoux, N., Bandon, D., Geissbuhler, A.: A review of content-based image retrieval systems in medical applications - clinical benefits and future directions. International Journal of Medical Informatics 73(1), 1–23 (2003) 13. Pentland, A., Picard, R., Sclaroff, S.: Tools for content-based manipulation of image databases. International Journal of Computer Vision 18(3), 233–254 (1996)
Interoperable Multimedia Metadata through Similarity-Based Semantic Web Service Discovery Stefan Dietze1, Neil Benn1, John Domingue1, Alex Conconi2, and Fabio Cattaneo2 1
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK {s.dietze,n.j.l.benn,j.b.domingue}@open.ac.uk 2 TXT eSolutions, Via Frigia 27, 20126 Milano, Italy {alex.conconi,fabio.cattaneo}@txt.it
Abstract. The increasing availability of multimedia (MM) resources, Web services as well as content, on the Web raises the need to automatically discover and process resources out of distributed repositories. However, the heterogeneity of applied metadata schemas and vocabularies – ranging from XML-based schemas such as MPEG-7 to formal knowledge representation approaches – raises interoperability problems. To enable MM metadata interoperability by means of automated similarity-computation, we propose a hybrid representation approach which combines symbolic MM metadata representations with a grounding in so-called Conceptual Spaces (CS). In that, we enable automatic computation of similarities across distinct metadata vocabularies and schemas in terms of spatial distances in shared CS. Moreover, such a vector-based approach is particularly well suited to represent MM metadata, given that a majority of MM parameters is provided in terms of quantified metrics. To prove the feasibility of our approach, we provide a prototypical implementation facilitating similarity-based discovery of publicly available MM services, aiming at federated MM content retrieval out of heterogeneous repositories. Keywords: Semantic Web Services, Multimedia, Metadata, Vector Spaces.
1 Introduction A continuously increasing amount of digital multimedia (MM) content is available on the Web, ranging from user-generated video content, commercial Video on Demand (VoD) portfolios to a broad range of streaming and IPTV resources and corresponding metadata records [19]. Besides, it became common practice throughout the last decade, to expose all sorts of MM content and metadata stored in one particular repository through a set of Web services, which provide Web-based access to software functionalities processing MM content and metadata, i.e. to retrieve, transcode or scale MM assets [21]. In line with the increasing usage of the term Web service in a broader sense, in the following we will use it synonymous with any kind of software functionality which is accessible through HTTP or any other IP-based layer, ranging from rather light-weight APIs, REST-based interfaces or standard Web service technology such as SOAP [22], UDDI [23] and WSDL [24]. T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 77–88, 2009. © Springer-Verlag Berlin Heidelberg 2009
78
S. Dietze et al.
Hence, the increasing accessibility of distributed MM resources – content as well as services – raises the need to automatically discover and compose distributed content. In that, the highly heterogeneous nature of MM resources distributed across distinct repositories leads to the following key challenges: C1. C2.
Discovery of distributed MM services. Discovery of distributed MM content.
However, w.r.t. these goals, several issues apply: Concurrent metadata schemes and vocabularies. Distinct approaches to metadata representation do exist, ranging from light-weight tagging approaches as deployed within user-driven websites such as youtube1 and general-purpose metadata standards such as Dublin Core [5] to fully-fledged domain-specific metadata standards such as MPEG-7 [10]. Besides, concurrent vocabularies – differing in terminology, syntax or language - are widely used to provide metadata records leading to further heterogeneities and ambiguities [11][19]. This issue also applies to Web service metadata provided based on syntactic descriptions such as WSDL [24] or semantic annotations based on OWL-S [12] or WSMO [25]. Lack of metadata comprehensibility and semantic meaningfulness. Metadata records lack expressivity due to merely syntactic annotations – usually based on XML schemas – not exploiting semantics of used structures and terminologies [1] [9][20]. In addition, current MM metadata schemas usually focus on the low-level parameters describing the actual format and audio-visual characteristics of MM assets, although a combined representation of both the actual content as well as its audio-visual format is required [14]. Moreover, even approaches such as [18] which exploit formal semantic representations, e.g. based on Semantic Web (SW) technologies such as OWL2 or RDF-S3, rely on either the common agreement on a shared conceptualisation or the formal representation of mappings, what is costly and error-prone. These issues hinder the automatic composition and processing of MM metadata and resources, and hence, do lead to interoperability issues. Lack of rather fuzzy matchmaking approaches. Current approaches to match between a certain request and available MM resources usually perform strict one-to-one matchmaking and require the subscription to a certain vocabulary from both providers and consumers. In that, only resources from a highly limited number of repositories which represent an exact match with the requested parameters are retrieved, while similar and otherwise related resources which potentially are useful are being left aside. Consequently, in order to enable interoperability between heterogeneous MM resource metadata, representation approaches are required which are meaningful enough to implicitly infer about inherent similarities across concurrent sets of MM annotations. In previous work [4], the authors proposed a representational approach combining symbolic knowledge representation mechanisms – as used by current MM resource metadata approaches – and SW technologies, with a representation in so-called Conceptual Spaces (CS) [8]. The latter consider the representation of 1
http://www.youtube.com http://www.w3.org/OWL/ 3 http://www.w3.org/RDFS/ 2
Interoperable Multimedia Metadata
79
knowledge entities, such as the ones described in MM metadata, through geometrical vector spaces where measurable quality criteria represent individual dimensions. Particular metadata records, i.e. instances, are represented as members, i.e. particular vectors, in a CS what facilitates computation of similarities by means of spatial distances. Here, we propose the application of our hybrid representational approach to model metadata of MM resources – i.e. MM services and MM content – in order to enable computation of similarities across heterogeneous repositories. In particular, low-level audio-visual MM characteristics, which are usually described by means of quantified attributes based on certain metrics, such as the MPEG-7 [10] descriptors Dominant Color or Homogeneous Texture, lend themselves to being represented in terms of vectors. Consequently, our hybrid representational approach appears to be well suited and hence, qualifies well to tackle MM metadata interoperability. The remaining paper is organized as follows. We provide an overview on related work in the area of MM service and content metadata intereoperability in Section 2. Our approach to represent MM metadata is introduced in Section 3 followed by a prototypical application utilising our approach for similarity-based MM resource discovery in Section 4. Section 5 concludes and discusses our work.
2 Related Work To satisfy the content need of a specific consumer, a federated MM content provisioning engine needs to discover (C1), the appropriate MM services (i.e. repositories) and (C2), the appropriate content. The following figure depicts this vision: C1. Discovery of MM services Client
MM Services Web service
MM Content Repository
MM Content Repository
Web service
Web service
MM Content Repository
MM Content Repository
MM Content C2. Discovery of MM content
Fig. 1. Discovery of distributed MM services and content
Given that both MM services and content utilize particular metadata vocabularies and schemas, approaching C1 and C2 requires taking into account related works from the areas of MM service metadata as well as MM content metadata interoperability. 2.1 MM Service Discovery through Semantic Web Services With respect to C1, Semantic Web Services (SWS) technology aims at the automatic discovery, orchestration and invocation of distributed services on the basis of
80
S. Dietze et al.
comprehensive semantic descriptions. SWS are supported through representation standards such as WSMO [25] and OWL-S [12]. We particularly refer to the Web Service Modelling Ontology (WSMO), an established SWS reference ontology and framework. WSMO is currently supported through dedicated reasoners, such as the Internet Reasoning Service IRS-III [2] and WSMX [26], which act as broker environments for SWS. In that, a SWS broker mediates between a service requester and one or more service providers. Based on a client request, the reasoner disovers potentially relevant SWS, invokes selected services and mediates potential mismatches. However, the domain-independent nature of SWS reference models requires their derivation to facilitate the representation of certain domain-specific contexts. While SWS aim at automatic discovery of distributed Web services based on semantic metadata, current approaches usually rely on either the subscription to a common vocabulary and schema – i.e. a common domain ontology – or the manual definition of mappings between distinct service ontologies. In that, the previously introduced issues (Section 1) also apply to SWS technologies, demanding for approaches to deal with heterogeneities between distributed SWS. In that, approaches such as [15] aim at addressing the interoperability issue partially by resolving heterogeneities based on mapping approaches. For instance, [27] provides an attempt to support similarity detection for mediation within SWS composition by exploiting syntactic similarities between SWS representations. However, it can be stated that current approaches rely on the definition of a priori mappings, the agreement of a shared vocabulary or the exploitation of semi-automatic ontology mapping approaches. Hence, providing a more generic solution to automatically resolve heterogeneities between heterogeneous SWS remains a central challenge. 2.2 MM Metadata Interoperability With respect to C2, a broad variety of research aims at interoperability between distributed MM (content) metadata. In general, the need for enriching non-semantic MM metadata through formal semantics is widely accepted [16] to enable more comprehensive query and retrieval facilities. For instance, [19] proposes an approach to semantically enrich MPEG-7 and TV-Anytime metadata through formal semantic expressions. In addition, [18] provides a way of formally expressing semantics of MPEG-7 profiles. While increasing the expressiveness of MPEG-7 based metadata, this work is limited to MPEG-7 exclusively. This also applies to the work proposed in [7], which provides an OWL expression of the MPEG-7 information model. In [9], the author provides a core ontology to annotate MM content to address interoperability. However, this approach relies on the subscription to a common vocabulary/schema – i.e. the suggested core ontology – what is not feasible in Web-scale scenarios. An entirely MPEG-21 based approach for interoperable MM communication is proposed in [17]. The need to automatically discover and compose Web services to enable processing of MM content is expressed in [21], where the authors propose an approach based on OWL-S. However, the interoperability issues between heterogeneous symbolic service annotations (Section 2.1) also apply here. While several approaches try to tackle MM metadata interoperability, it can be stated that the current state of the art usually relies on subscription to common (upperlevel) vocabularies/schemas or the manual definition of mappings. Hence, issues arise
Interoperable Multimedia Metadata
81
when attempting to apply such approaches in Web-scale scenarios [11]. Therefore, analogous to the field of MM service annotations (Section 2.1), we claim that methodologies are required which allow for a more flexible alignment of distinct vocabularies.
3 Approach With respect to the previously introduced issues (Sections 1 and 2), we claim that basing MM metadata representations on merely symbolic representations does not fully enable semantic meaningfulness [4] and hence, limits automatic identification of similarities across distinct schemas and vocabularies. In order to enable interoperability between heterogeneous MM resource metadata – representing MM content or services – representation approaches are required which are semantically meaningful enough to implicitly infer about inherent similarities. In that, we argue that a refinement of symbolic MM metadata through so-called Conceptual Spaces (CS) is better suited to overcome interoperability issues. While previous work [3][4] has shown that this approach can be applied to support interoperability between ontologies, here we apply it to facilitate interoperability between MM services and repositories. 3.1 Grounding MM Metadata in Multiple Conceptual Spaces We propose a two-fold representational approach – combining MM domain ontologies with corresponding representations based on multiple CS – to enable (a) similarity computation across concurrent MM metadata schemas and vocabularies and (b) the conjoint representation of low-level audio-visual features and the content semantics. In that, we consider the representation of a set of n schema entities (concepts) E of a set of MM metadata records (ontology) O through a set of n Conceptual Spaces CS. Note, that we particularly foresee the application of this approach to metadata of both MM services as well as content. Schema entities in the case of MM content are, for instance, the MPEG-7 descriptors such as Scalable Color or Edge Histogram. In the case of MM services, a schema entity could be for instance a WSMO ontology concept. MM metadata values (instances) are represented as members, i.e. vectors, in the respective CS. While still benefiting from implicit similarity information within a CS, our hybrid approach allows maintaining the advantages of symbolic MM metadata representations and comprehensive domain ontologies, i.e. the ability to represent arbitrary relations and axioms. In order to be able to refine and represent ontological concepts within a CS, we formalised the CS model into an ontology [4]. Hence, a CS can simply be instantiated in order to represent a particular MM metadata schema entity. Referring to [8], we formalise a CS as a vector space defined through quality dimensions di. Each dimension is associated with a certain metric scale, e.g. ratio, interval or ordinal scale. To reflect the impact of a specific quality dimension on the entire CS, we consider a prominence value p for each dimension [8]. A particular member M – representing a particular value of a schema entity – in the CS is described through a vector defined by valued dimensions vi. Following this vision, for instance the MPEG-7 schema entity Dominant Color could be represented through a CS defined by means of RGB values, where each of the spectrum colors
82
S. Dietze et al.
represents one particular dimension of the CS. A certain shade of blue would then be represented through a member M1, i.e. a vector with M1={(124, 177, 236)}. Alignment between symbolic MM metadata representations and their corresponding CS (members) is achieved by referring the respective symbolic representation to the corresponding CS ontology containing the respective CS and member instances. In that, ontological MM metadata representations would import the CS ontology, while XML-based metadata could utilise a XML serialization of the CS ontology which is utilized as a particular controlled vocabulary. Hence, content and service semantics which are represented through particular domain ontologies are refined through CS to enable similarity-computation between distinct metadata sets. 3.2 Similarity-Based Discovery of MM Resources We define the semantic similarity between two members of a CS as a function of the Euclidean distance between the points representing each of the members. Hence, with respect to [4], given a CS definition CS and two members V and U, defined by v v0, v1, …,vn and u1, u2,…,un within CS, the distance between V and U is calculated as a normalised function of their Euclidean distance. For further details, please refer to [3][4]. In order to facilitate automated similarity computation between distinct MM metadata vocabularies and schemas, we provided a Web service (WSsim) capable of computing similarities between multiple members in multiple CS. This Web service enables to automatically identify similarities between multiple MM metadata records, and hence, to automatically select the most appropriate (i.e. the most similar) MM metadata record for a given request. In that, given a set of MM metadata records, for instance based on formal semantics or XML, and a set of corresponding CS representations which refine the MM metadata schema and its values by means of vectors, WSsim is able to compute similarities and consequently, to map and mediate between concurrent metadata schemas and vocabularies. This Web service is provided with the actual MM metadata request R and the x MM metadata records MMi that are potentially relevant for R: R ∪ {MM 1 , MM 2 ,..., MM x } . R is provided as a set of measurements, i.e. vectors {v1..vn} representing a set of m Members M(R) in available CS, which describe the desired metadata values, e.g. values measuring a certain MPEG-7 descriptor or the certain criteria describing MM Web service capabilities, such as a specific Quality of Service (QoS). Also, each MMi contains a set of concepts (schema entities) C={c1..cm} and instances (entity values) I={i1..in}. For each Mi within R the corresponding CS representations CS={CS1..CSm} are retrieved by WSsim from the available CS ontology [4]. Similarly, for each MMj members M(MMi) – which refine the instances of MMj and are represented in one of the conceptual spaces CS1..CSm, – are retrieved: CS ∪ M ( R ) ∪ {M ( MM 1 ), M ( MM 2 ),..., M ( MM x )} . Based on the above ontological descriptions, for each member vl within M(R), the Euclidean distances to any member of all M(MMj) which is represented in the same space CSj as vl are computed. Consequently, a set of x sets of distances is computed as Dist(MMi)={Dist(R,MM1), Dist(R,MM2) .. Dist(R,MMx)} where each Dist(R,MMj) contains a set of distances {dist1..distn} where any distk represents the distance between one particular member vi of R and one member refining one instance of the
Interoperable Multimedia Metadata
83
capabilities of MMj. Hence, the overall similarity between the request R and any available MMj could be defined as being reciprocal to the mean value of the individual distances between all instances of their respective capability descriptions:
(
Sim( R, MM j ) = Dist ( R, MM j )
)
−1
⎛ n ⎞ ⎜ ∑ ( dist k ) ⎟ k = 1 ⎟ =⎜ ⎜ ⎟ n ⎜ ⎟ ⎝ ⎠
−1
Finally, a set of x similarity values – computed as described above – which each indicates the similarity between the request R and one of the x available MM records MMj is computed by WSsim: Output (WS sim ) = {Sim ( R, MM 1 ), Sim ( R, MM 2 ),.., Sim ( R, MM x )} . As a result, the most similar MMj, i.e. the closest MM record, can be selected and invoked. In order to ensure a certain degree of overlap between the actual request and the selected MM record, we also defined a threshold similarity value T which determines the minimum similarity which is required.
4 Application – Similarity-Based Selection of Video Services We provided a prototypical implementation which aims at similarity-based retrieval of public MM content. Note, that instead of applying the representational approach to individual MM content metadata, our prototypical application utilizes our approach to annotate MM (Web) services which operate on top of distributed MM content repositories. The available services were annotated following the representational approach proposed in Section 3.1. Hence, our proof-of-concept application facilitates similarity-based selection of MM services (i.e. C1 in Section 1), which in turn process and retrieve MM content (C2). In that, federated retrieval and processing of MM metadata is supported to facilitate interoperability. Our application makes use of standard SWS technology based on WSMO and IRS-III (Section 2.1) to achieve this vision. The application dynamically discovers services which had been created in the context of the EC-funded project NoTube4 and make use of the Youtube-API5 as well as data feeds provided by BBC- Backstage6 and Open Video7. 4.1 Representing MM Services through Multiple CS In fact, five different Web services had been provided, each able to retrieve content from distinct repositories through keyword-based searches. WS1 is able to retrieve content from the Youtube channel of The Open University8, while WS2 provides Youtube content associated with the entertainment category following the Youtube vocabulary. WS3 performs keyword-based searches on top of the Open Video repository, while WS4 operates on top of the news metadata feeds provided by BBC Backstage. In addition, WS5 provides Youtube content suitable for mobiles. 4
http://projects.kmi.open.ac.uk/notube/ http://code.google.com/intl/en/apis/youtube/ 6 http://backstage.bbc.co.uk/ 7 http://www.open-video.org/ 8 http://www.youtube.com/ou 5
84
S. Dietze et al. SWS6: get-video-request M6 ={v1, v2}
M6 ={v1, v2, v3}
2
1
CS1 Purpose Space O1:Purp
O1:Env
O2:Purp
O2:Env
O3:Purp
CS2 Environment Space O3:Env
SWS1: OU-youtube
SWS2: entertain-youtube
SWS3: open-video
WS1: OU-youtube
WS2: entertain-youtube
WS3: open-video
O4:Purp O4:Env . SWS :
O5:Purp O5:Env . SWS :
bbc-backstage
mobile-youtube
WS4: bbc-backstage
WS5: mobile-youtube
4
5
Fig. 2. MM service metadata refined in two distinct CS
Based on the SWS reference model WSMO, we provided service annotations following the approach described in Section 3. In particular, we annotated the Web services in terms of the purpose they serve MM content for and the technical environment supported by the delivered content. In that, a simplified space (CS1: Purpose Space in Figure 3) was defined to refine the notion of purpose by using three dimensions: {((p1*information), (p2*education), (p3*leisure))} = CS1. The dimensions of CS1 are measured on a ratio scale ranging from 0 to 100. For instance, a member P1 in CS1 described by vector {(0, 100, 0)} would indicate a rather educational purpose. In addition, a second space (CS1: Environment Space in Figure 2) was provided to represent technical environments in terms of dimensions measuring the available resolution and bandwidth {((p1*resolution), (p2*bandwidth))} = CS2. For simplification, also the dimensions of CS2 were ranked on a ratio scale. However, it is intended to refine the resolution dimension to apply an interval scale to both dimensions to be able to represent actual resolution and bandwidth measurements. Each dimension was ranked equally with a prominence of 1 in all cases. By applying the representational approach proposed here, each concept of the involved heterogeneous SWS representations of the underlying MM services was refined as shared CS, while instances – used to define MM services and MM requests – were defined as members, i.e. vectors. In that, assumptions (Ass) of available MM services had been described independently in terms of simple conjunctions of instances which were individually refined as vectors in shared CS as shown in Table 1. Table 1. Assumptions of involved SWS (requests) described as vectors in MS1 and MS2 Assumption AssSWSi = ( P1SWSi ∪ P2 SWSi ∪ .. ∪ PnSWSi ) ∪ ( E1SWSi ∪ E2 SWSi ∪ .. ∪ EmSWSi ) Members Pi in CS1 (purpose)
Members Ej in CS2 (environment)
SWS1
P1(SWS1)={(0, 100, 0)}
E1(SWS1)={(100, 100)}
SWS2 SWS3 SWS4
P1(SWS2)={(0, 0, 100)} P1(SWS3)={(50, 50, 0)} P1(SWS4)={(100, 0, 0)} P1(SWS5)={(100, 0, 0)} P2(SWS5)={(0, 100, 0)}
E1(SWS2)={(100, 100)} E1(SWS3)={(100, 100)} E1(SWS4)={(100, 100)}
SWS5
E1(SWS5)={(10, 10)}
Interoperable Multimedia Metadata
85
Each service was associated with a set of members (vectors) in CS1 and CS2 to represent its purpose and the targeted environment. For instance, SWS3 which provides resources from the Open Video repository, which in fact are of rather educational or information nature, was associated with a corresponding purpose vector {(50, 50, 0)}. While SWS5 represents a Web service dedicated to MM content suitable for mobiles, a vector {(10,10)} indicating low resolution and bandwidth values was associated with SWS5. 4.2 Similarity-Based Selection of MM Services and Content An AJAX-based user interface (Fig, 3) was provided which allows users to define MM content requests by providing measurements describing their context, i.e. the purpose and environment, and search input parameters, i.e. a set of keywords. For instance, a user provides a request R with the search keyword “Aerospace” together with measurements which correspond to the following vectors: P1(R)={(60, 55, 5)} in CS1 and P2(R)=(95, 90)} in CS2. These vectors indicate the need for content which serves the need for education or information and which supports a rather high resolution environment. Table 2. Automatically computed similarities between request R and available SWS Similarities SWS1 SWS2 SWS3 SWS4 SWS5
0.023162405 0.014675636 0.08536871 0.02519804 0.01085659
Fig. 3. Screenshot of AJAX interface depicting MM metadata retrieved from the Open Video repository after similarity-based selection & invocation of MM services and metadata
86
S. Dietze et al.
Though no MM service matches these criteria exactly, at runtime similarities are calculated between R and the related SWS (SWS1-SWS5) through the similarity computation service WSsim described in Section 3.2. This led to the calculation of the similarity values shown in Table 2. Given these similarities, our reasoning environment automatically selects the most similar MM service (SWS3) and triggers its invocation. As illustrated above, our application utilises our representational mechanism (Section 3.1) to support similarity-based selection of distributed MM services. Hence, though just deploying our representational approach to MM services rather than MM content, our proof-of-concept prototype illustrates the applicability of our approach for similarity-based MM metadata discovery.
5 Conclusions In order to facilitate interoperability between heterogeneous MM resources distributed across distinct repositories, we identified two major challenges – the discovery of appropriate MM services and the retrieval of the most appropriate MM content. However, addressing these challenges requires interoperability between concurrent metadata annotation schemas and vocabularies. To facilitate such interoperability, we proposed a two-fold representational approach. By representing MM annotation schema entities as dedicated vector spaces, i.e. CS, and corresponding values as vectors, similarities are computable by means of distance metrics. Our approach is realised through a dedicated CS ontology. To prove the feasibility of our approach, we introduce a prototypical application which utilises our representational approach to support discovery of MM services across distributed MM repositories. As a result, we enable similarity-based discovery of the most appropriate MM service for a given request, and hence, enable federated MM content and metadata searches across distributed repositories. However, while the current matchmaking algorithm considers instance similarity as exclusive suitability measure, future work will deal with the combined consideration of logical expressions and instance similarity. We claim that our representational approach is particularly applicable to the domain of MM, where the majority of descriptors is based on quantified metrics, and hence, is well suited for metric-based representations such as vector spaces. The authors would like to highlight that providing the representations proposed here requires an additional effort, which needs to be investigated within future work. In this respect, please note that certain CS, for instance the one describing the notion of color, are reusable given that these are required for a variety of MM parameters. As another restriction, our approach foresees that distinct parties share common CS. However, given the wide-spread usage of upper-level ontologies such as DOLCE [6], SUMO [13] or OpenCyc9 together with availability of common MM metadata standards [10] and ontologies [1][18], the agreement on common CS becomes increasingly applicable. Future work will be concerned with the evaluation of the effort required to utilise our representational model, and also, with carrying out further case studies.
9
http://www.opencyc.org/
Interoperable Multimedia Metadata
87
References [1] Arndt, R., Troncy, R., Staab, S., Hardman, L., Vacura, M.: COMM: Designing a WellFounded Multimedia Ontology for the Web. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 30–43. Springer, Heidelberg (2007) [2] Cabral, L., Domingue, J., Galizia, S., Gugliotta, A., Norton, B., Tanasescu, V., Pedrinaci, C.: IRS-III: A Broker for Semantic Web Services based Applications. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 201–214. Springer, Heidelberg (2006) [3] Dietze, S., Gugliotta, A., Domingue, J.: Conceptual Situation Spaces for SituationDriven Processes. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 599–613. Springer, Heidelberg (2008) [4] Dietze, S., Domingue, J.: Exploiting Conceptual Spaces for Ontology Integration, Workshop: Data Integration through Semantic Technology (DIST 2008) Workshop at 3rd Asian Semantic Web Conference (ASWC 2008), Bangkok, Thailand (2008) [5] Dublin Core Metadata Initiative, Dublin Core Metadata Terms (2006), http://dublincore.org/documents/dcmi-terms/ [6] Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., Schneider, L.: Sweetening Ontologies with DOLCE. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, p. 166. Springer, Heidelberg (2002) [7] Garcia, R., Celma, O.: Semantic Integration and Retrieval of Multimedia Metadata. In: 2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media (2005) [8] Gärdenfors, P.: Conceptual Spaces - The Geometry of Thought. MIT Press, Cambridge (2000) [9] Hunter, J.: Enhancing the semantic interoperability of multimedia through a core ontology. IEEE Trans. Circuits Syst. Video Techn. 13(1), 49–58 (2003) [10] Moving Picture Experts Group, ISO/IEC JTC1/SC29/WG11 – MPEG-7, http://www.chiariglione.org/mpeg/standards/ mpeg-7/mpeg-7.htm [11] Nack, F., van Ossenbruggen, J., Hardman, L.: That Obscure Object of Desire: Multimedia Metadata on the Web (Part II). IEEE MultiMedia 12(1), 54–63 (2005) [12] OWL-S 1.0, http://www.daml.org/services/owl-s/1.0/ [13] Pease, A., Niles, I., Li, J.: The suggested upper merged ontology: A large ontology for the semanticweb and its applications. In: AAAI 2002 Workshop on Ontologies and the Semantic Web. Working Notes (2002) [14] Petridis, K., Bloehdorn, S., Saathoff, C., Simou, N., Dasiopoulou, S., Tzouvaras, V., Handschuh, S., Avrithis, Y., Kompatsiaris, I., Staab, S.: Knowledge Representation and Semantic Annotation of Multimedia Content. IEE Proceedings on Vision Image and Signal Processing, Special issue on Knowledge-Based Digital Media Processing 153(3), 255–262 (2006) [15] Radetzki, U., Cremers, A., Iris, B.: A framework for mediator-based composition of service-oriented software. In: ICWS, pp. 752–755. IEEE Computer Society, Los Alamitos (2004) [16] Stamou, G., van Ossenbruggen, J., Pan, J.Z., Schreiber, G., Smith, J.R.: Multimedia annotations on the semantic Web. IEEE Multimedia 13(1), 86–90 (2006)
88
S. Dietze et al.
[17] Timmerer, C., Hellwagner, H.: Interoperable Adaptive Multimedia Communication. IEEE Multimedia Magazine 12(1), 74–79 (2005) [18] Troncy, R., Bailer, W., Hausenblas, M., Hofmair, P., Schlatte, R.: Enabling Multimedia Metadata Interoperability by Defining Formal Semantics of MPEG-7 Profiles. In: Avrithis, Y., Kompatsiaris, Y., Staab, S., O’Connor, N.E. (eds.) SAMT 2006. LNCS, vol. 4306, pp. 41–55. Springer, Heidelberg (2006) [19] Tsinaraki, C., Polydoros, P., Kazasis, F., Christodoulakis, S.: Ontology-based Semantic Indexing for MPEG-7 and TV-Anytime Audiovisual Content. Special issue of Multimedia Tools and Application Journal on Video Segmentation for Semantic Annotation and Transcoding (2004) [20] Van Ossenbruggen, J., Nack, F., Hardman, L.: That Obscure Object of Desire: Multimedia Metadata on the Web (Part I). IEEE MultiMedia 11(4), 38–48 (2004) [21] Wagner, M., Kellerer, W.: Web services selection for distributed composition of multimedia content. In: Proceedings of the 12th annual ACM international conference on Multimedia, New York, NY, USA, October 10-16 (2004) [22] W3C: Simple Object Access Protocol, SOAP, Version 1.2 Part 0: Primer (2003), http://www.w3.org/TR/soap12-part0/ [23] W3C: Universal Description, Discovery and Integration: UDDI Spec Technical Committee Specification v. 3.0 (2003), http://uddi.org/pubs/uddi-v3.0.120031014.htm [24] W3C: WSDL: Web services Description Language (WSDL) 1.1 (2001), http://www.w3.org/TR/2001/NOTE-wsdl-20010315 [25] WSMO Working Group (2004), D2v1.0: Web service Modeling Ontology (WSMO). Working Draft (2004), http://www.wsmo.org/2004/d2/v1.0/ [26] WSMX Working Group, The Web Service Modelling eXecution environment (2007), http://www.wsmx.org/ [27] Wu, Z., Ranabahu, A., Gomadam, K., Sheth, A.P., Miller, J.A.: Automatic Composition of Semantic Web Services using Process Mediation. In: Proceedings of the 9th International Conference on Enterprise Information Systems (ICEIS 2007), Funchal, Portugal, June 2007, pp. 453–461 (2007)
Semantic Expression and Execution of B2B Contracts on Multimedia Content Víctor Rodríguez-Doncel and Jaime Delgado Distributed Multimedia Applications Group Universitat Politècnica de Catalunya Jordi Girona 1-3, 08034 Barcelona, Spain {victorr,jaime.delgado}@ac.upc.edu
Abstract. Business to business commerce of audiovisual material can be governed by electronic contracts, in the same way as digital licenses govern business to consumer transactions. The digital licenses for end users have been expressed either in proprietary formats or in standard Rights Expression Languages and they can be seen as the electronic replacement of distribution contracts and end user licenses. However, these languages fail to replace the rest of the contracts agreed along the complete Intellectual Property value chain. To represent their corresponding electronic counterpart licenses, a schema based on the standard eContracts and the Media Value Chain Ontology is presented here. It has been conceived to deal with a broader set of parties, to handle typical clauses found in the audiovisual market contracts, and to govern every transaction performed on IP objects. Keywords: MPEG-21.
Contract,
license, DRM,
Intellectual Property,
Ontology,
1 Introduction Perhaps audio and video distribution never received so much public attention as in these days. From the legal changes around the intellectual property (IP) to the newest gadget with new wondrous capabilities to appear in the market, mass consumption of media is nowadays in the spotlight of everybody. Less socially interesting but of no fewer economical importance are the transactions of multimedia material within the business to business (B2B) limits. Technologies to serve media to the consumer have developed very fast, in a controlled manner through web portals and Digital Rights Management (DRM) systems and in non controlled manners through parallel channels like P2P networks, fast download servers, etc. In the business to consumer (B2C) sector, economical transactions have been kept relatively simple –the consumer pays and in exchange can download a file or gain access to a football match streaming, for example. Digital licenses, expressed in one of the existing Rights Expression Languages (REL) allow some degree of complexity, where the transaction can be conditioned to the satisfaction of some conditions (e.g., of temporal or territorial nature), and can define more precisely the action the user can T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 89–100, 2009. © Springer-Verlag Berlin Heidelberg 2009
90
V. Rodríguez-Doncel and J. Delgado
make (perhaps render but not store nor print). In the B2B sector of audiovisual content, transactions happen in a similar way: there is flow of money and a flow of content in opposite directions, both of which can take place in the digital space. The quintessence remains the same as in the B2C case, but complexities arise in the conditions and the nature of the agreements, and a pre-filled license does not suffice. Written contracts regulate the economical transactions instead of digital licenses, and technology is not relied as it is in the retail segment. The authors of this paper believe that part of this lack of acceptance of digital systems to create, manage, and execute the agreements is due to the lack of maturity in the technology, which has failed to express satisfactorily the terms of real contracts in a digital language and manage and execute them accordingly. This paper recalls the previous attempts of expressing contracts in a digital language and their role as enforcement agents in information systems. It then particularizes to the case of the audiovisual B2B sector, which presents some recurrent patterns in the contract structure, and a very well defined candidate environment for their use: the DRM systems. The paper finally evaluates more complex contracts in the context of its execution and its role as steering documents.
2 Contract Representation 2.1 Contract Representation Overview B2B transactions in the audiovisual market have been regulated with narrative contracts. Contracts are legally binding agreements and they are made of mutual promises between two or more parties to do (or refrain from doing) something. The terms of a contract may be expressed written or orally, implied by conduct, industry custom, and law or by a combination of these. Efforts to represent contracts electronically are not new –they are as old as computers, and even making them part of digital systems is not new. Along with the development of computer sciences and network communications, the electronic representation of contracts played each time a more active role. Thus, in the earliest Electronic Data Interchange (EDI) standards, about fourty years ago, only bills and invoices were exchanged, but slowly the exchanged messages became richer in their expressivity and their role in an integrated information system was each time more important. Besides propietary systems where information adquired an ad-hoc structure there have been some remmarkable attemps to structure electronically the information in contracts. COSMOS [1] was an e-commerce architecture supporting catalogue browsing, contract negotiation and contract execution. It defined a contract model in UML and proposed a CORBA-based software architecture in a coherent manner. DocLog [2] was an electronic contract representation language introduced in the 2000 with a ‘XML like’ structure, which anticipated the next generation of XML-based contract representations. When XML was mature enough it was seen as a good container of contract clauses, and thus the new format specifications came under the form of a XML Schema or a DTD (Document Type Definition). An effort to achieve a common XML contract representation was the Contract Expression Language (CEL) [3],
Semantic Expression and Execution of B2B Contracts on Multimedia Content
91
developed by the Content Reference Forum. It formalized a language that enabled machine-readable representation of typical terms found in content distribution contracts and was compliant with the Business Collaboration Framework [4], but it was not finally standardized. In the following years, the advent of the Semantic Web reached the contract expression formats, and new representations evolved from the syntactic representation level to the semantic one ([5][6][7]) being developed domain ontologies in the KIF (Knowledge Interchange Format) or OWL (Ontology Web Language) languages. Still climbing levels in the Semantic Web layered model, RuleML first and SWRL (Semantic Web Rules Language) after were enacted as the new model container for electronic contracts, given that a contract declares a set of rules [8]. SWRL provides a Web-oriented abstract syntax and declarative knowledge representation semantics for rules; but the concrete syntax can have the form of a RDF Schema (Resource Description Framework), thus providing a seamless integration with OWL ontologies. Some of these contract models have also been aimed at governing Information Technology systems [9][10]. 2.2 Contract Representation with eContracts Currently, the most widely acknowledged standard is the eContracts, promoted by the OASIS consortium. Their electronic contract representation banks on XML again, but it has gained rapid acceptance. This is the culmination of the LegalXML eContracts Technical Committee, which started in 2002 to evaluate a possible eContracts Schema, and which achieved its final form in 2007 [11]. The model proposed in this chapter uses this eContract standard as a framework for the execution of contracts in the audiovisual B2B sector. Being eContracts the container, contents are expressed with the help of some existing ontologies in the field, like the Media Value Chain Ontology [21] and the principles of deontic logic for contracts execution in other formalizations.
Fig. 1. Top elements in the eContracts XML contract
92
V. Rodríguez-Doncel and J. Delgado
eContracts is a standard aimed at representing general contracts, having no particular field in scope. eContracts documents are composed of general paragraphs and clauses, being the main XML elements the ec:item, the ec:title, the ec:block and the ec:text with the item element used recursively. Figure 1 shows the XML schema for the root and the main elements (the root element is, of course, the ec:contract). ecmetadata elements in the contracts allow the specification of contract date, creator or title (using the Dublin Core metadata elements). Contract parties are declared in the ec:contract-front part, being the clauses under the ec:body element. 2.3 Represention of Audiovisual Contracts The most common clauses found in the body of multimedia contents contracts have been studied [12] by analyzing some paper contracts in the sector. Besides the metadata and header clasues, the main kinds of clauses are listed here: • • • • • • • • • •
Rights. Actions that the licensee can execute. Resource. Object of the actions to be executed. Report and Auditing. Obligation to report on sales or action executions. Fee. Fee to be payed in exchange to a transfer of IP or content delivery. Territory. Spatial condition imposed on the execution of an action. Term. Temporal condition imposed on the execution of an action. Confidentiality. Prohibition to release information around the agreement. Disclaimer. Denial of responsibilities on certain issues. Jurisdiction. Agreed legal frame and court on case of dispute. Breach and termination. End of the contract extintion conditions.
These kinds of clauses may constitute refinements of the existing eContracts elements like ec:item or ec:block. But a better refinement of the structure could be given, specifying the nature of the clause. Clauses can be classified according to one of the standard deontic logics notions of ‘prohibition’, ‘obligation’ and ‘permission’. Some other clauses are auxiliar of the former, only describing facts agreed by both parties (assertions). Each of the clauses in the contract can be classified as one of them (prohibition, obligation, permission or assertion) (see Table 1). Contract metadata and contract front constitute also assertions. The namespace aec stands for audiovisual electronic contracts schema. Table 1. Kind of clauses and their meaning Kind of clause 1. aec:permission 2. aec:prohibition 3. aec:obligation 4. aec:assertion ec:metadata ec:contract-front
Meaning What the licensee can do What the licensee cannot do What the licensee must do What both parties agree it is Data on the contract itself Contract heading
Example clause Licensee rights Confidentiality Fee, territory, term Jurisdiction Contract date Contract parties
Semantic Expression and Execution of B2B Contracts on Multimedia Content
93
3 Contract Representation with Rights Expresion Languages and the Media Value Chain Ontology A parallel and practical effort in automating the trade of multimedia assets has been made with the Digital Right Management (DRM) systems and the electronic licenses as the contract-like elements governing the transactions. These licenses are expressed in a Rights Expression Language (REL) and they can be seen as effective electronic contracts that are being enforced. Examples of RELs are the MPEG-21 REL [13] and the Open Digital Rights Language (ODRL) [14]. This joint analysis of RELs as electronic contracts is not new [12] but so far it has not received enough attention. In both MPEG-21 REL and OMA DRM, an XML file contains a license which expresses the rights one of the parties has and the conditions that have to hold. And in both cases, these licenses can represent both in-force contracts or license offers. Comparing with Section 2.3, contract clauses representable in a REL license are only those directly needed for a rights enforcement: the parties clause, the rights clause, and some of their enforceable conditions (fee, territory, grant etc.). 3.1 Parties in REL Licenses Licenses refer always to two parties, likewise most of real contracts. Actually an MPEG-21 license may content several grants each of them with a different party, but then we can consider the grant as the basic license unit. In MPEG-21 language, parties are called issuer and principal, while in ODRL they are directly referred as parties, classified as ‘end users’ and ‘right holders’. No more information is given about who might be these parties, excepting that they are uniquely identified, and that one of them (the rights issuer) electronically signs the document. In the framework of MPEG-21, the concept of ‘user’ include “individuals, consumers, communities, organizations, corporations, consortia, governments and other standards bodies and initiatives around the world” [15]. In ODRL, parties can be humans, organizations, and defined roles. According to the standards, users are only defined by the actions they perform, but if we attend to the expressivity of both RELs, in the licenses there can be only end users and distributors. This is enough for most of DRM platforms, but a contract model should consider all the user roles appearing in the complete media value chain. 3.2 Rights and Conditions in REL Licenses Rights and conditions are declared together in REL licenses, and they are declared in only one direction: rights mean rights of the licensee, and conditions mean conditions that have to be met for the licensee to execute his rights. In practice, contracts in the audiovisual sector include clauses in both directions: each of the parties may have rights, obligations and prohibitions. The rights defined in MPEG-21 REL and ODRL (Table 2) are only focused in one of the directions, and they are not enough to describe all the actions that are permitted in contracts. The list of actions and rights needed to express the contract information is given in Table III, which was elaborated after a systematic analysis of existing contracts in the audiovisual market [12].
94
V. Rodríguez-Doncel and J. Delgado Table 2. MPEG-21 REL rights and ODRL permissions MPEG-21 REL rights
ODRL permissions Usage
End user Enlarge Reduce Move Adapt Extract Embed
End user Play Print execute Install Uninstall delete
Distributor Issue Revoke Obtain modify
Reuse
Asset Management End-user Display Modify Move Print Excerpt Duplicate Play Annotate Delete Execute Aggregate Verify Backup/Restore Install/Uninstall
Transfer Distributor Sell Lend Give Lease
Table 3. Main actions and rights to be considered in a contract representation Most common rights appeared in contracts Reproduce Download Upload MakeAvailable PubliclyPerform
Broadcast Copy Print Record Modify
Adapt Convert Transcode Remix Distribute
Lease License Promote Stream Translate
Advertise Dub Transmit Exhibit Sell
The comparison shows that MPEG-21 rights and ODRL permissions do not completely represent the information expressed in B2B contracts, and although RELs foresee mechanisms for the extension of the rights list, the main unaddressed issue is that they were not B2B conceived. 3.3 Parties and Rights in the Media Value Chain Ontology Much of the mismatch between RELs expressivity and contracts reality can be shortened with the help of the Media Value Chain Ontology (MVCO) [16]. XML representation of contracts under the form of REL licenses is of limited expressivity compared to the ontology-based contracts presented in Section 2. However none of the domain ontologies in Section 2 has been applied in the context of a content distribution system or a DRM system. The Media Value Chain Ontology is a semantic representation of the intellectual property along the Value Chain conceived in the framework of the MPEG-21 standard. The MVCO is based on work by the authors [17] and from an ontology that is part of the Interoperable DRM Platform (IDP), published by the Digital Media Project. The Media Value Chain Ontology is represented using the expressivity of OWLDL, and thus each class is well defined and related to a set of attributes and to other classes in a very precise way. In practice, applications can be deployed where the particular users, IP entities, actions etc. are instances of the ontology. The model defines the minimal set of kinds of Intellectual Property, the roles of the users interacting with them, and the relevant actions regarding the Intellectual Property law.
Semantic Expression and Execution of B2B Contracts on Multimedia Content
95
Although the MVCO was not intended to describe specifically contracts, their vocabulary is useful given that contracts on audiovisual material are contracts essentially on its intellectual property. If every contract represents an agreement between two parties who belong to the value chain, contracts can be classified according to the signing parties. Figure 2 shows the typical name of the contract types and relates them with the parties, including the contract between End User and Distributor (usually an oral contract). Creator Adaptation Contract Synchronization Contract
Execution Contract Performance Contract
Adaptor
Instantiator
Edition Contract Exploitation Contract Producer Distribution Contract
Broadcast Contract
Distributor
Broadcaster Cable TV Contract etc.
End user license End User
Fig. 2. Kinds of contracts in the media value chain
An example of eContracts fragment declaring the parties using the MVCO is given in Figure 3. 00 <ec:contract xmlns="urn:oasis:names:tc:eContracts:1:0"> 01 <ec:contract-front> 02 <ec:parties> 03 <ec:party><mvco:Distributor rdf:about="#Alice"/> 04 <ec:party><mvco:EndUser rdf:about="#Bob"/> 05 06 Fig. 3. eContracts parties declaration using the MVCO expressions
The IP value chain considers the different kinds of IP entities linked from the original work to the final product, as can be seen in Figure 4. There is a parallel between the chain of IP and the chain of contracts. Figure 4 shows the IP entities along the value chain, starting from work as the original abstract conception of an artist and finishing in the product as the most elaborate IP entity ready to be enjoyed by the end user.
96
V. Rodríguez-Doncel and J. Delgado
Fig. 4. The IP Value Chain
The MVCO can add much of its vocabulary to the semantics on contract representation. For example, three of its main classes are “Action” (comparable to the rights analyzed in the previous section) “User” (comparable to the parties in the contracts) and “IP Entity” (a classification of the objects of the contract from a IP point of view). Table 4 lists these classes and their inmediate subclasses. Table 4. Main classes of the ontology Root classes IP Entities Roles Actions
Subclasses Work, Adaptation, Manifestation, Instance, Copy, Product Creator, Adaptor, Instantiator, Producer, Distributor, EndUser TransformingActions (adapt, perform, etc.), EndUserActions (play etc)
Among the different parties and interests in the value chain, we may find creators, adaptors, performers, producers, distributors or broadcasters, all of them adding value to the product, and all of them tied by agreements in which Intellectual Property rights are handed over in exchange of economic compensations. Contract parties, if expressed in the terms of the MVCO, belong to one of the following classes: Creators, Adaptors, Instantiators, Producers, Distributors and EndUsers. Being broader in scope, the MVCO provides a good starting point for deriving new actions and parties to be used in a contract vocabulary. 3.4 RDF Representation of Deontic Clauses with MVCO OWL-DL is a Description Logics knowledge representation language, whose expression can be mapped to a first order predicate logic system. Predicates are verb phrase templates that describe properties of objects, or a relationship among objects represented by the variables (e.g. “Bob is a Creator”). As the given statements representing
Semantic Expression and Execution of B2B Contracts on Multimedia Content
97
the domain knowledge constitute a formal deductive system, the ontology can be queried (e.g. “has Bob created any Work?”). For each syntactically correct expression, the OWL-DL ontology is able to assert its truth value: either true, false or unkown (for the latter case, note that that OWL uses the open world assumption). All the above makes OWL an ideal mean to handle the truthness of propositions. However, not all the propositions in the English language (or human thinking) convey a truth value. Commands, questions or deontic expressions cannot be said to be true or false, and contracts carry its most valuable information in sentences like these (e.g. “Party A must pay party B in a yearly base”). This kind of expressions lies in the field of deontic logic [22]. Modal logics are concerned with other modalities of existence (usually necessity or possibility), and introduce two new monadic operators related like this:
◊P ↔ ⌐□⌐P and
□P ↔ ⌐◊⌐P The deontic logic is a kind of modal logic of the highest interest to represent contracts, and in place of the operators □, ◊ we can interpret “Obligation”, and “Permission” (in the above expressions, it can be read that “P is obligatory” is equivalent to “it is not permitted not P”). Actually only one of both operators is strictly necessary, as the second can be deduced from the first, but for readability, usually both are kept. In these expressions, P is no more than an alethic proposition. The MVCO defines a class “Fact” with a definite truth value (overcoming the open world assumption which enabled an unkown state), and an object property “hasRequired” which linked to a Permission enables the expression of obligations. The most important clauses found in multimedia content contracts, as they were defined in Section 3 are either alethic sentences (we call them Assertions, and we can represent them with the symbol P) or deontic expressions, the latter being either Permisison (⌐◊⌐P), Prohibition (◊⌐P) or Obligation (◊P). Similar approaches in the treatment of contracts can also be found in the literature [20]. Clauses which are liable to be enforced are enclosed in the aec:enforceable element. 3.5 Semantic Contract Representation We have now all the ingredients to describe a semantic representation of the audiovisual contracts. First we have a good structure given by the eContracts standard, which is naturally extendable. A general purpose standard like eContracts, with only 51 XML elements, cannot cover the details of any particular domain, but it can be refined. Then, we have a rich vocabulary of specific terms from the DRM world (and specified formally in the RELs), which is complemented by the elements in the semantic model of the complete value chain given in the Media Value Chain Ontology. Note that even though MVCO is an OWL-DL ontology, the declaration of its individuals is given in RDF and as such, easily integrable in the eContracts structure. Note that the the multimedia objects, if properly annotated with Multimedia Semantics, can bear information about its intellectual property nature making the integration easier. A real eContract clause carrying RDF triples of the MVCO can take the form shown in Figure 5.
98
V. Rodríguez-Doncel and J. Delgado
00 <ec:body> 01 <ec:item> 02 03 04 <mvco:permitsAction rdf:resource="#Action000"/> 05 <mvco:issuedBy rdf:resource="#Alice"/> 06 <mvco:hasRequired rdf:resource="#Germany"/> 07 08 09 <mvco:MakeAdaptation rdf:about="#Action000"> 10 <mvco:actedBy rdf:resource="#Bob"/> 11 <mvco:actedOver rdf:resource="#mywork1"/> 12 13 14 ISO:DE 15 16 <mvco:Work rdf:about="#mywork1"> 17 <mvco:hasRightsOwner rdf:resource="#Alice"/> 18 19 20 21 22
Fig. 5. eContracts clause integrated with a MVCO Permission
The XML snippet in Figure 5 asserts that there is a work called mywork1 (lines 1618), that there is a Fact called Germany (lines 13-15), and that there is an action that is Bob “making an adaptation over mywork1” (lines 09-12). It also declares that there is a permission given by Alice for Bob to make an adaptation provided that it is in Germany (lines 03-07). All this information is enforceable (line 02).
4 Execution of Semantic Audiovisual Contracts What is called the execution of contracts is no more than the authorisation of some media transfers (or the keys to decrypt) and the automatic dispatch of payment orders. Both transfers of bits and money can be ordered as the result of the execution of some rules given in the audiovisual eContract, fired with some events (purchase, consumption etc.). Figure 6 depicts a general environment of execution of audiovisual contracts, where the “semantic expression of contracts” is the eContracts extension we have described together with the MVCO elements and their extensions, the semantic expression of the context describes also as RDF the environment (spatial, temporal etc.) and the firing events are user request for rights execution or purchases (contract proposal accepts). The detailed syntax of these expressions exceeds the purpose of this paper, but they are a set of SWRL rules which ultimately offer an authorisation result and dispatch event reports.
Semantic Expression and Execution of B2B Contracts on Multimedia Content
99
Fig. 6. Execution environment for audiovisual electronic contracts
5 Conclusions This work acknowledges REL licenses as the governing element in DRM systems for B2C distribution of multimedia content, and declared licenses as the digital version of end user or distributor contracts. However, after an analysis of real contracts in the IP contents B2B market, it was observed that more flexibility was required to cope with the complexity of those narrative contracts. On the other hand, other electronic contract representations lack the needed formalism to steer content distribution systems. The MVCO, a recently presented ontology of the IP value chain model, may overcome the limitations of the existing RELs and may merge well into the OASIS eContract structure. This combination can govern a content distribution system with all the value chain players if some additions are made. In particular, an event description system is needed, and an authorisation mechanism too, capable of processing the dynamic events, the current context and these MVCO-extended eContracts. The execution of SWRL rules can determine this authorisation and make the electronic contracts to be truly semantic containers.
References 1. Kobryn, C., Atkinson, C., Milosevic, Z.: Electronic Contracting with COSMOS - How to Establish, Negotiate and Execute Electronic Contracts on the Internet. In: 2nd Int. Enterprise Distributed Object Computing Workshop (EDOC 1998), CA, USA (1998) 2. Tan, Y.-H., Thoen, W.: DocLog: an electronic contract representation language. In: Proceedings of the 11th International Workshop on Database and Expert Systems Applications (September 2000) 3. CRF Content Reference Forum. CEL: Contract Expression Language (2002), http://www.crforum.org/candidate/CELWG002_celspec.doc 4. Hofreiter, B., Huemer, C.: UN/CEFACT’s Business Collaboration Framework - Motivation and Basic Concepts. In: Proceedings of the Multi-Konferenz Wirtschaftsinformatik, Germany (March 2004) 5. Kabilan, V., Johannesson, P.: Semantic Representation of Contract Knowledge using Multi Tier Ontology. In: Proceedings of the First International Workshop on Semantic Web and Databases, Germany (June 2003)
100
V. Rodríguez-Doncel and J. Delgado
6. Llorente, S., Delgado, J., Rodríguez, E., Barrio, R., Longo, I., Bixio, F.: Generation of Standardised Rights Expressions from Contracts: An Ontology Approach? In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM-WS 2005. LNCS, vol. 3762, pp. 836–845. Springer, Heidelberg (2005) 7. Yan, Y., Zhang, J., Yan, M.: Ontology Modeling for Contract: Using OWL to Express Semantic Relations. In: Proceedings of the 10th IEEE International Enterprise Distributed Object Computing Conference, pp. 409–412. IEEE Computer Society, Los Alamitos (2006) 8. Paschke, A., Bichler, M., Dietrich, J.: ContractLog: An Approach to Rule Based Monitoring and Execution of Service Level Agreements. In: Adi, A., Stoutenburg, S., Tabet, S. (eds.) RuleML 2005. LNCS, vol. 3791, pp. 209–217. Springer, Heidelberg (2005) 9. Morciniec, M., Salle, M., Monahan, B.: Towards Regulating Electronic Communities with Contracts. In: Proceedings of the 2nd Workshop on Norms and Institution in Multi-agent Systems, Canada (June 2001) 10. Krishna, P.R., Karlapaplem, K., Dani, A.R.: From Contracts to E-Contracts: Modeling and Enactment. Information Technology and Management 6(4), 363–387 (2005) 11. Leff, L., Meyer, P. (eds.): OASIS LegalXML eContracts Version 1.0 Committee Specification (2007), http://docs.oasis-open.org/legalxml-econtracts/CS01/ legalxml-econtracts-specification-1.0.pdf 12. Rodríguez, V., Delgado, J., Rodríguez, E.: From Narrative Contracts to Electronic Licenses: A Guided Translation Process for the Case of Audiovisual Content Management. In: Proceedings of the 3rd International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, Spain (November 2007) 13. ISO/IEC 21000-5:2004, Information technology — Multimedia framework (MPEG-21) — Part 5: Rights Expression Language (2004) 14. Ianella, R.: Open Digital Rights Language (ODRL) Version 1.1, W3C (September 2002), http://www.w3.org/TR/odrl/ 15. Bormans, J., Keith Hill, K.: MPEG-21 Overview v.5, ISO/IEC JTC1/SC29/WG11/N5231 (October 2002) 16. MPEG-21, Media Value Chain Ontology, Committee Draft, ISO/IEC JTC1/SC29/WG11 N10264, Busan, South Korea (October 2008) 17. Gauvin, M., Delgado, J., Rodriguez-Doncel, V.: Proposed RRD Text for Approved Document No 2 – Technical Reference: Architecture, v. 2.1, Digital Media Project DMP0952/AHG40 (July 2007) 18. Guth, S., Simon, B., Zdun, E.: A Contract and Rights Management Framework Design for Interacting Brokers. In: Proc. of the 36th Hawaii International Conference on System Sciences (HICSS), Big Island, Hawaii/USA (January 2003) 19. AXMEDIS, Specification of Axmedis, AX4HOME Architecture, Automatic Production of Cross Media Content for Multi-Channel Distribution, DE12.1.3.1 (July 2007) 20. Prisacariu, C., Schneider, G.: A Formal Language for Electronic Contracts. In: Bonsangue, M.M., Johnsen, E.B. (eds.) FMOODS 2007. LNCS, vol. 4468, pp. 174–189. Springer, Heidelberg (2007) 21. Rodríguez, V., Gauvin, M., Delgado, J.: An Ontology for the Expression of Intellectual Property Entities and Relations. In: Proceedings of the 5th International Workshop on Security in Information Systems (WOSIS 2007), Portugal (April 2007) 22. Rodríguez, J.: Lógica deóntica: Concepto y Sistemas. Universidad de Valencia, Secretariado de Publicaciones, Valencia (1978) 23. Rodríguez, V., Delgado, J.: Multimedia Content Distribution Governed by OntologyRepresented Contracts. In: Workshop on Multimedia Ontologies and Artificial Intelligence Techniques in Law, Netherlands (December 2008)
A Conceptual Model for Publishing Multimedia Content on the Semantic Web Tobias B¨ urger and Elena Simperl Semantic Technology Institute (STI), University of Innsbruck, Innsbruck, Austria {tobias.buerger,elena.simperl}@sti2.at
Abstract. Retrieving multimedia remains a challenge on the Web of the 21st century. This is due, among other things, to the inherent limitations of machine-driven multimedia understanding, limitations which equally hold on the Web 2.0 or on its semantic counterpart. Describing multimedia resources through metadata is thus often seen as the only viable way to enable efficient multimedia retrieval. Assuming the availability of metadata descriptions, effectively indexing multimedia requires mediating among the wide range of metadata formats presently used to annotate or describe these resources. Semantic technologies have been identified as a potential solution for this large-scale interoperability problem. The work presented in this paper builds upon this last statement. We introduce RICO, a conceptual model and a set of ontologies to mark up multimedia content embedded in Web pages and to deploy such multimedia descriptions on the Semantic Web. By using semantic metadata referencing formal ontologies, we not only provide a basis for the uniform description of multimedia resources on the Web, but also enable automatic mediation between metadata standards, and intelligent multimedia retrieval features.
1
Introduction
In 2001, Tim Berners-Lee et al. introduced the idea of an augmented World Wide Web in which information is meaningful for machines as well as for humans [1]. Since then, and encouraged by phenomena such as the Web 2.0, the Web has turned into a global multimedia environment in which billions of professionally produced, but also more and more end-user generated content is published, shared, and used everyday. In the last decades much research has been done on how to analyze, organize, and annotate the rapidly increasing amounts of multimedia to enable better search, retrieval, and reuse of content (cf [2]. Still, a generic solution applicable at the scale of the current, and future Web has not been delivered yet. This is due, among other things, to the so-called “Semantic Gap” [3], which refers to the inherent gap between low-level multimedia features, which can be feasibly extracted automatically, and high-level, semantic ones, which can not be accurately derived without human intervention [4]. The advent of Web 2.0-style T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 101–113, 2009. c Springer-Verlag Berlin Heidelberg 2009
102
T. B¨ urger and E. Simperl
technologies, notably tagging systems, can be seen as a preliminary answer to this problem. The availability of tags, however, does not solve all issues related to multimedia retrieval. A first still open issue is related to the relevance of the available metadata within the retrieval process. There seems to be a gap between the multimedia descriptions available, and the queries end-users issue in Web search engines when looking for multimedia content, as shown, for example, in a recent study on image search behavior [5]. A second equally complex issue is the indexing of multimedia content which can be found on the Web. Efficient indexing techniques require rich descriptions of multimedia resources covering a multitude of aspects [6]: low-level features such as color histograms or motion vectors, high-level features such as aboutness and contextual information referring to the ways content can be used. The diversity of multimedia types and description standards thereof hamper efficient indexing as well. Unlike in classical information retrieval, multimedia retrieval needs techniques to search, index and rank images, videos, music, as well as more interactive resources such as slideshows, learning material, or even 3D presentations. A last issue worthwhile being mentioned is the level of granularity at which metadata are typically provided. A significant share of multimedia content remains largely hidden to multimedia search engines due to the fact that fine-granular descriptions of such content are not available, or are poorly exploited. Parts of multimedia resources are not retrievable as metadata is available at most for resources as a whole. Further on, information about the usage or license conditions is not explicitly linked to the respective multimedia resources, though this information is mostly available in textual form. Providing a common interface to search and retrieve the various types of multimedia available online is challenging. This interface needs to abstract from media types as such, to automatically mediate among the various schemes and formats presently used to annotate and describe multimedia, and to cover all aspects previously mentioned in order to enable fully-fledged multimedia retrieval. In this paper we propose RICO, a conceptual model to describe multimedia resources on the Web. The model implements a multimedia resource-centric view of conventional Web pages, thus being equally applicably to multimedia content published within Web pages (e.g., images embedded in news stories), to social media, or to professional media licensing sites. It is implemented as a set of Semantic Web ontologies (using W3C recommendations such as RDF and OWL) and uses RDFa [7] to mark up multimedia resources inline HTML pages. By using semantic metadata referencing formal ontologies, RICO not only provides a basis for the uniform description of multimedia resources on the Web, but also enables automatic mediation between metadata standards, and intelligent multimedia retrieval features such as automatic reasoning about the contents and structure of multimedia resources and their usage conditions. Compared to related work in the field of semantic multimedia (cf. Section 4) our main contributions are the comprehensiveness of the conceptual model, and the ways it can be deployed on the Semantic Web. RICO captures in-depth information about the multimedia resources themselves, but also contextual
A Conceptual Model for Publishing Multimedia Content
103
information about the current or prospected usage of these resources. RICO allows to relate descriptions, annotations and multimedia resources to each other by means of structural or logical relations. In doing so, it goes beyond the functionality of standards such as MPEG-7, whose Multimedia Description Schemes is restricted to the structure of multimedia resources and associated components, objects and events. The usage context of a particular resource, rights, reviews, opinions etc, is also taken into account. This information is bundled together with the resource itself as a logical package, similarly to standards such as METS [8], IMS Content Packaging [9] or OAI-ORE [10]. The remainder of this paper is organized as follows: We present the RICO conceptual model in Section 2 and the way it can be used on the Semantic Web in Section 3. We compare our approach with related work in Section 4 before summarizing our achievements and outlining future directions of research and development in Section 5.
2
The RICO Conceptual Model
The primary aim of the RICO model is to provide a metadata framework for the description of multimedia content and accompanying information on the Web. RICO is intended to group available information about a multimedia resource into a compound information package to make it available as a single accessible unit of value. This idea is illustrated in Figure 4: There, all information which is available about an image on a Web site is grouped into a single package. The overall design of the RICO model is based on the requirements outlined in [11,12]: The RICO model uses semantics to reduce the issue of ambiguity and to enable interoperability between different standards. Furthermore more richer descriptions are provided along with multimedia resources: on the one hand, descriptions published together with multimedia content on Web sites are explicitly linked to the content itself in order to enable more efficient indexing. On the other hand, multimedia content is accompanied by rich metadata sets following a metadata model which captures a multitude of features proven to be relevant for efficient retrieval purposes (cf. Section 2.2).
Fig. 1. The RICO Conceptual Model: Coverage of Different Metadata Types
104
T. B¨ urger and E. Simperl
The RICO model is separated into two parts (cf. Figure 1): The first part is called the RICO data model and covers structural metadata to describe information about what makes up the information package. The second part, the RICO metadata model, covers descriptive, administrative, and use metadata. Both the data and the metadata model within RICO are implemented as Semantic Web ontologies, which are introduced in Section 2.3. 2.1
The RICO Data Model
The general aim of the RICO data model is to offer a set of well-defined concepts to describe different aspects of multimedia resources and accompanying information on Web pages. As such, its aim is to lay a graph over content published on the Web, associating descriptions in a Web page to it, and to bind metadata to content and descriptive information. As a baseline for our model we use the MPEG-21 Digital Item Declaration (DID) Abstract Model [13]. The assessment of different models was made in accordance to a general comparative framework for multimedia content models introduced in [14]. MPEG-21 fulfills the basic characteristics of an adaptable data model proposed in this framework, that are, granular description of fragments, media elements (resources), grouped resources (or components), and the possibility to identify and select (parts of) resources.
Fig. 2. Main Elements Within the MPEG-21 Digital Item Declaration Model Used in RICO
The basic parts of an MPEG- 21 Digital Item which are interesting from this aspect are depicted in Figure 2: These include most notably containers in which identifiable digital assets are included (items) and which contain (multimedia) resources. Fragments of these resources can be selected and both resources and their fragments can be described via descriptors or annotated via annotations. Descriptors represent author-given metadata, while annotations provide additional information contributed by other users in the life cycle of a resource. The RICO data model is most notably realized by a Semantic Web ontology which covers the MPEG-21 DID Abstract Model and amongst others makes the
A Conceptual Model for Publishing Multimedia Content
105
semantic types of relations between media resources, components, or fragments and their descriptors explicit. By that, the RICO DID ontology overcomes some problems of the MPEG-21 DID XML Schema that can be traced back to the lack of formal semantics. The nesting of elements in the schema, especially for descriptive elements, is ambiguous – sometimes the descriptions belong to the parent element in the XML Schema, sometimes to the siblings of the respective element. The MPEG-21 DID Abstract Model, and implicitly our ontological model thereof, is compatible with the OAI Abstract Information Model [15] as shown in [16] and with learning content models such as the IMS Content packages [9] or METS [8] as proven by the RAMLET working group1 . 2.2
The RICO Metadata Model
The RICO metadata model captures different types of metadata which have been identified as important for multimedia retrieval by media professionals [17]: Bibliographic metadata is concerned with the authorship of content and includes fields such as identification, naming, publication or categorization. Technical metadata typically describes physical properties such as format, bit-rate and what is mostly called low-level features of content [2]. Classification metadata includes keywords or tags, but also domain-specific classification information. Evaluative metadata includes ratings and qualitative assessments of the content. Relational metadata covers relations between the content and other related external resources. Relations may be explicitly specified through the design of the content object or external objects or could be derived from observations or the usage history. Rights metadata describes the terms of use of a multimedia resource. Functional metadata refers to operations that may be supported to alter the presentation of the content, to customize or personalize the content, or to provide access to different versions of the content. The RICO metadata model can be further extended into domain-specific aspects such as educational characteristics for a learning object or preservation data for archival information. In this respect our approach is inline with the vision of Resource Profiles introduced in [18]. At the level of the metadata model we differentiate between authorative and non-authorative metadata. The former is contributed by the author (creator) of the multimedia content and reflects persistent information about it. The latter is provided by consumers or other third parties and refers to contextual and mostly changing aspects related the content [19]. Authorative metadata is typically created manually, whilst non-authorative metadata may either be explicitly provided (through annotations, reviews, ratings, etc.) or automatically generated 1
cf. http://www.ieeeltsc.org/working-groups/wg11CMI/ramlet
106
T. B¨ urger and E. Simperl
(through harvesting or usage analysis or alike). In order to support efficient filtering and optimal usage these different metadata types are conceptually separated into so-called “metadata sets”. 2.3
The RICO Ontologies
The data and metadata model introduced in the previous sections are implemented using a set of Semantic Web ontologies in order to publish semantic descriptions for multimedia resources on the Web. The ontologies were built in order to increase the interoperability of media resource descriptions on the Web. To be compatible with popular ontologies, and to ease the uptake of the RICO model, the RICO ontologies reuse or are based on existing ontologies. The ontologies define their own terms when needed terms do not exist in any suitable ontology or when their meaning should be precisely defined. In the latter case, terms are linked to other vocabularies via OWL/RDF(S) modeling constructs. If possible, terms are aligned to other vocabularies via mappings.
Tag MARCO Intra Relations
FRBR Relationship
MARCO Inter Relations
Commerce Voc. CC Copyright Voc.
RICO
MARCO
DublinCore Dublin Core Terms
DigitalMedia
MPEG-21 DID
FOAF
Annotea Annotations
MARCO Ratings
Fig. 3. The Reusable Intelligent Content Objects (RICO) Ontologies
The import-graph of the RICO ontologies is shown in Figure 3:2 The RICO Core ontology imports several other ontologies as follows: – The MPEG-21 DID ontology implementing the RICO data model. – The Mindswap Digital Media ontology which is used to type multimedia resources.3 2 3
The dark-gray and white ontologies were built in the course of this work. Others were reused (and adapted). http://www.mindswap.org/2005/owl/digital-media (last accessed on 07.03.2009).
A Conceptual Model for Publishing Multimedia Content
107
– The FOAF ontology for representing users and their properties.4 – The Annotea annotations ontology to represent annotations which can be attached to the content and to support referencing of classification information (cf. [20]).5 – The OWL-Lite version of the Dublin Core ontology to attach basic descriptive properties to the elements of the RICO data model.6 – The MARCO Core ontology which covers a core set of properties which are commonly used on the Web to describe bibliographic aspects, classifications of resources, and rights. It particularly makes use of • The MARCO Inter- and Intra- Relations ontology which reflect relations between media resources and between descriptions and the resources.7 • The MARCO Ratings ontology which reflects ratings about qualities of media resources. The MARCO Core ontology further reuses (parts of) the Dublin Core Terms ontology 8 , the Tag ontology 9 , and the Commerce vocabulary 10 . 2.4
RICO Profiles
The RICO model and its ontologies are generic with respect to media types. They can not only be applied to a restricted set of media types, but to the full range of media resource available on the Web. This means on the one hand, that RICO provides a core set of elements which can be used for description and retrieval across media types. On the other hand, the model and the ontologies have to be extended in order to optimize their application for specific media types. This can be done with RICO profiles. The profiling approach was chosen because it increases the reusability of the RICO model, whilst extensions and customizations are easily possible. Media resources of different types vary significantly with respect to many properties. Therefore different technical metadata formats, mechanisms to identify fragments, or to specify licenses, are needed. For technical metadata, a wide range of ontologies were proposed (cf. [2]). For the identification of media fragments, different media-specific solutions exist, such as in MPEG-7, MPEG-21, SMIL, SVG, or CMML (“The Continuous Media Markup Language”).11 Furthermore specialized licensing schemes are available such as the PLUS (“Picture Licensing Universal Standard”). 4 5 6 7 8 9 10 11
An OWL-DL version was used; cf. http://www.mindswap.org/2003/owl/foaf (last accessed on 07.03.2009). The Annotea ontology available at http://www.w3.org/2000/10/annotation-ns# (last accessed on 07.03.2009) was rebuilt in OWL. http://www.dublincore.org (last accessed on 27.03.2009). The MARCO Inter-Relations ontology is based on the Functional Requirements for Bibliographic Records (FRBR) – model [21]. http://dublincore.org/documents/dcmi-terms/ (last accessed on 07.03.2009). http://www.holygoat.co.uk/projects/tags/ http://digitalbazaar.com/commerce (last accessed on 07.03.2009). http://www.w3.org/2008/WebVideo/Fragments/wiki/State_of_the_Art (last accessed on 07.03.2009).
108
T. B¨ urger and E. Simperl
For the description of images on the Web we implemented a basic RICO profile, that is, the RICO4Images profile. This profile uses the EXIF ontology as provided by Kanzaki12 to describe technical aspects and makes use of the svgOutline mechanism to localize image regions as illustrated in [22]. This mechanism suggests to use an SVG [23] description to specify regions in images. Furthermore the profile instantiates the MARCO Ratings ontology to enable rating on image qualities. Steps to define further profiles are outlined in [17].
3
The Usage of the RICO Conceptual Model
The RICO model can be used to describe images published for different purposes such as illustration of Web pages, sharing, or selling as shown in [17].
Fig. 4. An Image Hosted by Flickr Described Using the RICO Model
3.1
Describing Multimedia Content Using the RICO Model
Figure 4 shows an arbitrary image available at Flickr. RICO information is explicitly marked up and related to the image (see numbers 1 − 12 in Figure 4). The example contains three metadata sets: one authorative set which was created by the owner of the image, and two non-authorative sets which are provided 12
http://www.kanzaki.com/ns/exif
A Conceptual Model for Publishing Multimedia Content
109
by the hosting platform and an end-user through a commentary (see number 3 in the Figure). The figure shows only how visible information is related to the image. However, additional (non-visible) metadata could also be provided, e.g., by adding a detailed description of different scenes in a video or further semantic descriptions of the content of the image. [12,17] provides further details on how the RICO ontologies can be used to describe images using the RICO ontologies. 3.2
Embedding RICO Descriptions into HTML Using RDFa
The semantic descriptions according to the RICO model can be published as a compound package together with the multimedia resource within an HTML page. Typically compound packages are published using XML to link resources and their metadata (e.g., in MPEG-21, IMS Content Packaging [9] or METS [8]). In our case, the compound package information is published inline within an HTML page using RDFa. RDFa is a serialization syntax for the RDF data model. It defines how an RDF graph is embedded in an (X)HTML page using a set of defined attributes such as @about, @rel, @instanceof, and others. By using Ramm.x (“RDFa enabled multimedia metadata”) [24], external metadata descriptions can additionally be included in a RICO by reference. An example showing the description of the Flickr image using the RICO model is provided on the Web along with a demonstration of the automatic extraction of RDF triples from the HTML page using available online services.13 3.3
Tools
In order to demonstrate the model we implemented a Firefox browser plug-in which can be used to manually annotate images using the RICO ontologies. The plug-in stores annotations directly in the DOM of the HTML page using RDFa. The plug-in enables to load ontologies to be used for annotation, and to export and validate the generated RDF data.14
4
Related Work
Especially in recent years much research has been done on the specification of ontologies that aim to combine traditional multimedia description models to allow reasoning over the structure and semantics of multimedia content (cf. [11,2]). The RICO conceptual model and its associated ontologies can be used to mark up illustrative images in conventional Web pages, images or videos hosted 13 14
http://tobiasbuerger.com/icontent/rico/example/ (last accessed on 07.03.2009). The plug-in is available on request.
110
T. B¨ urger and E. Simperl
by social media sharing sites, slides and images embedded in commercial offerings, or multimedia content deployed in blogs and wikis. The aim of the model is not to provide a sophisticated description facility for the structural and content related semantics as it is the case with MPEG-7 [25] or one of the recently proposed MPEG-7-based ontologies such as COMM [26]. We rather propose a common core, grounded into available standards in the multimedia area and other related area, to describe existing Web pages from the perspective of multimedia resources through a basic model which can be extended with existing ontologies. One model with a similar scope is the hMedia microformat:15 It provides a basic vocabulary to mark up multimedia resources on Web sites using propertyvalue pairs. Also the ramm.x model [24] has a similar scope. It provides a small, but extensible vocabulary for marking up multimedia resources and to embed semantic metadata inside of HTML pages. It mainly covers the inclusion of traditional metadata via service transformations, but does not provide a detailed vocabulary to describe multimedia resources. The aim of the SMIL MetaInformation-Module16 is to publish RDF-based metadata in SMIL presentations. It is very general and does not provide additional information about how it can be used in concrete applications. Another relevant approach is Adobe’s Extensible Metadata Platform (XMP).17 XMP specifies means to publish RDF-based metadata into PDFs or other document formats. Furthermore we want to acknowledge the work done by Creative Commons to describe and embed licensing data using RDF which is exploited in searches by Yahoo or in Flickr.18 The scope of the work in the W3C on an ontology for media resources on the Web is comparable to the scope of the MARCO Core ontology introduced in Section 2.3.19 More heavyweight approaches include Intelligent Content models as previously assessed for example in [27] which cover a broad range of aspects. Most of these approaches are too heavyweight for our proposed model. Traditional, non-semantic models include the standardized framework of MPEG-7 [25] or packaging formats from the archival or eLearning domains such as the IMS Content Packaging format [9], the Metadata Encoding and Transmission Standard (METS) [8], the MPEG-21 Digital Item Declaration (DID) [13] and most recently the OAI-Object Reuse and Exchange - model (OAI-ORE) [10]. The OAI-ORE model was designed as an exchange format for scholarly works. Its compound object model is similar to our model as it also provides facilities to publish semantic descriptions as an overlay graph over Web pages [28]. The approach however does not focus on multimedia aspects, but on grouping of resources into one single manageable entity. 15 16 17 18 19
More information available at http://wiki.digitalbazaar.com/en/Media_Info_Microformat http://www.w3.org/TR/2007/WD-SMIL3-20070713/smil-metadata.html http://www.adobe.com/products/xmp/ See http://search.creativecommons.org/ http://www.w3.org/2008/WebVideo/Annotations/
A Conceptual Model for Publishing Multimedia Content
5
111
Conclusions
In this paper we presented the RICO conceptual model. The model can be used to describe multimedia resources which are embedded in typical Web pages, or published on arbitrary media sharing platforms or commercial sites on the Web. It consists of a data model that supports adaptability of content, a metadata model that enables efficient retrieval of content and a deployment mechanism to publish RICO information inline of HTML pages. The model is implemented using a set of Semantic Web ontologies which resort to existing standards. In order to ensure interoperability with existing standards in the field we based our work on the MPEG-21 DID model which itself is compatible to METS, OAI and others. For the same purpose we reused several established ontologies or ontology-like formats such as the Dublin Core schema, the FOAF ontology, the Annotea annotations scheme and the Mindswap Digital Media ontology. We evaluated the compatibility of the metadata model with existing standards by defining mappings as outlined in [17]. Our next steps include further tools to foster the uptake of the model and to automate metadata generation. Future versions of the RICO model will incorporate results of the W3C Media Annotation and Media Fragments working groups on fragment identification and media resource metadata.20 Acknowledgments. The research leading to this paper was partially supported by the European Commission under contract IST-FP6-027122 “SALERO”.
References 1. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. The Scientific American 284(5), 28–37 (2001) 2. B¨ urger, T., Hausenblas, M.: Metadata standards and ontologies for multimedia content. In: Handbook of Metadata, Semantics and Ontologies. World Scientific Publishing Co., Singapore (2010) (to be published) 3. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000) 4. Hollink, L., Schreiber, G., Wielinga, B., Worring, M.: Classification of user image descriptions. Int. J. Hum.-Comput. Stud. 61(5), 601–626 (2004) 5. Pu, H.T.: An analysis of failed queries for web image retrieval. J. Inf. Sci. 34(3), 275–289 (2008) 6. Gong, Z., Leong Hou, U., Cheang, C.W.: Web image indexing by using associated texts. Knowl. Inf. Syst. 10(2), 243–264 (2006) 7. Adida, B., Birbeck, M. (eds.): Rdfa primer, w3c working group note, October 14 (2008), http://www.w3.org/TR/xhtml-rdfa-primer/ 8. The Library of Congress: Metadata encoding and transmission standard: Primer and reference material version 1.6 (September 2007), http://www.loc.gov/standards/mets/mets-schemadocs.html 20
http://www.w3.org/2008/WebVideo/
112
T. B¨ urger and E. Simperl
9. IMS Global Learning Consortium: Ms content packaging information model, version 1.1.4 final specification (October 2004), http://www.imsglobal.org/content/packaging/ cpv1p1p4/imscp infov1p1p4.html 10. Lagoze, C., de Sompel, H.V., Johnston, P., Nelson, M., Sanderson, R., Warner, S.: Ore user guide – primer (October 2008), http://www.openarchives.org/ore/1.0/primer.html 11. B¨ urger, T., Hausenblas, M.: Why real-world multimedia assets fail to enter the semantic web. In: Proc. of the Int. Workshop on Semantic Authoring, Annotation and Knowledge Markup, SAAKM 2007 (2007) 12. Buerger, T.: Towards increased reuse: Exploiting content related and social features of multimedia content on the semantic web. In: Proceedings of the Workshop on Interacting with Multimedia Content on the Web (IMC-SSW), co-located with SAMT 2008, December 3-5, Koblenz, Germany (2008) 13. Bormans, J., Hill, K.: Mpeg-21 overview v.5, ISO/IEC JTC1/SC29/WG11 (2002) 14. Boll, S., Klas, W.: Zyx – a multimedia document model for reuse and adaptation of multimedia content. IEEE Transactions on Knowledge and Data Engineering 13(3), 361–382 (2001) 15. CCSDS: Reference model for an open archival information system. Blue Book 1, Consultative Committee for Space Data Systems, CCSDS Secretariat Program Integration Division (Code M-3) National Aeronautics and Space Administration Washington, DC 20546, USA (January 2002) 16. Bekaert, J., Kooning, E.D., van de Sompel, H.: Representing digital assets using mpeg-21 digital item declaration. International Journal on Digital Libraries 6(2), 159–173 (2006) 17. B¨ urger, T.: An Intelligent Content Model for the Semantic Web. PhD thesis, University of Innsbruck (June 2009) 18. Downes, S.: Resource profiles. Journal of Interactive Media in Education Special Issue on the Educational Semantic Web 5 (2004) 19. Recker, M., Wiley, D.: A non-authoritative educational metadata ontology for filtering and recommending learning objects. Journal of Interactive Learning Environments (2001) 20. Kahan, J., Koivunen, M.R.: Annotea: an open RDF infrastructure for shared web annotations. In: Proceedings of the 10th International World Wide Web Conference, pp. 623–632 (2001) 21. IFLA Study Group on the Functional Requirements for Bibliographic Records: Functional requirements for bibliographic records, final report (1998), http://www.ifla.org/VII/s13/frbr/frbr.htm 22. Troncy, R., van Ossenbruggen, J., Pan, J.Z., Stamou, G.: Image annotation on the semantic web, w3c incubator group report (August 14, 2007), http://www.w3.org/2005/Incubator/mmsem/XGR-image-annotation/ 23. Jackson, D. (ed.): C.N.: Scalable vector graphics (svg) full 1.2 specification, w3c working draft (April 13, 2005), http://www.w3.org/TR/SVG12/ 24. Hausenblas, M., Bailer, W., B¨ urger, T., Troncy, R.: Ramm.x: Deploying multimedia metadata on the semantic web. In: Proceedings of SAMT 2007, December 4-7, Genova, Italy (2007) 25. Mart´ınez-Sanchez, J.M., Koenen, R., Pereira, F.: Mpeg-7: The generic multimedia content description standard, part 1. IEEE MultiMedia 9(2), 78–87 (2002)
A Conceptual Model for Publishing Multimedia Content
113
26. Arndt, R., Troncy, R., Staab, S., Hardman, L., Vacura, M.: Comm: Designing a well-founded multimedia ontology for the web. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudr´e-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 30–43. Springer, Heidelberg (2007) 27. B¨ urger, T., Toma, I., Shafiq, O., D¨ ogl, D.: State of the art in sws, grid computing and intelligent content objects - can they meet? GRISINO Deliverable D1.1 (October 2006), http://www.grisino.at 28. van de Sompel, H., Lagoze, C.: Interoperability for the discovery, use, and re-use of units of scholarly communication. CTWatch Quarterly 3(3) (August 2007)
CAIN-21: An Extensible and Metadata-Driven Multimedia Adaptation Engine in the MPEG-21 Framework* Fernando López1, José M. Martínez1, and Narciso García2 1
VPULab, EPS – Universidad Autónoma de Madrid, Spain {f.lopez,josem.martinez}@uam.es 2 GTI, ETSIT – Universidad Politécnica de Madrid, Spain [email protected]
Abstract. This paper presents the CAIN-21 multimedia adaptation engine, which facilitates the integration of pluggable multimedia adaptation modules, chooses the chain of adaptations to perform and manages its execution. Evolving from CAIN, CAIN-21 complies better with the MPEG-21 framework. Its new features and improvements are discussed in this paper. In addition, the pros and cons are explained with respect to others multimedia adaptation engines, including early CAIN. Keywords: multimedia, adaptation, metadata, mpeg-7 mpeg-21.
1 Introduction As time goes by, the variety of multimedia formats and devices has significantly increased, and still does. Multimedia content providers need to distribute their photos, videos and audio to a wide-range of devices and independently of the underlying delivery technology. This technological breakthrough has been explained through the Universal Multimedia Access (UMA) paradigm [1], where content adaptation plays an important role. The purpose of multimedia adaptation is to carry out changes in the multimedia content in order to enable its consumption in different terminals, operating with different access networks and satisfying the preferences of the multimedia system and users as far as possible. Frequently, the multimedia adaptation tools are intended to adapt a specific kind of media (e.g. audio, video, images), or even a specific media format (e.g. H.264/AVC), to a set of constraints imposed by the usage environment: namely the terminal, the network and the preferences of the multimedia system and users. The MPEG-21 [2] standard has taken into account the importance of interoperability among multimedia services. It proposes a framework that facilities the integration *
Work financed by the European Commission (IST-IP-001765 – aceMedia, IST-FP6-027685 – MESH), the Spanish Government (TEC2007-65400 - SemanticVideo), the Ministerio de Educación y Ciencia of Spanish (through the FPU fellowship grant issued to the first author) and the Comunidad de Madrid (S-0505/TIC-0223 - ProMultiDis-CM).
T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 114–125, 2009. © Springer-Verlag Berlin Heidelberg 2009
CAIN-21: An Extensible and Metadata-Driven Multimedia Adaptation Engine
115
of multimedia services and standards in order to harmonise multimedia technologies. In addition, the MPEG-21 framework has introduced general multimedia elements that can be instantiated in different scenarios. The MPEG-21 framework and its elements have been frequently used (for example in [3][4][5]) to describe multimedia systems. This paper presents an adaptation engine named CAIN-21 (Content Adaptation INtegrator in the MPEG-21 framework), which has evolved from CAIN [6]. The source code along with an online demo of its functionalities can be publicly accessed at cain21.sourceforge.net. The main objective of CAIN-21 is to provide a framework in which different multimedia adaptation tools can be integrated and tested. With the extensibility mechanism that CAIN-21 incorporates, adaptation tools can be added in a pluggable manner. Moreover, this adaptation engine includes an automatic decision process that enables the quick generation of new types of multimedia adaptations. Specifically, by combining existing multimedia adaptations tools, CAIN-21 is capable of addressing a wider range of multimedia adaptations. In the rest of this paper, Section 0 will review the state of the art concerning metadata-driven multimedia adaptation. Section 0 explains the main features and elements of CAIN-21. This section also provides a review of early CAIN and explains the new features and advantages that the new multimedia adaptation engine has incorporated. Section 0 provides a comparative analysis between CAIN-21 and other multimedia adaptation engines. Finally, Section 0 concludes the papers and gathers the advantages of the adaptation techniques explained in this paper.
2 State of the Art Metadata-driven multimedia adaptation [7] makes use of metadata to automatically decide the adaptation to perform and subsequently execute it. This metadata may come from both manual annotations and automatic content analysis techniques. The MPEG-7 [8] standard specifies description tools for both sources of metadata. In addition, the MPEG-21 [2] standard specifies the metadata to describe the complete multimedia system. Metadata-driven content adaptation has been studied at resource level [3], at system level [4] and at scene level [5][9]. Initial proposals for metadata-driven adaptation were implemented using XML to describe both, the multimedia content and the usage environment [10][11]. After their standardization by ISO, the MPEG-7 and MPEG-21 standards have gained popularity among the content adaptation community. Descriptions have been frequently classified in [12]: (1) content descriptions, with information of the media resource along with its metadata: title, media format, variations, natural language description, audiovisual features, etc. MPEG-7 has been frequently used to specify the metadata representation. In the MPEG-21 framework, a Component element comprises the media resource and its corresponding metadata. A Digital Item (DI) is a container of Component elements, additional nested DIs and metadata. (2) Service provider environment descriptions, which usually contain restrictions imposed by the service provider. Streaming bandwidth and privacy policies are examples of these descriptions. The MPEG-21 Part-7 Universal Constraints Description Tools (UCD Tools) has been frequently used in this case [12][13][14]. (3)
116
F. López, J.M. Martínez, and N. García
Usage environment descriptions, which describe the content consumer environment. Examples of these descriptions are the features of the terminal, the network characteristics or the preferences of the multimedia system and users. The MPEG-21 Part 7 Usage Environment Description Tools (UED Tools) have been used in this case [3][4][12]. Roughly speaking, the constraints of the content provider as well as the consumer ones are described with the UCD Tools. The usage environment of the consumer is described with the UED Tools. The adaptation engine is situated between both extremes. This adaptation engine can be implemented in the content provider's multimedia server or in the content consumer's terminal. The adaptation engine can even be located in a proxy. This idea has been proposed to perform distributed multimedia adaptation [15]. Within the multimedia adaptation engine, metadata-driven approaches such as variation selection, transcoding or transmoding have been examined [16]. Metadatadriven adaptation has also been used to perform scalable media adaptation [13][14]. In this area, the AdaptationQoS Tools [17] have been frequently used to decide the optimal adaptation over a media resource. First, these methods store the set of feasible adaptation in AQoS. Second, they eliminate the solutions that do not satisfy all the UCD limit or UED constraints. Third, they use the UCD optimization constraints to decide the optimal solution. The Bitstream Syntax Description Tools (BSD Tools) and BSDLink Tools [18] may assist the adaptation process by describing the high-level structure of the stream. Then by executing Extensible Stylesheet Language Transformations (XSLT) over the BSD, the resource adaptation is carried out efficiently. In the area of scene level adaptation, metadata-driven adaptation has been used to transcode web pages [19]. In this case, external annotations are associated to existing web documents in order to transcode them to the constraints and preferences of the user. Another complementary approach for dealing with multimedia content has been proposed by the Semantic Web, which advocates ontology based approaches [20][21][22] so that the machine will be capable of understanding both the content and the relationships among its metadata descriptions. In [23], Semantic Web technologies are used to perform scene of interest selection and scalable video frame rate adaptation.
3 CAIN-21: Architecture, Functionalities and Improvements 3.1 Interfaces CAIN-21 serves adaptation requests through two external programming interfaces (see Fig. 1 below): (1) The media level transcoding interface is devoted to perform blind adaptation (i.e. semantic-less adaptation) of a media resource. In addition to media level, this interface is also capable of performing system level adaptation of videos composed of one or more audio and visual streams. The transcoding operations defined in the media level transcoding interface are implemented in the Tlib module that includes conventional software libraries such as ffmpeg, imagemagick as well as JNI-based custom libraries. (2) The DI level adaptation interface is in charge of performing system level (semantic or blind) adaptations where metadata is used for helping during the adaptation.
CAIN-21: An Extensible and Metadata-Driven Multimedia Adaptation Engine
117
The DI level adaptation interface deals with three types of MPEG-21 DIs: (1) the Content DI is a DI that conveys the media resource along with its metadata; (2) the Context DIs are repositories that contain the information of the usage environment along with the information that the adaptation engine employs in deciding and executing the most suitable adaptation (i.e. the adaptation capabilities and the properties to be taken into account); (3) the Configuration DI states which terminal, network and user − from the ones available in Context DI − must be used to serve the adaptation. The purpose of the Configuration DI [24] is to decouple the Content DI and the Context DIs. Instead of storing information of the adaptation state in the Content DI (as the MPEG-21 Part-2 Choice mechanism or MPEG-21 Part-7 DIA Configuration do), the Configuration DI stores all the information related to the adaptation that is to be carried out. In this way, the Content DI will not be modified when the target usage environment (located in a Context DI) changes. Usage examples and further explanations of this mechanism can be found in [24]. In CAIN-21, metadata-driven adaptation is performed through the DI level interface and at Component level. An MPEG-21 Component includes a media resource (in the Resource element) and its metadata (in the Descriptor element). The DI level adaptation interface provides two different operations. The first one modifies the existing Component and the second operation adds a new Component element to the DI. More specifically: (1) transform() takes a Component from the Content DI and modifies its media resource and metadata in order to adapt it to the usage environment; (2) addVariation() takes a Component from the DI and creates a new Component ready to be consumed in the usage environment. This adapted Component is added to the Content DI at the end of the adaptation. 3.2 Architecture This section provides a detailed description of the modules that CAIN-21 incorporates. Fig. 1 depicts CAIN-21's functional modules and the control flow along the adaptation process. Adaptation Management Module (AMM) The AMM is in charge of coordinating the whole DI level adaptation process. Modules below the AMM perform different tasks initiated by the AMM. Adaptation Decision Module (ADM) and Adaptation Execution Module (AEM) In the last years a great deal of research has been made in order to automatically construct multimedia adaptation plans (see e.g. [21][22][25]). The distinction between the notion of Adaptation Decision Module (ADM) and the Adaptation Execution Module (AEM) has been frequently proposed [5][12][13][25]. CAIN-21 also includes this distinction and we have done a great deal of work in order to automate the decision and execution process. In CAIN-21, the pluggable adaptation modules are implemented by means of Component Adaptation Tools (CATs). The ADM uses metadata to decide [26] the sequence of conversions and parameters that should be executed over a specific Component element of the Content DI. Subsequently, the AEM executes the corresponding sequence of CATs over this Component. Caching
118
F. López, J.M. Martínez, and N. García
Fig. 1. Modules and control flow within CAIN-21
mechanisms can speed up any or both of these steps by avoiding deciding or executing several times adaptations that comprise the same parameters. Context Repository The Context Repository encompasses three types of Context DIs (see Fig. 1,). The Usage Environment DI is a Context DI that describes the available usage environments using the MPEG-21 UED Tools. Each CAT Capabilities DI describes the different conversions that one CAT is capable of undertaking. Each conversion has a set of valid input and output properties along with its corresponding values. CAIN-21 has introduced an addressing mechanism [24] in which changes in the underlying metadata descriptors will not imply changes in its source code. With this mechanism, metadata is dealt by means of properties. The Properties DI is intended to store a set of keys and corresponding XPath/XPointer expressions. Parsing Module (PM) The PM is the module in charge of resolving the values of the aforementioned properties. Firstly, the PM accesses the Properties DI in order to obtain the set of property keys and corresponding XPath/XPointer expressions. Secondly, on resolving these expressions, the values of these properties are generated. During this step the rest of the metadata is loaded from the Content DI, Configuration DI, Usage Environment DI and CAT Capabilities DI. At the end of this step all the metadata is represented as a set of properties. The value of these properties can be multi-valued (e.g. bitrate = [1000..200000], audio_format = {aac, mp3}).
CAIN-21: An Extensible and Metadata-Driven Multimedia Adaptation Engine
119
Coupling Module (CM) There exist a wide variety of multimedia representation standards and systems. CAIN-21 is an adaptation engine that is designed with the ability to be integrated in heterogeneous multimedia systems. The CM is the gateway for other components that may be using external technology (i.e., non-MPEG-21 technology) to represent multimedia (e.g. HTML, SMIL, NewsML or MPEG-4 BIFS). The CM enables the integration of CAIN-21 adaptation services into those heterogeneous multimedia systems. With this purpose, this module transforms the external representation of multimedia onto an MPEG-21 compliant input DI that is capable of being processed by CAIN-21. In addition, the CM is in charge of transforming the adapted output DI onto its external representation. Different instances of the CM are interchangeable modules created to interact with specific external systems. 3.3 Control Flow Numbers in Fig. 1 indicate the control flow of the tasks involved in the adaptation process. (1) On interacting with external systems, the CM transforms the external multimedia representation into MPEG-21 compliant DIs that CAIN-21 is capable of processing. (2) The AMM is in charge of coordinating the whole DI level adaptation process. (3) When a Content DI along with a Configuration DI arrives (via the above transform() or addVariation() operations of the DI level interface) the AMM invokes the ADM and the AEM to decide (4) and execute (5) the corresponding adaptation over an specific Component. (6) The CATs frequently employs the TLib in order to adapt the media resource. The CATs might also change or append information to the Descriptor element of the Component so that the subsequent CATs may use it. (7) When all the conversions of the sequence have been executed, (8) the AMM returns the adapted Content DI to the caller. (9) Frequently, the adapted Content DI may need to be transformed to an external representation. In this case, the CM will perform this transformation. 3.4 Early CAIN Early versions of CAIN received a media resource, an MPEG-7 description of this resource and an MPEG-21 description of the usage context. After its execution, CAIN produced an adapted media along with its MPEG-7 media description. Subsequent versions of CAIN developed its modular architecture. Fig. 2 shows the architecture as published in [6]. In this version, CAIN comprises a Decision Module (DM) and a set of adaptation operations called Content Adaptation Tools (CATs), Encoders and Decoders. In response to an external invocation, the multimedia resource, MPEG-7 description of this resource and an MPEG-21 description of the context were parsed and the DM performed three main steps in sequence: (1) Selecting the target media parameters, (2) selecting an adaptation tool capable of performing the adaptation and lastly, (3) the selected CAT/Encoder/Decoder was launched. The next step in the development of CAIN was twofold [27]: (1) the development of an extensibility mechanism and (2) the development of a formal decision process capable of dealing with this extensibility. The extensibility mechanism proposes the
120
F. López, J.M. Martínez, and N. García
Fig. 2. Early CAIN
use of a CAT Capabilities document to describe the adaptation capabilities of each pluggable CAT. The automatic decision mechanism was intended to select a CAT capable of performing the adaptation and the parameters to use. 3.5 From CAIN to CAIN-21 The first set of changes relates to the PM. In early CAIN the PM was in charge of parsing all the metadata: the MPEG-7 Part 5 MediaDescriptionType describing the media and the MPEG-21 Part 7 description of the UED. The changes undergone in the PM have been in two directions: Firstly, these elements are currently represented by means of the Content DI, Context DIs and Configuration DI as explained above. Secondly, and more important, the PM implemented in CAIN required changes in the source code of CAIN whenever a new description was added or changed. CAIN-21 has introduced a mechanism [24] in which metadata is dealt through properties stored in the Properties DI (see section 0). In this way, changes in the set of properties managed by the adaptation engine imply only changes in the Properties DI. Additionally, with this on-demand mechanism only the values of the properties used by the decision-making process are evaluated. During the execution of early CAIN, it received a document with the UED. The existence of only one Terminal, Network and User element was assumed. Different UED documents represented different adaptation environments. CAIN-21 has gathered the UED in a Context DI referred as the Usage Environment DI. With CAIN-21 more than one Terminal, Network and User elements can be stored in the Usage Environment DI. Once CAIN-21 is deployed in a multimedia system, the target Terminal,
CAIN-21: An Extensible and Metadata-Driven Multimedia Adaptation Engine
121
Network and User elements can be addressed by means of the Adaptation Request Configuration Tool (ARC Tool) [24]. As explained in section 0, CAIN incorporated three kinds of adaptation modules (see Fig. 2 (b)): Content Adaptation Tools (CATs), Encoders and Decoders. The third major change in CAIN-21 gathered all of these modules under the concept of Component Adaptation Tool (CAT). In CAIN-21 adaptations are always performed at Component level. An MPEG-21 Component includes a media resource and its metadata. Early CAIN included only one operation named adapt() with four parameters [27]: the input content, the output format, the output folder and a set of properties reserved for future functionalities. CAIN-21 divided this operation into two different operations: (1) transform() and addVariation() explained in section 0. Additionally, in CAIN-21, the number of parameters is variable and determined as a subset of the properties gathered from metadata. Further advances in the implementation of an automatic decision mechanism have motivated changes in the CAT Capabilities description mechanism. As further explained in [24][26], in order to express disjunction in the adaptation capabilities, each CAT capabilities document has to be divided into several ConversionCapabilitiesType elements. In order to allow the existence of multi-valued properties, the CAT Capabilities description mechanism was changed. In this way the properties of the CAT capabilities may be single-valuated (e.g. format = {mpeg2}), multi-valuated (e.g. color_space = {rgb, grayscale}), ranges (e.g. bitrate=[100..400000]) or compound properties (e.g. frame_size= {144x176, 288x352}). Ranges were also allowed in compound properties (e.g. frame_size = [10..5000]x[10..5000]). In the current CAT Capabilities description model [24], each ConversionCapabilitiesType element contains preconditions and postconditions [26] which are used by the automatic decisionmaking process. The adaptation of large media resources such as videos may imply long delays if the resource needs to be adapted before being delivered. Early CAIN only supported the Forward Adaptation Mode. [28] With this mode adaptation takes place as soon as the user requests the resource. The client characteristics, preferences and natural environment can be taken into account, but if the resource adaptation process is time consuming, the user has to wait until the whole resource is adapted. This kind of adaptation is useful for small resources (e.g. images), but it is undesirable for long resources (e.g. video or audio). CAIN-21 introduces the Online Adaptation Mode in which the media resource can start to be delivered before the whole media resource has been adapted. Lastly, in CAIN-21 a great deal of research has been made in order to automatically construct multimedia adaptation plans. As explained in [27], early CAIN was "not truly extensible in the sense it is currently, that is, it was possible to add additional CATs but it was needed to code or recode some parts in the core of CAIN". In order to automate the decision-making process an Artificial Intelligence planner [21][26] was implemented in the ADM of CAIN-21. This approach uses a description of the input and output parameters of the CATs as preconditions and postconditions, respectively. The planner (under development in [24]) computes a sequence of zero or more CATs along with its parameters in order to adapt the media to the usage environment. CAIN was only capable of selecting and executing one CAT in order to perform the adaptation. CAIN-21 can perform multistep adaptation, i.e., CAIN-21 can
122
F. López, J.M. Martínez, and N. García
find and execute sequences of conversions of any length. Another difference between CAIN and CAIN-21 is that the former was only capable of finding one of the feasible solutions to address the adaptation problem, whereas the latter computes all the feasible adaptation solutions (sequences of conversions). Subsequently, only one of these sequences needs to be executed in order to adapt the content. Deciding which of these sequences best suit the usage environment is a work in progress.
4 Comparison with Others Multimedia Adaptation Approaches This section provides a comparative review of four multimedia adaptation engines that work in the MPEG-21 framework: koMMa [21], BSD [14], DCAF [12], NinSuna [23] and CAIN-21. The comparison is based on five aspects: decision-making method, completeness of the computed solutions (i. e. the underlying algorithm finds all the solutions), type of multimedia content addressed, semantic adaptation and MPEG-21 Part-7 DIA tools supporting the adaptation. In reference to the decision-making method, BSD uses optimization methods to adapt the underlying resource. Even though BSD theoretically can be used to adapt any resource, it is only practical with scalable resources. This is because scalable resources are made up of blocks that can be easily modified or removed. DCAF proposes genetic algorithms to perform these optimizations with general video. NinSuna incorporates BSD-like frame-rate reduction. On the other hand, koMMa and CAIN-21 propose knowledge-based methods. The deterministic description of the operations defined in koMMa implies that their outputs are completely defined. Conversely, CAIN-21 allows partially defining the behaviour of the conversions [26] allowing thereby the use of optimization methods inside CATs. Optimization methods usually obtain a complete solution, i.e. all the feasible solutions are obtained and ranked: this is the case of BSD and DCAF. In reference to forward and backward search methods, forward search methods such as the one implemented in koMMa only finds one feasible solution (the first one it finds) whereas backward search planning methods, such as the current version of CAIN-21, obtain a complete set of solutions. NinSuna does not specify the completeness of its decisions. In reference to the supported media, BSD is particularly effective dealing with scalable media, whilst DCAF and NinSuna deal with general video resources. koMMa and CAIN-21 are intended to deal with a wider range of media resources. Currently, CAIN-21 can manage images, audio and video. BSD and DCAF used the gBSD tool [17] and the AdaptationQoS with IOPins linked to semantics to annotate the video stream on a semantic level. NinSuna provides semantic adaptation in the Scenes Of Interest (SOIs) selection and in the framerate reduction. CAIN-21 makes use of Regions of Interest (ROIs) to drive the Image2Video adaptation [29]. Roughly speaking, the UCD Tools have been used in the optimization methods (BSD and DCAF) and the UED Tools have been used in both: optimization and knowledge-based methods. The UCD Tools are frequently related to scalable multimedia content and CAIN-21 enables the use of the UCD Tools inside the CATs that deals with such scalable content.
CAIN-21: An Extensible and Metadata-Driven Multimedia Adaptation Engine
123
Table 1. Comparative of multimedia adaptation approaches
Decisionmaking method Complete solutions Multimedia content Semantic adaptation UCD Tools
koMMa
BSD
DCAF
NinSuna
CAIN-21
Knowledgebased
Optimization
Optimization
BSD frame reduction
Knowledgebased + Optimization
No
Ranking
Ranking
Unspecified
Images + Video No
Scalable media gBSD + AQoS Yes
Video
Video
gBSD + AQoS Yes
SOIs + BSD
Knowledge based + Ranking Images + Video + Audio ROIs
Unspecified
Inside CATs
No
As Table 1 shows, CAIN-21 combines the two major decision-making methods and implements a complete algorithm, i.e., and algorithm that identifies all the feasible adaptations that produce content satisfying the usage environment constraints. CAIN-21 is designed for extensibility and support a wider range of multimedia content that the rest of adaptation engines. Indeed, CAIN-21 is theoretically able of managing all content that can be represented as a DI. The BSD Tools and UCD Tools that implement the optimization method have been put apart from the knowledge-based decision mechanism and transferred to the CATs.
5 Conclusions This paper has explained an extensible and metadata-driven multimedia adaptation engine named CAIN-21. The extensibility mechanism enables the use of existing multimedia adaptation operations in different adaptation scenarios. By combining these adaptation operations a wider range of adaptation scenarios can be addressed. Metadata-driven multimedia adaptation facilitates the automatic identification of sequences of conversions that adapt the media resources to the usage environment. The general architecture of CAIN-21 includes the distinction between the Adaptation Decision Module (ADM) and the Adaptation Execution Module (AEM) which is an approach frequently found in the literature. The Properties DI is an innovative design approach that accesses to multimedia properties on-demand and allows changes in the adaptation capabilities of CAIN-21 without modifying the underlying source code. In order to provide extensibility, CAIN-21 proposes the notion of Component Adaptation Tools (CATs). Even though CAIN-21 uses the MPEG-21 framework to deal with multimedia, it can also provide adaptations to external multimedia representation systems (non-MPEG-21 compliant multimedia systems) by means of the Coupling Module (CM). The comparative analysis shows that CAIN-21 combines knowledge-based and optimization decision methods and is capable of managing a wider range of multimedia content than the others. Its extensibility mechanism paves the ground for an efficient and effective way for in creasing the range of content and formats that can be managed.
124
F. López, J.M. Martínez, and N. García
Acknowledgements We would like to thank to all the people that during these years have contributed to the final status of CAIN-21: Victor Valdés, Javier Molina, Álvaro García, Víctor Fernández-Carjabales, Jesús Bescós, Fernando Barreiro, Luis Herranz, Daniel de Pedro, Guillermo López and Javier Sanz.
References [1] Vetro, A., Christopoulos, C., Ebrahimi, T.: Universal Multimedia Access (special issue). IEEE Signal Processing Magazine 20(2) (2003) [2] Burnett, I.S., Pereira, F., de Walle, R.V., Koenen, R. (eds.): The MPEG-21 Book. John Wiley and Sons, Chichester (2006) [3] Sun, H., Vetro, A., Asai, K.: Resource Adaptation Based on MPEG-21 Usage Environment Descriptions. In: Proceedings of ISCAS 2003, pp. 536–539 (2003) [4] Rong, L., Burnett, I.: Dynamic multimedia adaptation and updating of media streams with MPEG-21. In: Proceedings of CCNC, pp. 436–441 (2004) [5] Pellan, B., Concolato, C.: Metadata-driven Dynamic Scene Adaptation. In: Proceedings of the International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2007), pp. 67–70 (2007) [6] Martínez, J.M., Valdés, V., Bescos, J., Herranz, L.: Introducing CAIN: A metadatadriven content adaptation manager integrating heterogeneous content adaptation tools. In: Proceedings of WIAMIS 2005, pp. 5–10 (2005) [7] van Beek, P., Smith, J.R., Ebrahimi, T., Suzuki, T., Askelof, J.: Metadata-driven multimedia access. IEEE Signal Processing Magazine 20(2), 40–52 (2003) [8] Manjunath, B.S., Salembier, P., Sikora, T. (eds.): Introduction to MPEG-7 Multimedia Content Description Interface. Wiley & Sons, Chichester (2002) [9] Asadi, M.K., Dufourd, J.C.: Context-Aware Semantic Adaptation of Multimedia Presentations. In: Proceedings of ICME 2005, pp. 362–365 (2005) [10] Villard, L., Roisin, C., Layaïda, N.: A XML-based multimedia document processing model for content adaptation. In: King, P., Munson, E.V. (eds.) PODDP 2000 and DDEP 2000. LNCS, vol. 2023, pp. 104–119. Springer, Heidelberg (2004) [11] Phan, T., Zorpas, G., Bagrodia, R.: An Extensible and Scalable Content Adaptation Pipeline Architecture to Support Heterogeneous Clients. In: Proceedings of DCS 2002, pp. 507–516 (2002) [12] Sofokleous, A.A., Angelides, M.C.: DCAF: An MPEG-21 Dynamic Content Adaptation Framework. Multimedia Tools and Applications 40(2), 151–182 (2008) [13] Mukherjee, D., Delfosse, E., Kim, J.G., Wang, Y.: Optimal adaptation decision-taking for terminal and network quality-of-service. IEEE Transactions on Multimedia 7(3), 454–462 (2005) [14] Kofler, I., Seidl, J., Timmerer, C., Hellwagner, H., Djama, I., Ahmed, T.: Using MPEG21 for cross-layer multimedia content adaptation. Journal on Signal, Image and Video Processing 2(4), 355–370 (2008) [15] Hutter, A., Amon, P., Panis, G., Delfosse, E., Ransburg, M., Hellwagner, H.: Automatic adaptation of streaming multimedia content in a dynamic and distributed environment. In: Proceedings of ICIP 2005, pp. 716–719 (2005) [16] Chung-Sheng, L., Mohan, R., Smith, J.R.: Multimedia content description in the InfoPyramid. In: Proceedings of ICASSP 1998, vol. 6, pp. 3789–3792 (1998)
CAIN-21: An Extensible and Metadata-Driven Multimedia Adaptation Engine
125
[17] ISO/IEC 21000-7:2004. Information technology – Multimedia framework (MPEG-21) – Part 7: Digital Item Adaptation [18] Deursen, D.V., de Neve, W., de Schrijver, D., de Walle, R.V.: gBFlavor: a new tool for fast and automatic generation of generic bitstream syntax descriptions. Multimedia Tools and Applications 40(3), 453–494 (2008) [19] Hori, M., Kondoh, G., Ono, K., Hirose, S.i., Singhal, S.: Annotation-based Web content transcoding. In: Proceedings of WWWCCN 2000, pp. 197–211 (2000) [20] Stamou, G., van Ossenbruggen, J., Pan, J.Z., Schreiber, G., Smith, J.Z.: Multimedia annotations on the semantic Web. IEEE Multimedia 13(1), 86–90 (2006) [21] Jannach, D., Leopold, K., Timmerer, C., Hellwagner, H.: A Knowledge-based Framework for Multimedia Adaptation. Applied Intelligence 24(2), 109–125 (2006) [22] Barbosa, V., Andrade, M.T.: MULTICAO: A Semantic Approach to Context-aware Adaptation Decision Taking. In: Proceedings of WIAMIS 2009, pp. 133–136 (2009) [23] Deursen, D.V., Lancker, W.V.: NinSuna: a Format-independent, Semantic-aware Multimedia Content Adaptation Platform. In: Proceedings of the 10th IEEE International Symposium on Multimedia, pp. 491–492 (2008) [24] López, F., Martínez, J.M., García, N.: Towards a fully MPEG-21 compliant adaptation engine: complementary description tools and architectural models. In: Proceedings of AMR 2008 (2008) [25] Lum, W.Y., Lau, F.C.M.: A context-aware decision engine for content adaptation. IEEE Pervasive Computing 1(3), 41–49 (2002) [26] López, F., Jannach, D., Martínez, J.M., Timmerer, C., Hellwagner, H., García, N.: Multimedia Adaptation Decisions Modelled as Non-Deterministic Operations. In: Proceedings of WIAMIS 2008, pp. 46–49 (2008) [27] Molina, J., Martínez, J.M., Valdés, V., López, F.: Extensibility of adaptation capabilities in the CAIN content adaptation engine. In: Poster and demo Proceedings of SAMT 2006, pp. 29–30 (2006) [28] Lee, J.Y.B.: Scalable continuous media streaming systems. Wiley & Sons, Chichester (2007) [29] López, F., Martínez, J.M., García, N.: Automatic adaptation decision-making using an image to video adaptation tool in the MPEG-21 framework. In: Proceedings of the 10th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2009), pp. 222–225 (2009)
Shot Boundary Detection Based on Eigen Coefficients and Small Eigen Value Punitha P. and Joemon M. Jose Department of Computing Science, University of Glasgow, Glasgow, United Kingdom {punitha,jj}@dcs.gla.ac.uk
Abstract. Detection of shot boundaries in a video has been an active for quite a long time, till the TRECVID community almost declared it as a solved problem. A problem is assumed to be solved when there is no significant improvement being achieved from that of the state-of-the art methodologies. However, certain aspects can still be researched and improved. For instance, finding appropriate parameters instead of empirical thresholds to detect the shot boundaries is very challenging and is still being researched. In this paper, we present a fast, adaptive and non-parametric approach for detecting shot boundaries. Appearance based model is used to compute the difference between two subsequent frames. These frame distances, are then used to locate the shot boundaries. The proposed shot boundary detection algorithm uses an asymmetric region of support that automatically adapts to the shot boundaries. Experiments have been conducted to verify the effectiveness and applicability of the proposed method for adaptive shot segmentation. Keywords: Video retrieval, eigen value, non-parametric, shot boundary detection.
1 Introduction Proliferated production of videos has made the area of content based video retrieval and the underlying areas such as, visual content interpretation, analysis, and management, the most acclaimed and a focused area of research. In spite of the length of the video, most videos are handled in smaller chunks, either as a set of keyframes or as shorter video clips. This universal approach has made the very first step in video analysis, the shot boundary detection, an indispensable component of any video analysis and interpretation system [5, 15 ]. Due to the complexity and of shot boundaries, estimating the correct shot boundaries is more challenging. A shot in general, is a video segment, where the visual content of the segment remains consistent. The shot boundaries detection is difficult because of various video editing tools that result with a range of shot boundary types, e.g., abrupt, dissolve, fade in, fade out etc. Shot transitions can be of two types, abrupt and gradual. An abrupt shot transition is usually easier to detect than a gradual shot transition. Many different algorithms have been proposed in the literature [2, 3] to detect different types of transitions, and there have even been some attempts to handle all transitions at once [6]. It is very tough to T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 126–136, 2009. © Springer-Verlag Berlin Heidelberg 2009
Shot Boundary Detection Based on Eigen Coefficients and Small Eigen Value
127
detect all types of shot transitions using a single approach as the transitions are highly complex. One can still find such an attempt in [6]. Shot boundary detection algorithms are either based upon the pixel based difference between frames [14, 12, 10, 13, 8, 9, 6] or the motion difference on the temporal axis [3]. Despite the approach that is used, shot boundary detection algorithms require a prefixed threshold of difference value. The most challenging part of shot segmentation is fixing up such parameters or thresholds irrespective of the video genre and the type of shot transition, to determine a shot boundary. It is difficult to detect all types of shot transitions, at the same time with a fixed parameter. Fuzzy logic based algorithms [2] are not very sensitive to noise unlike direct threshold based approaches, but still requires threshold selection. A very few attempts can be found in the literature to detect shot transitions adaptively. An attempt to adaptively select a global threshold to achieve a high shot cut detection rate was also made in [17]. An approach based on adaptive selection of the threshold using the average of the weighted variance of the change in previous detected shot to the current frame and the next frame was proposed in [11] and was reported to produce 94% accurate results. In this paper, we propose an adaptive non-parametric method for detecting shot boundaries, without being specific to the type of transition. The method proposed to adaptively find the shot boundaries is inspired by a non-parametric corner point detection method [7, 1]. The distance between two subsequent frames is computed in terms of eigen distance between two frames. This vector of distances with its index can be perceived as the profile signature of an open contour. Therefore, each distance with its index forms a co-ordinate point on a two dimensional plane. The adapted corner detection method then determines the shot boundaries by computing the statistical and geometrical properties associated with the small eigen value for each co-ordinate point. For each point representing a frame, three features viz., region of support, confidence value and curvature value are computed and based on these features the true shot boundaries are located. The results indicate that the proposed algorithm is effective for various type video sequences. The various combination of the features help in detecting all types of shot boundaries. This remaining part of this paper is organised as follows. In Section 2 we define the various video transitions. Section 3, presents an overview of the corner detection method presented in [1, 7]. Section 4 presents the proposed shot segmentation methodology adapted on corner detection mechanism. In Section 5, the performance metrics used to evaluate the proposed method is presented. Section 5, briefs about the experimental results and the paper concludes in Section 6.
2 Video Transitions A video V, in domain 2D+t, can be seen as a sequence of frames ft, and can be described by , where T is the number of frames in the video. The way in which any two video shots are joined together is called the transition [9]. The most common transition is the cut, in which there is a sudden and abrupt change between two frames. Fade is characterised by a progressive darkening of a shot until the last frame becomes completely black. The next most common transition
128
Punitha P. and J.M. Jose
is the cross fade (mix or dissolve), where one shot gradually fades into the next. Fades have a slower, more relaxed feel than a cut. Flash is an increase of the luminosity in a few frames, which is common in television journal videos. Other advanced transitions include wipes and digital effects, which are complex changes, whilst leading into the next shot, such as, colour replacement, animated effects, pixelization, focus drops, lighting effects, a pan from one person to another, or a zoom from a mid-shot to a close-up etc. In the next section, we propose and present an algorithm that can adaptively detect all shot boundaries, irrespective of the types of shot transitions.
3 The Proposed Shot Segmentation Algorithm This section describes the proposed feature extraction, frame difference computation and the algorithm to detect shot boundaries caused due to cut, flash, wipe and cross fade (mix, dissolve). The problem of shot boundary detection can be related to that of corner point detection in objects. This is due to the following underlying similarities, (i) Corner points on a shape curve are effective primitives for shape representation and analysis. Similarly, shot boundaries in videos are effective primitives for video representation. (ii) Corner points on a digital boundary are found at locations where the nature of the boundary changes abruptly and significantly. On similar lines, shot boundaries in a video are also lying at locations, where the frames change abruptly and significantly. However, this definition only makes sense when the video or object is viewed from a global perspective and not at the local frame or point level. 3.1 Feature Extraction and Frame Difference Computation As pixel based difference analysis remains as one of the most preferred approaches to find the dissimilarity between the frames, we also use pixel based frame difference to locate the shot boundary. As evident from, PCA is a linear method for data feature extraction. It is a mathematical technique used to analyse correlated random variables to reduce the dimensionality of a data set. This reduction is achieved by selecting the first few principal components. These components capture the most relevant features to use in classifying a group of objects to be recognised. Given the ith frame as the current frame, its previous frame is considered as the reference frame. Both frames are normalised for their intensity values in the RGB plane and an average intensity frame is obtained. Two reflective frames lying on opposite sides of the average frame are found. The eigen vectors and the eigen values of the two reflective frames are then computed and are used to calculate the representative eigen co-efficients of the two frames. Let
be the matrix representing the two frames under process and
be
the average of normalised F. The average matrix is obtained by replicating the columns with average vector to simplify the matrix subtraction. Then,
(1)
Shot Boundary Detection Based on Eigen Coefficients and Small Eigen Value
129
A diagonal matrix D of eigen values and a full matrix V whose columns are the corresponding eigen vectors such that, XV = VD are computed. Using, X, V and D, the eigen coefficient matrix,
is computed. Ecoeff is a 2x2 matrix where the first row corresponds to the reference frame and the 2nd row corresponds to the current frame. The difference between these eigen coefficients of the two frames is recorded along with the current frame number as a boundary point. The process is repeated for all frames in sequence and the boundary points B, where
n is the number of
frames in the video, are the interger and real space. These boundary points are then used to locate the shot boundaries using the corner detection inspired nonparametric approach. 3.2 Shot Boundary Detection In this section, we first present an overview of the corner detection method proposed in [7, 1] and then present how it is adopted for shot boundary detection. An Overview of Non-parametric Corner Detection Approach [7, 1] In the domain of object detection, localising the true corner points using the local features. The paper presented asymmetric region of support to find true corner points. This subsection gives an excerpt from the original paper to keep the flow a smooth read. Let where, , is a neighbour of (mod n), and n be a close curve described in a clockwise direction. Let be the point of interest for which the region of support has to be determined to decide if is a corner point or not. The region of support of , consists of left arm , right arm , and the point itself. That is, and , where and denote, respectively, the sizes of left and right arms which are decided adaptively based on local properties of the contour and are not necessarily equal implying that the region of support are not necessarily symmetric. It has been shown in [16], that for a straight line segment, the small eigen value in the continuous domain is zero, regardless of its length and orientation. Hence, the maximum sequence of points, on either side of the point , for which the small eigen value
130
Punitha P. and J.M. Jose
associated with the covariance matrix of the sequence of points approximates zero, is selected as the points of the respective arms. To compute the right arm, initially, . Let be the geometrical centroid of and let be the entries of the covariance matrix of . The small eigenvalue is computed as given in equation (3),
Since, any two points forms a straight line, the small eigenvalue associated with two points is zero. The set is updated by adding the next point , in sequence and the small eigenvalue of is computed. In order to avoid recomputation of the small eigenvalue as and when the set is updated, the small eigenvalue is computed based on the previous set information as in [12]. This process of updating the set by adding the next point in sequence and computing the small eigen value associated with the updated set is repeated until no more approximates zero. Once this condition is true, the value of r is taken as the size of the right arm and the points in is the right arm of point . The left arm is computed in a similar manner, but with the points preceding the point . Finally, the region of support of point is the set with points which are in sequence from . Since, the points mark the end points of a segment, their confidence values are incremented by one. It can be noticed that the size of the determined region of support varies from point to point depending on the local property of the curve within the vicinity of the point of interest and, thus, the proposed method determines adaptively the region of support which is not necessarily symmetric. Once the region of support of a point is determined, the curvature at that point is then estimated as the reciprocal of the angle made at that point due to its left and right arms. Determination of region of support of all points helps in computing the size of the region of support of each point, curvature at each point and also to compute how many times (limit value) a point has been the endpoint of the region of support of other points. It is observed experimentally that the limit value, the size of the region of support and the curvature of an actual corner point are relatively larger than the respective values of its neighbors. Non-parametric Shot Boundary Detection Since after the computation of the frame difference, the frame number and the distance, can be visualised as a point on a open contour, we refer each point p(x, y) in the 2D plane, with x representing the frame number and y represents the eigen distance of frame x, to its next frame in the video. Hence, similar to the approach referred and explained in previous section, for each point p, three features, (i) the region of support (ii) the curvature at p, formed by the two arms and (iii) the confidence value, as described earlier are extracted.
Shot Boundary Detection Based on Eigen Coefficients and Small Eigen Value
131
Now, if is the point of interest for which the region of support has to be determined to decide if marks a shot boundary, the region of support of , left arm , right arm , are computed as explained above. Once the region of support of a point is determined, the curvature at that point is estimated in three different ways and are labelled, , , and . is computed as the reciprocal of the angle made at that point due to the left and right arms, , and are computed as given by (4) and (5) respectively,
Where, is the slope of the line joining the points and is the slope of the line joining the points . , helps in detecting the cuts or abrupt change in the frames forming a clear cut shot boundary. But, is not very helpful to find shot boundaries caused due to gradual change in frames. This problem is generally caused when the subsequent frame difference is considered, as the subsequent frame difference remains small when there is only a slight change occurring between frames. At this point it seems plausible to assume that if instead of computing the difference between subsequent frames, if the difference between these frames is computed with respect to third frame, which could be a template frame/ image, it would provide more informative, helpful and discriminating differences. However, this is in fact not the case, as the selection of a template image/frame becomes a real challenge, as the resultant difference would depend completely on the template selected. Instead to overcome this problem, we use Von Mises probability density function , as used by [5] to find the distribution of angles at a point, on the contour and is given by (5),
To make the shot boundary detection more effective, instead of deciding a frame to be shot boundary frame, by just looking into the curvature corresponding to the frame, we also use the region of support and the confidence value at the point. For a frame marking the shot boundary, the three features, the confidence value, region of support and the curvature are expected to be larger than that of the neighbour frames as shown in Fig. 1-5 for first few frames of a video. Following some experimentation it was observed that the shot cuts and flash are evident and easy to detect by curvature , although it can be ascertained with region
132
Punitha P. and J.M. Jose
Fig. 1. Frame Difference
Fig. 2. Confidence values for first 1000 frames of a video
Fig. 3. Region of support for first 1000 frames of a video
Shot Boundary Detection Based on Eigen Coefficients and Small Eigen Value
Fig. 4.
133
- Curvature values helpful to detect shot cut
Fig. 5.
- Curvature values for gradual change
of support and confidence value. However, when it is detection of shot boundaries due to gradual changes in the frames, the curvature alone is not sufficient. Therefore, selecting points on the boundary curve, corresponding to frames with local maximum curvature and local maximum confidence values, with a longer region of support becomes essential and effective.
4 Performance Metric To test the robustness and effectiveness of the proposed algorithm, we manually identified video shot boundaries in a number of videos, this will be used as reference for classifying the type of detection as correct, false or missed. Accordingly, #Total represents the number of shot boundaries, #Correct represents the number of events
134
Punitha P. and J.M. Jose
correctly detected, #False represents the number of detected shot boundaries, which do not actually represent the shot boundaries and #Missed represents the number of undetected shots. Two performance metrics, the hit rate (HR) and error rate (ER) are used to evaluate our proposed methodology. Hit rate and error rate are given in (7) and (8) below # Correct HR = (7 ) # Total ER =
# Missed + # False # Total
(8 )
5 Experimental Results The approach presented in this paper was applied on two genres of videos, news and commercial TV program sequences. The video sequences included cut, fade, dissolve (blur and mix), and wipes. The sequences together lasted for 3739 secs, with 112086 frames. The dataset comprised of 79 cuts, 11 flashes, 45 fades, 113 crossfades (mix, blur and dissolve), and 5 wipes. Table 1. Shot cut detection rate
A B C D
E
Feature used + confidence value
HR 0.42
FR 0.57
+ confidence value
0.5 0.57 0.42 1
0.85 0.64 0.57 0.42
Region of support + confidence value Union of above results
The frame difference between the subsequent frames of a video was computed using eigen distance computation as explained in Section 3.1. The first frame index and its eigen distance to the next frame was regarded as a point on a open contour. Now the task reduces to finding the points on this contour which could be the possible shot cuts or corner points in object detection domain. For every point which represents a frame from a video, we computed the three features, (i) the region of support (ii) the curvature at p, formed by the two arms and (iii) the confidence value, as described earlier are extracted as explained in Section 3. Using these three features, for the video considered for experiments, we detected all 79 cuts with , with 100% accuracy, but only a very few flash, fade and dissolve shot boundaries were detected. When the common peaks were used in features, and the confidence value, we were able to find more boundaries caused due to fades. But, this also resulted with more fault detection. Using in combination to peak points of confidence, not only gave the shots undetected in previous setups, but had minimal fault rate. However, a small number of shot boundaries remained undetected. The combination of a longer region
Shot Boundary Detection Based on Eigen Coefficients and Small Eigen Value
135
of support and the confidence values was able to detect most of the undetected shots in the above setup. But, when all three features were used in combination, many shots went undetected. Instead of using all three features together, we aggregated the results of earlier mention combination and were able to detect all shots, but at additional fault rate. The results are tabulated in Table 1.
Fig. 6. Hit Rate(dotted line) and Error Rate(solid line) in boundary detection for various combinations of features
6 Discussion and Conclusion In this paper, we have made a successful attempt in exploring a model which overcomes the necessity of varying thresholds for shot boundary detection. The paper presents an adaptive and non-parametric approach for shot boundary detection, without limiting itself to a specific type of shot transition. The difference between two subsequent frames in the video is computed in terms of eigen distance between two subsequent frames of a video. The frame difference and the first frame number is regarded as the co-ordinates point on an open contour. The corner detection method is then used to determine the shot boundaries based on automatic computation of frame support, for each frame in the video using the statistical and geometrical properties associated with the small eigen value of the covariance matrix of a sequence of connected points on the open contour. The results obtained for a synthetic video of length 4000 secs, shown in Figure 6 depicts that we achieve a high detection rate when the results of all feature listed in the column ‘Features’ of Table 1 is used, however we also have a trade off with the false detection rate. As a future work, we aim to reduce the false shot boundary detection rate by considering other heuristics. In addition, though the method was not proposed to
136
Punitha P. and J.M. Jose
distinguish various types of transition, the experimental results have opened up the avenues to explore possible detection of type of transition with the help of above mentioned features.
Acknowledgement This research was supported by the EU commission, FP-027122-SALERO.
References 1. Dinesh, R., Guru, D.S.: Efficient Non-parametric corner detection: An approach based on small eigenvalue. In: IEEE Fourth Canadian conference on computer and robot vision (2007) 2. Fang, H., Jiang, J., Feng, Y.: A fuzzy logic approach for detection of video shot boundaries. Pattern Recognition 39, 2092–2100 (2006) 3. Fang, H., Jiang, J.: Predictive based cross line for fast motion estimation in MPEG-4 videos. In: Proceedings of IS&T/SPIE’s symposium on Electronic Imaging: Science & Technology, San Jose, USA, January 2004, pp. 175–183 (2004) 4. Feldman, J., Singh, M.: Information along contours and object boundaries. Psychological Review 112, 243–252 (2005) 5. Goyal, A., Punitha, P., Hopfgartner, F., Jose, J.M.: Split and Merge Based Story Segmentation in News Videos. In: Boughanem, M., et al. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 766–770. Springer, Heidelberg (2009) 6. Guimaraes, S.J.F., Couprie, M., Araujo, A.D.A., Leite, N.J.: Video segmentation based on 2D image analysis. Pattern Recognition Letters 24, 947–957 (2003) 7. Guru, D.S., Dinesh, R.: Non-parametric adaptive region of support useful for corner detection: a novel approach. Pattern Recognition 37, 165–168 (2004) 8. Hanjalic, A.: Shot boundary detection: unravelled and resolved? IEEE trans. Circuits Syst., Video Technol. 12(2) (2002) 9. Huang, C., Liao, B.: A robust scene-change detection method for video segmentation. IEEE trans. Circuits Syst., Video Technol. 11(12) (2001) 10. Jadon, R.S., Chaudhury, S., Biswas, K.K.: A fuzzy theoretic approach for video segmentation using syntactic features. Pattern Recognition Letters 22, 1359–1369 (2001) 11. Kim, W.H., Jeong, Y.J., Moon, K.S., Kim, J.N.: Adaptive shot change detection technique for real time operation on PMP. In: Proc. of third Intnl. Conf. on Convergence and Hybrid Technology, pp. 295–298 (2008) 12. Liu, T.Y., Lo, K.T., Zhang, X.D., Feng, J.: A new cut detection, algorithm with constant false alarm ratio for video segmentation. J. Vis. Commun. Image R. 15, 132–144 (2004) 13. Lo, C., Wang, S.: Video segmentation using a histogram based fuzzy c-means clustering algorithm. Computing, Standards and Interface 23, 429–438 (2001) 14. Porter, S., Mirmehdi, M., Thosmas, B.: Temporal video segmentation and classification of edit effects. Image and Vision Computing 21, 1098–1106 (2003) 15. Punitha, P., Urruty, T., Feng, Y., Halvey, M., Goyal, A., Hannah, D., Klampanos, I., Stathopoulos, V., Villa, R., Jose, J.: Glasgow University at TRECVID 2008, TRECVID workshop at NIST Gaithersburg, MD, USA (2008) 16. Teh, C.H., Chin, R.T.: On the detection of dominant points on digital curves. IEEE trans. Pattern Analysis and Machine Intelligence 11, 856–859 (1989) 17. Whitehead, B.P., Laganiere, R.: Feature based cut detection with automatic threshold selection. In: Enser, P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 410–418. Springer, Heidelberg (2004)
Shape-Based Autotagging of 3D Models for Retrieval Ryutarou Ohbuchi and Shun Kawamura Graduate School of Medicine and Engineering, University of Yamanashi, 4-30 Takeda, Kofu-shi, Yamanashi-ken, 400-8511, Japan [email protected], [email protected]
Abstract. This paper describes an automatic annotation, or autotagging, algorithm that attaches textual tags to 3D models based on their shape and semantic classes. The proposed method employs Manifold Ranking by Zhou et al, an algorithm that takes into account both local and global distributions of feature points, for tag relevance computation. Using Manifold Ranking, our method propagates multiple tags attached to a training subset of models in a database to the other tag-less models. After the relevance values for multiple tags are computed for tag-less points, the method selects, based on the distribution of feature points for each tag, the threshold at which the tag is selected or discarded for the points. Experimental evaluation of the method using a text-based 3D model retrieval setting showed that the proposed method is effective in autotagging 3D shape models. Keywords: 3D geometric modeling, content based retrieval, automatic annotation, manifold ranking, text tags, semantic retrieval.
1 Introduction Specification of a query is one of the most fundamental issues in retrieving multimedia objects such as images and 3D geometric models. These data objects may be searched either by example(s) or by text(s), each approach having its own advantages and disadvantages. To retrieve 3D models, a query for the Query-By-Example (QBE) approach could be sketches in either 2D [12] or 3D of the desired shape, an example 3D model, or a photograph of a real-world object having the desired shape. Most of the published work on 3D model retrieval used the QBE approach [2, 5, 6, 7, 9, 10, 12, 14]. While the QBE approach is quite powerful and useful in many application scenarios, it does have drawbacks. For example, it is often impractical to find a 3D model that have a shape similar enough to the desired one. Or, a user may find sketching a complex shape difficult. Capturing semantic aspect of the desired shape is also difficult with the QBE approach. Previous work used either single-class learning via relevance feedback [6, 7] or off-line multi-class learning [10] for semantics. An alternative, Query-By-Text (QBT) approach, employs a text as a query, as in text-based search engines. While the QBT is disadvantageous in directly specifying a desired shape, it has an advantage in describing semantic content of the desired shape. T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 137–148, 2009. © Springer-Verlag Berlin Heidelberg 2009
138
R. Ohbuchi and S. Kawamura
For example, a word “mammal” could be used to retrieve both dolphin and cat that have very different geometric shapes. The issue in QBT of 3D models is that most of the 3D models are not associated with text tags, and that adding consistent tags manually to a large number of 3D models can be quite difficult. An automatic or semiautomatic approach, e.g., propagation of tags from a set of tagged models to a large number of tag-less models, is necessary. The authors know of only two examples, one by Zhang et al [16] and the other by Goldfeder et al [3] that studied autotagging of 3D models. The method by Zhang et al [16] associates a list of attributes, not text tags, with a 3D model. The method employs biased kernel regression to propagate probabilities of the attributes from the manually tagged source models to the tag-less models. For retrieval, the method treats the list of probabilities as a feature vector, and computes distances among feature vectors. Thus, it is not strictly a QBT method. The method by Goldfeder et al [3] is a QBT approach, in which text tags are propagated from tagged 3D models to tag-less 3D models. To propagate tags, it uses local structure of the shape feature space via nearest neighbor search. For experiments, they used a set of 192,343 mesh-based 3D models having text tags in a snapshot of Google 3D Warehouse (G3W). The G3W is an evolving set of a large number of 3D models contributed by users, and it contains errors and inconsistencies in its tags. Unfortunately, the G3W snapshot they have used is not available for use at this time. Goldfeder et al evaluated tagging performance of their method by using a retrieval task, by comparing retrieval performance of their auto-tagging based QBT algorithm with that of a QBE algorithm that uses 3D models as its queries. The paper [3] reported that the QBT performed equal or better than the QBE algorithm they have compared against. In this paper, we propose a novel off-line semi-supervised algorithm for autotagging 3D model collections that exploits both local and global distribution of tagged and tag-less shape features by means of the Manifold Ranking (MR) algorithm by Zhou, et al. [17, 18]. From a set of shape features extracted from a set of tagged 3D models that shares a same text tag, the MR algorithm diffuses a relevance rank value indicating the likelihood of tag-less models to have the tag. This MR process is performed once for each of keyword to determine a set of tags a tag-less model should receive. To discard an unreliable tag attached to the tag-less model, compactness of the distribution of the tagged models sharing the tag is taken into consideration. An evaluation of the proposed algorithm by using a 3D model retrieval scenario showed that a QBT retrieval method by using the text tags added by the method significantly outperforms a QBE retrieval method that used 3D shape feature comparison.
2 Autotagging 3D Models via Multiple Manifold Ranking Our algorithm, whose outline is shown in Figure 1, consists of the following three steps. Details of the steps will be described in the following subsections. (1) (2)
Shape feature extraction: Extract a shape feature from the input 3D model. Tag propagation: For each tag, propagate, in the shape feature space, the likelihood of a 3D model having the tag from the tagged “training example” 3D models to the tag-less models. The tag likelihood propagation is performed by
Shape-Based Autotagging of 3D Models for Retrieval
139
using the Manifold Ranking (MR) algorithm by Zhou, et al [17, 18]. Repeat the MR-based tag likelihood propagation for all the tags. (3) Tag selection: A tag-less model now has multiple likelihood values for multiple tags. Select, based on the mutual distance of the tagged and tag-less models, the most likely tags to be attached to the model. Extract feature
Estimate TRR
Select tags Chair:0.5
Untagged model 0.3 0.5
0.5
Feature space Tagged models
Fig. 1. The method estimates a Tag Relevance Rank (TRR), a likelihood of a model having a text tag, by Manifold Ranking [17, 18]. Estimation of TRR is performed for every tag. Then most relevant of the tags are selected for a 3D model by using tag-specific threshold values.
2.1 Feature Extraction So far as a shape descriptor, or a feature, of a 3D model is a vector, it can be used in the proposed autotagging algorithm. A feature for comparing 3D shapes is required to have a set of invariances. Typically, certain invariance to shape representation, invariance to geometrical transformation up to similarity transformation, and invariance to geometrical and topological noise are expected. One might need to compare a CAD model defined as a 3D solid by using curved surface patches with a polygon soup model or a point set model. Invariance against three rotational degrees of freedom is not trivial to achieve [14]. Topological and geometrical noises, such as holes, variation in vertex connectivity in a mesh, displacements of vertices, need to be tolerated. Depending on an application, additional invariance, such as invariance against joint articulation, may be required. For the experiment described in this paper, we used our extension of Wahl’s Surflet Pair Relation Histograms (SPRH) [15], a shape feature that has nice invariance properties, readily available, and has reasonable retrieval performance. The Windows 32bit executable of our extension of the SPRH is available at our web site [8]. The SPRH was originally developed for oriented point set models acquired by Laser range scanners. We added a step to convert a surface based model to an oriented point-set model by using quasi-Monte Carlo sampling of surfaces [10]. As a result, the extended SPRH is able to accept popular surface-based shape representations such as polygon soup and closed polygonal mesh. The SPRH feature is a joint histogram of quadruples (δ, α, β, γ), in which δ is the distance and α, β, and γ are the angles between a pair of oriented points. By default, each of the four quantities has 5 bins, making a SPRH feature a ͷସ ൌ ʹͷ ܦvector. The SPRH is invariant to similarity
140
R. Ohbuchi and S. Kawamura
transformation without requiring pose normalization. It is also insensitive to various noises, such as variation in mesh tessellation or connectivity, holes, or small vertex displacements. 2.2 Tag Propagation and Selection Our method uses the Manifold Ranking (MR) [17, 18] algorithm to estimate the relevance of a text tag to tag-less 3D models. The MR algorithm is a graph-based learning algorithm that can be used either in unsupervised, supervised, or semi-supervised modes. Our method uses the MR in semi-supervised mode. Intuitively, the MR resembles to solving a diffusion equation on an irregular connectivity mesh in a highdimensional feature space. The mesh is created by connecting, by their proximity, the feature points in the feature space. For example, given the 625D feature of the SPRH, the mesh is embedded in a 625D space with its feature points having 625D coordinate. In our autotagging framework, the mesh for the MR is created from union of feature vectors of the tagged as well as tag-less 3D models. Typically, a text tag is associated with multiple tagged (or “source”) 3D models so that a MR process would have multiple sources for diffusion. Given N text tags, or keywords, the proposed algorithms runs N such MR processes to diffuse Tag Relevance Rank (TRR) value for the N tags onto all the tag-less 3D models. At the equilibrium of solving the diffusion equation iteratively, the higher the diffused TRR value, the higher the likelihood of the tag-less model having the tag. Let us assume propagation of a tag from a set of tagged 3D models to the other 3D models without the tag. Let χ = { x1 ,L, xs , xs +1 ,L, xt , xt +1,L, xn } be a set of n features in a m-dimensional feature space \ m , in which first t points from x1 to xt are the feature points for the tagged models. Among the tagged points, points from x1 to xs are the source set, and the point from xs +1 to xt are the probe set. This splitting of the tagged points is done to estimate a TRR threshold value th, which decides if a tag should be attached to a tag-less point or not. The source set and the probe set is of equal size, and are drawn randomly from the set of t points sharing the same tag. To put it briefly, by performing TRR diffusion within a set of models sharing a tag, the tightness of the distribution of the tag is estimated to compute the th. Details on tag selection will be explained below. The rest of the points from xt +1 to xn are the tag-less ones for which the TRR values need be computed. Let d : F u F o \ denote a distance metric on χ , e.g., L1-norm or Cosine distance, that assigns a pair of points xi and xj a distance d ( xi , x j ) . Let f : F o \ be a ranking function that assigns each xi a ranking score T fi , thereby forming a rank vector f = ⎣⎡ f1 ,L , f n ⎦⎤ . Let the n-dimensional binary-valued T vector y = ⎣⎡ y1 ,L , yn ⎦⎤ be a label vector, in which yi = 1 for the tagged (“source”) point and yi = 0 for the rest, i.e., points to which tags are assigned. We first create the affinity matrix W where Wij indicates the similarity between samples xi and x j ;
⎧ ⎛ d ( xi , x j ) ⎞ ⎪ exp ⎜⎜ − ⎟⎟ if i ≠ j Wij = ⎨ σ ⎝ ⎠ ⎪ otherwise ⎩ 0
(1)
Shape-Based Autotagging of 3D Models for Retrieval
141
The distance metric d(xi, xj) used in forming W affects ranking performance. We will compare several distance measures for their performance in Section 3.1. The positive parameter σ defines the radius of influence. Note that Wii = 0 since there is no ark connecting a point with itself. The matrix W is positive symmetric. A normalized graph Laplacian S is then defined as; −
1
S = D 2 WD
−
1 2
(2)
where D is a diagonal matrix in which Dij equals to sum of i-th row of W , that is,
Dij = ∑ j Wij Then the ranking vector f = ⎣⎡ f1 ,L , f n ⎦⎤ can be estimated by iterating T
the following until convergence;
f (t +1) = α Sf (t ) + (1 − α ) y
(3)
As the equilibrium is reached, each tag-less point would have a TRR value associated with them, indicating the likelihood of the point having the tag. We call the ranking vector at the equilibrium f * . The algorithm performs the tag propagation process described above n times, once for each of the n tags. After all the tags have been diffused, a point in the feature space would have n positive TRR values corresponding to the n tags. A TRR value ݂ indicates how relevant the tag is to the point i. The system must now decide what subset of the n tags should be attached to the point. To decide, our method employs a simple proving to estimate the cut-off threshold th of the TRR for each tag. As described above, the method randomly splits the set of t tagged points sharing the same tag into two (roughly) equal-sized subsets, the source set and the probe set. After the MR, the TRR values of the points in the probe set, that is, the points from xs +1 to xt in the set of features χ , are used to estimate a threshold th by using the following equation; ݄ݐൌ ሺ݂ ሻ ௦ାଵஸஸ௧
(4)
The threshold th is computed for each tag. If the TRR ݂ th at a tag-less point, the tag i is attached to the point. The tag is also given the TRR value ݂ as its confidence value at the point. In general, a point may be associated with multiple tags if more than one tag has TRR values greater than or equal to their respective threshold values. As a reviewer pointed out, the use of minimum to compute the threshold th is debatable, as it is not robust against outliers. However, given small size of some of the classes in the benchmark dataset, e.g., just 4 models per class and 2 models in its prove set, we could not come up with a better method that is statistically robust. In the example of Figure 2, tagged points belonging to the tag “car” has the tightest distribution, with the similarity threshold th=0.8. The TRR of the “car” tag at the tagless point in question is 0.5, which is below the threshold th=0.8. Thus, the “car” tag is not attached to the point. Similarly, the “animal” tag is not attached to the point. The “chair” tag, on the other hand, has the TRR=0.5, while the threshold for the “chair” tag is th=0.4. Thus, the point is associated with the tag “chair”.
142
R. Ohbuchi and S. Kawamura
Tag Propagation animal
Tag-less model
Tag Selection animal: th=0.6 TRR=0.3 TRR=0.5
Tag(s): animal car chair
TRR=0.5 car
chair
car: th=0.8
chair: th=0.4
Fig. 2. A tag is attached to a feature if its Tag Relevance Rank (TRR) propagated from features with the tag exceeds the threshold th specific to the tag. The threshold is the estimated minimum TRR of the prove set of features having the tag.
3 Experiments and Results As the research on autotagging, or automatic annotation of 3D models just started, there is no established benchmark for an objective evaluation. Ideally, one should use a large enough corpus, something similar to the snapshot of G3W containing 190k models used by Goldfeder et al, for tagging. Unfortunately, the G3W snapshot used by Goldfeder et al is not available for us. In addition to the database, one also needs a method to evaluate the tagging performance numerically and objectively. We resorted to using a well accepted benchmark for QBE 3D model retrieval, the Princeton Shape Benchmark (PSB) [13] for our evaluation. The 1,814 model PSB contains two equal-sized subsets, the train set and test set, each consisting of 907 models. We regard class labels of the PSB train set as text tags, and transfer the tags from the train set to the test set. We then regarded the retrieval performance derived under the QBT framework as the performance index of the autotagging method. The PSB has a 4-level class hierarchy, from the most detailed “Base” level classes to the most abstract “Coarse 3” level classes containing only two classes; “natural” and “manmade”. Of these four levels, we use the Base level classes only in the following experiments. The Base level class itself has four internal levels of hierarchy. If leaf nodes of the Base level are counted, there are about 90 classes in the Base level of both train and test set. Note that some of the classes in the Base class of the train set do not have their counterparts in the Base level of the test set, and vice versa. If both leaf and non-leaf nodes of the Base level classes are counted there are about 130 classes. (To be exact, there are 129 classes in the train set and 131 classes in the test set.) Note also that the size of a class at the leaf-node of the Base class can be quite small, ranging from just four per class to a couple of dozen per class. As the feature vector, we used the SPRH by Wahl [15] modified to accept polygon soup and polygonal mesh models found in the PSB. Note that the performance of the SPRH is only modest by current state of the art. Here we do not aim for the highest retrieval performance, but we want to evaluate our autotagging algorithm by using a QBT framework.
Shape-Based Autotagging of 3D Models for Retrieval
143
3. 1 Distance Measures for Similarity Matrix This set of experiments compares, using the retrieval performance, the autotagging performance of various distance measures used in forming the affinity matrix W in equation (1). For this experiment, we used 63 class labels that are common among the train and test sets of the PSB. These 63 classes are selected from the Base level, and include both leaf and non-leaf nodes in the Base level. These 63 classes correspond to 811 models in the train set and 823 models in the test set. We used the 3D models in the train set as the tagged models, and propagated their class labels to the models in the test set. We compared five distances measures, that are, Lk -norm having k=0.5, 1.0, and 2.0, the cosine (cos) measure, and the Kullback-Leibler divergence (KLD). In the following equations, x = ( xi ) and y = ( yi ) are the feature vectors and n is the dimension of the vectors. The Lk -norm is defined by the following equation; ⎡ d k ( x, y ) = ⎢ ⎣
∑( n
i
xi − yi
k
)
⎤ ⎥⎦
1
k
(5)
The original paper on MR used the Euclidian distance, or L2 -norm with k=2.0, to form the affinity matrix [17, 18]. The Manhattan distance with k=1.0 is often said to perform better in higher dimension. Aggarwal, et al in [1] argued that k<1.0, e.g., k=0.5, works even better in higher dimension than k=1.0. As the cosine measure is a measure of similarity having the range [ 0,1] , we converted it to a distance using the following equation; x⋅y x ⋅ y
dcos ( x, y ) = 1 −
(6)
The KLD is sometimes referred to as information divergence, or relative entropy, and is not a distance metric, for it is not symmetric. We use a version of KLD that is symmetric as below; n
d KLD ( x, y ) =
∑ ( y − x ) ln x i
i =1
i
yi
(7)
i
Figure 3 shows the retrieval performance in R-precision of various distance measures d ( xi , x j ) used in forming the affinity matrix W of equation (1). Both the L0.5 -norm and the KLD performed equally well, finishing at the top. The L2 -norm employed in the original paper on manifold ranking [17, 18] finished the last in this case. The result will likely be different for other features and databases having different distribution of feature points. However, it is definitely worth experimenting with several different distance measures. We will use the L0.5 -norm for the experiments that follow.
144
R. Ohbuchi and S. Kawamura
33%
35%
37%
39%
41%
L05
41.3%
L1 L2 COS
43%
39.4% 37.1% 38.5%
KLD
41.0%
Fig. 3. Distance measures used in forming the affinity matrix W (Equation (1)) and retrieval performance measured in R-precision
3.2 Query by Text v.s. Query by (3D Model) Example In this set of experiments, in an effort to quantify autotagging performance, retrieval performances of the QBT and QBE frameworks are compared. The QBE retrieval and the autotagging step for the QBT used the same shape feature, the SPRH modified by us to accept polygon-based models. As tags, we used 21 class labels of 21 leaf nodes in the Base level that are common among the PSB train and test set. This is different from the 63 class labels using in the previous experiment, which included both leaf and non-leaf nodes in the Base level of the PSB. This experiment used the leaf node only since performance evaluation in QBE framework could only use the leaf-nodes as its query. That is, it is not possible for the QBE framework to query by using a “super shape” corresponding to a super class. Figure 4 shows the recall-precision plots for the QBT and QBE frameworks. The QBT outperformed the QBE almost everywhere except at the lowest recall. The QBE won at the lowest recall since the top ranked retrieval in the QBE is always correct, as it is the query itself. For the QBT retrieval, however, the top-ranked result may be wrong if the tagging step made mistakes. Figure 5 compares the retrieval performance of the QBT and the QBE for when only leaf classes in the Base level are considered (case “leaf”) and both leaf classes and non-leaf classes in the Base level are considered (case “all”). The R-precision values are averaged over the classes and models used in each experiment. R-precision for both QBT and QBE dropped for when non-leaf classes are included in the queries. Still, the QBT performed better than the QBE in both cases. We observe that that the QBT retrieval performance of average R-precision=55% is on a par with some of the state-of-the-art QBE methods evaluated by using the same database but with different set of classes. (For reference, the Light Field Descriptor (LFD) by Chen et al. [2] has average R-precision=45% if evaluated by using all the 93 leaf node classes of the Base level PSB test set.) Such an apparently good performance of the QBT may be explained as follows. The proposed algorithm adds a tag by propagating TRR from multiple feature points sharing the tag. This resembles to the case in which the MR is used in a relevance-feedback based 3D model retrieval system [11]. The system described in [11] produced a high
Shape-Based Autotagging of 3D Models for Retrieval
145
1.0 QBT
0.9
QBE
0.8
Precision
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0
0.2
0.4
0.6
0.8
1.0
Recall
Fig. 4. Recall-precision plot for the QBE and QBT retrieval experiments. Overall, the QBT using the automatically added tags significantly outperformed the QBE using the shape feature.
20%
25%
30%
35%
40%
40.6%
leaf all
45%
36.8%
41.3%
50%
55%
60%
55.2% QBE QBT
Fig. 5. The QBT based on the autotagging performed significantly better than the QBE for both the “leaf” case, which included only leaf classes in the Base level PSB, and the “all” case, which included both leaf and non-leaf classes in the Base level PSB.
retrieval performance after multiple 3D models are marked as relevant after a few relevance feedback iterations. 3.3 Tagging Examples Figure 6, Figure 7, and Figure 8 show examples of tags attached to models by using the proposed algorithm. Figure 6, Figure 7, and Figure 8 show, respectively, the successful, unsuccessful, and mixed cases. The tags in red are correct tags, while those in blue are incorrect tags. The size of the letters indicate the estimated quality of the tag; the larger the letter, the more confident the system is about the tag.
146
R. Ohbuchi and S. Kawamura
Fig. 6. Examples of tags (successful cases.)
Fig. 7. Examples of tags (unsuccessful cases.)
Fig. 8. Examples of tags (mixed cases.)
Shape-Based Autotagging of 3D Models for Retrieval
147
Successful examples in Figure 6 are associated with both correct and incorrect tags. However, the incorrect tags should matter little in practice as they have low confidence values. Unsuccessful examples, e.g., a balloon (m1345) mistakenly tagged as “body part” and “head”, seem to suggest the need for a better shape feature as well as more training examples. For example, a multiresolution feature acquisition method similar to the one used in [10] might help.
4 Conclusion and Future Work In this paper, we proposed and evaluated a shape-based algorithm for automatic annotation, or autotagging, of 3D models with text keywords by learning the tags from a corpus of tagged 3D models. The algorithm first extracts shape features from 3D models both with tags (training corpus) and without tags (autotagging target.) The algorithm then computes, for each tag, the relevance of each 3D model to the tag. To compute the relevance, the algorithm takes into account distribution of shape features in the ambient feature space with both local and global consistency by using the Manifold Ranking (MR) algorithm of Zhou, et al [17, 18]. The algorithm chooses those tags that are reasonably high in confidence based on the mutual distance of tagged models sharing tag. We evaluated the autotagging algorithm by using a 3D model retrieval scenario, assuming a good tagging result should produce a good retrieval performance by using the Query By Text (QBT) framework. Our experiment showed that the QBT retrieval that employed the automatically added tag significantly outperformed the Query By Example (QBE) retrieval that employed query 3D models and their shape features. In the future, we definitely would like to employ more “realistic” corpus, e.g., larger size, presence of noise and error, etc., to evaluate the proposed algorithm. To this end, availability of a snapshot of the G3W will greatly benefit the research community. Assuming the availability of a large corpus, we need to improve the scalability of the MR algorithm so that it could handle a large number of features. We also would like to enhance the method, for example, by using a feature refined by using other learning based algorithms, or by using a combination of multiple shape features.
Acknowledgements This research has been funded in part by the Ministry of Education, Culture, Sports, Sciences, and Technology of Japan (No. 18300068).
References 1. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 420. Springer, Heidelberg (2000) 2. Chen, D.-Y., Tian, X.-P., Shen, Y.-T., Ouh-young, M.: On Visual Similarity Based 3D Model Retrieval. Computer Graphics Forum 22(3), 223–232 (2003)
148
R. Ohbuchi and S. Kawamura
3. Goldfeder, C., Allen, P.: Autotagging to Improve Text Search for 3D Models. In: Proc. 8th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2008, pp. 355–358 (2008) 4. Google 3D Warehouse, http://sketchup.google.com/3dwarehouse/ 5. Iyer, M., Jayanti, S., Lou, K., Kalyanaraman, Y., Ramani, K.: Three Dimensional Shape Searching: State-of-the-art Review and Future Trends. Computer Aided Design 5(15), 509–530 (2005) 6. Leifman, G., Meir, R., Tal, A.: Semantic-oriented 3d shape retrieval using relevance feedback. The Visual Computer 21(8-10), 865–875 (2005) 7. Novotni, M., Park, G.-J., Wessel, R., Klein, R.: Evaluation of Kernel Based Methods for Relevance Feedback in 3D Shape Retrieval. In: Proc. The Fourth International Workshop on Content-Based Multimedia Indexing, CBMI 2005 (2005) 8. Ohbuchi, R., et al.: Modified SPRH Windows 32bit, http://www.kki.yamanashi.ac.jp/~ohbuchi/research/ research_index.html 9. Ohbuchi, R., Takei, T.: Shape-Similarity Comparison of 3D Shapes Using Alpha Shapes. In: Proc. Pacific Graphics 2003, pp. 293–302 (2003) 10. Ohbuchi, R., Yamamoto, A., Kobayashi, J.: Learning semantic categories for 3D Model Retrieval. In: Proc. ACM MIR 2007, pp. 31–40 (2007) 11. Ohbuchi, R., Shimizu, T.: Ranking on semantic manifold for shape-based 3d model retrieval. In: ACM MIR 2008, pp. 411–418 (2008) 12. Pu, J., Lou, K., Ramani, K.: A 2D Sketch-Based User Interface for 3D CAD Model Retrieval. Computer Aided Design and Application 2(6), 717–727 (2005) 13. Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The Princeton Shape Benchmark. In: Proc. Shape Modeling International, SMI 2004, pp. 167–178 (2004), http://shape.cs.princeton.edu/search.html 14. Tangelder, J., Veltkamp, R.C.: A survey of content based 3D shape retrieval methods. Multimedia Tools and Applications 39(3), 441–471 (2008) 15. Wahl, E., Hillenbrand, U., Hirzinger, G.: Surflet-Pair-Relation Histograms: A Statistical 3D-Shape Representation for Rapid Classification. In: Proc. 3DIM 2003, pp. 474–481 (2003) 16. Zhang, C., Chen, T.: An Active Learning Framework for Content-Based Information Retrieval. IEEE Trans. Multimedia 4(2) (June 2002) 17. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with Local and Global Consistency. In: Proc. NIPS 2003 (2003) 18. Zhou, D., Weston, J., Gretton, A., Busquet, O., Schölkopf, B.: Ranking on Data Manifolds. In: Proc. NIPS 2003 (2003)
PixGeo: Geographically Grounding Touristic Personal Photographs Rodrigo F. Carvalho and Fabio Ciravegna The University of Sheffield, Dept. of Computer Science, Organisations, Information and Knowledge group (OAK) 211 Portobello, Regent Court, Sheffield - UK, S1 4DP {R.Carvalho,F.Ciravegna}@dcs.shef.ac.uk http://oak.shef.ac.uk/
Abstract. Realizing the potential of digital media in the home relies on the existence of detailed, semantically unambiguous metadata. However, in the domain of personal photography, the generation of such metadata remains an unsolved issue. This paper introduces PixGeo, a solution to geographically grounding touristic personal photographs based on exploiting the existence of strong contextual connections among photographs in a user’s collection. We experiment with building temporal clusters within test collections and leverage an 8 million image dataset from Flickr to perform scene matching using k -NN. In the evaluation we find that the approach performs 30% better than chance for grounding photos. Keywords: Photographs Images Annotation Geographic-Grounding Geo-Tagging Context PixGeo.
1
Introduction
The market launch of the first low-cost point-and-shoot camera in February 1900 by Kodak (the Kodak Brownie1) gave the masses access to technology that was until then only available to professionals. Mass produced portable cameras have since then been well known for supporting activities such as touristry, vacationing and amateur photography. But especially tourism as noted by Susan Sontag [14]: “(. . . ) photography develops in tandem with one of the most characteristic of modern activities: tourism (. . . ) It seems positively unnatural to travel for pleasure without taking a camera along (. . . )”. With the advent of digital cameras, the number of photos taken when travelling has increased tremendously, along with the existing challenges for organising and annotating such ever expanding collections (estimates point to 375 petabytes or 787.5 billion photos per year [4]). In the domain of personal photography, and more specifically within family centered communities, there is a growing need to exploit the existence of semantic metadata for facilitating the management of photographic collections. 1
http://en.wikipedia.org/wiki/Brownie (camera)
T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 149–160, 2009. c Springer-Verlag Berlin Heidelberg 2009
150
R.F. Carvalho and F. Ciravegna
However previous studies by Frohlich et al [7], and Miller and Edward [11], have found that users in this domain are unlikely to make extensive contributions to manually generating semantic metadata about their ever expanding photographic collections. This limits the reach of existing approaches that rely heavily on direct user input or the existence of comprehensive textual data related to photographs. Geographic location is a useful piece of metadata when devising methodologies for annotating personal photographs [13]. Our goal is to investigate how context data can be used to estimate the geographic location of touristic photographs. 1.1
Personal Photos and Context Awareness
Much like in a forest, where the shadow cast by tall trees will always affect the height of its neighbours, in personal photography, the semantic content of photographs in a collection will always be influenced by that of other photographs in the same collection according to a number of contextual aspects. The digital and physical environments of a photo are the most prominent aspects of context in this domain. The physical environment is defined in terms of any information that is known about a photograph as a single entity, such as its timestamp, GPS coordinates or indoor / outdoor classifications. This information can be used to make direct inferences about the contents of images, such as what the photo may be depicting considering where it was taken. On the other hand, the digital environment is defined in terms of all photographs within the same collection, thus treating them as inherently collective entities. The digital context can be used for maximising the effect that the semantics of one photograph may have on the collection as a whole. For instance, it is likely that two temporally close images were taken in the same physical location. When viewing the above statements from a purely systematic and technical point of view, Bazire and Br´ezillon offer a definition of context that captures the essence of what context awareness signifies in the photographic domain: it is “(. . . ) a set of constraints that influence the behaviour of a system (a user or a computer) embedded in a given task” [1]. So what is suggested here is that the knowledge, information and data about a single photograph has not only intra-photographic influence (i.e. within itself), but also inter-photographic influence, where it affects the semantics of its neighbouring photos – whatever the definition of neighbouring may be. The sections to follow demonstrate that by applying the notions of context outlined above in the domain of touristic photography, we are able to estimate the physical locations of users’ touristic photos with promising accuracy. The methodology uses a temporal clustering approach for contextualising photographs coupled with a purely data driven scene matching approach. We evaluate the methodology by attempting to geographically ground photographs to within a 200 meter radius of the actual landmark it is depicting.
PixGeo: Geographically Grounding Touristic Personal Photographs
2
151
Geo-tagged Dataset
In order to support a purely data driven approach to scene recognition we make use of a dataset of 106 million images collected from Flickr and processed with a range of MPEG-7 [10] descriptors (CoPhIR [2]). In order to estimate the geographic grounding of photographs, all geo-tagged images were extracted from the original dataset, which resulted in a dataset of over 8 million images. Previous studies of image features for exploiting the correlation between image properties and geographic location have demonstrated that an assortment of both colour and texture descriptors should be used for scene matching ([8], [12]). For this task, the following MPEG-7 image descriptors were used: – Colour Layout: given that it is a very compact and resolution invariant representation of colour, and that it is built for use in high-speed image retrieval systems, it is seen as appropriate for this domain where the scale of images may vary widely together with the number of processed images. – Colour Structure: it represents an image by both the colour distribution (e.g. histogram) and local spatial structure of the colour. It is used here to complement the image’s colour layout. – Edge Histogram: represents the spatial distribution of five types of edges. Texture features help distinguishing geographically correlated properties [8]. – Homogeneous Texture: it characterises the region texture using the mean energy and the energy deviation from a set of frequency channels. It is used here to complement the edge histogram for better accuracy.
3
PixGeo
PixGeo adopts a context aware approach for geographically grounding touristic personal photographs. It interweaves a photo’s physical and digital contexts for producing an accurate estimate of its geographic location. The approach is composed of several key steps to contextualising not only photographs to be geo-tagged, but also the large dataset of 8 million geo-tagged training images. What the approach tries to achieve is the maximisation of the impact little user input has on the annotation of personal photographs by using k -NN, an instance-based learning algorithm, for finding visual matches that are geographically coherent for temporally close photographs. Figure 1 visually describes the proposed method. Considering that the entire geo-tagged dataset is pre-indexed with the relevant visual descriptors at a preprocessing stage, the algorithm steps are explained in detail next. User Input. We obtain an input collection C and a high level geographic context from the user. Let C = {p1 , p2 , . . . , pn } be the collection of photos to be geographically grounded and coords(C) = (latitudec , longitudec) the coordinates for the initial geographic context of these photos at the city / region level as given by the user. This geographic context might be given in the form of latitude and longitude coordinates (e.g. collected from drag-and-drop map
152
R.F. Carvalho and F. Ciravegna
The user provides a geo-location for the batch of input images. This aids in geographically contextualising the new un-annotated images. MPEG-7 Feature Extractor Visual Matcher
Un-annotated Images
Geographic Coherence
Temporal Clusterer (Events)
k-NN k-NN Training Model
MPEG-7 Feature Extractor
Geographic Contextualising
Annotated Images (Geographically Grounded)
Fig. 1. The PixGeo approach
interfaces) or more precise textual annotations (so long as the annotation can be unambiguously grounded to a physical location indicating the city or region depicted in the photographs). The main assumption behind the approach is that we can rely on a little geographic input about a user’s photographs. Previous studies have found that it is reasonable to expect such input from users, who are likely to organise their own photos in folders and labelling them quite often with the location and event that is depicted ([11], [9]). Contextualising Training Instances. The input provided by the user plays a crucial role in the contextualisation of both test and training instances. In a domain where visual scene matching from pixel data alone has not yet proved its reliability, it is necessary to consider alternative solutions that maximise the relevance of training data while reducing the number of noisy instances for increasing classification competence. Applying an a priori knowledge of the location of photographs allows us to do just that by performing instance selection [3] on the original geo-tagged dataset before classification. In PixGeo, accuracy is paramount and the possibility of noisy matches is countered by initially filtering training instances that are unlikely to be relevant to the geographic grounding task. A contextualised dataset Dgeo−c is obtained from the original 8 million image dataset Dgeo = {p1 , p2 , . . . , pn } where
PixGeo: Geographically Grounding Touristic Personal Photographs
153
{pi ∈ Dgeo−c | pi ∈ Dgeo , distance(coords(pi ), coords(C)) ≤ d}
(1)
where the distance function finds the geographic distance between two points and ‘d’ can be determined heuristically (10 Km in this instance). What we now have is a smaller, contextualised dataset where Dgeo−c ⊂ Dgeo . Temporal Context. In order to maximise the effect that the semantics of one photograph has over others in the collection, the digital context and interphotographic relations are defined here by the temporal closeness between photos. In the domain of touristic photographs, a reliable characteristic of collections about a single trip or place visited is that temporal clusters tend to be formed around many sub-events that construct a visual journal of where the person has been. Each separate temporal cluster is likely to represent each place visited with a number of photos of the same type of scene and by grouping images using their temporal closeness, PixGeo maximises the chances of finding geographically coherent visual matches. The temporal association between images has been shown to produce reliable results for location classification in other works [5]. So the collection C can also be defined as Cevents = {T1 , T2 , . . . , Tn } where ∀Tn ∈ Cevents , Tn ⊂ C. There is no intersection between elements in Cevents . The algorithm used for finding temporal clusters within a collection of images was that proposed by Zhao and Liu [16]. Using an event clustering algorithm allows us to evaluate the above assumption that each place visited is neatly encapsulated within temporal clusters or sub-events. Scene Matching. Once photographs are subdivided into temporally close sub-events, PixGeo attempts to find for each image a number of possible visual matches for geographically grounding the images in the temporal cluster in question. The scene matching exploits the fact that most people will not feel they have extracted the most value out of the place they have visited unless they spend some time at the landmarks that are most representative of where they are. For instance, it is highly unlikely for someone to visit York and not see the city’s Minster, or for someone to visit London and not go to Westminster or the British Museum. This is clearly reflected in the dataset used for this task where geographic places tend to be represented by a number of well known and well photographed landmarks. Figures 2a and 2b exemplify this for geo-tagged samples of the cities of York and London, whereby the more densely photographed areas are clearly around well known landmarks. This characteristic of the dataset and of personal photography in general suggests that there is a great opportunity for seeking visual matches for the well represented areas of a city in the input collection. For each input image in our test set we build the same features as discussed in Section 2 and compute the distance in each feature space to all contextualised instances in Dgeo−c . Each feature’s distances are given the same weight so that they influence the matching of scenes equally. For each query photo px in each
154
R.F. Carvalho and F. Ciravegna
British Museum
York Minster
Picadilly Circus Trafalgar Square
The Shambles
Westminster
Clifford's Tower (York Castle)
(a) York
(b) London
Fig. 2. Density of photos taken in areas of the cities of York and London in the UK. Notice the concentration of photos around areas of high touristic interest.
temporal cluster in Cevents we use the aggregate feature distances (late fusion) to find a set of the nearest visual neighbours Vx = {v1 , v2 , . . . , vn } in Dgeo−c , such that Vx ⊂ Dgeo−c and ∀px ∈ C, ∃Vx ∧ |Vx | ≤ k
(2)
{vi ∈ Vx | vi ∈ Dgeo−c , similarity(px, vi ) ≥ s}
(3)
where the similarity function indicates visual similarity only, ‘s’ is a visual similarity threshold and ‘k’ is the maximum number of nearest neighbours. In order to reduce the impact that classes with more frequent examples have on the prediction of the new vector, the classes of the input image were predicted based on a threshold distance that aimed to try and filter out false positive matches caused by the locations more frequently depicted in the database. Geographic Coherence. One of the major challenges in understanding the semantic contents of images is that the pixel information they convey is not often unique enough to associate images (or parts of it) to a single semantic concept. A typical example is observed when comparing images of a sunset and other predominantly orange images with a red tint (e.g. fruits). For datasets of mixed nature, this would have a noticeable impact on the accuracy of image-toimage matching. This is the case for the dataset we work with here and despite contextualising training instances, the results are still prone to suffer from image-to-image matching problems. This is due to the highly mixed nature of images in the training dataset as well as the still large volume of training images we are left with even after filtering (e.g. |Dgeo−c | for York leave us with more than 7K training instances). In order to circumvent potential image matching problems, PixGeo incorporates the notion of geographic coherence into the approach. The assumption we
PixGeo: Geographically Grounding Touristic Personal Photographs
155
make at this point about the nature of tourist photographs is that each temporal cluster is highly likely to focus on recording the visit to one place or area. So the visual matches found for images within each temporal cluster should reflect this. Once a number of k visual matches are found for each input image, PixGeo then seeks to find a single geographically coherent cluster G for the collection that is formed by a subset of the visual matches found in a temporal cluster in Cevents . G is defined by exhaustive pairwise comparisons between each visual match found for each photograph in the temporal cluster. For two visual matches vn and vm , they are considered to be geographically coherent if distance(vn , vm ) ≤ g
(4)
where ‘g’ can be defined heuristically (0.2 Km in this instance). In the case where more than one geographically coherent set is found within Tn , the one with the highest cardinality is retained as the geographic grounding for the temporal cluster. At this point, a heuristic is used for selecting the most confident geographically grounded temporal cluster as an anchor. Reasoning using the temporal distance between other temporal clusters and the anchor would yield very accurate results for estimating the coordinates of remaining images in the collection. This discussion, however, is beyond the scope of this paper and in the following sections the focus will be on evaluating PixGeo for its capacity to find a single most confident geographically grounded temporal cluster for the collection.
4
Experimental Results
The evaluation considers the accuracy of the most confident geographic grounding produced for 6 collections depicting the city of York2 , in the UK, randomly picked from the original dataset. These totalled 485 photographs, with an average of 69 photos per collection. In order to avoid skewing the results towards the most popular depictions of the city (see figure 2a), none of 6 user collections were used as part of the training dataset. As a result of this, the evaluation was carried out manually as none of the photos in the test collection were geo-tagged. It was ensured that each of the 6 collections depicted a variety of places around the city and were not highly influenced by the presence of a single location. The evaluation was carried out by fixing the geographic context given by the user on the city of York, which resulted in the selection of just under 7K images that were contextually relevant. The k -NN model was then built from these images and the PixGeo approach described previously was applied to the input images. In this instance the top ranked temporal cluster within each collection was considered the one with a larger number of geographically coherent scene matches when normalized against the total number of scene matches found. This was then picked as the cluster with most confident geographic grounding and the location of each photo it encapsulates was considered a true positive if it 2
http://en.wikipedia.org/wiki/York
156
R.F. Carvalho and F. Ciravegna
was within a strict 200 meter radius of the actual location of the landmark depicted. This value was selected according to the parameter used for grouping geographically coherent matches. A standard precision-recall metric does not reflect the performance of this algorithm in this instance given that we are focusing on evaluating its effectiveness on finding a single most confident geographically grounded temporal cluster for each collection. Instead, precision is mapped to the overall percentage of images correctly classified. Recall is analysed in terms of the average number of visual suggestions made in comparison to the average number of such suggestions that are geographically coherent within the temporal cluster in question. This will allow us to visualise the effectiveness of PixGeo in being able to reduce noise for maintaining classification accuracy. The results are compared against a baseline that selects random visual matches for each image in each temporal cluster within the test collections. This is enough for testing the context-aware approach since the random selection of visual matches will lead to a random selection of temporal clusters as the most confidently geographically grounded. The results in figure 3a suggest the approach works well and produces accurate geographic grounding for a large portion of images in the most confident geographically grounded temporal clusters in the test set. It also demonstrates that PixGeo performs approximately 30% better than chance. When experimenting with higher values of k, there is a clear increase in the percentage of correctly grounded temporal clusters that is matched by an expected increase in the number of visual matches found for images within each temporal cluster (see figure 3c). The relatively stable number of geographically coherent matches throughout the experiments suggest that the increase in visual matches for higher values of k is mostly composed of noise and that if the trend continues, using higher values of k could be potentially detrimental to the performance of the classifier. What this demonstrates is that PixGeo is able to not only obtain accurate results after contextualising its training and testing datasets, but that it is also able to maintain this accuracy after testing the geographic coherence of temporal clusters within each test collection. Also, figure 3b reveals that there is some influence from the number of photographs in each temporal cluster on the precision of the estimated geographic grounding. This may indicate that the larger number of images in a temporal cluster provide a more well defined context to be used in the geographic estimation. This bias toward larger temporal clusters for determining well defined contexts can be used in future versions of PixGeo. Given that precision is calculated at the level of images, it is possible to confirm the assumption that each temporal cluster is highly likely to focus on recording one place or area. One observation is that a strong impact from the spot of highest photographic density in the city of York was found (see figure 2a) on the classification. This means that nearly all of the temporal clusters heuristically selected as the strongest geographic grounding in each collection pointed to the cathedral.
PixGeo: Geographically Grounding Touristic Personal Photographs
157
1 0.9 1
0.8
0.9
0.7
Precision
0.8
Precision
0.7 0.6 0.5
0.6 0.5 0.4 0.3
0.4 0.3
0.2
0.2
0.1
0.1
0 5
0 5-NN
10-NN
20-NN
7
18
24
25
27
Temporal Cluster Size
Random
(a)
(b)
Average Number of Photos
60 50 40 30 20 10 0 5-NN
Visual Matches
10-NN
Geo-Coherence
20-NN
Photos Per Cluster
(c) Fig. 3. Evaluation figures: (a) Percentage of correct geographically grounded images; (b) Plot of precision against size of temporal cluster selected; (c) average number of visual matches, geographically coherent matches and temporal cluster sizes.
When informally experimenting with the approach for the city of London, the results showed a slight deviation from what was found for the city of York due to existing limitations of the approach. One such limitation is that Exif information about the photographs such as focal length or data about the quality of the image itself (e.g. blurry images are unlikely to produce good visual matches) is not taken into consideration to eliminate potentially noisy training and test instances. These problems tended to occur in collections containing very poor quality photos or depictions of very specific subjects (e.g. artistic and very focused). When this was not the case, the approach did show very promising results when dealing with a large contextualised dataset of just under 190K images of London as shown by figure 4.
158
R.F. Carvalho and F. Ciravegna
(a)
Figure (a) shows that visually, the British Museum could not be recognised from its front entrance alone as it shares many visual similarities with the entrance to the Bank of England. Figure (b), also in the same temporal cluster as (a) strengthens the geographic coherence around the British Museum. (b) Fig. 4. Examples of visual matches found for two photographs in the same temporal cluster about the British Museum in the city of London. Test image to the left, smaller images represent visual matches.
An approach to eliminate noise in the training images may consider the use of knowledge bases such as DBPedia.org3 for identifying geographic clusters around actual landmarks or places of interest.
5
Related Work
The problem of labelling photographs has been studied intensively in the past few years. Many approaches exist that solve different parts of the problem, but only recently, with the success of Web 2.0 websites such as Flickr4 and Facebook5 , many techniques are being developed that leverage unprecedented amounts of data collected from these sites for tackling difficult computer vision problems. Stone et al [15] make use of a user’s social connections as well as community produced face annotations on a large number of photos collected from Facebook for building face recognition databases. They show that face recognition methods can be improved by exploring a user’s social context, especially when combined with a large collection of tagged photos from the popular social networking site. 3 4 5
http://dbpedia.org/About http://www.flickr.com http://www.facebook.com
PixGeo: Geographically Grounding Touristic Personal Photographs
159
Most notably, Hays and Efros [8] proposed one of the first methodologies for leveraging an image dataset with over 6 million geo-tagged instances collected from Flickr for geographically locating single images. Their approach makes use of an instance based learner for producing probability distributions of the potential geograhic coordinates of input images from the geo-tagged visual matches found. Crandall et al [6] work with a dataset of 35 million photographs from Flickr in order to study the interplay between visual, textual and temporal features in estimating a photograph’s geographic location. Moxley et al [12], on the other hand, make use of a 100,000 images dataset from Flickr to demonstrate that by combining tag ranking techniques based on both a geographic context as well as content-based image analysis, it is possible to suggest geographically relevant tags for photos newly tagged with GPS coordinates. In contrast to existing approaches, what PixGeo is able to demonstrate is that despite the advantages offered by large image datasets, there is a strong need for contextualisation in order to obtain accurate results. Especially when considering problems such as geo-tagging photographs. While previous approaches [8] demonstrate an accuracy of 200 Km for just over 50% of the test set, we are able to build on existing work and bring accuracy up to under 200 meters for the domain of personal photography.
6
Conclusion
A novel approach for geographically grounding personal photographs has been introduced by applying the notion that photos are generally tied together by not only their physical context but crucially their digital context as well. PixGeo leverages the power of 8 million geo-tagged photographs from Flickr for applying the context aware paradigm to the domain of personal photography for generating entirely new metadata for users’ collections. From a user’s perspective, PixGeo is capable of turning a single geographic input about the collection into much finer grained geographic annotations. The main benefit of the approach is that it produces metadata that is seen as the point of entry for a wide range of tasks from tag generation approaches to visualisation techniques to finer grained user profiling (e.g. recommendation of touristic spots based on previously visited landmarks). Finally, PixGeo is the first algorithm that successfully applies the concepts of digital and physical contexts for geographically grounding photographs. Also, it demonstrates that solutions to the problem of scene matching that once involved unfeasible amounts of work can now be tackled from different perspectives by making use of masses of annotated data that exist on the web currently. Future work on PixGeo will focus on studying methodologies for geographically grounding other temporal clusters within the collection based on the anchor point that is found on the first iteration of the algorithm. Also, fine tuning the algorithm for selecting more appropriate photographs for comparison. The
160
R.F. Carvalho and F. Ciravegna
fine tuning should involve filtering out poor quality images (e.g. blurry or overexposed) or images that are too focused (e.g. artistic or zoomed in). A better methodology for filtering out poor training instances (i.e. that do not depict landmarks) would also improve the resulting matches.
References 1. Bazire, M., Br´ezillon, P.: Understanding context before using it. In: Dey, A.K., Kokinov, B., Leake, D.B., Turner, R. (eds.) CONTEXT 2005. LNCS (LNAI), vol. 3554, pp. 29–40. Springer, Heidelberg (2005) 2. Bolettieri, P., Esuli, A., Falchi, F., Lucchese, C., Perego, R., Piccioli, T., Rabitti, F.: CoPhIR: a test collection for content-based image retrieval. CoRR, abs/0905.4627v2 (2009) 3. Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discov. 6(2), 153–172 (2002) 4. Brilakis, I.K.: Content Based Integration of Construction Site Images in AEC/FM Model Based Systems. PhD thesis, University of Illinois (2005) 5. Carvalho, R.F., Chapman, S., Ciravegna, F.: Attributing semantics to personal photographs. Multimedia Tools Appl. 42(1), 73–96 (2009) 6. Crandall, D., Backstrom, L., Huttenlocher, D., Kleinberg, J.: Mapping the world’s photos. In: WWW (2009) 7. Frohlich, D., Kuchinsky, A., Pering, C., Don, A., Ariss, S.: Requirements for photoware, New Orleans, Louisiana, USA, pp. 166–175 (2002) 8. Hays, J., Efros, A.A.: Im2gps: Estimating geographic information from a single image. In: Proc. of Compter Vision and Pattern Recognition (2008) 9. Kirk, D.S., Sellen, A.J., Rother, C., Wood, K.R.: Understanding photowork. In: SIGCHI, Montr´eal, Qu´ebec, Canada (April 2006) 10. Manjunath, B., Ohm, J.-R., Vasudevan, V.V., Yamada, A.: Color and texture descriptors. IEEE Transactions on Circuits and Systems for Video Technology 11, 703–715 (2001) 11. Miller, A., Edwards, K.W.: Give and take: A study of consumer photo-sharing culture and practice. In: CHI, San Jose, California, USA. ACM Press, New York (2007) 12. Moxley, E., Kleban, J., Manjunath, B.S.: Spirittagger: a geo-aware tag suggestion tool mined from flickr. In: MIR 2008: Proceeding of the 1st ACM conference on Multimedia information retrieval. ACM, New York (2008) 13. Naaman, M., Paepcke, A., Garcia-Molina, H.: From where to what: metadata sharing for digital photographs with geographic coordinates. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 196–217. Springer, Heidelberg (2003) 14. Sontag, S.: On photography. Farrar, Straus and Giroux, New York (1977) 15. Stone, Z., Zickler, T., Darrell, T.: Autotagging facebook: Social network context improves photo annotation. IEEE, Los Alamitos (2008) 16. Zhao, M., Liu, S.: Automatic person annotation of family photo album. In: Sundaram, H., Naphade, M., Smith, J.R., Rui, Y. (eds.) CIVR 2006. LNCS, vol. 4071, pp. 163–172. Springer, Heidelberg (2006)
Method for Identifying Task Hardships by Analyzing Operational Logs of Instruction Videos Junzo Kamahara1, Takashi Nagamatsu1, Yuki Fukuhara1, Yohei Kaieda1, and Yutaka Ishii2 1 Graduate School of Maritime Sciences, Kobe University 5-1-1 Fukae-minami, Higashi-Nada, Kobe 658-0022, Japan [email protected], [email protected] 2 Information Science and Technology Center, Kobe University, 1-1 Rokkoudai, Nada, Kobe 657-8501, Japan [email protected]
Abstract. We propose a new identification method that aids in the development of multimedia contents of task instruction. Our method can identify the difficult parts of a task in an instruction video by analyzing the operation logs of multimedia player used by a user to understand the difficult parts. The experimental results show that we can identify those video segments that the learners find difficult to learn from. This method could also identify the hardships that the expert did not anticipate. Keywords: Multimedia Authoring, User Behavior.
1 Introduction For effective self learning of a skilled task, the learner would require video based multimedia contents to understand how to do the task. For creating such multimedia contents, we have developed a skill acquisition support system [1], which is a multimedia presentation system that displays advice texts superimposed on a video of an expert performing a task. The advice texts are annotations taken from an interview of the experts to aid self learning. Our system consists of three subsystems to perform the following operations: recording the expert performing the task, authoring annotations of the expert’s advice, and playing the instructional contents in a player. Authoring contents is a tedious task; determining which part of the video requires advice text is particularly time consuming. Content analysis of the instruction video to identify which segments of video would be difficult for the user to understand is not easy. To overcome these limitations, we propose a new identification method that can help identify the segment of the instruction video, which is relatively difficult to understand. Our method can identify the hardships in a task shown in the instruction video by analyzing the operation logs of the player used by the learner. This part of the task identified as a hardship is that part of the expert’s movement that a learner feels is difficult to perform. According to [2], the users’ intention can be estimated from his or her browsing behavior. However, these intentions are based on only browsing of the contents and do not elucidate their semantic nature. In this article, we deal with task hardships as indicators for the semantics of the instructional content. T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 161–164, 2009. © Springer-Verlag Berlin Heidelberg 2009
162
J. Kamahara et al.
2 Proposed Method for Task Hardships Identification The proposed identification method for reducing the author’s burden of determining the part of a video that requires annotation can be described in the following steps: 1) An original video of the expert performing a task is recorded as the instruction video. 2) The learners practice the task by repeatedly watching the video on our instructionvideo player that does not display advice text. 3) The operation logs are analyzed to determine the parts that are difficult for the learners to imitate. 4) The author annotates advice texts that are taken from the expert’s interview. Fig. 1 shows the instruction-video player interface used in our study. We modified this player into a subset of SMIL [3] multimedia player. The video is displayed at the center of the window. The user can start and stop the video using the play/pause button; the user can fast-forward or rewind the video using the position slider provided on the window. The user can also change the playback speed using the speed slider, which is also controlled by the mouse scroller. The status of the play/pause button and the movement of the sliders reflect the semantics of the user while watching the video. We assume that if the user finds a segment of the video difficult, the user would utilize one of the above functions on the player to view the details of that part. Therefore, analyzing and collecting the data on the behavior of the users helps determine the parts of hardships. For this, we added functions on the player for recording the operation logs which consist of the event handled by the interface, the rate of speed with respect to normal speed, the time position of the video, and the actual time elapsed since the start of the video, respectively.
Fig. 1. Instruction-video player
3 Experiment To confirm that an aggregate of data on a user’s behavior can represent the hardships of the task, we conducted experiments, the results of which reveal that the analysis of the operation logs can be effectively used to identify the segments of task hardships. In this experiment, we compared three sets of data for the segments of video, which were recorded from the user’s behavior, a questionnaire for the user about the task hardships, and the expert’s interview. We select a task of rope work as a skilled task.
Method for Identifying Task Hardships by Analyzing Operational Logs
163
This task is slightly difficult to self learn because it involves a series of works. The expert selected for this task was an associate professor who teaches rope work in our university. We recorded a video of the expert performing the rope work task. After recording the video, the expert was made to give the instructions for the task verbally that would then be used to author the advice texts. During this, the expert was allowed to watch his video on the player and asked to specify the instances where the advice texts were to be inserted. Then, we extracted the parts of video from this interview. 3.1 Experiment on Non-experts Practicing the Task The objective of this experiment was to record the operation logs of the instructionvideo player being used by the learners to view it. During this experiment, no text was displayed. The subjects selected as learner were ten male university students, who had no knowledge on or experience with rope work. The subjects were asked to complete the task that was played on the player; this video demonstrated how to tie five knots. After completing the task, the subjects were made to answer a questionnaire to indicate in terms of media time the segments of the task, which they felt were difficult to perform. Fig. 2 shows a subject performing the task by using the player. Fig. 3 shows the experimental result of one subject. The horizontal and vertical axes indicate the media time and the actual time, respectively. The positive and negative slopes represent the forward and backward playback of the video, respectively. 1200 1100
Actual time (sec)
1000 900 800 700 600 500 400 300 200 100 0 0
10
20
30
40
50
60
70
80
90 100 110 120
Media-time (sec)
Fig. 2. Subject performing task
Fig. 3. Experimental result obtained from one subject
4 Analysis of the Experimental Results We hypothesized that the video segments that were played forward more than twice or those in which the pause button was pressed at least once represent the difficult parts of the task. Fig. 4 shows the number of times positive playback was run and the total pause time recorded for one subject; in this figure, the temporal resolution of media time is 0.1 s. The data on the video segments where the expert offered advice to the learners and on those in which the learners indicated difficulties in the questionnaire are also included in Fig. 4.
164
J. Kamahara et al.
12 11 10 9 8 7 6 5 4 3 2 1 0
30 25 20 15 10 5
Total pause time (sec).
Number of forward playback selections.
The results of the analysis on the data obtained from ten subjects are as follows. All subjects played the video more than two times and paused the video 82.8% (24/29) and 41.4% (12/29) of the times, respectively, at the segments that they felt were difficult. On the other hand, the subjects felt that 69.2% (27/39) of the video segments that were either played more than two times or paused were difficult. The video segments that were either played more than two times or paused coincided with 93.1% (27/29) of the total number of video segments that the learners indicated as difficult. This indicates that we can effectively identify the hardships in a task from user behavior. Furthermore, we found three video segments indicated as difficult parts by the learners, which were not included in the video segments in which the expert’s advice was taken; this result implies that an expert cannot provide all the advice that required by the learners.
Number of forward playback selections Total pause time Difficult parts Expert's advice
0 00:20
00:40 01:00 01:20 Media-time(min:sec).
01:40
02:00
Fig. 4. Analysis of experimental data of one subject
5 Conclusion We proposed a new method for identifying hardships in a task by analysis of operating logs of learners using the player; this method will help reduce the content author’s burden on determining the parts of a task that require the expert’s advice. The experimental results show that we can identify the segments of a video which learners find difficult to learn from. This method could also effectively identify the task hardships that the expert could not anticipate. By the identification of task hardships, the author of instructional content can focus on annotating the expert’s advice on only the video segments of the identified task hardships.
References 1. Nagamatsu, T., Kaieda, Y., Kamahara, J., Shimada, H.: Development of a Skill Acquisition Support System Using Expert’s Eye Movement. In: Smith, M.J., Salvendy, G. (eds.) HCII 2007. LNCS, vol. 4558, pp. 430–439. Springer, Heidelberg (2007) 2. Syeda-Mahmood, T., Ponceleon, D.: Learning Video Browsing Behavior and its Application in the Generation of Video Previews. In: 9th ACM international Conference on Multimedia (MULTIMEDIA 2001), vol. 9, pp. 119–128. ACM, New York (2001) 3. SMIL – Synchronized Multimedia Integration Language, http://www.w3.org/AudioVideo/
Multimodal Semantic Analysis of Public Transport Movements Wolfgang Halb and Helmut Neuschmied Institute of Information Systems, JOANNEUM RESEARCH, Steyrergasse 17, 8010 Graz, Austria [email protected]
Abstract. We present a system for multimodal, semantic analysis of person movements that incorporates data from surveillance cameras, weather sensors, and third-party information providers. The interactive demonstration will show the automated creation of a survey of passenger transfer behavior at a public transport hub. Such information is vital for public transportation planning and the presented approach increases the cost-effectiveness and data accuracy as compared to traditional methods.
1
Introduction
It is very important for public transport and infrastructure providers to have exact measures of passenger movemements at public transport hubs (such as train or bus stations) and within the entire transportation network. Figures about changes of mode or type of transport (e.g., from train to bus, from private to public transport) are also valuable for public transportation planning and accurate statistics are a key factor for designing efficient and sustainable public transport [1]. Traditionally, obtaining these numbers has been expensive because it involves human observers manually counting the number of passengers over a number of days. More advanced approaches using Bluetooth hardware for wirelessly recording passenger movements [2] have the disadvantage that only passengers with a Bluetooth-enabled mobile phone can be captured. In the presented approach passenger movement data is extracted from surveillance videos and enhanced with additional information. The use of multiple cameras allows to cover larger areas and improves the detection rate. When fixed cameras are used it is possible to monitor movements over a long period of time thus enabling comprehensive statistics. The integration of data from mobile camera locations is also supported. Through a semantic analysis of video and other related data detailed information about persons’ movements can be gained. The system has been designed to always ensure the privacy of passengers and persons are never identified.
2
The System in Brief
A system for multimodal, semantic analysis of person movements has been developed that incorporates data from surveillance cameras, weather sensors, and T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 165–168, 2009. c Springer-Verlag Berlin Heidelberg 2009
166
W. Halb and H. Neuschmied
third-party information providers. It has been developed for analysing passenger movements at public transportation hubs but can also be used for the analysis of person movements in any other setting (e.g., pedestrian movements in shopping malls). The components involved in the system are (i) visual semantic analysis, (ii) a contextual data interface, and (iii) a visualization of the combined statistics. 2.1
Visual Semantic Analysis
The visual semantic analysis is based on video input from surveillance cameras. Basic underlying principles on how the video input is processed are described in [3]. State-of-the-art object detection algorithms are used to detect people anonymously in the video input. The object detection is based on Histograms of Oriented Gradients (HOG) [4] and object trackers are used that have been refined and improved so that it is also possible to detect and track persons in crowded scenes and use multiple cameras for more accurate results. It is even possible to use video input from infrared cameras in the case of poor lighting conditions at night. As an intermediate result trajectories of person movements are available that describe the position of each person at a given point in time but they do not contain any additional semantics. This information is further processed and enhanced with a description of the persons’ actions. In the user interface certain areas can be defined and linked to metadata about the area. Each time a person enters or exits an area this information is recorded. Along with metadata about the area the actual event semantics can be assessed. The area metadata contains information about the charateristics of the area, such as if the area belongs to the station or not. If an arrival/departure platform is located within the area, information about the serving bus/train lines is also included. Examples for further specialized areas are bike racks or parking lots. All information gained through the visual semantic analysis is stored in a central database for generating the passenger movement statistics. This database does not contain any sensitive information as it only contains data extracted from the visual analysis without storing any image content. It is not possible to determine a person’s identity based on this information. 2.2
Contextual Data Interface
The contextual data interface is responsible for the acquisition of additional information that can enrich the detected passenger movements. Currently modules for the integration of weather information from weather sensors and bus/train schedules are implemented. The system has been designed for flexible use and thus allows including a range of further information providers. Bus/train schedules are used for determining the line a passeneger has used. In the case that multiple lines are served from a single platform it is not sufficient to know from which platform a passenger is departing for determining the used line and the additional information from the schedules is needed.
Multimodal Semantic Analysis of Public Transport Movements
2.3
167
Combined Statistics
Passenger movement statistics make use of all data that has been collected through the visual semantic analysis and the contextual data interface. Based on this information it is possible to determine the following events: – Arrival at/departure from station with public transport (and bus/train line used) – Entering/leaving the station and mode of transport (pedestrian, bike, car) – Transfer between different public transport lines Statistics can be accessed in near realtime through a web-based user interface. They contain information about the events as described above and thus present the degree of public transport line utilization. The presentation is highly customizable to show daily, hourly, or peak time movements, comparisons between different periods of time and so forth. By taking weather data into account it can be analysed if certain weather conditions (e.g., extreme temperatures, precipation, etc.) influence passenger movement behavior. This information is extremely valuable for public transportation planning to predict future passenger movement trends.
Fig. 1. Person movement trajectories
168
3
W. Halb and H. Neuschmied
Demonstration
The demonstration will be highly interactive and show a complete sample usecase for a multimodal, semantic analysis of passenger movements at a public transportation hub. For the demonstration pre-recorded surveillance camera footage from a real setup will be used. It will be shown how the visual semantic analysis can be configured to link visual regions with additional metadata about the region. The actual visual computations will be shown (see Figure 1 for an example of detected persons, their trajectories, and area crossings in a night scene, recorded with an infrared camera). The resulting statistics and the possible views will be shown as well. Through this demonstration the viewer will be able to understand the entire processing workflow and see the different possibilities for visualization of the results. Acknowledgements. The research leading to this paper was partially funded by the Austrian Federal Ministry for Transport, Innovation and Technology (BMVIT) in the ‘IV2Splus - Intelligente Verkehrssysteme und Services plus’ programme under project nr. 816031 (NET FLOW). We would also like to thank GRAZ AG VERKEHRSBETRIEBE (GVB) for making the video recordings possible.
References 1. MacKay, D.J.C.: Sustainable Energy - Without the Hot Air. UIT (December 2008) 2. Kostakos, V.: Using bluetooth to capture passenger trips on public transport buses. CoRR, abs/0806.0874 (August 2008) 3. Paletta, L., Wiesenhofer, S., Brandle, N., Sidla, O., Lypetskyy, Y.: Visual surveillance system for monitoring of passenger flows at public transportation junctions. In: Proceedings of the Intelligent Transportation Systems, pp. 862–867. IEEE, Los Alamitos (2005) 4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR 2005: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), Washington, DC, USA, vol. 1, pp. 886–893. IEEE Computer Society, Los Alamitos
CorpVis: An Online Emotional Speech Corpora Visualisation Interface Charlie Cullen, Brian Vaughan, John McAuley, and Evin McCarthy Digital Media Centre, Dublin Institute of Technology, Aungier Street, Dublin, Ireland [email protected]
Abstract. Our research in emotional speech analysis has led to the construction of several dedicated high quality, online corpora of natural emotional speech assets. The requirements for querying, retrieval and organization of assets based on both their metadata descriptors and their analysis data led to the construction of a suitable interface for data visualization and corpus management. The CorpVis interface is intended to assist collaborative work between several speech research groups working with us in this area, allowing online collaboration and distribution of assets to be performed. This paper details the current CorpVis interface into our corpora, and the work performed to achieve this. Keywords: Emotional speech, online speech corpora, data visualization.
1 Introduction Existing emotional speech corpora are often maintained offline, with interface and visualization tools often being developed on an individual basis if at all. Although useful work has been performed on multidimensional scaling analysis [1, 2] and tool development [3-5], no online solution for emotional speech corpora visualization currently exists. Online data visualization is a growing field of research with examples such as Google Finance [6], Amazon [7] and Flickr [8] becoming a popular means of providing interactive user interfaces into large, real-time data sources. This suggests a useful means of implementing a similarly scalable and reusable data visualization toolset for emotional speech corpora. Our team has used these components to form the basis of an online emotional speech corpora interface- CorpVis. The CorpVis interface is designed to allow speech researchers to visualize corpora assets, query the visualization in real time and view the results in detail at asset level.
2 Emotional Speech Corpora Work in the SALERO project [9] has developed several corpora of high quality emotional speech assets [10, 11]. These corpora form the basis of work in emotional speech analysis and synthesis, linguistic convergence and machine learning. The only defined attempt at corpus metadata standardization performed thus far the is the ISLE Metadata Initiative (IMDI) [12]. T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 169–172, 2009. © Springer-Verlag Berlin Heidelberg 2009
170
C. Cullen et al.
In IMDI, separate session bundles are grouped logically under an overarching project. Each session in turn relates to a specific type of content, and involves various actors who produce the speech assets that are analyzed for acoustic, linguistic and emotional information. The hierarchical groupings used by IMDI often contain complex sets of descriptors, with redundancy of these descriptors being a difficult issue to resolve when seeking to avoid large amounts of effort in the cataloguing of speech assets. For this reason, elements of the IMDI schema are suppressed by the CorpVis interface, to ensure the overall data visualization is kept as simple as possible. Each asset in the corpus is analyzed for acoustic data including pitch, intensity, contour [13] and voice quality information such as jitter and shimmer [14]. In addition, manual annotation of linguistic information (such as IPA transcription) can also be stored with the analysis data within a standard SMIL format XML file. Emotional data [15] is also included, though rating of emotions requires group listening tests [16] to validate each result.
3 CorpVis Visualization Implementation The CorpVis application is divided into three separate tiers: a presentation tier developed using Adobe Flex; an application tier developed using Ruby on Rails, with support for SMIL through the REXML ruby-gem and a data tier consisting of a MySQL database. The presentation tier implements the corpus visualization. The visualization tool delivers project, session, asset and emotional dimension information in a
Fig. 1. Screen shot of the CorpVis corpora visualisation tool. Tutorials on how to use the tool and what it does are provided for the user on the DMC website.
CorpVis: An Online Emotional Speech Corpora Visualisation Interface
171
Fig. 2. Screen shot of the CorpVis asset viewer tool. This tool provides a visualization of all acoustic, linguistic and emotional data relating to a single asset. The vowels in a speech act and their prominence are shown in the top chart alongside pitch, intensity and emotional dimension curves. The bottom graphs show pitch, intensity and voice quality for the selected vowel. The user can also add annotations (not shown) that will update the corpora database.
single interactive screen. The user can also visualize gender, age and acoustic data relating to each asset grouping. If analysis of an individual asset is required, a separate asset analysis screen is launched (Figure 2): A full demonstration of the CorpVis interface is available online at www.dmc.dit.ie. Due to ethical considerations in emotional speech research, full access can be obtained by directly contacting members of the research team.
4 Ongoing and Future Work This paper details a brief introduction to the CorpVis emotional speech corpora visualisation tool. Development is ongoing, aiming to further streamline the operation of the interface and provide more flexible visualization and querying options. The implementation of automated emotional and linguistic analysis is also currently being developed, leveraging machine learning algorithms to develop a completely automatic analysis method for all speech assets in the corpus. Acknowledgments. The research leading to this paper was partially supported by the European Commission under contract IST-FP6-027122 “SALERO”.
172
C. Cullen et al.
References [1] Yamakawa, K., Matsui, T., Itahashi, S.: MDS-based Visualization Method for Multiple Speech Corpus Features. IEICE Technical Report 108, 35–40 (2008) [2] Batliner, A., Steidl, S., Hacker, C., Nöth, E.: Private emotions versus social interaction: a data-driven approach towards analysing emotion in speech. User Modeling and UserAdapted Interaction 18, 175–206 (2008) [3] Sjölander, K., Beskow, J.: Wavesurfer - an open source speech tool. In: Sixth International Conference on Spoken Language Processing (ICSLP 2000), Beijing, China, vol. 4, pp. 464–467 (2000) [4] Kubat, R., DeCamp, P., Roy, B.: TotalRecall: Visualization and Semi-Automatic Annotation of Very Large Audio-Visual Corpora. In: Nagoya, A. (ed.) Proceedings of the 9th international conference on Multimodal interfaces, pp. 208–215 (2007) [5] Barras, C., Geoffroisb, E., Wuc, Z., Libermanc, M.: Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication 33, 5–22 (2001) [6] Brendan Meutzner Consulting. Google Finance Visualisation Tool, Webmaster (2008) [7] Bannur, S.: Querying Amazon through webservice (2008) [8] France Telecom Research and Development LLC. Pikeo: Share your world, explore another (2008) [9] Haas, W., Thallinger, G., Cano, P., Cullen, C., Bürger, T.: SALERO - Semantic Audiovisual Entertainment Reusable Objects. In: Avrithis, Y., Kompatsiaris, Y., Staab, S., O’Connor, N.E. (eds.) SAMT 2006. LNCS, vol. 4306. Springer, Heidelberg (2006) [10] Cullen, C., Vaughan, B., Kousidis, S.: Emotional speech corpus construction, annotation and distribution. In: The sixth international conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco (2008) [11] Cullen, C., Vaughan, B., Kousidis, S., Wang, Y., McDonnell, C., Campbell, D.: Generation of High Quality Audio Natural Emotional Speech Corpus using Task Based Mood Induction. In: International Conference on Multidisciplinary Information Sciences and Technologies Extremadura, Merida (2006) [12] ISLE, IMDI (ISLE Metadata Initiative), Metadata Elements for Session Descriptions. Draft Proposal Version 3.0.3 ed (2003) [13] Cullen, C., Vaughan, B., Kousidis, S., Reilly, F.: A vowel-stress emotional speech analysis method. In: The 5th International Conference on Cybernetics and Information Technologies, Systems and Applications, CITSA 2008, Genoa, Italy (2007) [14] Ozdas, A., Shiavi, R.G., Silverman, S.E., Silverman, M.K., Wilkes, D.M.: Investigation of vocal jitter and glottal flow spectrum as possible cues for depression and near-term suicidal risk. IEEE Biomedical Engineering 51, 1530–1540 (2004) [15] Cowie, R., Cornelius, R.R.: Describing the emotional states that are expressed in speech. Speech Communication Special Issue on Speech and Emotion 40, 5–32 (2003) [16] Vaughan, B., Cullen, C.: Emotional Speech Corpus Creation, Structure, Distribution and Re-Use. In: Young Researchers Workshop in Speech Technology (YRWST 2009), Dublin, Ireland (2009)
Incremental Context Creation and Its Effects on Semantic Query Precision Alexandra Dumitrescu and Simone Santini Escuela Polit´ecnica Superior, Universidad Aut´onoma de Madrid
Abstract. We briefly describe the results of an experimental study on the incremental creation of context out of the results of targeted queries, and discuss the increase in retrieval precision that results from the incremental enrichment of context.
1 Introduction This paper is, in essence, a progress report on an activity that was presented last year at this very conference series. Last year, we presented a conceptual model (and its practical incarnation) that used the documents in a person’s computer to create a context, and used it to conduct semantic searches on the web. The theoretical bases for the semantic model that we use are radically different from the Tarskian semantics of the semantic web, and can be traced on the one hand to the hermeneutic tradition and, on the other hand, to the anglo-saxon philosophy of language that assumed the findings of the second Wittgenstein [3]. Last year we presented our model and evaluated its behavior using complete and fixed contexts. We took two rich collections of documents about computing and neurophysiology, respectively, and determined how much the presence of this context would improve the precision of the results of generic queries in these two areas. The results were quite encouraging, sometimes more than doubling the precision of the same query without context. In this report, we briefly discuss an experimental study of context formation. We used the system iteratively to see how quickly, starting from a situation without any context information, we could build up a context that allowed the considerable improvements that we observed last year. In order to make the paper reasonably self-contained, we will include a brief description of our context model. In [4], the context was based on a set of directories with a sub-directory relationship that caused some complication in the derivation of the model. Since the results that we describe in this paper are based on the contents of a single directory, we will described a simplified model that doesn’t take into account structure, and that is sufficient for our present purposes. For details on the complete model, the reader is referred to [4].
This work was supported in part by Consejer´ıa de Educaci´on, Comunidad Aut´onoma de Madrid, under the grant CCG08-UAM/TIC/4303, B´usqueda basada en contexto como alternativa sem´antica al modelo ontol´ogico. Simone Santini was in part supported by the Ram´on y Cajal initiative of the Ministero de educaci´on y ciencia. Alexandra Dumitrescu was in part supported by the European Social Fund, Universidad Aut´onoma de Madrid.
T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 173–176, 2009. c Springer-Verlag Berlin Heidelberg 2009
174
A. Dumitrescu and S. Santini
2 Context Definition Our starting point for the creation of context is a collection of documents. In the complete system, these are the documents contained in the working directory from which the user starts a query as well as the documents retrieved and downloaded in the course of previous queries. In this case, we are interested in studying the process of context formation, so we will not consider any document in the working directory, making the experimental assumption that the first query is made from an empty context without any document. The context is created by downloading query results and accumulating them. We wish to emphasize that this is not the way our system is meant to be used. In general, all the documents relevant to a given activity are used to make up the context. The choice of the particular operational mode used here is in accord to the experimental design that we present in this paper. We use two context representations, the second being built based on the first. The first representation, which in the general model we call the syntagma, is syntactic, and in the present incarnation of the system is a point cloud representation. The second, called the seme is a semantic representation implemented, in this case, as a self-organizing map that constitutes a latent semantic manifold, that is, a non-linear low-dimensional subspace of the word space that capture important semantic regularities among words. The technique is based on the self-organizing map WEBSOM [1], but while WEBSOM and other latent semantic techniques have been used so far mainly for the reprsentation of data bases, we shall use them as a context representation. Consider that we have already executed a number of queries and collected certain documents that are considered to be relevant. We join all these documents in a single large document and apply standard algorithms for stopword removal and stemming. The result is a series of stems of significant words, from which we consider pairs of consecutive stems (word pairs). Let the words (terms) of the document [t1 , . . . , tW ], (ij) be the pair formed by the word ti followed by tj , P the set of all pairs found in the documents, and Nij the number of times that the pair (ij) appears. The pair (ij) is given a weight wij = Nij Nhk The pairs are represented in a vector space whose axes are the words (hk)∈P
D t1 , . . . , tW . In this space, the pair (ij), with weight wij is represented by the point i
D D pij = (0, . . . , wij , 0, . . . , wij , 0, . . . , 0). All these points form the point cloud ID . j
The point cloud thus built is used as the training data for a self-organizing map deployed in the term space. The map is a grid of elements called neurons, each one of which is a point in the word space and is identified by two integer indices: [μν] = μν (uμν 1 ≤ μ ≤ N, 1 ≤ ν ≤ M The map is discrete, two dimensional 1 , . . . , uT ) with the 4-neighborhood topology. On it we define a neighborhood function, h(t, n), which depends on two parameters t, n ∈ N; n is the graph distance between the neuron whose neighborhood we are determining and another neuron, t is a time parameter that increases as learning proceeds. The function decreases with the distance from the given neuron and “shrinks” with time. We also define a learning parameter α(t), decreasing with time.
Incremental Context Creation and Its Effects on Semantic Query Precision
175
All the points in ID are presented to the map. We call the presentation of a point p ∈ ID an event, and the presentation of all the points of ID an epoch. Learning consists of a number of epochs, counted by a counter t. The neurons of the map are at first spread randomly in the word space; then, for each event consisting of the presentation of the point p, we apply the standard Kohonen map learning algorithm [2]. When learning stabilized, the resulting is the latent semantics manifold. In order to form a query, we begin with the terms entered by the user, which we call the inquiry, composed of a set of keywords, a sentence, even a whole paragraph. We process it by stop words removal and stemming. The result is a series of stems (keywords) Y = {tk1 , . . . , tkq }. For the sake of generality, we assume that the user associated weights {uk1 , . . . , ukq } to these terms. The inquiry can thus be represented as a point q in the word space. The inquiry modifies the context by subjecting it to a sort of partial learning. Let [∗] be the neuron in the map closest to the inquiry point q. The map is updated, through a learning iteration, in such a way that the neuron [∗] gets closer to the point q by a factor φ, with 0 < φ ≤ 1. This is the target context [μν] of the query. The target semantics for our query is given by the difference between the target ˜ = [μν] − [μν] The values [μν] ˜ in a neighborhood of context and the original one: [μν] ˜ [∗] constitute our complete query expression.
3 Context Evolution The test procedure of the following experiments is incremental. Each user starts with an empty context, a target context (either computing or neurophysiology), and a list of queries. Each “run” centers on a single query. The user is given the query word and uses R it to retrieve documents using our system based on the google search engine [4]. Out of the reults returned by the search engine, the user chooses a number of them that she considers relevant. These documents are used as an input to the procedure described in the previous section to build a level 1 context. The same query is now repeated with the level 1 context, its precision is noted, and the results that the user downloads are added to the context to create a level 2 context. The process is then repeated: the same 1.00
b
b
b
b
b
b
b
b
r No context b Level 1 context
r 0.00 0.00
r
r
r
r
r
r
r
9.00 Fig. 1. Precision of the results for the computing context
Level 2 context
176
A. Dumitrescu and S. Santini
query is sent using the level 2 context, and its precision is noted. The results of this first test are shown in figure 1. The number n in abscissa is the number of results that we consider when measuring the rpecision, and the ordinate is the precision of the first n results. Note that the inclusion of level 1 context results in a significant improvement, while the inclusion of level 2 context leaves the results statistically unchanged for n > 3 and it appears to decrease the precision of the first two results. From an analysis of the documents retrieved, it seems that, with the second iteration, the context is somehow expanded. That is, at least with the users considered, the level 2 context appears to be less specific than the level 1, leading to the loss of precision. In order to analyze the phenomenon further, we selected a few queries and repeated the procedure up the level 4 context. The results are shown in figure 2 We can notice 1.00
qa
0.00 0.00
q a
a q
q
a
qa
q
a q relational a rewriting active bug address
q a
6.00 1.00
qa
a q qa q neighbouring a concentration nerve weak disorders
6.00
Fig. 2. Precision of the results for the computing context (left) and the neurophysiology context (right) as a function of the level of context
that the presence of “bumps” in the context is quite frequent, although the general trend, as expected, it towards an increase in precision with the context level. Here, too, we observe a phenomenon that we already pointed out in [4]: certain queries, mainly in the neurophysology context, are very specific and carry with them enough context information that adding context results only in marginal improvement. The query “nerve” is an example of this phenomenon. The precision seems to plateau after three or four iteration, indicating that at that time the context is quite formed and sufficient to specify the semantics of the desired query.
References 1. Kaski, S.: Computationally efficient approximation of a probabilistic model for document representation in the WEBSOM full-text analysis method. Neural Processing letters 5(2) (1997) 2. Kohonen, T.: Self-organizing maps. Springer, Heidelberg (2001) 3. Santini, S.: Ontology: use and abuse. In: Boujemaa, N., Detyniecki, M., N¨urnberger, A. (eds.) AMR 2007. LNCS, vol. 4918, pp. 17–31. Springer, Heidelberg (2008) 4. Santini, S., Dumitrescu, A.: Context as a non-ontological determinant of semantics. In: Duke, D., Hardman, L., Hauptmann, A., Paulus, D., Staab, S. (eds.) SAMT 2008. LNCS, vol. 5392, pp. 121–136. Springer, Heidelberg (2008)
OntoFilm: A Core Ontology for Film Production Ajay Chakravarthy, Richard Beales, Nikos Matskanis, and Xiaoyu Yang IT Innovation Centre, 2 Venture Road, Southampton, SO16 7NP United Kingdom {ajc,rmb,nm,kxy}@it-innovation.soton.ac.uk
Abstract. In this paper we present OntoFilm, a core ontology for film production. OntoFilm provides a standardized model which conceptualizes the domain and workflows used at various stages of the film production process starting from pre-production and planning, shooting on set, right through to editing and post-production. The main contributions in this paper are: we discuss how OntoFilm models the semantics necessary to interpret these workflows consistently for all users (Directors, DoP’s, grips, post-production, lighting). We also discuss how our ontology forms a common bridge between the low level descriptive metadata generated for the video footage and the high level semantics used in software tools during the production process. Keywords: Ontologies, Semantic Web, Film Production, Post Production, Digital Media, Semantic Gap.
1 Introduction The process of film production is lengthy, time consuming and often an expensive affair. Film directors currently rely on paper based storyboards or manually-created ‘pre-vis’ animations to record and express their creative intent on how a scene could be shot. There is no formal way to describe this intent. Similarly in post production a tremendous amount of metadata is available describing the footage that has been captured, but this is in the form of paper notes and must be sorted and input into the postproduction software manually. This is a time-consuming, laborious and error-prone process. There is a need for a standardized knowledge model which will act as a common knowledge sharing platform between the software tools used throughout film production. This research is being carried out as part of the ANSWER1 project. ANSWER is a novel approach to the creative process of film making. ANSWER introduces a symbolic language for directors to create detailed scene descriptions, much like a musician would use music scores to compose a song. ANSWER also aims to provide tools and technologies which will enable scene authoring through filming notation. Users (directors) will be able to define and visualize film scenes before shooting begins. The aim of our research within ANSWER is to create a conceptual knowledge model which facilitates knowledge flow among the various software components used and is a first step towards bridging the semantic gap between high level pre-production 1
www.answer-project.org/
T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 177–181, 2009. © Springer-Verlag Berlin Heidelberg 2009
178
A. Chakravarthy et al.
concepts and low level post-production metadata. In its initial stages OntoFilm is mainly concerned with modelling film production. However, the film and games industries are converging in the way that their creative content is authored, so we will extend the results to include the needs of the games industry, thus offering a bridge between digital media production and animation for game design. In this paper we will present how we have attempted to model the workflow in three main stages of film production, i.e. pre-production, on-location, and post-production. Prior to starting development on OntoFilm we have studied state of the art multimedia ontologies such as MPEG-7 [2] and COMM [1]. We have reused concepts from the MPEG-7 ontology (e.g. for Camera Motion, Motion Trajectories and Motion Activity).
2 Pre-production Model The pre-production model covers the concepts used during the planning stage of the film. Script writers and film directors use paper based scripts and story boards to plan shots within the film. Often directors add additional notes on script elements to record their creative ideas. This additional metadata is very useful in post-production when locating footage and during video and dialogue synchronization. However paper based scripts cannot be used directly to produce machine understandable metadata. OntoFilm introduces the concept of Shooting Script element. After consultation with directors and analyzing existing film and television scripts we have identified the main elements of a movie script and their relationships such as Scene, Shot, Character Name, Dialogue, Parenthetical and Transition Element. The Movie Script Markup Language (MSML) is a movie script schema specification developed as part of the ANSWER ontology specification after studying existing schemas used within the film domain such as Celtx2. MSML is encoded within OntoFilm and provides the semantics necessary for software tools to annotate and parse movie scripts according to this schema. Figure 1 shows the conceptual schema of MSML.
Fig. 1. Movie Script Markup Language conceptual schema
2
http://wiki.celtx.com/index.php?title=The_Script_Editor
OntoFilm: A Core Ontology for Film Production
179
The graphical user interface developed within ANSWER allows directors to visualize and annotate scripts just as they would on a paper based script, the difference being all the concepts being annotated are persisted as OWL instances. Further the director can select any element on the script and add free text annotations to record his/her creative intent. Other pre-production concepts include various physical setups required on set such as Optical Attachments, Optical Components, Costume, Props, Shots (e.g. long shot, POV shot, medium shot etc). Detailed descriptions of the aforementioned classes are published in [4]. We have done initial experimentation with SWRL3 based rule modelling e.g. for automatic classification of the different Cameras used during filming. Figure 2 shows a rule for classifying Film Camera.
Fig. 2. SWRL rule showing classification conditions for Film Camera
The isFilm_Camera rule classifies the Camera instance according to if it has Emulsion Film as capturing media. The camera’s hasCropFactor data property can then be calculated by taking into account its classification, film or sensor size by the calcFilmCropFactor and calcDigitalCropFactor.
3 On-Set Production Model Directors often follow a set of filming conventions while shooting a scene. Although these are not mandatory, these rules are recommended to avoid causing confusion to the viewing audience. In OntoFilm we have conceptualized these rules with Film Grammar concepts. A few examples of these rules include: • •
180 Degree Rule: This rule deals with the spatial relation between characters on-screen. It is used to maintain consistent screen direction between characters or between a character and an object. 30 Degree Rule: If a shot is going from one character to another without an intervening shot of something else, the camera angle should change by at least 30 degrees.
The on set production model also includes the Director Notation (DN) ontology. DN is a symbolic language which can be used to describe detailed scene descriptions during film shooting [3]. The graphical user interface developed in ANSWER will allow directors to create detailed scene descriptions using DN. A rule engine automatically translates these DN scores and generates animates pre-visualizations. The DN ontology is a system level ontology which acts as a bridge between the user enabled tools for using DN to describe a scene and the rule engine which translates the DN scores into animations. Details of the DirectorNotation ontology will not be expanded since it’s outside the scope of this paper. 3
http://www.w3.org/Submission/SWRL/
180
A. Chakravarthy et al.
4 Post-Production Model Our approach to modelling Post-Production for OntoFilm has been to focus on the workflow involved, as it was considered this would draw attention to the points at which metadata was lost or needed to be transferred manually. Significant amount of machine-readable metadata is automatically generated in post production to describe the footage from the point at which it is ingested (digitised), e.g. descriptions of media on which the files were held and software processes applied to alter the visual appearance of the footage. Furthermore, low-level video features, such as regions of colour or movement, can both be annotated automatically using the MPEG-7 standard. However there is a stark disconnect between these various technical parameters and the high-level features that they contain or represent. The result is that post-production operators can easily identify footage that had been scanned by a particular machine, but not find a sequence of shots that contained a scene description e.g. protagonists in a car chase – this information is buried in the reams of paper notes that accompany the production4. We have modelled the various post-production steps, e.g. Edit_Room (which involves assembly of selected material from video footage produced on a day to day basis), Commercial_Preview (where a candidate edit is screened in a cinema to judge audience reaction), and linked them not only technical descriptions of the media involved, but also to scene descriptions from the On-Set Production model and planning notes from the Pre-Production model. In this way, it is now possible to construct a unified semantic representation of the action portrayed on scene, the actors, props and locations involved, the planning rationale for creating the scene and the technical means with which it was acquired and the processing applied in post. For example the low level footage descriptor Scene Tint (which is the predominant colour of the shot) relates to the high level concept of Content (which describes the mood the scene).
5 Conclusions and Future Work The OntoFilm ontology is being developed after direct consultations with film directors (STEFI5 Productions), post production software developers (DFT6) and game developers (Larian7 Studios). The feedback from these collaborative meetings confirmed the need for development of a common ontological framework for the digital media domain, which will act as a bridge for knowledge transfer between these domains consistently for all users. We are currently investigating integrating OntoFilm with the COMM API. The COMM ontology combines the advantages of extensibility and scalability of a web-based solution with the accumulated experience of MPEG-7 [1]. The main advantage of doing this would be software programs will be easily able to access the ontology through programmatic interfaces. We are also looking at 4
5 6 7
This information has been gathered after having technical meetings with post production teamDFT Bones@Weiterstadt, Germany. http://www.stefi.gr/ http://www.dft-film.com/ http://www.larian.com/
OntoFilm: A Core Ontology for Film Production
181
current efforts in the field of digital media ontologies, namely the iMP8 and Salero9 projects. However our work differs from these multimedia ontologies because OntoFilm is specific to film production and covers this domain in great depth with a rich set of semantics. Software programs developed for digital media production will be able to leverage our ontology directly.
References 1. Arndt, R., Troncy, R., Staab, S., Hardman, L., Vacura, M.: COMM: Designing a WellFounded Multimedia Ontology for the Web. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 30–43. Springer, Heidelberg (2007) 2. Martinez, M.J.: MPEG-7 Overview: International organization for standardisation (ISO//IEC/JTCI/SC29/WG11). Coding of moving pictures and audio. Palma be Mallorca (October 2004) 3. Yannopoulos, A., Savrami, K., Varvarigou, T.: DirectorNotation as a Tool for AmI & Intelligent Content: an Introduction by Example. Accepted for publication in the International Journal of Cognitive Informatics and Natural Intelligence, IJCiNi (2008) 4. D4.1B: Mid-Term Ontology Report, ANSWER Project Deliverable (May 2009)
8 9
http://imp-project.eu/ http://www.salero.eu/
RelFinder: Revealing Relationships in RDF Knowledge Bases Philipp Heim1 , Sebastian Hellmann2 , Jens Lehmann2 , Steffen Lohmann3 , and Timo Stegemann3 1
University of Stuttgart, Visualization and Interactive Systems, Universit¨ atsstr. 38, 70569 Stuttgart, Germany [email protected] 2 University of Leipzig, Agile Knowledge Engineering and Semantic Web, Johannisgasse 26, 04103 Leipzig, Germany {hellmann,lehmann}@informatik.uni-leipzig.de 3 University of Duisburg-Essen, Interactive Systems and Interaction Design, Lotharstr. 65, 47057 Duisburg, Germany {steffen.lohmann,timo.stegemann}@uni-due.de
Abstract. The Semantic Web has recently seen a rise of large knowledge bases (such as DBpedia) that are freely accessible via SPARQL endpoints. The structured representation of the contained information opens up new possibilities in the way it can be accessed and queried. In this paper, we present an approach that extracts a graph covering relationships between two objects of interest. We show an interactive visualization of this graph that supports the systematic analysis of the found relationships by providing highlighting, previewing, and filtering features. Keywords: semantic user interfaces, semantic web, relationship discovery, linked data, dbpedia, graph visualization.
1
Introduction
The Semantic Web enables answers to new kinds of user questions. Unlike searching for keywords in Web pages (as e.g. in Google), information can be accessed according to its semantics. The information is stored in structured form in knowledge bases using formal languages such as RDF 1 or OWL2 and consisting of statements about real world objects like ’Washington’ or ’Barack Obama’. Each object has a unique identifier (URI ) and is usually assigned to ontological classes, such as ’city’ or ’person’, and an arbitrary number of properties that define links between the objects (e.g., ’lives in’). Given this semantically annotated and linked data, new ways to reveal relationships within the contained information are possible. A common visualization for linked data are graphs, such as the Paged Graph Visualization [2]. In order to find relationships in these visualizations, users normally apply one of the following two strategies: They either choose a starting point 1 2
http://www.w3.org/RDF/ http://www.w3.org/TR/owl-features/
T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 182–187, 2009. c Springer-Verlag Berlin Heidelberg 2009
RelFinder: Revealing Relationships in RDF Knowledge Bases
183
and incrementally explore the graph by following certain edges, or they start with a visualization of the entire graph and then filter out irrelevant data. Some more sophisticated solutions are based on the concept of faceted search. The tool gFacet [4], for instance, groups object data into facets that are represented by nodes and can be used to filter a user-defined result set. However, all these approaches require the user to manually explore the visualization in order to find relationships between two objects of interest. This kind of trial-and-error search can be very time consuming, especially in large knowledge bases that contain many data links. As a solution to this problem, we propose an approach that automatically reveals relationships between two known objects and displays them as a graph. The relationships are found by an algorithm that is based on a concept proposed in [5] and that can be applied to large knowledge bases, such as DBpedia [1] or the whole LOD-Cloud 3 . Since the graph that visualizes the relationships can still become large, we added interactive features and filtering options to the user interface that enable a reduction of displayed nodes and facilitate understanding. We present an implementation of this approach – the RelFinder – and demonstrate its applicability by an example from the knowledge base DBpedia [1].
2
RelFinder
The RelFinder is implemented in Adobe Flex4 and runs in all Web browsers with an installed Flash Player5. In the following, we first explain its general functionality before describing the involved mechanisms in more detail. The search terms that are entered by the user in the two input fields in the upper left corner (Fig. 1, A) get mapped to unique objects of the knowledge base. These constitute the left and right starting nodes in the graph visualization (Fig. 1, B) that get then connected by relations and objects found in between them by the algorithm. If a certain node is selected all graph elements that connect this node with the starting nodes are highlighted forming one or more paths through the graph (Fig. 1, C). In addition, further information about the selected object is displayed in the sidebar (Fig. 1, D). Filters can be applied to increase or reduce the number of relationships that are shown in the graph and to focus on certain aspects of interest (Fig. 1, E). 2.1
Disambiguation
Ideally, the search terms that are entered by the user can be uniquely matched to objects of the knowledge base without any disambiguation. However, if multiple matches are possible (e.g., in case of homonyms, polysemes, or incomplete user input) the user is supported by a disambiguation feature. Generally, a list of objects with labels that enclose the search terms is already shown below the 3 4 5
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/ LinkingOpenData http://www.adobe.com/products/flex The current version of the RelFinder is accessible at http://relfinder.dbpedia.org
184
P. Heim et al.
Fig. 1. Revealing relationships between Kurt G¨ odel und Albert Einstein
input box while the user enters the terms (Fig. 1, A). This disambiguation list results from a query against the SPARQL endpoint of the selected knowledge base. The following code shows the DBpedia optimized SPARQL query for the user input ’Einstein’6 : SELECT ?s ?l count(?s) as ?count WHERE { ?someobj ?p ?s . ?s ?l . ?l bif:contains ’"Einstein"’ . FILTER (!regex(str(?s), ’^http://dbpedia.org/resource/Category:’)). FILTER (!regex(str(?s), ’^http://dbpedia.org/resource/List’)). FILTER (!regex(str(?s), ’^http://sw.opencyc.org/’)). FILTER (lang(?l) = ’en’). FILTER (!isLiteral(?someobj)). } ORDER BY DESC(?count) LIMIT 20
The disambiguation list is sorted by relevance using the ’count’ value (or alternatively a string comparison if ’count’ is not supported by the endpoint). ’count’ is also used to decide if a user’s search term can be automatically matched to an 6
A configuration file allows to freely define the queried endpoint, the element that is queried (typically ’rdfs:label’), and properties that should be ignored. It is also possible to deactivate specific syntax elements, such as ’count’ or ’bif:contains’, in case a SPARQL endpoint does not support these.
RelFinder: Revealing Relationships in RDF Knowledge Bases
185
object of the knowledge base or if a manual disambiguation is necessary. An automatic match is performed in one of the following two cases: 1) if the user input and the label of the most relevant object are completely equal, or 2) if the user input is contained in the label of the most relevant object and this object has a much higher count value than the second relevant object of the disambiguation list (ten times higher by default). Thus, the automatic disambiguation is rather defensive in order to prevent false matches. If the user does not select an entry from the disambiguation list and if no automatic match is possible, the entries from the disambiguation list are shown again in a pop-up dialog that explicitly asks the user to provide the intended meaning of the search term by selecting the corresponding object. 2.2
Searching for Relationships
A query building process composed of several SPARQL queries searches for relationships between the given objects of interest. Since the shortest connection is not known in advance, the process searches iteratively for connections with increasing length, starting from zero. As a constraint, the direction of the property relations within each connection chain is only allowed to change once. We defined this constraint due to performance reasons and because multiple changes in the direction of the edges are difficult to be followed and understood by the user. If our objects of interests are a and b this results in the following search patterns: a → ··· → b a ← ··· ← b a → ··· → c ← ··· ← b a ← ··· ← c → ··· → b Thus, we are looking either for one-way relationships (first two lines) or those with an object c in between such that there is a one-way relationship each from a and from b to c or from c to a and b (last two lines). Note that c is not known in advance but found within the searching process. The algorithm has several parameters: 1) It can be configured to suppress circles in extracted relationships. With the help of SPARQL filters, any object is only allowed to occur once in each connecting relationship. 2) Objects and properties can be ignored using regular expression patterns on their labels or URIs, which is useful if someone is not interested in certain objects or properties. More importantly, also structural relations between objects can be omitted, such as whether two objects belong to the same class or to the same part of a class hierarchy (i.e., ignoring rdf:type and rdfs:subClassOf properties). We decided to remove these by default, since they normally yield a multitude of relationships of minor interest which can be better explored in more traditional ways such a hierarchy browsers. 3) A maximum length of the returned relationships can be defined. 4) The SPARQL endpoint to use can be configured. An exemplary SPARQL query that searches for relationships of the type Kurt G¨ odel ← of 1 ← c → os1 → Albert Einstein is given here (filter omitted):
186
P. Heim et al.
SELECT * WHERE { db:Kurt_G¨ odel ?pf1 ?of1 . ?of1 ?pf2 ?c . db:Albert_Einstein ?ps1 ?os1 . ?os1 ?ps2 ?c . FILTER ... } LIMIT 20 2.3
Graph Visualization
The found relationships are added one by one to the graph, beginning with the shortest (i.e., direct relationships and relationships with only one object in between, if there are any). All objects are visualized as nodes connected by edges that are labeled and directed according to the property relation they represent (Fig. 1, B). Since the labels of the edges are crucial for understanding the relationships they serve as flexible articulations in the force-directed layout [3], what reduces overlaps but cannot completely avoid them. Interactive Features. To further reduce overlaps in the graph, we implemented a pinning feature that enables users to manually drag single nodes away from agglomerations and forces them to stay at the position they got dropped (pinned nodes are indicated by needle symbols as can be seen in Fig. 1, F). Especially in situations where many nodes are connected by many edges and thus are likely to overlap in the automatic layout, manual adjustments in combination with our pinning feature are helpful to produce an understandable graph layout that facilitates visual tracking. As already mentioned, visual tracking is additionally supported by the possibility of highlighting all paths that connect a selected node with the starting nodes (Fig. 1, C). If a certain node is selected in the graph, further information about the corresponding object is displayed in the sidebar (e.g., in case of DBpedia these are a title, short abstract, and image extracted from Wikipedia, Fig. 1, D). Moreover, the ontological class an object belongs to is highlighted in the list that is shown in the sidebar (Fig. 1, E). Vice versa, all corresponding nodes in the graph are highlighted if an ontological class is selected from the list. Filtering Options. The shown relationships can be filtered in two ways: 1) According to their length (i.e., the number of objects in between) and 2) according to the ontological classes the objects belong to. For instance, relationships consisting of several objects could be regarded as too far-fetched or objects belonging to certain classes might not be of interest for a user’s goals (Fig. 1, E) and are therefore removed from the graph. Filtering helps to reduce the number of displayed relationships in the graph and can hence prevent the graph from getting overly cluttered. For each search process, the filters are automatically set to initial values that avoid an overcluttered graph, if possible.
RelFinder: Revealing Relationships in RDF Knowledge Bases
3
187
Conclusion and Outlook
We introduced an approach in this paper that uses properties in semantically annotated data to automatically find relationships between any pair of user-defined objects and visualizes them in a force-directed graph layout. The RelFinder can therefore save a lot of time which would otherwise be lost in searching these relationships manually. Since the amount of found relationships can be large, we additionally provide two types of filters that can be used to reduce the displayed relationships according to their length and the ontological classes they belong to and thus allow focusing on only a relevant part of the relationships. Together with a pinning feature that lets users rearrange the graph layout manually, we therefore provide a semi-automatic approach. The basic mechanisms of the RelFinder work with every SPARQL endpoint and can therefore be applied to arbitrary knowledge bases with only little configuration effort. Since the SPARQL queries are composed on the client they are comparably independent of server routines and resources. We are currently testing the RelFinder on different knowledge bases in various scenarios. Future work includes a comprehensive evaluation that further examines the potentials and benefits of querying structured knowledge bases compared to common Web search with respect to the discovery and analysis of relationships between certain objects of interests.
References 1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudr´e-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007) 2. Deligiannidis, L., Kochut, K., Sheth, A.: RDF data exploration and visualization. In: Proceedings of the ACM first workshop on CyberInfrastructure 2007, pp. 39–46. ACM Press, New York (2007) 3. Fruchterman, T., Reingold, E.: Graph drawing by force-directed placement. Softw. Pract. Exper. 21(11), 1129–1164 (1991) 4. Heim, P., Ziegler, J., Lohmann, S.: gFacet: A browser for the web of data. In: Proceedings of the International Workshop on Interacting with Multimedia Content in the Social Semantic Web (IMC-SSW 2008), pp. 49–58. CEUR-WS (2008) 5. Lehmann, J., Sch¨ uppel, J., Auer, S.: Discovering unknown connections – the DBpedia relationship finder. In: Proceedings of the 1st SABRE Conference on Social Semantic Web, CSSW (2007)
Image Annotation Refinement Using Web-Based Keyword Correlation Ainhoa Llorente, Enrico Motta, and Stefan R¨ uger Knowledge Media Institute, The Open University, Walton Hall, Milton Keynes, MK7 6AA, U.K. {a.llorente,e.motta,s.rueger}@open.ac.uk
Abstract. This paper describes a novel approach that automatically refines the image annotations generated by a non-parametric density estimation model. We re-rank these initial annotations following a heuristic algorithm, which uses semantic relatedness measures based on keyword correlation on the Web. Existing approaches that rely on keyword cooccurrence can exhibit limitations, as their performance depend on the quality and coverage provided by the training data. Additionally, WordNet based correlation approaches are not able to cope with words that are not in the thesaurus. We illustrate the effectiveness of our Web-based approach by showing some promising results obtained on two datasets, Corel 5k, and ImageCLEF2009. Keywords: Automated image annotation, Normalized Google Distance, semantic similarity.
1
Introduction and Related Work
Automated image annotation refers to the process of learning statistical models from a training set of pre-annotated images in order to generate annotations for unseen images using visual feature extracting technology. The early attempts were focused on algorithms that explore the correlation between words and image features. More recently, there are some efforts [1,2,3,4,5,6,7,8] which attempt to benefit from exploiting the correlation between words computed using semantic similarity measures. According to the knowledge base used as source of information, this new generation of algorithms can be classified into those that use a corpus as the training set [6], those that employ a thesaurus such as WordNet [7], and those that perform the statistical correlation using the Web [8]. For each one of the mentioned categories, a variety of semantic similarity measures have been proposed. The most important limitation, affecting approaches that rely on a training set, is that they are limited to the scope of the topics represented in the collection. This information may not be enough for detecting annotations that are not correlated with others. For instance, if our collection contains a lot of images of animals in a circus and just a few of animals in the wildlife, the co-occurrence approach T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 188–191, 2009. c Springer-Verlag Berlin Heidelberg 2009
Image Annotation Refinement Using Web-Based Keyword Correlation
189
will penalize combinations such as “lion” and “savannah” while promoting associations such as “lion” and “chair”, given that the training set is dominated by images of lions in a circus (i.e. consider a lion tamer controlling a lion with a chair). Additionally, WordNet based correlation approaches need to handle some words that do not exist or have no available relations with other words of the thesaurus. Our research follows the approach that uses Web-based semantic similarity measures in an attempt to overcome the limitations of approaches based on correlations in the training set or in WordNet. The semantic similarity measure used in this research is based on the Normalized Google Distance (NGD) between two terms x and y that was defined by Cilibrasi and Vit´ anyi [9] as: NGD(x, y) =
max{log f (x), log f (y)} − log f (x, y) , log N − min{log f (x), log f (y)}
(1)
where f (x) and f (y) are the counts for search terms x and y using Google and f (x, y) is, the number of web pages found on which both x and y occur. N is the total number of web pages searched by Google which, in 2007, was estimated to be more than 8bn pages. Cilibrasi and Vit´ anyi call the Normalized Web Distance (NWD) when they used any web-based search engine as source of frequencies. Gracia and Mena [10] applied a transformation on Equation 1 proposing their web-based semantic relatedness measure between x and y, as: relWeb(x, y) = e−2 NWD(x,y)
(2)
This transformation was done in order to get a proper relatedness measure that is a bounded value (in the range [0, 1]) and at the same time increases inversely to distance. The rest of this paper is organised as follows: Section 2 describes our algorithm while Section 3 shows our experiments and results. Finally, Section 4 contains our conclusions and plans for future work.
2
Image Annotation Refinement
The model used as baseline annotation is that developed by Yavlinsky et al. [11] who use global features together with a non-parametric density estimation. This algorithm yields a probability value, p(ωj |Ji ), of each keyword ωj being present in an image, Ji , of the test set, which is used as a confidence score for our candidate annotations. The candidate annotations are those five with the highest scores. Thus, the confidence score of keyword ωj is: Confidence score(ωj ) = p(ωj |Ji ) ≈ p(ωj |x1 , ..., xd ),
(3)
where x = (x1 , ..., xd ) is a vector of real-valued image features. As a result of our algorithm’s focus on refining baseline image annotations, we concentrate on
190
A. Llorente, E. Motta, and S. R¨ uger
test images with at least one accurate keyword among the candidates. In order to meet this requirement, we select images from the test set whose confidence score is greater than a threshold that is set after performing cross-validation on the training set. Then, we apply the semantic similarity measure defined in Equation 2 to pairs of candidate annotations until we detect candidates which are not semantically related to the others. Therefore, we reduce the confidence score of these “noisy” candidates. Finally, we get rid of these irrelevant annotations after re-ranking and selecting, again, the five with the highest confidences scores.
3
Experimental Work and Results
The proposed algorithm was evaluated on the Corel 5k and the ImageCLEF2009 collections (Table 1). The Corel 5k dataset is a collection with a training set of 4,500 images and a test set of 500 images. The vocabulary is made up of 374 words. The ImageCLEF2009 dataset is made up of 5,000 training images and 13,000 test images. The vocabulary consists of 53 words. In both cases, we followed a 10-fold cross-validation for tuning the system parameters. Performance is evaluated with the mean average precision (MAP), which is the average precision, over all queries, at the ranks where recall changes, where relevant items occur. Table 1. Results obtained for the two datasets Collection Metric Baseline Google Yahoo Corel 5k MAP 0.2861 0.2882 0.2901 ImageCLEF2009 MAP 0.2613 0.2736 0.2720
We represent our results in Table 1 where we have used two different Websearch engines. Independently of the search engine used, we observe an increment in the performance over the baseline. For the Corel 5k, an increment of 1.4% in the MAP is obtained while for the ImageCLEF2009 we get 4.7% increment over the baseline. In both cases, we used a combination of Tamura texture and CIELAB colour descriptors.
4
Conclusions and Future Work
We have presented a new approach in automated image annotation that prunes “noisy” keywords using semantic similarity measures by means of Web-search engines such as Google and Yahoo. One of the main advantages of using this approach compared to others is the flexibility that comes from the use of a Webbased search engine where you can type almost any word and you can expect to get some results. Another point is that it is not necessary to know the sense
Image Annotation Refinement Using Web-Based Keyword Correlation
191
of the word, contrary to WordNet, where a disambiguation task is necessary before estimating the semantic similarity. However, the most important benefit of this approach is that we are not limited to the scope of topics provided by a training set as it happens in statistical correlation cases. As future work, we plan to extend this approach and use Wikipedia as the source of background knowledge, a hybrid solution that can be considered a Web-based and a thesaurusbased measure. Acknowledgments. Thanks to Jorge Gracia for his helpful comments. This work was partially funded by the EU Pharos project (IST-FP6-45035) and by Santander Corporation.
References 1. Jin, R., Chai, J.Y., Si, L.: Effective automatic image annotation via a coherent language model and active learning. In: Proceedings of the 12th International ACM Conference on Multimedia, pp. 892–899 (2004) 2. Jin, Y., Khan, L., Wang, L., Awad, M.: Image annotations by combining multiple evidence & WordNet. In: Proceedings of the 13th International ACM Conference on Multimedia, pp. 706–715 (2005) 3. Liu, J., Li, M., Ma, W.Y., Liu, Q., Lu, H.: An adaptive graph model for automatic image annotation. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 61–70 (2006) 4. Zhou, X., Wang, M., Zhang, Q., Zhang, J., Shi, B.: Automatic image annotation by an iterative approach: incorporating keyword correlations and region matching. In: Proceedings of the International ACM Conference on Image and Video Retrieval, pp. 25–32 (2007) 5. Llorente, A., R¨ uger, S.: Using second order statistics to enhance automated image annotation. In: Boughanem, M., et al. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 570–577. Springer, Heidelberg (2009) 6. Wang, C., Jing, F., Zhang, L., Zhang, H.J.: Image annotation refinement using random walk with restarts. In: Proceedings of the 14th annual ACM International Conference on Multimedia, pp. 647–650. ACM, New York (2006) 7. Jin, Y., Wang, L., Khan, L.: Improving image annotations using WordNet. In: Candan, K.S., Celentano, A. (eds.) MIS 2005. LNCS, vol. 3665, pp. 115–130. Springer, Heidelberg (2005) 8. Liu, J., Wang, B., Li, M., Li, Z., Ma, W., Lu, H., Ma, S.: Dual cross-media relevance model for image annotation. In: Proceedings of the 15th International Conference on Multimedia, pp. 605–614 (2007) 9. Cilibrasi, R., Vitanyi, P.: The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007) 10. Gracia, J., Mena, E.: Web-based measure of semantic relatedness. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 136–150. Springer, Heidelberg (2008) 11. Yavlinsky, A., Schofield, E., R¨ uger, S.: Automated image annotation using global features and robust nonparametric density estimation. In: Proceedings of the International ACM Conference on Image and Video Retrieval, pp. 507–517 (2005)
Automatic Rating and Selection of Digital Photographs Daniel Kormann, Peter Dunker, and Ronny Paduschek Fraunhofer Institute for Digital Media Technology, Ehrenbergstrasse 31, 98693 Ilmenau, Germany [email protected], [email protected], [email protected]
Abstract. This paper presents methods for automatically rating and selecting digital photographs. The importance of each photograph is estimated by analyzing its content as well as its time-metadata. The presence of people is estimated by combining face and skin detection. Finally, the appeal of each photograph is calculated using a trained SVM classifier. The results of a conducted user study show that the automatically obtained rating coincides well with the perception of the test persons.
1
Introduction
The digitalization has led to an ever increasing size of personal photo collections. Today, consumers often take hundreds or even thousands of pictures of an event. For many intended purposes like the creation of slideshows or image galleries, a rating of the images would be desirable, so that the top-scoring images could be automatically selected. Existing approaches solve this task by screening for low quality images or by considering certain image content aspects like colorfulness [1] . Other works deal with the estimation of aesthetics in digital images [2]. Compared to the above mentioned work, our method focuses on the selection of the “best” images by considering various criteria.
2
Overview
We want to single out the aspects upon which a human observer would rate digital photographs. Based on a study of Savakis et al. [7] as well as our own conducted online survey, the following three criteria can be stated, which are all considered within our system, resulting in separate scores: Image appeal - Is the image appealing, is it a successful photography? Image importance - Does the image show an important subject or event? Presence of people - Does the image show people, friends and family? 2.1
Image Importance
In order to calculate an importance score for each image i, we consider three aspects: First, if something interesting is happening during an event, the photographic rate rises. Hence, we calculate the local photographic rate fi for each T.-S. Chua et al. (Eds.): SAMT 2009, LNCS 5887, pp. 192–195, 2009. c Springer-Verlag Berlin Heidelberg 2009
Automatic Rating and Selection of Digital Photographs
193
image i using a weighted window. Second, if a photographer takes many pictures of one subject, this implies that the subject is of a certain importance to him. To detect whether two images are showing the same subject, the SIFT algorithm is used [6]. If there is a reasonable number of matching SIFT keypoint descriptors between two images, they are regarded as showing the same subject. si denotes the number of images showing the same subject as image i. Third, if a photographer takes two or more near identical images, it implies that this particular shot is of special importance to him. Such duplicate images are detected using the MPEG-7 color layout descriptor [4], where an empirically determined threshold defines whether two images are regarded as duplicates. Similar to above, di denotes the number of duplicate images of image i. A weighted combination results in a final importance score siimportance = u1 ∗ fi /fmax + u2 ∗ di /dmax + u3 ∗ si /smax , where fmax , smax and dmax are the maximum values for the entire photo collection, and ui are weighting factors. 2.2
Image Appeal
Machine learning techniques are used to calculate an appeal score for each image. Therefore, 15 low- and mid-level image features were designed, covering the aspects technical quality, composition, simplicity and colorfulness (see Tab. 1). Fourteen test persons of different age groups were asked to provide personal photographs of particularly high and low image appeal. Thus, a training database consisting of 840 images was built. Using this database, an SVM classifier was trained. The resulting SVM model is used to estimate the image appeal. The probability estimation for the “high appeal” class membership of image i is used as the final appeal score siappeal . Table 1. Overview of selected appeal classification features
2.3
technical quality
simplicity
highlight/shadow clipping sharpness
number of salient regions number of distinct hues
colorfulness
composition
color distribution standard deviation hue-channel
rule of thirds centrality
Presence of People
In order to rate the presence of people in each photograph i, we combine face and skin detection. The number of faces ni in each image as well as their relative size ai are obtained using a face detector based on Haar features and AdaBoost training. The relative amount of skin si in each image is calculated using a skin detector trained on skin colors. All three aspects are combined to a final people score sipeople = v1 ∗ log(ni ) + v2 ∗ ai + v3 ∗ si where vi are weighting factors.
194
D. Kormann, P. Dunker, and R. Paduschek
Fig. 1. Example of an event with calculated appeal score (AS) and importance score (IS) and the resulting 3 selected images (green) after duplicate removal
2.4
Ranking Framework
The three calculated scores are combined to obtain a final combined score for ranking the images: sicombined = w1 ∗ siimportance + w2 ∗ siappeal + w3 ∗ sipeople . The weighting wi of each score can be determined by the user or specified by certain presets, according to the application scenario. Due to the calculated importance score, the top-scoring images usually contain many pictures showing the same subject. We again use SIFT keypoints for detecting similar subjects. With this information, only the highest rated images which are showing different subjects can be selected (as depicted in Fig. 1). The representativity of the selected images can be increased by performing event detection prior to the ranking process, and selecting the highest-rated images from each event separately. To detect events, we combine and modify two existing approaches [3] [5]. The time gaps between images are first clustered using the k-means algorithm (k = 2). These initial events are then further partitioned by looking for outliers.
3
Evaluation
As the user’s satisfaction with the rating of the images is the main measure for the quality of our method, a user study was conducted. Fourteen participants provided personal photo collections. Images were rated and selected using different presets. Each test person rated the selected images of his/her own photo collection by filling in a questionnaire, which was designed according to guidelines of qualitative research, e.g. using 5-point Likert scales and cross-check questions. The boxplot in Fig. 2 exemplarily shows one result of the user study. It represents the overall satisfaction of the test persons with the selected images. As can be clearly seen, the images selected by our method obtained much better overall ratings than randomly selected images. The test results in general are promising. The calculated image appeal was reflected well by the user ratings. Throughout, selections generated by our method obtained considerably better ratings compared to random selections.
Automatic Rating and Selection of Digital Photographs
195
Fig. 2. Overall user-satisfaction with the selected images (1-worst, 5-best)
4
Conclusions
We briefly presented our method for rating and selecting personal digital photographs, which rates images based on various criteria, decreases redundancy and increases representativity amongst the selected images. A conducted user study delivered promising results, and verifies our concept of the automatic selection of images.
References 1. Boll, S., Sandhaus, P., Scherp, A., Thieme, S.: MetaXa - Context-and ContentDriven Metadata Enhancement for Personal Photo Books. In: Cham, T.-J., Cai, J., Dorai, C., Rajan, D., Chua, T.-S., Chia, L.-T. (eds.) MMM 2007. LNCS, vol. 4351, pp. 332–343. Springer, Heidelberg (2007) 2. Datta, R., Joshi, D., Li, J., Wang, J.: Studying Aesthetics in Photographic Images Using a Computational Approach. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) IWSM 2000. LNCS (LNAI), vol. 5796, pp. 152–162. Springer, Heidelberg (2009) 3. Graham, A., Garcia-Molina, H., Paepcke, A., Winograd, T.: Time as essence for photo browsing through personal digital libraries. In: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pp. 326–335. ACM Press, New York (2002) 4. Kasutani, E., Yamada, A.: The MPEG-7 color layout descriptor: a compact image feature description for high-speed image/video segment retrieval. In: Proceedings of the 2001 International Conference on Image Processing, vol. 1 (2001) 5. Loui, A., Savakis, A., Co, E., Rochester, N.: Automated event clustering and quality screening of consumer pictures for digital albuming. IEEE Transactions on Multimedia 5(3), 390–402 (2003) 6. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 7. Savakis, A., Etz, S., Loui, A., et al.: Evaluation of image appeal in consumer photography. In: Proceedings-SPIE the International Society for Optical Engineering, pp. 111–121 (2000)
Author Index
Abbasi, Rabeeh
65
Leelanupab, Teerapong 3 Lehmann, Jens 182 Lindstaedt, Stefanie 40 Llorente, Ainhoa 188 Lohmann, Steffen 16, 182 L´ opez, Fernando 114
Baeza-Yates, Ricardo 1 Beales, Richard 177 Benn, Neil 77 B¨ urger, Tobias 52, 101 Carvalho, Rodrigo F. 149 Cattaneo, Fabio 77 Chakravarthy, Ajay 177 Ciravegna, Fabio 149 Conconi, Alex 77 Cullen, Charlie 169 Cusano, Claudio 28 Delgado, Jaime 89 Dietze, Stefan 77 Domingue, John 77 Dumitrescu, Alexandra Dunker, Peter 192 Ertl, Thomas
Mart´ınez, Jos´e M. 114 Matskanis, Nikos 177 McCarthy, Evin 169 McAuley, John 169 Motta, Enrico 188 Nagamatsu, Takashi Neuschmied, Helmut Ohbuchi, Ryutarou
173
137
P., Punitha 52, 126 Paduschek, Ronny 192 Pammer, Viktoria 40
16 Rodr´ıguez-Doncel, V´ıctor R¨ uger, Stefan 2, 188
Feng, Yue 3 Fukuhara, Yuki
161
Garc´ıa, Narciso 114 Grzegorzek, Marcin 65 Halb, Wolfgang 52, 165 Heim, Philipp 16, 182 Hellmann, Sebastian 182
Santini, Simone 28, 173 Schettini, Raimondo 28 Simperl, Elena 101 Staab, Steffen 65 Stathopoulos, Vassilios 3 Stegemann, Timo 182 Tetzlaff, Lena
Ishii, Yutaka
161 165
16
161
Jose, Joemon M.
3, 126
Kaieda, Yohei 161 Kamahara, Junzo 161 Kawamura, Shun 137 Kormann, Daniel 192 Kump, Barbara 40
Vaughan, Brian 169 Villa, Robert 52 Weiss, Wolfgang Yang, Xiaoyu Ziegler, J¨ urgen
52 177 16
89