Multilingual Information Access Evaluation II - Multimedia Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, ... Applications, incl. Internet Web, and HCI)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

48 downloads 3117 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6242

Carol Peters Barbara Caputo Julio Gonzalo Gareth J.F. Jones Jayashree Kalpathy-Cramer Henning Müller Theodora Tsikrika (Eds.)

Multilingual Information Access Evaluation II Multimedia Experiments 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009 Corfu, Greece, September 30 - October 2, 2009 Revised Selected Papers

13

Volume Editors Carol Peters ISTI-CNR, Area Ricerca CNR 56124 Pisa, Italy E-mail: [email protected] Barbara Caputo Idiap Research Institute 1920 Martigny, Switzerland E-mail: [email protected] Julio Gonzalo LSI-UNED, 28040 Madrid, Spain E-mail: [email protected] Gareth J.F. Jones Dublin City University Glasnevin, Dublin 9, Ireland E-mail: [email protected]

Henning Müller University of Applied Sciences Western Switzerland 3960 Sierre, Switzerland E-mail: [email protected] Theodora Tsikrika Centrum Wiskunde & Informatica 1098 XG Amsterdam The Netherlands E-mail: [email protected] Managing Editors Pamela Forner and Danilo Giampiccolo CELCT, Trento, Italy Email: {forner; giampiccolo}@celct.it

Jayashree Kalpathy-Cramer Oregon Health and Science University Portland, OR 97239-3098, USA E-mail: [email protected]

Library of Congress Control Number: 2010934130 CR Subject Classification (1998): I.2.7, H.3, H.4, H.2, H.5, I.7 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13

0302-9743 3-642-15750-5 Springer Berlin Heidelberg New York 978-3-642-15750-9 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

The tenth campaign of the Cross Language Evaluation Forum (CLEF) for European languages was held from January to September 2009. There were eight main evaluation tracks in CLEF 2009 plus a pilot task. The aim, as usual, was to test the performance of a wide range of multilingual information access (MLIA) systems or system components. This year, about 150 groups, mainly but not only from academia, registered to participate in the campaign. Most of the groups were from Europe but there was also a good contingent from North America and Asia. The results were presented at a two-and-a-half day workshop held in Corfu, Greece, September 30 to October 2, 2009, in conjunction with the European Conference on Digital Libraries. The workshop, attended by 160 researchers and system developers, provided the opportunity for all the groups that had participated in the evaluation campaign to get together, compare approaches and exchange ideas. The schedule was divided between plenary track overviews, and parallel, poster and breakout sessions presenting the CLEF 2009 experiments and discussing ideas for the future. There were several invited talks. Noriko Kando, National Institute of Informatics, Tokyo, reported on the evolution of NTCIR (NTCIR is an evaluation initiative focussed on testing information access technologies for Asian languages), and Jaap Kamps of the University of Amsterdam presented the main outcomes of a SIGIR workshop on the “Future of IR Evaluation.” In the final session, Donna Harman, US National Institute of Standards and Technology, summed up what she felt were the main achievements of CLEF over these ten years of activity. The presentations given at the CLEF workshop can be found on the CLEF website at www.clef-campaign.org. The workshop was preceded by two related events. On September 29, a one-day Workshop on Visual Information Retrieval Evaluation was held. This workshop was sponsored by the THESEUS program and co-organized by the Fraunhofer Institute for Digital Media Technology. The participants discussed the results of the ImageCLEF initiative and identified new challenging image retrieval and analysis tasks for future evaluations. The MorphoChallenge 2009 meeting on “Unsupervised Morpheme Analysis” was held on the morning of September 30. The objective of this year’s challenge was to design a statistical machine learning algorithm for morpheme discovery. MorphoChallenge is part of the EU Network of Excellence PASCAL Programme. The CLEF 2008 and 2009 campaigns were organized by TrebleCLEF, a Coordination Action of the Seventh Framework Programme. TrebleCLEF has built on the results achieved by CLEF, supporting the development of expertise in the multidisciplinary research area of multilingual information access and promoting a dissemination action in the relevant application communities. As part of its activities, the project has released a set of Best Practice recommendations in the areas of MLIA System Development and Search Assistance, Test Collection Creation, and Language Processing Technologies. The results of TrebleCLEF can be accessed at www.trebleclef.eu.

VI

Preface

This is the first time that the CLEF proceedings are published in two volumes, reporting the results of the Text Retrieval Experiments and the Multimedia Experiments separately. This decision was made necessary by the large participation in CLEF 2009 and our desire to provide an exhaustive overview of all the various activities. This volume reports research and experiments on various types of multimedia collections. It is divided into three main sections presenting the results of the following tracks: Interactive Cross-Language Retrieval (iCLEF), Cross-Language Image Retrieval (ImageCLEF), and Cross-Language Video Retrieval (VideoCLEF). The companion volume contains the results of the tracks testing aspects of multilingual information access on different kinds of text: Multilingual Document Retrieval (Ad-Hoc), Multiple Language Question Answering (QA@CLEF), Multilingual Information Filtering (INFILE@CLEF), Intellectual Property (CLEF-IP), Log File Analysis (LogCLEF) and MophoChallenge. The table of contents is included in this volume. The papers are mostly extended and revised versions of the initial working notes distributed at the workshop. All papers were subjected to a reviewing procedure. The final volumes were prepared with the assistance of the Center for the Evaluation of Language and Communication Technologies (CELCT), Trento, Italy, under the coordination of Danilo Giampiccolo and Pamela Forner. The support of CELCT is gratefully acknowledged. We should also like to thank all the additional reviewers for their careful refereeing.

April 2010

Carol Peters Barbara Caputo Julio Gonzalo Gareth J.F. Jones Jayashree Kalpathy-Cramer Henning Müller Theodora Tsikrika

Reviewers

The editors express their gratitude to the colleagues listed below for their assistance in reviewing the papers in this volume:

-

Paul D. Clough, University of Sheffield, UK Agnes Gyarmati, Dublin City University, Ireland Mark Hughes, Dublin City University, Ireland Sander Koelstra, Queen Mary, University of London, UK Martha Larson, Delft University of Technology, The Netherlands Thomas Piatrik, Queen Mary, University of London, UK Stephan Raaijmakers, TNO Information and Communication Technology, Delft, The Netherlands Naeem Ramzan, Queen Mary, University of London, UK Stevan Rudinac, Delft University of Technology, The Netherlands Pavel Serdyukov, Delft University of Technology, The Netherlands Mohammad Soleymani, University of Geneva, Switzerland Ashkan Yazdani, Ecole Polytechnique Federale de Lausanne, Switzerland

CLEF 2009 Coordination

CLEF 2000–2009 was coordinated by the Istituto di Scienza e Tecnologie dell' Informazione, Consiglio Nazionale delle Ricerche, Pisa. The following institutions contributed to the organization of the different tracks of the 2009 campaign:

-

Adaptive Informatics Research Centre, Helsinki University Technology, Finland Berlin School of Library and Information Science, Humboldt University, Germany Business Information Systems, University of Applied Sciences Western Switzerland, Sierre, Switzerland CEA LIST, France Center for Autonomous Systems, Royal Institute of Technology, Sweden Center for Evaluation of Language and Communication Technologies, Italy Centrum Wiskunde & Informatica, Amsterdam, The Netherlands Computer Science Department, University of the Basque Country, Spain Computer Vision and Multimedia Lab, University of Geneva, Switzerland Database Research Group, University of Tehran, Iran Department of Computer Science & Information Systems, University of Limerick, Ireland Department of Information Engineering, University of Padua, Italy Department of Information Science, University of Hildesheim, Germany Department of Information Studies, University of Sheffield, UK Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, USA Department of Medical Informatics, Aachen University of Technology, Germany Evaluations and Language Resources Distribution Agency Sarl, Paris, France Fraunhofer Institute for Digital Media Technology (IDMT), Ilmenau, Germany GERiiCO, Université de Lille, France Idiap Research Institute, Switzerland Information Retrieval Facility (IRF), Austria Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Orsay, France Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia, Madrid, Spain Linguateca, SINTEF ICT, Norway Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Bulgaria Matrixware Information Services, Austria Mediamatics, Delft University of Technology, The Netherlands

X

-

CLEF 2009 Coordination Medical Informatics Service, University Hospitals and University of Geneva, Switzerland MITRE Corporation, USA National Institute of Standards and Technology, Gaithersburg MD, USA NLE Lab., Universidad Politènica de Valencia, Spain Research Institute for Artificial Intelligence, Romanian Academy, Romania Romanian Institute for Computer Science, Romania Royal Institute of Technology (KTH), Stockholm, Sweden School of Computing, Dublin City University, Ireland Swedish Institute of Computer Science, Sweden University of Applied Sciences Western Switzerland (HES-SO), Switzerland

CLEF 2009 Steering Committee

-

Maristella Agosti, University of Padua, Italy Martin Braschler, Zurich University of Applied Sciences, Switzerland Amedeo Cappelli, ISTI-CNR and CELCT, Italy Hsin-Hsi Chen, National Taiwan University, Taipei, Taiwan Khalid Choukri, Evaluations and Language Resources Distribution Agency, Paris, France Paul Clough, University of Sheffield, UK Thomas Deselaers, ETH, Switzerland Giorgio Di Nunzio, University of Padua, Italy David A. Evans, Clairvoyance Corporation, USA Marcello Federico, Fondazione Bruno Kessler, Trento, Italy Nicola Ferro, University of Padua, Italy Christian Fluhr, Cadege, France Norbert Fuhr, University of Duisburg, Germany Frederic C. Gey, U.C. Berkeley, USA Julio Gonzalo, LSI-UNED, Madrid, Spain Donna Harman, National Institute of Standards and Technology, USA Gareth Jones, Dublin City University, Ireland Franciska de Jong, University of Twente, The Netherlands Noriko Kando, National Institute of Informatics, Tokyo, Japan Jussi Karlgren, Swedish Institute of Computer Science, Sweden Michael Kluck, German Institute for International and Security Affairs, Berlin, Germany Natalia Loukachevitch, Moscow State University, Russia Bernardo Magnini, Fondazione Bruno Kessler, Trento, Italy Paul McNamee, Johns Hopkins University, USA Henning Müller, University of Applies Sciences Western Switzerland, Sierre and University of Geneva, Switzerland Douglas W. Oard, University of Maryland, USA Anselmo Peñas, LSI-UNED, Madrid, Spain Vivien Petras, Humboldt University Berlin, Germany Maarten de Rijke, University of Amsterdam, The Netherlands Diana Santos, Linguateca, Sintef, Oslo, Norway Jacques Savoy, University of Neuchâtel, Switzerland Peter Schäuble, Eurospider Information Technologies, Switzerland Richard Sutcliffe, University of Limerick, Ireland

XII

-

CLEF 2009 Steering Committee

Hans Uszkoreit, German Research Center for Artificial Intelligence, Germany Felisa Verdejo, LSI-UNED, Madrid, Spain José Luis Vicedo, University of Alicante, Spain Ellen Voorhees, National Institute of Standards and Technology, USA Christa Womser-Hacker, University of Hildesheim, Germany

Table of Contents – Part II What Happened in CLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carol Peters

1

I: Interactive Cross-Language Retrieval (iCLEF) Overview of iCLEF 2009: Exploring Search Behaviour in a Multilingual Folksonomy Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julio Gonzalo, V´ıctor Peinado, Paul Clough, and Jussi Karlgren

13

Analysis of Multilingual Image Search Logs: Users’ Behavior and Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V´ıctor Peinado, Fernando L´ opez-Ostenero, and Julio Gonzalo

21

User Behaviour and Lexical Ambiguity in Cross-Language Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Borja Navarro-Colorado, Marcel Puchol-Blasco, Rafael M. Terol, Sonia V´ azquez, and Elena Lloret Users’ Image Seeking Behavior in a Multilingual Tag Environment . . . . . Miguel E. Ruiz and Pok Chin

29

37

II: Cross-Language Retrieval in Image Collections (ImageCLEF) Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monica Lestari Paramita, Mark Sanderson, and Paul Clough

45

Overview of the WikipediaMM Task at ImageCLEF 2009 . . . . . . . . . . . . . Theodora Tsikrika and Jana Kludas

60

Overview of the CLEF 2009 Medical Image Retrieval Track . . . . . . . . . . . Henning M¨ uller, Jayashree Kalpathy–Cramer, Ivan Eggel, Steven Bedrick, Sa¨ıd Radhouani, Brian Bakke, Charles E. Kahn Jr., and William Hersh

72

Overview of the CLEF 2009 Medical Image Annotation Track . . . . . . . . . Tatiana Tommasi, Barbara Caputo, Petra Welter, Mark Oliver G¨ uld, and Thomas M. Deserno

85

Overview of the CLEF 2009 Large-Scale Visual Concept Detection and Annotation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefanie Nowak and Peter Dunker

94

XIV

Table of Contents – Part II

Overview of the CLEF 2009 Robot Vision Track . . . . . . . . . . . . . . . . . . . . . Andrzej Pronobis, Li Xing, and Barbara Caputo

110

ImageCLEFPhoto Diversity Promotion: Is Reordering Top-Ranked Documents Suﬃcient? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergio Navarro, Rafael Mu˜ noz, and Fernando Llopis

120

Comparison of Several Combinations of Multimodal and Diversity Seeking Methods for Multimedia Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . Julien Ah-Pine, Stephane Clinchant, and Gabriela Csurka

124

University of Glasgow at ImageCLEFPhoto 2009: Optimising Similarity and Diversity in Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teerapong Leelanupab, Guido Zuccon, Anuj Goyal, Martin Halvey, P. Punitha, and Joemon M. Jose Multimedia Retrieval by Means of Merge of Results from Textual and Content Based Retrieval Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Garc´ıa-Serrano, Xaro Benavent, Ruben Granados, Esther de Ves, and Jos´e Miguel Go˜ ni

133

142

Image Query Expansion Using Semantic Selectional Restrictions . . . . . . . Osama El Demerdash, Sabine Bergler, and Leila Kosseim

150

Clustering for Text and Image-Based Photo Retrieval at CLEF 2009 . . . Qian Zhu and Diana Inkpen

157

ImageCLEFwiki Combining Text/Image in WikipediaMM Task 2009 . . . . . . . . . . . . . . . . . . Christophe Moulin, C´ecile Barat, C´edric Lemaˆıtre, Mathias G´ery, Christophe Ducottet, and Christine Largeron

164

Document Expansion for Text-Based Image Retrieval at CLEF 2009 . . . . Jinming Min, Peter Wilkins, Johannes Leveling, and Gareth J.F. Jones

172

Multimodal Image Retrieval over a Large Database . . . . . . . . . . . . . . . . . . D´ebora Myoupo, Adrian Popescu, Herv´e Le Borgne, and Pierre-Alain Mo¨ellic

177

Using WordNet in Multimedia Information Retrieval . . . . . . . . . . . . . . . . . Manuel Carlos D´ıaz-Galiano, Mar´ıa Teresa Mart´ın-Valdivia, L. Alfonso Ure˜ na-L´ opez, and Jos´e Manuel Perea-Ortega

185

Table of Contents – Part II

XV

ImageCLEFmed Medical Image Retrieval: ISSR at CLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . Waleed Arafa and Ragia Ibrahim

189

An Integrated Approach for Medical Image Retrieval through Combining Textual and Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zheng Ye, Xiangji Huang, Qinmin Hu, and Hongfei Lin

195

Analysis Combination and Pseudo Relevance Feedback in Conceptual Language Model: LIRIS Participation at ImageCLEFMed . . . . . . . . . . . . . Lo¨ıc Maisonnasse, Farah Harrathi, Catherine Roussey, and Sylvie Calabretto

203

The MedGIFT Group at ImageCLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Zhou, Ivan Eggel, and Henning M¨ uller

211

An Extended Vector Space Model for Content-Based Image Retrieval . . . Tolga Berber and Adil Alpkocak

219

Using Media Fusion and Domain Dimensions to Improve Precision in Medical Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sa¨ıd Radhouani, Jayashree Kalpathy-Cramer, Steven Bedrick, Brian Bakke, and William Hersh

223

ImageCLEFmed Annotation ImageCLEF 2009 Medical Image Annotation Task: PCTs for Hierarchical Multi-Label Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivica Dimitrovski, Dragi Kocev, Suzana Loskovska, and Saˇso Dˇzeroski

231

Dense Simple Features for Fast and Accurate Medical X-Ray Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uri Avni, Hayit Greenspan, and Jacob Goldberger

239

Automated X-Ray Image Annotation: Single versus Ensemble of Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Devrim Unay, Octavian Soldea, Sureyya Ozogur-Akyuz, Mujdat Cetin, and Aytul Ercil

247

ImageCLEF Annotation and Robot Vision Topological Localization of Mobile Robots Using Probabilistic Support Vector Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Gao and Yiqun Li

255

The University of Amsterdam’s Concept Detection System at ImageCLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Koen E.A. van de Sande, Theo Gevers, and Arnold W.M. Smeulders

261

XVI

Table of Contents – Part II

Enhancing Recognition of Visual Concepts with Primitive Color Histograms via Non-sparse Multiple Kernel Learning . . . . . . . . . . . . . . . . . Alexander Binder and Motoaki Kawanabe

269

Using SIFT Method for Global Topological Localization for Indoor Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emanuela Boro¸s, George Ro¸sca, and Adrian Iftene

277

UAIC at ImageCLEF 2009 Photo Annotation Task . . . . . . . . . . . . . . . . . . . Adrian Iftene, Loredana Vamanu, and Cosmina Croitoru

283

Learning Global and Regional Features for Photo Annotation . . . . . . . . . . Jiquan Ngiam and Hanlin Goh

287

Improving Image Annotation in Imbalanced Classiﬁcation Problems with Ranking SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ali Fakeri-Tabrizi, Sabrina Tollari, Nicolas Usunier, and Patrick Gallinari

291

University of Glasgow at ImageCLEF 2009 Robot Vision Task: A Rule Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yue Feng, Martin Halvey, and Joemon M. Jose

295

A Fast Visual Word Frequency - Inverse Image Frequency for Detector of Rare Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emilie Dumont, Herv´e Glotin, S´ebastien Paris, and Zhong-Qiu Zhao

299

Exploring the Semantics behind a Collection to Improve Automated Image Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ainhoa Llorente, Enrico Motta, and Stefan R¨ uger

307

Multi-cue Discriminative Place Recognition . . . . . . . . . . . . . . . . . . . . . . . . . Li Xing and Andrzej Pronobis

315

MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation and Retrieval Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trong-Ton Pham, Lo¨ıc Maisonnasse, Philippe Mulhem, Jean-Pierre Chevallet, Georges Qu´enot, and Rami Al Batal

324

ImageCLEF Mixed The ImageCLEF Management System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivan Eggel and Henning M¨ uller

332

Interest Point and Segmentation-Based Photo Annotation . . . . . . . . . . . . . B´ alint Dar´ oczy, Istv´ an Petr´ as, Andr´ as A. Bencz´ ur, Zsolt Fekete, D´ avid Nemeskey, D´ avid Sikl´ osi, and Zsuzsa Weiner

340

Table of Contents – Part II

University of Ja´en at ImageCLEF 2009: Medical and Photo Tasks . . . . . . Miguel A. Garc´ıa-Cumbreras, Manuel Carlos D´ıaz-Galiano, Mar´ıa Teresa Mart´ın-Valdivia, Arturo Montejo-Raez, and L. Alfonso Ure˜ na-L´ opez

XVII

348

III: Cross-Language Retrieval in Video Collections (VideoCLEF) Overview of VideoCLEF 2009: New Perspectives on Speech-Based Multimedia Content Enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martha Larson, Eamonn Newman, and Gareth J.F. Jones Methods for Classifying Videos by Subject and Detecting Narrative Peak Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tudor-Alexandru Dobril˘ a, Mihail-Ciprian Diacona¸su, Irina-Diana Lungu, and Adrian Iftene Using Support Vector Machines as Learning Algorithm for Video Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e Manuel Perea-Ortega, Arturo Montejo-R´ aez, Mar´ıa Teresa Mart´ın-Valdivia, and L. Alfonso Ure˜ na-L´ opez

354

369

373

Video Classiﬁcation as IR Task: Experiments and Observations . . . . . . . . Jens K¨ ursten and Maximilian Eibl

377

Exploiting Speech Recognition Transcripts for Narrative Peak Detection in Short-Form Documentaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martha Larson, Bart Jochems, Ewine Smits, and Roeland Ordelman

385

Identiﬁcation of Narrative Peaks in Video Clips: Text Features Perform Best . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joep J.M. Kierkels, Mohammad Soleymani, and Thierry Pun

393

A Cocktail Approach to the VideoCLEF’09 Linking Task . . . . . . . . . . . . . Stephan Raaijmakers, Corn´e Versloot, and Joost de Wit

401

When to Cross Over? Cross-Language Linking Using Wikipedia for VideoCLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Agnes Gyarmati and Gareth J.F. Jones

409

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

413

Table of Contents – Part I

What Happened in CLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carol Peters

1

I: Multilingual Textual Document Retrieval (AdHoc) CLEF 2009 Ad Hoc Track Overview: TEL and Persian Tasks . . . . . . . . . . Nicola Ferro and Carol Peters

13

CLEF 2009 Ad Hoc Track Overview: Robust-WSD Task . . . . . . . . . . . . . . Eneko Agirre, Giorgio Maria Di Nunzio, Thomas Mandl, and Arantxa Otegi

36

AdHoc-TEL Evaluating Cross-Language Explicit Semantic Analysis and Cross Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maik Anderka, Nedim Lipka, and Benno Stein Document Expansion, Query Translation and Language Modeling for Ad-Hoc IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Johannes Leveling, Dong Zhou, Gareth J.F. Jones, and Vincent Wade

50

58

Smoothing Methods and Cross-Language Document Re-ranking . . . . . . . . Dong Zhou and Vincent Wade

62

Cross-Language Information Retrieval Using Meta-language Index Construction and Structural Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amir Hossein Jadidinejad and Fariborz Mahmoudi

70

Sampling Precision to Depth 10000 at CLEF 2009 . . . . . . . . . . . . . . . . . . . Stephen Tomlinson

78

Multilingual Query Expansion for CLEF Adhoc-TEL . . . . . . . . . . . . . . . . . Ray R. Larson

86

Experiments with N-Gram Preﬁxes on a Multinomial Language Model versus Lucene’s Oﬀ-the-Shelf Ranking Scheme and Rocchio Query Expansion (TEL@CLEF Monolingual Task) . . . . . . . . . . . . . . . . . . . . . . . . . Jorge Machado, Bruno Martins, and Jos´e Borbinha

90

XX

Table of Contents – Part I

AdHoc-Persian Evaluation of Perstem: A Simple and Eﬃcient Stemming Algorithm for Persian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amir Hossein Jadidinejad, Fariborz Mahmoudi, and Jon Dehdari

98

Ad Hoc Retrieval with the Persian Language . . . . . . . . . . . . . . . . . . . . . . . . Ljiljana Dolamic and Jacques Savoy

102

Ad Hoc Information Retrieval for Persian . . . . . . . . . . . . . . . . . . . . . . . . . . . AmirHossein Habibian, Abolfazl AleAhmad, and Azadeh Shakery

110

AdHoc-Robust Combining Probabilistic and Translation-Based Models for Information Retrieval Based on Word Sense Annotations . . . . . . . . . . . . . . . . . . . . . . . . . Elisabeth Wolf, Delphine Bernhard, and Iryna Gurevych

120

Indexing with WordNet Synonyms May Improve Retrieval Results . . . . . Davide Buscaldi and Paolo Rosso

128

UFRGS@CLEF2009: Retrieval by Numbers . . . . . . . . . . . . . . . . . . . . . . . . . Thyago Bohrer Borges and Viviane P. Moreira

135

Evaluation of Axiomatic Approaches to Crosslanguage Retrieval . . . . . . . Roman Kern, Andreas Juﬃnger, and Michael Granitzer

142

UNIBA-SENSE @ CLEF 2009: Robust WSD Task . . . . . . . . . . . . . . . . . . . Pierpaolo Basile, Annalina Caputo, and Giovanni Semeraro

150

Using WordNet Relations and Semantic Classes in Information Retrieval Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javi Fern´ andez, Rub´en Izquierdo, and Jos´e M. G´ omez

158

Using Semantic Relatedness and Word Sense Disambiguation for (CL)IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eneko Agirre, Arantxa Otegi, and Hugo Zaragoza

166

II: Multiple Language Question Answering (QA@CLEF) Overview of ResPubliQA 2009: Question Answering Evaluation over European Legislation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anselmo Pe˜ nas, Pamela Forner, Richard Sutcliﬀe, ´ Alvaro Rodrigo, Corina Forˇcscu, I˜ naki Alegria, Danilo Giampiccolo, Nicolas Moreau, and Petya Osenova

174

Table of Contents – Part I

XXI

Overview of QAST 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jordi Turmo, Pere R. Comas, Sophie Rosset, Olivier Galibert, Nicolas Moreau, Djamel Mostefa, Paolo Rosso, and Davide Buscaldi

197

GikiCLEF: Expectations and Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . Diana Santos and Lu´ıs Miguel Cabral

212

ResPubliQA NLEL-MAAT at ResPubliQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Santiago Correa, Davide Buscaldi, and Paolo Rosso

223

Question Answering on English and Romanian Languages . . . . . . . . . . . . . Adrian Iftene, Diana Trandab˘ a¸t, Alex Moruz, Ionut¸ Pistol, Maria Husarciuc, and Dan Cristea

229

Studying Syntactic Analysis in a QA System: FIDJI @ ResPubliQA’09 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xavier Tannier and V´eronique Moriceau

237

Approaching Question Answering by Means of Paragraph Validation . . . ´ Alvaro Rodrigo, Joaqu´ın P´erez-Iglesias, Anselmo Pe˜ nas, Guillermo Garrido, and Lourdes Araujo

245

Information Retrieval Baselines for the ResPubliQA Task . . . . . . . . . . . . . ´ Joaqu´ın P´erez-Iglesias, Guillermo Garrido, Alvaro Rodrigo, Lourdes Araujo, and Anselmo Pe˜ nas

253

A Trainable Multi-factored QA System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Radu Ion, Dan S ¸ tef˘ anescu, Alexandru Ceau¸su, Dan Tuﬁ¸s, Elena Irimia, and Verginica Barbu Mititelu

257

Extending a Logic-Based Question Answering System for Administrative Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ingo Gl¨ ockner and Bj¨ orn Pelzer Elhuyar-IXA: Semantic Relatedness and Cross-Lingual Passage Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eneko Agirre, Olatz Ansa, Xabier Arregi, Maddalen Lopez de Lacalle, Arantxa Otegi, Xabier Saralegi, and Hugo Zaragoza Are Passages Enough? The MIRACLE Team Participation in QA@CLEF2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mar´ıa Teresa Vicente-D´ıez, C´esar de Pablo-S´ anchez, Paloma Mart´ınez, Juli´ an Moreno Schneider, and Marta Garrote Salazar

265

273

281

XXII

Table of Contents – Part I

QAST The LIMSI Participation in the QAst 2009 Track: Experimenting on Answer Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guillaume Bernard, Sophie Rosset, Olivier Galibert, Gilles Adda, and Eric Bilinski Robust Question Answering for Speech Transcripts: UPC Experience in QAst 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pere R. Comas and Jordi Turmo

289

297

GikiCLEF Where in the Wikipedia Is That Answer? The XLDB at the GikiCLEF 2009 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nuno Cardoso, David Batista, Francisco J. Lopez-Pellicer, and M´ ario J. Silva

305

Recursive Question Decomposition for Answering Complex Geographic Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sven Hartrumpf and Johannes Leveling

310

GikiCLEF Topics and Wikipedia Articles: Did They Blend? . . . . . . . . . . . Nuno Cardoso

318

TALP at GikiCLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Ferr´es and Horacio Rodr´ıguez

322

Semantic QA for Encyclopaedic Questions: EQUAL in GikiCLEF . . . . . . Iustin Dornescu

326

Interactive Probabilistic Search for GikiCLEF . . . . . . . . . . . . . . . . . . . . . . . Ray R. Larson

334

III: Multilingual Information Filtering (INFILE) Information Filtering Evaluation: Overview of CLEF 2009 INFILE Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Romaric Besan¸con, St´ephane Chaudiron, Djamel Mostefa, Isma¨ıl Timimi, Khalid Choukri, and Meriama La¨ıb

342

Batch Document Filtering Using Nearest Neighbor Algorithm . . . . . . . . . Ali Mustafa Qamar, Eric Gaussier, and Nathalie Denos

354

UAIC: Participation in INFILE@CLEF Task . . . . . . . . . . . . . . . . . . . . . . . . Cristian-Alexandru Dr˘ agu¸sanu, Alecsandru Grigoriu, and Adrian Iftene

362

Table of Contents – Part I

XXIII

Multilingual Information Filtering by Human Plausible Reasoning . . . . . . Asma Damankesh, Farhad Oroumchian, and Khaled Shaalan

366

Hossur’Tech’s Participation in CLEF 2009 INFILE Interactive Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Anton Chrisostom Ronald, Aur´elie Rossi, and Christian Fluhr

374

Experiments with Google News for Filtering Newswire Articles . . . . . . . . Arturo Montejo-R´ aez, Jos´e M. Perea-Ortega, Manuel Carlos D´ıaz-Galiano, and L. Alfonso Ure˜ na-L´ opez

381

IV: Intellectual Property (CLEF-IP) CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giovanna Roda, John Tait, Florina Piroi, and Veronika Zenz

385

Exploring Structured Documents and Query Formulation Techniques for Patent Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Walid Magdy, Johannes Leveling, and Gareth J.F. Jones

410

Formulating Good Queries for Prior Art Search . . . . . . . . . . . . . . . . . . . . . . Jos´e Carlos Toucedo and David E. Losada

418

UAIC: Participation in CLEF-IP Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adrian Iftene, Ovidiu Ionescu, and George-R˘ azvan Oancea

426

PATATRAS: Retrieval Model Combination and Regression Models for Prior Art Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patrice Lopez and Laurent Romary

430

NLEL-MAAT at CLEF-IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Santiago Correa, Davide Buscaldi, and Paolo Rosso

438

Simple Pre and Post Processing Strategies for Patent Searching in CLEF Intellectual Property Track 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julien Gobeill, Emilie Pasche, Douglas Teodoro, and Patrick Ruch

444

Prior Art Search Using International Patent Classiﬁcation Codes and All-Claims-Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benjamin Herbert, Gy¨ orgy Szarvas, and Iryna Gurevych

452

UTA and SICS at CLEF-IP’09 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antti J¨ arvelin, Anni J¨ arvelin, and Preben Hansen

460

Searching CLEF-IP by Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. Alink, Roberto Cornacchia, and Arjen P. de Vries

468

UniNE at CLEF-IP 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claire Fautsch and Jacques Savoy

476

XXIV

Table of Contents – Part I

Automatically Generating Queries for Prior Art Search . . . . . . . . . . . . . . . Erik Graf, Leif Azzopardi, and Keith van Rijsbergen Patent Retrieval Experiments in the Context of the CLEF IP Track 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniela Becks, Christa Womser-Hacker, Thomas Mandl, and Ralph K¨ olle

480

491

Prior Art Retrieval Using the Claims Section as a Bag of Words . . . . . . . Suzan Verberne and Eva D’hondt

497

UniGE Experiments on Prior Art Search in the Field of Patents . . . . . . . Jacques Guyot, Gilles Falquet, and Karim Benzineb

502

V: Logﬁle Analysis (LogCLEF) LogCLEF 2009: The CLEF 2009 Multilingual Logﬁle Analysis Track Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Mandl, Maristella Agosti, Giorgio Maria Di Nunzio, Alexander Yeh, Inderjeet Mani, Christine Doran, and Julia Maria Schulz Identifying Common User Behaviour in Multilingual Search Logs . . . . . . . M. Rami Ghorab, Johannes Leveling, Dong Zhou, Gareth J.F. Jones, and Vincent Wade

508

518

A Search Engine Based on Query Logs, and Search Log Analysis by Automatic Language Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Oakes and Yan Xu

526

Identifying Geographical Entities in Users’ Queries . . . . . . . . . . . . . . . . . . . Adrian Iftene

534

Search Path Visualization and Session Performance Evaluation with Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katrin Lamm, Thomas Mandl, and Ralph Koelle

538

User Logs as a Means to Enrich and Reﬁne Translation Dictionaries . . . . Alessio Bosca and Luca Dini

544

VI: Grid Experiments (GRID@CLEF) CLEF 2009: Grid@CLEF Pilot Track Overview . . . . . . . . . . . . . . . . . . . . . . Nicola Ferro and Donna Harman

552

Decomposing Text Processing for Retrieval: Cheshire Tries GRID@CLEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ray R. Larson

566

Table of Contents – Part I

Putting It All Together: The Xtrieval Framework at Grid@CLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jens K¨ ursten and Maximilian Eibl

XXV

570

VII: Morphochallenge Overview and Results of Morpho Challenge 2009 . . . . . . . . . . . . . . . . . . . . . Mikko Kurimo, Sami Virpioja, Ville T. Turunen, Graeme W. Blackwood, and William Byrne

578

MorphoNet: Exploring the Use of Community Structure for Unsupervised Morpheme Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Delphine Bernhard

598

Unsupervised Morpheme Analysis with Allomorfessor . . . . . . . . . . . . . . . . . Sami Virpioja, Oskar Kohonen, and Krista Lagus

609

Unsupervised Morphological Analysis by Formal Analogy . . . . . . . . . . . . . Jean-Fran¸cois Lavall´ee and Philippe Langlais

617

Unsupervised Word Decomposition with the Promodes Algorithm . . . . . . Sebastian Spiegler, Bruno Gol´enia, and Peter Flach

625

Unsupervised Morpheme Discovery with Ungrade . . . . . . . . . . . . . . . . . . . . Bruno Gol´enia, Sebastian Spiegler, and Peter Flach

633

Clustering Morphological Paradigms Using Syntactic Categories . . . . . . . Burcu Can and Suresh Manandhar

641

Simulating Morphological Analyzers with Stochastic Taggers for Conﬁdence Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Monson, Kristy Hollingshead, and Brian Roark

649

A Rule-Based Acquisition Model Adapted for Morphological Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constantine Lignos, Erwin Chan, Mitchell P. Marcus, and Charles Yang

658

Morphological Analysis by Multiple Sequence Alignment . . . . . . . . . . . . . . Tzvetan Tchoukalov, Christian Monson, and Brian Roark

666

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

675

What Happened in CLEF 2009 Carol Peters Istituto di Scienza e Tecnologie dell’Informazione (ISTI-CNR), Pisa, Italy [email protected]

Abstract. The organization of the CLEF 2009 evaluation campaign is described and details are provided concerning the tracks, test collections, evaluation infrastructure, and participation. The aim is to provide the reader of these proceedings with a complete picture of the entire campaign, covering both text and multimedia retrieval experiments. In the final section, the main results achieved by CLEF in the first ten years of activity are discussed and plans for the future of CLEF are presented.

1 Introduction The objective of the Cross Language Evaluation Forum is to promote research in the field of multilingual system development. This is done through the organisation of annual evaluation campaigns in which a series of tracks designed to test different aspects of mono- and cross-language information retrieval (IR) are offered. The intention is to encourage experimentation with all kinds of multilingual information access – from the development of systems for monolingual retrieval operating on many languages to the implementation of complete multilingual multimedia search services. This has been achieved by offering an increasingly complex and varied set of evaluation tasks over the years. The aim is to meet and anticipate the needs of the multidisciplinary research community working in this area and to encourage the development of next generation multilingual IR systems. CLEF is perhaps one of the few platforms where groups working in many different areas (e.g. Information Retrieval, Natural Language Processing, Image Processing, Speech Recognition, Log Analysis, etc. ) have a chance to see what others are doing, and discuss and compare ideas. Figure 1 shows the evolution of CLEF in ten years of activity. This is the first time that the CLEF post-campaign proceedings have been published in two separate volumes. This decision has been made necessary by the large participation in CLEF 2009 and our desire to provide an exhaustive overview of all the various evaluation activities. We have thus distinguished between papers describing systems and functionality for text retrieval and for multimedia retrieval. This volume reports experiments on various types of multimedia collections. It is divided into three main sections presenting the results of the following tracks: Interactive Cross-Language Retrieval (iCLEF), Cross-Language Image Retrieval (ImageCLEF), and Cross-Language Video Retrieval (VideoCLEF). The papers are mostly extended and revised versions of the initial working notes distributed at the workshop. For details on the results of the tracks conducting experiments on different kinds of text: C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 1–12, 2010. © Springer-Verlag Berlin Heidelberg 2010

2

C. Peters

Multilingual Document Retrieval (Ad-Hoc), Multiple Language Question Answering (QA@CLEF), Multilingual Information Filtering (INFILE@CLEF), Intellectual Property (CLEF-IP), Log File Analysis (LogCLEF) and MophoChallenge, the reader is referred to the companion volume1. This Introduction gives a brief overview of entire campaign in order to provide the reader with a complete picture of what happened: Section 2 lists the various tracks and tasks offered in 2009; Sections 3 and 4 describe the participation and the evaluation infrastructure; the final section gives an assessment of the results achieved by CLEF in this first ten years of activity and presents plans for the future.

2 Tracks and Tasks in CLEF 2009 CLEF 2009 offered eight tracks designed to evaluate the performance of systems for: • multilingual textual document retrieval (Ad Hoc) • interactive cross-language retrieval (iCLEF) • multiple language question answering (QA@CLEF) • cross-language retrieval in image collections (ImageCLEF) • multilingual information filtering (INFILE@CLEF) • cross-language video retrieval (VideoCLEF) • intellectual property (CLEF-IP) – New this year • log file analysis (LogCLEF) – New this year CLEF 2000

mono-, bi- & multilingual text doc retrieval (Ad Hoc) mono- and cross-language information on structured scientific data (Domain-Specific)

CLEF 2001 New

interactive cross-language retrieval (iCLEF)

CLEF 2002 New

cross-language spoken document retrieval (CL-SR)

CLEF 2003 New

multiple language question answering (QA@CLEF) cross-language retrieval in image collections (ImageCLEF)

CLEF 2005 New

multilingual retrieval of Web documents (WebCLEF) cross-language geographical retrieval (GeoCLEF)

CLEF 2008 New

cross-language video retrieval (VideoCLEF) multilingual information filtering (INFILE@CLEF)

CLEF 2009 New

intellectual property (CLEF-IP) log file analysis (LogCLEF)

Fig. 1. Evolution of CLEF Tracks

1

Multilingual Information Access Evaluation I: Text Retrieval Experiments, LNCS Vol 6241, Springer.

What Happened in CLEF 2009

3

An experimental pilot task was also offered: • Grid Experiments (Grid@CLEF) In addition, Morpho Challenge 2009 was organized in collaboration with CLEF as part of the EU Network of Excellence Pascal Challenge Program2. Here below we give a brief overview of the various activities. Multilingual Textual Document Retrieval (Ad Hoc): The aim of this track has been to promote the development of monolingual and cross-language textual document retrieval systems. From 2000 - 2007, the track used collections of European newspaper and news agency documents. In CLEF 2008, the focus of the track was considerably widened: we introduced very different document collections, a non-European target language, and an information retrieval (IR) task designed to attract participation from groups interested in natural language processing (NLP). Ad Hoc 2009 was to a large extent a repetition of the previous year’s activities, with the same three tasks: Tel@CLEF, Persian@CLEF, and Robust-WSD. An important objective was to create good reusable test collections for each of them The track was thus structured in three distinct streams. The first task offered monolingual and cross-language search on library catalog records and was organized in collaboration with The European Library (TEL)3. The second task resembled the ad hoc retrieval tasks of previous years but this time the target collection was a Persian newspaper corpora. The third task was the robust activity which used word sense disambiguated (WSD) data. The track was coordinated jointly by ISTI-CNR and Padua University, Italy; the University of the Basque Country, Spain; with the collaboration of the Database Research Group, University of Tehran, Iran. Interactive Cross-Language Retrieval (iCLEF): In iCLEF, cross-language search capabilities have been studied from a user-inclusive perspective. A central research question has been how best to assist users when searching information written in unknown languages, rather than how best an algorithm can find information written in languages different from the query language. Since 2006, iCLEF has based its experiments on Flickr, a large-scale, web-based image database where image annotations constitute a naturally multilingual folksonomy. In an attempt to encourage greater participation in user-orientated experiments, a new task was designed for 2008 and continued in 2009. The main novelty has been to focus experiments on a shared analysis of a large search log, generated by iCLEF participants from a single search interface provided by the iCLEF organizers. The focus has been, therefore, on search log analysis rather than on system design. The idea has been to study the behaviour of users in an (almost) naturalistic search scenario, having a much larger data set than in previous iCLEF campaigns. The track was coordinated by UNED, Madrid, Spain; Sheffield University, UK; Swedish Institute of Computer Science, Sweden. Multilingual Question Answering (QA@CLEF): This track has offered monolingual and cross-language question answering tasks since 2003. QA@CLEF 2009 proposed three exercises: ResPubliQA, QAST and GikiCLEF: 2

MorphoChallenge is part of the EU Network of Excellence Pascal: http://www.cis.hut.fi/morphochallenge2009/ 3 See http://www.theeuropeanlibrary.org/

4

•

•

•

C. Peters

ResPubliQA: The hypothetical user considered for this exercise is a person close to the law domain interested in making inquiries on European legislation. Given a pool of 500 independent natural language questions, systems must return the passage that answers each question (not the exact answer) from the JRC-Acquis collection of EU parliamentary documentation. Both questions and documents are translated and aligned for a subset of languages. Participating systems could perform the task in Basque, Bulgarian, English, French, German, Italian, Portuguese, Romanian and Spanish. QAST: The aim of the third QAST exercise was to evaluate QA technology in a real multilingual speech scenario in which written and oral questions (factual and definitional) in different languages are formulated against a set of manually and automatically transcribed audio recordings related to speech events in those languages. The scenario proposed was the European Parliament sessions in English, Spanish and French. GikiCLEF: Following the previous GikiP pilot at GeoCLEF 2008, the task focused on open list questions over Wikipedia that require geographic reasoning, complex information extraction, and cross-lingual processing, for collections in Bulgarian, Dutch, English, German, Italian, Norwegian (both Bokmål and Nynorsk), Portuguese and Romanian or Spanish.

The track was organized by a number of institutions (one for each target language), and jointly coordinated by CELCT, Trento, Italy, and UNED, Madrid, Spain. Cross-Language Retrieval in Image Collections (ImageCLEF): This track evaluated retrieval from visual collections; both text and visual retrieval techniques were employed. A number of challenging tasks were offered: •

multilingual ad-hoc retrieval from a photo collection concentrating on diversity in the results; • a photographic annotation task using a simple ontology; • retrieval from a large scale, heterogeneous collection of Wikipedia images with user-generated textual metadata; • medical image retrieval (with visual, semantic and mixed topics in several languages); • medical image annotation form two databases, a database of chest CTs to detect nodules and a database of x-ray images; • detection of semantic categories from robotic images (non-annotated collection, concepts to be detected). A large number of organisations were involved in the complex coordination of these tasks. They include: Sheffield University, UK; University of Applied Sciences Western Switzerland; Oregon Health and Science University, USA; University of Geneva, Switzerland; CWI, The Netherlands; IDIAP, Switzerland; University of Geneva, Switzerland; Fraunhofer Gesellschaft, Germany; Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands. Multilingual Information Filtering (INFILE@CLEF): INFILE (INformation, FILtering & Evaluation) was a cross-language adaptive filtering evaluation track sponsored by the French National Research Agency. INFILE has extended the last filtering track of TREC 2002 in a multilingual context. It used a corpus of 100,000

What Happened in CLEF 2009

5

Agence France Press comparable newswires for Arabic, English and French; and evaluation was performed using an automatic querying of test systems with a simulated user feedback. Each system can use the feedback at any time to increase performance. The track was coordinated by the Evaluation and Language resources Distribution Agency (ELDA), France; University of Lille, France; and CEA LIST, France. Cross-Language Video Retrieval (VideoCLEF): VideoCLEF 2009 was dedicated to developing and evaluating tasks involving access to video content in a multilingual environment. Participants were provided with a corpus of video data (Dutch-language television, predominantly documentaries) accompanied by speech recognition transcripts. In 2009, there were three tasks: "Subject Classification", which involved automatically tagging videos with subject labels; "Affect", which involved classifying videos according to characteristics beyond their semantic content; ``Finding Related Resources Across Languages", which involved linking video to material on the same subject in a different language. The track was jointly coordinated by Delft University of Technology, The Netherlands, and Dublin City University, Ireland. Intellectual Property (CLEF-IP): This was the first year for the CLEF-IP track. The purpose of the track was twofold: to encourage and facilitate research in the area of patent retrieval by providing a large clean data set for experimentation; to create a large test collection of patents in three main European languages for the evaluation of cross-language information access. The track focused on the task of prior art search. A large test collection for evaluation purposes was created by exploiting patent citations. The collection consists of a corpus of 1,9 million patent documents and 10,000 topics with an average of 6 relevance assessments per topic. Log File Analysis (LogCLEF): LogCLEF was an evaluation initiative for the analysis of queries and other logged activities as expression of user behaviour. The goal was the analysis and classification of queries in order to understand search behaviour in multilingual contexts and ultimately to improve search systems. The track used log data from the files of The European Library. Grid Experiments (Grid@CLEF): This experimental pilot has been planned as a long term activity with the aim of: looking at differences across a wide set of languages; identifying best practices for each language; helping other countries to develop their expertise in the IR field and create IR groups. Participants had to conduct experiments according to the CIRCO (Coordinated Information Retrieval Components Orchestration) protocol, an XML-based framework which allows for a distributed, loosely-coupled, and asynchronous experimental evaluation of Information Retrieval (IR) systems. The track was coordinated jointly by University of Padua, Italy, and the National Institute of Standards and Technology, USA. Unsupervised Morpheme Analysis (Morpho Challenge): Morpheme analysis is particularly useful in speech recognition, information retrieval and machine translation for morphologically rich languages where the amount of different word forms is very large. In Morpho Challenge 2009 unsupervised algorithms that provide morpheme analyses for words in different languages were evaluated in various practical applications. The evaluations consisted of: 1) a comparison to grammatical morphemes, 2) using morphemes instead of words in information retrieval tasks, and 3) combining morpheme and word based systems in statistical machine translation tasks. The

6

C. Peters

evaluation languages in 2009 were: Finnish, Turkish, German, English and Arabic. The track was coordinated by Helsinki University of Technology and Cambridge University Engineering Department. Details on the technical infrastructure and the organisation of all these tracks can be found in the track overview reports in this volume, collocated at the beginning of the relevant sections.

3 Test Collections The CLEF test collections are made up of documents, topics and relevance assessments. The topics are created to simulate particular information needs from which the systems derive the queries to search the document collections. System performance is evaluated by judging the results retrieved in response to a topic with respect to their relevance, and computing the relevant measures, depending on the methodology adopted by the track. The document sets that have been used to build the test collections in CLEF 2009 included: • • • • • • • • • •

• • • • •

A subset of the CLEF multilingual corpus of news documents in 14 European languages (Ad Hoc WSD-Robust task, MorphoChallenge) Hamshahri Persian newspaper corpus (Ad Hoc Persian task) Library catalog records in English, French, German plus log files provided by The European Library (Ad Hoc TEL task and LogCLEF) Log files from the Tumba search engine: http://www.tumba.pt/ (LogCLEF) Flickr web-based image database (iCLEF) ResPubliQA document collection, a subset of the JRC Acquis corpus of European legislation (QAatCLEF: ResPubliQA) Transcripts of European parliamentary sessions in English and Spanish, and French news broadcasts (QAatCLEF: QAST) BELGAPICTURE image collection (ImageCLEFPhoto) A collection of Wikipedia images and their user-generated textual metadata (ImageCLEFwiki) Articles and images from the Radiology and Radiography journals of the RSNA (Radiological Society of North America) (ImageCLEFmed); IRMA collection for medical image annotation (ImageCLEFmedAnnotation); a collection from the Lung Image Database Consortium (LIDC) (ImageCLEFmedAnnotation) A collection of FlickR images (ImageCLEFanno) A collection of robotics images created from KTH, Sweden (ImageCLEFrobotVision) Dutch and English documentary television programs (VideoCLEF) Agence France Press (AFP) comparable newswire stories in Arabic, French and English (INFILE) Patent documents in English, French and German from the European Patent Office (CLEF-IP)

Acknowledgements of the valuable contribution of the data providers is given at the end of this paper.

What Happened in CLEF 2009

7

4 CLEF and TrebleCLEF CLEF is organized mainly through the voluntary efforts of many different institutions and research groups. However, the central coordination has always received some support from the EU IST programme under the unit for Digital Libraries and Technology Enhanced Learning, mainly within the framework of the DELOS Network of Excellence. CLEF 2008 and 2009 were organized under the auspices of TrebleCLEF, a Coordination Action of the Seventh Framework Programme. TrebleCLEF has built on the results achieved by CLEF, supporting the development of expertise in the multidisciplinary research area of multilingual information access and promoting dissemination actions in the relevant application communities. The aim has been to:

-

Provide applications that need multilingual search solutions with the possibility to identify the technology which is most appropriate Assist technology providers to develop competitive multilingual search solutions.

In 2009, the TrebleCLEF activities included the organization of a Summer School on Multilingual Information Access (MLIA) and a MLIA Technology Transfer Day, and the publication of three Best Practices studies:

-

Best Practices in Language Resources for Multilingual Information Access Best Practices in System and User-oriented Multilingual Information Access Best Practices for Test Collection Creation, Evaluation Methodologies and Language Processing Technologies Information on the activities of TrebleCLEF can be found on the project website4.

5 Technical Infrastructure TrebleCLEF has supported a data curation approach within CLEF as an extension to the traditional methodology in order to better manage, preserve, interpret and enrich the scientific data produced, and to effectively promote the transfer of knowledge. The current approach to experimental evaluation is mainly focused on creating comparable experiments and evaluating their performance whereas researchers would also greatly benefit from an integrated vision of the scientific data produced, together with analyses and interpretations, and from the possibility of keeping, re-using, and enriching them with further information. The way in which experimental results are managed, made accessible, exchanged, visualized, interpreted, enriched and referenced is an integral part of the process of knowledge transfer and sharing towards relevant application communities. The University of Padua has thus developed DIRECT: Distributed Information Retrieval Evaluation Campaign Tool5, a digital library system for managing the scientific data and information resources produced during an evaluation campaign. A preliminary version of DIRECT was introduced into CLEF in 2005 and subsequently tested and developed in the CLEF 2006 and 2007 campaigns. It has been further developed under TrebleCLEF. In 2009, DIRECT managed the technical infrastructure for several of the CLEF tracks and tasks: Ad Hoc, ImageCLEFphoto, GridCLEF, managing: 4 5

http://www.trebleclef.eu/ http//direct.dei.unipd.it/

8

-

C. Peters

the track set-up, harvesting of documents, management of the registration of participants to tracks; the submission of experiments, collection of metadata about experiments, and their validation; the creation of document pools and the management of relevance assessment; the provision of common statistical analysis tools for both organizers and participants in order to allow the comparison of the experiments; the provision of common tools for summarizing, producing reports and graphs on the measured performances and conducted analyses.

6 Participation Researchers from 117 different academic and industrial institutions submitted runs in CLEF 2009: 81 from Europe, 18 from N.America; 16 from Asia, 1 from S.America and 1 from Africa. Figure 2 shows the trend in participation over the years and Figure 3 shows the shift in focus as new tracks have been added. As can be seen, the number of groups participating in the Ad Hoc, iCLEF, QA and VideoCLEF tracks is almost the same as last year, there has been a rise of interest in INFILE and participation in the two new tracks (LogCLEF and CLEF-IP) is encouraging.

CLEF 2000 - 2009 Participation 120 110 100 90

Oceania

80 70

Africa

60

North Am erica

50 40

As ia

South Am erica

Europe

30 20 10 0 2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

Fig. 2. CLEF 2000 – 2009: Participation

The most popular track is without doubt ImageCLEF which, with a notable increase from the previous year, tended to dominate the scene in 2009. This gives some cause for reflection as ImageCLEF is the track least concerned with multilinguality. A list of groups and indications of the tracks in which they participated can be found in the CLEF2009 Working Notes on the CLEF website.

What Happened in CLEF 2009

CLEF 2000 - 2009 Tracks

9

AdHoc

70

DomSpec iCLEF

60

P articip atin g G ro u p s

CL-SR

50

QA@CLEF ImageCLEF

40

WebClef GeoClef

30

VideoClef

20

INFILE M orphoChallenge

10

CLEF-IP LogCLEF

0

2000

2001 2002

2003

2004

2005

2006

2007

2008

2009

GridCLEF

Years

Fig. 3. CLEF 2000 – 2009: Participation per Track

7 The Future of CLEF The main goal of CLEF in this first ten years of activity has been to sustain the growth of excellence in language processing and multilingual information access (MLIA) across language boundaries. A strong motivation has been the desire to promote the study and utilisation of languages other than English on the Internet. In this period, the CLEF activities have produced the following significant results:

-

-

6

Creation of a very active multidisciplinary international research community, with strong interactions with the other main international initiatives for the evaluation of 6 7 8 IR systems: TREC , NTCIR , and now FIRE ; Investigation of core issues in MLIA which enable effective transfer over language boundaries, including the development of multiple language processing tools (e.g. stemmers, word decompounders, part-of-speech taggers); creation of linguistic resources (e.g. multilingual dictionaries and corpora); implementation of appropriate cross-language retrieval models and algorithms for different tasks and languages; Creation of important reusable test collections and resources in diverse media for a large number of European languages, representative of the major European language typologies; Significant and quantifiable improvements in the performance of MLIA systems.

Text REtrieval Conferences, http://trec.nist.gov/ NTCIR (NII Test Collection for IR Systems) Project, http://research.nii.ac.jp/ntcir/ 8 Forum for Information Retrieval Evaluation, http://www.isical.ac.in/~clia/ 7

10

C. Peters

CLEF 2009 has represented an important milestone for the MLIA community. After ten years of activity focused on stimulating the development MLIA systems and functionality through the organisation of increasingly complex evaluation tasks and presenting the results at an annual workshop, we have decided to widen the format. CLEF 2010 will thus take the form of an independent Conference soliciting the submission of papers that propose new retrieval tasks, new evaluation tools, new measures, and new types of operational evaluation, organised in conjunction with a set of Evaluation Labs, which will continue the CLEF tradition of community-based evaluation and discussion on evaluation issues. Two different forms of labs are offered: "campaign-style" labs running evaluation tasks and experiments during the nine month period preceding the conference, and "workshop-style” labs exploring issues of information access evaluation and related fields. The Conference will be held in Padua, Italy, September 2010, as a four day event: The first two days will consist of plenary sessions in which keynote speeches and peer-reviewed papers will be presented. The goals will be to explore current needs and practices for information access and discuss new directions for future activities in the European multilingual /multimodal IR system evaluation context. In Days 3 and 4, the results of the Labs will be presented in full and half-day workshops. Information on CLEF 2010 is available online9.

Acknowledgements It would be impossible to run the CLEF evaluation initiative and organize the annual workshops without considerable assistance from many groups. CLEF is organized on a distributed basis, with different research groups being responsible for the running of the various tracks. My gratitude goes to all those who have been involved in the coordination of the 2009 campaigns. A list of the main institutions involved is given at the beginning of this volume. Here below, let me thank just some of the people responsible for the coordination of the different tracks. My apologies to all those I have not managed to mention:

-

-

9

Abolfazl AleAhmad, Hadi Amiri, Eneko Agirre, Giorgio Di Nunzio, Nicola Ferro, Nicolas Moreau, Arantxa Otegi and Vivien Petras for the Ad Hoc Track Paul Clough, Julio Gonzalo and Jussi Karlgren for iCLEF Iñaki Alegria, Davide Buscaldi, Luís Miguel Cabral, Pere R. Comas, Corina Forascu, Pamela Forner, Olivier Galibert, Danilo Giampiccolo, Nicolas Moreau, Djamel Mostefa, Petya Osenova, Anselmo Peñas, Álvaro Rodrigo, Sophie Rosset, Paolo Rosso, Diana Santos, Richard Sutcliff and Jordi Turmo for QA@CLEF Brian Bakke, Steven Bedrick, Barbara Caputo, Paul Clough, Peter Dunker, Thomas Deselaers, Thomas Deserno, Ivan Eggel, Mark Oliver Güld, William Hersh, Patric Jensfelt, Charles E. Kahn Jr., Jana Kludas, Jayashree Kalpathy–Cramer, Henning Müller, Stefanie Nowak, Monica Lestari Paramita, Andrzej Pronobis, Saïd Radhouani, Mark Sanderson, Tatiana Tommasi, Theodora Tsikrika and Petra Welter for ImageCLEF

http://clef2010.org/

What Happened in CLEF 2009

11

-

Romaric Besançon, Stéphane Chaudiron, Khalid Choukri, Meriama Laïb, Djamel Mostefa and Ismaïl Timimi for INFILE - Gareth J.F. Jones, Martha Larson and Eamonn Newman for VideoCLEF - Giovanna Roda, John Tait, Florina Piroi, and Veronika Zenz for CLEF-IP - Maristella Agosti, Giorgio Di Nunzio, Christine Doran, Inderjeet Mani, Thomas Mandl, Julia Maria Schulz and Alexander Yeh for LogCLEF - Nicola Ferro and Donna Harman for GridCLEF - Graeme W. Blackwood, William Byrne Mikko Kurimo, Ville T. Turunen and Sami Virpioja for MorphoChallenge at CLEF - Marco Duissin, Giorgio Di Nunzio and Nicola Ferro for developing and managing the DIRECT infrastructure. I also thank all those colleagues who have helped us by preparing topic sets in different languages and the members of the CLEF Steering Committee who have assisted me with their advice and suggestions throughout this campaign. Furthermore, I gratefully acknowledge the support of all the data providers and copyright holders, and in particular: • • • • • • • • • • • • • • • • • •

The Los Angeles Times, for the American-English newspaper collection. SMG Newspapers (The Herald) for the British-English newspaper collection. Le Monde S.A. and ELDA: Evaluations and Language resources Distribution Agency, for the French newspaper collection. Frankfurter Rundschau, Druck und Verlagshaus Frankfurt am Main; Der Spiegel, Spiegel Verlag, Hamburg, for the German newspaper collections. Hypersystems Srl, Torino and La Stampa, for the Italian newspaper data. Agencia EFE S.A. for the Spanish news agency data. NRC Handelsblad, Algemeen Dagblad and PCM Landelijke dagbladen/Het Parool for the Dutch newspaper data. Aamulehti Oyj and Sanoma Osakeyhtiö for the Finnish newspaper data. Russika-Izvestia for the Russian newspaper data. Hamshahri newspaper and DBRG, Univ. Tehran, for the Persian newspaper data. Público, Portugal, and Linguateca for the Portuguese (PT) newspaper collection. Folha, Brazil, and Linguateca for the Portuguese (BR) newspaper collection. Tidningarnas Telegrambyrå (TT) SE-105 12 Stockholm, Sweden for the Swedish newspaper data. Schweizerische Depeschenagentur, Switzerland, for the French, German & Italian Swiss news agency data. Ringier Kiadoi Rt. (Ringier Publishing Inc.).and the Research Institute for Linguistics, Hungarian Acad. Sci. for the Hungarian newspaper documents. Sega AD, Sofia; Standart Nyuz AD, Sofia, Novinar OD, Sofia and the BulTreeBank Project, Linguistic Modelling Laboratory, IPP, Bulgarian Acad. Sci, for the Bulgarian newspaper documents Mafra a.s. and Lidové Noviny a.s. for the Czech newspaper data Usurbilgo Udala, Basque Country, Spain, for the Egunkaria, Basque newspaper documents

12

C. Peters

• • • • • • • • • • • • • • •

The European Commission – Joint Research Centre for the JRC Acquis Parallel corpus of European legislation in many languages. AFP Agence France Presse for the English, French and Arabic newswire data used in the INFILE track The British Library, Bibliothèque Nationale de France and the Austrian National Library for the library catalog records forming part of The European Library (TEL) The European Library (TEL) for use of TEL log files Tumba! web search engine of the Faculdade de Ciências da Universidade de Lisboa (FCUL), Portugal, for logfile querying Aachen University of Technology (RWTH), Germany, for the IRMA annotated medical images. Radiological Society of North America for the images of the Radiology and Radiographics journals. Lung Image Database Consortium (LIDC) for their database of lung nodules. Belga Press Agency, Belgium, for BELGAPICTURE image collection LIACS Medialab, Leiden University, The Netherlands & Fraunhofer IDMT, Ilmenau, Germany for the use of the MIRFLICKR 25000 Image collection Wikipedia for the use of the Wikipedia image collection. ELDA for the use of the ESTER Corpus: Manual and automatic transcripts of French broadcast news ELDA for the use of EPPS 2005/2006 ES & EN Corpora: Manual and automatic transcriptions of European Parliament Plenary Sessions in Spanish and English Matrixware Information Services GmbH for the use of a collection of patent documents in English, French and German from the European Patent Office The Institute of Sound and Vision, The Netherlands, for the English/Dutch videos, the University of Twente for the speech transcriptions, and Dublin City University for the shot segmentation.

Without their contribution, this evaluation activity would be impossible.

Overview of iCLEF 2009: Exploring Search Behaviour in a Multilingual Folksonomy Environment Julio Gonzalo1 , V´ıctor Peinado1, Paul Clough2 , and Jussi Karlgren3 2

1 UNED, Madrid, Spain University of Sheﬃeld, Sheﬃeld, UK 3 SICS, Kista, Sweden

Abstract. This paper summarises activities from the iCLEF 2009 task. As in 2008, the task was organised based on users participating in an interactive cross-language image search experiment. Organizers provided a default multilingual search system (Flickling) which accessed images from Flickr, with the whole iCLEF experiment run as an online game. Interaction by users with the system was recorded in log ﬁles which were shared with participants for further analyses, and provide a future resource for studying various eﬀects on user-orientated cross-language search. In total six groups participated in iCLEF with diﬀerent approaches, ranging from pure log analysis to speciﬁc experiment designs using the Flickling interface.

1

Introduction

iCLEF is the interactive track of CLEF (Cross-Language Evaluation Forum), an annual evaluation exercise for Multilingual Information Access systems. In iCLEF, Cross-Language search capabilities are studied from a user-inclusive perspective. A central research question is how best to assist users when searching information written in unknown languages, rather than how best an algorithm can ﬁnd information written in languages diﬀerent from the query language. Since 2006, iCLEF has moved away from news collections (a standard for text retrieval experiments) in order to explore user behaviour in scenarios where the necessity for cross-language search arises more naturally for the average user. We chose Flickr, a large-scale, web-based image database based on a large social network of WWW users sharing over two billion images, with the potential for oﬀering both challenging and realistic multilingual search tasks for interactive experiments. Over the last years, iCLEF participants have typically designed one or more cross-language search interfaces for tasks such as document retrieval, question answering or text-based image retrieval. Experiments were hypothesis-driven, and interfaces were studied and compared using controlled user populations under laboratory conditions. This experimental setting has provided valuable research insights into the problem, but has a major limitation: user populations C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 13–20, 2010. c Springer-Verlag Berlin Heidelberg 2010

14

J. Gonzalo et al.

are necessarily small in size, and the cost of training users, scheduling and monitoring search sessions is very high. In addition, the target notion of relevance does not cover all aspects that make an interative search session successful; other factors include user satisfaction with the results and usability of the interface. The main novelty of the iCLEF 2008 shared experience, which has been kept in 2009, was to focus on the shared analysis of a large search log from a single search interface provided by the iCLEF organizers. The focus is, therefore, on search log analysis rather than on system design. The idea is to study the behaviour of users in an (almost) naturalistic search scenario, having a much larger data set than in previous iCLEF campaigns. The search interface provided by iCLEF organizers is a basic cross-language retrieval system to access images in Flickr, presented as an online game: the user is given an image, and she must ﬁnd it again without any a-priori knowledge of the language(s) in which the image is annotated. Game-like features are intented to engage casual users and therefore increase the chances of achieving a large, representative search log. The iCLEF 2009 task is the same as in 2008, the only diﬀerence being the approach to select the target images (the topics for our task). In 2008 a large log was harvested, but in over half of the search sessions the user had active language skills in the target language, and the situations were the user has only passive or no abilities in the target language were underepresented. The reason was that many images in the target set had annotations in English (plus other languages in many cases), and the set of users (over 200 active searchers) tend to have English as a native or at least as a well-known language. Therefore, this year we explicitly avoided images annotated in English to increase the chances of having search sessions in unknown languages. The structure of the rest of the paper is as follows: Section 2 describes the task guidelines (and can be skipped by readers familiarized with the iCLEF 2008 task); Section 3 describes the features of the search log distributed to participants. In Section 4 we summarize the participation in the track and give some conclusions about the experience.

2

Task Guidelines

The task is exactly the same as in 2008, and the diﬀerences lie in the search log collected (target images, set of registered users, etc.) Readers which are familiarized with iCLEF 2008 can safely skip this Section. 2.1

Search Task Deﬁnition

First of all, the decision to use Flickr as the target collection is based on (i) the inherent multilingual nature of the database, provided by tagging and commenting features utilised by a worldwide network of users, (ii) although it is in constant evolution, which may aﬀect reproducibility of results, the Flickr search API allows the speciﬁcation of timeframes (e.g. search in images uploaded between 2004 and 2007), which permits deﬁning a more stable dataset for experiments; and

Overview of iCLEF 2009

15

(iii) the Flickr search API provides a stable service which supports full boolean queries, something which is essential to perform cross-language searches without direct access to the index. For 2008, our primary goal was harvesting a large search log of users performing multilingual searches on the Flickr database. Rather than recruiting users (which inevitably leads to small populations), we wanted to publicize the task and attract as many users as possible from all around the world, and engage them with search. To reach this goal, we needed to observe some restrictions: – The search task should be clear and simple, requiring no a-priori training or reading for the casual user. – The search task should be engaging and addictive. Making it an online game - with a rank of users - helps achieve that, with the rank providing a clear indication of success. – There should be no need for manual judgements in order to establish the success of a search session, in order to avoid discouraging delays in the online game rankings. – It should have an adaptive level of diﬃculty to prevent novice users from being discouraged, and to prevent advanced users from being unchallenged. – The task should be naturally multilingual. We decided to adopt a known-item retrieval search task: the user is given a raw (unnanotated) image and the goal is to ﬁnd the image again in the Flickr database, using a multilingual search interface provided by iCLEF organizers. The user does not know in advance in which languages the image is annotated; therefore searching in multiple languages is essential to get optimal results. Although the task is probably not the most natural one (thematic-based searches are probably more common than ”stuﬀ I’ve seen before” search needs), it has the deﬁnitive advantage of not requiring manual judgements, and that makes possible to keep an instantly updated user ranking. Indeed the task is organized as an online game: the more images found, the higher a user is ranked. In case of ties, the ranking will also depend on precision (number of images found / number of images attempted). At any time the user can see the “Hall of Fame” with a rank of all registered users. Depending on the image, the source and target languages, this can be a very challenging task. To have an adaptive level of diﬃculty, we implemented a hints mechanism. At any time whilst searching, the user is allowed to quit the search (skip to next image) or ask for a hint. The ﬁrst hint is always the target language (and therefore the search becomes mono or bilingual as opposed to multilingual). The rest of the hints are keywords used to annotate the image. Each image found scores 25 points, but for every hint requested, there is a penalty of 5 points. The hint mechanism proved essential to engage users in 2008 and even more in 2009 (for reasons explained later). Initially a ﬁve minute time limit per image was considered, but initial testing indicated that such a limitation was not natural and had a deep impact on users’ search behaviour. Therefore we decided to remove time restrictions from the task deﬁnition.

16

J. Gonzalo et al.

2.2

Search Interface

We designed the so-called Flickling interface to provide a basic cross-language search front-end to Flickr. Flickling is described in detail in [1]; here we will summarize its basic functionalities: – User registration, which records the user’s native language and language skills in each of the six European languages considered (EN, ES, IT, DE, NL, FR). – Localization of the interface in all six languages. – Two search modes: mono and multilingual. The latter takes the query in one language and returns search results in up to six languages, by launching a full boolean query to the Flickr search API. – Cross-language search is performed via term-to-term translations between six languages using free dictionaries1 . – A term-to-term automatic translation facility which selects the best target translations according to (i) string similarity between the source and target words; (ii) presence of the candidate translation in the suggested terms oﬀered by Flickr for the whole query; and (iii) user translation preferences. – A query translation assistant that allows users to pick/remove translations, and add their own translations (which go into a personal dictionary). We did not provide back-translations to support this process, in order to study correlations between target language abilities (active, passive, none) and selection of translations. – A query reﬁnement assistant that allows users to reﬁne or modify their query with terms suggested by Flickr and terms extracted from the image rank. When the term is in a foreign language, the assistant tries to display translations into the user’s preferred language to facilitate feedback. – Control of the game-like features of the task: user registration and user proﬁles, groups, ordering of images, recording of session logs and access to the hall of fame. – Post-search questionnaires (launched after each image is found or failed) and ﬁnal questionnaires (launched after the user has searched ﬁfteen images, not necessarily at the end of the experience). 2.3

Participation in the Track

As in 2008, iCLEF 2009 participants can essentially adopt two types of methodology: (1) analyse log ﬁles based on all participating users (which is the default option) and, (2) perform their own interactive experiments with the interface provided by the organizers. CLEF individuals registered in the interface as part of a team, so that a ranking of teams is produced in addition to a ranking of individuals.

1

Taken from: http://xdxf.revdanica.com/down

Overview of iCLEF 2009

17

Generation of search logs. Participants can mine data from the search session logs, for example looking for diﬀerences in search behaviour according to language skills, correlations between search success and search strategies, etc. Interactive experiments. Participants can recruit their own users and conduct their own experiments with the interface. For instance, they could recruit a set of users with passive language abilities and another with active abilities in certain languages and, besides studying the search logs, they could perform observational studies on how they search, conduct interviews, etc. iCLEF organizers provided assistance with deﬁning appropriate user groups and image lists, for example, within the common search interface. Besides these two options, and given the community spirit of iCLEF, we were open to groups having their own plans (e.g. testing their own interface designs or using a speciﬁc set of images) as long as they did not change the overall shared search task (known-item search on Flickr).

3

Dataset: Flickling Search Logs

Search logs were harvested from the Flickling search interface between May and June 2009 (see [1] for details on the logs content and syntax). In order to entice a large set of users, the “CLEF Flickr Challenge” was publicized in Information Access forums (e.g. the SIG-IR and CLEF lists), Flickr blogs and general photographic blogs. As in 2008, we made a special eﬀort to engage the CLEF community in the experience, with the goal of getting researchers closer to the CLIR problem from a user’s perspective. To achieve this goal, CLEF organizers agreed to award two prizes consisting of free registrations for the workshop: one for the best individual searcher and one for the best scoring CLEF group. Overall, 130 users registered for the task, for a total of 2527 search sessions, many of them ending in success (2149). There were 19 native languages in our user set, with this distribution: 46 Spanish, 38 Romanian, 10 English, 9 Italian, 4 Persian/Farsi, 4 German, 3 Chinese, 2 Finnish, 2 Catalan, 2 Basque, 2 Arabic, 1 Danish, 1 Vietnamese, 1 Malay, 1 Russian, 1 Greek and 1 Belarusian. Apart from general users, the group aﬃliation revealled two dominant user proﬁles: university researchers and students (most of them in Computer Science) and photography fans. The 2008 search log was skewed towards ”active” search sessions (where users had active skills in some of the languages used to annotate the image). Therefore this year we changed the methodology to select the target images, excluding those which had annotations in English, and reducing the number of images annotated in Spanish (because it was a well represented native language in our user base). The strategy was too successful: we harvested 1585 search sessions where the target language was unknown to the user, 18 where the user had passive abilities (i.e. could read results but not write queries), and none where the user had active skills in the target language. That makes this search log

18

J. Gonzalo et al.

an excellent tool to study the behaviour of users searching in foreign language, but it can hardly be used to compare the three proﬁles. We also found that the combination of users and images is so diﬀerent from the 2008 experience that merging the two search logs, even if the task is the same, is not advisable. Overall, it has been possible to collect a large controlled multilingual search log, which includes both search behaviour (interactions with the system) and users’ subjective impressions of the system (via questionnaires). This oﬀers a rich source of information for helping to understand multilingual search characteristics from a user’s perspective.

4

Participation and Findings

Six sites submitted results for this year’s interactive track: two newcomers (University of North Texas and Alexadru Ioan Cuza University, UAIC, in Romania) and four groups with previous experience in iCLEF: Universidad Nacional de Educaci´on a Distancia (UNED), the Swedish Institute of Computer Science (SICS), Manchester Metropolitan University (MMU), and the University of Alicante. University of Alicante [5] investigated whether there is a correlation between lexical ambiguity in queries and search success and, if so, whether explicit Word Sense Disambiguation can potentially solve the problem. To do so, they mined data from the search log distributed by the iCLEF organization, and found that less ambiguous queries lead to better search results and coarse-grained Word Sense Disambiguation might be helpful in the process. UAIC [2] tried to ﬁnd correlations between diﬀerent search parameters using a subset of the search log consisting of searchers performed by a set of 31 users recruited fro the task (which were very active, performing almost 46% of all queries in the general search log). They did not ﬁnd a clear connection between the results of over-achieving users and their particular actions, and they found hints of a possible (light) collaboration between them, which eventually makes our search log less reliable than initially thought. Manchester Metropolitan University [3] tried to demonstrate the value in focusing on user’s trust and conﬁdence in the exploration of seeking behaviour to reveal users’ perception of the tasks involved when searching across languages. Instead of focusing on log analysis, MMU recruited their own set of 24 users selected a speciﬁc set of three images (in Dutch, German and Spanish) and performed a qualitative and quantitative analysis including questionnaires, observational study of the search sessions, retrospective thinking aloud and interviews. Among other things, they found that variations in perceptions of searching and approach to using translations which is unrelated to the amount or type of help or guidance given. They also found that, in general, users only think about languages after asking for the ﬁrst hint (i.e. the target language), facing cross-linguality only when it is inevitable. UNED [4] tried to establish diﬀerences between users with active/passive/no knowledge of the target language, including search success and cognitive eﬀort, and compared the results using search logs from 2008 and 2009. Unfortunately

Overview of iCLEF 2009

19

the skewed distribution of language proﬁles in 2009 did not permit direct comparisons and made results from the merged logs unreliable. UNED then worked on estabilishing successful search strategies when searching in foreign, unknown language. They found that the usage of cross-language search assistance features has an impact on search success, and that such features are highly appreciated by users. University of North Texas [6] aimed at understanding the challenges that users face when searching for images that have multilingual annotations, and how they cope with these challenges to ﬁnd the information they need. Similarly to MMU, instead of using the search log this group recruited their own set of six north american students and studied their search behaviour and subjective impressions using questionnaires, training, interviews and observational analysis. They found that users have strong diﬃculties using ﬂickr tags, particularly when doing cross-language search, and that their typical session requires two hints: the target language and a keyword. SICS has continued to investigate methods for how to study conﬁdence and satisfaction of users. In previous years’ studies, results have been somewhat equivocal; this year, some preliminary studies of the number of reformulations versus success rate have been performed. The SICS team found that the length of query sequences which eventually were successful were longer, indicating persistence when a search appears to be in the right direction. The number of query reformulations also correlate well with success: successful query sequences are a result of active exploration of the query space. But for users who persist in working with monolingual searches (search calls), the SICS team found that queries, ﬁrstly tended to be vastly less often reformulated to begin with, and that the successful sequences were more parsimonious than the failed ones (conversely from the clsearch calls): instead the number of scroll actions were much more frequent. This would seem to indicate that if users are fairly conﬁdent of a well put query, they will persist by scrolling through result lists. The ﬁgures in Table 1 are all statistically signiﬁcant by the Mann Whitney U rank sum test (p > 0.95). Table 1. Some quantitative results distinguishing successful query sequences from failed ones. (Logs from 2009.) Success (“foundImg”) 2149 Time to resolution (average) 1 420 s Reformulations (average) 110 search 3.7 Scroll actions search 1.8

Give up (“giveUp”) 261 412 s 29 6.3 1.3

20

5

J. Gonzalo et al.

Conclusions

iCLEF 2009 has continued to run a large-scale interactive experiment as an online game to generate log ﬁles for further study. A default multilingual information access system developed by the organizers was provided to participants to lower the cost of entry and generate search logs recording user’s interaction with the system and qualitative feedback about the search tasks and system (through online questionnaires). In addition, two groups have decided to replace (or extend) log analysis by recruiting their own set of users and employ the usual methodology (training, questionnaires, interviews, retrospective thinking aloud, observational studies) on them. The search logs generated by the iCLEF track in 2008 and 2009 together are a reusable resource for future user-orientated studies of cross-language search behaviour, and we hope to see new outcomes in the near future coming from indepth analysis of our logs. Researchers interested in this resource might contact the iCLEF organization (see http://nlp.uned.es/iCLEF) for details.

Acknowledgements This work has been partially supported by the Regional Government of Madrid under the MAVIR Research Network (S-0505/TIC-0267) and the Spanish Government under project Text-Mess (TIN2006-15265-C06-02).

References 1. Peinado, V., Artiles, J., Gonzalo, J., Barker, E., L´ opez-Ostenero, F.: FlickLing: a multilingual search interface for Flickr. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 2. Cristea, F., Alexa, V., Iftene, A.: UAIC at iCLEF 2009: Analysis of Logs of Multilingual Image Searches in Flickr. In: CLEF 2009 Workshop Notes (2009) 3. Vassilakaki, E., Johnson, F., Hartley, R.J., Randall, D.: Users’ Perceptions of Searching in Flickling. In: CLEF 2009 Workshop Notes (2009) 4. Peinado, V., L´ opez-Ostenero, F., Gonzalo, J.: UNED at iCLEF 2009: Analysis of Multilingual Image Search Sessions. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 5. Navarro-Colorado, B., Puchol-Blasco, M., Terol, R.M., V´ azquez, S., Lloret, E.: User Behavior and Lexical Ambiguity in Cross-Language Image Retrieval. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 6. Ruiz, M., Chin, P.: Users’ Image Seeking Behaviour in a Multilingual Tag Environment. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010)

Analysis of Multilingual Image Search Logs: Users’ Behavior and Search Strategies V´ıctor Peinado, Fernando L´ opez-Ostenero, and Julio Gonzalo NLP & IR Group, ETSI Inform´ atica, UNED c/ Juan del Rosal, 16, E-28040 Madrid, Spain {victor,flopez,julio}@lsi.uned.es http://nlp.uned.es

Abstract. In this paper we summarize the analysis performed on the logs of multilingual image search provided by iCLEF09 and its comparison with the logs released in the iCLEF08 campaign. We have processed more than one million log lines in order to identify and characterize 5, 243 individual search sessions. We focus on the analysis of users’ behavior and their performance trying to ﬁnd possible correlations between: a) the language skills of the users and the annotation language of the target images; and b) the ﬁnal outcome of the search session. We have observed that the proposed task can be considered as easy, even though users with no competence in the annotation language of the images tend to perform more interactions and to use cross-language facilities more frequently. Usage of relevance feedback is remarkably low, but successful users use it more often.

1

Introduction

In this paper we summarize the analysis performed on the logs of multilingual image search provided in the iCLEF 2009 track [2] and its comparison with the logs released in the iCLEF 2008 campaign [1]. In the search logs provided by the organizers, individual search sessions can be easily identiﬁed. Each session starts when a registered user is shown a target image and ﬁnishes when the user ﬁnds the image or gives up. The logs collect every interaction occurred in the meantime: monolingual and multilingual queries launched, query reﬁnements, exploration of the results ranking, hints showed by the system, usage of the personal dictionaries and other cross-language facilities. These logs are automatically generated by the FlickLing search engine. See [3] for a complete description on the interface’s functionalities and the logs. Last year [5] we focused on the analysis of possible correlations between the language skills of the users and the annotation language of the target images, along with the usage of some of the speciﬁc cross-language facilities FlickLing features. In this work we are focusing on the analysis of users’ behavior and their performance trying to ﬁnd possible correlations between: a) the language skills of the users and the annotation language of the target images; and b) the ﬁnal outcome of the search session. Being aware of the diﬀerences between both C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 21–28, 2010. c Springer-Verlag Berlin Heidelberg 2010

22

V. Peinado, F. L´ opez-Ostenero, and J. Gonzalo

groups of users involved in an interactive experiment and between both pools of images used, we are replicating the analysis trying to ﬁnd out new correlations and reinforce or discard the evidences observed. The remainder of the paper is as follows: Section 2 describes the processing tasks and the characterization of the search sessions performed on the iCLEF logs. Next, we discuss some correlations found between our users’ search behavior and their proﬁle according to their language skills (Sections 3) and the ﬁnal outcome of their search sessions (Section 4). Finally, in Section 5 we draw some general conclusions and move forward to propose future work lines.

2

iCLEF Logs Processing

The logs provided by the iCLEF organization in 2009 were considerably smaller than last year’s corpus. Besides, there were many users who registered and tried out just a few searches. For the current analysis, we are focusing only on those users who, regardless of their ﬁnal outcome, were able to complete at least 15 search sessions and ﬁlled out the overall questionnaire. Table 1 shows some of the most relevant statistics of both logs considering only the mentioned sub-sets of users. Notice that we are analyzing more than one million log lines generated by 98 users and containing 5, 243 search sessions and more than 62, 000 queries. Comparing a collection of logs generated in an interactive image search experiment with diﬀerent users and two diﬀerent sets of target images is not straightforward, but we think these ﬁgures are large enough to reach quantitatively meaningful conclusions. So, we have processed the logs in order to obtain a rich characterization of the search sessions: the user and her behavior, the target image and the usefulness of the search and translations facilities provided by FlickLing. As in our previous work [4], we have extracted 115 features for each session, capturing the complete user’s proﬁle according to her language skills, the target image’s proﬁle, and the usage of the interface’s functionalities. In the following sections Table 1. Statistics of the sub-sets of logs analyzed

considered users log lines valid search sessions images found images not found hints asked monolingual queries multilingual queries promoted translations penalized translations image descriptions shown

2008 65 841, 957 3, 640 2, 983 657 8, 093 23, 060 20, 607 223 70 126

2009 33 357, 703 1, 603 1, 439 164 3, 886 8, 461 10, 463 525 246 42

Analysis of Multilingual Image Search Logs

23

we present the analyses performed on these two sub-sets of search sessions according to the language skills of the users (Section 3) and considering the ﬁnal outcome of the search sessions (Section 4).

3

Analysis according to Language Skills

We have divided our search sessions into three diﬀerent ‘proﬁles’ according to the user’s language skills with respect to the annotation language of the target image. On one hand, “active” denotes the sessions where the image was annotated in a language in which the user was able to read and write. On the other hand, “passive” sessions are those where the target language was partially understandable by the user, but the user could not make queries in that language (e.g. images annotated in Italian for most Spanish speakers). Finally “unknown” refers to sessions when the image is annotated in languages completely unfamiliar for the user. 3.1

Users’ Behavior

While the iCLEF08 corpus has samples enough under these categories (2, 345 sessions for active, 535 for passive and 760 for unknown), iCLEF09 corpus has no active sessions and a great majority of unknown sessions (only 18 are passive and 1585 are unknown).1 Table 2 shows the number of samples per proﬁle, the average values for success rate (was the image found?) and the average number of hints requested per search session for each year’s logs, along with the aggregate values. As these ﬁgures show, the degree of success was high in all cases. In the iCLEF08 corpus, active and passive speakers performed similarly (passive users asking for more hints, though): they successfully found the target image 84% and 82% of the times, respectively. On the other hand, as expected, users with no competence in the annotation language obtained 73% of success rate and asked for more hints (2.42). As for the the iCLEF09 corpus, the division in proﬁles does not allow to ﬁnd clear correlations because of the lack of samples. Unknown users, nonetheless, were able to successfully ﬁnd the image 90% of the times, while asking for 2.43 hints, a smiliar ﬁgure compared to iCLEF08. It’s worth noticing that hints in iCLEF09 were more speciﬁc and concrete than in iCLEF08. Thus, even though most of the target images were annotated in an unknown language, asking for hints was deﬁnitely more useful this year. 1

The explanation for this is in the diﬀerent characteristics of the target images proposed each year. Last year the image corpus was fully multilingual but most of the images could be easily found by simply searching in English and Spanish, the most popular languages among our users. This year, on the contrary, the image corpus was collected trying to avoid images annotated in English and stressing carefully on Dutch and German. Our users were coming basically from Romania, Italy and Spain, with little knowledge in these languages.

24

V. Peinado, F. L´ opez-Ostenero, and J. Gonzalo

Table 2. User’s behavior according to language skills: average success rate and hints requested iCLEF08 result samples success rate # hints requested active 2, 345 85% 2.14 passive 535 82% 2.22 unknown 760 73% 2.42 iCLEF09 result samples success rate # hints requested active 0 passive 18 78% 1.22 unknown 1, 585 90% 2.43

3.2

Cognitive Eﬀort

We have grouped under the name “cognitive eﬀort” some of the most usual interactions occurred in a traditional search interface, namely: launching queries, exploring the ranking of results beyond the ﬁrst page (each page contains 20 items), and using relevance feedback (words provided by Flickr related to the query terms, and the tags associated to each image retrieved in the ranking of results). Table 3 shows the ﬁgures related to these interactions for each user proﬁle in both FlickLing’s monolingual and multilingual environments. In the iCLEF08 logs, as expected, active and passive users launch more queries in the monolingual environment, while unknown users, who are supposed to need some translation functionalities to ﬁnd the image, launch more multilingual queries using FlickLing’s facilities. As far as the ranking exploration is concerned, the same pattern appears: active and passive users cover more ranking pages while querying in monolingual and unknown users explore the ranking more deeply while querying in multilingual. In the iCLEF09 results, if we Table 3. Cognitive eﬀort according to language skills: typed queries, ranking exploration and usage of relevance feedback iCLEF08 competence typed queries ranking exploration mono multi mono multi active 4.03 3.28 2.09 1.92 passive 4.16 3.31 2.83 2.24 unknown 3.81 4.02 2.36 2.81 iCLEF09 competence typed queries ranking exploration mono multi mono multi active passive 4.72 11.06 2.78 11.11 unknown 3.48 3.89 1.76 2.43

relevance feedback mono multi 0.03 0.03 0.05 0.02 0.07 0.09 relevance feedback mono multi 0 0 0.01 0.03

Analysis of Multilingual Image Search Logs

25

ignore the 18 samples corresponding to passive users, we ﬁnd the unknown users again performed more interactions in the multilingual environment: more queries launched and more ranking explorations. Usage of relevance feedback facilities, as shown in previous works (see [5]), is very low for both logs collections. But even with small variations, active and passive players used relevance feedback more often with monolingual searches, and unknown players used it more often in the multilingual environment. 3.3

Usage of Speciﬁc Cross-Language Reﬁnement Facilities

The dictionaries used by FlickLing were not optimal. Freely-available generalpurpose dictionaries were used, covering the six languages considered in the experiment. To rectify some of the translation errors, FlickLing allows users to promote, penalize and add new translations. These changes are incorporated into a personal dictionary for each user and do not aﬀect other players’ translations. When characterizing the search sessions, we also took into consideration the usage of this functionality by our users. In general, the usage of the personal dictionary was low. Table 4 shows the average percentage of search sessions in which users manipulated their personal dictionary by adding new translations, promoting convenient options and removing bad ones, and the average query terms modiﬁed by these manipulations. Table 4. Usage of speciﬁc cross-language reﬁnement facilities according to language skills iCLEF08 competence dictionary manipulations query terms modified active 0.06 0.04 passive 0.05 0.03 unknown 0.17 0.11 iCLEF09 competence dictionary manipulations query terms modified active passive 6.56 1.67 unknown 0.4 0.16

In iCLEF08, unknown users manipulated their personal dictionary about three times (0.17) more often than active (0.06) and passive (0.05) players, and consequently the number of query terms modiﬁed was also higher (0.11). If we compare both collections, we observe how in iCLEF09, where the usage of crosslanguage facilities was more expected, was also increased (0.4).

4

Analysis according to Search Session’s Outcome

In order to ﬁnd some correlations about the most successful strategies used by our users, we are dividing the sessions into two categories: “success” refers to

26

V. Peinado, F. L´ opez-Ostenero, and J. Gonzalo

those sessions where users were, with or without hints, able to ﬁnd the target image; “fail” refers to those sessions where the user quit. 4.1

Users’ Behavior

As we saw in Section 3.1, we are analyzing users’ behavior but stressing now on the ﬁnal outcome of the search sessions. If we see Table 5, the ﬁrst detail to be noted is the number and percentage of samples of each category: 81.95% of success samples in iCLEF08 and 89.77% in iCLEF09 conﬁrm that ﬁnding the proposed images was an easy task. Regarding the average number of hints requested, users in successful sessions asked for 2.32 and 2.38 hints in iCLEF08 and iCLEF09, respectively. Users in failed sessions asked for a similar quantity of hints in iCLEF09 (2.77), while in iCLEF08 the number of hints is lower (1.74). Table 5. User’s behavior according to search session outcome: average success rate and hints requested iCLEF08 result samples % # hints requested success 2, 983 81.95% 2.32 fail 657 18.05% 1.74 iCLEF09 result samples % # hints requested success 1, 439 89.77% 2.38 fail 164 10.23% 2.77

4.2

Cognitive Eﬀort

Analyzing the cognitive eﬀort with respect to the outcome of the search session, our aim is to ﬁnd some correlations about what strategy was the most convenient for our users to ﬁnd the proposed images. As shown in Table 6, in the iCLEF08 logs, successful users launched more queries in the monolingual environment than in the multilingual one (4.05 vs. 3.36), while unsuccessful players does not show diﬀerences 3.76 vs. 3.79). On the other hand, in the iCLEF09 logs, successful users launched more multilingual queries than monolingual (4.02 vs. 3.65). This can be explained again because the image collection was designed to force the multilingual searches. This fact can also be seen in the number of explorations of the ranking, slightly higher than in iCLEF09 (2.48 and 2.93 vs. 2.13 and 2.26). Lastly, in general, users in failed sessions seems to have performed more interactions in the monolingual environment. As the last columns of the table show, the usage of relevance feedback was very low in both categories, being higher in monolingual in iCLEF08 and multilingual in iCLEF09. In general, but still with little diﬀerences, successful users tended to use relevance feedback more frequently.

Analysis of Multilingual Image Search Logs

27

Table 6. Cognitive eﬀort according to the search session outcome: typed queries, ranking exploration and usage of relevance feedback iCLEF08 competence typed queries ranking exploration mono multi mono multi success 4.05 3.36 2.22 2.13 fail 3.76 3.79 2.39 2.26 iCLEF09 competence typed queries ranking exploration mono multi mono multi success 3.65 4.02 1.89 2.48 fail 1.96 3.23 0.79 2.93

relevance feedback mono multi 0.05 0.04 0.05 0.02 relevance feedback mono multi 0.02 0.03 0.01 0.02

Table 7. Usage of speciﬁc cross-language reﬁnement facilities according to the search session outcome iCLEF08 competence dictionary manipulations query terms modified success 0.08 0.05 fail 0.06 0.05 iCLEF09 competence dictionary manipulations query terms modified success 0.46 0.17 fail 0.62 0.18

4.3

Usage of Speciﬁc Cross-Language Reﬁnement Facilities

Finally, regarding the manipulation of the personal dictionary (see Table 7), successful users in iCLEF08 used it slightly more often than those who failed (0.08 vs. 0.06). In iCLEF09, the general usage is much more higher, but the pattern is reproduced the inverse: unsuccessful players tended to manipulate their dictionaries more often (0.62 vs. 0.46).

5

Conclusions

In this paper we have summarized the analysis performed on the logs of multilingual image search provided by iCLEF09 and its comparison with the logs released in the iCLEF08 campaign. We have processed more than one million log lines in order to identify and characterize 5, 243 individual search sessions. In this work we have focused on the analysis of users’ behavior and their performance trying to ﬁnd possible correlations between: a) the language skills of the users and the annotation language of the target images; and b) the ﬁnal outcome of the search session. We can draw the following conclusions: 1) The proposed task turned out to be easy, since all users’ proﬁles reach more than 80% of success rate; 2) Users with

28

V. Peinado, F. L´ opez-Ostenero, and J. Gonzalo

no competence in the annotation language of the image tend to ask for more hints; 3) Users with some knowledge in the annotation language of the images employ more cognitive eﬀort in monolingual searches, while users without skills need more cognitive eﬀort in multilingual searches in order to reach a similar performance; 4) As expected, the more lack of language skills a user has, the more she uses cross-language facilities; 5) In iCLEF08, where most of the images were annotated in known languages, successful users launched more queries in the monolingual environment. In iCLEF09, where multilingual needs were forced on purpose, successful users launched more multilingual queries; and 6), usage of relevance feedback is remarkably low, but successful users tended to use it more frequently.

Acknowledgements This work has been partially supported by the Regional Government of Madrid under the Research Network MAVIR (S-0505/TIC-0267) and the Spanish Government under project Text-Mess (TIN2006-15265-C06-02). We would also like to thank Javier Artiles for his intensive work during the implementation of the FlickLing interface and all the collaborators involved in collecting the image corpus and the testing stage.

References 1. Gonzalo, J., Clough, P., Karlgren, J.: Overview of iCLEF 2008: search log analysis for Multilingual Image Retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. pp. 262–295. Springer, Heidelberg (2009) 2. Gonzalo, J., Peinado, V., Clough, P., Karlgren, J.: Overview of iCLEF 2009: Exploring Search Behaviour in a Multilingual Folksonomy Environment. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 3. Peinado, V., Artiles, J., Gonzalo, J., Barker, E., L´ opez-Ostenero, F.: FlickLing: a multilingual search interface for Flickr. In: Working Notes for the CLEF 2008 Workshop, Aarhus, Denmark, September 17-19 (2008), ISSN: 1818-8044, ISBN: 2912335-43-4 4. Peinado, V., Gonzalo, J., Artiles, J., L´ opez-Ostenero, F.: UNED at iCLEF 2008: Analysis of a Large Log of Multilingual Image Searches in Flickr. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 5. Peinado, V., Gonzalo, J., Artiles, J., L´ opez-Ostenero, F.: Log Analysis of Multilingual Image Search in Flickr. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2009. LNCS, vol. 5706, pp. 236–242. Springer, Heidelberg (2009)

User Behaviour and Lexical Ambiguity in Cross-Language Image Retrieval Borja Navarro-Colorado, Marcel Puchol-Blasco, Rafael M. Terol, Sonia V´ azquez, and Elena Lloret Natural Language Processing Research Group (GPLSI) Department of Software and Computing Systems University of Alicante {borja,marcel,rafamt,svazquez,elloret}@dlsi.ua.es

Abstract. Our objective in this paper is to determine the necessity of Word Sense Disambiguation in Information Retrieval tasks, according to user behaviour. We estimate and analyse the lexical ambiguity of queries in Cross-Language Image Retrieval (using search logs from a multilingual search interface for Flickr) and measure its correlation with search eﬀectiveness. We show to what extent the lexical ambiguity of a query can inﬂuence the successful retrieval of an image in a multilingual framework.

1

Introduction

In this paper1 we present a study about how users deal with lexical ambiguity during their interaction with an Information Retrieval system. Taking advantage of the Flickling system2 search logs, we have analysed how lexical ambiguity inﬂuences search behaviour in a multilingual image retrieval task. Establishing the inﬂuence of Word Sense Disambiguation (WSD) in Information Retrieval (IR) is a challenging task, extensively addressed in the IR and Natural Language Processing (NLP) literature [12,19,20,6,5,21]. The NLP community has not yet reached an agreement as to whether WSD systems are useful or not in IR tasks. Papers such as [19,20,22] suggest that a WSD system really does not improve the text retrieval process. They argue that the context of ambiguous words in a query provides enough information for an IR system to correctly ﬁnd texts. Therefore the lexical ambiguity of the word has no relevant inﬂuence during the retrieval process. [7,9,14,1] claim that there is no improvement with WSD 1

2

This paper has been supported by the Spanish Government, project TEXT-MESS TIN-2006-15265-C06-01. Elena Lloret is funded by an FPI grant (BES-2007-16268) from the Spanish Ministry of Science and Innovation, under this project. Marcel Puchol-Blasco is funded by a research grant (BFPI06/182) from the Valencian Regional Government (Generalitat Valenciana). Thanks to all researchers that provided useful comments at CLEF 2009 and Julio Gonzalo for his revision. http://cabrillo.lsi.uned.es/ﬂickling

C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 29–36, 2010. c Springer-Verlag Berlin Heidelberg 2010

30

B. Navarro-Colorado et al.

systems, while [13] concludes that only in speciﬁc cases a WSD system could improve retrieval eﬀectiveness. On the other hand, papers such as [12,11,6,5] provide some empirical evidence that, by indexing the text collection with senses, the retrieval process can improve. A similar idea appears in papers such as [16,17,3], suggesting that there may be a real improvement on Information Retrieval with WSD systems. At present, it seems that a WSD system is useful in IR tasks only for ambiguous words where alternative senses are not semantically related (i.e. are homographs). There are diﬀerent approaches to WSD, which use diﬀerent disambiguation algorithms and lexical resources, and are applied to diﬀerent problems [2]. However, all of them introduce errors and are time-consuming in an IR system. Therefore, nowadays it is diﬃcult to test to what extent they could be useful in practical IR tasks. From our point of view, a problem of these experiments is how they consider lexical ambiguity. Most WSD systems applied to IR assume the resolution of lexical ambiguity as a classiﬁcation problem: given a ﬁnite set of senses, the system must select only one. However, several papers on lexical semantics in Natural Language Processing set out reasonable doubts about lexical disambiguation based on selection and classiﬁcation techniques, because it is often diﬃcult to identify sharp boundaries between senses [10,8,15]. It is not clear whether a human mentally picks one from a list of senses when interpreting an ambiguous word. Indeed, humans are not normally conscious about the lexical ambiguity of words in context. The problem is that we actually don’t have enough information regarding the cognitive process of word selection and disambiguation. Following Robins [18], in order to design an eﬀective IR system, it is necessary to know how users interact with the system. In this paper, our proposal is to analyse how users deal with lexical ambiguity during the retrieval process. Speciﬁcally, we study whether they use more or less ambiguous queries and how this correlates with search success. Taking advantage of the iCLEF 2009 search logs distributed to participants, we have estimated the lexical ambiguity of the queries posed by users to the Flickling system, and we have compared query ambiguity with search eﬀectiveness. As a result, we have obtained interesting evidence on user behaviour and lexical ambiguity in an IR task that could be useful to shed some light into this question.

2

Lexical Ambiguity of the Queries

First of all, we need to know the overall ambiguity of each query, in order to relate it to the success of the retrieval process. We estimate the general ambiguity of a query with this formula (1): 1 α = (1 − n

i=1 si

)∗

1 w

(1)

User Behaviour and Lexical Ambiguity in Cross-Language Image Retrieval

31

where α represents the general ambiguity of the query, s is the number of senses for each word in a sense inventory (WordNet [4] or EuroWordNet [23] in our case), and w is the number of words in the query. The ambiguity of each word is represented by the number of senses that it has in WordNet or EuroWordNet. As users introduce a word in a query with a speciﬁc sense in mind, we take into account the probability of each sense for each word. Of course, the ambiguity of a query is not the sum of all word senses. The number of words in the context inﬂuences the ambiguity of the query, because the words that make up a query are the context in which each ambiguous word is disambiguated. In order to take account of this, we apply the reciprocal of the number of words in the query: w1 . In a nutshell, formula 1 tries to represent the fact that a query with only one word and ﬁve senses has more ambiguity that a query with ﬁve words each with one sense. In each of these queries, the number of senses is the same, but not the number of words. From a semantic point of view, the fact that diﬀerent languages are used in the same query is a method of lexical disambiguation. The same word in diﬀerent languages could have a diﬀerent amount of senses (for example, the Spanish word “jugar” -5 senses- and the English translation “to play” -29 senses-). In these cases, we consider a word and its translation into another language as the same word. The ambiguity of these complex words is the intersection of their senses, that is, we consider only the number of senses of the less ambiguous word. In the previous example, we considered that the ambiguity of “jugar - to play” is the number of senses of “jugar”, i.e. ﬁve senses. For unknown words, we consider only one sense for each, assuming that these words are either a proper noun or a technical term3 .

3

Log Analysis and Results

For this paper, we have focused on three of the ﬁve languages oﬀered by Flickling: English, Spanish and Italian4 . We have computed the general lexical ambiguity of each query with formula 1, and we have extracted data about these aspects: – lexical ambiguity of the queries and images found by the users, – lexical ambiguity of the queries and images found by a speciﬁc user. 3.1

Query Lexical Ambiguity and Search Success

Table 1 and Graph 1 show the average of images located and not located by all users according to query lexical ambiguity. The graph shows only data with queries with ambiguity in the range [0, 0.5]. 3

4

Another possibility is that a word does not appear in the lexicon due to a typographic error or spelling mistake by the user. This case is not taken into account. However, as we will show later, this has introduced some errors in the results. Due to technical problems, Dutch and German have not been taken into account.

32

B. Navarro-Colorado et al. Table 1. Ambiguity vs. Search Success Ambiguity 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Total

Images not found 323 (23.59%) 268 (21.77%) 307 (20.1%) 155 (24.56%) 51 (16.94%) 45 (54.21%) 18 (64.28%) 9 (100%) 7 (50%) 1 (50%) 1184 (22.8%)

Images found 1046 (74.4%) 963 (78.22%) 1220 (79.89%) 476 (75.43%) 250 (83%) 38 (45.78%) 10 (35.71%) 0 (0%) 7 (50%) 1 (50%) 4011 (77.2%)

Total 1369 1231 1527 631 301 83 28 9 14 2 5195

1400

1000

Not found Found

+

1200

♦ +

+ +

800 Images 600 + 400

♦

200

♦

♦

+ ♦

0 0

0.1

0.2 0.3 Ambiguity

♦ 0.4

+ ♦ 0.5

Fig. 1. Ambiguity and images found

These data clearly show that with low ambiguous queries, the users correctly ﬁnd more much images than with queries with high ambiguity: 71.3% of the images have been correctly located with queries with low ambiguity (between 0 to 0.3): 1046 images for queries with ambiguity 0; 963 images for queries with ambiguity 0.1, or 1220 images for queries with ambiguity 0.2, from a total of 4011. 92.37% of the images correctly located have queries with low ambiguity (between 0 and 0.3). However, 88.93% of the images not located have queries with low ambiguity too (between 0 and 0.3). This fact does not depend exclusively on the ambiguity. Logically, an unambiguous query doesn’t always mean that the user correctly ﬁnds an image, or an ambiguous query imply that the user doesn’t ﬁnd the image. For example, as shown in the table, in those queries with ambiguity 0,

User Behaviour and Lexical Ambiguity in Cross-Language Image Retrieval

33

74.4% of the images are correctly located and 23.5% are not located. The reasons why an image is not located could be diverse: for example, cases where the user does not know the language of the image or the correct translation, or cases where the user does not use the words in the image caption. Notice that, in the majority of cases, when a user correctly ﬁnds an image, the ambiguity of the query is low. From the total amount of images, 71.3% have been correctly located with queries with low ambiguity (between 0 to 0.3), versus 20.26% of images not located with queries with the same level of ambiguity. According to these data, we can conclude that, in the majority of cases, the lexical ambiguity of the query seems to inﬂuence search success when ﬁnding images with a Multilingual Image Retrieval system. However, at the same time, due to users correctly ﬁnding some images with queries with high ambiguity, our conclusion can not be categorical: the lexical ambiguity of the query is an important (and maybe decisive) factor, but it is not the unique factor to be considered for search success. In this sense, the use of a WSD system to improve search eﬀectiveness seems appropriate, and maybe decisive. In any case, it is interesting that, according to these data, it is not necessary to disambiguate with one sense per word, following the classical WSD approach based on classiﬁcation techniques. Up a 0.3 level of ambiguity is acceptable for a user to ﬁnd images correctly. Therefore, a coarse-grained WSD system5 could be useful for IR tasks. 3.2

Ambiguity of the Queries and Images Found by a Speciﬁc User

It is also interesting to look at how a speciﬁc user deals with the lexical ambiguity of the queries. In order to study this, we have extracted the lexical ambiguity of the queries from the user that has located more images correctly. They are shown in Graph 2 and Table 2. We have taken into account only the last query for each search session, in which the user has found the image. In this case the tendency is clear: the majority of images have been correctly found with non-ambiguous queries (47.9% - 190 images). With low ambiguity (between 0.1 and 0.3), 27.5% (109 images) have been correctly found. However, 17.9% (71 images) have been found with high ambiguity queries (between 0.5 and 0.9)6 . The behaviour of our single user is similar to the average: the majority of images have been found with low ambiguous queries: 75.4% for queries with ambiguity between 0 and 0.3. However, the user can ﬁnd images correctly with high ambiguous queries (17.9%). Therefore, the conclusion is the same. For this user, lexical ambiguity of the query is a relevant factor: with less ambiguous queries, the user can ﬁnd more images correctly. However, this is not the unique factor, because there are some cases in which the user can ﬁnd images correctly with high ambiguous queries. 5

6

That is, WSD systems that could disambiguate with a low level of granularity, with more than one related sense per word, or with general semantic categories. Probably, most retrieved images with ambiguity 0 comefrom words that do not appear in the lexicon, that we consider as non-ambiguous.

34

B. Navarro-Colorado et al.

200 180

♦

160 140 120 Queries100 80 ♦

60 40

♦

20

♦

♦

♦

♦

0 0

0.1

0.2

0.3

0.4 0.5 Ambiguity

0.6

♦ 0.7

♦ 0.8

Fig. 2. Ambiguity for the best user Table 2. Ambiguity for the best user Ambiguity of the query 0 0.14 ∼ 0.19 0.2 ∼ 0.278 0.3 ∼ 0.37 0.4 ∼ 0.464 0.416 0.428 0.43 0.44 0.45 0.464 0.5 0.67 0.75 0.8 Total

4

Amount of images found 190 30 58 21 26 9 1 3 6 1 1 36 23 9 3 396

Conclusions and Future Work

The main conclusions of this study are the following: – There is a clear inﬂuence of the lexical ambiguity of the queries (as we measure it) on the precision of users in a multilingual known-item image retrieval task. With less ambiguous queries, the user can ﬁnd more images correctly.

User Behaviour and Lexical Ambiguity in Cross-Language Image Retrieval

35

– For the task studied, ﬁned-grained disambiguation does not seem necessary. With an ambiguity between 0 and 0.3, users can ﬁnd the majority of images correctly. – However, users can ﬁnd some images correctly with ambiguous queries (on average, less than with low ambiguous queries). Extrapolating these results to the general case about the usefulness of WSD systems in IR tasks, we think that it should be useful to apply some form of disambiguation to improve the eﬀectiveness of IR systems. However, it may not be necessary to perform ﬁne-grained disambiguation: coarse-grained disambiguation could be useful and eﬀective enough. For example, disambiguation with more than one related sense (in WordNet), the use of lexical resources with low ambiguity, disambiguation with semantic classes, generic concepts or domains, ontologies, etc. As future work, there are some aspects that should be addressed: – Repeating the experiment with sense inventories with less average granularity than Wordnet. – Performing a similar analysis with a more generic Text Retrieval task.

References 1. Acosta, O.C., Geraldo, A.P., Orengo, V.M., Villavicencio, A.: UFRGS@CLEF2008: Indexing Multiword Expressions for Information Retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 66–74. Springer, Heidelberg (2009) 2. Agirre, E., Edmonds, P.G. (eds.): Word Sense Disambiguation: algorithms and applications. Springer, Heidelberg (2006) 3. Basile, P., Caputo, A., Semeraro, G.: UNIBA-SENSE at CLEF 2008: SEmantic N-levels Search Engine. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 4. Fellbaum, C. (ed.): WordNet. An Electronic Lexical Database. MIT Press, Cambridge (1998) 5. Gonzalo, J., Pe˜ nas, A., Verdejo, F.: Lexical ambiguity and information retrieval revisited. In: Proc. Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC 1999), Maryland (1999) 6. Gonzalo, J., Verdejo, F., Chugur, I., Cigarr´ an, J.M.: Indexing with WordNet synsets can improve Text Retrieval. In: Usage of WordNet in Natural Language Processing Systems. Coling-ACL Workshop (1998) 7. Guyot, J., Falquet, G., Radhouani, S., Benzineb, K.: UNIGE Experiments on Robust Word Sense Disambiguation. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 8. Hanks, P.: Do Word Meaning Exist? Computing and the Humanities 34(1-2), 205–215 (2000)

36

B. Navarro-Colorado et al.

9. Juﬃnger, A., Kern, R., Granitzer, M.: Exploiting Co-occurrence on Corpus and Document Level for Fair Cross-language Retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 10. Kilgarriﬀ, A.: Word senses. In: Agirre, E., Edmonds, P.G. (eds.) Word Sense Disambiguation: Algorithms and Applications, ch. 2, pp. 29–46. Springer, Heidelberg (2006) 11. Krovetz, R.: On the Importance of Word Sense Disambiguation for Information Retrieval. In: Creating and Using Semantics for Information Retrieval and Filtering. State of the Art and Future Research. In: Third International Conference on Language Resources and Evaluation (LREC) Workshop (2002) 12. Krovetz, R., Croft, W.B.: Lexical ambiguity and information retrieval. ACM Transactions on Information Retrieval 10(2), 115–141 (1992) 13. Mart´ınez-Santiago, F., Perea-Ortega, J.M., Garc´ıa-Cumbreras, M.A.: SINAI at Robust WSD Task @ CLEF 2008: When WSD is a Good Idea for Information Retrieval tasks? In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) Evaluating Systems for Multilingual and Multimodal Information Access. LNCS, vol. 5706. Springer, Heidelberg (2009) 14. Navarro, S., Llopis, F., Mu˜ noz, R.: IR-n in the CLEF robust WSD task 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 134–137. Springer, Heidelberg (2009) 15. Navarro-Colorado, B.: Metodolog´ıa, construcci´ on y explotaci´ on de corpus anotados sem´ antica y anaf´ oricamente. PhD thesis, University of Alicante, Alicante (2007) 16. Otegi, A., Agirre, E., Rigau, G.: IXA at CLEF 2008 Robust-WSD Task: using Word Sense Disambiguation for (Cross Lingual) Information Retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 118–125. Springer, Heidelberg (2008) 17. P´erez-Ag¨ uera, J.R., Zaragoza, H.: UCM-Y!R at CLEF 2008 Robust and WSD tasks. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, Springer, Heidelberg (2009) 18. Robins, D.: Interactive Information Retrieval: contexts and basic notions. Informing Science 3(2), 51–61 (2000) 19. Sanderson, M.: Word Sense Disambiguation and Information Retrieval. In: Proceedings of the 17th ACM SIGIR Conference, pp. 142–151 (1994) 20. Sanderson, M.: Ambiguous queries: Test collections need more sense. In: Processdings of the 31st Annual International ACM SIGIR Conference, Singapore (July 2008) 21. Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, pp. 159–166 (2003) 22. Voorhees, E.M.: Using WordNet to Disambiguate Word Senses for Text Retrieval. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 171–180 (1993) 23. Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht (1998)

Users’ Image Seeking Behavior in a Multilingual Tag Environment Miguel E. Ruiz and Pok Chin University of North Texas, College on Information, Department of Library and Information Sciences, 1155 Union Circle 311068 Denton, Texas 76203-1068, USA {Miguel.Ruiz,Pok.Chin}@unt.edu

Abstract. This paper presents the results of a user study conducted in the framework of the Interactive Image Retrieval task at CLEF 2009. The main goal of our research is to understand the way in which users search for images that have been annotated with multilingual tags. The study is based on the application of grounded theory to try to understand the challenges that users face when searching for images that have multilingual annotations, and how they cope with these challenges to find the information they need. The study includes two methods of data collection: an online survey and a face to face interview that included a search task using Flickling. Because this was our first year participating in the interactive image CLEF, we found that the most challenging aspect of conducting a user centered evaluation in the context of CLEF is the short amount of time that is available from the time the task is defined and the deadline for submitting results. We were able to conduct face to face interviews for approximately three weeks (from 6/29/2009 to 7/17/2009) before the Flickling system was shut down. Our online survey was also made available at the end of June and we report here the results that we have collected until the end of November 2009. During this time we collected 27 responses to the online questionnaire and 6 face to face interviews. Our results indicate that 67% of the users search for images at least once a week and that the most common purposes for finding images are entertainment and professional. Our results from the user interviews indicate that the users find the known-item retrieval task hard to do due to the difficulty in expressing the contents of the target image using tags that could have also been assigned by the creator of the image. The face to face interviews also give some feedback for improving the current Flickling interface, particularly the addition of a spell checking mechanism and the improvement of the multilingual translation of terms selected by users.

1 Introduction Understanding the way people interact with information retrieval systems to fulfill their information needs is one of the most crucial aspects that contribute to the acceptance of these systems. Most studies in image retrieval research have concentrated in developing methods or algorithms for image processing and retrieval. However there is a need for understanding how users can take advantages of these algorithms. For C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 37–44, 2010. © Springer-Verlag Berlin Heidelberg 2010

38

M.E. Ruiz and P. Chin

this reason we decided to participate in the CLEF interactive image retrieval track. This year we proposed a user study that aims to understand the way in which American students search for images and their needs for tools to support this task. We present a study that follows an exploratory analysis method to identify the needs for searching images that American students have in general and to conduct an observation of the way in which they approach the task of known-item search in a multilingual environment. Section 2 of this paper presents some background on studies that explore user preferences in image searching. Section 3 describes the approach and methods used in the study. Section 4 presents the results and analysis of our experiments. Section 5 presents our conclusions and future work.

2 Background Research studies that explore user’s preferences on image retrieval are not that common. We present here some of the studies that we consider are more related to the problem that we want to explore. Goodrum [2] suggests two opposite ends in user’s digital image searching: (1) focused specific search, and (2) loose searching or browsing. According to Goodrum [2], efficient image retrieval systems may have to take into account various users’ need and image types available. Many users employ the “browsinf-searching” combined techniques to find the images that fulfill their information needs. The “browsing-searching” model allows images seekers to refine their query using different categories of images returned [4]. Some image retrieval their systems, such as Flickr, already support the “browsingsearching” cycles by using techniques such as clustering and tagging. The Interactive CLEF track from 2008 started the use of a game setting to collect user interaction with an interactive image retrieval system with multilingual support. The Flickling system is the same that is being used in this year’s iCLEF and is described in more detail in the overview paper [1]. The work presented in the 2008 iCLEF paper inspired us to participate this year in the conference.

3 Methodology Although the main goal for the iCLEF 2009 task was to use a game like setting to collect a large number of interactions, we recognize that we needed to explore some more basic aspects related to the typical type of users that we expect to find in the United States. We expected that most of our users would be primarily fluent English speakers with little or no proficiency in other languages. This type of users is more likely to be very dependent on the multilingual capabilities offered by the interface in order to be successful in finding images in Flickling. There were two major questions that we wanted to explore in this research: a) Finding more about the needs of predominantly English speakers when they search for images. This includes the prevalence of their need for searching images (as opposed to other type of information), the search tools that they prefer for conducting such searches, the type of needs that they most recently have, and the context in which these information seeking activities take place.

Users’ Image Seeking Behavior in a Multilingual Tag Environment

39

b) Observing these users while they perform their searches in Flickling to try to understand how they use the multilingual capabilities and the major challenges that the users face. For this purpose we created an online survey and recruited students in the College of Information at UNT. We also recruited a reduced number of students to come and be part of a face to face section that included some basic training on the use of Flickling, two sessions of searching images under slightly different conditions and an exit interview to find out more about their experience with the system and the challenges that they faced while performing the search task. As a requirement of UNT we had to get approval from the Institutional Review Board (IRB) before any user was recruited and for this reason we had to come up with the full questionnaire and interview protocol to ensure that our study complied with the human protection directives required in the USA. This was a task that took some more time than we expected and reduced the amount of time available for our interviews to about 3 weeks from June 29th (the first day when we were able to recruit users), to July 17th (the last day when the Flickling system was still available). The online questionnaire included 14 questions that gather the following information: • Demographic information (4 questions): age range, gender, education, and language proficiency • Usage of search engines (3 questions): frequency of usage, preferred search engines, and purpose for conducting searches (i.e. academic, entertainment, work related, etc) • Data related to image searching behavior and preferences (7 questions): estimated frequency of searches for images, purpose for conducting those searches, preferred search engines, methods used for searching images (i.e. use of keywords, image descriptions and sample images), eliciting an example of the latest image search that the user had conducted, the level of satisfaction with the results obtained, and the ranking of importance of the features that such search engines should have. For the face to face interviews we followed a protocol that includes a training session and three data collection parts: A 20-minute image searching without using hints provided by Flickling, A 20-minute image searching with hint, and an exit interview. The following section details the step-by-step procedure of this experiment: 1) Setup for experiment: All research subjects completed the experiment using the same laptop with the same system configuration. 2) Demographic data collection: The researchers collected demographic data from each research subject before he/she performed the experiment. 3) Training: Before recording the experiment, the researchers trained each research subject using the first and/or second images provided by Flickling. Research subjects learned the Flickling functionalities by searching the first and/or second images. The training stop once the research subject acknowledged to the researchers that he/she was comfortable with the way the Flickling interface works.

40

M.E. Ruiz and P. Chin

4) Part 1: After the training, the first part of this study allocated 20 minutes to measure the user’s multi-lingual image searching behaviors without using hints provided by Flickling. Each research subject’s search session was recorded using TechSmith’s Camtasia Studio screen recording software (http://www.techsmith.com). The screen recording of each research subject’s activity was saved for further analysis. The user was asked to complete as many searches as possible but there was a maximum of 10 minutes per image. If the subject reached the limit without out finding the image presented she/he was instructed to give up the search for that image and move on to the next. 5) Part 2: Part 2 of this experiment was almost identical to part 1. However, subjects were allowed to use the hints provided by Flickling. 6) Part 3: After completion of the part 2 tasks, the researchers conducted an interview with each research subject. The researchers informed the subject that the entire interview session was going to be recorded using audio recording devices for transcription and subsequent analysis. Each research subject was asked the same set of questions regarding the tasks he/she performed during part 1 and 2 of the experiment.

4 Results and Analysis This section presents the results and our analysis of the data collected in the online survey as well as the user testing and interviews. 4.1 Online Survey We received 27 completed responses from students in the College of Information at UNT. All these users can be considered either native English speakers or with a high English proficiency level. 89% of the responders reported at least a basic level of proficiency in one or more languages other than English. Spanish was the language with the largest amount of users that reported at least a basic level of proficiency (70% of the responders) but only 26% reported to have a medium or high level of proficiency in Spanish. Other languages for which responders reported having at least a basic level of proficiency include French (26%), German (15%), Chinese (11%), Italian (7%), Japanese (4%), Korean (4%), Polish (4%), Portuguese (4%), Russian (4%), Ukrainian (4%) and Vietnamese (4%). 7 responders (26%) reported that they search for information using language other than English. These users can be considered true bilingual speakers. The reported language distribution seems to be consistent with what we expected to find among students in a higher education setting in the United State. We have to note that the characteristics of this population is different from a more general population since only 27.5% of the US population is estimated to have a Bachelor’s degree or higher level of education [5]. Figure 1 presents the users’ reported frequency for finding information and images. 81% reported that they search for information daily and the remaining 19% reported that they search for information weekly. The distribution for the frequency of searches of images is quite different with 63% weekly, 11% monthly, 22% less than once a month and only 4% daily. This indicates a significant difference in terms of the

Users’ Image Seeking Behavior in a Multilingual Tag Environment

41

Fig. 1. Users’ reported frequency of information searches and image searches

frequency of the users’ information needs related to images. Keeping in mind that the population studied includes mostly students in Masters in Library and Information Science it is understandable that this difference exists. Figure 2 presents the users’ preferred search engines for finding information as well as images. As expected, Google is the number one search engine preferred by 70% of the users. For image searches Flickr is the second preferred search engine with 13% of the responses, MSNSearch and Yahoo followed with 3% each. Among the other choices for search engines that were reported as preferred search engines by the users are Bing (which had been released shortly before we conducted our survey), and specialized image databases such as stock photos, Getty Institute and Library of Congress image collections. Figure 3 presents the users’ reported reasons for conducting information searches and image searches. “Academic needs” is the most prevalent reason (33%) for finding information followed by entertainment (28%) and professional needs (26%). Image searching needs on the other hand are predominantly for entertainment purposes (37%), followed by professional needs (31%), and academic (26%). These findings also represent a significant difference between the users’ needs for finding information and images. The majority of the users (86%) are either satisfied or extremely satisfied with the results that they get from the search engines when they search for images. Figure 4 reports the features that the users consider extremely important or very important for an image search tool. Accuracy, ease of use, and speed are almost equally

Fig. 2. Users’ reported preferred search engines for finding information and images

42

M.E. Ruiz and P. Chin

Fig. 3. Users’ reported reasons for finding information and images

Fig. 4. Factors considered important for an image database/search engine

important followed by the size of the image collection. Among the other reasons considered important by users we found: quality of the image, copyright or use restrictions, and appropriate tags or text describing the image. 4.2 User Experiments and Interviews As explained in Section 3 we recruited users to participate in a face to face interview that included a two part user search experiment using Flickling. Since we had only a limited amount of time to recruit users and conduct these interviews we were able to conduct 6 interviews. These face to face interviews generated a rich set of data that included searches for 35 images (which we call here image search sessions), 303 queries, 1022 search terms with 289 unique terms, and the users’ opinions of the level of difficulty of each image search task. From the search tasks we analyze each session to determine search strategy being used by the user. We classify this strategy according to three types: • General to specific: the user starts with general term and adds more terms that narrows the set of results. • Specific to general: The user starts with a very specific query and changes it towards a more general set of terms. • Parallel: the user tends to change the terms in his initial query for other equivalent or synonym terms. We also categorized each of the terms used in the query according to the 7 basic attributes ascribed to images as proposed by Greisdorf and O’Connor [3]: color, shape,

Users’ Image Seeking Behavior in a Multilingual Tag Environment

43

texture, object, action, location, and affect. Since several images in our experiments included some text in the image that could be used as a clue by the user we decided to add this to the set of attributes ascribed to images. Of the 35 search sections that were conducted by our users, 7 used a general to specific search strategy, 5 used a specific to general, and 23 used parallel queries. This seems to be consistent with the nature of the task as known item searches usually require the user to try different way to describe the contents of the image until they either find it or give up. The strategies of using general to specific correlate with queries were the user claims that there were too many results returned, while the parallel strategy tends to correlate with queries where the user has a hard time describing the image. A total of 303 queries were submitted by the users with an average query length of 3.37 terms. We reviewed each query that was submitted and classified each of the terms in the searches submitted using the 8 basic attributes ascribed to images listed above. Table 1 shows the distribution of the terms. Our results are in line with the findings presented by Greisdorf and O’Connor [3] since most of the words used by users tend to describe the objects present in the image (65%), followed by color (14%), action (8%), location and affect. The textual clues occurred in 5 of the images and they were used by the users as part of their queries. Although the percentage is too small, these textual clues were important for finding the correct images in 4 out of the 5 cases where they appeared. Table 1. Distribution of query terms according to the attributes ascribed to images Attributes Color Shape Texture Object Action Location Affect Textual Clue

Number of terms Percentage 138 14% 0 3 669 65% 81 8% 65 6% 61 6% 5

Table 2. Success rates in the experimental conditions

No Hints Hints

Success 2 (12%) 14 (78%)

Failure 15 (88%) 4 (22%)

The success rates presented in Table 2 also reveal that the problem of known item search using images annotated with multilingual tags is extremely hard (our users were able to find the given image in 12% of the cases). As expected when the users are allowed to use the hints from Flickr the success rate increases significantly to 78%. The average number of hints requested by users was 2.6, which means that they had to get not only information about the language in which the image was tagged (first hint) but also more than one tag.

44

M.E. Ruiz and P. Chin

During the interviews most users reported having a very hard time describing the contents of the images. There were also problems with the translations provided by the system and many users resort to finding appropriate translation using Google and adding them to their searches. Most users also report that searching for a given image (known item search) is not a task that they ever perform and that most of their searches are rather related to specific topics or needs where more than one image could be potentially useful for them. The users also expressed that they wish the system had a spell checking mechanism.

5 Conclusions Our experiments have shown that the setting of known item search is extremely challenging when the images are annotated in a language unknown to the user. We also confirm that the type of terms used by users to describe images centers primarily on describing the objects perceived by the user and the colors and potential locations. This is consistent with previously reported findings in the literature. We also found some important characteristics about the needs for image searching that can help to guide future developments of the interactive image retrieval systems. Our results have limitations related to the number of users that participated in the study as well as the fact that the users were recruited at the University of North Texas. We would need to conduct these experiments in a larger and more diverse population in order to derive a more general conclusion.

References 1. Gonzalo, J., Clough, P., Karlgren, J.: Overview of iCLEF 2008: Search Log Analysis for Multilingual Image Retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 227–235. Springer, Heidelberg (2008) 2. Goodrum, A.: Image information retrieval: An overview of current research. Informing Science 3(2), 63–67 (2000) 3. Greisdorf, H., O’Connor, B.: Modeling what users see when they look at images: a cognitive viewpoint. Journal of Documentation 58(1), 6–29 (2002) 4. Rasmussen, E.: Indexing images. Annual Review of Information Science and Technology 32, 169–196 (1997) 5. United States Census Bureau. Educational Attainment in the United States: 2007. Population Statistics Report Number P20-560. U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, Washington, DC (January 2009)

Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009 Monica Lestari Paramita, Mark Sanderson, and Paul Clough University of Sheﬃeld, United Kingdom {m.paramita,m.sanderson,p.d.clough}@sheffield.ac.uk

Abstract. The ImageCLEF Photo Retrieval Task 2009 focused on image retrieval and diversity. A new collection was utilised in this task consisting of approximately half a million images with English annotations. Queries were based on analysing search query logs and two diﬀerent types were released: one containing information about image clusters; the other without. A total of 19 participants submitted 84 runs. Evaluation, based on Precision at rank 10 and Cluster Recall at rank 10, showed that participants were able to generate runs of high diversity and relevance. Findings show that submissions based on using mixed modalities performed best compared to those using only concept-based or contentbased retrieval methods. The selection of query ﬁelds was also shown to aﬀect retrieval performance. Submissions not using the cluster information performed worse with respect to diversity than those using this information. This paper summarises the ImageCLEFPhoto task for 2009.

1

Introduction

The ImageCLEFPhoto task is part of the CLEF evaluation campaign, the focus for the past two years being promoting diversity within image retrieval. The task originally began in 2003 and has since attracted participants from many institutions worldwide. For the past three years, ImageCLEFPhoto has used a dataset of 20,000 general photos called the IAPR TC-12 Benchmark. In 2008, we adapted this collection to enable the evaluation of diversity in image retrieval results. We recognised that this setup had limitations and therefore moved to using a larger and more realistic collection of photos (and associated search query logs) from Belga1 , a Belgian press agency. Even though photos in this collection have English-only annotations and hence provide little challenge to cross-language information retrieval systems, there are other characteristics of the dataset which provide new challenges to participating groups (explained in Section 1.1). The resources created for the 2009 task have given us the opportunity to study diversity for image retrieval in more depth. 1

Belga Press Agency: http://www.belga.be

C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 45–59, 2010. c Springer-Verlag Berlin Heidelberg 2010

46

1.1

M. Lestari Paramita, M. Sanderson, and P. Clough

Evaluation Scenario

Given a set of information needs (topics), participants were tasked with ﬁnding not only relevant images, but also generating ranked lists that promote diversity. To make the task harder, we released two types of queries: the ﬁrst type of query included written information about the speciﬁc requirement for diversity (represented as clusters); queries of the second type contained a more conventional title and example relevant images. In the former type of query participants were required to retrieve diverse results with some indication of what types of clusters were being sought; in the latter type of query little evidence was given for what kind of diversity was required. Evaluation gave more credence to runs that presented diverse results without sacriﬁcing precision than those exhibiting less diversity. 1.2

Evaluation Objectives for 2009

The Photo Retrieval task in 2009 was focused at studying diversity further. Using resources from Belga, we provided a much larger collection, containing just under half a million images, compared to 20,000 images provided in 2008. We also obtained statistics on popular queries submitted to the Belga website in 2008 [1], which we exploited to create representative queries for this diversity task. We experimented with diﬀerent ways of specifying the need for diversity which was given to participants, and this year decided to release half of the queries without any indication of diversity required or expected. We were interested in addressing the following research questions: – Can results be diverse without sacriﬁcing relevance? – How much will knowing about query clusters a priori help increase diversity in image search results? – Which approaches should be used to maximize diversity and relevance for image search results? These research questions will be discussed further in section 4.

2

Evaluation Framework

One of the major challenges for participants of the 2009 ImageCLEFPhoto task was a new collection which was 25 times larger than that used for 2008. Query creation was based completely on query log data, which helped to make the retrieval scenario as realistic as possible [2]. We believe this new collection will provide a framework in which to conduct a more thorough analysis of diversity in image retrieval. 2.1

Document Collection

The collection consists of 498,920 images with English-only annotations (i.e. captions) describing the content of the image. However, diﬀerent to the structured annotations of 2008, the annotations in this collection are presented in

Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009

47

Table 1. Example Image and Caption

Annotation: 20090126 - DENDERMONDE, BELGIUM: Lots of people pictured during a commemoration for the victims of the knife attack in Sint-Gilles, Dendermonde, Belgium, on Monday 26 January 2009. Last friday 20-Year old Kim De Gelder killed three people, one adult and two childs, in a knife attack at the children’s day care center ”Fabeltjesland” in Dendermonde. BELGA PHOTO BENOIT DOPPAGNE

an unstructured way (Table 1). This increases the challenge for participants as they must automatically extract information about the location, date, photographic source, etc of the image as a part of the indexing and retrieval process. The photos cover a wide-ranging time period, and there are many cases where pictures have not been orientated correctly, thereby increasing the challenge for content-based retrieval methods. 2.2

Query Topics

Based on search query logs from Belga, 50 example topics were generated and released as two query types (as mentioned previously). From this set, we randomly chose 25 queries to be released with information including the title, cluster title, cluster description and image (example) as shown in Table 2. We refer to these queries as Query Part 1. In this example, participants can notice that this result about Clinton requires 3 diﬀerent clusters, which are Hillary Clinton, Obama Clinton and Bill Clinton. Results covering other aspects of Clinton, such as Chelsea Clinton or Clinton Cards, will not be counted towards the ﬁnal diversity score. More information about these clusters and the method used to produce them can be found in [2]. Given that one might argue that the diversity result in Query Part 1 could be relatively easy to produce as detailed information about the diﬀerent sub-topics is provided as part of the query topic and there are often in practice instances when little or no query log information is

48

M. Lestari Paramita, M. Sanderson, and P. Clough

available to indicate possible clusters, we released 25 queries containing no information about the kind of diversity expected (referred to as Query Part 2 ). An example of this query type is given in Table 3. It should be noted that information about the cluster titles and description were also based on Belga’s query logs. However, we did not release any of this information to the participants. Table 2. Example of Query Part 1 12 clinton hillary clinton Relevant images show photographs of Hillary Clinton. Images of Hillary with other people are relevant if she is shown in the foreground. Images of her in the background are irrelevant. belga26/05859430.jpg obama clinton Relevant images show photographs of Obama and Clinton. Images of those two with other people are relevant if they are shown in the foreground. Images of them in the background are irrelevant. belga28/06019914.jpg bill clinton Relevant images show photographs of Bill Clinton. Images of Bill with other people are relevant if he is shown in the foreground. Images of him in the background are irrelevant. belga44/00085275.jpg

Table 3. Example of Query Part 2 26 obama belga30/06098170.jpg belga28/06019914.jpg belga30/06107499.jpg

Since Belga is a press agency based in Belgium, there are a large number of queries which contain the names of Belgian politicians, Belgian football clubs and members of the Belgian royal family. Other queries, however, are more general such as Beckham, Obama, etc. There are some queries which are very broad and under-speciﬁed (e.g. Belgium); others are highly ambiguous (e.g. Prince and Euro). The list of 50 topics used in this collection is given in Table 4.

Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009

49

Table 4. Overall List of Topics Used in the 2009 Task Query Part 1 Query Part 2 1 leterme 14 princess** 26 obama* 39 beckham* 2 fortis 15 monaco** 27 anderlecht 40 prince** 3 brussels** 16 queen** 28 mathilde 41 princess mathilde 4 belgium** 17 tom boonen 29 boonen 42 mika* 5 charleroi 18 bulgaria** 30 china** 43 ellen degeneres 6 vandeurzen 19 kim clijsters 31 hellebaut 44 henin 7 gevaert 20 standard 32 nadal 45 arsenal 8 koekelberg 21 princess maxima 33 snow** 46 tennis** 9 daerden 22 club brugge 34 spain** 47 ronaldo* 10 borlee* 23 royals** 35 strike** 48 king** 11 olympic** 24 paola* 36 euro* 49 madonna 12 clinton* 25 mary* 37 paris** 50 chelsea 13 martens* 38 rochus * = ambiguous, ** = under-speciﬁed queries, bold queries: queries with more than 677 (median) relevant documents

Fig. 1. Number of Relevant Documents per Query

2.3

Relevance Assessments

Relevance assessments were performed using the DIRECT (Distributed Information Retrieval Evaluation Campaign Tool)2 , a system which enables assessors to work in a collaborative environment. We hired 25 assessors to be involved in this process and assessments were divided into 2 phases: in the ﬁrst phase, assessors were asked to identify images relevant to a given query. Information about all relevant clusters to the topic was given to assessors to ensure they were aware 2

http://direct.dei.unipd.it

50

M. Lestari Paramita, M. Sanderson, and P. Clough

of the scope of relevant images for a query. The number of relevant images for each query resulting from this stage is shown in Figure 1. Having queries from diﬀerent types shown in Table 4, we then analysed the number of relevant documents in each type. This data, shown in Table 5, illustrates that under speciﬁed queries have the highest average number of relevant documents. Table 5. Number of Relevant Documents in Each Query Type All Queries Ambiguous Queries Number of Queries 50 10 Average Doc 697.74 490 Min 2 35 Max 2210 1052 Standard Dev 512.16 366.28

Under Specified Queries 16 1050.19 246 2210 459.29

Other Queries 24 549.33 2 1563 490.5

After a set of relevant images were found, for the second stage diﬀerent assessors were asked to ﬁnd images relevant to each cluster (some images could belong to multiple clusters). Since topics varied widely in content and diversity, the number of relevant images varied from 1 to 1,266 for each cluster. Initially, there were 206 clusters created for the 50 queries, but this number dropped to 198 as there were 8 clusters with no relevant images which had to be deleted. There are an average number of 208.49 relevant documents for each cluster, with a standard deviation of 280.59. The distribution of clusters is shown in Figure 2.

Fig. 2. Distribution of Clusters in the Queries

Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009

2.4

51

Generating the Results

The method for generating results from participants submissions was similar to that used in 2008 [3]. The precision of each run (P@10) was evaluated using trec eval and cluster recall (CR@10) was used to measure diversity. Since the maximum number of clusters was set to 10 [2], we focused evaluation on P@10 and CR@10. The F1 score calculates the harmonic mean of these two measures.

3

Overview of Participation and Submissions

A total of 44 diﬀerent institutions registered for the ImageCLEFPhoto task (the highest number of applications ever received for this task). From this number, 19 institutions from 10 diﬀerent countries ﬁnally submitted runs to the evaluation. Due to the large number of runs received last year, we limited the number of submitted runs to 5 per participant. A total of 84 runs were submitted and evaluated (some groups submitted less than 5 runs). Table 6. Participating Groups No 1 2 3 4

Group ID Institution Country Runs Alicante University of Alicante Spain 5 Budapest-ACAD Hungarian Academy of Science, Budapest Hungary 5 Chemnitz Computer Science, Trinity College, Dublin Ireland 4 CLAC-Lab Computational Linguistics at Concordia Canada 4 (CLAC) Lab, Concordia University, Montreal 5 CWI* Interactive Information Access Netherlands 5 6 Daedalus Computer Science Faculty, Daedalus, Spain 5 Madrid 7 Glasgow Multimedia IR, University of Glasgow UK 5 8 Grenoble Lab. Informatique Grenoble France 4 9 INAOE Language Tech Mexico 5 10 InfoComm Institution for InfoComm Research Singapore 5 11 INRIA* LEAR Team France 5 12 Jaen Intelligent Systems, University of Jaen Spain 4 13 Miracle-GSI Intelligent System Group, Daedalus, Spain 3 Madrid 14 Ottawa NLP, AI.I.Cuza U. of IASI Canada 5 15 Southampton* Electronics and Computer Science, Univer- UK 4 sity of Southampton 16 UPMC-LIP6 Department of Computer Science, Labora- France 5 toire d’Informatique de Paris 6 17 USTV-LSIS System and Information Sciences Lab, France 2 France 18 Wroclaw* Wroclaw University of Technology Poland 5 19 XEROX-SAS XEROX Research France 4 * = new participating groups

52

3.1

M. Lestari Paramita, M. Sanderson, and P. Clough

Overview of Submissions

The participating groups for 2009 are listed in Table 6. From the 24 groups participating in the 2008 task, 15 groups returned and were involved this year (Returning). We also received four new participants who joined this task for the ﬁrst time (New). Participants were asked to specify the query ﬁelds used in their search and the modality of the runs. Query ﬁelds were described as T (Title), CT (Cluster Title), CD (Cluster Description) and I (Image). The modality was Table 7. Choice of Query Fields Query Fields Number of Runs T 17 T-CT-CD-I 15 T-CT 15 T-CT-I 9 T-CT-CD 9 I 8 T-I 7 CT-I 2 CT 2

Table 8. Modality of the Runs Modality Number of Runs TXT-IMG 36 TXT 41 IMG 7

Title only 20%

Title and cluster title 18%

Other fields without title 5% Title and image 8% Image only 10%

Title, cluster title and other fields 39%

Fig. 3. Summary of Query Fields Used in Submitted Runs

Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009

53

described as TXT (text-based search only), IMG (content-based image search only) or TXT-IMG (both text and content-based image search). The range of approaches is shown in Tables 7 and 8 and summarised in Figure 3.

4

Results

This section provides an overview of the results based on the type of queries and modalities used to generate the runs. As mentioned in the previous section, we used P@10 to calculate the fraction of relevant documents in the top 10 and CR@10 to evaluate diversity, which calculates the proportion of subtopics retrieved in the top 10 documents as shown below: K

subtopics (di ) (1) nA The F1 score was used to calculate the harmonic mean of P@10 and CR@10, to enable the results to be sorted by one single measure: CR@K =

F1 = 4.1

i=1

2 · P @10 · CR@10 P @10 + CR@10

(2)

Results across All Queries

The top 10 runs computed across all 50 queries (ranked in descending order of F1 score) are shown in Table 9. Looking at the top 10 runs, we observe that highest eﬀectiveness is reached using mixed modality (text and image) and using information from the query title, cluster title and the example image. The scores for P@10, CR@10 and F1 in this years task are notably higher than the evaluation last year. Moreover, the number of relevant images in this years task was higher. Having two diﬀerent types of queries, we analysed how participants dealt with the diﬀerent queries. Tables 10 and 11 summarise the top 10 runs in each of query types. Table 9. Systems with Highest F1 Score for All Queries No 1 2 3 4 5 6 7 8 9 10

Group XEROX-SAS XEROX-SAS XEROX-SAS INRIA INRIA InfoComm XEROX-SAS INRIA Southampton Southampton

Run Name XRCEXKNND XRCECLUST KNND LEAR5 TI TXTIMG LEAR1 TI TXTIMG LRI2R TI TXT XRCE1 LEAR2 TI TXTIMG SOTON2 T CT TXT SOTON2 T CT TXT IMG

Query T-CT-I T-CT-I T-CT-I T-I T-I T-I T-CT-I T-I T-CT T-CT

Modality TXT-IMG TXT-IMG TXT-IMG TXT-IMG TXT-IMG TXT TXT-IMG TXT-IMG TXT TXT-IMG

P@10 0.794 0.772 0.8 0.798 0.776 0.848 0.78 0.772 0.824 0.746

CR@10 0.8239 0.8177 0.7273 0.7289 0.7409 0.6710 0.7110 0.7055 0.6544 0.7095

F1 0.8087 0.7942 0.7619 0.7619 0.7580 0.7492 0.7439 0.7373 0.7294 0.7273

54

M. Lestari Paramita, M. Sanderson, and P. Clough Table 10. System with Highest F1 Score for Query Part 1

No 1 2 3 4 5 6 7 8 9

Group Southampton Southampton XEROX-SAS XEROX-SAS XEROX-SAS XEROX-SAS Southampton InfoComm Southampton

10 INRIA

Run Name SOTON2 T CT TXT SOTON2 T CT TXT KNND XRCE1 XRCECLUST XRCEXKNND SOTON1 T CT TXT LRI2R TCT TXT SOTON1 T CT TXT IMG LEAR1 TI TXTIMG

Query T-CT IMG T-CT T-CT-I T-CT-I T-CT-I T-CT-I T-CT T-CT T-CT T-I

Modality TXT TXT-IMG TXT-IMG TXT-IMG TXT-IMG TXT-IMG TXT TXT TXT-IMG

P@10 0.868 0.804 0.768 0.768 0.768 0.768 0.824 0.828 0.76

TXT-IMG 0.772

CR@10 0.7730 0.8063 0.8289 0.8289 0.8289 0.8289 0.7470 0.7329 0.7933

F1 0.8178 0.8052 0.7973 0.7973 0.7973 0.7973 0.7836 0.7776 0.7763

0.7779 0.7749

Diﬀerent compared to results presented previously, it is interesting to see that the top run in Query Part 1 used only text retrieval approaches. Even though the CR@10 score was lower than most of the runs, it obtained the highest F1 score due to a high P@10 score. The uses of tags vary within results, but the top 9 runs consistently use both title and cluster title. We therefore conclude that the use of title and cluster title do help the participants to achieve a good score in both precision and cluster recall. In the queries part two, participants did not have access to cluster information. We speciﬁcally intended this to see how well the system ﬁnds diverse results without any hints. The results of the top runs in Query Part 2 is shown in Table 11. It is shown in the table that the top 9 runs use information from example images, which shows that example images and their annotations might have given useful hints to detect diversity. To analyse this further, we divided the runs which used the Image ﬁeld and those which did not, and found that the average CR@10 scores were 0.5571 and 0.5270 respectively. We conclude that having example images helps to identify diversity and present a more diverse set of results. Table 11. System with Highest F1 Score for Query Part 2 No Group Run Name Query Modality P@10 CR@10 F1 1 XEROX-SAS XRCEXKNND T-I TXT-IMG 0.82 0.8189 0.8194 2 XEROX-SAS XRCECLUST T-I TXT-IMG 0.776 0.8066 0.7910 3 InfoComm LRI2R TI TXT T-I TXT 0.828 0.6901 0.7528 4 INRIA LEAR5 TI TXTIMG T-I TXT-IMG 0.756 0.7399 0.7479 5 INRIA LEAR1 TI TXTIMG T-I TXT-IMG 0.78 0.7039 0.7400 6 GRENOBLE LIG3 TI TXTIMG* T-I TXT-IMG 0.7708 0.6711 0.7175 7 XEROX-SAS KNND T-I TXT-IMG 0.832 0.6257 0.7143 8 INRIA LEAR2 TI TXTIMG T-I TXT-IMG 0.728 0.6849 0.7058 9 GRENOBLE LIG4 TCTITXTIMG T-I TXT-IMG 0.792 0.6268 0.6998 10 GLASGOW GLASGOW4 T TXT 0.76 0.6401 0.6949 *submitted results for 24 out of 25 queries. Score shown is the average of the 24 queries.

Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009

55

Table 12. Cluster Recall Score Diﬀerence between Query Part 1 and Part 2 Mean StDev Max Min -0.0234 0.1454 0.2893 -0.6459 Table 13. Comparison of CR@10 Scores Number of P@10 CR@10 Runs Mean SD Mean SD Query Part 1 with CT 52 0.6845 0.2 0.5939 0.1592 Query Part 1 without CT 32 0.6641 0.2539 0.5006 0.1574 Query Part 2 84 0.6315 0.2185 0.5415 0.1334 Queries

F1 Mean SD 0.6249 0.1701 0.5581 0.1962 0.5693 0.1729

Comparing the CR@10 scores in the top 10 runs of Query Part 1 and Query Part 2, the scores in the latter group were lower, which implied that systems did not ﬁnd as many diverse results when cluster information was not available. The F1 scores from these top 10 were also lower, but they only diﬀered slightly compared to the Query Part 1. We also calculated the magnitude of diﬀerence between results for diﬀerent query types (shown in Table 12. This indicates that on average runs do perform lower in Query Part 2, however the diﬀerence is small and not suﬃcient to conclude that runs will be less diverse if cluster titles are not available (p=0.146). It is important to understand that not all the runs in Query Part 1 use the cluster title. To analyse how useful the Cluster Title (CT) information is, we divided the runs of Query Part 1 based on the use of CT ﬁeld. The mean and standard deviation of P@10, CR@10 and the F1 scores is shown in Table 13 (the highest score shown in italics). Table 13 provides more evidence that the Cluster Title ﬁeld has an important role in identifying diversity. When Cluster Title is not being used, the F1 scores of both Query Part 1 and Query Part 2 do not diﬀer signiﬁcantly. Figure 4 shows a scatter plot of F1 scores for each query type. Using a two-tailed paired t-test, the scores between Query Part 1 and Query Part 2 were found to be signiﬁcantly diﬀerent (p=0.02). There is also a signiﬁcant correlation between the scores: the Pearson correlation coeﬃcient equals 0.691. We evaluated the same test on the runs using Cluster Title only to the runs in Query Part 2, and found that they are also signiﬁcantly diﬀerent (p=0.003), the Pearson correlation coeﬃcient equals 0.745. However, when the same evaluation was being performed on runs not using Cluster Title, the diﬀerence in scores was not signiﬁcant (p=0.053), although obtaining a Pearson correlation coeﬃcient of 0.963. Table 14 summarises the results across all queries (mean scores). According to these results, highest scores from the three conditions are obtained when the query has full information about potential diversity. We also analysed whether the number of clusters have any eﬀect on the diversity score. To measure this factor, we calculated the mean CR@10 for all of the

56

M. Lestari Paramita, M. Sanderson, and P. Clough

1 F1-measure for Query Part 2

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

F1-measure for Query Part 1

Fig. 4. Scatter Plot for F1 Scores of Each Run Between Query Types Table 14. Summary of Results Across All Queries P@10 Mean SD All Queries 0.655 0.2088 Query Part 1 0.6768 0.2208 Query Part 2 0.6315 0.2185 Queries

CR@10 Mean SD 0.5467 0.1368 0.5583 0.1641 0.5415 0.1334

F1 Mean SD 0.5848 0.1659 0.5995 0.1823 0.5693 0.1729

runs. These scores are then plotted based on the number of clusters contained in each speciﬁed query. This scatter plot, shown in Figure 5, has a Pearson correlation coeﬃcient of -0.600, conﬁrming that the more clusters a query contains, the lower the CR@10 score is. 4.2

Results by Retrieval Modality

In this section, we will present an overview result of runs using diﬀerent modalities. According to Table 15, both the precision and cluster recall scores are highest if systems use both low-level features based on the content of an image and its associated text. The mean of the runs using image content only (IMG) is drastically lower based on the P@10 score; however the gap decreases when considering only the CR@10 score. Further research should be carried out to improve runs using content-based approaches only, as the best run using this approach had the lowest F1 score (0.218) compared to TXT (0.351) and TXTIMG (0.297).

Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009

4.3

57

Approaches Used by Participants

Having known that the mixed modality performs best, we were also interested to see the best combination of query ﬁelds to maximize the F1 score of the runs. We therefore calculated the mean of each combination and modality and the result is shown in Table 16 with the highest score for each modality shown in italic.

1

0.9 0.8 Mean CR@10

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0 0

1

2

3

4

5

6

7

8

9

10

Number of Clusters

Fig. 5. Scatter Plot for Mean CR@10 Scores for Each Query

Table 15. Results by Retrieval Modality Modality Number of Runs TXT-IMG TXT IMG

36 41 7

P@10 Mean SD 0.713 0.1161 0.698 0.142 0.103 0.027

CR@10 Mean SD 0.6122 0.1071 0.5393 0.0942 0.2535 0.0794

F1 Mean SD 0.6556 0.1024 0.5976 0.0964 0.1456 0.0401

It is interesting to note that the highest F1 score was diﬀerent for each modality. A combination of T-CT-I had the highest score in TXT-IMG modality. In the TXT modality, a combination of T-I scored the highest, with T-CT-I following on the second place. However, since only one run used the T-I, it was not enough to provide a conclusion about the best run. Calculating the average F1 score regardless of diversity shows that the best runs are achieved using a combination of Title, Cluster Title and Image. Using all tags in the queries resulted in the worst performance.

58

M. Lestari Paramita, M. Sanderson, and P. Clough Table 16. Choice of Query Tags with Mean F1 Score Query Type

TXT-IMG T 2 runs 0.4621 T-CT-CD-I 10 runs 0.5729 T-CT 2 runs 0.7214 T-CT-I 8 runs 0.7344 T-CT-CD 2 runs 0.6315 I 4 runs 0.6778 T-I 6 runs 0.7117 CT-I 2 runs 0.6925 CT -

5

Modality Average F1 TXT IMG 14 runs 0.5905 1 run 0.0951 0.5462 2 runs 0.4579 3 runs 0.1296 0.4689 13 runs 0.6071 0.6233 1 run 0.6842 0.7288 7 runs 0.5688 0.5827 1 run 0.6741 3 runs 0.1786 0.4901 1 run 0.7492 0.7171 0.6925 2 runs 0.6687 0.6687

Conclusions

This paper has reported the ImageCLEF Photo Retrieval Task for 2009. Still focusing on the topic of diversity, this year’s task introduced new challenges to the participants, mainly through the use of a much larger collection of images than used in previous years and by other tasks. Queries were released as two types: the ﬁrst type of queries included information about the kind of diversity expected in the results; the second type of queries not providing this level of detail. The number of registering participants in this year was the highest of all the ImageCLEFPhoto tasks since 2003. Nineteen participants submitted a total of 84 runs, which were then categorised based on the query ﬁelds used to ﬁnd information, and the modalities being used. The result showed that participants were able to present a diverse result without sacriﬁcing precision. In addition, results showed the following: – Information about the cluster title is essential for providing diverse results, as this enables participants to correctly present images based on each cluster. When the cluster information was not being used, the cluster recall score is proven to drop, which showed that participants need better approach to predict the diversity need in it. – A combination of Title, Cluster Title and Image was proven to maximize the diversity and relevance of the search engine. – Retrieval using mixed modality (text and image) in the runs managed to achieve the highest F1 compared to using only text or image features alone. Considering the increasing interest of participants in ImageCLEFPhoto, the creation of the new collection was seen as a big achievement in providing a more realistic framework for the analysis of diversity and evaluation of retrieval systems aimed at promoting diverse results. The ﬁndings from this new collection were found to be promising and we plan to make use of other diversity algorithms in the future to enable evaluation to be done more thoroughly.

Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009

59

Acknowledgments We would like to thank Belga Press Agency for providing us the collection and query logs and Theodora Tsikrika for the preprocessed queries which we used as the basis for this research. The work reported has been partially supported by the TrebleCLEF Coordination Action, within FP7 of the European Commission, Theme ICT-1-4-1 Digital Libraries and Technology Enhanced Learning (Contract 215231).

References 1. Tsikrika, T.: Queries Submitted by Belga Users in (2008, 2009) 2. Paramita, M.L., Sanderson, M., Clough, P.: Developing a Test Collection to Support Diversity Analysis. In: SIGIR 2009 Workshop: Redundancy, Diversity, and Interdependent Document Relevance, Boston, Massachusetts, USA, July 23 (2009) 3. Arni, T., Clough, P., Sanderson, M., Grubinger, M.: Overview of the ImageCLEFPhoto 2008 Photographic Retrieval Task. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 500–511. Springer, Heidelberg (2008)

Overview of the WikipediaMM Task at ImageCLEF 2009 Theodora Tsikrika1 and Jana Kludas2 1 2

CWI, Amsterdam, The Netherlands [email protected] CUI, University of Geneva, Switzerland [email protected]

Abstract. ImageCLEF’s wikipediaMM task provides a testbed for the system-oriented evaluation of multimedia information retrieval from a collection of Wikipedia images. The aim is to investigate retrieval approaches in the context of a large and heterogeneous collection of images (similar to those encountered on the Web) that are searched for by users with diverse information needs. This paper presents an overview of the resources, topics, and assessments of the wikipediaMM task at ImageCLEF 2009, summarises the retrieval approaches employed by the participating groups, and provides an analysis of the main evaluation results.

1

Introduction

The wikipediaMM task is an ad-hoc image retrieval task. The evaluation scenario is thereby similar to the classic TREC ad-hoc retrieval task and the ImageCLEF photo retrieval task: simulation of the situation in which a system knows the set of documents to be searched, but cannot anticipate the particular topic that will be investigated (i.e., topics are not known to the system in advance). Given a multimedia query that consists of a title and one or more sample images describing a user’s multimedia information need, the aim is to ﬁnd as many relevant images as possible from the (INEX MM) wikipedia image collection. A multi-modal retrieval approach in that case should be able to combine the relevance of diﬀerent media types into a single ranking that is presented to the user. The wikipediaMM task diﬀers from other benchmarks in multimedia information retrieval, like TRECVID, in the sense that the textual modality in the wikipedia image collection contains less noise than the speech transcripts in TRECVID. Maybe that is one of the reasons why, both in last year’s task and in INEX Multimedia 2006-2007 (where this image collection was also used), it has proven challenging to outperform the text-only approaches. This year, the aim is to promote the investigation of multi-modal approaches to the forefront of this task by providing a number of resources to support the participants towards this research direction. The paper is organised as follows. First, we introduce the task’s resources: the wikipedia image collection and additional resources, the topics, and the assessments (Sections 2–4). Section 5 presents the approaches employed by the C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 60–71, 2010. c Springer-Verlag Berlin Heidelberg 2010

Overview of the WikipediaMM Task at ImageCLEF 2009

61

participating groups and Section 6 summarises their main results. Section 7 concludes the paper.

2

Task Resources

The resources used for the wikipediaMM task are based on Wikipedia data. The collection is the (INEX MM) wikipedia image collection, which consists of 151,519 JPEG and PNG Wikipedia images provided by Wikipedia users. Each image is associated with user-generated alphanumeric, unstructured metadata in English. These metadata usually contain a brief caption or description of the image, the Wikipedia user who uploaded the image, and the copyright information. These descriptions are highly heterogeneous and of varying length. Further information about the image collection can be found in [13].

Fig. 1. Wikipedia image+metadata example from the wikipedia image collection

Additional resources were also provided to support the participants in their investigations of multi-modal approaches. These resources are: Image similarity matrix: The similarity matrix for the images in the collection has been constructed by the IMEDIA group at INRIA. For each image in the collection, this matrix contains the list of the top K = 1000 most similar images in the collection together with their similarity scores. The same is given for each image in the topics. The similarity scores are based on the distance between images; therefore, the lower the score, the more similar the images. Further details on the features and distance metric used can be found in [2].

62

T. Tsikrika and J. Kludas

Image classiﬁcation scores: For each image, the classiﬁcation scores for the 101 MediaMill concepts have been provided by UvA [11]. The UvA classiﬁer is trained on manually annotated TRECVID video data for concepts selected for the broadcast news domain. Image features: For each image, the set of the 120D feature vectors that has been used to derive the above image classiﬁcation scores [3] has also been made available. Participants can use these feature vectors to custom-build a content-based image retrieval (CBIR) system, without having to pre-process the image collection. The additional resources are beneﬁcial to researchers who wish to exploit visual evidence without performing image analysis. Of course, participants could also extract their own image features.

3

Topics

The topics are descriptions of multimedia information needs that contain textual and visual hints. 3.1

Topic Format

These multimedia queries consist of a textual part, the query title, and a visual part, one or several example images. query by keywords <image> query by image content (one or several) <narrative> description of query in which the deﬁnitive deﬁnition of relevance and irrelevance are given <title> The topic <title> simulates a user who does not have (or want to use) example images or other visual constraints. The query expressed in the topic <title> is therefore a text-only query. This proﬁle is likely to ﬁt most users searching digital libraries. Upon discovering that a text-only query does not produce many relevant hits, a user might decide to add visual hints and formulate a multimedia query. <image> The visual hints are example images, which can be taken from outside or inside the wikipedia image collection and can be of any common format. Each topic has at least one example image, but it can have several, e.g., to describe the visual diversity of the topic. <narrative> A clear and precise description of the information need is required in order to unambiguously determine whether or not a given document fulﬁls the given information need. In a test collection this description is known as the narrative. It is the only true and accurate interpretation of a user’s needs. Precise Overview of the WikipediaMM Task at ImageCLEF 2009 63 recording of the narrative is important for scientiﬁc repeatability - there must exist, somewhere, a deﬁnitive description of what is and is not relevant to the user. To aid this, the <narrative> should explain not only what information is being sought, but also the context and motivation of the information need, i.e., why the information is being sought and what work-task it might help to solve. These diﬀerent types of information sources (textual terms and visual examples) can be used in any combination. It is up to the systems how to use, combine or ignore this information; the relevance of a result does not directly depend on these constraints, but it is decided by manual assessments based on the <narrative>. 3.2 Topic Development The topics in the ImageCLEF 2009 wikipediaMM task have been partly developed by the participants and partly by the organisers. This year the participation in the topic development process was not obligatory, so only 2 of the participating groups submitted a total of 11 candidate topics. The rest of the candidate topics were created by the organisers with the help of the log of an image search engine. After a selection process performed by the organisers, a ﬁnal list of 45 topics was created. These ﬁnal topics are listed in Table 1 and range from simple, and thus relatively easy (e.g., “bikes”), to semantic, and hence highly diﬃcult (e.g., “aerial photos of non-artiﬁcial landscapes”), with the latter forming the bulk of the topics. Semantic topics typically have a complex set of constraints, need world knowledge, and/or contain ambiguous terms, so they are expected to be challenging for current state-of-the-art retrieval algorithms. We encouraged the participants to use multi-modal approaches since they are more appropriate for dealing with semantic information needs. On average, the 45 topics contain 1.7 images and 2.7 words. 4 Assessments The wikipediaMM task is an image retrieval task, where an image with its metadata is either relevant or not (binary relevance). We adopted TREC-style pooling of the retrieved images with a pool depth of 50, resulting in pools of between 299 and 802 images with a mean and median both around 545. The evaluation was performed by the participants of the task within a period of 4 weeks after the submission of runs. The 7 groups that participated in the evaluation process used the web-based interface that was used last year and which has also been previously employed in the INEX Multimedia and TREC Enterprise tracks. 5 Participants A total of 8 groups submitted 57 runs: CEA (LIC2M-CEA, Centre CEA de Saclay, France), DCU (Dublin City University, School of Computing, Ireland), 64 T. Tsikrika and J. Kludas Table 1. Topics for the ImageCLEF 2009 wikipediaMM task: IDs, titles, the number of image examples providing additional visual information, and the number of relevant images in the collection ID 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 Topic title # image examples # relevant images shopping in a market 3 31 real rainbow 1 12 sculpture of an animal 3 32 stamp without human face 3 89 orthodox icons with Jesus 2 28 sculptures of Greek mythological ﬁgures 3 30 rider on horse 2 53 old advertisement for cars 2 31 advertisement on buses 2 41 aerial photos of non-artiﬁcial landscapes 2 37 situation after hurricane katrina 2 5 airplane crash 2 12 madonna portrait 2 29 people laughing 3 12 satellite image of river 1 60 landline telephone 1 13 bikes 1 30 close up of antenna 2 21 people with dogs 2 52 group of dogs 2 39 cartoon with a cat 1 53 woman in pink dress 2 12 close up of people doing sport 3 37 ﬂowers on trees 2 32 ﬂower painting 2 18 ﬁre 2 74 building site 1 6 palm trees 1 41 street musician 2 20 snowy street 2 31 traﬃc signs 2 32 red fruit 2 38 bird nest 2 21 tennis player on court 2 29 desert landscape 2 35 political campaign poster 2 19 hot air balloons 1 13 baby 1 23 street view at night 2 95 notes on music sheet 1 112 illustration of engines 1 40 earth from space 2 35 coral reef underwater 2 24 harbor 2 63 yellow ﬂower 2 62 Overview of the WikipediaMM Task at ImageCLEF 2009 65 Table 2. Types of the 57 submitted runs Run type # runs Text (TXT) 26 Visual (IMG) 2 Text/Visual (TXTIMG) 29 Query Expansion 18 Relevance Feedback 7 DEUCENG (Dokuz Eylul University, Department of Computer Engineering, Turkey), IIIT-Hyderabad (Search and Info Extraction Lab, India), LaHC (Laboratoire Hubert Curien, UMR CNRS, France), SZTAKI (Hungarian Academy of Science, Hungary), SINAI (Intelligent Systems, University of Jaen, Spain) and UALICANTE (Software and Computer Systems, University of Alicante, Spain). Table 2 gives an overview of the types of the submitted runs. This year more multi-modal (text/visual) than text-only runs were submitted. A short description of the participants’ approaches follows. CEA (12 runs) [8]. They extended the approach they employed last year by reﬁning the textual query expansion procedure and introducing of a k-NN based visual reranking procedure. Their main aim was to examine whether combining textual and content-based retrieval improves over purely textual search. DCU (5 runs) [6]. Their main eﬀort concerned the expansion of the image metadata using the Wikipedia abstracts’ collection DBpedia. Since the metadata is short for retrieval by query text, they expand the query and documents using the Rocchio algorithm. For retrieval, they used the LEMUR toolkit. They also submitted one visual run. DEUCENG (6 runs) [4]. Their research interests focussed on 1) the expansion of native documents and queries, term phrase selection based on WordNet, WSD and WordNet similarity functions, and 2) a new reranking approach with Boolean retrieval and C3M based clustering. IIT-H (1 run) [12]. Their system automatically ranks the most similar images to a given textual query using a combination of the Vector Space Model and the Boolean model. The system preprocesses the data set in order to remove the non-informative terms. LaHC (13 runs) [7]. In this second participation, they extended their approach (a multimedia document model deﬁned as a vector of textual and visual terms weighted using tf.idf) by using 1) additional information for the textual part (legend and image bounding text extracted from the original documents), 2) diﬀerent image detectors and descriptors, and 3) a new text/image combination approach. SINAI (4 runs) [5]. Their approach focussed on query and document expansion techniques based on WordNet. They used the LEMUR toolkit as their retrieval system. SZTAKI (7 runs) [1]. They used both textual and visual features and employed image segmentation, SIFT keypoints, Okapi BM25 based text 66 T. Tsikrika and J. Kludas retrieval, and query expansion by an online thesaurus. They preprocessed the annotation text to remove author and copyright information and biased retrieval towards images with ﬁlenames containing relevant terms. UALICANTE (9 runs) [9]. They used IR-n, a retrieval system based on passages and applied two diﬀerent term selection strategies for query expansion: Probabilistic Relevance Feedback and Local Context Analysis, and their multimodal versions. They also used the same technique for Camel Case decompounding of image ﬁlenames that they used in last year’s participation. 6 Results Table 3 presents the evaluation results for the 15 best performing runs ranked by Mean Average Precision (MAP). DEUCENG’s text-only runs performed best. But as already seen last year, approaches that fuse several modalities can compete with the text-only ones. Furthermore, it is notable that all participants that used both mono-media and multi-modal algorithms achieved their best results with their multi-modal runs. The complete list of results can be found at the ImageCLEF website http://www.imageclef.org/2009/wikiMM-results. Table 3. Results for the top 15 runs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Participant deuceng deuceng deuceng lahc lahc cea cea cea cea cea cea lahc lahc lahc ualicante Run deuwiki2009 205 deuwiki2009 204 deuwiki2009 202 TXTIMG 100 3 1 5 meanstd TXTIMG 50 3 1 5 meanstd cealateblock ceaearlyblock ceabofblock ceatlepblock ceabofblockres ceatlepblockres TXTIMG Siftdense 0.084 TXT 100 3 1 5 TXT 50 3 1 5 Alicante-MMLCA Modality TXT TXT TXT TXTIMG TXTIMG TXTIMG TXTIMG TXTIMG TXTIMG TXTIMG TXTIMG TXTIMG TXT TXT TXTIMG FB/QE QE QE QE NOFB NOFB QE QE QE QE QE QE NOFB NOFB NOFB FB MAP 0.2397 0.2375 0.2358 0.2178 0.2148 0.2051 0.2046 0.1975 0.1959 0.1949 0.1934 0.1903 0.1890 0.1880 0.1878 P@10 0.4000 0.4000 0.3933 0.3378 0.3356 0.3622 0.3556 0.3689 0.3467 0.3689 0.3467 0.3111 0.2956 0.3000 0.2733 P@20 0.3133 0.3111 0.3189 0.2811 0.2867 0.2744 0.2833 0.2789 0.2733 0.2789 0.2733 0.2700 0.2544 0.2489 0.2478 R-prec. 0.2683 0.2692 0.2708 0.2538 0.2536 0.2388 0.2439 0.2342 0.2236 0.2357 0.2236 0.2324 0.2179 0.2145 0.2138 Next, we analyse the evaluation results. In our analysis, we use only the top 90% of the runs to exclude noisy and buggy results. Furthermore, we excluded 3 runs that we considered to be redundant, i.e., they were produced by the same group and achieved the exact same result, so as to reduce the bias of the analysis. 6.1 Performance per Modality for All Topics Table 4 shows the average performance and standard deviation with respect to modality. On average, the multi-modal runs manage to outperform the monomedia runs with respect to all examined evaluation metrics (MAP, Precison at 20, and precision after R (= number of relevant) documents are retrieved). Overview of the WikipediaMM Task at ImageCLEF 2009 67 Table 4. Results per modality over all topics MAP Mean SD All top 90% runs (46 runs) 0.1751 0.0302 TXT in top 90% runs (23 runs) 0.1726 0.0326 TXTIMG in top 90% runs (23 runs) 0.1775 0.0281 Modality 6.2 P@20 Mean SD 0.2356 0.0624 0.2278 0.0427 0.2433 0.0364 R-prec. Mean SD 0.2076 0.0572 0.2038 0.0328 0.2115 0.0307 Performance per Topic and per Modality To analyse the average diﬃculty of the topics, we classify the topics based on the AP values per topic averaged over all runs as follows: easy: M AP > 0.3 medium: 0.2 < M AP <= 0.3 hard: 0.1 < M AP <= 0.2 very hard: M AP < 0.1. Table 5 presents the top 7 topics per class (i.e., easy, medium, hard, and very hard), together with the total number of topics per class. Most of the topics are considered to be hard. This was actually intended during the topic development process where we opted for highly semantic topics that are challenging for current retrieval approaches. Nonetheless, 10 out of 45 topics were of easy and medium diﬃculty. Only 7 topics were very hard to solve. Therein, topics #97 “woman in pink dress” and #98 “close up of people doing sport” can be considered as unsolvable, since their M AP < 0.05. Table 5. Topics classiﬁed based on their diﬃculty. The top 7 topics are shown per class together with the total number of topics per class. easy (6 topics) 112 hot air balloons 88 madonna portrait 80 orthodox icons 108 bird nest 103 palm trees 93 close up antenna medium (4 topics) 118 coral reef underwater 90 satellite image of river 110 desert landscape 77 real rainbow hard (28 topics) 120 yellow ﬂower 91 landline telephone 99 ﬂowers on trees 79 stamp human face 107 red fruit 94 people with dogs very hard (7 topics) 105 snowy street 78 sculpture of an animal 117 earth from space 85 aerial ph. of landscapes 89 people laughing 97 woman in pink dress 98 close up of people doing sport We also analysed the performance of runs that use only text (TXT) versus runs that use both text and visual resources (TXTIMG). Figure 2 shows the average performance on each topic for all, text-only, and text-visual runs. The text-based runs outperform the text-visual ones in 22 out of the 45, indicating that slightly more than half of the topics beneﬁt from a multi-modal approach. 6.3 Visuality of Topics The “visuality” of topics can be deduced from the performance of text-only and text-visual approaches that we presented in the last section. We consider that if, 68 T. Tsikrika and J. Kludas Fig. 2. Average topic performance over all, text-only, and text/visual runs for a topic, the text-visual approaches improve signiﬁcantly the MAP over all runs (i.e., by diﬀ(M AP ) >= 0.01), then we could consider that to be a visual topic. In the same way, we can deﬁne topics as textual, if the text-only approaches improve signiﬁcantly the MAP over all runs of a topic. Based on this analysis, 15 of the topics can be characterised as textual and 14 as visual. The remaining 16 topics, where no clear improvements are observed, are considered to be neutral. Table 6. Best performing topics for textual and text-visual runs relative to the average over all runs neutral (16 topics) 76 shopping on a market 77 real rainbow 78 sculpture of an animal 79 stamp without human face 95 group of dogs 120 yellow ﬂower 81 sculptures of Greek mythological ﬁgures 99 ﬂowers on trees 86 situation after katrina 82 rider on horse 111 pol. campaign poster 87 airplane crash 84 advertisement on buses 103 palm trees 117 earth from space 85 aerial photos of nonartiﬁcial landscapes 96 cartoon with a cat 88 madonna portrait 97 woman in pink dress 119 harbor 93 close up of antenna 98 close up of people doing sport 108 bird nest 107 red fruit 101 ﬁre 114 street view at night 80 orthodox icons with Je- 104 street musician sus 91 landline telephone 100 ﬂower painting 105 snowy street 113 baby 109 tennis player on court 106 traﬃc signs 89 people laughing 112 hot air balloons 116 illustration of engines #images/topic 1.66 1.85 2.06 #words/topic 2.53 3.00 3.31 #reldocs 35.33 36.28 36.50 #words/reldocs 29.65 44.99 39.24 easy 2 3 1 medium 0 3 1 hard 12 7 9 very hard 1 1 5 Topics textual (15 topics) 83 advertisement for cars 102 building site 94 people with dogs 92 bikes visual (14 topics) 115 notes on music sheet 90 satellite image of river 118 coral reef underwater 110 desert landscape Overview of the WikipediaMM Task at ImageCLEF 2009 69 Table 6 presents the topics in each group, as well as some statistics on the topic, their relevant documents, and their distribution over the classes that indicate their diﬃculty. As expected, visual topics have more image examples per topic (#images/topic) than textual ones (1.66 vs. 1.85); however, the neutral topics have an even higher average of 2.06 images per topic. The same tendency is observed in the average number of words in the topics (#words/topic). Short titled topics are better solved with text-only approaches, topics with longer titles tend to be visual or neutral. Therefore, it appears that the latter two groups contain the more complex/semantic topics. The distribution of the textual, visual, and neutral topics over the classes expressing their diﬃculty shows that the visual topics are more likely to fall into the easy/medium class than the textual or neutral ones. The neutral topics seem to contain in general very diﬃcult topics, where neither the text-only approaches nor the text-visual ones could achieve good retrieval results. 6.4 Eﬀect of Query Expansion and Relevance Feedback Finally, we analyse the eﬀect of the application of query expansion (QE) and relevance feedback (FB) techniques. Similarly to the analysis in the previous section, we consider the techniques to be useful for a topic, if they improved signiﬁcantly the MAP over all runs. Table 7 presents the best performing topics for these techniques and some statistics. Query expansion is useful for 17 topics and relevance feedback for 11. The statistics show that these techniques can help Table 7. Best performing topics for textual and text-visual runs relative to the average over all runs QE (17 topics) 110 desert landscape 118 coral reef underwater 120 yellow ﬂower 109 tennis player on court 92 bikes 82 rider on horse 101 ﬁre 115 notes on music sheet 117 earth from space 119 harbor 112 hot air balloons 98 close up of people doing sport 113 baby 107 red fruit 79 stamp without human face 84 advertisement on buses 78 sculpture of an animal #images/topic 1.94 #words/topic 2.76 #reldocs 46.47 #words/reldocs 37.98 easy 1 medium 2 hard 11 very hard 3 Topics FB (11 topics) 88 madonna portrait 115 notes on music sheet 87 airplane crash 93 close up of antenna 96 cartoon with a cat 79 stamp without human face 116 illustration of engines 118 coral reef underwater 95 group of dogs 104 street musician 86 situation after hurricane katrina 1.72 3.18 40.36 42.74 2 1 8 0 70 T. Tsikrika and J. Kludas improve the retrieval results for topics deﬁned without too much detail, e.g., topics having a short title (#words/topic) and/or a small number of example images (#images/topic). 7 Conclusions This year (similarly to 2008), a text-based approach performed best in the wikipediaMM task, even though highly semantic multimedia topics were developed with the aim to encourage and show the potential of multi-modal approaches. It is worth noting though that all of the participants that submitted both mono-media and multi-modal runs achieved their best results with their multi-modal runs. Additionally, it is encouraging to see more than half of the submitted runs being multi-modal. In 2010, a new collection of approximately 250,000 Wikipedia images will be provided with multi-lingual text annotations in English, French, and German. Acknowledgements Theodora Tsikrika was supported by the European Union via the European Commission project VITALAS (contract no. 045389). Jana Kludas was funded by the Swiss National Fund (SNF). The authors would also like to thank all the groups participating in the relevance assessment process. References 1. Dar´ oczy, B., Petr´ as, I., Bencz´ ur, A.A., Fekete, Z., Nemeskey, D., Sikl´ osi, D., Weiner, Z.: SZTAKI @ ImageCLEF 2009. In: CLEF 2009 Working Notes (2009) 2. Ferecatu, M.: Image retrieval with active relevance feedback using both visual and keyword-based descriptors. Ph.D. Thesis, Universit´e de Versailles, France (2005) 3. van Gemert, J.C., Geusebroek, J.-M., Veenman, C.J., Snoek, C.G.M., Smeulders, A.W.M.: Robust scene categorization by learning image statistics in context. In: Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA, p. 105. IEEE Computer Society, Los Alamitos (2006) 4. Kilinc, D., Alpkocak, A.: DEU at ImageCLEF 2009 WikipediaMM task: Experiments with expansion and reranking approaches. In: CLEF 2009 Working Notes (2009) 5. Mart´ın-Valdivia, M.T., D´ıaz-Galiano, M.C., Urena-L´ opez, L.A., Perea-Ortega, J.M.: Using WordNet in multimedia information retrieval. In: Peters et al. [10] 6. Min, J., Wilkins, P., Leveling, J., Jones, G.J.F.: Document expansion for text-based image retrieval at CLEF 2009. In: Peters et al. [10] 7. Moulin, C., Barat, C., Lemaˆıtre, C., G´ery, M., Ducottet, C., Largeron, C.: Combining text/image in WikipediaMM task 2009. In: Peters et al. [10] 8. Myoupo, D., Popescu, A., Le Borgne, H., Mo¨ellic, P.-A.: Multimodal image retrieval over a large database. In: Peters, et al. [10] Overview of the WikipediaMM Task at ImageCLEF 2009 71 9. Navarro, S., Munoz, R., Llopis, F.: Evaluating fusion techniques at diﬀerent domains at ImageCLEF subtasks. In: CLEF 2009 Working Notes (2009) 10. Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, G.J.F., Gonzalo, J., Caputo, B. (eds.): CLEF 2009 Workshop, Part II. LNCS, vol. 6242. Springer, Heidelberg (2010) 11. Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.-M., Smeulders, A.W.M.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th Annual ACM International Conference on Multimedia, pp. 421–430. ACM Press, New York (2006) 12. Vundavalli, S.: IIIT-H at ImageCLEF Wikipedia MM 2009. In: CLEF 2009 Working Notes (2009) 13. Westerveld, T., von Zwol, R.: The INEX 2006 multimedia track. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 331–344. Springer, Heidelberg (2007) Overview of the CLEF 2009 Medical Image Retrieval Track Henning M¨ uller1,2 , Jayashree Kalpathy–Cramer3, Ivan Eggel1 , Steven Bedrick3 , Sa¨ıd Radhouani3 , Brian Bakke3, Charles E. Kahn Jr.4 , and William Hersh3 1 Geneva University Hospitals and University of Geneva, Switzerland University of Applied Sciences Western Switzerland, Sierre, Switzerland 3 Oregon Health and Science University (OHSU), Portland, OR, USA Department of Radiology, Medical College of Wisconsin, Milwaukee, WI, USA henning.mueller@sim.hcuge.ch 2 4 Abstract. 2009 was the sixth year for the ImageCLEF medical retrieval task. Participation was strong again with 38 registered research groups. 17 groups submitted runs and thus participated actively in the tasks. The database in 2009 was similar to the one used in 2008, containing scientific articles from two radiology journals, Radiology and Radiographics. The size of the database was increased to a total of 74,902 images. For the first time, 5 case–based topics were provided as an exploratory task. These topics’ unit of retrieval was intended to be the source article and not the image itself. Case–based topics are designed to be closer to the clinical workflow, as clinicians often seek information about patient cases using incomplete information consisting of symptoms, findings, and a set of images. Supplying cases to a clinician from the scientific literature that are similar to the case (s)he is treating models what may become an important application of image retrieval in the future. We also introduced a lung nodule detection task in 2009. This task used the CT slices from the Lung Imaging Data Consortium (LIDC) includeding ground truth in the form of manual annotations. The goal of this task was to create algorithms to automatically detect lung nodules. Although there seemed to be significant interest in the task only two groups submitted results with a proprietary software from an industry participant achieving very good results. 1 Introduction ImageCLEF1 [1,2,3] started in 2003 as part of the Cross Language Evaluation Forum (CLEF2 , [4]). A medical image retrieval task was added in 2004 and has been held every year since [3,5]. The main goal of ImageCLEF in the past has been to promote multi–modal information retrieval by combining a variety of media including text and images for more eﬀective information retrieval. As such, it has always contained visual, textual and mixed tasks and sub–tracks. 1 2 http://www.imageclef.org/ http://www.clef-campaign.org/ C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 72–84, 2010. c Springer-Verlag Berlin Heidelberg 2010 Overview of the CLEF 2009 Medical Image Retrieval Track 73 The medical image retrieval track began in 2004 as a primarily visual information retrieval task with a teaching database of 8,000 images. Since then, it progressed to a collection of over 66,000 images from several teaching collections with topics that were best suited for textual, visual and mixed methods. In 2008, images from the medical literature were used for the ﬁrst time, moving the task one step closer towards applications that can be of interest in clinical scenarios. Several user studies have been performed to study the image searching behaviour of clinicians [6,7,8]. These studies have been used to create the task and the topics over the years. This year, for the ﬁrst time, we introduced a case–based retrieval task as we continue to strive for scenarios that more closely resemble actual clinical work–ﬂows. This paper reports on the medical retrieval task. Additionally, other papers within ImageCLEF describe the other ﬁve tasks of ImageCLEF 2009. More information on the tasks and on how to participate in CLEF can also be found on the ImageCLEF web pages. 2 Participation, Data Sets, Tasks, Ground Truth This section describes the details concerning the set–up and the participation in the medical retrieval task in 2009. A new management system for participation in ImageCLEF was created to better manage the increasing number of registrations and submissions to the ImageCLEF benchmark in a fully electronic fashion. The interface allowed registrations for particular tasks, provided the links to the description of the data sets are available, allowed submission of the results and enabled the ﬁnal evaluation. 2.1 Participation In 2009, a new record of 85 research groups registered for the seven sub–tasks of ImageCLEF. For the medical retrieval task the participation remained similar to the previous year with 37 registrations. 17 of the participants submitted results to the tasks, a slight increase from 15 in 2008. The following groups submitted at least one run: – – – – – – – – – – – NIH (USA); Liris (France); ISSR (Egypt)∗ ; UIIP Minsk (Belarus)∗ ; MedGIFT (Switzerland); Sierre (Switzerland)∗ ; SINAI (Spain); Miracle (Spain); BiTeM (Switzerland); York University (Canada)∗ ; AUEB (Greece); 74 H. M¨ uller et al. – – – – – – University of Milwaukee (USA)∗ ; University of Alicante (Spain); University of North Texas (USA)∗ ; OHSU (USA); University of Fresno (USA); DEU (Turkey). Participants marked with a star had never participated in the past in a medical retrieval task, indicating that the number of ﬁrst–time participants is fairly high with six among the 17 participants. A total of 124 valid runs were submitted, 106 of which were submitted for the image–based topics while 18 for the case-based topics. The number of runs per group was limited to ten per subtask and case–based and image–based topics were seen as separate subtasks in this view. This was an increase compared to the 111 runs submitted in 2008. 2.2 Datasets The database in 2009 was again made accessible by the Radiological Society of North America (RSNA3 ). The database contained a total of 74,902 images, and was the largest collection to ever have been used for ImageCLEFmed. All images in the collection originated from the journals Radiology and Radiographics, published by the RSNA. A similar database is also available via the Goldminer4 interface. This collection constitutes an important body of medical knowledge from the peer–reviewed scientiﬁc literature including high quality images with textual annotations. Images are associated with journal articles, and can also be part of a larger ﬁgure. Figure captions were made available to participants, as well as the sub-caption concerning a particular subﬁgure (if available). This high–quality set of textual annotations enabled textual searching in addition to content–based retrieval. Furthermore, the PubMed IDs of each ﬁgure’s originating article were also made available, allowing participants to access the MeSH (Medical Subject Headings) index terms assigned by the National Library of Medicine for MEDLINE5 . 2.3 Image–Based Topics The image-based topics were created using methods similar to previous years where realistic search topics were identiﬁed by surveying actual user needs. The starting point for this year’s topics was a user study [9] conducted at Oregon Health & Science University (OHSU) during early 2009. Using qualitative methods, this study was conducted with 37 medical practitioners and focused on understanding their needs, both met and unmet, regarding medical image retrieval. 3 4 5 http://www.rsna.org/ http://goldminer.arrs.org/ http://www.pubmed.gov/ Overview of the CLEF 2009 Medical Image Retrieval Track 75 The ﬁrst part of the study was dedicated to the investigation of the characteristics of a large portion of the population served by medical image retrieval systems (e.g., their background, searching habits, etc.). After a demonstration of state– of–the–art image retrieval systems, the second part of the study was devoted to learning about the motivation and tasks for which the intended audience uses medical image retrieval systems (e.g., contexts in which they seek medical images, types of useful images, numbers of desired answers, etc.). In the third and last part, the participants were asked to use the demonstrated systems, trying to solve challenging queries, and provide responses to them in terms of how likely they would be to use them, which aspects they did and did not like, and which missing features they would like to see added. In total, the 37 participants used the demonstrated systems to perform a total of 95 searches using textual queries in English. We randomly selected 25 candidate queries from the 95 searches to create the topics for ImageCLEFmed 2009. We added to each candidate query 2 to 4 sample images from the previous collections of ImageCLEFmed. Then, for each topic, we provided a French and a German translation of the original textual description provided by the participants. Finally, the resulting set of the topics was categorized into three groups: 10 visual topics, 10 mixed topics, and 5 semantic topics. The entire set of topics was ﬁnally approved by a physician. 2.4 Case–Based Topics Case–based topics were made available for the ﬁrst time in 2009. The goal was to move image retrieval potentially closer to clinical routine by simulating the use case of a clinician who is in the process of diagnosing a diﬃcult case. Providing this clinician with articles from the literature that discuss cases similar6 to the case (s)he is working on can be a valuable aide to choosing a good diagnosis or treatment. The topics were created based on cases from the teaching ﬁle Casimage. This teaching ﬁle contains cases (including images) from radiological practice. 10 cases were pre–selected and a search with the diagnosis was performed in the ImageCLEF data set to make sure that there were at least a few matching articles. Five topics were ﬁnally chosen. The diagnosis and all information on the chosen treatment was then removed from the cases so as to simulate the situation of the clinician who has to diagnose the patient. In order to make the judging more consistent, the relevance judges were provided with the original diagnosis for each case. 2.5 Relevance Judgements The relevance judgements were performed with the same on–line system as in 2008 for the image–based topics. The system was adapted for the case–based topics showing the article title and several images appearing in the text (currently the ﬁrst six, but this can be conﬁgured). Besides a short description for 6 “Similar” in terms of images and other clinical data on the patient. 76 H. M¨ uller et al. the judgements, a full document was prepared to describe the judging process, including what should be regarded as relevant versus non–relevant. A ternary judgement scheme was used again, wherein each image in each pool was judged to be “relevant”, “partly relevant”, or “non–relevant”. Images clearly corresponding to all criteria were judged as “relevant”, images whose relevance could not be safely conﬁrmed but could still be possible were marked as “partly relevant”, and images for which one or more criteria of the topic were not met were marked as “non–relevant”. Judges were instructed in these criteria and results were manually veriﬁed during the judgement process. We had the opportunity to perform multiple judgements on many topics, both image–based and case–based. Inter–rater agreement was assessed using the kappa metric, given as P (A) − P (E) κ= (1) 1 − P (E) where P (A) is the observed agreement between judges and P (E) is the expected (random) agreement. These are calculated using a 2x2 table for the relevances of images or articles. Due to the ternary nature of the relevance judgments, κ scores were calculated using both “lenient” and “strict” deﬁnitions. Under the “lenient” deﬁnition, a judgment of “partly relevant” is considered relevant; under the “strict” deﬁnition, “partly relevant” is considered to be not–relevant. It is generally accepted that a κ < 0.7 is good and suﬃcient for an evaluation. In general, the agreement between the judges was fairly high with few exceptions, and this year’s overall average κ is similar to that found in other evaluation campaigns. We observed that the judges responsible for the case–based topics seemed to need more time to complete the judging process than did the non– case–based judges; however, given the small amount of feedback we received, it seems that all the judges seemed to be satisﬁed with the written description on the judgments that were supplied. 3 Results This section describes the results of ImageCLEF 2009. Runs are ordered based on the techniques used (visual, textual, mixed) and the interaction used (automatic, manual). Case–based topics and image–based topics are separated but compared in the same sections. Trec eval was used for the evaluation process, and we made use of most of its performance measures. 3.1 Submissions The numbers of submitting teams was slightly higher in 2009 than in 2008 with 17. The numbers of runs increased from 111 to 124. This was partly due to the fact that with the case–based topics and the image–based topics there were two more run categories. A total of 124 runs were submitted via the electronic submission system. Scripts to check the validity of the runs were made available to participants Overview of the CLEF 2009 Medical Image Retrieval Track 77 ahead of the submission phase, so only few runs contained errors in either content or format and required changes. Common mistakes included a wrong trec eval format, use of only a subset of the topics and incorrect image identiﬁers. In collaboration with the participants, a large number of runs were quickly repaired, resulting in 122 valid runs taken into account for the pools. In total, only 13 runs were “manual” or “interactive.” There were only 16 “visual–only” runs. The large majority were “text–only runs”, with 59 submissions. There were 30 mixed runs. Groups subsequently had the chance to evaluate additional runs themselves as the qrels were made available to participants 2 weeks ahead of the submission deadline for the working notes. 3.2 Image–Based Results Visual Retrieval. The number of visual runs in 2009 was small, and the improvement in the results is not as fast as with textual retrieval techniques. 5 groups submitted a total of 16 runs in 2009, one of which included feedback. Performance as measured in MAP is very low for all these runs, reaching a maximum of 0.0136 for the best run. Both early precision and recall were quite low for the visual runs when compared to the textual runs, but there were signiﬁcant diﬀerences between the visual runs. The University of Minsk only submitted runs to a subset of the topics and this made their average performance look much worse than than the other runs. A more detailed per topics analysis seems necessary to really compare the systems. Table 1 shows the results and particularly the large diﬀerences between the runs. Runs retrieved between 13 and 315 of 2,362 possible relevant images, which is substantially lower than the poorest performing textual runs. This means that visual retrieval alone does not yield any suﬃciently high performance for retrieval. Part of the performance can be explained with the extremely well annotated database that created a much larger gap between visual and textual results. The Table 1. Results of the visual runs for the medical image retrieval task Run CBIR FUSION MERGE medGIFT sep max medGIFT sum withAR CBIR FUSION CV MERGE medGIFT sep sum medGIFT sep max withAR CBIR FUSION CATEGORY medGIFT sum withNegImg medGIFT max withNegImg.txt clef2009 UIIPMinsk visual 1 UIIPMinsk visual 2 CSUFresno visual CEDD CSUFresno visual CEDD CSUFresno visual CEDD Run Type Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic Visual Automatic MAP 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 bpref 0.04 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.02 0.02 0.01 0.01 0.03 0.01 0.02 P5 0.08 0.09 0.09 0.09 0.05 0.08 0.06 0.06 0.06 0.02 0.03 0.02 0.04 0.04 0.02 P10 0.07 0.08 0.07 0.08 0.05 0.08 0.06 0.04 0.04 0.03 0.02 0.02 0.02 0.02 0.01 P30 rel ret 0.05 295 0.06 266 0.05 262 0.05 289 0.06 259 0.05 253 0.04 315 0.05 210 0.04 201 0.02 242 0.01 80 0.01 91 0.02 162 0.02 13 0.01 89 78 H. M¨ uller et al. topics in ImageCLEFmed also became more semantic, making even the visual topics much more semantic than before. This corresponds clearly to user needs. The small number of submitted visual runs also biases the pools towards the textual runs, even further widening the gap. Textual Retrieval. Purely automatic textual retrieval had by far the largest number of runs in 2009 with 52, more than 46% of all submitted runs. Table 2 shows the results for all submitted automatic text runs, ordered by MAP. Most performance measures such as bpref and early precision are similar in order. Only early precision sometimes has signiﬁcant diﬀerences from the ranking with MAP. Runs from the LIRIS obtained the best results with 8 of the top 10 runs. These runs used conceptual language modeling with the additional use of the UMLS (Uniﬁed Medical Language System) metathesaurus. They had many runs with MAP between 0.43 and 0.41 and thus higher than any other system. The lowest–ranked systems where at a MAP of 0.05. In terms of early precision (P5 and P10) several systems reached extremely good values at around 0.7, showing the high quality of the results. Multimodal Retrieval. The promotion of mixed–media retrieval has always been one of the main goals of ImageCLEF. In past years, mixed–media retrieval had the highest submission rate. In 2009 as in 2008, however, only about half as many mixed runs as purely textual runs were submitted as several groups seem to have diﬃculty in combining textual and visual runs. Table 3 shows the results for all submitted runs. It is clear that, for a large number of the runs, the MAP results for the mixed retrieval submissions were very similar to those from the purely textual retrieval systems. An interesting observation is that, for some groups, the mixed–media submissions often have higher early precision than the purely textual retrieval submissions. All runs exhibited relatively high correlation between MAP and bpref. On the other hand early precision and MAP were less correlated and some systems not having the best MAP had the highest early precision, which is most often what a real user might be looking for. From examining mixed–media runs, which had corresponding text–only runs, it is particularly clear that combining good textual retrieval techniques with questionable visual retrieval techniques can negatively aﬀect system performance. This demonstrates the diﬃculty of usefully integrating both textual and visual information, and the fragility that such combinations can introduce into retrieval systems. The distribution of MAP for the textual runs was higher than that for the mixed runs. A signiﬁcant mode exists around a MAP of 0.2 for the mixed runs, while the mode for the textual runs is at around 0.3. Interactive Retrieval. This year, as in previous years, interactive retrieval was only used by a very small number of participants. However, the manual and interactive runs submitted this year performed relatively well with one of the runs achieving the highest overall early precision (P5 and P10). Table 4 shows Overview of the CLEF 2009 Medical Image Retrieval Track 79 Table 2. Results of the textual runs for the medical image retrieval task Run LIRIS maxMPTT extMPTT LIRIS maxMPTT extMPTTEF LIRIS maxMPTT enMPTT LIRIS maxMPTT enMMMPTTEF LIRIS KL maxMPTT extMMMPTTEF LIRIS KL maxMPTT enMMMPTTEF LIRIS KL maxMPTT extMMMPTT LIRIS KL maxMPTT enMMMPTT sinai CTM t LIRIS maxMPTT frTT tradMPTT york.In expB2c1.0 sinai CT t ISSR text 1 ceb-essie2-automatic york.bm25 deu run1 pivoted clef2009 ISSR Text 2 sinai C t sinai CTM tM BiTeM EN sinai CM t deu run2 simple sinai CT tM ISSR Text FR 1 BiTeM FRtl deu simple rrank dtree sinai CM tM deu run3 pivot rrank dtree sinai C tM BiTeM FR UNTtextrf1 UNTtextb1 BiTeM DEtl BiTeM DE BiTeM ENsy ISSR Text DE 1 OHSU SR1 MirEN MirTaxEN Mir uwmTextOnly Alicante-Run3 Alicante-Run1 MirRF0505EN MirTax ohsu j no mod MirRFTax0505EN MirRF1005EN MirRF0505 MirRFTax1005EN MirRFTax0505 Run Type Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic Text Automatic MAP 0.43 0.42 0.42 0.42 0.42 0.42 0.42 0.41 0.38 0.38 0.37 0.36 0.35 0.35 0.35 0.34 0.34 0.33 0.33 0.33 0.32 0.31 0.31 0.31 0.30 0.29 0.29 0.28 0.28 0.28 0.28 0.26 0.24 0.23 0.22 0.20 0.20 0.18 0.17 0.16 0.15 0.13 0.13 0.13 0.13 0.13 0.12 0.10 0.09 0.07 0.07 0.05 bpref 0.46 0.45 0.44 0.43 0.44 0.43 0.44 0.43 0.39 0.41 0.38 0.36 0.36 0.40 0.36 0.35 0.36 0.35 0.36 0.35 0.33 0.34 0.34 0.32 0.34 0.32 0.32 0.33 0.32 0.32 0.30 0.28 0.28 0.27 0.25 0.24 0.24 0.22 0.23 0.22 0.21 0.18 0.16 0.16 0.18 0.19 0.18 0.15 0.13 0.11 0.12 0.09 P5 0.70 0.67 0.70 0.72 0.69 0.73 0.68 0.70 0.65 0.58 0.61 0.58 0.58 0.65 0.60 0.58 0.67 0.58 0.61 0.58 0.52 0.62 0.61 0.51 0.53 0.48 0.59 0.54 0.59 0.56 0.46 0.53 0.46 0.41 0.41 0.40 0.42 0.59 0.62 0.59 0.58 0.44 0.34 0.38 0.59 0.50 0.42 0.45 0.54 0.43 0.41 0.29 P10 0.66 0.62 0.68 0.68 0.64 0.68 0.62 0.69 0.62 0.58 0.60 0.60 0.56 0.62 0.57 0.52 0.60 0.52 0.57 0.54 0.50 0.57 0.53 0.53 0.48 0.44 0.51 0.56 0.52 0.52 0.44 0.44 0.40 0.37 0.37 0.38 0.39 0.54 0.55 0.52 0.47 0.40 0.36 0.40 0.51 0.40 0.38 0.36 0.41 0.31 0.30 0.22 P30 0.55 0.53 0.56 0.55 0.53 0.54 0.53 0.55 0.56 0.47 0.51 0.54 0.49 0.54 0.48 0.45 0.50 0.44 0.50 0.48 0.42 0.54 0.45 0.44 0.40 0.38 0.46 0.47 0.42 0.45 0.37 0.38 0.37 0.34 0.34 0.31 0.30 0.41 0.39 0.38 0.31 0.31 0.34 0.35 0.34 0.29 0.30 0.23 0.23 0.21 0.18 0.17 rel ret 1814 1801 1689 1685 1784 1678 1793 1682 1884 1576 1762 1869 1717 1554 1759 1742 1803 1768 1590 1666 1752 1582 1620 1729 1643 1699 1615 1399 1570 1494 1641 1762 1642 1606 1545 1387 1608 801 912 913 842 572 996 958 567 843 896 568 459 430 470 447 the results of all manual and interactive runs submitted showing clearly the potential in improving the user interaction. There is deﬁnitely a need to further promote interactive and manual retrieval, as the potential of these approaches does not seem to have been well–exploited to date. 80 H. M¨ uller et al. Table 3. Results of the multimodal runs for the medical image retrieval task Run deu imaged vsm york.BO1.EdgeHistogram0.2 york.BO1.Tamura0.2 BM25b=0.75k 1=1.2k 3=8.0 ICLEFPProcess York.BO1.colorHistogram0.2 BM25b=0.75k 1=1.2k 3=8.0 ICLEFPProcess medGIFT0.3 withNegImg EN Multimodal Text Rerank UNTMixed Automatic1 medGIFT0.5 EN UNTMixed Automaticrf1 ohsu j umls ohsu j mod1 OHSU SR6 OHSU SR2 OHSU SR3 clef2009 medGIFT mix 0.5vis withNegImg Alicante-Run4 uwmTextAndModality Alicante-Run5 OHSU SR4 OHSU SR5 uwmTextAndImageDistance medGIFT sum withNegImg Run Type Mixed Automatic Mixed Automatic Mixed Automatic 3Mixed Automatic Mixed Automatic 1Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed Automatic Mixed MAP 0.37 0.36 0.35 0.35 0.34 0.33 0.29 0.27 0.24 0.21 0.19 0.18 0.17 0.16 0.16 0.15 0.15 0.14 0.13 0.13 0.13 0.11 0.11 0.07 0.01 bpref 0.39 0.37 0.37 0.37 0.36 0.35 0.32 0.40 0.28 0.25 0.24 0.21 0.22 0.20 0.21 0.20 0.21 0.17 0.17 0.17 0.16 0.15 0.15 0.09 0.03 P5 0.63 0.58 0.62 0.60 0.59 0.59 0.63 0.49 0.46 0.70 0.50 0.71 0.59 0.68 0.62 0.61 0.38 0.56 0.31 0.49 0.33 0.60 0.58 0.44 0.06 P10 0.54 0.58 0.57 0.59 0.57 0.57 0.60 0.52 0.40 0.59 0.42 0.66 0.55 0.61 0.54 0.52 0.33 0.49 0.34 0.46 0.35 0.48 0.52 0.40 0.04 P30 0.48 0.54 0.51 0.50 0.50 0.47 0.52 0.45 0.37 0.43 0.37 0.42 0.38 0.43 0.39 0.37 0.26 0.33 0.33 0.38 0.33 0.31 0.31 0.27 0.05 rel ret 1754 1724 1722 1763 1719 1757 1176 1553 1659 848 1197 591 943 543 801 801 1381 547 992 521 982 381 514 204 210 Table 4. Results of the interactive and manual runs for the medical image retrieval task Run ceb-interactive-with-pad In expB2c1.0 Bo1bfree TEXT MANUAL CBIR RF BM25b=0.75k 1=1.2k 3=8.0 Bo1bfree ISSR Text FR 2 ISSR Text 1rfb ISSR Text 5 ISSR Text 4 ISSR Text DE 2 Alicante-Run2 CBIR RF york.BO1.MeSH.TamuraHistogram0.2 Multimodal Text QE CBIR 3.3 Run Type Mixed Interactive Mixed Interactive Mixed Interactive Text Interactive Text Interactive Text Interactive Text Interactive Text Interactive Text Interactive Text Interactive Visual Interactive Mixed Manual Mixed Manual MAP 0.38 0.37 0.04 0.37 0.31 0.28 0.27 0.27 0.20 0.14 0.01 0.35 0.04 bpref 0.43 0.39 0.09 0.38 0.34 0.29 0.29 0.29 0.22 0.17 0.03 0.36 0.09 P5 0.74 0.62 0.28 0.61 0.47 0.41 0.38 0.34 0.25 0.32 0.06 0.62 0.27 P10 0.72 0.56 0.22 0.55 0.49 0.43 0.43 0.42 0.25 0.35 0.05 0.58 0.19 P30 0.55 0.51 0.14 0.50 0.44 0.39 0.37 0.37 0.27 0.34 0.05 0.48 0.13 rel ret 1545 1803 496 1810 1811 1604 1738 1734 1438 994 306 1760 456 Case–Based Results A total of six groups participated in this introductory task, submitting a total of 18 runs. The results were quite promising, with one group achieving a relatively high MAP of 0.33. As with the image–based retrieval, automatic textual results achieved the best results, with poor results being obtained by visual methods. Results were quite varied, however, as the MAP ranged from 0.0025 to 0.335 Visual Retrieval. This year, purely visual methods were not able to achieve good performance, as seen in the Table 5 below. This is not entirely surprising, as the set of sample images provided for each topic was quite varied in visual Overview of the CLEF 2009 Medical Image Retrieval Track 81 Table 5. Results of the visual runs for the medical image retrieval task (Case–Based Topics) Run medGIFT medGIFT clef2009 medGIFT medGIFT Run Type case bySimilarity vis maxWithAR Visual Automatic case bySimilarity vis sumWithAR Visual Automatic Visual Automatic case byfreq vis sumWithAR Visual Automatic case byfreq vis maxWithAR Visual Automatic MAP 0.02 0.02 0.01 0.00 0.00 bpref 0.03 0.03 0.00 0.00 0.00 P5 0.04 0.04 0.00 0.00 0.00 P10 0.04 0.06 0.00 0.00 0.00 P30 rel ret 0.05 41 0.05 42 0.01 39 0.01 26 0.01 26 Table 6. Results of the textual runs for the medical image retrieval task (Case–Based Topics) Run ceb-cases-essie2-automatic sinai TA cbt sinai TA cbtM clef2009 HES-SO-VS txt case Alicante-CaseBased-Run5 Alicante-CaseBased-Run2 Alicante-CaseBased-Run4 Alicante-CaseBased-Run3 Alicante-CaseBased-Run1 Run Type Textual Automatic Textual Automatic Textual Automatic Textual Automatic Textual Automatic Textual Automatic Textual Automatic Textual Automatic Textual Automatic Textual Automatic MAP 0.34 0.26 0.26 0.19 0.19 0.07 0.05 0.05 0.05 0.05 bpref 0.28 0.23 0.22 0.13 0.15 0.07 0.04 0.04 0.04 0.04 P5 0.32 0.32 0.32 0.32 0.32 0.16 0.08 0.08 0.08 0.08 P10 0.34 0.34 0.30 0.24 0.32 0.10 0.08 0.08 0.08 0.08 P30 rel ret 0.23 74 0.23 89 0.25 89 0.19 93 0.20 71 0.09 61 0.07 58 0.07 58 0.07 59 0.07 59 appearance and it needs to be explored how this information can be used well and how several diﬀerent images can be combined for querying. Textual Retrieval. Textual methods were more eﬀective at retrieving relevant articles as seen in the Table 6 below. Interestingly, the early precision for the best runs were not signiﬁcantly higher than the MAP, as opposed to the scores from the image–based topics wherein the early precision was substantially higher than the MAP for many runs. This means that it is hard to rank cases perfectly well at the upper part of the performance but that fewer cases were missed (rather recall–oriented) Multimodal Retrieval. Unlike the image–based topics, the case–based multimodal runs performed quite poorly, as shown in Table 7. This could be due to a variety of reasons, including the diversity of sample images, poor visual performance of runs for case–based topics, and the fact that the two best groups in textual retrieval did not submit runs in this category. Only two groups submitted results making a comparison hard. Table 7. Results of the multimodal runs for the medical image retrieval task (Case– Based Topics) Run medGIFT0.5 BySimilarity EN clef2009 Run Type MAP bpref P5 P10 P30 rel ret Mixed Automatic 0.07 0.05 0.12 0.14 0.09 74 Mixed Automatic 0.02 0.00 0.00 0.00 0.02 57 82 3.4 H. M¨ uller et al. Relevance Judgement Analysis A number of topics, both image–based and case–based, were judged by two or even three judges. There were signiﬁcant variations in the kappa metric used to evaluate the inter–rater agreement. The kappas for the image–based topics are given below in Table 8. The kappas are usually reasonably high, except when involving certain particular judges. As seen in the table, judge 12 was extremely lenient compared to all other judges, leading to extremely low kappas for any pairwise comparison involving this judge. For instance, on topic 13, judge 12 evaluated 342 images as being relevant while judge 7 (our strictest judge) only evaluated 7 images as being relevant. We discovered this during the judging process, and did not use the judgements from judge 12 in creating the oﬃcial qrels for any of the topics. We performed extensive evaluation of the eﬀect of the judge’s strictness in establishing relevance, and found that, overall, the results obtained using strict judges and those obtained lenient judges correlated very well. However, results for a particular topic could be aﬀected by the judge’s parsimony in the evaluation of relevance. For the case–based topics, the kappa values were generally lower, as shown in Table 9. The 2x2 tables indicated that judge 4 was the most lenient and judge 7 was the strictest. This is not surprising as the topics were less clearly deﬁned in terms of what is relevant and what not. 3.5 Lung Nodule Detection Task We also introduced a lung nodule detection task in 2009. This task used CT (Computed Tomography) slices from the Lung Imaging Data Consortium (LIDC). This Table 8. Kappas for Image–Based Topics Topic Judge 1 Judge 2 Kappa 1 3 4 0.341 3 3 11 0.715 7 4 6 0.302 8 6 15 0.639 10 4 12 0.15 13 7 12 0.021 14 11 12 0.0298 15 6 7 0.885 17 3 4 0.821 18 4 15 0.884 20 7 12 0.0388 Table 9. Kappas for Case–Based Topics Topic judge1 judge2 Kappa 26 4 7 0.06 27 4 7 -0.10 28 4 11 0.37 29 4 7 -0.25 29 4 11 0.13 29 7 11 0.28 30 7 11 0.56 Overview of the CLEF 2009 Medical Image Retrieval Track 83 collection consisted of 100–200 slices per study, and were manually annotated by 4 clinicians. Although more than 25 groups had registered for the task and more than a dozen had downloaded the datasets, only two groups submitted runs. A commercial proprietary software package performed quite well in the task of detecting the nodules. With only two participants a more detailed analysis seems hard. 4 Conclusions The focus of many participants in ImageCLEF 2009 was on text–based retrieval. The increasingly semantic topics, combined with a database containing high– quality annotations in 2009, may explain the decrease in number of runs using visual techniques as compared to previous years. Visual runs were rare and generally poor in performance. Mixed–media runs were very similar in performance to textual runs in terms of MAP. The analysis also shows that several particular runs that happened to have very few relevant images had a very low average performance, whereas topics with a larger number seemed to perform better. Case–based topics were introduced for the ﬁrst time, although only a few groups participated in these topics. Performance on case–based topics was slightly lower than that seen for the image–based topics. A kappa analysis between several relevance judgements for the same topics shows that while there are diﬀerences between judges, overall agreement is high. However, we did ﬁnd that a small number of judges can nevertheless have notable amounts of disagreement with all the other judges, a phenomenon that merits further investigation. For future campaigns, it seems important that more research on visual techniques (including massive learning) should be done, as current techniques do not seem to perform well. Interactive and manual retrieval do also seem to have room for improvements, and should be put forward to participants who currently generally prefer automatic text–based approaches. Acknowledgements We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. This work was partially funded by the Swiss National Science Foundation (FNS) under contracts 205321–109304/1 and PBGE22–121204, the American National Science Foundation (NSF) with grant ITR–0325160, the TrebleCLEF project and Google. We would like to thank the RSNA for supplying the images of their journals Radiology and Radiographics for the ImageCLEF campaign. References 1. Clough, P., M¨ uller, H., Deselaers, T., Grubinger, M., Lehmann, T.M., Jensen, J., Hersh, W.: The CLEF 2005 cross–language image retrieval track. In: Peters, C., Gey, F.C., Gonzalo, J., M¨ uller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 535–557. Springer, Heidelberg (2006) 84 H. M¨ uller et al. 2. Clough, P., M¨ uller, H., Sanderson, M.: The CLEF cross–language image retrieval track (ImageCLEF) 2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 597–613. Springer, Heidelberg (2005) 3. M¨ uller, H., Deselaers, T., Kim, E., Kalpathy-Cramer, J., Deserno, T.M., Clough, P., Hersh, W.: Overview of the ImageCLEFmed 2007 medical retrieval and annotation tasks. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 473–491. Springer, Heidelberg (2008) 4. Savoy, J.: Report on CLEF–2001 experiments. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 27–43. Springer, Heidelberg (2002) 5. M¨ uller, H., Rosset, A., Vall´ee, J.P., Terrier, F., Geissbuhler, A.: A reference data set for the evaluation of medical image retrieval systems. Computerized Medical Imaging and Graphics 28(6), 295–305 (2004) 6. M¨ uller, H., Despont-Gros, C., Hersh, W., Jensen, J., Lovis, C., Geissbuhler, A.: Health care professionals’ image use and search behaviour. In: Proceedings of the Medical Informatics Europe Conference (MIE 2006), pp. 24–32. IOS Press/Studies in Health Technology and Informatics, Maastricht/ The Netherlands (August 2006) 7. M¨ uller, H., Kalpathy-Cramer, J., Hersh, W., Geissbuhler, A.: Using Medline queries to generate image retrieval tasks for benchmarking. In: Medical Informatics Europe (MIE 2008), Gothenburg, Sweden, pp. 523–528. IOS Press, Amsterdam (May 2008) 8. Hersh, W., Jensen, J., M¨ uller, H., Gorman, P., Ruch, P.: A qualitative task analysis for developing an image retrieval test collection. In: ImageCLEF/MUSCLE Workshop on Image Retrieval Evaluation, Vienna, Austria, pp. 11–16 (2005) 9. Radhouani, S., Hersh, W., Kalpathy-Cramer, J., Bedrick, S.: Understanding and improving image retrieval in medicine. Technical report, Oregon Health and Science University (2009) Overview of the CLEF 2009 Medical Image Annotation Track Tatiana Tommasi1 , Barbara Caputo1 , Petra Welter2 , Mark Oliver G¨ uld2 , and Thomas M. Deserno2 1 Idiap Research Institute, Martigny, Switzerland {ttommasi,bcaputo}@idiap.ch 2 RWTH, Aachen University, Dept. of Medical Informatics, Aachen, Germany {pwelter,g¨ uld,tdeserno}@mi.rwth-aachen.de Abstract. This paper describes the last round of the medical image annotation task in ImageCLEF 2009. After four years, we deﬁned the task as a survey of all the past experience. Seven groups participated to the challenge submitting nineteen runs. They were asked to train their algorithms on 12677 images, labelled according to four diﬀerent settings, and to classify 1733 images in the four annotation frameworks. The aim is to understand how each strategy answers to the increasing number of classes and to the unbalancing. A plain classiﬁcation scheme using support vector machines and local descriptors outperformed the other methods. 1 Introduction The medical image annotation task was introduced in the ImageCLEF1 challenge in 2005. Its main contribution was to provide a resource for benchmarking content-based image classiﬁcation systems focusing on medical images. Hospitals collect hundreds of imaging data everyday and automatic image annotation can be an important step when searching for images in huge databases. Automatic techniques able to identify acquisition modality, body orientation, body region, and biological system examined could be used for multilingual image annotations as well as for DICOM header corrections in medical image acquisition routine. Over the last four years the medical annotation task evolved both in terms of number of images, classes, and classes’ framework provided. It was born as a 60 plain class problem [3], grew up to a 120 class problem [6], and became a complex hierarchical class task in 2007 [5,2]. In 2008, class imbalance was added to foster the use of prior knowledge encoded into the hierarchy of classes [1]. This year we celebrate the 5th medical image annotation task anniversary and we decided to organize its conclusive round as a survey on the last years experience. The idea is to compare the scalability of diﬀerent image classiﬁcation techniques as the number of classes grows, their hierarchical structure increase, and badly populated classes appear. 1 http://www.imageclef.org/ C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 85–93, 2010. c Springer-Verlag Berlin Heidelberg 2010 86 2 T. Tommasi et al. Database and Task Description As in the past challenge editions, the annotation task was deﬁned on the basis of the IRMA project2 . This year a database of 12677 fully classiﬁed radiographs, taken randomly from medical routine, was made available as training set. Images are labelled according to four classiﬁcation label sets considering: – 57 classes as in 2005 (12631 images) + a “clutter” class C (46 images); – 116 classes as in 2006 (12334 images) + a “clutter” class C (343 images); – 116 IRMA codes as in 2007 (12334 images) + a “clutter” class C (343 images); – 193 IRMA codes as in 2008 (12677 images). For the ﬁrst two label settings, images are associated to simple raw numbers while in the last two label settings images are identiﬁed by their complete IRMA code (see Section 3). The 1-57 labels used for the ﬁrst group deﬁnition are derived through a high level identiﬁcation of images in IRMA code terms. Considering a more detailed image annotation and the introduction of some new classes we pass to 116 and then to 193 classes. The “clutter” class for a speciﬁc setting contains all the images belonging to new classes, or images described with a higher level of detail in the ﬁnal 2008 setting. The test data consisted of 1733 images. Not all the training classes have examples in this set: – 2005 labelling - 55 classes (out of 57) with 1639 images + class C with 94 images; – 2006 labelling - 109 classes (out of 116) with 1353 images + class C with 380 images; – 2007 labelling - 109 IRMA codes (out of 116) with 1353 images + class C with 380 images; – 2008 labelling - 169 IRMA codes (out of 193) with 1733 images. Note the distribution of the images in the classes of the training set: for 2005, 2006 and 2007 classes have more than 6 images while in 2008 there are classes with 1 to 5 images. Concerning the 2008 labels, the test data have a 20% of images which are badly (classes with less than 10 images) represented in the training data. Participants to the medical annotation task were asked to classify the test images according to all the four label settings. Each group is allowed to submit diﬀerent runs, but each of them should be based only on one algorithm which should be optimized to face the four diﬀerent classiﬁcation problems. The aim is to understand how each algorithm answers to the increasing number of classes and to the unbalancing. The classiﬁcation results are considered per year and 2 http://irma-project.org/index en.php Overview of the CLEF 2009 Medical Image Annotation Track 87 Table 1. Examples from the IRMA code code 000 ... 400 410 411 412 413 420 430 431 432 440 ... textual description not further speciﬁed upper upper upper upper upper upper upper upper upper upper extremity extremity extremity extremity extremity extremity extremity extremity extremity extremity (arm) (arm); (arm); (arm); (arm); (arm); (arm); (arm); (arm); (arm); hand hand; ﬁnger hand; middle hand hand; carpal bones radio carpal join forearm forearm; distal forearm forearm; proximal forearm elbow the error scores are summed to have a ﬁnal unique way to rank the performance of the submitted runs. 3 IRMA Code Standardized nomenclature for medical imaging are generally roughly structured, ambiguous, and often use optional tags. Concerning the needs for content-based image retrieval and annotation in the medical ﬁeld, a detailed unambiguous coding scheme is required. Valid relations between code and sub-code elements could be “is-a” and “part-of”, deﬁning a strict hierarchical order. Causality is also important for grouping of processing strategies. Therefore, a mono-hierarchical scheme is required, where each sub-code element is connected to only one code element. Since categorization of medical images must cover all aspects inﬂuencing the image content and structure, a multi-axial scheme is needed [4]. The IRMA code strictly rely on these rules. It is composed from four axes having three to four positions each in {0, . . . , 9, a, . . . , z}, where “0” denotes “unspeciﬁed ” to determine the end of a path along an axis: – – – – the the the the technical code (T) describes the image modality; directional code (D) models body orientations; anatomical code (A) refers to the body region examined; biological code (B) describes the biological system examined. This results in a string of 13 characters (IRMA: TTTT-DDD-AAA-BBB). A small exemplary excerpt from the anatomy axis of the IRMA code is given in Table 1. The IRMA code can be easily extended by introducing characters in a certain code position, e.g., if new image modalities are introduced. Based on the hierarchy, the more code position diﬀer from “0”, the more detailed is the description. 88 T. Tommasi et al. 4 Error Evaluation We describe here how the error score for the medical image annotation challenge was evaluated. On the basis of the image labelling, we deﬁned two diﬀerent evaluation strategies. 2005 and 2006. For these two years the error is evaluated just on the capability of the algorithm to make the correct decision. There is also the possibility to say “don’t know”, which is encoded by “*”. An example is given in Table 2. Table 2. Error score evaluation for 2005 and 2006 settings. The correct label is 18. classiﬁed error score 18 0.0 21 1.0 * 0.5 2007 and 2008. For these two years, the error is evaluated on the basis of the hierarchical IRMA code. Let an image be coded by the technical, directional, anatomical and biological independent axes. They can be considered separately and we can just sum up the errors for each axis independently: – let l1I = l1 , l2 , . . . , li , . . . , lI be the correct code (for one axis) of an image; – let ˆl1I = ˆl1 , ˆl2 , . . . , ˆli , . . . , ˆlI be the classiﬁed code (for one axis) of an image; where li is speciﬁed precisely for every position, and in ˆli is allowed to say “don’t know”, which is encoded by “*”. Note that I (the depth of the tree to which the classiﬁcation is speciﬁed) may be diﬀerent for diﬀerent images. Given an incorrect classiﬁcation at position ˆ li we consider all succeeding decisions to be wrong and given a not speciﬁed position, we consider all succeeding decisions to be not speciﬁed. Furthermore, we do not count any error if the correct code is unspeciﬁed and the predicted code is a wildcard. In that case, we do consider all remaining positions to be not speciﬁed. We want to penalize wrong decisions that are easy (fewer possible choices at that node) over wrong decisions that are diﬃcult (many possible choices at that node), we can say, a decision at position li is correct by chance with a probability of b1i if bi is the number of possible labels for position i. This assumes equal priors for each class at each position. Furthermore, we want to penalize wrong decisions at an early stage in the code (higher up in the hierarchy) over wrong decisions at a later stage in the code (lower down on the hierarchy) (i.e. li is more important than li+1 ). Putting together: I 1 1 δ(li , lˆi ) (1) b i i i=1 (c) (a) (b) Overview of the CLEF 2009 Medical Image Annotation Track 89 Table 3. Error score evaluation for 2007 and 2008 settings. We are considering just one axis, the correct label is 463. classiﬁed 463 46* 461 4*1 4** 47* 473 477 ** 731 error count 0.000000 0.025531 0.051061 0.069297 0.069297 0.138594 0.138594 0.138594 0.125000 0.250000 Table 4. Error score evaluation for the clutter class. The correct label is C or CCC. classiﬁed 2005-06 18 21 * C classiﬁed 2007 111 11* 1** *** *C* with error count 0.0 0.0 0.0 0.0 error count 0.000000 0.000000 0.000000 0.000000 0.000000 ⎧ ⎨ 0 if lj = lˆj ∀j ≤ i δ(li , lˆi ) = 0.5 if lj = ∗ ∃j ≤ i ⎩ 1 if lj = lˆj ∃j ≤ i (2) where the parts of the equation: (a) accounts for diﬃculty of the decision at position i (branching factor); (b) accounts for the level in the hierarchy (position in the string); (c) correct/not speciﬁed/wrong, respectively. In addition, for every axis, the maximal possible error is calculated and the errors are normalized such that a completely wrong decision (i.e. all positions for that axis wrong) gets an error count of 0.25 and a completely correctly predicted axis has an error of 0. Thus, an image where all positions in all axes are wrong has an error count of 1, and an image where all positions in all axes are correct has an 90 T. Tommasi et al. 2005: 2006: 2007: 2008: 22 (11-4-91-7) 54 1121-4a0-914-700 1121-4a0-914-700 2005: 2006: 2007: 2008: 1 (11-1-50-0) 1 1123-127-500-000 1123-127-500-000 2005: 2006: 2007: 2008: 50 (11-2-45-7) C CCCC-CCC-CCC-CCC 1121-230-451-700 2005: 2006: 2007: 2008: C C CCCC-CCC-CCC-CCC 1127-310-600-625 Fig. 1. Examples of the four years labels settings error count of 0. Finally setting a wildcard “*” instead of a “0” is not considered a mistake (see Table 3). Clutter in 2005, 2006 and 2007. For these three years we introduced a class called “clutter” C. Even if in the test set there are images belonging to this class, their classiﬁcation do NOT inﬂuence the error score for the challenge (see Table 4). An example of the released database complete labelling is in Figure 1. 5 Participation In 2009, seven groups participated in the medical annotation task submitting nineteen runs in total. In the following we describe the methods applied by the participating groups. TAUbiomed. The Medical Image Processing Lab from Tel Aviv University in Israel submitted one run using a multiple-resolution patch-based Overview of the CLEF 2009 Medical Image Annotation Track 91 bag-of-visual words approach. Classiﬁcation is performed through support vector machines. The code hierarchy is completely neglected and no wildcards “*” were used. Idiap. The Idiap Research Institute from Switzerland submitted four runs reproposing the same strategies used in 2008. They consisted in diﬀerent classiﬁcation schemes for support vector machines coupling two diﬀerent image descriptors. FEITIJS. The Faculty of Electrical Engineering and Information Technologies from the University of Skopje in Macedonia submitted one run. It is based on global and local image descriptors which are classiﬁed using bagging and random forest. VPA. The Computer Vision and Pattern Analysis Laboratory from Sabanci University in Turkey submitted ﬁve runs. They used local binary patterns as features and support vector machine as classiﬁer. They adopted a hierarchical approach considering, when applicable, the four IRMA code axes separately. medGIFT. The medGIFT group from University Hospitals of Geneva in Switzerland submitted three runs using diﬀerent descriptors and voting schemes in the medGIFT image retrieval system. DEU. The Dokuz Eylul University in Turkey participated submitting four runs. Diﬀerent global and local features are extracted from images and classiﬁcation is performed with a k-Nearest Neighbour algorithm. IRMA. As a general reference the Image Retrieval in Medical Application group at RWTH Aachen University, Germany, provided a baseline run. It was deﬁned using Tamura Texture Measures and the Image Distortion Model. Since 2004, the parametrization is unchanged, so the IRMA code hierarchy is disregarded. The results of the challenge evaluation are given in Table 5, sorted by error score sum over the four year label settings. Considering the error score per-year, the group ranking does not change except for an exchange of the ﬁrst and second rank positions between the Idiap and TAU group in 2006. In general, analyzing the results it can be seen that the top-performing runs do not consider the hierarchical structure of the given task (2007 and 2008 labels), but rather use each individual code as one class and train a plain classiﬁer. Comparing the 2005 and 2006 results, we see that there is a general decrease in the error score. A possible explanation is that in 2005 the 57 classes are wide, each one containing diﬀerent sublevels in terms of IRMA codes. This make them diﬃcult to be modelled by a classiﬁer in the training phase. On the other hand, comparing the 2007 and 2008 results there is a general increase in the error score. This eﬀect was expected: here new classes with the same level of detail respect to the IRMA code are added passing from 2007 to 2008. Moreover some of the new classes are poorly populated in the training set. 92 T. Tommasi et al. Table 5. Results from the medical image annotation task Run & error score TAUbiomed 95 9 1246120389711 Idiap 3 9 1245417716666 Idiap 3 9 1245417533955 Idiap 3 9 1245417469975 Idiap 3 9 1245417671272 FEITIJS 96 9 1245937057229 VPA SabanciUniv 63 9 1245419336923 VPA SabanciUniv 63 9 1245418900571 VPA SabanciUniv 63 9 1245944101876 VPA SabanciUniv 63 9 1246033855761 MedGIFT 77 9 1245961041705 MedGIFT 77 9 1245971471117 IRMA MedGIFT 77 9 1246044416990 VPA SabanciUniv 63 9 1245936277557 DEU 97 9 1246226037987 DEU 97 9 1246225040330 DEU 97 9 1245952497879 DEU 97 9 1245952673253 2005 356 393 393 447 447 549 578 578 587 587 618 618 790 791.5 587 1368 1370 1471 1484 2006 263 260 260 292 292 433 462 462 498 502 507 507 638 612.5 1170 1183 1189 1243 1246 2007 64.3 67.23 67.23 75.81 75.81 128.10 155.05 201.31 169.33 172.08 190.73 190.73 207.55 272.69 413.1 487.5 488.5 541.8 539.7 2008 169.5 178.93 179.17 224.82 227.19 242.46 261.16 272.61 300.44 320.61 317.53 317.53 359.29 420.91 574 642.5 639 713.3 710.1 SUM 852.8 899.16 899.4 1039.63 1042 1352.56 1456.21 1513.92 1554.77 1581.69 1633.26 1633.26 1994.84 2097.6 2744.1 3681 3686.5 3969.1 3979.8 As ﬁnal remark, we notice that methods using patch-based local image descriptors and discriminative SVM classiﬁcation methods outperform the other approaches. 5.1 Discussion and Conclusion We have presented the ImageCLEF 2009 medical image annotation task. This is its conclusive round and we organized it as a survey on the last four years experience. We want to compare the scalability of diﬀerent image classiﬁcation techniques as the number of classes grows, their hierarchical structure increase, and badly populated classes appear. A plain classiﬁcation scheme using support vector machine and local descriptors outperformed the other methods. The obtained scores range from 852.8, over 1994.84, to 3979.8 for best, baseline and worst respectively. Acknowledgements We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. The authors T. Tommasi and B. Caputo are supported by the EMMA project thanks to the Hasler foundation (www.haslerstiftung.ch). P. Welter is supported by the German Research Foundation (DFG, Le 1108/9). Overview of the CLEF 2009 Medical Image Annotation Track 93 References 1. Deselaers, T., Deserno, T.M.: Medical image annotation in ImageCLEF 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 523–530. Springer, Heidelberg (2009) (to appear) 2. Deselaers, T., Deserno, T.M., M¨ uller, H.: Automatic medical image annotation in ImageCLEF 2007: Overview, results, and discussion. Pattern Recognition Letters 29(15), 1988–1995 (2008) 3. Deselaers, T., M¨ uller, H., Clough, P., Ney, H., Lehmann, T.M.: The CLEF 2005 automatic medical image annotation task. International Journal in Computer Vision 74(1), 51–58 (2007) 4. Lehmann, T.M., Schubert, H., Keysers, D., Kohnen, M., Wein, B.B.: The IRMA code for unique classiﬁcation of medical images. In: Huang, H.K., Ratib, O.M. (eds.) SPIE Proceedings Medical Imaging 2003: PACS and Integrated Medical Information Systems: Design and Evaluation, San Diego, California, USA, vol. 5033, pp. 440–451 (May 2003) 5. M¨ uller, H., Deselaers, T., Kim, E., Kalpathy-Cramer, J., Deserno, T.M., Clough, P., Hersh, W.: Overview of the ImageCLEFmed 2007 medical retrieval and annotation tasks. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 472–491. Springer, Heidelberg (2008) 6. M¨ uller, H., Deselaers, T., Lehmann, T., Clough, P., Kim, E., Hersh, W.: Overview of the ImageCLEFmed 2006 medical retrieval and annotation tasks. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 595–608. Springer, Heidelberg (2007) Overview of the CLEF 2009 Large-Scale Visual Concept Detection and Annotation Task Stefanie Nowak1 and Peter Dunker2 1 2 Audio-Visual Systems, Fraunhofer IDMT, Ilmenau, Germany stefanie.nowak@idmt.fraunhofer.de Media Technology Lab, Gracenote Inc, Emeryville, CA, USA pdunker@gracenote.com Abstract. The Large-Scale Visual Concept Detection and Annotation Task (LS-VCDT) in ImageCLEF 2009 aims at the detection of 53 concepts in consumer photos. These concepts are structured in an ontology which can be utilized during training and classiﬁcation of the photos. The dataset consists of 18,000 Flickr photos which were manually annotated with 53 concepts. 5,000 photos were used for training and 13,000 for testing. Two evaluation paradigms have been applied, the evaluation per concept and the evaluation per photo. The evaluation per concept was performed by calculating the Equal Error Rate (EER) and the Area Under Curve (AUC). For the evaluation per photo a recently proposed ontology-based measure was utilized that takes the hierarchy and the relations of the ontology into account and calculates a score per photo. Altogether 19 research groups participated and submitted 73 runs. For the concepts, an average AUC of 84% could be achieved, including concepts with an AUC of 95%. The classiﬁcation performance for each photo ranged between 68.7% and 100% with an average score of 89.6%. 1 Introduction Automated methods for archiving, indexing and retrieving multimedia content are of increasing importance caused by the steadily growing amount of digital data on the web and at home. These methods are often diﬃcult to compare as they are evaluated on diﬀerent kinds of datasets and concerning diﬀerent semantic entities. CLEF is an evaluation initiative that aims at comparing approaches and results in cross-language retrieval for 10 years now. One track of CLEF is ImageCLEF which deals with the evaluation of image-based approaches in the medical and consumer photo domain. This year ImageCLEF posed six tasks. In the Large-Scale Visual Concept Detection and Annotation Task (LS-VCDT) the participants were asked to annotate a number of photos with a deﬁned set of concepts in a multi-label scenario. This paper presents an overview of the LS-VCDT results in ImageCLEF 2009. Section 2 introduces the task and describes the database, the annotation process, the ontology and the evaluation measures applied. Section 3 presents the results and the approaches of the participants. Finally, Section 4 summarizes and concludes the paper. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 94–109, 2010. c Springer-Verlag Berlin Heidelberg 2010 Overview of LS-VCDT 2 95 Task Description The focus of LS-VCDT lies on the automated detection and annotation of concepts in a large consumer photo collection. It mainly poses two challenges: 1. Can image classiﬁers scale to the large amount of concepts and data? 2. Can an ontology (hierarchy and relations) help in large scale annotations? In this task, the MIR Flickr 25,000 image set [1] is utilized. It consists of 25,000 photos from Flickr with creative commons license. Most of the photos contain EXIF data, stored in a separate text ﬁle. In the LS-VCDT, 18,000 photos of this collection were used and manually annotated with 53 visual concepts. The training and test set consist of 5,000 and 13,000 images, respectively. Most annotations refer to holistic visual concepts and are annotated at an imagebased level. Altogether, the annotations for 53 concepts were provided in rdf format and plain text ﬁles. The visual concepts are organized in a small ontology. Participants may use the hierarchical order of the concepts and the relations between concepts for solving the annotation task. The LS-VCDT 2009 is an extension of the former VCDT 2008 concerning the amount of data available and the amount of concepts to be annotated. In 2008, a smaller database with about 1,800 images for training and 1,000 images for testing was utilized and the task was to annotate these photos with 17 concepts. In the following, the annotation process, the ontology and the evaluation measures of the LS-VCDT 2009 are described in detail. 2.1 Annotation Process The annotation process was realized in three steps. First, the annotation of all photos was performed by several annotators, second, a validation step of these annotations was conducted and third, an agreement between diﬀerent annotators for the same concepts and photos was calculated. The annotation of 18,000 photos was conducted by 43 persons of the Fraunhofer IDMT research institute. The number of photos that were annotated by one person varied between 30 and 2,500 images. All annotators were provided with a deﬁnition of the concepts and example images with the goal to allow a consistent annotation amongst the large number of persons. It was important that the concepts are visually represented in the whole image. Some of the concepts exclude each other, while others can be depicted simultaneously. One example photo per concept is illustrated in Figure 1 and a complete list of all concepts can be found in Table 1. The frequency of each concept in the training and test sets is also depicted. After this ﬁrst annotation step, a validation of the annotations was performed. Due to the number of people, the number of photos and the ambiguity of some image content, the annotations were not consistent throughout the database. Three persons performed a validation by screening only those photos that (a) were annotated with concept X and (b) that were not annotated with concept 96 S. Nowak and P. Dunker 0 1 2 3 9 10 11 18 19 20 27 36 45 28 37 12 30 31 32 17 25 33 26 34 42 50 8 16 24 41 49 7 15 23 40 48 6 14 22 39 47 5 13 21 29 38 46 4 35 43 51 44 52 Fig. 1. Example photos for each concept. The numbers below the photos denote the concept identiﬁer (see also Table 1). X. In case (a), they had to delete all annotations for concepts that were not depicted in the photo and therefore were wrongly assigned. In case (b), the goal was to ﬁnd the photos with a missing annotation for concept X, but an obvious appearance of this concept. Additionally, a subset of 100 photos was annotated by 11 diﬀerent persons. These annotations are used to calculate an annotator agreement value for the diﬀerent concepts and photos. The agreement on concepts is illustrated in Table 1. For each photo and each concept, the annotation of the majority of annotators was regarded as correct and the percentage of annotators that annotated correct is utilized as agreement factor. This agreement is used in the evaluation measure Ontology-based Score (OS) as scaling factor (see Section 2.3). Regrettably, there was no possibility to annotate each photo by two or three persons to get a validation and an agreement on concepts over the whole set. 2.2 Ontology In addition to the photos and their annotation, the Photo Tagging Ontology [2] was provided. This OWL ontology structures all concepts into four main categories at its top level. They are called Content Element, Scene Description, Representation and Quality. The category Content Element covers all items Overview of LS-VCDT 97 Table 1. Summary of the frequencies of each concept in the training and test sets. On the right the agreements among annotators are depicted for each concept. No. Concept Train (%) Test (%) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 Partylife Family Friends Beach Holidays Building Sights Snow Citylife Landscape Nature Sports Desert Spring Summer Autumn Winter No Visual Season Indoor Outdoor No Visual Place Plants Flowers Trees Sky Clouds Water Lake River Sea Mountains Day Night No Visual Time Sunny Sunset Sunrise Canvas Still Life Macro Portrait Overexposed Underexposed Neutral Illumination Motion Blur Out of focus Partly Blurred No Blur Single Person Small Group Big Group No Persons Animals Food Vehicle Aesthetic Impression Overall Quality Fancy 3,26 13,18 1,56 10,74 1,66 13,6 15,92 1,56 0,36 2,02 12,82 1,7 3,0 80,46 28,58 53,42 18 11,4 5,08 9,86 20,92 12,78 6,7 1,04 1,42 3,02 2,12 54,02 7,24 38,76 13,66 5,12 2,28 8,4 5,74 10,3 0,96 5,92 93,12 2,84 1,98 30,56 64,62 21,74 8,44 2,8 67,02 9,22 2,98 4,26 16,98 25,82 14,88 4,18 15,26 2,82 11,17 1,35 15,88 17,13 2,41 0,35 0,62 7,80 1,56 1,92 88,10 24,50 51,48 24,02 26,84 4,38 12,71 27,47 14,57 11,35 1,35 1,83 3,30 3,93 52,28 6,82 40,90 15,60 4,45 2,78 8,40 14,72 15,79 1,61 4,45 93,95 3,30 1,55 26,51 68,65 20,82 9,18 3,25 66,78 8,46 3,77 7,59 18,04 14,29 13,33 Annotator Agreement 0.97 0.91 0.96 0.93 1.0 0.90 0.94 0.99 0.99 0.98 0.87 0.98 0.99 0.84 0.90 0.96 0.88 0.91 0.95 0.94 0.92 0.95 0.97 0.98 0.99 0.97 0.98 0.88 0.96 0.88 0.88 0.99 0.98 0.91 0.94 0.92 0.99 0.92 0.89 0.97 0.98 0.86 0.83 0.96 0.96 0.98 0.89 0.99 0.98 0.97 0.75 0.81 0.84 98 S. Nowak and P. Dunker that are rather object-based than concept-based, but play a major role in the research of automated image annotation. It is divided into the two subcategories Landscape Elements and Pictured Objects. The category Scene Description covers subcategories and concepts that illustrate the place, time or activity of the photo content. It is subdivided into the ﬁve subcategories Abstract Categories, Activity, Place, Seasons and Time of Day. The category Representation describes how the image content is represented. It is composed of the concepts Canvas, Macro Image, Still Life and Portrait and the subcategory Illumination. The category Quality describes the grade of quality of the photos. It contains the two subcategories Blurring and Aesthetics. Fig. 2. Visualization of an ontology fragment for image annotation. The concepts are hierarchically structured and diﬀerent types of relationships are exemplarily highlighted. Figure 2 shows a simple hierarchical organization of a part of the ontology. The hierarchy allows making assumptions about the assignment of concepts to documents. For example, if a photo is classiﬁed to contain trees, it also contains plants. Next to the is-a relationship of the hierarchical organization of concepts, other relationships between concepts determine possible label assignments. The ontology restricts for instance, that for a certain sub-node only one concept can be assigned at a time (disjoint items) or that a special concept (like portrait) postulates other concepts (like persons or animals). The ontology allows the participants to incorporate knowledge in their classiﬁcation algorithms, and to make assumptions about which concepts are probable in combination with certain labels. 2.3 Evaluation Measures The evaluation of submissions to LS-VCDT considers two evaluation paradigms. First, the evaluation per concept and second, the evaluation per photo. For the Overview of LS-VCDT 99 evaluation per concept, the Equal Error Rate (EER) and the Area Under Curve (AUC) of the Receiver Operator Characteristics (ROC) curve summarize the performance of the individual runs. The EER is deﬁned as the point where the false acceptance rate of a system is equal to the false rejection rate. The measure AUC describes the overall quality of a classiﬁcation system independent from an individual threshold conﬁguration. It is calculated by integration of the ROC curve, whereas an AUC value of 1 equals a perfect system with no false positives and an AUC value of 0.5 equals a random system. These scores were also used in the VCDT task 2008 and allow comparing the results of the diﬀerent groups to some overlapping concepts. The evaluation per photo is assessed with a recently proposed Ontology-based Score [3]. It considers partial matches between system output and ground truth and calculates misclassiﬁcation costs for each missing or wrongly annotated concept per image. The score is based on ontology structure information, relationships from the ontology and the agreement between annotators for a concept. The measure matches a predicted label with a ground truth label and computes the shortest path in the hierarchy between both labels by counting the number of edges and assigning diﬀerent costs depending on the depth of the link in the hierarchy. The relationships of the ontology are taken into account when two predicted labels are disjoint to each other or when pre-conditions for relationships are ignored. Then the system is penalised accordingly. The degree of agreement among annotators is determined for each concept over all images and serves as a weighting factor for the costs for each misclassiﬁed label. Summarizing, the calculation of misclassiﬁcation costs favours systems that annotate a photo with concepts close to the correct ones more than systems that annotate concepts that are far away in the hierarchy from the correct concepts. According to this scheme, in the single-label classiﬁcation example depicted in Figure 2, system 1 gets lower misclassiﬁcation costs assigned than system 2. Underlying is the assumption that concepts that represent similar content or image characteristics are located close to each other in the ontology. 3 Results Altogether 19 participants submitted results to the LS-VCDT task in a total of 73 runs. The maximum number of runs was restricted to 5 runs per group. Each run contains conﬁdence values between 0 and 1 for all concepts and all photos. One run with pseudo-random numbers was added by the organizers. In this case, for each concept a random number between 0 and 1 was generated that denotes the conﬁdence of the annotation for the EER/AUC computation and that was rounded to 0 or 1 for the ontology-based measure per photo. The results for all runs are depicted in Table 5 in the appendix. In the following, the best results per group and their average rank are discussed in detail for the concept-based measures and the photo-based measures. The results for the evaluation per concept are illustrated in Table 2. The team with the best results achieves an EER of 23% and an AUC of 84% in average 100 S. Nowak and P. Dunker Table 2. Summary of the results for the evaluation per concept. The table shows the EER and AUC for the best run per group and the averaged EER and AUC for all runs of one group. Team ISIS LEAR CVIUI2R FIRST XRCE SZTAKI MMIS IAM Southampton LSIS LIP6 MRIM AVEIR Wroclaw University KameyamaLab UAIC apexlab INAOE TIA Random CEA LIST TELECOM ParisTech Runs 5 5 2 4 1 5 5 3 5 5 4 4 5 5 1 3 5 1 4 2 Best Run Rank EER 1 0.234 5 0.249 7 0.253 8 0.254 14 0.267 17 0.292 21 0.312 23 0.330 24 0.331 33 0.372 34 0.384 41 0.441 43 0.446 47 0.452 54 0.479 56 0.483 57 0.485 0.500 68 0.500 72 0.526 AUC 0.839 0.823 0.814 0.817 0.803 0.773 0.744 0.715 0.721 0.673 0.643 0.551 0.221 0.164 0.106 0.070 0.099 0.499 0.469 0.459 Average Runs Rank EER AUC 3.2 0.240 0.833 13.2 0.268 0.798 9.0 0.255 0.813 10.5 0.258 0.803 14.0 0.267 0.803 20.6 0.312 0.746 27.8 0.345 0.699 24.7 0.335 0.709 42.2 0.418 0.602 42.0 0.414 0.554 38.0 0.415 0.584 49.8 0.461 0.548 45.4 0.449 0.200 53.4 0.466 0.133 54.0 0.479 0.106 60.3 0.487 0.078 61.0 0.489 0.080 0.500 0.499 69.5 0.502 0.463 72.5 0.527 0.459 for their best run. The next three teams closely follow these results with an EER of about 25% and an AUC of 82% and 81%. The performance of the teams at the end of the ranking falls down to 10% AUC and 52% EER. The random conﬁguration achieves an EER and AUC of 50%. In Table 3, the results for each concept are summarized. In average the concepts could be detected with an EER of 23% and an AUC of 84%. The majority of these concepts was classiﬁed best by the ISIS group. It is obvious that the aesthetic concepts (Aesthetic_Impression, Overall_Quality and Fancy) are classiﬁed worst (EER greater than 38% and AUC smaller than 66%.). This is not surprising due to the subjective nature of these concepts which also made the ground truthing diﬃcult. The best classiﬁed concepts are Clouds (AUC: 96%), Sunset-Sunrise (AUC: 95%), Sky (AUC: 95%) and Landscape-Nature (AUC: 94%). Table 4 shows the results for the evaluation per photo. The classiﬁcation performance per photo ranges between 68.7% and 100% with an average of 89.6%. The best results in terms of OS were achieved by the XRCE group with 81% annotation score over all photos. It can be seen from the table that the ranking of the groups is diﬀerent than for the EER/AUC measures. It seems that some of the groups took the ontology information into account (at least Overview of LS-VCDT 101 Table 3. Overview of concepts and results per concept in terms of the best EER and best AUC per concept and the name of the group which achieved these results No. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 Concept Partylife Family Friends Beach Holidays Building Sights Snow Citylife Landscape Nature Sports Desert Spring Summer Autumn Winter No Visual Season Indoor Outdoor No Visual Place Plants Flowers Trees Sky Clouds Water Lake River Sea Mountains Day Night No Visual Time Sunny Sunset Sunrise Canvas Still Life Macro Portrait Overexposed Underexposed Neutral Illumination Motion Blur Out of focus Partly Blurred No Blur Single Person Small Group Big Group No Persons Animals Food Vehicle Aesthetic Impression Overall Quality Fancy Best AUC 0.83 0.83 0.91 0.88 0.87 0.83 0.94 0.72 0.89 0.83 0.81 0.87 0.85 0.81 0.84 0.90 0.79 0.88 0.87 0.90 0.95 0.96 0.90 0.91 0.90 0.94 0.93 0.85 0.91 0.84 0.77 0.95 0.82 0.82 0.81 0.87 0.80 0.88 0.80 0.75 0.81 0.86 0.85 0.79 0.80 0.88 0.86 0.83 0.90 0.83 0.66 0.66 0.58 Best EER 0.24 0.24 0.16 0.20 0.21 0.25 0.13 0.34 0.18 0.25 0.26 0.21 0.23 0.26 0.25 0.19 0.29 0.21 0.20 0.18 0.12 0.10 0.18 0.16 0.17 0.13 0.14 0.24 0.17 0.25 0.30 0.11 0.25 0.25 0.26 0.21 0.25 0.18 0.26 0.32 0.25 0.22 0.23 0.28 0.28 0.21 0.22 0.25 0.19 0.24 0.38 0.38 0.44 Group ISIS ISIS ISIS ISIS LEAR ISIS ISIS FIRST ISIS FIRST ISIS ISIS ISIS ISIS ISIS ISIS ISIS ISIS ISIS - FIRST ISIS ISIS ISIS ISIS ISIS ISIS ISIS ISIS ISIS LEAR ISIS LEAR - ISIS ISIS XRCE ISIS ISIS XRCE - ISIS LIP6 CVIUI2R LEAR ISIS LEAR LEAR LEAR ISIS - LEAR ISIS ISIS ISIS ISIS ISIS ISIS ISIS ISIS ISIS 102 S. Nowak and P. Dunker Table 4. Summary of the results for the evaluation per photo. The table illustrates the average Ontology Score (OS) over all photos for the best run per group and the average OS per group. OS* denotes the scores when the annotator agreements are ignored during computation. Team XRCE CVIUI2R FIRST KameyamaLab Wroclaw University LEAR ISIS apexlab INAOE TIA MRIM CEA LIST UAIC SZTAKI MMIS LSIS AVEIR LIP6 IAM Southampton TELECOM ParisTech random Runs 1 2 4 5 5 5 5 3 5 4 4 1 5 5 5 4 5 3 2 1 Best Run Rank OS 1 0.811 2 0.810 4 0.802 7 0.796 11 0.779 12 0.770 14/13 0.765 17 0.756 20/21 0.744 28 0.716 29 0.709 32/33 0.690 35/34 0.684 42 0.576 49/50 0.512 50/49 0.504 58 0.440 61/59 0.413 65/64 0.385 0.384 OS* 0.790 0.789 0.779 0.773 0.753 0.746 0.740 0.733 0.715 0.683 0.679 0.655 0.652 0.530 0.458 0.466 0.409 0.389 0.355 0.350 Average Runs Rank OS OS* 1.0 0.811 0.790 2.5 0.809 0.789 6.0/5.8 0.799 0.776 25.2/25 0.675 0.653 40.2/38.6 0.582 0.560 21 0.741 0.714 16.8/16.4 0.755 0.731 30/30.7 0.683 0.657 28.1/28.4 0.691 0.662 35.8/36.3 0.655 0.618 30.8/30.5 0.697 0.666 32.0 0.690 0.655 40.2/40 0.618 0.584 49.2/50.6 0.521 0.468 57.0/56.6 0.449 0.412 57.3/59.3 0.443 0.390 68.8 0.297 0.273 64/62 0.391 0.370 67/66 0.360 0.334 0.384 0.350 in a post-processing step) and others ignored it. The include of the annotator agreements does not change the results substantially. The scores are slightly worse as the measure is stricter, but the ranking of the groups remains. An extensive analysis of the inﬂuence of 13 well-established evaluation measures on the ranking of all run conﬁgurations can be found in [4]. 3.1 Submitted Technologies This subsection brieﬂy describes the technologies of the participating groups in alphabetical order. Further information about each approach can be found in the corresponding references. AVEIR. The AVEIR submissions [5] are from a joint group of the individual participants: Telecom ParisTech, LSIS, MRIM-LIG and UPMC/LIP6. They submitted a total of four runs. The AVEIR submissions seem to be equal to the individual submissions. An eﬃcient and reliable combination or fusion method based on a carefulness index is only discussed theoretically in the working notes. Overview of LS-VCDT 103 Table 5. This table presents the results of all submitted runs for the evaluation measures EER, AUC and OS in alphabetical order Run apexlab 61 2 1245568188287.txt apexlab 61 2 1245569401060.txt apexlab 61 2 1245571723120.txt AVEIR 81 2 1245664594607.txt AVEIR 81 2 1245664654451.txt AVEIR 81 2 1245664755158.txt AVEIR 81 2 1245664820829.txt bpacad 18 2 1245621915717.txt bpacad 18 2 1245622065724.txt bpacad 18 2 1245622367725.txt bpacad 18 2 1245622473940.txt bpacad 18 2 1245623805085.txt CEA LIST 24 2 1245339305674.txt CEA LIST 24 2 1245339411928.txt CEA LIST 24 2 1245339505739.txt CEA LIST 24 2 1245339660021.txt CVIUI2R 22 2 1244628714641.txt CVIUI2R 22 2 1244629050173.txt FIR2 92 2 1245417583652.txt FIR2 92 2 1245516363637.txt FIR2 92 2 1245516525876.txt FIR2 92 2 1245516822169.txt IAM Southampton 30 2 1245438072355.txt IAM Southampton 30 2 1245519187248.txt IAM Southampton 30 2 1245519327555.txt INAOE TIA 66 2 1245620227762.txt INAOE TIA 66 2 1245630111443.txt INAOE TIA 66 2 1245631132743.txt INAOE TIA 66 2 1245631427651.txt INAOE TIA 66 2 1245631711558.txt ISIS University of Amsterdam 76 2 1245075326057.txt ISIS University of Amsterdam 76 2 1245075554764.txt ISIS University of Amsterdam 76 2 1245075919629.txt ISIS University of Amsterdam 76 2 1245076320558.txt ISIS University of Amsterdam 76 2 1245076767051.txt KameyamaLab 21 2 1245594455534.txt KameyamaLab 21 2 1245616956073.txt KameyamaLab 21 2 1245617358988.txt KameyamaLab 21 2 1245617609041.txt KameyamaLab 21 2 1245617818119.txt LEAR 44 2 1245581505805.txt LEAR 44 2 1245581860906.txt LEAR 44 2 1245581967963.txt LEAR 44 2 1245582143586.txt LEAR 44 2 1245582451309.txt LIP6 17 2 1245498965150.txt LIP6 17 2 1245579903668.txt LIP6 17 2 1245610812916.txt LIP6 17 2 1245617011478.txt LIP6 17 2 1245618383113.txt LSIS 14 2 1245595798047.txt LSIS 14 2 1245595947674.txt LSIS 14 2 1245596027969.txt LSIS 14 2 1245664294478.txt LSIS 14 2 1245664386697.txt MMIS 33 2 1245434554581.txt MMIS 33 2 1245586552541.txt MMIS 33 2 1245601239738.txt MMIS 33 2 1245611281967.txt MMIS 33 2 1245674693001.txt EER 0.482693 0.488715 0.488715 0.485345 0.440589 0.464311 0.454191 0.346106 0.322632 0.296315 0.291718 0.304113 0.501514 0.500495 0.500535 0.504021 0.253296 0.255945 0.253572 0.253566 0.253685 0.271969 0.330401 0.333452 0.342123 0.486201 0.484685 0.492958 0.492388 0.487231 0.234476 0.234476 0.235479 0.252997 0.243547 0.452374 0.466948 0.49761 0.457482 0.455052 0.258642 0.304383 0.273385 0.256169 0.249469 0.372169 0.383789 0.406509 0.498451 0.410032 0.481796 0.330819 0.488454 0.360759 0.428975 0.312366 0.352478 0.356945 0.352485 0.352612 AUC 0.0704 0.082151 0.082151 0.520723 0.550866 0.552315 0.566118 0.70786 0.733264 0.771324 0.773133 0.7463 0.460804 0.469035 0.456957 0.463241 0.813893 0.811421 0.817153 0.817159 0.816811 0.762276 0.714825 0.711566 0.699769 0.100839 0.099306 0.044276 0.06222 0.094741 0.838699 0.838699 0.837531 0.821731 0.830005 0.164048 0.127788 0.066594 0.152011 0.15604 0.8133 0.756649 0.79392 0.804713 0.823105 0.673089 0.651276 0.629883 0.192313 0.622688 0.505016 0.720931 0.49807 0.685542 0.600487 0.744231 0.68941 0.684821 0.689407 0.689342 OS 0.77994823 0.6588524 0.6588524 0.475304 0.5162502 0.46305847 0.428374 0.577842 0.64546704 0.6194188 0.621747 0.70747864 0.75172716 0.73096365 0.7369552 0.73920727 0.82751185 0.8276921 0.8152637 0.8152935 0.81217587 0.8068478 0.41897374 0.39985585 0.37403193 0.7591791 0.5329161 0.72801954 0.7323485 0.7417855 0.7502514 0.7811672 0.7830596 0.77050894 0.7779823 0.48170832 0.8001519 0.5472044 0.8072495 0.8085204 0.7487974 0.7573569 0.77192163 0.7918007 0.7561616 0.44497892 0.26109898 0.26151088 0.26016545 0.26147848 0.3950422 0.48310107 0.36631873 0.54901516 0.50417936 0.5479666 0.6179764 0.42055705 0.6180272 0.6172693 104 S. Nowak and P. Dunker Table 5. (Continued ) MRIM 50 2 1245573948330.txt MRIM 50 2 1245574356224.txt MRIM 50 2 1245574874836.txt MRIM 50 2 1245578233616.txt Random TELECOM ParisTech 39 2 1245415171956.txt TELECOM ParisTech 39 2 1245415387504.txt UAIC 34 2 1244812428616.txt Wroclaw University of Technology 49 2 1245621871943.txt Wroclaw University of Technology 49 2 1245622279557.txt Wroclaw University of Technology 49 2 1245622900052.txt Wroclaw University of Technology 49 2 1245622992589.txt Wroclaw University of Technology 49 2 1245624864334.txt XRCE 36 2 1245429345115.txt 0.442568 0.392492 0.439504 0.38363 0.50028 0.527132 0.526302 0.4797 0.454221 0.451702 0.447512 0.447512 0.446024 0.267301 0.528129 0.592542 0.573579 0.64345 0.499307 0.458622 0.459922 0.105589 0.17054 0.18037 0.214875 0.214875 0.220957 0.802704 0.6792413 0.71418947 0.7413273 0.5905188 0.38431713 0.39030218 0.34283262 0.7236283 0.4453914 0.79032654 0.4727444 0.4727444 0.7775906 0.82949024 CVIU I2R. The Institute for Infocomm Research in Singapore [6] proposed two runs. They provide a system that utilizes various global and local features, for instance colour and edge histograms, colour coherence vector, census transform as well as diﬀerent SIFT features. Furthermore, for each concept a local region search algorithm is applied for a relevant bounding box selection. For each individual concept a feature selection process is applied and a χ2 kernel SVM used. For the disjoint concepts, the probabilities were adapted in order to end with a single concept above 0.5. FIRST. The Fraunhofer FIRST Institute [7] submitted four runs. They used SIFT features on diﬀerent colour channels and pyramid histograms over colour intensities. The SIFT features are combined by the bag of words approach. For classiﬁcation, a SVM was applied with average kernel, sparse L1 MKL and nonsparse Lp MKL kernel. IAM. The Intelligence Agents Multimedia Group of the University Southampton [8] participated with three runs. They focus on visual-terms, which are created by low-level features mainly based on SIFT and a following codebook quantization. For machine learning, the Cross Language Latent Indexing method was applied which maps the concept names and the visual-terms into a semantic space. The decision classiﬁcation is handled by estimating the smallest cosine distance between concepts and visual-terms. Within the diﬀerent runs, a successive expansion of the concept hierarchy was investigated, which resulted in no improvement. INRIA-LEAR. The LEAR team of INRIA [9] proposed ﬁve runs. They utilize a bag-of-features setup with global features namely a gist of scene descriptor and diﬀerent colour histograms applied in three horizontal regions of the image and the local SIFT feature quantized with k-means. In two runs a weighted nearest neighbour tag prediction method is applied and in two other runs a SVM for each concept is used. The ﬁfth run uses a SVM classiﬁer that was trained for multiclass separation. The SVM runs performed better than the tag prediction, whilst Overview of LS-VCDT 105 the tag prediction was ten times faster. No post-processing on the ontology rules was performed, therefore the results of the OS are worse compared to EER. ISIS. The Intelligent Systems Lab of the University of Amsterdam [10] submitted ﬁve runs. They apply a system that is based on four main steps. First, a sampling strategy is applied that combines a spatial pyramid approach and saliency points detection. Second, SIFT features are extracted in diﬀerent colour spaces. In the third step, a codebook transformation is utilized to reduce the amount of features. The frequency information of predeﬁned codewords is used as ﬁnal feature. The learning step is based on SVM with χ2 kernel. The runs diﬀer mainly in the number of SIFT features used and the codebook generation process. KameyamaLab. The Graduate School of Global Information and Telecommunication Studies of Waseda University [11] participated with ﬁve runs. They propose a system with joint global colour and texture features as well as local features based on saliency regions. Additionally, a gist of scene feature is used. KNN classiﬁer is applied for the assignment of concept labels. LSIS. The Sciences and Information Lab [12] submitted ﬁve runs. They combine diﬀerent features, e.g. HSV, edge, gabor or proﬁle entropy and apply a Visual Dictionary with a visual-word approach. MMIS. The Multimedia and Information Systems group of the Open University in the United Kingdom [13] proposed ﬁve runs. They utilize global colour histograms, Tamura textures and Gabor features. Selected features are estimated in nine sub regions and concatenated to one feature vector. A non-parametric density estimation is applied. A global feature weighting was submitted as baseline approach. The other four submissions use diﬀerent parameter combinations for word correlations to semantic similarity. The runs diﬀer by the source of the semantic similarity space, which was estimated by the training data, by Google Web search, WordNet and Wikipedia measure. The submission based on the training data achieved the best results. MRIM-LIG. The Multimedia Information Modeling and Retrieval group at the Laboratoire Informatique de Grenoble [14] participated with four runs. They combine RGB histograms, SIFT and Gabor features. For the learning phase, diﬀerent SVM combinations are trained and as a priori, the best feature and SVM setup for each concept is used. SZTAKI1 . The Computer and Automation Research Institute of the Hungarian Academy of Sciences [15] submitted ﬁve runs. They used SIFT features and a graph-based segmentation algorithm. Based on the segments, colour histograms, shape and DFT features are estimated. The SIFT features are post-processed with a GMM, and a Fisher kernel is applied on the features derived from the segmentation. For classiﬁcation, a binary logistic regression approach is utilized. 1 SZTAKI equals the run submissions with identiﬁer “bpacad”. 106 S. Nowak and P. Dunker As one of a few groups, the SZTAKI group applied the connections between concepts in the provided ontology to estimate correlations of appearing concepts. TELECOM ParisTech. The TELECOM ParisTech group [16] participated with two runs. Their algorithm was designed especially for the large-scale scenario. The approach is characterized through a low complexity for the processing and an easy extension to a variety of concepts, while accepting a decrease of precision. The algorithm utilizes global visual features and text features generated out of the 53 visual concepts via PCA. A Canonical Correlation Analysis is used to capture linear relationships between these diﬀerent features spaces. TIA-INAOE. The Research Group on Machine Learning for Image Processing and Information Retrieval of the Mexican National Institute of Astrophysics, Optics and Electronics [17] submitted ﬁve runs. They provide an algorithm based on global features, e.g. colour and edge histograms. As baseline run, a KNN classiﬁer is used and the most frequent concepts of the top nearest neighbour training images were assigned. A further label reﬁnement process concentrates on co-occurrence statistics of the disjoint concepts of the training set. UAIC. The Faculty of Computer Science of Alexandru Ioan Cuza University Romania [18] participated with one run. Their approach consists of four modules. First, a face detection software was used to estimate the number of faces in images. The second module concentrates on clustering of training images, whereas most concepts were set to a score of 0.5 if no decision could be made. The same process was applied in an EXIF data processing module. The last module sets default values to disjoint concepts depending on their occurrence in the training data. UPMC/LIP6. The Laboratoire d’Informatique de Paris 6 of the University Pierre et Marie Curie [19] submitted ﬁve runs. Their approach utilizes a simple HSV histogram feature calculated in three horizontal segments and a linear kernel SVM for learning. XRCE. The Xerox Research Centre [20] proposed one run. The algorithm uses a set of diﬀerent features, e.g. a GMM image representation, a Fisher vector, local RGB statistics and SIFT features. The local features are extracted on a multilevel image-grid. For the classiﬁcation, a Sparse Logistic Regression approach was applied. In the post-processing, the hierarchical structure, disjoint-concepts and relating concepts were considered. 3.2 Discussion Summarizing the approaches, some facts can be driven. The groups that used local features like SIFT achieved better results than the groups relying solely on global features. Most groups that investigated the concept hierarchy and analyzed, e.g. the correlations between the concepts, could achieve better results in the photo based ontology score compared to the concept based EER. The information about the computational performance is diﬃcult to compare, because the level of detail of the information range from 72 hours for the complete Overview of LS-VCDT 107 process, to 1 second for training and testing. In the 2010 task, a more detailed speciﬁcation for this information is needed. Comparing the results to the VCDT 2008, the average AUC over all concepts for the best run drops from 90% to 84%, while increasing the number of concepts with a factor of about three. Even if the concepts between 2008 and 2009 diﬀer by its characteristics, a rough comparison can be given. The most comparable concepts indoor and outdoor dropped by 13% and 7%, respectively, which can be explained with the third choice NoVisualPlace. Other concepts could be annotated in a similar quality, e.g. mountains and sky -1%, day and trees +/-0%. The concept person was substituted by three concepts and dropped in average by 7%. Concepts that could be better annotated are beach +4%, clouds +4% and water +5%. In case of clouds, the 2009 task was easier, because the concepts overcast and partly cloudy were combined to clouds. The concept water builds an own group with multiple subcategories in 2009, which might ease this concept, too. 4 Conclusion This paper summarises the ImageCLEF 2009 LS-VCDT. Its aim was to benchmark current approaches for an automated annotation of photos in a multi-label scenario with 53 concepts. A provided ontology could be used to enrich the classiﬁcation system. The results show that the task could be solved reasonably well with the best system achieving an AUC of 84% for all photos in average. Four other groups got an AUC score over or equal to 80%. Evaluated based on concepts, the concepts could be annotated in average with an AUC of 84%. In terms of OS, the best system annotated all photos with an average annotation rate of 81%. Three other systems achieved close results with 81%, 80% and 79.6%. Part of the groups used the ontology for post-processing or to learn correlations of concepts. No participant integrated the ontology in a reasoning system to apply this system for the classiﬁcation task. Compared to the VCDT 2008, the enriched number of concepts and photos posed no problem to the classiﬁcation systems, the average AUC for the best run drops only from 90% to 84%, while increasing the number of concepts from 17 to 53. Acknowledgments. This work has been supported by grant No. 01MQ07017 of the German research program THESEUS funded by the Ministry of Economics. References 1. Huiskes, M.J., Lew, M.S.: The MIR Flickr Retrieval Evaluation. In: MIR 2008: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. ACM, New York (2008) 2. Nowak, S., Dunker, P.: A Consumer Photo Tagging Ontology: Concepts and Annotations. In: THESEUS/ImageCLEF Pre-Workshop 2009, Co-located with the Cross Language Evaluation Forum (CLEF) Workshop and 13th European Conference on Digital Libraries ECDL, Corfu, Greece (2009) 108 S. Nowak and P. Dunker 3. Nowak, S., Lukashevich, H.: Multilabel Classiﬁcation Evaluation using Ontology Information. In: The 1st Workshop on Inductive Reasoning and Machine Learning on the Semantic Web -IRMLeS 2009, co-located with the 6th Annual European Semantic Web Conference (ESWC), Heraklion, Greece (2009) 4. Nowak, S., Lukashevich, H., Dunker, P., R¨ uger, S.: Performance Measures for Multilabel Classiﬁcation - A Case Study in the Area of Image Classiﬁcation. In: ACM SIGMM International Conference on Multimedia Information Retrieval (ACM MIR), Philadelphia, Pennsylvania (2010) 5. Glotin, H., Fakeri-Tabrizi, A., Mulhem, P., Ferecatu, M., Zhao, Z., Tollari, S., Quenot, G., Sahbi, H., Dumont, E., Gallinari, P.: Comparison of Various AVEIR Visual Concept Detectors with an Index of Carefulness. In: CLEF Working Notes 2009, Corfu, Greece (2009) 6. Ngiam, J., Goh, H.: Learning Global and Regional Features for Photo Annotation. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, J., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242. Springer, Heidelberg (2010) 7. Binder, A., Kawanabe, M.: Enhancing Recognition of Visual Concepts with Primitive Color Histograms via Non-sparse Multiple Kernel Learning. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, J., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 8. Hare, J., Lewis, P.: IAM@ImageCLEFPhotoAnnotation 2009: Naive application of a linear-algebraic semantic space. In: CLEF Working Notes 2009, Corfu, Greece (2009) 9. Douze, M., Guillaumin, M., Mensink, T., Schmid, C., Verbeek, J.: INRIA-LEARs participation to ImageCLEF 2009. In: CLEF Working Notes 2009, Corfu, Greece (2009) 10. van de Sande, K., Gevers, T., Smeulders, A.: The University of Amsterdam’s Concept Detection System at ImageCLEF 2009. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, J., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 11. Sarin, S., Kameyama, W.: Joint Contribution of Global and Local Features for Image Annotation. In: CLEF Working Notes 2009, Corfu, Greece (2009) 12. Dumont, E., Zhao, Z.Q., Glotin, H., Paris, S.: A new TFIDF Bag of Visual Words for Concept Detection. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, J., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 13. Llorente, A., Motta, E., R¨ uger, S.: Exploring the Semantics Behind a Collection to Improve Automated Image Annotation. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, J., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 14. Pham, T., Maisonnasse, L., Mulhem, P., Chevallet, J.P., Qu´enot, G., Al Batal, R.: MRIM-LIG at ImageCLEF 2009: Robot Vision, Image annotation and retrieval tasks. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, J., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 15. Dar´ oczy, B., Petr´ as, I., Bencz´ ur, A., Fekete, Z., Nemeskey, D., Sikl´ osi, D., Weiner, Z.: Interest Point and Segmentation-Based Photo Annotation. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, J., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) Overview of LS-VCDT 109 16. Ferecatu, M., Sahbi, H.: TELECOM ParisTech at ImageClef 2009: Large Scale Visual Concept Detection and Annotation Task. In: CLEF working notes 2009, Corfu, Greece (2009) 17. Escalante, H., Gonzalez, J., Hernandez, C., Lopez, A., Montex, M., Morales, E., Ruiz, E., Sucar, L., Villasenor, L.: TIA-INAOE’s Participation at ImageCLEF 2009. In: CLEF working notes 2009, Corfu, Greece (2009) 18. Iftene, A., Vamanu, L., Croitoru, C.: UAIC at ImageCLEF 2009 Photo Annotation Task. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, J., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 19. Fakeri-Tabrizi, A., Tollari, S., Usunier, N., Gallinari, P.: Improving Image Annotation in Imbalanced Classiﬁcation Problems with Ranking SVM. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, J., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 20. Ah-Pine, J., Clinchant, S., Csurka, G., Liu, Y.: XRCE’s Participation in ImageCLEF 2009. In: CLEF Working Notes 2009, Corfu, Greece (2009) Overview of the CLEF 2009 Robot Vision Track Andrzej Pronobis1, Li Xing1 , and Barbara Caputo2 1 Centre for Autonomous Systems, The Royal Institute of Technology, Stockholm, Sweden {pronobis,lixing}@kth.se 2 Idiap Research Institute, Martigny, Switzerland bcaputo@idiap.ch Abstract. The robot vision track has been proposed to the ImageCLEF participants for the ﬁrst time in 2009 and attracted considerable attention. The track addressed the problem of visual place recognition applied to robot topological localization. Participants were asked to classify rooms of an oﬃce environment on the basis of image sequences captured by a perspective camera mounted on a mobile robot. The algorithms proposed by the participants had to answer the question “where are you?” (I am in the kitchen, in the corridor, etc) when presented with a test sequence imaging rooms seen during training, or additional rooms that were not imaged in the training sequence. The participants were asked to solve the problem separately for each test image (obligatory task). Additionally, results could also be reported for algorithms exploiting the temporal continuity of the image sequences (optional task). Robustness of the algorithms was evaluated in presence of variations introduced by changing illumination conditions and dynamic variations observed across a time span of almost two years. The participants submitted 18 runs to the obligatory task, and 9 to the optional task. The best results were obtained by the Idiap Research Institute, Martigny, Switzerland for the obligatory task and the University of Castilla-La Mancha, Albacete, Spain for the optional task. 1 Introduction ImageCLEF1 [1, 2, 3] started in 2003 as part of the Cross Language Evaluation Forum (CLEF2 , [4]). Its main goal has been to promote research on multi-modal data annotation and information retrieval, in various application ﬁelds. As such it has always contained visual, textual and other modalities, mixed tasks and several sub tracks. 1 2 We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. B. Caputo was supported by the EMMA project, funded by the Hasler foundation. A. Pronobis was supported by the EU FP7 project ICT-215181-CogX. The support is gratefully acknowledged. http://www.imageclef.org/ http://www.clef-campaign.org/ C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 110–119, 2010. c Springer-Verlag Berlin Heidelberg 2010 Overview of the CLEF 2009 Robot Vision Track 111 The robot vision track has been proposed to the ImageCLEF participants for the ﬁrst time in 2009. The track attracted a considerable attention, with 19 inscribed research groups, 7 groups eventually participating and a total of 27 submitted runs. The track addressed the problem of visual place recognition applied to robot topological localization. Speciﬁcally, participants were asked to classify rooms on the basis of image sequences, captured by a perspective camera mounted on a mobile robot. The sequences were acquired in an oﬃce environment, under varying illumination conditions and across a time span of almost two years. The training and validation set consisted of a subset of the IDOL2 database3 . The test set consisted of sequences similar to those in the training and validation set, but acquired 20 months later and imaging also additional rooms. Participants were asked to build a system able to answer the question “where are you?” (I am in the kitchen, in the corridor, etc) when presented with a test sequence imaging rooms seen during training, or additional rooms that were not imaged in the training sequence. The system had to assign each test image to one of the rooms present in the training sequence, or indicate that the image came from a new room. We asked all participants to solve the problem separately for each test image (obligatory task). Additionally, results could also be reported for algorithms exploiting the temporal continuity of the image sequences (optional task). Of the 27 runs, 18 were submitted to the obligatory task, and 9 to the optional task. The best result in the obligatory task was obtained by the Idiap Research Institute, Martigny, Switzerland with an approach based on integration of global and local visual features and a Support Vector Machine. The best result in the optional task was obtained by the Intelligent Systems and Data Mining Group (SIMD) of the University of Castilla-La Mancha, Albacete, Spain, with an approach based on local features and a particle ﬁlter. This paper provides an overview of the Robot Vision track and reports on the runs submitted by the participants. First, details concerning the setup of the robot vision track are given in Section 2. Then, Section 3 presents the participants and Section 4 provides the ranking of the obtained results. Finally, an overview of the approaches used by the participants is given in Section 5. Conclusions are drawn in Section 6. Additional information about the task and on how to participate in the future robot vision challenges can be found on the ImageCLEF web pages. 2 The RobotVision Track This section describes the details concerning the setup of the robot vision track. Section 2.1 describes the dataset used. Section 2.2 gives details on the tasks proposed to the participants. Finally, section 2.3 describes brieﬂy the algorithm used for obtaining a ground truth and the evaluation procedure. 3 http://www.cas.kth.se/IDOL/ 112 A. Pronobis, L. Xing, and B. Caputo Fig. 1. The MobileRobots PowerBot mobile robot platform used for data acquisition 2.1 Dataset Three datasets were made available to the participants. Annotated training and validation data were released when the competition started. Unlabeled testing set was released two weeks before the results submission deadline. The training and validation sets consisted of a subset of the publicly available IDOL2 database [5, 6]. An additional, previously unreleased image sequence was used for testing. The part of the IDOL2 database used for training and validation comprises 12 image sequences acquired using a MobileRobots PowerBot robot platform presented in Figure 1. The image sequences in the database are accompanied by laser range data and odometry data; however use of that data was not permitted in the competition. The image sequences in the IDOL2 database were captured with a Canon VC-C4 perspective camera using the resolution of 320x240 pixels. The acquisition was performed in a ﬁve room subsection of a larger oﬃce environment, selected in such way that each of the ﬁve rooms represented a diﬀerent functional area: a one-person oﬃce, a two-persons oﬃce, a kitchen, a corridor, and a printer area. The map of the environment is presented in Figure 2. The appearance of the rooms was captured under three diﬀerent illumination conditions: in cloudy weather, in sunny weather, and at night. The robots were manually driven through each of the ﬁve rooms while continuously acquiring images and laser range scans at a rate of 5fps. Approximate path followed by the robot during acquisition of the data sequences is plotted with a solid line in Figure 2. Each data sample was then labelled as belonging to one of the rooms according to the Overview of the CLEF 2009 Robot Vision Track 113 Fig. 2. Map of the environment with approximate path followed by the robot during acquisition of the training, validation and testing data. The dashed segments of the path correspond to the rooms available only in the test set. position of the robot during acquisition (rather than contents of the images). Examples of images showing the interiors of the rooms, variations observed over time and caused by activity in the environment as well as introduced by changing illumination are presented in Figure 3. The IDOL2 database was designed to test the robustness of place recognition algorithms to variations that occur over a long period of time. Therefore, the acquisition process was conducted in two phases. Two sequences were acquired for each type of illumination conditions over the time span of more than two weeks, and another two sequences for each setting were recorded 6 months later (12 sequences in total). Thus, the sequences captured variability introduced not only by illumination but also natural activities in the environment (presence/absence of people, furniture/objects relocated etc.). The test sequences were acquired in the same environment, using the same camera setup presented in Figure 1. The acquisition was performed 20 months after the acquisition of the IDOL2 database. The robot followed a very similar path to the one used for acquisition of the IDOL2 database. However, this time the path was extended with two additional rooms: a meeting room and a 114 A. Pronobis, L. Xing, and B. Caputo Sunny Night Corridor Two-persons oﬃce Cloudy (a) Variations introduced by illumination Corridor One-person oﬃce Two-persons oﬃce (b) Variations observed over time One-person oﬃce Kitchen Printer area (c) Remaining rooms (at night) Fig. 3. Examples of pictures taken from the IDOL2 database showing the interiors of the rooms, variations observed over time and caused by activity in the environment as well as introduced by changing illumination Overview of the CLEF 2009 Robot Vision Track 115 bathroom. The dashed segments of the path shown in Figure 2 correspond to those rooms available only in the test set. 2.2 The Task The Robot Vision track addressed the problem of visual place recognition applied to topological localization of a mobile robot. Speciﬁcally, participants were asked to determine the topological location of a robot based on images acquired with a perspective camera mounted on a mobile robot platform. Participants were given training data consisting of an image sequence. The training sequence was recorded using a mobile robot that was manually driven through several rooms of a typical indoor oﬃce environment. The acquisition was performed under ﬁxed illumination conditions and at a given time. Each image in the training sequence was labeled and assigned to the room in which it was acquired. The challenge was to build a system able to answer the question ’where are you?’ (I’m in the kitchen, in the corridor, etc.) when presented with a test sequence containing images acquired in the previously observed part of the environment or in additional rooms that were not imaged in the training sequence. The test images were acquired 6-20 months later after the training sequence, possibly under diﬀerent illumination settings. The system had to assign each test image to one of the rooms that were present in the training sequence or indicate that the image came from a room that was not included during training. Moreover, the system could refrain from making a decision (e.g. in the case of lack of conﬁdence). The algorithm had to be able to provide information about the location of the robot separately for each test image (e.g. when only some of the images from the test sequences were available or the sequences were scrambled). This corresponds to the problem of global topological localization. We called this the obligatory task. However, results could also be reported for the case when the algorithm was allowed to exploit continuity of the sequences and relied on the test images acquired before the classiﬁed image. We called this the optional task. 2.3 Ground Truth and Evaluation The image sequences used in the competition were annotated with ground truth. The annotations of the training and validation sequences were available to the participants, while the ground truth for the test sequence was released after the results were announced. Each image in the sequences was labelled according to the position of the robot during acquisition as belonging to one of the rooms used for training or as an unknown room. The ground truth was then used to calculate a score indicating the performance of an algorithm on the test sequence. The following rules were used when calculating the overall score for the whole test sequence: – 1 point was granted for each correctly classiﬁed image. – Correct detection of an unknown room was regarded as correct classiﬁcation. 116 A. Pronobis, L. Xing, and B. Caputo – 0.5 points was subtracted for each misclassiﬁed image. – No points were granted or subtracted if an image was not classiﬁed (the algorithm refrained from the decision). A script was available to the participants that automatically calculated the score for a speciﬁed test sequence given the classiﬁcation results produced by an algorithm. 3 Participation In 2009, a new record of 85 research groups registered for the seven sub tasks of ImageCLEF. Of these 85, 19 registered to the Robot Vision task. 7 of the registered groups submitted at least one run: – Faculty of Computer Science, The Alexandru Ioan Cuza University (UAIC), Ia¸si, Romania – Idiap Research Institute, Martigny, Switzerland – Computer Vision & Image Understanding Department (CVIU), Institute for Infocomm Research, Singapore – Laboratoire des Sciences de l’Information et des Syst`emes (LSIS), La Garde, France – Intelligent Systems and Data Mining Group (SIMD), University of CastillaLa Mancha, Albacete, Spain – Multimedia Information Modeling and Retrieval Group (MRIM), Laboratoire d’Informatique de Grenoble, France – Multimedia Information Retrieval Group (MIRG), University of Glasgow, United Kingdom A total of 27 runs were submitted, with 18 runs submitted to the obligatory task and 9 runs submitted to the optional task. In order to encourage participation, there was no limit to the number of runs that each group could submit. 4 Results This section presents the results of the robot vision track of ImageCLEF 2009. Table 1(a) shows the results for the obligatory task, while Table 1(b) shows the result for the optional task. Scores are presented for each of the submitted runs that complied with the rules of the contest. We see that the majority of runs were submitted to the obligatory task. A possible explanation is that the optional task requires a higher expertise in robotics that the obligatory task, which therefore represents a very good entry point. Additional three runs were submitted to the optional track by MRIM. However, since no runs were submitted to the obligatory track, the results could not be accepted in the oﬃcial ranking. The next section provides an overview of the approaches used by the participants. Overview of the CLEF 2009 Robot Vision Track 117 Table 1. Results for each run submitted to the obligatory (a) and optional (b) tasks (a) Obligatory task. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 5 Group Idiap UAIC UAIC CVIU UAIC UAIC LSIS SIMD LSIS MRIM MRIM MRIM UAIC MRIM LSIS LSIS LSIS LSIS Score 793.0 787.0 787.0 784.0 599.5 599.5 544.0 511.0 509.5 456.5 415.0 328.0 296.5 25.0 -32.0 -32.0 -32.0 -32.0 (b) Optional task. # 1 2 3 4 5 6 Group SIMD CVIU Idiap SIMD SIMD SIMD Score 916.5 884.5 853.0 711.0 711.0 609.0 Approaches The submissions used a wide range of techniques for representing visual information, building models of the appearance of the environment and spatio-temporal integration. It is interesting to note though that most of the groups, including the two groups that ranked ﬁrst in the two tasks, employed approaches based on local features, either used as the only image representation or in combination with other visual cues. This conﬁrms a consolidated trend in the robot vision community that treats local descriptors as the oﬀ the shelf feature of choice for visual recognition. At the same time, the algorithms used for place recognition spanned from statistical methods to approaches transplanted from the language modeling community. The Scale Invariant Feature Transform (SIFT) [7] was employed most frequently as a local descriptor and the groups winning in both tasks used SIFT in order to represent visual information. The approach used by Idiap [8] which ranked ﬁrst in the obligatory task, used SIFT combined with several other descriptors including two global image representations: Composed Receptive Field Histograms (CRFH) and PCA Census Transform Histograms (PACT). The algorithm employed by SIMD [9] relied mainly on the SIFT descriptor complemented with lines and squares detected using the Hough transform. Other participants also used SIFT (UAIC [10]); color SIFT (SIFT features extracted from the red, green and blue channels) combined with HSV color histograms and 118 A. Pronobis, L. Xing, and B. Caputo multi-scale canny edge histograms (MRIM [11]); local features extracted from patches formed around interest points found using the Harris corner detector in images pre-processed using an illumination ﬁlter based on the Retinex algorithm (MIRG [12]); or Proﬁle Entropy Features (PEF) encoding RGB color and texture information (LSIS [13]). Techniques using color descriptors ranked lower in general in the obligatory task, which might suggest that color information was not suﬃciently robust to the large variations in illumination captured in the dataset. The participants applied a wide range of techniques to the place recognition problem in the obligatory task. Several variations of a simple image matching strategy were used by SIMD [9], UAIC [10] and MIRG [12]. Idiap [8] built models of places using Support Vector Machines (SVM), separately for several visual cues, and combined the outputs using a Discriminative Accumulation Scheme (DAS). The group of CVIU, also used Support Vector Machines, while LSIS [13] used Least Squares Support Vector Machines (LS-SVM). Finally, MRIM [11] applied a framework based on visual vocabulary and a language model (Conceptual Unigram Model). Four groups submitted runs to the optional task. The approach used by SIMD [9], which ranked ﬁrst in this track, employed a particle ﬁlter to perform Monte Carlo localization. MIRG [12] used decision rules to process the results obtained for separate frames. CVIU and Idiap [8] applied simple temporal smoothing techniques which obtained lower scores than the other approaches. 6 Conclusions The ﬁrst robot vision task at ImageCLEF 2009 attracted a considerable attention and proved an interesting complement to the existing tasks. The approach presented by the participating groups were diverse and original, oﬀering a fresh take on the topological localization problem. We plan to continue the task in the next years, adding cues provided by stereo vision and proposing new challenges to the participants. In particular, we plan to focus on the problem of place categorization and use objects as an important source of information about the environment. References 1. Clough, P., M¨ uller, H., Deselaers, T., Grubinger, M., Lehmann, T.M., Jensen, J., Hersh, W.: The CLEF 2005 cross–language image retrieval track. In: Peters, C., Gey, F.C., Gonzalo, J., M¨ uller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 535–557. Springer, Heidelberg (2006) 2. Clough, P., M¨ uller, H., Sanderson, M.: The CLEF cross–language image retrieval track (ImageCLEF) 2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 597–613. Springer, Heidelberg (2005) Overview of the CLEF 2009 Robot Vision Track 119 3. M¨ uller, H., Deselaers, T., Kim, E., Kalpathy-Cramer, J., Deserno, T.M., Clough, P., Hersh, W.: Overview of the ImageCLEFmed 2007 medical retrieval and annotation tasks. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 472–491. Springer, Heidelberg (2008) 4. Savoy, J.: Report on CLEF–2001 experiments. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 27–43. Springer, Heidelberg (2002) 5. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: The KTH-IDOL2 database. Technical Report CVAP304, Kungliga Tekniska Hoegskolan, CVAP/CAS (October 2006), http://www.cas.kth.se/IDOL/ 6. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2007), San Diego, CA, USA (October 2007) 7. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2) (2004) 8. Xing, L., Pronobis, A.: Multi-cue discriminative place recognition. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Cramer, J., Jones, G.J.F., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 9. Mart´ınez-G´ omez, J., Jim´enez-Picazo, A., Garc´ıa-Varea, I.: A particle- Iter-based self-localization method using invariant features as visual information. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (2009) 10. Boro¸s, E., Ro¸sca, G., Iftene, A.: Uaic: Participation in imageclef 2009 robot vision task. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (2009) 11. Pham, T.T., Maisonnasse, L., Mulhem, P.: Visual language modeling for mobile localization. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (2009) 12. Feng, Y., Halvey, M., Jose, J.M.: University of glasgow at imageclef 2009 robot vision task. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (2009) 13. Glotin, H., Zhao, Z.Q., Dumont, E.: Fast lsis prole entropy features for robot visual self-localization. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (2009) Diversity Promotion: Is Reordering Top-Ranked Documents Suﬃcient? Sergio Navarro, Rafael Mu˜ noz, and Fernando Llopis Natural Language Processing and Information Systems Group, University of Alicante, Spain {snavarro,rafael,llopis}@dlsi.ua.es Abstract. In our participation in the ImageCLEF 2009 Photo Retrieval task we pursued two objectives: Firstly, to re-evaluate MultiModal Local Context Analysis (MMLCA), our multimodal fusion technique. Secondly, to evaluate a new subquery generation technique based on clustering. From the experiments conducted: Firstly, we conﬁrmed MMLCA performs better for generic domain collections than the other local expansion techniques evaluated. Secondly, our proposal of subquery generation based on clustering, obtained good results (5th best textual run of the Photo Retrieval task). Besides these results in this paper we try to reﬂect on the extent to which reordering techniques are appropriate to promote diversity. Results suggest that while reordering strategies limit the margin of improvement due to their use of a limited number of documents, the use of approaches based on subqueries generation can overcome this limitation. 1 Introduction In our participation in the 2009 edition of the ImageCLEF task, we pursued two objectives, ﬁrst to expand the number of subtasks in which we evaluate our proposed multimodal fusion strategy presented in previous works, MultiModal Local Context Analysis (MMLCA) [4]. Furthermore, we evaluated a subquery generation technique based on clustering which is an attempt to overcome the limitations of other diversity promotion techniques based on reordering the topranked documents in a list. Similar to previous editions, we have customized IR-n [3] to be able to harness the image list returned by a CBIR system. 2 The IR-n System To perform the experiments we used IR-n, an information retrieval system based on passages. Unlike systems based on documents, passage based systems give greater relevance to those documents where the query terms appear in closer positions to each other [3]. The IR-n architecture allows us to choose between two diﬀerent term selection strategies for the local query expansion. These strategies are Probabilistic Relevance Feedback (PRF)[6] and Local Context Analysis (LCA) [7]. Moreover it C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 120–123, 2010. c Springer-Verlag Berlin Heidelberg 2010 Diversity Promotion: Is Reordering Top-Ranked Documents Suﬃcient? 121 supports the multimodal versions of these two strategies, MMPRF and MMLCA. These multimodal versions use the top-ranked documents returned by the CBIR system as input for the term selection strategy (PRF or LCA) [4]. Furthermore, in an attempt to promote diversity and precision in the results returned by the system, this year we have added a subqueries generation technique which is based on clustering, This strategy uses Carrot21 a text based open source clustering engine. Firstly, our clustering module passes to Carrot2 the top relevant documents in the ranking list. Secondly, Carrot2 returns a label for each one of the clusters found, a label consists of those terms which exist within all the documents in the cluster and at the same time are the most discriminative in relation to other clusters. Thirdly, the system builds a subquery per each cluster label returned by Carrot2. These subqueries consist of the terms of the original query jointly with those cluster label terms returned by Carrot2 which do not exist in the original query and are not repeated in other cluster labels returned by Carrot2. Fourthly, the system performs the retrieval for each one of the new generated subqueries, using the same local expansion conﬁguration that is used for the initial query. Fithly, for each list the relevance values are normalized using the maximun relevance value of the list. Finally, the results for each subquery are fused and ordered by their normalized relevance value. 3 Experiments This section shows the results of the system and describes the conﬁguration used by IR-n. Below is the description of each one of the parameters used: – Relevance Feedback parameters relF B(exp/num/ncbir/term): Indicates which relevance feedback strategy is used PRF, LCA, MMPRF or MMLCA. If exp has value 1, this denotes we use relevance feedback based on documents. But, if exp has value 2, the relevance feedback is based on passages. Moreover, num denotes the number of passages or documents that the local expansion strategy will use from the textual ranking, ncbir denotes the number of documents that the multimodal local expansion strategy will use from an image based list. Finally, term indicates the number of terms that the local expansion strategy will use for the query expansion. – Subqueries (subq): Indicate whether the system uses the subquery retrieval with the standard re-ranking strategy. For those topics which do not include preset subqueries the system uses the subqueries generated by the clustering strategy based on the annotations of the top 50 results. For our participation in the Photo Retrieval subtask, the weighting schema useb by IR-n was DFR [1] and the CBIR system used was FIRE [2]. Further information regarding the collections and the topic sets used in this task can be found at [5]. 1 http://www.carrot2.org 122 S. Navarro, R. Mu˜ noz, and F. Llopis Table 1. Results in Photo Retrieval Subtask run type TXTIMG TXT TXTIMG TXT TXT TXTIMG TXTIMG TXT TXT TXTIMG TXTIMG TXT relFB subq F-Mea10 CR10 Part 1 Topic Set MMLCA(1/0/20/5) yes 0.7742 0.7500 PRF(1/5/0/10) yes 0.7619 0.7056 MMLCA(1/0/20/5) no 0.5882 0.4899 PRF(1/5/0/10) no 0.5584 0.4301 Part 2 Topic Set PRF(1/5/0/10) yes 0.6694 0.6031 MMLCA(1/0/20/5) no 0.6461 0.5758 MMLCA(1/0/20/5) yes 0.6191 0.5321 PRF(1/5/0/10) no 0.5987 0.5478 All Topics PRF(1/5/0/10) yes 0.7158 0.6544 MMLCA(1/0/20/5) yes 0.6996 0.6410 MMLCA(1/0/20/5) no 0.6182 0.5328 PRF(1/5/0/10) no 0.5850 0.4890 CR50 CR1000 rank 0.8173 0.8746 0.6722 0.6455 0.9420 0.9710 0.9360 0.9083 11/84 15/84 55/84 62/84 0.7071 0.7419 0.6849 0.6825 0.8896 0.9743 0.8336 0.9733 20/80 37/80 49/80 54/80 0.7909 0.7511 0.7070 0.6640 0.9303 0.8878 0.9551 0.9408 14/84 18/84 47/84 57/84 The ﬁrst part of Table 1 shows the results for the Part 1 topic set ordered by F-Meassure at 10. This query set provides a number of preset subqueries and images per query. For this topic set our group was the 5th best group in terms of F-Meassure out of the total of 19 participants. The second part of Table 1 shows the results achieved for the Part 2 topic set ordered by F-Meassure at 10. This query set did not provide the subqueries per query as the previous one. Therefore, the runs which used subqueries, used the subquery generation strategy. For this second topic set our group was the 7th best group in terms of F-Meassure out of the total of 19 participant groups. Finally, we have to point out that our T XT, P RF, subq conﬁguration was the 5th best textual run of the competition in global terms for the two query subsets. It is interesting to observe that for the Part2 topic set the CR50 value of the T XT, P RF, nosubq run was 0.6825, which indeed would be the maximum value which the system would have achieved for CR10 and CR50, if it would have used a reordering strategy for promoting the diversity between the top 10 documents based on the 50 top-ranked documents. However, our subquery generation technique, T XT, P RF, subq run, was able to promote diversity amongst the top 10 documents and at the same time was able to improve CR50, achieving a 0.7071 value, though it used for the subquery generation the same 50 top-documents that would be used with a reordering strategy. The improvement achieved using an approach based on subqueries was more evident for the runs submitted for the Part 1 subset. The reason could be contrary to what happens in the case of the queries generated using Carrot2, the subqueries provided for each topic are just the right ones for that topic. Finally, we speculate about the diﬀerences between our system and Xerox-SAS system, the best one in the competiton (F-mea10 = 0.8087). Taking into account Diversity Promotion: Is Reordering Top-Ranked Documents Suﬃcient? 123 the test subtopics for the task were extracted automatically using global term co-ocurrence statistics from a query log related with the collection. A feasible hypothesis is the query enriching technique used by the Xerox-SAS system could work specially well whether the co-ocurrence distribution of the query terms in the log is similar to the co-ocurrence distribution of these terms in the collection. This hypothesis is based on the fact the query enriching technique used by this system is also based on global term co-ocurrence statistics. 4 Conclusion and Future Work Results suggest us that while reordering strategies applied to a limited number of documents limit the margin of improvement, it is possible to partially overcome this limitation, without hurting precision through the use of approaches based on subqueries generation. Future work will study how to harness query and click logs in order to promote diversity in a more eﬀective way. Acknowledgment This research has been partially funded by the Spanish Government within the framework of the TEXT-MESS (TIN-2006-15265-C06-01) project. References 1. Amati, G., Van Rijsbergen, C.J.: Probabilistic Models of information retrieval based on measuring the divergence from randomness. ACM TOIS 20(4), 357–389 (2002) 2. Gass, T., Weyand, T., Deselaers, T., Ney, H.: Fire in imageclef 2007: Support vector machines and logistic regression to fuse image descriptors in for photo retrieval. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 492–499. Springer, Heidelberg (2008) 3. Llopis, F., Vicedo, J.L., Ferr´ andez, A.: IR-n System at CLEF-2002. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 291–300. Springer, Heidelberg (2003) 4. Navarro, S., Llopis, F., Mu˜ noz, R.: Using evidences based on natural language to drive the process of fusing multimodal sources. In: Horacek, H., M´etais, E., Mu˜ noz, R., Wolska, M. (eds.) NLDB 2009. LNCS, vol. 5723, pp. 24–35. Springer, Heidelberg (2010) (Accepted for publication) 5. Paramita, M.L., Sanderson, M., Clough, P.: Diversity in photo retrieval: Overview of the ImageCLEFphoto task (2009) 6. Robertson, S.E., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society for Information Science 27(3), 129–146 (1977) 7. Xu, J., Croft, W.B.: Improving the eﬀectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst. 18(1), 79–112 (2000) Comparison of Several Combinations of Multimodal and Diversity Seeking Methods for Multimedia Retrieval Julien Ah-Pine, Stephane Clinchant, and Gabriela Csurka Xerox Research Centre Europe 6 ch. de Maupertuis 38240 Meylan, France FirstName.LastName@xrce.xerox.com Abstract. The aim of this paper is to analyze the technologies designed and used in the context of XRCE’s participation in the Photo Retrieval Task of ImageCLEF 2009 [1]. We evaluate and compare diﬀerent mono and multimedia retrieval methods and two distinct diversityseeking strategies as well. Our analysis allows us to better understand which combinations of basic approaches are the best ones. It appears that taking advantage of the multimodal nature of the data by means of our cross-modal similarities technique and leveraging diﬀerent text representations of the topics in the goal of covering distinct related subtopics, allow us to tackle the Photo Retrieval Task eﬀectively. 1 Introduction Given a collection of text/image objects and a set of multimedia topics and subtopics, the aim of the challenge was to produce for each topic, a ranked list of images holding both relevant and diverse objects. However, the deﬁnition of what constitutes diversity varied across topics. Basically there are two kinds of topics with respect to this aspect: part 1 and part 2. In the ﬁrst part, for each subtopic of a topic, in addition to the query title, the “cluster title” ﬁeld, clearly indicated what the clustering criteria and the “cluster description” ﬁeld gave even more precision. However, we did not use either of them. In the second part of the challenge, only three relevant illustrative images were given with the query title of the topic, without any other indication concerning the clustering criteria. In that case, participants were encouraged to decide on how broad the results should be for each of these topics. In what follows, we ﬁrst brieﬂy introduce the underlying technologies that we used to ﬁnd relevant and diverse multimedia objects for each topic, then we perform an analysis of the results and a comparison of diﬀerent strategies as well. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 124–132, 2010. c Springer-Verlag Berlin Heidelberg 2010 Comparison of Several Combinations of Multimodal 2 125 The Underlying Technologies Image representation. As image representation, we use the Fisher Vector1 proposed in [2]. This is an extension of the bag-of-visual-words representation. The main idea is to represent the visual vocabulary with a Gaussian Mixture Model (GMM) λ = {wi , μi , Σi , i = 1, ..., N } where each Gaussian corresponds to a visual word and to characterize the image I with the gradient of the normalized −1/2 log-likelihood according to the GMM model: fI = Fλ ∇λ log P (I|λ); where Fλ is the Fisher Information matrix. The similarity between two images I1 and I2 , is computed as follows: 2 − ||fI1 − fI2 ||1 ; where the f are ﬁrst normalized to 1. Text representation. Two information retrieval models were considered: a standard language model and an information based model with a log-logistic distribution, after lemmatization of the texts. We refer to the working notes paper [3] for more information. Furthermore, we decided to use query expansion or enrichment, in order to ﬁnd new clusters in addition to those represented by the example images for topics in part 2. The Chi-Square statistic was used to enrich the query title words with their top ten most similar terms. Cross-media similarity. The information fusion technique used to combine textual and visual similarities can be understood as a score regularization through a two-step diﬀusion process, the ﬁrst step being performed in one mode and the second step being performed in the other one [4,3]. Let St and Si respectively be the textual and the visual similarity matrices over the same set of multimedia objects, that are normalized to obtain a similarity value distribution between 0 and 1 for each row. The cross-media similarity matrices that combine two monomedia similarity matrices are deﬁned as follows: Simimg−txt = κ(Si , ki )St and Simtxt−img = κ(St , kt )Si ; (1) where κ(S, k) is a thresholding function that, for all rows of S, puts to zero all values that are lower than the k th highest value and keeps all other components to their initial value (see [3] for more details). Let us precise that in the more speciﬁc case of information retrieval, given a multimedia query q (qt denoting the text part and qi the image part of q), we similarly have the following cross-media scores: Scoreimg−txt (qi ) = κ(si , ki ).St and Scoretxt−img (qt ) = κ(st , kt ).Si ; where st is the similarity row vector of a given textual query qt with a set of multimedia objects (their text part) and si is respectively, the similarity row vector of a given image query qi with the same set of multimedia objects (but their image part). Finally, given a multimedia query, the ﬁnal relevance score can be computed as follows: Score(q) = αt st + αi si + αit Scoreimg−txt (qi ) + αti Scoretxt−img (qt ) (2) where the weight distribution was set heuristically to αt = 5/12, αi = 1/4, αit = 1/4, αti = 1/12. 1 The authors also want to thank Florent Perronnin for his code allowing to compute the Fisher Vectors and to Yan Liu for his help in preprocessing the visual data. 126 3 J. Ah-Pine, S. Clinchant, and G. Csurka Runs Description and Analysis Since part 1 and part 2 constitute two diﬀerent kinds of topics, we designed two slightly diﬀerent approaches that we brieﬂy describe2 before analysing the results they can provide. 3.1 Part 1 Description. For each topic in part 1, we start by analyzing its subtopics individually as if they were independent. To this end, we ﬁrst used text similarity to retrieve relevant multimedia objects. More precisely, we used the image’s caption (ICPT) of the subtopic as a text query. This allows us to ﬁnd a set of M relevant objects based on textual similarities between the text query (image’s caption of the subtopic) and the captions of images in the database. Note that since the image subtopic is itself within the database, it is also retrieved in the top M list. From these M retrieved multimedia objects we can compute three similarity matrices (textual, visual and cross-modal) as described in section 2. Based on those matrices, the top M retrieved objects can be re-ranked using either textual, visual or cross-modal similarities. Indeed, given a similarity matrix, we extract from the latter the row similarity corresponding to the subtopic and we re-rank the retrieved objects according to this similarity distribution. In that case, we can see that using only visual and cross-modal similarities actually result in a re-ranking of the top M list since the original ranking was produced by text similarities. We thus obtain three top M lists from the three subtopics. We ﬁnally combine the latter into a single list using a Round Robin merging technique3 . Analysis. Table 1 shows the results4 with the three modalities: textual (T), visual (I) and cross-modal (X). In this part the diversity (cluster recall measure) was in principle ensured by the fact that each cluster was represented by a subtopic. Therefore we did not apply any additional diversity-seeking re-ranking strategy to the topics of part 1. As far as the notations are concerned, this is denoted by a no in a run’s name. From Table 1, we see that even if the cross-media run ICPT no X reaches the best performance, the improvement over the pure text run, ICPT no T is rather weak. The main reason might be that using the subtopics’ image caption and merging the resulting subtopics’ top M list with a Round Robin procedure, is already an eﬃctive strategy to address both the precision and the cluster recall. 2 3 4 For a more detailed description, see Algorithm 1 and Algorithm 2 in the working notes paper [3]. The images of the subtopics and their exact duplicates are ﬁrst removed from the diﬀerent lists. The results are slightly better than the oﬃcial runs. This is due to the correction of a small bug in the code. Comparison of Several Combinations of Multimodal 127 Table 1. Results for part 1 using diﬀerent modalities Modality ICPT no T ICPT no I ICPT no X (Old) ICPT no X 3.2 CR10 83.9 75.2 83.7 82.9 P10 78.4 60.8 79.6 76.8 F1 81.0 67.2 81.6 79.7 Part 2 - Basic Runs Description. For topics in part 2, we assumed that the three image queries represented three diﬀerent subtopics. As a consequence, we ﬁrst applied the same technique as for part 1. It is worth mentioning here that, similarly to part 1, neither the cluster title nor the cluster description were exploited. Therefore, we can obtain the runs ICPT no T, ICPT no I and ICPT no X by considering the Round Robin fusion of the three (re-)ranked lists (using textual, visual or cross-modal similarities), in the same manner as we described previously. Assuming that the three image subqueries were relevant, we can expect that their lists lead to high precision measures (P10). However, there might be other subtopics that are related to the topic but which are not conveyed by the image subqueries we were given5 . To tackle such an issue, our main idea was to enrich the above obtained lists with objects that remain relevant to the general topic but which are distinct from the given image subqueries. To this end, we proposed to use diﬀerent sources of information in order to have diﬀerent aspects of the topic and thus to promote diversity. Accordingly, we used, in addition to the images captions (ICPT), two other query types: the query title (QRW) and the query title enriched with the most similar words (ENT). Again, for these extra queries, the textual information was employed as a pre-ﬁlter before using any visual information. In both cases, we ﬁrst build a top list of M objects by using textual similarities6 : ENT no T and QRW no T respectively. Aiming to bring further diversity, we can also use diﬀerent diversity re-ranking strategies on any of the above lists similarly to our last year’s participation in the task. We mainly experimented with two diﬀerent methods (see [4,3] for more details): • A density based re-ranking of the top elements (kdens). The main idea is to select elements that have high density (similar elements) around them as representatives. We used the sum of k nearest neighbors distances as an example measure. • A clustering based re-ranking of the top elements (clust). Similarly to the previous method, we search images that represent a group of images, except that we use clustering techniques. 5 6 As an example, in topic 37, images about Paris Hilton, Paris-Brest and Paris-Nice clearly did not address all the aspects of the topic Paris. Note that we have a single topic list for ENT and QRW as no subtopic is assumed. 128 J. Ah-Pine, S. Clinchant, and G. Csurka Fig. 1. Basic Runs for topics in part 2 Indeed, considering the M retrieved multimedia objects, we can ﬁrst compute three similarity matrices (T, I and X) and then, re-rank the top list with either the density based method (kdens) or the clustering based method (clust) described above. This leads to six additional runs for each query type (see for example Figure 1). Analysis. While these extra lists (ENT and QRW) were designed to be combined with the ICPT list in order to bring new clusters at the top of the ﬁnal list, it is interesting to analyze and compare them individually in a ﬁrst time. Accordingly, Figure 1 shows the performances of all the basic models we used (called basic runs). In the ﬁve ﬁrst rows of Table 2, we recall some of the best basic runs we obtained. Let us analyze these results from diﬀerent point of views: • QRW vs ENT: Surprisingly, we can see that query enrichment mostly beneﬁts precision rather than cluster recall. Query enrichment does beneﬁt cluster recall, but to a lesser extent that we would have expected. Besides, re-ranking techniques for query title runs (clust, kdens) do not signiﬁcantly improve the F1 measure over the baseline run (QRW no T ). On the other hand, a larger improvement is obtained when the same techniques are applied to ENT runs. For example ENT kdens I obtains a large improvement in F1 measure, increasing both precision and cluster recall measures. Overall, we observe that enriching the query allows us to improve the performances in most cases since ENT runs are in general better than QRW ones. • ICPT: The cross-media ICPT no X is the best of our basic runs, largely outperforming any of the other runs. Another important observation is that no re-ranking techniques (clust, kdens) actually help when using ICPT. In fact. using the Round Robin fusion between subtopics, there may be already enough diversity represented such that the re-ranking techniques bring more noise than useful information Comparison of Several Combinations of Multimodal 129 • ENT vs ICPT: Overall, enriched queries based runs are precision oriented. On the contrary, images’ caption based runs are rather recall oriented. As a result, since those runs are complementary, we could expect their combination to be promising. • Multimodal Runs: There are evidences that taking advantage of the multimodal nature of the objects allows us to outperform monomedia based retrieval. As mentioned previously, the cross-media ICPT no X is a good example but there are more ones. For example, the run ENT clust I, ﬁrst retrieves a top list of objects on the basis of textual similarities and then uses the visual similarities between those objects and a clustering technique in the goal of avoiding redundancy. We can actually observe that this run performs better than the monomedia run ENT no T in terms of F1. Overall, these observations demonstrate that multimedia and cross-modal techniques are valuable and eﬀective. • kdens vs clust. Regarding the comparison between the two re-ranking techniques, we observe on the one hand that density and clustering are comparable when image similarity is used, but on the other hand, with text similarity or cross-media similarity, clustering generally gives better results. Those re-ranking techniques are really eﬀective on the enriched queries however, as mentioned beforehand, none of these techniques helps the ICPT runs. 3.3 Part 2 - Combined Runs Description. As mentioned previously, our aim when designing diﬀerent runs based on diﬀerent information sources, was to better promote diversity by combining several kinds of results. In that perspective, we used again the Round Robin method to combine several runs. There are many possible combinations among the basic runs that we described previously and no room for an exhausted study. Therefore we selected some of them to be analyzed and compared. Analysis. Table 2 shows the results of our best basic runs, and some of their combinations. Since we observed that ENT basic runs are generally better than Table 2. Best basic runs and some of their combinations for part 2 topics Run name ENT clust I ENT clust T ENT kdens I ENT no T ICPT no X ICPT no X ENT clust X ICPT no X ENT clust I ICPT no X ENT kdens I ICPT no X ENT no T CR10 65.8 63.4 62.6 58.5 76.8 84.0 82.4 82.5 83.1 P10 78.0 79.6 83.2 78.4 72.4 78.0 78.8 81.6 78.8 F1 71.4 70.6 71.4 67.0 74.6 80.9 80.6 82.0 80.9 130 J. Ah-Pine, S. Clinchant, and G. Csurka Fig. 2. F1 and cluster recall performances by topic for part 2 QRW ones, we thus rather consider the use of the ENT runs than the latter ones in the combinations. The best basic ICPT no X run reaches a 74.6% F1 value. The combination of this run with others leads to roughly 80% F1 runs. The run ICPT no X ENT no T is particularly interesting because it is high performing and does not use any re-ranking techniques. Our best combination is ICPT no X ENT kdens I (F1=82%). Therefore, we analyze these two basic runs and their combination at the topic level. In the top chart of Figure 2, we show the F1 measures of ICPT no X, ENT kdens I and their combination. In the bottom chart we show the cluster recall at 20 (CR20) of the combination against the CR10 of the two basic runs. Our motivation is to analyze whether the retrieved subtopics in the basic runs complement each other7 . We can make the following observations: • Our system seems to fail on topic 43 but the relevance judgment reveals that there are only 2 relevant images for this topic and only one cluster. We actually found one relevant image so the failure is not really dramatic. • On average, ICPT no X gets better results than the ENT kdens I, as their respective F1 scores are 74.6% and 71.4%. However, for topics where ENT kdens I results are better, they are much better than ICPT no X ones. In other words, 7 If we only considered the CR10 of the combined list, this would take into account only the top 5 elements of each individual lists. Comparison of Several Combinations of Multimodal 131 there are a few topics where ENT kdens I makes a real diﬀerence over ICPT no X. As we already noticed above enriching queries does not help cluster recall so often. Thus, only queries 27, 29, 35, 37, 40, 45, 48, 49, after fusion, show some improvements in terms of cluster recall. If we look8 • Lastly, our fusion strategy seems robust: the combination does worst than the worst basic run only for one query (query 26). Otherwise, the combination does better than the worst run. This actually conﬁrms that those runs, ICPT no X and ENT kdens I, are complementary. Nevertheless, as far as the combination of basic runs is concerned, we could expect to obtain better results by giving more weight to the cross-media runs than other basic runs, since the former runs are better than the latter ones most of the time. 4 Conclusion In this paper we brieﬂy recalled the technologies we used in order to address the Photo retrieval task of ImageCLEF 2009. We then presented a detailed analysis of the diﬀerent results that our methods can provide. We generated a lot of diﬀerent basic runs, by varying the type of the initial query text representation and the use or not of a re-ranking technique to increase diversity. We showed that query enrichment beneﬁts more precision than cluster recall, which was somehow unexpected. Moreover, the diversity re-ranking techniques we used, clustering and density, increase the results of enriched queries, but not the results of other text representations query title and images’ caption. However, these improvements do not outperform the basic run which uses the images’ caption and cross-media measures without any re-ranking method. The latter run is actually our best basic run which shows again that our cross-media technique is eﬀective. Finally, textual similarities, using the images’ caption particularly, allow us to obtain a ﬁrst interesting baseline, that we can signiﬁcantly improve in a second step, by integrating visual and cross-modal similarities. These basic runs can further be improved by combining some of them in order to increase the F1 measure. In that perspective, combining basic runs that use diﬀerent information sources and media is very beneﬁcial. Acknowledgments. This work was partially supported by the french projects Omnia ANR-06-CIS6-01 and Fragrances ANR-08-CORD-008. References 1. Paramita, M., Sanderson, M., Clough, P.: Diversity in photo retrieval: overview of the ImageCLEFPhoto task 2009. In: CLEF Working Notes (2009), http://www.imageclef.org/2009/photo 8 Figure omitted due to space limitation. at the precision of the diﬀerent topics of part 2, enriched queries get often better precision than the cross-media ICPT results. We introduced query enrichment with the hope to ﬁnd new clusters, but it turns out that most of the time the images’ caption and the cross-media are already able to ﬁnd most of the clusters9 . 132 J. Ah-Pine, S. Clinchant, and G. Csurka 2. Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: CVPR (2007) 3. Ah-Pine, J., Clinchant, S., Csurka, G., Liu, Y.: XRCE’s participation to ImageCLEF 2009. In: Working Notes of the 2009 CLEF Workshop, Crete, Greece (2009) 4. Ah-Pine, J., Cifarelli, C., Clinchant, S., Csurka, G., Renders, J.: XRCE’s participation to ImageCLEF 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) University of Glasgow at ImageCLEFPhoto 2009: Optimising Similarity and Diversity in Image Retrieval Teerapong Leelanupab, Guido Zuccon, Anuj Goyal, Martin Halvey, P. Punitha, and Joemon M. Jose University of Glasgow, Glasgow, G12 8RZ, United Kingdom {kimm,guido,anuj,halvey,punitha,jj}@dcs.gla.ac.uk Abstract. In this paper we describe the approaches adopted to generate the runs submitted to ImageCLEFPhoto 2009 with an aim to promote document diversity in the rankings. Four of our runs are text based approaches that employ textual statistics extracted from the captions of images, i.e. MMR [1] as a state of the art method for result diversiﬁcation, two approaches that combine relevance information and clustering techniques, and an instantiation of Quantum Probability Ranking Principle. The ﬁfth run exploits visual features of the provided images to re-rank the initial results by means of Factor Analysis. The results reveal that our methods based on only text captions consistently improve the performance of the respective baselines, while the approach that combines visual features with textual statistics shows lower levels of improvements. 1 Introduction In this paper we describe the methods adopted by the Information Retrieval (IR) group at the University of Glasgow for the ImageCLEFPhoto 2009 retrieval task. Our methods aim to promote novelty/diversity against redundancy in document rankings. The need for diversity in a ranking is empirically motivated by several studies [1,2]. Addressing diversity issues allows retrieval systems to cope with ambiguous queries, maximising the chances of retrieving relevant documents. Our submission consists of ﬁve runs, four of which are based exclusively on text features. Statistical text features are exploited to promote topically and possibly semantically diﬀerent documents. In our ﬁfth run, we combined topical information, obtained by matching textual queries against the textual captions, with visual features associated to the images, in order to capture both topical and visual diversity in the ranking function. 2 2.1 Approaches Initial Ranking Our approaches for promoting diversity consist of a two step process: images are ﬁrst retrieved and subsequently a re-ranking procedure is performed. In order C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 133–141, 2010. c Springer-Verlag Berlin Heidelberg 2010 134 T. Leelanupab et al. to acquire the initial document ranking, we aimed to simulate the situation of a user posing an ambiguous query that may be associated with several subtopics. Also we do not want our results to be biased by the way features (i.e. low-level features) are combined. We therefore utilized merely the topic titles for the 50 topics from ImageCLEFPhoto 2009 [3] as queries to post to our retrieval systems. To retrieve the initial document ranking according to the relevance probabilities, we employed Okapi BM25 as a retrieval function. For the runs Glasgow - run - 1,3,4,5, we employed Terrier1 for indexing and ranking the documents, while for the run Glasgow - run - 2 we relied on the Lemur toolkit2 . In both cases, stemming and stop word removal has been applied using the same dataset. The top 500 initial ranking has been also used as baseline and re-ranking to evaluate the performance gains for each of our methods. 2.2 MMR Maximal Marginal Relevance (MMR) [1] has been one of the ﬁrst approaches proposed in order to exploit the dissimilarity/diversity between documents. In this approach the relevance score of each candidate document is interpolated with its diversity score, with respect to documents ranked at previous positions. We implemented the MMR approach in Glasgow - run - 1 with the intention of using this as a benchmark against our remaining strategies. The MMR approach we adopted is characterised by the following ranking formula: M M RJ +1 ≡ argmax[αS(xi ; q) − (1 − α) avg S(xi ; xj )] xi ∈I\J xj ∈J (1) where I is the set of initial results retrieved by the IR system; J is a set of reranked results containing J documents; q is a query; xi is a candidate document in I \ J , which is the set of documents that have not been ranked yet; and xj is a document in J , which has been already ranked in the results. Function S(xi ; q) is a normalised similarity metric used for document retrieval, in this case, obtained by Okapi BM25. S(xi ; xj ) is a similarity function estimating the relationship between document vectors according to their cosine similarity. The component of the vectors are obtained from the BM25 weights of the respective terms. Intuitively, the MMR method maximises marginal relevance and diversity in document retrieval at each iteration by linearly combining the relevance of a document with its dissimilarity with respect to previously retrieved documents. When parameter α is greater than 0.5, the similarity of the document to the query results is more important than novelty/diversity, vice-versa for α < 0.5. In previous experiments, we found that α = 0.3 yields satisfying results since the obtained ranking still contains a great number of relevant documents. We therefore selected this value as a constant for all our runs based on MMR. As suggested by the MMR method, the ﬁrst image to be inserted in the ﬁnal ranking is the most similar to the query, since no previous results are in 1 2 http://ir.dcs.gla.ac.uk/terrier/ http://lemurproject.org/ Optimising Similarity and Diversity in Image Retrieval 135 the ranking and the diversity component of eq. 1 does not contribute to the document’s ﬁnal score. Afterwards, the image xi ∈ I\J which obtains the highest MMR score in each iteration is added to the results list until all images are ranked. In our Glasgow - run - 1, we re-ranked the top 500 images. 2.3 Clustering Models for Subtopic Retrieval Clustering as pre-processing for selecting the top ranked document in MMR re-ranking –Assuming MMR is able to adequately combine relevance and diversity in the ranking of results, the selection of the ﬁrst document to retrieve might present a problem. In fact, the MMR score of a candidate document depends upon the documents ranked at the previous ranks. As a result, an unfavourable choice at an early position in the result list may deteriorate the entire ranking. The re-ranking procedure might get trapped in a local maximum of distance (diversity). MMR suggests that the ﬁrst document retrieved should be the document with highest similarity (relevance) score, but this might not be always the optimal choice. Several methods have been proposed to solve the problem of non-optimal ranking due to initial bad selection, one example is dynamic programming [4], which has however a high computational cost. In our Glasgow - run - 3, we strive to deal with this problem using a light-weight approach incorporated within a clustering technique. In our experiment, we deﬁned a rank based threshold on the top 50% of the retrieved documents, from an initial set of results with regard to only documents containing query terms. The underlying assumption is that those documents above the threshold are highly relevant to the user’s information need. Note that this is to some extent similar to pseudo relevance feedback techniques, assuming that the top n retrieved documents are relevant and can, for example, be used for query expansion. We cluster this set of documents using Hierarchical Agglomerative Clustering, which guarantees low computational cost. Subsequently, we consider only the biggest cluster in order to extract the ﬁrst document to rank. This choice is motivated by the assumption that the biggest cluster in the top list could be used to characterise the most popular subtopic users identify. We believe that selecting a document from the most popular subtopic, which can satisfy user intent given the query, would help in ﬁnding relevant documents that will be later selected. The medoid3 of the biggest cluster is selected as a representative and as an initial choice for re-ranking. Afterwards, we follow the MMR method (with α = 0.3) up to the top 500 images from the initial ranking obtained by Okapi. Integration of Clustering and MMR techniques –A popular method to obtain diverse but relevant results in image retrieval is re-ranking based on clustering techniques. Most cluster based methods dealing with the diversity problem assumes that each individual cluster of retrieved documents represents a diﬀerent sub-topic of results. A representative document is then iteratively selected from each cluster to generate the ﬁnal ranked list. 3 The document closest to the centroid of the cluster. 136 T. Leelanupab et al. Many studies investigated inter-clustering diversity. There are three common methods used to select the representative document from each cluster. The ﬁrst is to select the cluster’s document with the highest similarity score to a given query [4]. In this case, relevance is considered solely after clustering. In the second approach, the medoid, assumed to be a good cluster representative, is chosen [5,6]. This strategy can be altered by selecting a typical member from the cluster rather than the medoid. A third method [7] suggests selecting the document that is most similar to the members of the selected cluster. In the last two approaches, relevance is ignored. Besides, all the presented methods merely rely on clustering and consequently neglect the diversity of documents in the ﬁnal result list. In this case, the top ranked results are still similar or located close to each other. Conversely we wish to select a representative document from within clusters for which both relevance and document dependencies are considered. To the best of our knowledge, no existing cluster-based methods consider both diversity and relevance criterion. In order to achieve such an approach, we adopted a MMR technique to select documents after clustering procedures. Importantly, integrating a cluster-based method with a MMR approach in such a way can be applied to various clustering techniques. We implemented this approach in Glasgow - run - 4, on the top 500 images from the initial ranking. Expectation Maximisation (EM)4 clustering algorithm, which has been observed to perform best in clustering similar images in our experiments, was employed. We set the number of clusters to be 20, with the goal of producing a ranking that ideally, in the top 20 results, holds as many relevant images that are representative of the diﬀerent sub-topics within the results as possible. After clustering the initial ranking, clusters are ranked according to the average relevance score of documents within each cluster. Subsequently, the retrieval consists of two steps: in the ﬁrst step, we select clusters according to the mean of their similarity scores. We used a round robin approach to simplify the selection of clusters in our preliminary experiment. In the second step, MMR is employed to select a document to be added to the results list from a particular cluster in each iteration. We assume that applying diversity based re-ranking on clusters can enhance the overall diversity of the document ranking, while maintaining relevancy to the query. In each iteration, the number of candidate documents to be compared is reduced from the entire set of the initial ranking to a set of a speciﬁc cluster. 2.4 The Quantum Probability Ranking Principle In Glasgow − run − 2 we develop a ranking strategy based on the theoretical framework of the Quantum Probability Ranking Principle (QPRP) [8], which models interdependent document relevance and diversity in a document ranking from ﬁrst principles. In this paper, we provide a brief introduction to the principle and the main intuitions underlying it. 4 Where the standard deviation was set at 6 × 10−6 . The seed number was set at 100 with 100 maximum iterations. Optimising Similarity and Diversity in Image Retrieval 137 Consider the set of retrieved documents R = {d1 , d2 , d3 } in which the probability of relevance of each document to the information need is respectively P (d1 ) > P (d2 ) > P (d3 ). A system implementing the PRP ranks at ﬁrst position document d1 because P(d1) > P(d2), and then at second position d2 as P (d2 ) > P (d3 ). The ranking of the documents is not inﬂuenced by documents previously ranked. Conversely, the QPRP suggests that interdependent document relationships at relevance level should be accounted for in the document ranking. These dependencies are modelled through quantum interference, which resembles the extent documents are interrelated. In the following equation we deﬁne with Id1 ,d2 the interference term between document d1 and d2 . Under the QPRP, the ﬁrst ranked document is the document with higher probability of relevance given the information need, because it cannot interfere with any other document in the ranking, with the latter being empty at this stage. We indicate this document as d@1, meaning the document is ranked at position 1. The document that has to be returned at second position in the ranking is the one, maximising the expression P (di ) + Id@1,di , with di ∈ I \ {d@1}. I is the set of documents retrieved in response to the query. By generalising the previous equation, we obtain that the document that has to be returned at rank n is the one that maximises: P (di ) + Idx ,di (2) dx ∈J where J is the set containing the documents that have been already ranked and di ∈ I \ J . With respect to the previous example, d1 is still ranked in the ﬁrst position. However, the QPRP ranks d2 at the second position in the rank if and only if P (d2 ) + Id1 ,d2 > P (d3 ) + Id1 ,d3 , where Id1 ,di is the quantum interference between the documents that have been already ranked (d1 ) and document di . The QPRP posits that the interference between documents occurs at relevance level and that it models interdependent document relevance. Intuitively, the QPRP is a generalisation of the PRP: in particular when the interference term equals zero for each pair of documents, then the two principles provide the same ranking list. However, when Id1 ,d2 (or equally Id1 ,d3 ) is not null, the ranking suggested by the PRP is subverted by the interference term if Id1 ,d3 > P (d2 ) − P (d3 ) + Id1 ,d2 . The interference term –The interference term has a central role in the formulation of the QPPR and it is expressed (see [8]) as: Idx ,dy = 2 P (dx )P (dy )cosθdx ,dy (3) where θdx ,dy is the diﬀerence of the phases5 associated with the probabilities of dx and dy . Thus the interference’s behaviour depends from the phase diﬀerence θ. A correct estimation of either θ or of the probability (of relevance) amplitudes 5 Recall that the QPRP assumes the presence of a complex probability (of relevance) amplitude φi associated to each document di , where probability amplitudes and probabilities are related by P (di ) = |φi |2 . 138 T. Leelanupab et al. associated to documents is then essential for an eﬀective modelling of interdependent document relevance using the QPRP framework: this is still ongoing research. However,we assume that cosθdx ,dy can be approximated with the opposite of the Pearson’s correlation between vectors associated to documents in case of subtopic retrieval. Each component of the vector is associated to a term of the collection and is characterised by its correspondent Okapi weight. The QPRP strategy has been instantiated in Glasgow - run - 2, where the 100 most relevant documents are re-ranked. 2.5 Visual Diversity In Glasgow − run− 5 we combine textual statistics with visual features. For each query we rank the 100 most relevant documents using Terrier and the standard BM25 settings. Afterwards we apply factor analysis and bi-clustering to the visual features of the gathered documents. Our main goal is to re-rank text based results to diversify the ranked list based on visual features. For this purpose, we use two low-level features deﬁned in MPEG-7 standards, i.e. Colour Structure and Edge. For each image, the two features are concatenated to generate a single feature of dimensionality d. A matrix F is generated by using the top 100 results from text based retrieval, in which each row contains a particular feature component of all top 100 images. The size of F is then (d × 100). Factor analysis is applied on the covariance matrix of F which generates the loading matrix Λ. In the loading matrix, each row represents an image by its factor loadings for all common factors from the 100 top results. Further, we calculate the communality, which indicates the characteristics an image has in common with other images in the result set. Since each common factor represents some speciﬁc visual characteristics of the result set, combinations of diﬀerent common factors are also an eﬃcient representation of result set characteristics. For diversity based retrieval, our main goal is to create subgroups of the loading matrix such that a ﬁxed number of factors share a similar pattern for all images in that subgroup, i.e. a factor is either highly correlated or less correlated with all components in a subgroup. This is a two-fold requirement: (1) images within a subgroup will represent some common characteristic; and (2) images from two diﬀerent subgroups will contain somewhat diﬀerent characteristics. We can employ three diﬀerent methods to cluster the loading matrix. Image clustering only considers the overall distance between two images by using all factors loadings. Nevertheless, this approach is questionable as we want to account for factor combinations as well. In factor clusters, we have a group of diﬀerent factors which behave almost the same for all images. However, it will miss some factor combinations that behave similarly only for some images, due to the constraint that each cluster should contain all images. To overcome these problems, we perform bi-clustering on the loading matrix. Bi-clustering results in subgroups where particular factors are highly correlated with all images in that subgroup. Hence each bi-cluster, containing factors or Optimising Similarity and Diversity in Image Retrieval 139 factor components, represents distinct characteristics of the result set. To create a ranking within each cluster, we ﬁrst calculate the ranking within a cluster based on the initial ranking and the communality of each image. The text ranking (Rtext ) characterises topical relevance while the communality based rank (Rcomm ) returns the most common images from that cluster. Finally, we re-rank images within a cluster in increasing order of their ﬁnal rank Rf : Rf = (α)Rtext + (1 − α)Rcomm (4) The lower the value of Rf , the higher the position of the image in the ﬁnal ranking. Preliminary investigations suggested to ﬁx the value of α to 0.4. To generate the ﬁnal document ranking, we ﬁrst rank the clusters based on the best Rtext among all images within a cluster and then we iteratively select one image from each cluster in descending order of cluster ranks until all clusters are completely exhausted. To select an image from a cluster, we just choose the image having minimum Rf in that cluster. In this way we generate a ranking containing more visual diversity than the initial ranking. 3 Results and Discussions In Tables 1 and 2 we report the performance of our runs against the one of the initial rankings (baseline). Since the ultimate aim of this year’s campaign was to promote diversity in document ranking, we report and discuss only the values relative to subtopic recall (S-recall) [9]. The evaluation of our runs using the usual IR measures such as precision, recall, MAP, etc, can be found in the campaign summary released by the organisers. In Table 1, the performances obtained by Glasgow − run − 1 (the MMR approach) are superior to the one obtained by the relative initial ranking (for S-recall@10). However, the empirical evaluation shows that our runs combining MMR and two diﬀerent clustering approaches (Glasgow − run − 3 and Table 1. Overview of the cluster recall (s-recall) evaluation and improvements with respect to the baseline for the runs Glasgow 1, 3, 4, and 5. Statistical signiﬁcances (p<0.05) against baseline, and Glasgow 1 (MMR) are indicated by ∗ and † respectively. Run id S-recall@5 S-recall@10 S-recall@20 S-recall@50 S-recall@100 baseline 0.4574 0.5421 0.6356 0.7424 0.8331 0.4280 0.5818 0.6705 0.8011∗ 0.8553 Glasgow 1 -6.43% +7.33% +5.49% +7.92% +2.67% 0.5514∗ † 0.6562∗ 0.7733∗ † 0.8429∗ 0.9011∗ † Glasgow 3 +20.56% +21.04% +21.67% +13.56% +8.18% 0.5501† 0.6680 0.7609∗ 0.8156∗ 0.8608∗ Glasgow 4 +20.28% +23.23% +19.72% +9.87% +3.33% 0.4538 0.5744 0.6594 0.7594 0.8252 Glasgow 5 -0.78% +5.95% +3.74% +2.30% -0.93% 140 T. Leelanupab et al. Table 2. Overview of the cluster recall (s-recall) evaluation and improvements with respect to the baseline for the runs Glasgow 2. Run id S-recall@5 S-recall@10 S-recall@20 S-recall@50 S-recall@100 baseline 0.4424 0.5391 0.6388 0.7315 0.8137 Glasgow 2 0.4825 +9.06% 0.5488 +1.80% 0.6282 -1.66% 0.7105 -2.87% 0.8137 0.00% Glasgow − run − 4) out-perform both the baseline and the original MMR approach. Also the approach based on the quantum probability framework (QPRP, Glasgow−run−2), obtains better values of s-recall at low s-recall points, with respect to the initial ranking. Conversely, the results obtained in Glasgow−run−5 are not consistently superior to the relative baseline, although improvements are obtained for subtopic recall at ranks less than 30. This might be caused by the ﬂow in our methodology or because the way in which we combine visual features and text statistics is not eﬀective for diversify document rankings, we will examine this further in future work. However, we hypothesise that the poor results obtained by our visual diversity run suggest that the diversity in the ImageCLEFPhoto 2009 dataset is predominantly topical rather than visual. Note that the results based on MMR techniques in this experiment were not yet optimised for a particular value of the variable α that can vary from 0 to 1. This will be investigated further in our future experiments. 4 Conclusions In this paper we have presented a summary of our participation in the ImageCLEFPhoto 2009 evaluation campaign. The methods we proposed aimed to obtain a diversiﬁed ranking in terms of of topical and visual diversity. We have proposed four new approaches (one of them combined both visual and textual features) and we have tested a well known method for combining relevance and document dissimilarity (MMR). The obtained results are promising, in particular for the runs exploiting only textual statistics. However, we recognise the need to perform a thorough evaluation of our approaches, employing a series of evaluation measures tailored to capture diversity and novelty. This will be subject of future work. Acknowledgement. This research is supported by the Royal Thai Government and the European Commission under contract FP6-027122-SALERO. References 1. Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: SIGIR (1998) 2. Radlinski, F., Dumais, S.: Improving personalized web search using result diversiﬁcation. In: SIGIR (2006) Optimising Similarity and Diversity in Image Retrieval 141 3. Paramita, M.L., Sanderson, M., Clough, P.: Developing a test collection to support diversity analysis. In: SIGIR Red., Div., and Inter. Doc. Rel. Workshop (2009) 4. Deselaers, T., Gass, T., Dreuw, P., Ney, H.: Jointly optimising relevance and diversity in image retrieval. In: CIVR (2009) 5. Leelanupab, T., Hopfgartner, F., Jose, J.M.: User centred evaluation of a recommendation based image browsing system. In: IICAI (2009) 6. Urruty, T., Hopfgartner, F., Hannah, D., Elliott, D., Jose, J.M.: Supporting aspectbased video browsing - analysis of a user study. In: CIVR (2009) 7. Halvey, M., Punitha, P., Hannah, D., Villa, R., Hopfgartner, F., Goyal, A., Jose, J.M.: Diversity, assortment, dissimilarity, variety: A study of diversity measures using low level features for video retrieval. In: ECIR (2009) 8. Zuccon, G., Azzopardi, L., van Rijsbergen, C.: The quantum probability ranking principle. In: ICTIR (2009) 9. Zhai, C., Cohen, W.W., Laﬀerty, J.: Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In: SIGIR (2003) Multimedia Retrieval by Means of Merge of Results from Textual and Content Based Retrieval Subsystems Ana García-Serrano3, Xaro Benavent2, Ruben Granados1, Esther de Ves2, and José Miguel Goñi1 1 Universidad Politécnica de Madrid, UPM 2 Universidad de Valencia 3 Universidad Nacional de Educación a Distancia, UNED agarcia@lsi.uned.es, xaro.benavent@uv.es Abstract. The main goal of this paper it is to present our experiments in ImageCLEF 2009 Campaign (photo retrieval task). In 2008 we proved empirically that the Text-based Image Retrieval (TBIR) methods defeats the Content-based Image Retrieval CBIR “quality” of results, so this time we developed several experiments in which the CBIR helps the TBIR. The TBIR System [6] main improvement is the named-entity sub-module. In case of the CBIR system [3] the number of low-level features has been increased from the 68 component used at ImageCLEF 2008 up to 114 components, and only the Mahalanobis distance has been used. We propose an ad-hoc management of the topics delivered, and the generation of XML structures for 0.5 million captions of the photographs (corpus) delivered. Two different merging algorithms were developed and the third one tries to improve our previous cluster level results promoting the diversity. Our best run for precision metrics appeared in position 16th, in the 19th for MAP score, and for diversity value in position 11th, for a total of 84 submitted experiments. Our best and “only textual” experiment was the 6th one over 41. Keywords: Information Retrieval, Textual-based Retrieval, Content-Based Image Retrieval, Merge Results Lists, Fusion, Indexing. 1 Introduction The main goal of this paper it is to present our experiments in ImageCLEF 2009 Campaign (photo retrieval task) [9] with different merging algorithms applied to the results obtained from the two subsystems developed: the Textual and the Content based Image Retrieval ones (TBIR and CBIR). The global system includes our own implemented tool IDRA (InDexing and Retrieving Automatically) [6], and the Valencia University CBIR system [3, 4, 7]. Given that the paradigms are intrinsically different, and also that we proved empirically last year that the TBIR defeats the CBIR “quality” of results [5, 10], this time we were interested in experiments in which the CBIR helps the TBIR. The IDRA or TBIR System main improvement of this year was the named-entity sub-module. In case of the CBIR system the number of low-level features has been C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 142–149, 2010. © Springer-Verlag Berlin Heidelberg 2010 Multimedia Retrieval by Means of Merge of Results 143 increased from the 68 component used at ImageCLEF 2008 up to 114 components, and in this campaign only the Mahalanobis distance has been used in our experiments. Our participation at ImageCLEF 2009 propose an ad-hoc management of the topics delivered, and the generation of XML structures for 0.5 million captions of the photographs (corpus) included in the so-called Belga Collection and delivered for the ImageClef 2009 International Competition. Two different merging algorithms were developed in order to fuse different results lists from visual or textual modules, from different textual indexations, and other: MAXmerge (the algorithm selects the results from the N lists which have a higher relevance value) and ENRICH (this merging uses two results lists, a main list and a support list, and when a concrete result appears in both lists, the relevance is increased). Finally in order to improve our previous cluster level results looking forward the diversity we implemented the EQUImerge (that selects the first result of each query for a different cluster, not selected yet). A more detailed presentation of the experiments is included in the following. 2 TBIR, CBIR and MERGE Subsystems Descriptions The global system includes our own implemented tool IDRA (InDexing and Retrieving Automatically), and the Valencia University CBIR system. In this year, a global strategy for all experiments has been that the Content-Based module always starts working with a selected textual results list as part of its input data (decided from our participation at ImageCLEF 2008 [5]). 2.1 Text-Based Index and Retrieval IDRA textual retrieval is based on the VSM approach using weighted vectors based on the TF-IDF weight. Applying this approach, a representing vector will be calculated for each one of the image captions in the collection. The components of the vectors will be the weight values for the different words in the collection. When a query is launched, a vector for that query is also calculated and compared with all the vectors stored during the index process. This comparison will generate the ranked results list for the launched query. To index the collection, the system needs approximately 2 days to index each one of the 5 parts in which the collection was divided to be indexed. These 5 indexations processes can be executed concurrently. Queries file response time depends on the concrete queries file launched (as explained in the following), but it takes over 10 hours to obtain a results file for 119 queries (119 queries at cluster level). The textual retrieval tasks (sequentially executed) are the following. Text Extractor. Uses the JDOM Java API to identify the content of each of the tags of the captions XML files. Preprocess. This component processes the text in two ways, characters with no statistical meaning like punctuation marks, are eliminated and stop-words detected, with a new constructed list. XML Fields Selection. With this component, it is possible to select the desired XML tags of the captions files (DOCNO, TITLE, DESCRIPTION, NOTES, LOCATION, 144 A. García-Serrano et al. DATE, IMAGE and THUMBNAIL). In the index process, the selected tags from the captions XML files had been three TITLE, DESCRIPTION, and LOCATION. We preprocessed the Belga collection in order to have the same format as in the CLEF08 campaign. IDRA Index. This module indexes the caption by calculating the weights vectors for each one. Each vector is compounded by the TF-IDF weights values [8] of the different words in the collection. All the weights values of each vector are normalized using the Euclidean distance between the elements of the vector. Therefore, the IDRA Index process update the values for each one of the words appearing in the XML captions collection. IDRA Search. For the query text, its weights vector is also calculated. Now, the similarity between the query and an image caption will depend on the proximity of their associated vectors. To measure the proximity between two vectors we use the cosine. This value of similarity will be calculated between the query and all the images captions indexed, and the images will be ranked in descending order as the IDRA result list. 2.2 Content-Based Information The VISION-Team at the Computer Science Department of the University of Valencia has its own CBIR system mainly used for relevance feedback algorithms evaluation [4,7], and that was used for ImageCLEF 2008 for the first time. The low-level features of the CBIR system have been adapted for the images of the new image database (2009) taking into account the results of the last year. As in most CBIR systems, the first step at the Visual Retrieval system is extracting the visual features for all the images on the database and for each of the cluster query topic images. We use different low-level features describing color and texture to build a vector of features. The number of low-level features has been increased from the 68 component (ImageCLEF 2008) up to 114 components at the current edition. This increment is mainly due to the use of local color histogram descriptors that were not used last year. Color information: Color information has been extracted calculating both local and global histograms of the images using a bin of size 10x3 on a HSV color system. Local histograms have been calculated dividing the images in four fragments of the same size. For this database, only the H (hue) component has been used because the rest of values were almost zero, as it happened at the IAPR database. Therefore, a feature vector of 10 components for the global histogram, and 40 components for the local histograms represent the color information of the image. Texture information: As it was done for the IAPR database, six texture features have been computed for this repository respectively. The first three ones use code from the implementation done by Smith and Burn in Meastex (trec.nist.gov); the rest have been implemented by the authors. The total of texture features builds a vector of 64 components: Gabor Convolution Energies, Gray Level Coocurrence Matrix also known as Spatial Gray Level Dependence, and Gaussian Random Markov Fields. The granulometric distribution function that we have used here is not the raw distribution but the coefficients that result of fitting its plot with a B-spline basis. Finally, for the Spatial Size Distribution we used two different versions of it by using as the Multimedia Retrieval by Means of Merge of Results 145 structuring elements for the morphological operation that get size both a horizontal and a vertical segment. The second step is to calculate the similarity distance between the features vectors from each image on the database to each of the cluster images. In ImageCLEF 2008, we tested two different metrics to calculate this distance: the Euclidean and the Mahalanobis. In all experiments better results were obtained with the Mahalanobis distance due to the fact that this measure takes into account the correlations of the data set. This is very useful because of the broad differences among the low-level values. Therefore, only the Mahalanobis distance has been used in our experiments. These sorted lists are passed to the merging module of the global system. 2.3 Merging Algorithms Different merging algorithms were developed in order to fuse together different results lists from visual or textual modules, different textual indexations, or cluster level results into a unique topic level results list. All the three require trec_eval format for input results lists, which is the format required by the ImageCLEF organization to submit the definitive runs. Since N different indexes are created for the collection and, consequently, the IDF is computed per index fraction rather than globally, we use the MAXMERGE algorithm to fuse together the N (configurable value) results lists obtained when a concrete queries file is launched against the N indexes corresponding to the N parts in which the collection was divided to be indexed. For each query, the algorithm selects the results from the N lists which have a higher relevance value for the corresponding query, independently of the list the results appears in. The maximum number of results per query in the resulting list is set up to 1000 (‘max’ is a configurable parameter). As the topic is divided into different queries when textual description of clusters are given, it is necessary to post process the different result lists in order to produce a unique result list which aims to provide diversity as expected from a topic with described clusters. It is the case in which the EQUIMERGE algorithm selects the top results from each cluster result list, and builds a unique list, by selecting in each step a result from a different cluster, if it has not been yet selected. The relevance value will be decremented (configurable value ‘decr’) for each result starting with the original relevance value of the first selected result. The maximum number of selected results per topic is set up to 1000 (‘max’ is a configurable parameter). The ENRICH merge algorithm uses two results lists, a main list and a support list. The merged results list will have a maximum of 1000 results per query (configurable). If a concrete result appears in both lists for the same query, the relevance of this result in the merged list will be increased in the following way: newRel = mainRel + supRel ( posRel + 1) (1) where, newRel is the new relevance value in the merged list, mainRel is the relevance value in the main list, supRel is the relevance value in the support list and posRel is the position in the support list. Relevance values will be then normalized from 0 to 1. Every result appearing in the support list but not in the main one (for each query), will be added at the end of the corresponding list. In this case, relevance values will be 146 A. García-Serrano et al. normalized according with the lower value in the main list. In the last year implementation of this algorithm, this addition didn’t work correctly, so this year it has been modified in a proper way. In this experiment, main and support lists are compound by a maximum of 1000 results for each query. Also the merged lists resulting will be limited to the same number of results per query. 3 Preprocessing the Corpus ImageCLEFphoto09 task uses the so-called “BELGA Collection” which contains 498,920 images from Belga News Agency. Each photograph is accompanied by a caption composed of English text up to a few sentences in length. Image captions are provided without a specific format and we preprocess them in order to build a semistructured XML description for each image, similar to the used in the ImageCLEFphoto08 task [2]. This format includes 8 tags (docno, title, description, notes, location, date, image and thumbnail). The Named Entities (NE) of the captions were tagged using a pipeline of taggers by C&C tools: tokenizer, Part Of Speech (POS) and NER taggers [1]. We focused on NEs referring to Locations, Persons and Organization. The images of the database have been pre-processed for the Content-Based Image module because some of them have extra-information consisting on some bands on the frame of the image with color pixels of the RGB and MCY system colors. This kind of information is often used for color calibration. So that, the first attempt was to use this extra-information in order to calibrate the color images of the database. But, after a visual analysis of different images we realize that they don’t follow an established format. There are different images formats, e.g. with two vertical color bands, or two horizontal color bands, only one color band, some color bands have the two color systems (RGB and MCY), others only one of the color systems, some extra white frame of different sizes. Therefore, the solution adopted was to reduce all the images to the 90% of his real size in order to eliminate the different bands and the white pixels frames. Analyzing both “topics_part1.txt” and “topics_part2.txt” task topics files, we built different queries files to be launched against the IDRA indexation. The different queries files constructed are explained in the following: [qf1] “BELGAtopics-all-(q)-fQ.txt”: one query per topic containing one stream with all the text from all the clusters (not used for runs). [qf2] “BELGAtopics-tctcd-(q-cl)-fQ.txt”: one query for each cluster of each topic with the text from title, clusterTitle and clusterDescription. We eliminate the negative sentences (those containing words “not” or “irrelevant”). We do not include the negative clusters as “soccer -belgium -spain -beach italy -netherlands”. [qf3] “BELGAtopics-tct-(q-cl)-fQ.txt”: the same as above but just with the text from title and clusterTitle. [qf4] “BELGAtopics-topEnt(1..25)+capEnt(26..50)-(q-cl)-fQ.txt”: one query for each cluster (except negatives ones) of each topic from 1 to 25 and for each one of the three images of each topic from 26 to 50. The associated text of each query is obtained extracting the named entities from the clusterTitle and clusterDescription fields of the corresponding topic, in the case of topics Multimedia Retrieval by Means of Merge of Results 147 1 to 25, and from the associated XML files for each of the three images in the case of topics 26 to 50. [qf5] “BELGAtopics-cap(title+desc)-(26..50)-(q-cl)-fQ.txt”: one query for each one of the three images of each topic from 26 to 50. The text for each query is obtained from the concatenation of the TITLE and DESCRIPTION fields of the XML files for these captions. 4 Results Five runs were submitted to the ImageCLEFphoto2009. Runs start from the launch of some of the queries files described in section 3 to five parts of the corpus and the five obtained results lists are merged using the MAXmerge algorithm. [run1] “MirFI1_T-CT-I_TXT-IMG” mixed (textual/visual) experiment launching [qf3], reordering this textual results list with content-based results, and merging both lists with the ENRICH algorithm. [run2] “MirFI2_T-CT-CD-I_TXT-IMG” as above, but launching [qf2]. [run3] “MirFI3_T-CT-CD_TXT”: textual experiment launching [qf2]. [run4] “MirFI4_T-CT-CD-I_TXT”: the first part (topics 1 to 25) doing exactly the same as [run3], and the second one (26 to 50) launching [qf5]. [run5] “MirFI5_T-CT-CD-I_TXT”: textual experiments treating with named entities. [qf4] is launched to obtain the results list. After the evaluation by the task organizers, obtained results for each of the submitted experiments are presented in Table 1. The table shows for each run: the identifier, the mean average precision (MAP), the R-Precision, the precision at 10 and 20 first results, the number of relevant images retrieved (out of a total of 34887 relevant images in the collection), the cluster recall at 10 and 20, and the F-Measure. Average values from all the experiments presented to the task for these metrics are also shown in the table, as well as the best value obtained for each of the metrics. Table 1. Results for the submitted experiments Run [run1] [run2] [run3] [run4] [run5] average best MAP R-Prec Prec@10 Prec@20 RelRet 43.78 42.25 42.33 27.84 17.33 29.08 50.64 16547 16301 16301 13627 13498 10940.5 19066 51.39 50.11 50.12 36.96 29.53 34.09 56.43 82.00 80.00 80.80 47.40 23.60 65.50 84.80 81.80 81.00 81.40 48.90 26.30 64.38 83.20 CR@10 CR@20 64.51 63.24 63.51 69.83 68.49 54.67 82.39 72.00 73.41 73.31 76.76 72.82 62.35 86.07 FMeasure 72.21 70.64 71.12 56.47 35.10 58.48 80.87 We can observe that [run1] is our best run for precision metrics (very similar to [run2] and [run3]), and appears in the 16th position in R-Precision classification and in the 19th in MAP one (from a total of 84 submitted experiments). Regarding the 148 A. García-Serrano et al. diversity metrics (cluster recall at 10 an 20, CR@10 and CR@20), [run4] and [run5] obtain our best diversity values, appearing in position 11th in cluster recall classification. The last column, F-Measure metric, which combines both precision and diversity values, situates our best experiment in position 11th. Another of the classifications provided by the organizers, shows [run3] as the 6th best run for all the experiments (41) based just in textual information. Comparing obtained results from experiments [run1] and [run2], we can see that not using CD (cluster description) tag from the topics is slightly better for precision results and very similar in diversity ones. So we can say that the addition of this field in the queries construction step was not very useful. Obtained results for experiments [run2] and [run3] are very similar, so we can conclude that the use of the ENRICH merging algorithm with the visual re-ranked results list, does not affect the results in a significant way. Experiments [run3] and [run4] are different in the way of constructing the second half of the queries (from topic 26 to 50). The evaluation of the results shows that [run3] obtains better precision results than [run4], but worse diversity ones. One reason is that the use of the captions text adds more information to the queries, which is useful for the diversity aim, but adds noise for the precision one. Experiments [run4] and [run5] show how named entity additional information improves the diversity results, but make the precision ones worse. 5 Concluding Remarks Our experiments in ImageCLEF 2009 Campaign (photo retrieval task) were developed in order to make the CBIR module “help” the TBIR. The TBIR System main improvement was the named-entity sub-module, but the experiments with this new feature seems not to improve our previous results. We should to analyze in deep this unexpected performance. In case of the CBIR system the number of low-level features has been increased from the 68 component used at ImageCLEF 2008 up to 114 components, and only the Mahalanobis distance has been used. We propose an ad-hoc management of the topics delivered, and the generation of XML structures for 0.5 million captions of the photographs (corpus) delivered. Two different merging algorithms were developed and the third one tries to improve the diversity of our previous cluster level results. For the five runs submitted we obtained middle results comparing with the other groups in the campaign: our best run for precision metrics appears in the 16th position in R-Precision classification (0.5139), near the best (0.5643), and in the 19th in MAP one (0.4378), quite better than average and not very far from the best (0.5064), from a total of 84 submitted experiments. Our best diversity values, appeared in position 11th in cluster recall classification. F-Measure based classification, combining both precision and diversity values, situates our best experiment in position 11th. Having into account just textual experiments, our best was the 6th one over 41. Table 1 shows that our best runs values are always over the average and not so far from best ones. Analysing results per topic set, the most important thing we note is that using as query Multimedia Retrieval by Means of Merge of Results 149 text the one from the relevant images captions ([qf5]) introduces noise and results get much worse (rank 14 to 75 in provided “25 queries-part 2” classification). Acknowledgments. Work partially supported by Spanish R+D Plan TIN2007-67407C03-03 and TIC2002-03494; by Madrid’s R+D Plan MAVIR S-0505/TIC/000267. References 1. Agerri, R., Granados, R., García-Serrano, A.: Extracting descriptions for Image Photo Retrieval. In: Working Notes 7th International Workshop in Adaptive Multimedia Retrieval (AMR 2009), UNED, Madrid (2009), http://nlp.uned.es/amr2009/ 24-25/09/ 2. Arni, T., Clough, P., Sanderson, M., Grubinger, M.: Overview of the ImageCLEFphoto 2008 Photographic Retrieval Task. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 500–511. Springer, Heidelberg (2009) 3. Ayala, G., Domingo, J.: Spatial Size Distributions. Applications to Shape and Texture Analysis. IEEE Trans. on Pattern Anal. and Machine Intelligence 23(4), 1430–1442 (2001) 4. de Ves, E., Domingo, J., Ayala, G., Zuccarello, P.: A novel Bayesian framework for relevance feedback in image content-based retrieval systems. Pattern Recognition 39, 1622– 1632 (2006) 5. Granados, R., Benavent, X., Garcia-Serrano, A., Goñi, J.M.: MIRACLE-FI at ImageCLEFphoto 2008: Some Results Using Different Approaches to Merge Visual and TextBased Features. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 568–571. Springer, Heidelberg (2009) 6. Granados, R., García-Serrano, A., Goñi, J.M.: La herramienta IDRA (Indexing and Retrieving Automatically). In: Congreso SEPLN 2009 (2009) 7. Leon, T., Zuccarello, P., Ayala, G., de Ves, E., Domingo, J.: Applying logistic regression to relevance feedback in image retrieval systems. Pattern Recognition 40, 2621–2632 (2007) 8. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to information retrieval. Cambridge Univ. Press, New York (2008) 9. Lestari Paramita, M., Sanderson, M., Clough, P.: Diversity in photo retrieval: overview of the ImageCLEFPhoto task 2009. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 10. Villena-Roman, J., Lana-Serrano, S., Martinez-Fernandez, J.L., Gonzalez-Cristobal, J.C.: MIRACLE at ImageCLEFphoto 2007: Evaluation of Merging Strategies for Multilingual and Multimedia Information Retrieval. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 500–503. Springer, Heidelberg (2008) Image Query Expansion Using Semantic Selectional Restrictions Osama El Demerdash, Sabine Bergler, and Leila Kosseim CLaC Laboratory, Department of Computer Science & Software Engineering, Concordia University {osama el,bergler,kosseim}@cse.concordia.ca Abstract. This paper describes our participation at ImageCLEF 2009. We participated in the photographic retrieval task (ImageCLEFPhoto). Our method is based on cross-media pseudo-relevance feedback. We have enhanced the pseudo-relevance feedback mechanism by using semantic selectional restrictions. We use Terrier for text retrieval and our own simple block-based visual retrieval engine. The results obtained at ImageCLEF 2009 show that our method is robust and promising. However, there is room for improvement on the visual retrieval. 1 Introduction This paper describes our participation at ImageCLEF 2009. We participated in the photographic retrieval task (ImageCLEFPhoto). This year’s task targeted the promotion of diversity in image search. It involved an annotated image collection of approximately half a million images, and ﬁfty queries divided into two sets: one with a subject and provided speciﬁc subtopics (clusters), while the other with only a topic. A full description of the task can be found in [1]. We submitted four runs, aiming at evaluating our method as well as the resources used. Similar to our participation at ImageCLEF 2008[2], our method is based on cross-media pseudo-relevance feedback. However, in order to account for the much larger data set, we have introduced some modiﬁcations to our visual component. We have also enhanced the textual retrieval component, as well as the pseudo-relevance feedback mechanism by using semantic selectional restrictions. The rest of this paper is organized as follows: Section 2 describes the visual retrieval component, Section 3 the text processing of the query, Section 4 the enhanced pseudo-relevance feedback and Section 5 the results we obtained at ImageCLEF 2009, then we conclude the paper. 2 Visual Retrieval Figure 1 shows the diﬀerent regional divisions used to analyze the image. In order to capture diﬀerent levels of detail, we divide the image into 2X2, 3X3, 4X4 blocks yielding 4, 9, 16 equal partitions respectively. Due to the much larger C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 150–156, 2010. c Springer-Verlag Berlin Heidelberg 2010 Image Query Expansion Using Semantic Selectional Restrictions 151 Fig. 1. Block-based Visual Features size of the data set compared to the IAPR TC-12 collection (20,000 images) used in previous years, we resorted to reducing the index by eliminating some of the descriptors we used previously, such as the grey-level and gradient-magnitude descriptors. The image is ﬁrst converted to the Intensity/Hue/Saturation (IHS) color space, a perceptual color space which is more intuitive and reﬂective of human color perception than the RGB color space. This also allows for assigning more weight to the hue component which is a better discriminating feature as shown in [3]. As previously illustrated [4], the moments of histograms are eﬃcient approximations of the entire histogram. Therefore, for each of the three-band color histograms of the divisions, the ﬁrst two moments (the mean and the average energy) as well as the standard deviation are stored in the index. For retrieval, the diﬀerent partitions are compared to their counter parts in the query images. We selected the Manhattan distance (L1 Norm) after investigating several other measures including the Euclidean and the Mahalanobis distances, combined with a measure for the number of blocks within a minimum threshold for the distance. Since all features were represented as histograms with the same number of bins (256), no normalization was necessary. The images in the database were ranked according to their highest proximity to any of the three query images. This choice presumes that our simple features do not perform equally well on all example images. 3 Text Retrieval The text is tokenized and preprocessed by removing stop words (grammatical words which do not contribute to the meaning) and punctuation. The rest of the terms are converted to lower case and stemmed using the Snowball stemmer [5]. 152 O. El Demerdash, S. Bergler, and L. Kosseim Queries are tokenized and preprocessed similarly; stop words and punctuation are removed and the rest of the terms are stemmed. The queries consist of the topic and the cluster description (where available), in addition to the expansion terms from the top visual results. Named-entities, recognized using simple capitalization heuristics, are given more weight and multiple-token named-entities are chunked into one term by adding quotes around them. For text retrieval, we use the Terrier Information Retrieval platform, a Java-based Information Retrieval platform available from the University of Glasgow [6]. Terrier includes boolean, vector-space and probabilistic model capabilities. We use the vector-space model, which slightly surpasses the probabilistic models in our experiments. In the vector-space model, documents and queries are represented as vectors of terms weighed by Term Frequencies multiplied by the Inverse Document Frequency (TF-IDF). Terrier also has the option of block-indexing for phrase querying which we employ. Query terms are considered unioned by Terrier in order to promote recall. 4 Pseudo-Relevance Feedback with Semantic Selectional Restrictions In this phase, the query is ﬁrst expanded with terms potentially related to the query, then sent to the text retrieval engine described in the previous section. Common ways for text query expansion include adding synonyms and other related terms to the query. However, according to our experiments, this approach leads to the introduction of many noisy terms. Instead, we opted for the extraction of related terms from the ﬁve highest-ranked results retrieved by the content-based system described in Section 2. For the data set in our experiments, all the terms associated with the image are extracted except for stop words. In order to expand the query without introducing noise, the candidate text is compared to the query topic. If the image is found to be potentially related to the topic, then the text query is expanded with the relevant terms and sent to the text retrieval engine. To compute the relatedness of the image annotation to the topic, we use the minimum threshold of one common non-grammatical word, due to data sparseness. An example of an exapnsion is the query ”Obama”, which was expanded with the following word stems: michell, wife, democrat, presidenti, candid, u s, senat, barack, democrat, illinoi, wave, introduc, ralli, univers, illinoi, chicago, pavilion, midst, oﬃci, campaign, trip, iowa, new hampshir, formal, announc, candidaci, epa, tannen, mauri. The purpose of the query expansion module is not only to augment the query by adding new candidate related terms to it, but also to enhance it by adding weights to its key terms and ﬁltering out potentially noisy terms from expansion. We also avoid expanding the query with named entities that do not have a semantic relationship with the query. This is crucial in photographic collections, since by their nature, photographs and image queries are often bound by geographical constraints. In order to ensure that potential expansion images do not introduce conﬂicting geographical terms in the query, we ﬁrst build a ﬁlter Image Query Expansion Using Semantic Selectional Restrictions 153 from the location speciﬁed in the query. We make use of WordNet [7], a lexical database, by traversing its PartMeronym hierarchy. A PartMeronym is a relationship between two nouns where the child noun constitutes a part of the parent noun. For geographical locations, this translates by the divisions of the parent noun. For example for the USA, a traversal of the hierarchy produces the names of the states, then major cities and towns followed by speciﬁc locations. While similar ﬁlters are possible for common nouns and using other relations such as Hyponymy (sub-classes of a term), we limit the expansion to named-entities, so as to avoid the problem of disambiguation of the speciﬁc sense of the term. 5 Results Due to a glitch in the preprocessing of the diﬀerently formatted queries without given clusters, the oﬃcial results we obtained at ImageCLEF 2009 do not reﬂect the actual performance of our methods. The oﬃcial results can be found in [8] as well as the ImageCLEFphoto track description in this volume. Table 1 shows the results we obtained when we ﬁxed the glitch. The ﬁrst two runs are purely visual and textual respectively. The PRF run combines visual and text retrieval using the Pseudo-relevance feedback mechanism described in section 4 and separate queries for each cluster, the results of which are then combined using a simple interleaving method. The Combined run uses the same method as the PRF, while combining all clusters information into one query. P10 and P20 are the Precision ﬁgures at 10 and 20 retrieved documents respectively. CR10 and CR20 are the Cluster Recall levels at 10 and 20 retrieved documents, while the F-measure reported in these tables employs P10 and CR10 similar to the oﬃcial F-measure used at the 2009 ImageCLEF campaign. Our results show that using text only queries outperforms the pseudo-relevance feedback runs in F-measure as well as precision. However, the diversity of the pseudo-relevance feedback runs tends to be higher. The visual-only run rated very Table 1. Results on ImageCLEFPhoto 2009 Queries Description Visual Text With PRF Combined P10 0.0960 0.7540 0.5820 0.6200 P20 0.0990 0.7800 0.6770 0.7090 CR10 0.2980 0.6877 0.7334 0.6822 CR20 0.4340 0.7525 0.8482 0.7972 MAP 0.0060 0.4879 0.4221 0.4531 Rel Ret 657 19148 17880 18387 F-measure 0.1452 0.7193 0.6490 0.6496 Table 2. Queries with Given Clusters Description Visual Text With PRF Combined P10 0.0720 0.7400 0.5400 0.6000 P20 0.0820 0.7660 0.6900 0.7220 CR10 0.2603 0.7796 0.7562 0.6741 CR20 0.3934 0.8693 0.8772 0.7702 MAP 0.0026 0.4595 0.4207 0.4476 Rel Ret 241 8778 8664 8793 F-measure 0.1128 0.7593 0.6300 0.6349 154 O. El Demerdash, S. Bergler, and L. Kosseim Table 3. Queries without Given Clusters MAP Description Visual Text With PRF Combined 1 P10 0.1200 0.7680 0.6240 0.5680 P20 0.1160 0.7940 0.6640 0.6200 CR10 0.3357 0.5958 0.7106 0.6902 CR20 0.4757 0.6358 0.8192 0.8242 MAP 0.0095 0.5164 0.4234 0.4585 Rel Ret 416 10370 9216 9594 F-measure 0.1768 0.6710 0.6645 0.6641 Visual Text withPRF 0.9 Combined 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Topic Fig. 2. Map by Query poorly. Indeed, the successful pseudo-relevance appeared to stem from expanding using the text associated with the example images, which were eliminated from the gold standard and did not count as valid results. Tables 2 and 3 show the breakdown of these runs by query set (queries where the cluster information was given and queries without cluster information respectively). We note a signiﬁcant diﬀerence between the precision and cluster recall at ten (P10 & CR10) and at 20 (P20 & CR20) retrieved results. Unexpectedly, precision increases with retrieved results and up to the top hundred results. This could be due to some noisy early results introduced by the errors in visual retrieval. Contrary to ImageCLEFPhoto 2008 the F-measure was computed this year using a cut-oﬀ of the ﬁrst ten results, which was a disadvantage to our method. The MAP and the Relevant Retrieved ﬁgures are promising and show consistency over the diﬀerent topics. Figure 2 shows the individual queries MAP performance of each of the four runs, while ﬁgure 3 shows the Cluster Recall at 10 retrieved results of the three MAP Image Query Expansion Using Semantic Selectional Restrictions Text Combined 155 withPRF 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Topic Fig. 3. Cluster Recall by Query textual and mixed runs. We note that the text-only run shows a higher standard deviation than the pseudo-relevance feedback method, especially due to the very low precision of two queries (Queries 10 and 43). In both cases the PRF method managed to reasonably answer the queries due to the visual input. We also remark that combining the cluster information in one query improves precision but decreases cluster recall. 6 Conclusion We experimented at ImageCLEF 2009 with applying semantic selectional restrictions to enhance cross-media pseudo-relevance feedback and diﬀerent methods of query formulation for clustered queries. Our ﬁndings show that in the presence of valid results from a visual retrieval system, pseudo-relevance feedback can be successfully implemented and enhances the diversity of the results, except when the clusters are pre-determined. While the pre-querying semantic ﬁltering applied in our approach can be useful, we will attempt combining it with a second post-retrieval ﬁltering to remove noise and conﬁrm the relevance of the results. References 1. Paramita, M., Sanderson, M., Clough, P.: Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto Task 2009. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 156 O. El Demerdash, S. Bergler, and L. Kosseim 2. El Demerdash, O., Kosseim, L., Bergler, S.: Image retrieval by inter-media fusion and pseudo-relevance feedback. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peas, A., Petras, V. (eds.) CLEF. LNCS, vol. 5706, pp. 605–611. Springer, Heidelberg (2008) 3. Stricker, M.A., Orengo, M.: Similarity of color images. In: Storage and Retrieval for Image and Video Databases (SPIE), pp. 381–392 (1995) 4. Mandal, M., Aboulnasr, T., Panchanathan, S.: Image indexing using moments and wavelets. IEEE Transactions on Consumer Electronics 42, 557–565 (1996) 5. Porter, M.F.: Snowball: A language for stemming algorithms. Published online (October 2001), Accessed 16.04.2009, 18.00h 6. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A High Performance and Scalable Information Retrieval Platform. In: Proceedings of ACM SIGIR 2006 Workshop on Open Source Information Retrieval, OSIR 2006 (2006) 7. Fellbaum, C. (ed.): WordNet An Electronic Lexical Database. The MIT Press, Cambridge (1998) 8. El Demerdash, O., Bergler, S., Kosseim, L.: CLaC at imageclef 2009. In: CLEF Working Notes 2009, Corfu, Greece (2009) Clustering for Text and Image-Based Photo Retrieval at CLEF 2009 Qian Zhu and Diana Inkpen School of Information Technology and Engineering University of Ottawa qzhu012@uottawa.ca, diana@site.uottawa.ca Abstract. For this year's Image CLEF Photo Retrieval task, we investigated the effectiveness of 1) image content-based retrieval, 2) text-based retrieval, and 3) integrated text and image retrieval. We investigated whether the clustering of results can increase diversity by returning as many different clusters of images in the results as possible. Our image system used the FIRE engine to extract image features such as color, texture, and shape from a data collection consisting of about half a million images. The text-retrieval backend used Lucene to extract texts from image annotations, title, and cluster tags. Our results revealed that among the three image features, color yields the highest retrieval precision, followed by shape, then texture. A combination of color extraction with text retrieval increased precision, but only to a certain extent. Clustering also improved diversity, only in our text-based clustering runs. 1 Introduction The goal of the ImageCLEF 2009 photo retrieval task is to promote the diversity of search results through presenting relevant images from as many clusters as possible. Such clusters may focus on the location where the image was taken, the subject matter, the event, the time, etc. Our data consists of an unprecedented 498,920 newspaper images, courtesy of the Belga news agency, each containing a picture, a title, a short description of the image, and a time stamp. This data presents many challenges, due to the enormous sizes of image database and the diverse nature of texts to process. Handling a data collection of such size is already a feat on its own. Each of the 50 queries consists of up to 3 sample images (each having a description and a picture). We may use the text, the image, or both parts as query for the retrieval task. In addition, the queries are divided into two parts: part 1 (25 queries) provides the cluster titles for each query to help us cluster the results; part 2 does not provide any cluster hints. For more details see [1]. The University of Ottawa team has developed a system for text-based retrieval, a system for image content-based retrieval, and also an integrated text+image system. In the sections that follow, we describe each system, compare their retrieval effectiveness, and investigate whether or not clustering helps increase the diversity of results. We have used the k-means clustering algorithm. We describe two ways to incorporate clusters into the resulting ranking. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 157–163, 2010. © Springer-Verlag Berlin Heidelberg 2010 158 Q. Zhu and D. Inkpen 2 System Description 2.1 Text-Based Retrieval This system is running on the Lucene search engine that searches through a document collection based on the frequency of query terms in the document (tf-idf measure). In our system, the image annotations, titles, and tags are indexed. As a pre-processing step prior to building the index, we converted all words into their stemmed form by running the Porter Stemmer Toolkit [2]. To further improve the search results, we undertook an additional step: query expansion. 2.2 Query Expansion In addition to using the image title as our query, we also expanded the query using the terms that appear in the description section of the 3 sample images. This was used in last year's competition, and has been shown to work well [3]. When this method is used, we should keep in mind that not all terms are introduced to the query equally. Otherwise, irrelevant documents may appear in our ranking due to expanded irrelevant query terms. We therefore give each term some weight, as determined by the frequency of the term in the description tag of 3 sample images. The words that appeared frequently will have a higher weight in the expanded query. The LucQE library [4] provides a good implementation of the weighted query expansion done using the Rocchio's method. This method produces the modified query m: qm = α * q0+ β * 1* Σj dj / |D| where: q0 is the original query vector (i.e., the image title); D is the set of known relevant documents (i.e., the description of sample images); dj is the vector representation of a known relevant document in D. We used the following parameters for Rocchio's method: α = 1.0, β = 0.75, after some experimenting on the training data. 2.3 Image Content-Based Retrieval The wealth of image data provides us with an excellent opportunity to assess different image retrieval methods. The image dataset is the largest we have tested to date, and we shall see how our system performed under such a heavy load. Our system extracted 3 image features from each image: color, Tamura texture, and scale invariant feature transform (SIFT) [5]. Of particular interest is the SIFT feature, which is a feature related to shapes. This local image feature extracts particular interest points from the image which are highly distinctive, relatively stable to scale, and invariable to rotations and minor changes in illumination and noises. Images are first applied a Gaussian-blur filters at different levels, producing successively blurred images. The differences between the blurred images are calculated based on the Difference of Gaussians (DoG) technique. And Clustering for Text and Image-Based Photo Retrieval at CLEF 2009 159 from the extremes of DoG, local interest points are derived. We have found a frontend SIFT extraction tool (called extractsift) from the FIRE image retrieval package [6]. This extraction uses Andrea Vedaldi's implementation of the SIFT algorithm [7]. Feature extraction was a very lengthy process. In particular, the SIFT extraction takes 10 second per image on an Athlon 64 3.0GHz dual-core system, making the extraction process infeasible. Therefore, we have reduced the size of all images by 50% in the longest edge, while keeping the aspect ratio constant, to allow us to finish the extraction task in reasonable time. 2.4 Integrating Results 2.4.1 Integrating Results from Multiple Image Retrieval Methods A look in the image-retrieval results indicates that the set of retrieved images varies between different methods. In order to increase the number and the diversity of retrieved images, we have merged the image results from multiple methods by using the following two techniques. Because different image retrieval methods may use different scoring schemes, we need some way to normalize the scores of each ranking, so that the two rankings are directly comparable to each other. We illustrate two techniques of normalization: Technique 1: Normalize the ranking scores by dividing each score by the maximum score in each topic. Technique 2: Normalize the ranking scores using variance normalization: (score average) / variance. Once the scores have been normalized for both rankings, we can integrate using the following method. For each image appearing in both rankings A and B, we compared its score in the two rankings, and select the maximum score to be the final score in the integrated ranking. If an image appeared in only one ranking, then the score was unchanged. 2.4.2 Integrating Image and Text Retrieval Results The second step after the integration of image results is to merge them with the text retrieval results. We calculated the integrated “image + text” ranking based on a weighting scheme of 85% text score + 15% image score. This weighting scheme has been tested in the past as an effective scheme [3]. 2.5 K-Means Clustering To investigate the effect of clustering documents, we employed the k-means clustering algorithm on the documents retrieved from the query-expanded text retrieval system. The version of the algorithm we used can be found in [8]. Only the top 50 retrieved documents participate in clustering, because expanding this clustering range risks introducing irrelevant document to the top of the ranking. Additionally, the clustering is based on the 10 most frequent terms in each document, and the number of clusters (k) is chosen as 10, as well. This combination of settings have been shown to 160 Q. Zhu and D. Inkpen work best, because setting k too high may risk losing precision, while setting k too low improves precision at the expense of sacrificing cluster variety. It is important to mention that clustered documents are re-inserted into the ranking in a way that increases the diversity of results. Two ways of doing this are proposed. 1) Cluster-by-cluster: Clusters are ranked in descending order by the average similarity score of documents in the cluster. Then, documents in the top scored cluster are all inserted to the ranking, followed by the next top score cluster, etc. 2) Interleaved: Again, clusters are ranked in descending order. Differently from above, only one document from each cluster is inserted into the ranking at a time. When all clusters have contributed at least one document to the ranking, the method begins inserting the second document into the ranking. This is expected to produce the most effective results. 3 Experimental Results Table 1 lists the results of our runs on the 50 test queries (5 submitted runs, marked with s, plus 8 new runs). For each query, precision at depth R (where R = 5, 10, 20, 30, 100), the mean average precision (MAP), and the cluster diversity (CR) at different depths are reported. The run Color was based on full-size images which had a maximum of 512px in either width or height. The runs Sift and Tamura were based on 50% reduced size images. For the two text-based runs, both runs involved k-means clustering (k=10). The run ColorSift_1 is an image only run that integrated the results of Color and SIFT, using the image-integration technique 1. The run ColorSift_2 used the imageintegration technique 2 The runs TextQE and TextNoQE are text only runs, before clustering, with query expansion and without. Clusters_InterLeaved used the interleaved way to insert clusters into the ranking, whereas Clusters_NonInterLeaved used the other way, as described previously. Stemming and query expansion were also applied, to them and to all the text runs. To integrate text and image runs, we used the integrated image ranking Sift plus Color and we further integrated them with text retrieval ranking. For this, we selected the top 5 hits from the image ranking, and combined their scores with their text scores. The run ImgTextQE is the run that adds text retrieval results to the results of integrating Color and Sift by technique 1, including query expansion. ImgTextQE_ClustersInterLeaved adds interleaved clustering to the previous run, while ImgTextQE_ClustersNonInterleaved is similar but the clustering method is non-inter-leaved. Finally, the run ImgTextNoQE uses image and text, no query expansion, and no clustering. Clustering for Text and Image-Based Photo Retrieval at CLEF 2009 161 Table 1. Our results. Modality indicates whether the retrieval is based on image features (I) or texts (T) or both (IT). P stands for precision. MAP stands for mean average precision. For each image run, only one feature was investigated. CR means cluster diversity. The submitted runs are marked with (s). The last two runs are the best runs in the competition, submitted by other groups. F1 Num rel. retr. MAP CR@100 CR@30 CR@20 CR@10 CR@5 P@100 P@30 P@20 P@10 I 0.14 0.10 0.07 0.06 0.04 0.23 0.24 0.25 0.27 0.54 507 0 0.14 Sift (s) I 0.13 0.09 0.07 0.06 0.04 0.18 0.22 0.24 0.27 0.34 318 0 0.13 Tamura (s) I 0.13 0.09 0.07 0.06 0.04 0.17 0.19 0.21 0.28 0.36 400 0 0.12 ColorSift_1 I 0.15 0.10 0.07 0.06 0.03 0.17 0.17 0.19 0.21 0.31 331 0 0.13 ColorSift_2 I 0.12 0.09 0.07 0.07 0.04 0.14 0.16 0.19 0.22 0.30 612 0 0.12 TextQE T 0.82 0.78 0.74 0.72 0.65 0.39 0.46 0.56 0.65 0.81 12915 0.29 0.58 TextNoQE T 0.72 0.74 0.72 0.72 0.66 0.36 0.43 0.49 0.55 0.70 15701 0.41 0.55 Modality P@5 Run Name Color (s) ClustersInter T 0.41 0.50 0.57 0.60 0.63 0.47 0.64 0.81 0.83 0.89 12920 0.27 0.56 Leaved (s) Clusters_ NonInter Leaved (s) T 0.79 0.76 0.72 0.71 0.65 0.48 0.55 0.66 0.74 0.89 12915 0.29 0.64 ImgTextQE I 0.83 0.79 0.74 0.72 0.65 0.43 0.50 0.59 0.67 0.82 12915 0.29 0.61 T ImgTextQE Clusters InterLeaved I 0.43 0.51 0.56 0.60 0.63 0.33 0.49 0.66 0.69 0.81 12920 0.27 0.50 T ImgTextQE I ClustersNon 0.80 0.77 0.74 0.72 0.65 0.40 0.48 0.57 0.63 0.82 12915 0.30 0.59 T InterLeaved ImgText NoQE I 0.73 0.74 0.72 0.72 0.66 0.39 0.47 0.51 0.57 0.71 15701 0.41 0.57 T InfoComm T 0.85 0.85 0.83 0.83 0.80 0.59 0.67 0.69 0.72 0.86 14658 0.43 0.75 Xerox-SAS I 0.82 0.79 0.76 0.74 0.72 0.74 0.82 0.83 0.86 0.89 10635 0.29 0.81 T 162 Q. Zhu and D. Inkpen 3.1 Discussion of the Results It is evident that content-based image retrieval alone cannot achieve good performance, because the precision values are simply too low. However, we notice that the precision at depth 5 is the highest, suggesting that perhaps the top 5 images have the potential to increase performances when they are combined with the text retrieval system. We investigated the effect on retrieval score when the image retrieval score is combined with the text retrieval score. Our results show that for each text-only run, the corresponding image + text run always improves the precision scores, and respectively the F-measure, by 0.01 - 0.02. The improvements were the result of an 85% text + 15% image weighting scheme and when only the top 5 images were integrated. Altering the weighting scheme in favor of images or incorporating more images have resulted in lower precision scores. Our results confirm that images can in fact boost retrieval performances consistently in all experiments, even though improvements are small in many cases. Image ranking seems to only improve the position of relevant documents, but does not add more relevant documents. Among the 3 image features tested, color is the best feature, followed closely by SIFT, and then by the Tamura texture. The low precision scores of SIFT and Tamura runs might be explained by the reduced sizes of image, which may have eliminated too many details needed for accurate retrieval. In our experiments, we have integrated image rankings from multiple image retrieval methods. Here we evaluate the performance of the two image-integration techniques. As we can see from the runs ColorSift_1 (technique 1) and ColorSift_2 (technique 2), there are both pros and cons associated with each technique. Recall that the two techniques differ in the way we normalize the scores before integration. Where technique 1 normalizes by dividing each score by the maximum, technique 2 uses the variance normalization. Our results for technique 1 (ColorSift_1) show that precision is improved slightly, but it did not retrieve more relevant images than the individual non-integrated rankings. In technique 2 (ColorSift_2), the number of relevant images retrieved increased from 300 to 600 after the integration, but the precision scores @5, @10, @20 have not changed after integration, suggesting that most of additional relevant images retrieved are in the bottom of the ranking. The additional relevant images are quite attractive. In future we need to find a way to move relevant images to the top of the ranking. Our results also compared non-clustered runs (TextQE) versus clustered runs (Clusters_Inter-, and non-interleaved). Clustered runs always have higher CR scores @5, @10, @20, @30 than non-clustered runs, confirming that k-means clustering under our current settings do in fact increase cluster variety quite a bit. There is, however, a trade-off between precision and cluster diversity, as increased CR scores are often accompanied by lower precision scores. This is most obvious from the comparison of TextQE and Clusters_Interleaved runs, where CR@10 goes up from 0.46 to 0.64, but P@10 goes down from 0.78 to 0.50. This is understandable, because not every document in each cluster is relevant, so when cluster documents are re-inserted to the ranking in interleaving fashion, there is a risk that non-relevant documents appear at the top positions of the ranking. It is worth noting that the run Clusters_NonInterleaved strikes a middle ground between precision and cluster diversity, producing the best overall performance (F score is the highest at 64%). Clustering for Text and Image-Based Photo Retrieval at CLEF 2009 163 Also, we found out that the non-clustered run TextNoQE already has a good MAP and precision values (MAP of 0.40, and P@5-P@30 around 0.70). This has the best MAP score. The other non-clustered run TextQE is also good (MAP of 0.29, and P@5-P@30 around 0.80). This shows that text-retrieval itself can achieve very good MAP and precision, and clustering does not improve precision and MAP scores. If anything, clustering may even drop precision. In general the F-scores did not correlate with the MAP scores. The two best runs in the competition (InfoComm and Xerox-SAS) can be used as gold standards against which we can judge the effectiveness of our system. We notice that even the best runs also exhibited a trade-off between precision and cluster variety. Although our results do not rank among the top in the overall F-measure, when judging precision and cluster diversity independently, our results are very competitive. Our TextQE run looks similar to InfoComm in the P@5 and P@10 scores, and our Clusters_Interleaved run is very close to Xerox-SAS in the CR@10, @20, @30, and @100 scores. 4 Conclusion and Future Work In this paper, we experimented with several ways to integrated results from different image-based retrieval methods, from text and image retrieval, and two ways to increase cluster diversity. In future work, we can use more advanced image features and we can use probabilistic retrieval as the main retrieval framework. References 1. Paramita, M., Sanderson, M., Clough, P.: Diversity in photo retrieval: Overview of the ImageCLEFPhoto task 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980) 3. Inkpen, D., Stogaitis, M., DeGuire, F., Alzghool, M.: Clustering for Photo Retrieval at ImageCLEF 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 4. Rubens, N.: The application of fuzzy logic to the construction of the ranking function of information retrieval system. Computer Modelling and New Technologies 10, 20–27 (2006) 5. Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999) 6. Deselaers, T., Keysers, D., Ney, H.: FIRE: A Flexible Image Retrieval Engine. ImageCLEF 2004 Evaluation. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 688–698. Springer, Heidelberg (2005) 7. Vedaldi, A.: An open implementation of the SIFT detector and descriptor. UCLA CSD technical report (2007) 8. Sivaraman, S.: K-means cluster analysis algorithm implementation in Java, http://www.codecodex.com/wiki/index.php? (retreived) Combining Text/Image in WikipediaMM Task 2009 Christophe Moulin, C´ecile Barat, C´edric Lemaˆıtre, Mathias G´ery, Christophe Ducottet, and Christine Largeron Universit´e de Lyon, F-69003, Lyon, France ´ ´ Universit´e de Saint-Etienne, F-42000, Saint-Etienne, France CNRS UMR5516, Laboratoire Hubert Curien {christophe.moulin,cecile.barat,cedric.lemaitre, mathias.gery,ducottet,largeron}@univ-st-etienne.fr Abstract. This paper reports our multimedia information retrieval experiments carried out for the ImageCLEF Wikipedia task 2009. We extend our previous multimedia model deﬁned as a vector of textual and visual information based on a bag of words approach [6]. We extract additional textual information from the original Wikipedia articles and we compute several image descriptors (local colour and texture features). We show that combining linearly textual and visual information signiﬁcantly improves the results. 1 Introduction The capacity of data storage increases constantly making possible the collection of large amount of information of all kinds, such as texts, images, videos and their combinations. In order to retrieve documents from such amount of data, information retrieval techniques tailored to the diﬀerent data types are required. The earliest methods in multimedia retrieval only considered the textual information of documents [2]. Additional textual evidence, such as image names [10], has also been used. Nowadays visual information is more and more exploited, for example in object recognition or categorization [4], [3], but also in multimedia information retrieval, where visual features, such as colour, have been used [9]. ImageCLEFwiki is a multimedia collection where each document is composed of one image and its textual metadata [11]. User needs are represented by queries (topics), which are also multimedia. Therefore, a multimedia document model is necessary to handle such a collection. In 2008, we proposed a ﬁrst model that combines text and image information for multimedia retrieval [6]. This year, we extend our model by adding textual information, using diﬀerent descriptors for the visual information and merging through a linear combination our textual and visual results. After presenting our model, we will explain the submitted runs and the obtained results. We will ﬁnish by discussing our future work. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 164–171, 2010. c Springer-Verlag Berlin Heidelberg 2010 Combining Text/Image in WikipediaMM Task 2009 2 165 Visual and Textual Document Model The document model we deﬁned for ImageCLEF 2008 lets us rank documents depending on the query using diﬀerent methods. Figure 1 presents an overview of textual and visual information representation. As shown in this ﬁgure, we process text and images separately. Firstly, we explain the key features of our approach to rank documents according to a query using only textual information. Secondly, we describe how we extend the method to handle the visual information. Finally, we present our method for combining textual and visual results. Fig. 1. A textual/visual vector model based on the bag of words approach 2.1 Textual Representation Model As in the vector space model introduced by Salton et al. [8], we represent a document di as a vector of weights wi,j . Each wi,j corresponds to the importance of the term tj in the document di computed by multiplying tfi,j and idfj , where tfi,j is the term frequency that characterizes the frequency of the term tj in the document di . The idfj is the inverse document frequency that quantiﬁes the importance of the term tj over the corpus of documents. wi,j is high when the term tj is frequent in the document di but rare in the others. We use tfi,j and idfj deﬁned in the Okapi formula by Robertson et al [7] by : tfi,j = k1 ni,j i| ni,j + k2 (1 − b + b d|davg ) (1) where ni,j is the occurrence of the term tj in the document di , |di | the size of the document di , davg the average length of all documents in the corpus and k1 , k2 and b are three constants. idfj = log |D| − |{di |tj ∈ di }| + 0.5 |{di |tj ∈ di }| + 0.5 (2) where |D| is the number of documents in the corpus and |{di |tj ∈ di }| the number of documents where the term tj occurs at least one time. 166 C. Moulin et al. If we consider a query qk in the same way, we can represent it as a vector of weights. A score is then computed between the query qk and a document di as shown in Table 1. The main diﬀerence between score1 and score2 is the representation of the query. In the ﬁrst score, the weight of each term in the query is deﬁned by its tf only while in the second score the weight is equal to tf.idf . Note that for tfk,j , b = 0 because |dk | and |davg | are not deﬁned for a query. Table 1. Scoring equations and their default parameters as in Lemur software [12] Scoring Parameters tfi,j 1 score (q k1 = 2.2 k , di ) = k2 = 1.2 tj ∈qk tfi,j idfj tfk,j b = 0.75 2 score (q k1 = 1 k , di ) = tj ∈qk tfi,j idfj tfk,j idfj k2 = 1 b = 0.5 Parameters tfk,j k1 = 8 k2 = 7 b=0 k1 = 1 k2 = 1 b=0 Diﬀerent sources of text are available. The metadata provided with images is often very short and sometimes useless: for example, when the text deals with the image copyright or when it gives details about the user who uploaded the image. In order to gain information, we aim at using the original text from the wikipedia documents in which images appear. We consider a text fragment around the image. The size of the window is tuned using the topics in wikipediaMM 2008 collection for training. We use 50 (respectively 100) characters around the image (50 char (respectively 100 char). We add this text to the metadata of the image and we index both the added text and the original metadata. The indexing is performed with the Lemur software[12]. 2.2 Visual Representation Model In order to combine the visual information with the textual one, we also represent images as a vector of weights. Provided we are able to extract visual words from images, it is possible to use the tf.idf formula in the same way as in the textual model. It is therefore necessary to create a visual vocabulary V = {v1 , ..., vj , ..., v|V | } as in [3]. For that purpose, we use 3 diﬀerent descriptors. The ﬁrst one (meanstd) is the same as in [6]. Each image is partitioned into 16x16 cells. Each cell is described by a six dimensional vector which corresponds to the R G , R+G+B and R+G+B mean and the standard deviation of R+G+B 3∗255 where R, G and B are the red, green and blue components of the cell. The second (named sif t1 ) and third (named sif t2 ) descriptors correspond to the well known SIFT descriptor based on histograms of gradient orientation [4]. The sif t1 ﬁrstly detects regions of interest using the MSER method as in [5] while the sif t2 one uses a regular partitioning as in the meanstd descriptor. Combining Text/Image in WikipediaMM Task 2009 167 For each of our 3 descriptors, we apply a k-means algorithm [1] to obtain a vocabulary of 10,000 visual terms. Each visual term represents a cluster of feature vectors. Then, each image can be represented using a vector of visual terms. Local features are ﬁrst calculated using one of the 3 descriptors. Then visual terms are determined by seeking, for each feature, the closest visual term (according to the euclidean distance) in the corresponding visual vocabulary. The weight of each visual term is computed using the same tf.idf approach deﬁned for the textual words. 2.3 Combination In order to combine textual and visual results we use two diﬀerent methods. The ﬁrst one (IN ) is a simple intersection between the results obtained with the textual query and with the visual one. The second one (CL) corresponds to a linear combination between the textual and the visual scores. The ﬁgure 2 summarizes all the process from the indexation to the results combination. score(qk , di ) = αscoreV (qk , di ) + (1 − α)scoreT (qk , di ) (3) The α parameter lets us give more or less importance to the visual information. We calculate its optimal value using the queries from the ImageCLEFwiki 2008 task as a training set. Fig. 2. From the user query to the ﬁnal result 3 Experiments Using the model described in the previous section, we present runs submitted to ImageCLEFwiki and the results we obtained. 3.1 Submitted Runs All the runs are entirely automatic and are summarized in Table 2. We deﬁne a baseline, LaHC 1, that corresponds to a pure text model. It uses only textual terms 168 C. Moulin et al. Table 2. Parameters of our submitted runs run id run LaHC 1 LaHC TXT okapi LaHC 2 LaHC TXT tﬁdf LaHC 3 inter TXT IMG Meanstd LaHC 4 inter TXT IMG Sift LaHC 5 TXTIMG Meanstd 0.015 LaHC 6 TXTIMG Meanstd 0.025 LaHC 7 TXTIMG Sift 0.012 LaHC 8 inter TXT IMG Siftdense LaHC 9 TXT 100 3 1 5 LaHC 10 TXT 50 3 1 5 LaHC 11 TXTIMG 100 3 1 5 meanstd LaHC 12 TXTIMG 50 3 1 5 meanstd LaHC 13 TXTIMG Siftdense 0.084 score score1 score2 score2 score2 score2 score2 score2 score2 score2 score2 score2 score2 score2 text image combination metadata metadata metadata meanstd intersection (IN ) metadata sif t1 intersection (IN ) metadata meanstd α= 0.015 (CL) metadata meanstd α= 0.025 (CL) metadata sif t1 α= 0.012 (CL) metadata sif t2 intersection (IN ) 100 char 50 char 100 char meanstd α= 0.025 (CL) 50 char meanstd α= 0.025 (CL) metadata sif t2 α= 0.084 (CL) for the query and scoring of documents. We calculate the score1 for each image using terms of the textual content. The image name and surrounding text are not considered. We do not use neither feedback nor query expansion. Since score1 is applied, the query terms are weighted within the query only by their frequency tf . Using only the text, we perform 3 other runs: the LaHC 2 is the same as the baseline except that the query is represented by its terms’ tf.idf rather than its terms’ tf (score2 ). The LaHC 9 and LaHC 10 are two other text only runs that make use of the surrounding text around the image in the original wikipedia document. The LaHC 9 (respectively LaHC 10) considers 100 (respectively 50) characters before and after the image. All other runs exploit both the textual and the visual information of documents. The LaHC 3 (respectively LaHC 4 and LaHC 8) is obtained after an intersection of the text only query results (LaHC 2) and the image query using the meanstd (respectively sif t1 and sif t2 ) descriptors. The other runs are obtained from a linear combination of the textual and the visual scores. LaHC 5, LaHC 6, LaHC 7 and LaHC 13 use the textual scores of LaHC 2 and the visual scores of a visual descriptor (meanstd, sif t1 and sif t2 ). LaHC 5 and LaHC 6 diﬀer in their combination parameters. LaHC 5 is an additionnal run which has not been trained on wikipediaMM 2008 collection. LaHC 11 and LaHC 12 combine the textual scores of LaHC 9 and LaHC 10 with the visual scores of the meanstd descriptor. 4 Results All the results are on the ImageCLEFwiki results webpage1 . As shown by Table 3, on the whole results, our team ranks 2nd on 8 participants. All the obtained results of our runs are summarized in Table 4. 1 Available at http://imageclef.org/2009/wikiMM-results Combining Text/Image in WikipediaMM Task 2009 169 Table 3. Best runs of each participant rank participant run 1 deuceng deuwiki2009 205 5 lach run TXTIMG 100 3 1 5 meanstd 7 cea cealateblock 17 ualicante Alicante-MMLCA 26 dcu DCUTFIDF-DBpediaMetadata-QE 29 sztaki bp acad txt4 min txtimg 41 sinai sinai NTWn T 55 iiit hyd iiithr1 MAP 0.2397 0.2178 0.2051 0.1878 0.1752 0.1699 0.1566 0.0186 Text-based retrieval. Our best textual run uses the surrounding text extracted from the original Wikipedia articles and has a MAP of 0.1890. It leads to an improvement of 13% against our best metadata-only textual run baseline which has a MAP of 0.1667. If we compare the number of relevant retrieved documents, we can see that adding the surrounding text did not bring a lot of new documents: 13 documents more using 100 characters around this image and 6 more with 50 characters. This MAP improvement indicates that adding textual information from original Wikipedia articles mainly leads us to a better ranking. This is supported by the precision for the ﬁrst 5 retrieved documents (P @5) which has been improved by 33% from 0.2978 to 0.3956 as shown in Table 4. Table 4. Presentation of the results rank 5 6 14 15 16 20 21 24 33 44 52 53 54 runid LaHC 11 LaHC 12 LaHC 13 LaHC 9 LaHC 10 LaHC 6 LaHC 7 LaHC 5 LaHC 2 LaHC 1 LaHC 8 LaHC 3 LaHC 4 fusion CL CL CL CL CL CL IN IN IN text image MAP 100 char meanstd 0.2178 50 char meanstd 0.2148 metadata sif t2 0.1903 100 char 0.1890 50 char 0.1880 metadata meanstd 0.1845 metadata sif t1 0.1807 metadata meanstd 0.1792 metadata 0.1667 metadata 0.1432 metadata sif t2 0.0365 metadata meanstd 0.0338 metadata sif t1 0.0321 P@5 0.3956 0.3956 0.3333 0.3600 0.3422 0.3067 0.3511 0.2978 0.2978 0.2622 0.1867 0.2089 0.1556 P@10 0.3378 0.3356 0.3111 0.2956 0.3000 0.3178 0.2978 0.2867 0.2733 0.2444 0.1267 0.1356 0.1044 num ret num rel ret 44993 1213 44993 1218 44993 1212 38004 1205 37041 1198 44993 1208 44995 1200 44993 1213 35611 1192 35611 1164 619 142 574 76 637 120 Intersection of textual and visual results. In Table 4, the last three results obtained after an intersection are the worst results regarding the MAP measure. However, if we compare them in term of precision, they become the best ones with 1 relevant retrieved document over 6 retrieved documents. If the user doesn’t want to ﬁnd all results but wants a good precision, these runs are the 170 C. Moulin et al. most relevant. Nevertheless, the P @5 values for intersection runs are very low which indicates that only few queries are concerned by the good precision. Improvements with visual information. As we can see on Table 4, all the runs combining linearly the visual information with the textual one lead to improvements. The best improvement is obtain by the sif t2 which uses a regular detector and the SIFT descriptor with an improvement of 14%. Globally sif t2 (MAP: 0.1903) is better than meanstd (MAP: 0.1845) which is better than sif t1 (MAP: 0.1807). As for the surrounding text, visual information does bring a lot of new documents, but it also leads to a better ranking of relevant documents. 0.8 legend: LaHC_2 (map: 0.1667) meanstd: LaHC_6 (map: 0.1845) 100 char: LaHC_9 (map: 0.1890) 100 char+meanstd: LaHC_11 (map: 0.2178) 0.7 0.6 precision 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 recall Fig. 3. Best improvements compared with the textual baseline Additional text and visual information. Figure 3 summarizes improvements obtained by adding the surrounding text (100 char) or the visual information (meanstd) and by combining both information (100 char+meanstd). As we can see, when we combine surrounding text and meanstd descriptor we improve by 30% our results from a MAP of 0.1667 to a MAP of 0.2178. As meanstd is not our best visual descriptor, we expect to improve our result by combining the surrounding text and the sif t2 descriptor. 5 Conclusion In this article we proposed improvements to the multimedia model we introduced in [6]. The ﬁrst one was to use image surrounding text extracted from the original documents, the second was to use SIFT based image descriptors for the visual part and the third one was to add a text/image combination approach. A series of thirteen runs was submitted using the ImageCLEFwiki 2009 collection. Thanks to these runs, we can draw 3 conclusions: it improves results to use the image surrounding text than the metadata only. The SIFT descriptor is better than Combining Text/Image in WikipediaMM Task 2009 171 our previous color descriptor provided it is calculated on a regular partitioning. The text-image combination is a winning strategy which can be implemented by a simple linear combination of textual and visual scores. For future work, we aim to combine the textual information with more than just one visual descriptor. Moreover as the visual information importance depends on the query, we also plan to learn a diﬀerent α parameter for each query instead of a global α. Finally, we would like to try other combination approaches as a rank fusion approach. Acknowledgements This work was partly supported by the LIMA project (http://liris.cnrs.fr/lima) and the Web Intelligence project (http://www.web-intelligence-rhone-alpes.org). References 1. Bisgin, H.: Parallel clustering algorithms with application to climatology. Technical report, Informatics Institute, Istanbul Technical University, Turkey (2008) 2. Goodrum, A.A.: Image information retrieval: An overview of current research. Informing Science 3(2), 63–66 (2000) 3. Jurie, F., Triggs, B.: Creating eﬃcient codebooks for visual recognition. In: International Conference on Computer Vision, pp. 604–610 (2005) 4. Lowe, D.G.: Object recognition from local scale-invariant features. In: International Conference on Computer Vision, ICCV, pp. 1150–1157 (1999) 5. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaﬀalitzky, F., Kadir, T., Van Gool, L.: A comparison of aﬃne region detectors. International Journal of Computer Vision 65(1-2), 43–72 (2005) 6. Moulin, C., Barat, C., G´ery, M., Ducottet, C., Largeron, C.: UJM at ImageCLEFwiki 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 779–786. Springer, Heidelberg (2009) 7. Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gull, A., Lau, M.: Okapi at trec-3. In: Text REtrieval Conference, pp. 21–30 (1994) 8. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communations of the ACM 18(11), 613–620 (1975) 9. Tollari, S., Glotin, H.: Wisti: a simple eﬃcient textuo-visual web image retrieval model - speciﬁcations and benchmarks. In: Imag. Eval. (2006) 10. Torjmen, M., Pinel-Sauvagnat, K., Boughanem, M.: Some experiments on the WikipediaMM 2008 task: Evaluating the impact of image names in context-based retrieval. In: CLEF, pp. 756–762. Springer, Heidelberg (2009) 11. Tsikrika, T., Kludas, J.: Overview of the wikipediaMM task at ImageCLEF 2009. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 12. Zhai, C.: Notes on the lemur tﬁdf model. Technical report, Carnegie Mellon University (2001), http://www.lemurproject.com Document Expansion for Text-Based Image Retrieval at CLEF 2009 Jinming Min, Peter Wilkins, Johannes Leveling, and Gareth J.F. Jones Centre for Next Generation Localisation School of Computing, Dublin City University Dublin 9, Ireland {jmin,pwilkins,jleveling,gjones}@computing.dcu.ie Abstract. In this paper, we describe and analyze our participation in the WikipediaMM task at CLEF 2009. Our main eﬀorts concern the expansion of the image metadata from the Wikipedia abstracts collection - DBpedia. In our experiments, we use the Okapi feedback algorithm for document expansion. Compared with our text retrieval baseline, our best document expansion RUN improves MAP by 17.89%. As one of our conclusions, document expansion from external resource can play an eﬀective factor in the image metadata retrieval task. 1 Introduction In this paper, we describe our document expansion (DE) method developed for the WikipediaMM task at CLEF 2009 [1]. This information retrieval task is challenging since the image metadata usually contains less terms which can leads the vocabulary mismatch between the user query and image metadata. We propose a document expansion method to enrich the vocabulary of image metadata documents. With proper expansion of metadata from external resource DBpedia1 , our text retrieval experiment improves MAP by 17.89% compared to the baseline system. 2 Retrieval Model and Document Expansion After testing diﬀerent IR models on the text-based image retrieval task, we choose the tf-idf model in Lemur toolkit2 as our baseline model in this task [2]. The document term frequency (tf ) weight we use in tf-idf model is: tf (qi , D) = k1 · f (qi , D) f (qi , D) + k1 · (1 − b + b lldc ) (1) f (qi , D) is the frequency of query term qi in Document D, ld is the length of document D, lc is the average document length of the collection, and k1 1 2 http://dbpedia.org/ http://www.lemurproject.org/ C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 172–176, 2010. c Springer-Verlag Berlin Heidelberg 2010 Document Expansion for Text-Based Image Retrieval and b are parameters set to 1.2 and 0.75 respectively. The given by log(N/nt ), where N is number of documents in the is the number of documents containing term t. The query tf deﬁned similarly with a parameter representing average query of document D against query Q is given by: s(D, Q) = n tf(qi , D) · qtf(qi , Q) · idf(qi )2 173 idf of a term is collection and nt function (qtf ) is length. The score (2) i=1 qtf is the tf for a term in queries and it is computed using the same method with the tf in documents. For WikipediaMM 2009 task, we use the following data: the topics, the metadata collection and DBpedia. All these collections are preprocessed to be used in our task. For the topics, we select the title part as the query; for the metadata collection, the text is selected as the query to perform the document expansion and all the tags are removed and only the text in the ﬁeld “text” will be used. To transform the metadata into the query we process it by: 1. removing useless punctuation in metadata; 2. removing special HTML encoded characters; The English DBpedia includes 2,787,499 documents corresponding to a brief form of a Wikipedia article. We select 500 stop words by ranking the term frequencies from DBPedia and remove all the stop words before indexing it. Our document expansion method is similar to a typical query expansion process. In the oﬃcial runs, we use the pseudo-relevance feedback as our document expansion method with Rocchio’s algorithm [3]. The Rocchio algorithm reformulates the query from three parts: the original query, the feedback words from the assumed top relevant documents and the negative feedback terms from the assumed non-relevant documents. For the described experiments, we do not use negative feedback. In our implementation of Rocchio’s algorithm, the factors for original query terms and feedback terms are all set to be 1 (α = 1, β = 1). For every metadata document, after preprocessing we use the remaining text as the query. We retrieve the top 100 documents as the assumed relevant documents. With all the words from the returned top 100 documents we ﬁrst remove all the stop words. We select the top ﬁve words as the document expansion words. Then the expanded terms will be added into the metadata document and the index is rebuilt. In our oﬃcial runs, we did not index the image name which leads to a loss of related information. We rebuild our index with the image name information and use Equation 3 to select the expansion terms from DBpedia. Here the r(ti ) means the number of documents which contain term ti in the top 100 assumed relevant documents. idf uses the same method as Equation 2. S(ti ) = r(ti ) ∗ idf(ti ) (3) For the number of feedback words, we select the top ld words ranked using Equation 3, where ld is the length of the original query document. This strategy is 174 J. Min et al. taken from the method successfully adopted in [4]. So after the oﬃcial runs, we re-indexed the metadata ﬁles and got our new highest result (Run: Document Expansion) in Table 1 and the best results are also from the combination of document expansion and query expansion. A simple explanation to the techniques used in our document expansion research is: – DEE: document expansion from external resource – QEE: query expansion from external resource – QE: query expansion from original metadata documents For query expansion part in our research, we are using the standard Okapi feedback method for query expansion and we are selecting 10 feedback terms in top 30 assumed relevant documents in the prior retrieval. For our content-based image retrieval run, it can be refereed in [2]. 3 Results and Analysis In Table 1, our results show that document expansion from external resource can be a very eﬀective approach in text-based image retrieval task. DE can improve 6.92% comparting to baseline run and improve 11.17% when combing with QE. After adding the image name information, the MAP was improved 17.89% comparing to baseline. We can conclude the image name is a very important information to describe the content of the image. Furthermore, we perform the signiﬁcance test for our results. We are comparing the baseline and our new DE result. In Figure 1 we give the scatter plot for the 45 topics’ MAP from these two runs. The paired t-test was used to compute statistical signiﬁcance (p = 0.0191), and this diﬀerence is considered to be statistically signiﬁcant by conventional criteria. Table 1. Results of the WikipediaMM 2009 Run Modality Methods MAP P@10 dcutﬁdf-baseline TXT BASELINE 0.1576 0.2600 dcutﬁdf-dbpedia-qe TXT DEE 0.1685 0.2600 dcutﬁdf-dbpediametadata-dbpediaqe TXT QEE+DEE+QE 0.1641 0.2378 dcutﬁdf-dbpediametadata-qe TXT DEE+QE 0.1752 0.2578 dcuimg IMG BASELINE 0.0079 0.0244 Document Expansion TXT DEE+QE 0.1858 0.2844 Comparing the Document Expansion Run with our baseline run, we have 26 of 45 topics in Document Expansion run improves. To analyze the eﬀect of DE in more detail, we selected the topic 92 “bikes” as the good example for DE which improves MAP from 0.0159 to 0.2681. And we will compare the top 5 results from baseline run and DE run. The top 5 results for topic 92 in baseline Document Expansion for Text-Based Image Retrieval 175 Fig. 1. Average Precision Diﬀerence run is 104197, 72035, 290663 , 171738, and 201403. In these results, only the third result is judged as relevant (P @5 = 0.2). After we have done DE, we get diﬀerent top 5 results using the same retrieval model which are 126160, 244171, 171738, 256625, and 10283 (P @5 = 0.6). We choose the document 126160 as an example to observe what happened after DE. <DOC> <DOCNO>126160</DOCNO> <TEXT> <ORIGINAL>mountain bike image of a mountain bicycle frog perspective copy pierpaolo corona vajont italy</ORIGINAL> <EXPANSION>bike bicycle racing cycling cross racer frog bikes stationary exercise bicycles race</EXPANSION> </TEXT> </DOC> After expansion, many words related to “bikes” are added to the document which make this document more focus on the “bikes”and it is also the main meaning of this image metadata. Document 104792 was ranked ﬁrst before DE. After DE the document is expanded with words related to “chile” and it is not included in the results for experiment using DE. The document is relevant to the topic. <DOC> <DOCNO>104197</DOCNO> <TEXT> <ORIGINAL>my bike from santiago of chile</ORIGINAL> <EXPANSION>chile santiago metropolitan chilean spanish airline</EXPANSION> </TEXT> </DOC> 3 Bold font means it is relevant with the topic. 176 J. Min et al. Through our observation, we ﬁnd that DE strengthens the main meaning of the document. For the image metadata, usually the main meaning can be described by a few key words. So we are ﬁnding which words are the most important and expand the document using related terms extracted from DBpedia. 4 Conclusion We presented and analyzed our system for the WikipediaMM task at CLEF 2009 focusing on document expansion. From past research, whether the document expansion can improve the IR eﬀectiveness or how to improve it is not obvious [5]. Our main ﬁndings in this research are as follows. DE can improve the retrieval performance for our text-based image retrieval task. The reason is that image metadata can be viewed as short-length documents which usually contain few words to describe the content of the image. When expanding the metadata from the related external resources, it will help to solve the query-document mismatch problem in this task. Since our external resources are also short-length documents, we choose a higher number as the assumed relevant documents in the pseudo relevant feedback process. Finally, we ﬁnd DE’s main impact will take eﬀect in the ﬁnal QE process. Combining document reduction, DE and QE produces the best results in text-based image retrieval. Furthermore we will continue the research by exploring the use of document expansion in ad-hoc IR tasks. Acknowledgments This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (CNGL) project. References 1. Tsikrika, T., Kludas, J.: Overview of the WikipediaMM Task at ImageCLEF 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Min, J., Wilkins, P., Leveling, J., Jones, G.: DCU at WikipediaMM 2009: Document Expansion from Wikipedia Abstracts. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 3. Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System-Experiments in Automatic Document Processing, Englewood Cliﬀs, NJ, USA, pp. 313–323 (1971) 4. Singhal, A., Pereira, F.: Document Expansion for Speech Retrieval. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Infomation Rtrieval, Berkeley, California, USA, pp. 34–41 (1999) 5. Billerbeck, B., Zobel, J.: Document expansion versus query expansion for ad-hoc retrieval. In: The Tenth Australasian Document Computing Symposium, Sydney, Australia, pp. 34–41 (2005) Multimodal Image Retrieval over a Large Database D´ebora Myoupo1 , Adrian Popescu2 , Herv´e Le Borgne1, and Pierre-Alain Mo¨ellic1 1 CEA, LIST, Laboratoire d’ing´enierie de la connaissance multim´edia et multilingue F-92265 Fontenay-aux-Roses, France 2 Computer Science Dept., T´el´ecom Bretagne debora.myoupo@gmail.com, adrian.popescu@telecom-bretagne.eu, {herve.le-borgne,pierre-alain.moellic}@cea.fr Abstract. We introduce a new multimodal retrieval technique which combines query reformulation and visual image reranking in order to deal with results sparsity and imprecision, respectively. Textual queries are reformulated using Wikipedia knowledge and results are then reordered using a k-NN based reranking method. We compare textual and multimodal retrieval and show that introducing visual reranking results in a signiﬁcant improvement of performance. 1 Introduction The combination of textual and visual retrieval techniques has been studied in a number of existing works [11,5,8] but it remains a stimulating research area, particularly when applied to large scale datasets. Some open questions include: how to deal with diﬃcult queries? What should be the text and image features in multimodal approaches? How to design retrieval frameworks which are fast enough for real-time search? In this paper, we present techniques which propose possible answers to these questions. To answer the ﬁrst question, we employ a query expansion technique which draws on our work in [8]. The categorical structure of Wikipedia is exploited in order to ﬁnd and rank concepts from the encyclopedia which are semantically similar to the initial query. While we agree that textual queries are the main way to access Web images, we advocate for a more central role of image processing. Image reranking, built around a visual model of a query and an external class, is applied to query results before displaying them. We verify our assumptions on the Wikipedia collection, a large image dataset [10] and present results which show that multimodal search outperforms textual search while introducing a small computational overload. With over 3,000,000 articles in its English version, Wikipedia is a rich resource and is used in a variety of research tasks, such as: sense disambiguation, ontology extraction or semantic relatedness. The last problem can be formulated as follows: given an input (a concept or a larger text), ﬁnd the concepts which are most closely related to the input. Wikipedia based techniques to ﬁnd C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 177–184, 2010. c Springer-Verlag Berlin Heidelberg 2010 178 D. Myoupo et al. semantic relatedness include WikiRelate! [9], Explicit Semantic Analysis (ESA) [4] and Wikipedia Link-based Measure (WLM) [6]. WikiRelate! modiﬁes techniques previously applied to WordNet in order to suit Wikipedia’s structure. The authors of [4] map queries to Wikipedia concepts representation in order to ﬁnd related concepts. ESA is interesting because it ﬁnds related concepts for any given query and not only for mono-conceptual queries and is thus suited for use in Web information retrieval. WLM exploits only Wikipedia links to ﬁnd related concepts. When compared [6], ESA achieves the best performances, followed by WLM and WikiRelate!. However, WLM is considered as computationally simpler than ESA. Image reranking can be performed using textual information associated to images, visual description or a combination of both. Here, we consider the case of visual reranking applied to textual query results. In [5], the authors adapt the PageRank algorithm to image retrieval in order to ﬁnd authority nodes in a visual similarity graph. Both homogeneous and heterogeneous visual concepts are discussed but the approach is only tested on product images and it largely outperforms the Google standard search. Van Leuken et al. [11] propose techniques for diversifying image search results based on visual clustering. Clustering is applied to both ambiguous and non-ambiguous queries and it is evaluated against manually clustered search results. Tests show that the approach tends to reproduce manual clustering in a majority of cases. Deselaers et al. [3] discuss the joint optimization of search precision and diversity, with a focus on diversity. They implement a dynamic programming algorithm applied on top of a greedy selection and test their approach on a heterogeneous test database (ImageCLEF 2008 photo retrieval task [1]). An improvement of diversity, accompanied by a small precision loss is reported when comparing results to ImageCLEF runs. The authors of [7] implement a shared nearest neighbors algorithm (s-NN) which clusters both tags and visual content for any given query. Unfortunately, the technique in is not fast enough to be performed at query time. Though eﬀective, techniques like s-NN [7] or dynamic programming [3] are computationally expensive and are hard to apply under real time constraints. Visual clustering and visual reranking serve the same purpose: surfacing relevant (and diversiﬁed) images based on their visual properties. When visual clustering is applied, diversity is obtained by selecting images from diﬀerent clusters whereas in our approach diversity is handled using query expansion. 2 Retrieval Method Our retrieval method has two steps: the ﬁrst one is a text-based retrieval, with automatic query expansion whereas the second exploits the visual properties of a query to improve results of the text search. 2.1 Conceptual Neighbourhood Building The main resource we exploited was Wikipedia, which provides its dumps for free use. We download the April 2009 English dump, which contains over 2.6 million Multimodal Image Retrieval over a Large Database 179 articles and is provided as a single XML ﬁle. Next, we split the dump into individual articles in order to process the information faster. The information in Wikipedia covers large number of conceptual domains, with a high number of articles describing known people, places, entertainment and organisations. Hereafter we use concept to denote Wikipedia article titles. Wikipedia images come with brief textual descriptions. Query expansion is an appealing way to improve recall and, if performed in a judicious way, to also improve results precision. Topics are preprocessed in order to eliminate stop words and visual terms from a closed list (including image(s), photograph(s) etc.). Then, we lemmatize remaining terms and arrange them using their term frequency in Wikipedia in order to favor rare terms (which are more likely to be discriminant than frequent words). Also, when a non-ambiguous Wikipedia concept is found in a query, we consider that all Wikipedia redirects toward the concept’s article are synonyms. Finally, we compare terms in the query to Wikipedia categories and retain articles that match at least one term. A limit of 5000 articles is imposed to speed up processing. The relatedness of the 5000 discovered concepts to the initial query is variable and we need to rank them by pertinence. We rank concepts by counting the number of terms from the initial query which appear in each article’s categories and by favoring related concepts which match rare terms in the query. At this point, there are usually several concepts with the same score and in order to diﬀerentiate them we reﬁne the ranking by answering the following questions: – is the concept ambiguous? – do all terms in the initial query appear in the ﬁrst paragraph of the Wikipedia article? – do all terms in the initial query appear in the article’s text? Concepts are considered ambiguous if their Wikipedia article contains links to disambiguation pages. The reﬁned list of related concepts will favor unambiguous elements and concepts which contain all terms in the query either in the ﬁrst paragraph of article (which is often a deﬁnition) or in the remaining text. When ties appear, they are broken by counting the total number of query terms in the article. Similarly to [4] or [6] our method ﬁnds semantically similar concepts from Wikipedia for an input text. However, whereas the relations in [4] or [6] are untyped, our method will primarily ﬁnd concepts which are related to (a part of) the query via hyponymy. In image retrieval, such relations are preferable because speciﬁc concepts illustrate well more generic ones. 3 Textual Retrieval Given a query, image results from the Wikipedia collection are retrieved by searching for images which are described either by terms in the initial query or by related concepts from Wikipedia. We assume that the relatedness of an image to a query is proportional to the number of terms/concepts which describe it and are also in the query (in its extended form). Therefore, relevant images are 180 D. Myoupo et al. found by launching queries with the initial terms and the expanded queries in the following order: – all terms in the initial query and a related concept – the initial query – parts of the initial query (starting with largest subparts - and favoring rare terms) and a related concept – related concept or parts of the initial query The above types of queries are called blocks. In order to diﬀerentiate between images in the same block, weights are applied to the terms in the initial query (rare terms are favored) and to related concepts. Individual image scores are then calculated by adding these scores. For instance, if a query contains three terms, and two images are annotated with two diﬀerent terms out of three, the one which is annotated with the rarest terms is favored. Since blocks are ordered by importance, an image is retained only once, that is why they ﬁrst appear. The result of textual retrieval is noted Rt . 4 Visual Reranking We introduce a reranking method which is computationally inexpensive if the images are preindexed and which is based on a contrastive model. We assume that an image which is visually close to a visual model of a query is more likely to be a good answer than another image which is less similar to the visual model. To evaluate the similarities between the Rt images and a topic, we need to obtain a low-level description of that particular topic. We create a visual model (a positive set of images Rpos which depicts the topic) using Web images. A negative set Rneg containing diversiﬁed images is constructed and used as an outlier for all topics in order to discard images which are not visually close to the topic’s visual model. Then we use the visual coherence to evaluate the relevance of both sets (Rall = Rpos ∪ Rneg ) and rerank them. Finally, we rerank the images in each block generated by the textual retrieval using their visual similarity to the query. 4.1 Visual Coherence The visual coherence (VC) of an image is a metric measuring its relatedness to a visual model of the query. This metric is computed using a positive set, containing Npos topic images and a negative set Rneg of Nneg non relevant images. We compute the visual coherence score based on the following scores : False neighbours: For an image, we search for its Nneigh closest neighbours in Rpos as well as its Nneigh closest neighbours in Rneg . Elements of the two lists of Nneigh neighbours are ordered using the euclidean distance in the feature space. The ﬁrst part of the VC score is deﬁned as the number of neighbours which belong to the negative set among the ﬁrst Nneigh images of this 2 × Nneigh size list. Ideally, very good depictions of the concept will Multimodal Image Retrieval over a Large Database 181 not have any pictures from Rneg among their Nneigh nearest neighbours. Inversely, the noisy images should have a lot of neighbors from Rneg among their neighbours. Distances: The second score is the sum of the distances of their Nsum closest neighbours in Rpos . A small value of this sum implies that the image is visually similar to the concept in the descriptor’s space. The distances computation is based on the same descriptor as in the previous step. To rank a set of images according to their visual relatedness to a query, we sort them using only the ﬁrst score (in 0 . . . Nneigh ). If two images have the same number of neighbors, we use the second score of the visual coherence to reﬁne the ranking. The algorithm depends primarily on the low-level descriptor. 4.2 Visual Model Creation We ﬁrst download a set of Nneg which constitute the negative set Rneg . Since the negative set will be used to rerank images for diversiﬁed topics, it needs to be itself diversiﬁed. To insure diversity, a large number of concepts from diﬀerent conceptual domains, such as mountain, dog, car, football or protest are manually selected and represented in the negative set in order to depict the noise that is to say everything except the WikipediaMM topics. Then, we query the Web for with each WikipediaMM topic, download Nq raw images and index them with low-level descriptors. For the visual model to be eﬀective, the images in the positive set need to be as accurate as possible and it is necessary to ﬁlter out noisy results returned by the Web search engine. To ﬁnd relevant images, we compute the VC on the raw set of topic images(for each image, the Nq -1 other images temporary constitutes Rpos ) and keep the top Npos results only. Since the visual coherence computation depends on the features, a visual model is deﬁned for each low-level descriptor. The visual model of a query is composed of an ordered list of Npos positive images and a set of Nneg negative images and can be used for image reranking. 4.3 Textual Results Reranking Let Rt be the list of images returned by the textual matching procedure. It can now be reordered using the visual model using the same k-NN method used to build the model itself. We compute the signature of each image from Rt and its distance to all the images of the visual model. Our assumption is that a relevant result from the textual run will be related to some of the images composing the visual model. We run experiments using: Single Descriptor Reranking. We use the visual model created with one descriptor to compute the visual coherence scores of the Rt list. Then, the list is simply ranked according to the VC score. Late-fusion Reranking. The picture is reranked according to the sum of its ranks into the lists coming from several Descriptor Rerankings (i.e. using several descriptors). 182 D. Myoupo et al. Results of the textual retrieval have weights which express their relatedness to the query. These relatedness is used to deﬁne three blocks of answers by decreasing relevance to the topic. 5 Experimental Validation We evaluate our multimodal retrieval method on the WikipediaMM collection [10]. For the visual models creation, we used both Google and Yahoo! image search engines to improve the model variety. We kept the top 50 pictures of each search engine results, leading to a raw set of Nq = 100 images, then a new positive set containing Npos = 50 images. Ideally, the negative set should be redeﬁned for each query in order to avoid overlaps. However, since the negative set contains few images per concept, the overlap with query images is always small or null and the same negative set can be used for all queries. We ﬁxed the number of negative examples (Nneg ) to 300. To compute the visual coherence score, Nneigh and Nsum are both ﬁxed to 10 : the diversity of a concept can make two pictures semantically relevant but diﬀerent from a visual point of view and it can be interesting to experiment with diﬀerent low-level features. We experimented with a global descriptor (the color and texture based Local Edge Pattern (LEP), derived from [2]) and a local descriptor (bag of SIFTs similar to [12]). Hence, for each query, we obtain three visual models : LEP visual model, Bag of feature visual model and the Early-fusion visual model which results from the mix of the previous two. Finally, we get four content-based rerankings : the texture LEP reranking, the Bag of features reranking, the Late-fusion reranking .The experimental results are reported in table 11 . Table 1. Evaluation results for the diﬀerent methods, on the WikipediaMM collection Run Reranking procedure cealateblock Rt textual ranking + Late-fusion reranking ceabofblock Rt textual ranking + Bag of features reranking ceatlepblock Rt textual ranking + Texture LEP reranking ceatxt - none - MAP 0.2430 0.2192 0.2118 0.1870 P@10 0.4000 0.3867 0.3511 0.2689 P@20 0.3022 0.2989 0.2811 0.2267 The results in table 1 show that our multimodal retrieval technique is more eﬀective than a textual search strategy. When looking at MAP scores, the improvement is 0.025 for a visual reranking with global features, 0.032 for a visual reranking with local descriptors and 0.056 for a late fusion of decriptors. The advantages of multimodal search are even more salient if we look at P@10 or P@20. For P@20, the textual retrieval score is 0.2267 whereas the late fusion score is 0.3022. Precision is particularly relevant in Web image search, where users often look only at the ﬁrst page of results and neglect the other results pages. 1 Although the method was similar, oﬃcial ImageCLEF 2009 results were hampered by bugs which were later corrected. Multimodal Image Retrieval over a Large Database 183 Our visual reranking method is appropriate for Web image retrieval because it maximizes the quality of top results. Visual reranking results show that local descriptors are slightly better than global ones (MAP 0.2192 vs. 0.2118) on the query set. More importantly, late fusion results (MAP 0.2430) are signiﬁcantly better than local or global reranking results and lead us to conclude that descriptors fusion is the best reranking strategy among those tested here. An explanation of the obtained results is that the 45 queries in Wikipedia MM 2009 are diversiﬁed and no single descriptor is ﬁtted for all queries. Our best MAP result (0.2430) is very close to the best of the campaign WikipediaMM 2009 [10]. A qualitative comparison to the best systems highligth the weakness of the pure textual retrieval of our method. However, there is no insurance that the visual reranking would be as eﬃcient as in our case when the initial textual results are better. Document expansion and textual reranking are not included in our current approach but they can be easily integrated and this integration would probably bring further improvements. The use of the content-based reranking (i.e the second step) improves the results compared to the text-based baseline. However, we noticed during our preliminary experiments that the results can decrease when the reranking is global (i.e without keeping a relative order according to each textual block). We expected such results because we retain up to 1000 results for each topic and the test collection contains around 150000 images. Consequently, a large part of the 1000 answers are not relevant and it is useful to exploit the textual ranking. Late-fusion reranking signiﬁcantly improves MAP results (around 2 points) in comparison to simple descriptor reranking. It conﬁrms that when dealing with diversiﬁed topics, it is interesting to merge local and global descriptors in order to obtain better results. During preliminary works, some “early-fusion reranking” schemes were tested, leading to quite similar results. 6 Conclusion and Future Work We proposed a new multimodal retrieval technique which combines query reformulation and visual image reranking. It is both eﬀective and can easily be scaled-up to larger image repositories. The method has two main steps consisting in retrieving a list of images using the textual information only, then reranking it using the visual information. Textual queries are reformulated using Wikipedia knowledge and results are then reordered using a k-NN based reranking method with a visual query model. We compared textual and multimodal retrieval and showed that the second step lead to a signiﬁcant improvement of performances. The late fusion method we proposed had a mean average precision (0.2430) comparable to the best score obtained during the oﬃcial campaign on the large corpus (150,000 images) of the Wikipedia task [10] at ImageCLEF 2009. One interesting direction for future work is to replace the reﬁnement part of the concept ranking with techniques such as explicit semantic analysis [4] and to assess the impact of this new concept ranking on the performances of the system. 184 D. Myoupo et al. We will also test the eﬀects of the query expansion on larger datasets (such as the Web corpus) in order to compare it to standard search engine results. We are conﬁdent that results are likely to improve in terms of precision and diversity for queries that map well on Wikipedia categories because Wikipedia related concepts cover various aspects of a topic. Acknowledgement We thank the DGCIS for funding us through the regional business cluster Systematic (project POPS) and Cap Digital (project Mediatic and Romeo), and the french National Agency for Research (ANR) project Georama. References 1. Arni, T., Clough, P., Sanderson, M., Grubinger, M.: Overview of the ImageCLEFphoto 2008 photographic retrieval task. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 500–511. Springer, Heidelberg (2009) 2. Cheng, Y.-C., Chen, S.-Y.: Image classiﬁcation using color, texture and regions. In: Image Vision Computing (2003) 3. Deselaers, T., Gass, T., Dreuw, P., Ney, H.: Jointly Optimising Relevance and Diversity in Image Retrieval. In: Proc. of CIVR (2009) 4. Gabrilovich, E., Markovich, S.: Computing Semantic Relatedness using Wikipediabased Explicit Semantic Analysis. In: Proc. of IJCAI (2007) 5. Jing, Y., Baluja, S.: VisualRank: Applying PageRank to Large-Scale Image Search. PAMI (2008) 6. Milne, D., Witten, I.H.: An eﬀective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: Proc. of WIKIAI (2008) 7. Mo¨ellic, P.-A., Haugeard, J.-E., Pitel, G.: Image clustering based on a shared nearest neighbors approach for tagged collections. In: Proc. of CIVR (2008) 8. Popescu, A., Le Borgne, H., Mo¨ellic, P.-A.: Conceptual Image retrieval over a Large Scale Database. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 9. Strube, M., Ponzetto, S.P.: WikiRelate! Computing Semantic Relatedness Using Wikipedia. In: Proc. of AAAI (2006) 10. Tsikrika, T., Kludas, J.: Overview of the wikipediaMM task at ImageCLEF 2009. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 11. Van Leuken, R.H., Garcia, L., Olivares, X., van Zwol, R.: Visual Diversiﬁcation of Image Search Results. In: Proc. of WWW (2009) 12. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classifcation of texture and object categories: An in-depth study. Technical Report RR-5737, INRIA (2005) Using WordNet in Multimedia Information Retrieval Manuel Carlos D´ıaz-Galiano, Mar´ıa Teresa Mart´ın-Valdivia, L. Alfonso Ure˜ na-L´ opez, and Jos´e Manuel Perea-Ortega SINAI Research Group, Computer Science Department, University of Ja´en, Spain {mcdiaz,maite,laurena,jmperea}@ujaen.es http://sinai.ujaen.es Abstract. This work investigates the use of external knowledge in a corpus with minimal textual information. We have expanded the original collection with WordNet terms in order to enrich the information included in this corpus. In addition, we have have carried out experiments with original and expanded topics. However, the obtained results show that it is necessary to continue investigating the expansion methodology. The query expansion does not improve the results, although using only the expansion for the corpus slightly achieves better MAP. 1 Introduction This paper presents the ﬁrst participation of the SINAI research group at the CLEF wikipediaMM task [4]. Our main goal in this work is to study the use of the WordNet expansion technique over a collection with minimal textual information. The integration of knowledge through the use of ontologies has been very successful in many systems. Speciﬁcally, WordNet1 [2] has been used with success in many works related to information retrieval [3], image retrieval, disambiguation [1] and text categorization. The following section describes the expansion system using WordNet and the experiments are explained. Then, discussion and conclusions are presented in Section 3 and Section 4. 2 System Description and Experiments In our experiments, we have expanded the collection and topic set with WordNet ontology using only the words included in all sets of synonyms (called synsets) of each word. Figure 1 shows the proccess followed to expand the text. First, the text of a document or query are tagged with a POS tagger. Then we obtain the synset of nouns and verbs using WordNet ontology, and the words of all synsets are included in a bag of words. In the next step we add only the word, without POS or synset information and we remove repeated words in the bag. Finally, all words in the bag are included at the end of the original text. 1 http://wordnet.princeton.edu C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 185–188, 2010. c Springer-Verlag Berlin Heidelberg 2010 186 M.C. D´ıaz-Galiano et al. Fig. 1. General scheme of WordNet expansion The purpose of these experiments was to compare the performance of collection expansion when the textual information is scarce. We have used two textual collections and two sets of topics, original and expanded. The collections contain the title and text of article from original collection. The name of the collections are: A (without expansion) and AWn (expanded with WordNet ). The sets of topics use only the title label of original topics. The name of the topic sets are: T (original topics) and TWn (expanded with WordNet ). The dataset of the collection has been indexed using Lemur2 IR system, by applying KL-divergence weighting function and using Pseudo-Relevance Feedback (PRF). The combination of these collections with the topics have produced the results shows in Table 1. Table 1. Evaluation metrics for all experiments combination Metrics A-T AWn-T A-TWn AWn-TWn MAP R-prec P5 P10 P15 P20 P30 P100 P200 P500 P1000 2 0.1538 0.1809 0.3022 0.2333 0.2222 0.2044 0.1822 0.1140 0.0793 0.0434 0.0243 http://www.lemurproject.org/ 0.1566 0.1951 0.3022 0.2556 0.2311 0.2167 0.1933 0.1167 0.0770 0.0396 0.0226 0.1275 0.1466 0.2400 0.2000 0.1659 0.1567 0.1319 0.1011 0.0736 0.0400 0.0233 0.0998 0.1326 0.1911 0.1800 0.1481 0.1433 0.1304 0.0864 0.0594 0.0324 0.0193 Using WordNet in Multimedia Information Retrieval 187 As we can see in Table 1, the query expansion does not improve the results, although using only the expansion for the collection slightly achieves better MAP and R-prec. 3 Discussion We have experimented with diﬀerent kinds of expansion using the external resource WordNet. In addition, we have applied the Wilcoxon test to the diﬀerent runs. However, the results show that the experiments are not statistically signiﬁcant. Table 2. Information about topics: (1) MAP A-T, (2) MAP AWn-T, (3) MAP A-TWn, (4) MAP AWn-TWn, (5) Words, (6) Words expand detected, (7) Words expanded, (8) Expansion terms (words and multi-words), (9) Expansion multi-words, (10) Query words similar to expanded words Topic number (1) (2) (3) (4) 102 103 112 116 118 120 0.1686 0.3788 0.4960 0.1087 0.2119 0.1386 0.0102 0.2904 0.3439 0.0453 0.2328 0.1700 0.1863 0.3849 0.2494 0.0075 0.2546 0.1730 0.0000 0.1547 0.0062 0.0087 0.2266 0.0956 (5) (6) (7) (8) (9) (10) 2 2 3 3 3 2 2 2 2 2 2 1 2 9 3 1 8 3 1 15 4 1 5 0 2 7 2 1 8 0 2 1 1 1 3 1 As we can see in [4], the statistics show that the query expansion (QE) technique can help improve the retrieval results. However, the diﬀerence between our best MAP (0.1566) and the best MAP of the other participants (0.2397, DEU team) is important (53% worse). It should be noted that the best MAP is achieved only using textual features and QE technique. The main reason for our poor performance is the way in which we make the expansion of documents and queries: we included all non-repeated words from WordNet synsets for all nouns and verbs and for all their senses, without applying any previous disambiguation process, so we are adding too much noise in the expansion. For example, the query number 102 is ‘build site’ and the word ‘site’ has synsets with the multi-words ‘internet site’ and ‘web site’. These multi-words are included in the expansion of the query but they have no relation with the query. Other problem with the expansion methodology is the elimination of repeated words. The multi-words ‘internet site’ and ‘web site’ contain the word ‘site’. Therefore, the expansion includes the word ‘site’ several times, because the algorithm does not compare each word in the multi-word. The algorithm handles each multi-word like a simple word. On the other hand, we do not expand the adjectives For example, yellow (in query 120 ‘yellow ﬂower’), underwater (in query 118 ‘coral reef underwater’) and hot (in query 112 ‘hot air ballons’) are not expanded. 188 M.C. D´ıaz-Galiano et al. Moreover, WordNet does not detect plural nouns in some queries, and so, there are not expanded. For instance, ‘trees’ in query 103 (‘palm trees’), ballons (in query 112 ‘hot air ballons’) or ‘engines’ in query 116 (‘illustration of engines’) are not expanded. Perhaps, the use of a stemmer or lemmatizer could improve the results. Table 2 shows information about all these topics. For each topic we present the MAP values, the number of words for the original query and other data about the expansion. 4 Conclusions The obtained results show that the corpus expansion is not as promissing method as we expected. Moreover, we think our algorithm need some improvement. Thus, our next goal will be to investigate new models for applying more techniques in query expansion. For example, it will be interesting to prove a word disambiguation procedure before incorporating the synset. In addition, further ﬁltering of words included in multiword expressions could achieve better results. Acknowledgements This work has been partially supported by a grant from the Spanish Government, project TEXT-COOL 2.0 (TIN2009-13391-C04-02), a grant from the Andalusian Government, project GeOasis (P08-TIC-41999), and a grant from the University of Jaen, project RFC/PP2008/UJA-08-16-14. References 1. Joao Pinto, F., Farina Martinez, A., Fernandez Perez-Sanjulian, C.: Joining automatic query expansion based on thesaurus and word sense disambiguation using WordNet. IJCAT 33(4), 271–279 (2008) 2. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to WordNet: An On-line Lexical Database. Journal of Lexicography 3(4), 235–244 (1990) 3. Richardson, R., Smeaton, A.F.: Using WordNet in a Knowledge-Based Approach to Information Retrieval. In: Proc. of the BCS-IRSG Colloquium. No. CA-0395 (1995) 4. Tsikrika, T., Kludas, J.: Overview of the WikipediaMM task at ImageCLEF 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) Medical Image Retrieval: ISSR at CLEF 2009 Waleed Arafa and Ragia Ibrahim Department of Computer & Information Sciences Institute of Statistical Studies and Research (ISSR), Cairo University, Egypt {waleed arafa,ribrahim}@issr.cu.edu.eg http://issr.cu.edu.eg Abstract. This paper represents the ﬁrst participation of the Institute of Statistical Studies and Research (ISSR) at Cairo University group in CLEF 2009-Medical image retrieval track. The main objective is to improve retrieving medical image depending on associated image text. The other purpose of our participation is to experiment diﬀerent text features such as article title, image caption and article paragraphs denoting to the image. We propose a simple and eﬀective extraction method to ﬁnd relevant paragraphs based on the structure of HTML ﬁles. Automatic translation of queries in diﬀerent languages other than collection language is also experimented. For indexing and retrieval, we used the LEMUR toolkit. In this paper the results of 9 runs are submitted in order to compare retrieval based on diﬀerent text features, as well as the eﬀect of stop word lists and the use of relevance feedback. 1 Introduction Medical image retrieval is a challenge for Cross Language Information Retrieval (CLIR) as well as for Information Retrieval (IR). Since medical image annotations, associated text can be written in more than a single language (multilingual), and the language used to express the associated texts or textual queries should not aﬀect retrieval, e.g. if the image annotation is written in English, then it should be searchable in other languages and give good results as if it is searched in English [1]. Another limitation in medical image retrieval is the lake for standardization for metadata associated with medical image [2]. Muller, et al. [3] reported that medical image retrieval information can be used in research, diagnosis and teaching. Accordingly, researchers can beneﬁt from such a system through the use of visual image query. In diagnosis, the goal is not just to search for a speciﬁc patient by speciﬁc attributes such as name, but is to retrieve images that support speciﬁc diagnosis, for example ﬁnd images with the same anatomic region of a speciﬁc disease. In teaching, it can aid lecturers and students to visually inspect (study) results found by browsing educational images repositories [3]. This paper presents system description, experiment results, analysis, and ﬁnally future work. In our experiments, we focus on diﬀerent selected textual features .We examine and compare the use of title, image caption, and extracted paragraph. In C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 189–194, 2010. c Springer-Verlag Berlin Heidelberg 2010 190 W. Arafa and R. Ibrahim addition, we examine the use of domain speciﬁc stop word lists, since stop word lists have a signiﬁcant eﬀect on information retrieval system, removing too many will hurt eﬀectiveness while keeping too many may cause problems in ranking [4]. On CLIR retrieval, crossing issue can be done by using one of two ways either translating queries (source language) into target language (document language) or translating documents into query language [5]. Query translation advantages are: less expensive than document translation from computations point of view and eﬀective with short queries. Common approaches for query translation include Machine Translation (MT), Corpus based methods and Dictionary based Methods [6, 7]. Google translation is based on statistical machine translation. It feeds its translation system with billions of words of text, parallel text containing examples of human translations among the languages and monolingual text in the target language [8]. 2 System Description The ISSR IR’s Image Retrieval system was built with ready-made components. For text retrieval Lemur toolkit has been used. The Lemur Toolkit is an opensource toolkit, it was developed by Language Technology Institute (LTI) at Carnegie Mellon University and Center of Intelligent Information Retrieval (CIIR) at the University of Massachusetts, Amherst. It was designed to facilitate research in language modeling and information retrieval. Lemur supports a wide range of industrial and research language applications such as ad-hoc retrieval, site-search, and text mining [9]. We used our own text extraction algorithm to add extra annotation to images before indexing, so that more relevant terms are added for the image to increase its probability to be retrieved. Following is the outline of the algorithm: The distributed dataset as described in contained images metadata in an XML ﬁle [12]. We used the XML ﬁle to extract each image name, caption, title and create list of URLs. Download only HTML ﬁles referred to by the URLs since HTML ﬁles have explicit tags for paragraphs and other elements of the document. For each image, the HTML downloaded ﬁle is used to extract the paragraph(s) relevant to the image. Extraction process concentrates on HTML structure e.g. and syntax to get each image related paragraph(s) from the article using the following steps: 1. Images are mentioned in paragraph by many ways such as: Fig/Figure n, where n is the ﬁgure number, or Figs/Figures n1, n2 nn to refer to many ﬁgures. Image name may contain letter after the number; e.g. Fig 1b. So regular expression is used to search for ﬁgure reference in all paragraphs. 2. If the above failed, check if the image name contains letter after ﬁgure number, search using ﬁgure number only without the letter. 3. If the above failed, search in HTML tags in addition to normal text; since some ﬁles contain links to the ﬁgure without explicitly mentioning its name. Medical Image Retrieval: ISSR at CLEF 2009 191 4. If the above failed, search in the text about the word Figure or Fig only without any number, since some articles have only one ﬁgure and refer to it without any number. 5. HTML tags don’t distinguish between the text under the image, i.e. the image caption, and the normal text in the article, so check whether found paragraph is the image caption, ignore it in this case. 6. If the paragraph(s) found has other terms that are not found in the image caption, add them to the image ﬁle. This is done after normalizing the caption and the paragraph. According to [11], textual retrieval achieved best performance when caption and title only are used, while adding article abstract or the whole article achieved poor results. For that, we used 2008 collocation with title and caption to determine the best methods for indexing and retrieval using the Lemur toolkit. The Lemur Toolkit supported indexing in several types of documents. We prepared documents in TREC text format. For Indexing, we used Indri index and tested Porter and Krovetz for word stemming, diﬀerent retrieval models supported by Lemur (KL-divergence, vector space, tf.idf, Okapi and InQuery) are tested for retrieval. Best result is obtained by using Indri indexing and Okapi language model. Then we used a free stop word list published by The Information Retrieval Group [10], it gives better performance, we also updated the stop word list by adding common terms found in the queries that are not relevant to the medical domain such as ’show me’, ’image’, and ’photo’, this slightly improved the performance. Pseudo relevance feedback included in Lemur toolkit is applied; many feedback settings are checked to determine the best one. The above settings are applied after adding the relevant paragraph(s) into 2008 collection of title and caption, and there was about 30. 3 Experimental Results All runs are for text only, 5 of them are for English queries, 2 for French and 2 for German. Here is the description of the ﬁrst ﬁve runs for English queries: Run 1 uses only title and caption as text features. Run 2 uses title, caption and added paragraph(s) as text features. Run 3 uses only title and caption as text features with pseudo relevance feedback. Run 4 uses title, caption and added paragraph(s) as text features with pseudo relevance feedback. Run 5 uses title, caption and added paragraph(s) as text features with pseudo relevance feedback and the updated stop word list. Two runs are submitted for French queries after using the automatic Google Translation to translate them into English, they are: Run 6 uses title, caption and added paragraph(s) as text features with updated stop word list. Run 7 uses title, caption and added paragraph(s) as text features with the updated stop word list and pseudo relevance feedback. Two runs are submitted for German queries after using the hg’];jll automatic Google Translation tool to translate them into English, they are: Run 8 uses 192 W. Arafa and R. Ibrahim title, caption and added paragraph(s) as text features with the updated stop word list. Run 9 uses title, caption and added paragraph(s) as text features with the updated stop word list and pseudo relevance feedback. Table 1. Results of ISSR nine submitted runs. All of them use textual features only for retrieval. They are all automatic, no manual feedback was involved (AUTO). : added paragraphs, PRF: Pseudo Relevance Feedback, USWL: Updated Stop Word List, MAP: Mean Average Precision, R-Prec: R-Precision. Language PRF USWL MAP R-Prec Recall # Run Name 1 2 3 4 5 6 7 8 9 4 ISSR ISSR ISSR ISSR ISSR ISSR ISSR ISSR ISSR Text Text Text Text Text Text Text Text Text 1 2 1rfb 4 5 FR 1 FR 2 DE 1 DE 2 English English English English English French French German German No Yes No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes No Yes No Yes No No No No Yes Yes Yes Yes Yes 0.3499 0.3315 0.277 0.2672 0.2692 0.2951 0.3111 0.1997 0.1981 0.3827 0.3652 0.3014 0.2916 0.2945 0.3354 0.3338 0.2314 0.2197 0.7269 0.7485 0.6791 0.7341 0.7358 0.6956 0.7667 0.6808 0.6088 Results and Analysis Results described in table 1 show that the best results are obtained when only the image’s caption and title features are used. Using paragraphs in addition to title and caption decreased the performance as opposed to the results of the same experiment on 2008 collection! Recall increased after adding paragraphs (run 2 is better than run 1) and slightly increased after using the updated stop word list (run 5 is better than run 4), these results are similar to the results of 2008 collection. Using domain speciﬁc stop word list also increased the MAP (run 5 is better than run 4). The decrease in MAP after adding paragraphs is surprising and was not expected, but careful analysis of the results of each query shows an improvement in recall as expected since more relevant terms are added to the image annotation so more relevant images are retrieved. The proposed method signiﬁcantly increases recall by 60% of the queries, 28% decreased and 12% unchanged as shown in Table 2. We believe that improving recall is important for the case of the ImageCLEFmed collection, which can be regarded as a small collection (about 84’000 images), where recall plays a more important role than in larger collections, which can rely on information redundancy with a less important role for recall [13]. One reason for this unexpected result is that the additional text has many noise terms that are irrelevant to the image. Even worse, some paragraphs mention many ﬁgures, so words relevant to one image are added to other irrelevant images mentioned in the paragraph. These reasons caused the similarity between many documents and the query decrease so that they got low rank in the Medical Image Retrieval: ISSR at CLEF 2009 193 Table 2. Number of queries aﬀected by adding paragraphs to the title and image caption for MAP, R-precision and recall for the 25 English queries Measure MAP R-prec Recall Eﬀect Queries Ratio Queries Ratio Queries Ratio Increased 9 36% 11 44% 15 60% Same 0 0% 3 12% 3 12% Decreased 16 64% 11 44% 7 28% retrieved list; this is the reason of decreased MAP since its value depends on document ranking. This noise did not decrease retrieval MAP on 2008 collection for two reasons: ﬁrstly, a signiﬁcant number of articles in 2008 dataset doesn’t have HTML ﬁles at April 2009 when we were preparing the test data. So, there was no irrelevant text added to the images in these articles, so they got the same rank in the retrieved list. Secondly, 1605 images (about 2.4% of the collection) didn’t have captions at all, 471 of them are relevant to at least one query. By applying the proposed technique, annotations to these images are added, so a better retrieval for these images is shown. Pseudo Relevance Feedback (PRF) is used in runs 3, 4, 5, 7, and 9. PRF is considered a successful simple query expansion technique where most frequent words in top k documents used to expand query terms. In our case, ﬁrst k documents don’t include relevant documents, or if they do, they have a lot of noise because of added paragraphs; so many irrelevant terms are added to the modiﬁed queries (compare run 4 with run 2). Our English retrieval result was the 4th best group results of all 13 participating groups in textual runs, and our best run with 0.3499 MAP was the 13th run of all 52 textual runs, while the MAP of overall best run was 0.4293. Our best run with added paragraphs with 0.3315 MAP was the 7th best group of all 13 groups and the 18th best run of all 52 runs. For multilingual retrieval task, French and German queries are translated using Google online machine translation [8]. French translated queries increased the MAP than original English queries by about 15% (run 7 is better than run 5). On the other hand, German queries decreased the MAP by 26% (run 9 is less than run 5). However, as the Google translator is a general machine translation and not a domain-speciﬁc tool, the ineﬀective handling of terms was expected. 5 Conclusion and Future Work A simple syntax-based technique to add relevant text to image annotation is proposed; and this technique is tested on image retrieval using Lemur toolkit. The results show that it is a promising approach. We intend to enhance this approach using semantic extraction methods such as shallow NLP techniques 194 W. Arafa and R. Ibrahim or statistical approaches to extract only relevant sentences from the paragraph denoting the image instead of adding the whole paragraph in order to reduce noise terms. Acknowledgments. We acknowledge ImageCLEFMed track organizer for providing 2008 data set. References 1. The Cross Language Image Retrieval Track, http://imageclef.org/ 2. Hersh, W., Muller, H.: Image Retrieval in Medicine: The ImageCLEF Medical Image Retrieval Evaluation, Ass&t Bulletin (March 2007) 3. Muller, H., Michoux, N., Bandon, D., Geissbuhler, A.: A review of content-based image retrieval systems in medicine - clinical beneﬁts and future directions. International Journal of Medical Informatics 73, 1–23 (2004) 4. Croft, B., Metxler, D., Stohman, T.: Engines: Information Retrieval in Practice. Addison-Wesley, Reading (2009) 5. Tuomas, T.: Comparable Corpora in Cross-Language Information Retrieval. University of Tampere, Faculty of Information Sciences (2008) 6. Pirkola, A.: The Eﬀects of Query Structure and Dictionary Setups in DictionaryBased Cross-Language Information Retrieval. In: SIGIR 1998 Cross-language Information Retrieval (1998) 7. Hiemstra, D.: Using language models for information retrieval. Ph.D. Thesis, Centre for Telematics and Information Technoloy (2001) 8. Och, F.J.: Statistical machine translation: Foundations and recent advances. Tutorial at MT Summit X (2005) 9. The Lemur Toolkit for Language Modeling and Information Retrieval, http://www.lemurproject.org 10. IR Linguistic Utilities, http://ir.dcs.gla.ac.uk/resources/linguistic_utils 11. Diaz-Galiano, M., Garcia-Cumbreras, M.A., Martin-Valdivia, M.T., Urena-Lopez, L.A., Montejo-Raez, A.: SINAI at ImageCLEFmed 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 12. Muller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C., Hersh, W.: Overview of the CLEF 2009 medical image retrieval track (2008) 13. Muller, H., Patrick, R.: Query and Document Translation by Automatic Text Categorization: A Simple Approach to Establish a Strong Textual Baseline for ImageCLEFmed 2006. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 57–61. Springer, Heidelberg (2006) An Integrated Approach for Medical Image Retrieval through Combining Textual and Visual Features Zheng Ye1,2 , Xiangji Huang2 , Qinmin Hu2 , and Hongfei Lin1 1 Department of Computer Science and Engineering, Dalian University of Technology Dalian, Liaoning, 116023, China 2 Information Retrieval and Knowledge Managment Lab, York University, Toronto, Canada {yezheng,jhuang}@yorku.ca, vhu@cse.yorku.ca, hflin@dlut.edu.cn Abstract. In this paper, we present an empirical study for monolingual medical image retrieval. In particular, we present a series of experiments in ImageCLEFmed 2009 task. There are three main goals. First, we evaluate traditional well-known weighting models in the text retrieval domain, such as BM25, TFIDF and Language Model (LM), for context-based image retrieval. Second, we evaluate statistical-based feedback models and ontology-based feedback models. Third, we investigate how content-based image retrieval can be integrated with these two basic technologies in traditional text retrieval domain. The experimental results have shown that: 1) traditional weighting models work well in context-based medical image retrieval task especially when the parameters are tuned properly; 2) statistical-based feedback models can further improve the retrieval performance when a small number of documents are used for feedback; however, the medical image retrieval can not beneﬁt from ontology-based query expansion method used in this paper; 3) the retrieval performance can be slightly boosted via an integrated retrieval approach. Keywords: CBIR, Visual and Textual Retrieval, Weighting Model, Pseudo Relevance Feedback, Ontologies, MeSH. 1 Introduction Medical images are becoming more important in research, diagnosis and teaching, as they are available in digital form. Image retrieval techniques can facilitate the processes. In image retrieval, there are two main approaches, Context based Image Retrieval and Content Based Image Retrieval (CBIR). The ﬁrst one uses the textual context information (e.g. annotation or surrounding texts) of an image for retrieval. The techniques used are similar to traditional text retrieval. However, the textual description for an image is always short and noisy, which is diﬀerent from that of traditional adhoc text collections. It is necessary to test and adapt traditional weighting models for this particular task. The second one C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 195–202, 2010. c Springer-Verlag Berlin Heidelberg 2010 196 Z. Ye et al. is CBIR, which uses low-level image features to retrieve images similar to an example. Existing techniques can only extract very simple features from images, which may be not representative enough for retrieval purpose. For example, a CBIR system solely based on color feature could return an image of blue sky when issuing a query with an example image of a blue car. In order to address the problems stated above, we ﬁrst evaluate traditional well-known weighting models in the text retrieval domain, such as BM25, TFIDF and LM, for context-based image retrieval [1]. Second, on the basis of the baseline results, we use statistical-based pseudo relevance feedback and ontology-based query expansion approaches to further enhance the retrieval performance. Finally, we note that, some images in the same article share the same contextual text for comparison reason, and the circumstance is always that only one of these images is what we are looking for. So it is impossible for us to ﬁlter the other images only based on the context-based image retrieval technologies. We propose to incorporate diﬀerent image content features to enhance context-based image retrieval technologies. The remainder of this paper is organized as follows. In Section 2, we describe the basic retrieval models. In Section 3, we present the statistical-based and ontology-based query expansion models used in our experiments. In Section 4, we propose an integrated approach for medical image retrieval. In Section 5, we present the experimental results and analyses. In Section 6, we conclude the paper with a discussion of our ﬁndings and a look at future work. 2 Weighting Models In the previous medical image retrieval tasks, a number of diﬀerent information retrieval (IR) toolkits, such as Lemur, Jirs and Lucene, are used as the basic retrieval systems for context-based medical image retrieval [2,3]. However, there is no systematic comparison of diﬀerent weighting models for ImageCLEFmed task. In addition, it is not clear that whether the default parameters in these models empirically tuned for traditional adhoc datasets are optimal. In this paper, we have made comparisons of four well-known weighting models: BM25 [4], JM-LM [5], TFIDF [6] and DFR [7]. The corresponding term weighting functions are as follows. – BM25 w= N − n + 0.5 (k3 + 1 ) ∗ qtf (k1 + 1) ∗ tf ∗ log ∗ (1) k1 ∗ ((1 − b) + b ∗ dl/avdl) + tf n + 0.5 k3 + qtf – JM-LM w = (1 + – TFIDF w = qtf ∗ tf ∗ F reqT otColl μ ∗ ) 1−μ dl ∗ Ft k1 ∗ tf tf + k1 ∗ (1 − b + b ∗ dl avdl ) ∗ log(1 + (2) N ) n (3) An Integrated Approach for Medical Image Retrieval 197 N +1 ) n exp T F = tf ∗ log2 (1 + avdl/dl) (4) – DFR w = T F ∗ qtf ∗ N ORM ∗ loge ( N ORM = (tf + 1)/(df ∗ (T F + 1)) n exp = idf ∗ (1 − e−qtf /df ) where w is the weight of a query term, N is the number of indexed documents in the collection, n is the number of documents containing the term, tf is withindocument term frequency, qtf is within-query term frequency, dl is the length of the document, avdl is the average document length, the ki s are tuning constants (which depend on the database and possibly on the nature of the queries and are empirically determined). 3 Query Expansion Pseudo-relevance feedback (PRF) via query expansion (QE) has been proven to be eﬀective in many information retrieval (IR) tasks. In this section, we introduce two query expansion methods, namely a statistical-based feedback method in 3.1 and an ontology-based query expansion method in 3.2. 3.1 Query Expansion with the Bose-Einstein Distribution The pseudo relevance feedback method used in our experiments is DFR-based weighting model described in [7]. The basic idea of these term weighting models for query expansion is to measure the divergence of a term’s distribution in a pseudo relevance set from its distribution in the whole collection. The higher this divergence is, the more likely the term is related to the query topic. We use Bo1 weighting model in this set of experiments. The Bo1 term weighting model is based on the Bose-Einstein statistics. Using this model, the weight of a term t in the exp doc top-ranked documents is given by: 1 + Pn + log2 (1 + Pn ) (5) Pn where exp doc usually ranges from 3 to 10 [7]. Another parameter involved in the query expansion mechanism is exp term, namely the number of terms extracted from the exp doc top-ranked documents. exp term is usually larger than F , F is the frequency of the term in the collection, exp doc [7]. Pn is given by N and N is the number of documents in the collection. tfx is the frequency of the query term in the exp doc top-ranked documents. w(t) = tfx · log2 3.2 Query Expansion with MeSH Ontology In the medical domain, terms are highly synonymous and ambiguous. This motivates us to investigate using ontology to expand the original query terms. 198 Z. Ye et al. The Medical Subject Headings (MeSH1 ) is a thesaurus developed by the National Library of Medicine of the United States. MeSH contains two organization ﬁles, namely an alphabetic list with bags of synonymous and related terms and a hierarchical organization of descriptors associated to the terms. A term is composed by one or more words. We have used the longest match approach to recognize the MeSH terms in a query. In particular, if all the words of a term are in the original query, we add its synonymous terms to the query. To compare the words of a particular term with those of the query, we convert all the words in lowercase and we do not remove stop words in this step. In order to reduce the noise brought by the expansion of the original query, only three categories of MeSH terms (A: Anatomy, C: Diseases, E: Analytical, Diagnostic and Therapeutic Techniques and Equipment) have been used for query expansion. Table 6 in Section 5.2 will present the MeSH-based query expansion results under four diﬀerent basic weighting models. 4 An Integrated Approach Content-based Image Retrieval (CBIR) systems enable users to search a large image database by issuing an image sample, in which the actual contents of the image will be analyzed. The contents of an image refer to its features – colors, shapes, textures, or any other information that can be derived from the image itself. This kind of technology sounds interesting and promising. The key issue in CBIR is to extract representative features to describe an image. However, this is a very diﬃcult research topic. According to ImageCLEFmed conference notes [3], CBIR always performs poorly, while context-based image retrieval can always achieve good performance in terms of MAP measurement. However, content features are also needed, especially when context information is not easy to obtain or a number of images share the same context. This motivates us to combine these two technologies. In particular, we explore three representative features for medical image retrieval. 1. Color and Edge Directivity Descriptor (CEDD): is a low level feature which incorporates color and texture information in a histogram [8]. 2. Tamura Histogram Descriptor: features coarseness, contrast, directionality, line-likeness, regularity, and roughness. The relative brightness of pairs of pixels is computed such that degree of contrast, regularity, coarseness and directionality may be estimated [9]. 3. Color Histogram Descriptor: Retrieving images based on color similarity is achieved by computing a color histogram for each image that identiﬁes the proportion of pixels within an image holding speciﬁc values (that humans express as colors). Current research is attempting to segment color proportion by region and by spatial relationship among several color regions. Examining images based on the colors they contain is one of the most widely used techniques because it does not depend on image size or orientation. 1 http://www.nlm.nih.gov/mesh/ An Integrated Approach for Medical Image Retrieval 199 The ﬁnal ranking score is obtained by merging the context-based similarity score (Scontext ) and the content-based similarity score (Scontent ). In particular, we use linear combination. The formula is described as follows. score = (1 − λ) ∗ Scontext + λ ∗ Scontent 5 5.1 (6) Experiments Experiments Setting The dataset used in our experiments is the dataset of Medical Image Retrieval 2009. It contains over 70,000 images, consisting of images from articles published in Radiology and Radiographics including the text of the captions and a link to the html of the full text articles. More information for the dataset and topics can be found in [2]. For the preprocessing of the image contextual text, we use blank delimiter to separate words for indexing and searching, and stopwords are removed. Besides these two simple steps, no further technologies have been used. In our experiments, the default values of k1 , k3 and b in the BM25 function are 1.2, 8 and 0.75 respectively; the default value of μ in JM-ML model is 0.15. 5.2 Experimental Results and Analyses In this section, we ﬁrst discuss the inﬂuence of two basic retrieval techniques on context based medical image retrieval. Then, we present the experimental results of the integrated retrieval approach. Influence of Basic Weighting Model In Table 1, we present the top 10 oﬃcial results of the ImageCLEFmed 2009 Track for comparison. The forth run marked by superscript ‘o’ is our best ofﬁcial textual run. Thereafter, all our oﬃcial runs are marked in the same way. From Table 2, we can see that the DFR weighting model has achieved the best performance under default setting. Table 1. MAP Performance of Top 10 Oﬀcial Textual Runs Runs LIRIS maxMPTT extMPTT sinai CTM t york.In expB2c1.0o ISSR text 1 ceb-essie2-automatic deu run1 pivoted clef2009 ISSR Text 2 BiTeM EN MAP 0.4293 0.3795 0.3685 0.3499 0.3484 0.3389 0.3362 0.3315 0.3206 200 Z. Ye et al. Table 2. A Comparison of Four Weighting Models with Default Setting: BM25, JM LM, TFIDF and DFR Runs BM25o JM - LM TF IDF DFR MAP 0.3515 0.3444 0.3608 0.3730 Table 3. MAP Performance BM25 Model with Diﬀerent Parameter (b) b 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP 0.3560 0.3554 0.3586 0.3577 0.3553 0.3543 0.3515 0.3503 0.3490 0.3450 Table 4. Performance JM-LM Model with Diﬀerent Parameter (μ) μ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP 0.3401 0.3486 0.3542 0.3603 0.3677 0.3734 0.3774 0.3796 0.3817 0.0390 As for the parameters tuning, we test diﬀerent settings for b in BM25 and μ in JM-LM. From Table 3, we can see that the BM25 model is more robust against changes in the tuning parameters. For JM-LM model in Table 4 we can see, as the increase of parameter μ, the performance increases signiﬁcantly. When μ takes the value of 0.9, the JM-LM weighting model outperforms the DFR weighting model. Although the JM-LM weighting model is not as steady as the BM25 weighting model, it is still promising if the parameter can be properly tuned. Influence of Query Expansion The main goal of this set of experiments is to investigate how many top documents and terms should be used for query expansion. For the limitation of space, we only present experimental results on the basis of BM25 model. Actually, we observe similar results of the other three models. Form Table 5 we can see, in general, the performance can be boosted if we can set the parameters properly. However, when the number of documents for query expansion increases from 5 to 10, the performance drops quickly. The results in Table 5 suggest that only a very small number of documents are useful for query expansion in context-based medical image retrieval. Table 6 presents the ontology-based query expansion results under four diﬀerent weighting models. Ontology-based query expansion approach does not work well as we expected. Our conjecture is that MeSH-based query expansion may also bring negative terms into query, especially the abbreviation terms. In addition, when JM-LM is used as the basic retrieval model, the performance drops remarkably. This is another evidence that the performance of JM-LM is not steady. Performance of Integrated Approach From Table 7, we can see that the retrieval performance can be slightly boosted by integrating content features. Among these three features, CEDD can improve the performance most. However, the improvements are marginal. More representative features are needed to be developed for retrieval purpose. An Integrated Approach for Medical Image Retrieval 201 Table 5. MAP Performance of Query Expansion – baseline MAP=0.3730 docs/terms 5 10 20 30 50 5 0.3901 0.3576 0.3443 0.3371 0.3388 10 0.3940 0.3622 0.3533 0.3415 0.3405 20 0.3947 0.3643 0.3520 0.3377 0.3345 30 0.3961 0.3670 0.3526 0.3401 0.3351 50 0.3958 0.3672 0.3541 0.3434 0.3372 70 100 0.3954 0.3963(6.25%) 0.3688 0.3683 0.3561 0.3613 0.3432 0.3448 0.3379 0.3391 Table 6. MAP Performance of MeSH-based Query Expansion Runs BM25 JM - LM TF IDF DFR Not-FB 0.3515 0.3444 0.3608 0.3730 MeSH-FB 0.3458o 0.3056 0.3529 0.3685o Table 7. MAP Performance of Integrated Approach (BM25 as basic weighting model) Runs BM25 baseline λ = 0.1 0.3515 λ = 0.2 0.3515 λ = 0.3 0.3515 6 CEDD 0.3515 0.3552 0.3616 Tamura 0.3514 0.3544 0.3544 Color 0.3529 0.3524 0.3516 Conclusions In this study, we ﬁrst evaluate four well-known weighting models for context based medical image retrieval. The performance of the four weighting models is comparable, but DFR weighting model works best under default settings. JMLM model is not steady for this task. However, if the parameter can be tuned properly, it is still promising. Second, we investigate query expansion technologies for the medical image retrieval task. In general, statistical-based QE method outperforms ontology-based methods. The experimental results also suggest that only a small number of top ranked documents are useful for statistical-based QE method. Ontology-based methods sound interesting and useful, however the actual performance is not good in our experiments. More sophisticated processing for this kind of methods is needed. Finally, we explore three content features for content-based medical image retrieval. The experimental results have shown that retrieval performance can only be slightly improved. The current features extracted from images may not be representative enough to capture the characteristics of images. Better features are required to improve CBIR. In future work, we plan to work on the following two directions. First, we will use data-driven approaches to choose optimal parameters for statisticalbased QE. Second, we will explore the correlation of diﬀerent content features of images. If more features can be integrated into medical image retrieval properly, we believe the retrieval performance can be further improved. 202 Z. Ye et al. Acknowledgements. We thank the reviewer’s valuable and constructive comments. This research is jointly supported by NSERC of Canada, the Early Researcher/Premier’s Research Excellence Award, Natural Science Foundation of China (No. 60373095 and 60673039) and the National High Tech Research and Development Plan of China (2006AA01Z151). References 1. Zhai, C.: Statistical Language Models for Information Retrieval: A Critical Review. Foundations and Trends in Information Retrieval 2(3), 137–213 (2008) 2. Muller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C., Hersh, W.: Overview of the CLEF 2009 Medical Image Retrieval Track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 3. Muller, H., Deselaers, T., Kim, E., Kalpathy-Cramer, J., Deserno, T., Clough, P., Hersh, W.: Overview of the ImageCLEFmed 2007 Medical Retrieval and Annotation Tasks. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 472–491. Springer, Heidelberg (2008) 4. Hancock-Beaulieu, M., Gatford, M., Huang, X., Robertson, S.E., Walker, S., Williams, P.W.: Okapi at TREC-5. In: Text REtrieval Conference (TREC) TREC-5 Proceedings (1996) 5. Zhang, R., Yi, C., Zheng, Z., Metzler, D., Nie, J.: Search Result Re-ranking by Feedback Control Adjustment for Time-Sensitive Query. In: Proceedings of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT), pp. 165–168 (2009) 6. Singhal, A., Buckley, C., Mitra, M.: Pivoted Document Length Normalization. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. ACM, New York (1996) 7. Amati, G.: Probabilistic Models for Information Retrieval based on Divergence From Randomness. PhD thesis, Department of Computing Science, University of Glasgow (2003) 8. Gasteratos, A., Vincze, M., Tsotsos, J.: CEDD: Color and Edge Directivity Descriptor. A Compact Descriptor for Image Indexing and Retrieval. In: Gasteratos, A., Vincze, M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 312–322. Springer, Heidelberg (2008) 9. Tamura, Hideyuki Mori, S., Yamawaki, T.: Textural Features Corresponding to Visual Perception. IEEE Transactions on Systems, Man and Cybernetics 8, 460– 473 (1978) Analysis Combination and Pseudo Relevance Feedback in Conceptual Language Model LIRIS Participation at ImageCLEFMed Lo¨ıc Maisonnasse1 , Farah Harrathi1 , Catherine Roussey1,2 , and Sylvie Calabretto1 1 2 Universit´e de Lyon, CNRS, INSA de Lyon, Universit´e Lyon 1, LIRIS UMR5205, Villeurbanne, France firstname.lastname@liris.cnrs.fr Cemagref, 24 Av. des Landais, BP 50085, 63172 Aubi´ere, France Abstract. This paper presents the LIRIS contribution to the CLEF 2009 medical retrieval task (i.e. ImageCLEFmed). Our model makes use of the textual part of the corpus and of the medical knowledge found in the Uniﬁed Medical Language System (UMLS) knowledge sources. As proposed in [6] last year, we used a conceptual representation for each sentence and we proposed a language modeling approach. We test two versions of conceptual unigram language model; one that use the logprobability of the query and a second one that compute the KullbackLeibler divergence. We used diﬀerent concept detection methods and we combine these detection methods on queries and documents. This year we mainly test the impact of the use of additional analysis on queries. We also test combinations on French queries where we combine translation and analysis, in order to solve the lack of French terms in UMLS, this provide good results close from the English ones. To complete these combinations we proposed a pseudo relevance method. This approach use the n ﬁrst retrieve documents to form one pseudo query that is used in the Kullback-Leibler model to complete the original query. The results of this approach show that extending the queries with such an approach improves the results. 1 Introduction The previous ImageCLEFmed tracks show the advantages of conceptual indexing (see [6]). Such indexing allows one to better capture the content of queries and documents and to match them at an abstract semantic level. On these conceptual representation [5] proposed a conceptual language modeling approach that handle diﬀerent conceptual representations of documents or queries. In this paper we extend this approach it in various ways. The rsv value in [5] is computed through a simple query likelihood. We also evaluate here the use of a KullbackLeibler divergence as proposed in many language model approaches. Then we C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 203–210, 2010. c Springer-Verlag Berlin Heidelberg 2010 204 L. Maisonnasse et al. compare combinations of conceptual representations with the divergence rather than combinations with likelihood. In last year participation we used two analyses for documents and queries, as results presented in [5] show that combining analysis on queries is an easy way to improve the results; so we make use this year of two supplementary analysis on queries. One of them is a new method of concept detection using only statistical methods. Finally we complete this model by proposing a pseudo relevance feedback extension of queries based on our language model approach. This paper ﬁrst presents the diﬀerent extensions of our conceptual model. Then we detail the diﬀerent documents and queries analysis. And ﬁnally we show and discuss our results obtain at CLEF 09. 2 Conceptual Model We rely on a language model deﬁned over concepts, as proposed in [5], which we refer to as Conceptual Unigram Model. We assume that a query q is composed of a set C of concepts, each concept being independent to the others conditionally on a document model. First we compute the rsv of this approach by simply computing the log-probability of the concept set C assuming a model Md of the document d as: RSVlog (q, d) = log(P (C|Md )) log(P (ci |Md )#(ci ,q) ) = (1) ci ∈C where #(ci , q) denotes the number of times concept ci occurs in the query q. The quantity P (ci |Md ) is directly estimated through maximum likelihood, using Jelinek-Mercer smoothing, P (ci |Md ) = (1 − λu ) |c|∗|i |dd + λu |c|∗|i |DD where |ci |d (respectively |ci |D ) is the frequency of concept ci in the document d (respectively in the collection D), and | ∗ |d (respectively | ∗ |D ) is the size of d, i.e. the number of concepts in d (respectively in the collection). In a second approach we compute the rsv of a query q for a document d by using Kullback-Leiber divergence between the document model Md estimated over d and the query model Mq estimated over the query q, this results in: P (ci |Mq ) P (ci |Mq ) log RSVkld (q, d) = −D (Mq Md) = (2) P (ci |Md ) ci ∈C P (ci |Mq ) ∗ log(P (ci |Md )) − P (ci |Mq ) ∗ log(P (ci |Mq )) = ci ∈C ci ∈C Since the last element of the decomposition correspond to query entropy and does not aﬀect documents ranking, we only compute the following decomposition: RSVkld (q, d) ∝ P (ci |Mq ) ∗ log(P (ci |Md )) (3) ci ∈C where P (ci |Md ) is estimated as previously. P (ci |Mq ) is directly computed through |c | maximum likelihood on the query by P (ci |Md ) = |∗|i qq where |ci |q is the frequency of concept ci in the query and | ∗ |q is the size of q. Analysis Combination and Pseudo Relevance Feedback 2.1 205 Model Combination We present here the method used to combine diﬀerent sets of concepts (i.e. concepts obtained from diﬀerent analyses of queries and/or documents) with the two rsv presented above. We used the results obtain in [5] to select the best combinations on queries and documents. First, we group the diﬀerent analysis of a query. To do so, we assume that a query is represented by a set of sets of concepts Q = {Cq }; and that the probability of this set assuming a document model is computed by the product of the probability of each query concept set Cq . Assuming that the ﬁrst rsv RSVlog uses the log-probability and that the second RSVkld uses a divergence, the combination of the rsv is computed through a sum over the diﬀerent queries: RSV (Q, d) ∝ RSV (Cq , d) (4) Cq ∈Q where RSV (Cq , d) is either RSVlog (equation 1) or RSVkld (equation 3). With this fusion,the best rsv will be obtained for a document model which can generate all analyses of the queries with high probability. Second, we group the diﬀerent analysis d of a document D = {d}. We assume that a query can be generated by diﬀerent models of the same document Md∗ (i.e. a set of models corresponding to each document d of D). Based on [5] results, we keep the higher probability among the diﬀerent models, this result in: RSV (Q, D) = argmaxd∈D RSV (Q, d) (5) With this method, documents are ranked, for a given query, according to their best document model. 2.2 Pseudo Relevance Feedback Based on the n ﬁrst results selected for one query set Q obtain by one RSV (equation 4), we compute a pseudo relevance feedback score P RF . This score correspond to the rsv obtain by the pseudo query Qf d constitute by the merging of the n ﬁrst documents retrieved with the query Q added, with a smoothing parameter, to the results obtained by the original query Q. P RF (Qf d , d) = (1 − λprf )RSV (Q, d) + (λprf )RSV (Qf d , d) (6) where RSV (Q, d) is either RSVlog or RSVkld and RSV (Qf d , d) is the same type of rsv apply on the pseudo-query Qf d that correspond to the merging of the n ﬁrst results retrieved by RSV (Q, d). λprf is a smoothing parameter that allows to give lower or higher importance to the pseudo query. If diﬀerent collection analysis are used, we ﬁnally merge these results using equation 5. 3 Concepts Detection UMLS is a good candidate as a knowledge source for medical text indexing. It is more than a terminology because it describes terms with associated concepts. 206 L. Maisonnasse et al. This knowledge is large (more than 1 million concepts, 5.5 million of terms in 17 languages). UMLS is not an ontology, as there is no formal description of concepts, but its large set of terms and their variants speciﬁc to the medical domain, enables full scale conceptual indexing. In UMLS, all concepts are assigned to at least one semantic type from the Semantic Network. This provides consistent categorization of all concepts in the meta-thesaurus at the relatively general level represented in the Semantic Network. The Semantic Network also contains relations between concepts, which allow one to derive relations between concepts in documents (and queries). 3.1 Linguistic Detection Process The detection of concepts based on linguistic analysis of document from a thesaurus is a relatively well established process. It consists of four major steps (refer to [5] for details on these steps): 1. Morpho-syntactic Analysis (POS tagging) of document with a lemmatization of inﬂected word forms; 2. Filtering empty words on the basis of their grammatical class; 3. Detection in the document of words or phrases appearing in the metathesaurus; 4. Possible ﬁltering of concepts identiﬁed. 3.2 Statistical Detection Process We develop a statitical method of concept detection that could be apply on several languages without any linguistic analysis. This method replace the morphosyntactic analysis (step 1 and 2 of previous section) by statistical method. Our method is composed of four main steps: 1. 2. 3. 4. Empty Word and Simple Term Extraction based on corpus analysis. Compound Term Extraction. Concept Detection Concept Filtering The last two steps (3 and 4) are similar to the linguistic detection process, thus we will not describe them in the next paragraphs. Empty Word and Simple Term Extraction. Empty words are words that have no discriminate power to identify a speciﬁc document over a corpus, because they have a linear distribution over all the documents. They can be stop words or general word like the day of the week and so one. In order to extract the empty word of the document we use two corpora: The indexing corpus and the support corpus. The support corpus should have the same languages than the indexing corpus but should deals with another domain. For example in our experiment the indexing corpus is about medicine, the support corpus is about laws (the Analysis Combination and Pseudo Relevance Feedback 207 European Parliament collection1 ). We deﬁne empty word as a word belonging to the indexing corpus and the support corpus and its frequency inside the two corpora should be above a threshold ﬁxed by experience. Simple terms are the words of the indexing corpus which are not detected as empty word. Compound Terms Extraction. We assume that compound term (term composed of more than one word) is a kind of word collocation. According to [1] we can detect words involved in a collocation by following two assumptions. (1) The words must appear together signiﬁcantly more often than expected by chance. (2) The words should appear in a relatively rigid way because of syntactic constraints. [3] uses the Mutual Information (MI) measure to extract a collocation of two words. Unfortunately the MI measure is not able to extract compound terms composed of empty words and it is not adapted to extract compound terms of more than two words. Thus we propose to adapt the Mutual Information measure to avoid these two drawbacks. Considering two words m1 and m2 , our formula of the AdaptedM utualInf ormation(AM I) is: ⎧ ⎨ log2 P (m1 ,m22 ) if m2 is an empty word P (m1 ) (7) AM I(m1 , m2 ) = P (m ,m ) 1 2 ⎩ log2 otherwise P (m1 )∗P (m2 ) Where P (m1 ) is estimated by counting the number of observations of m1 in the collection and normalizing by N , the size of the collection. P (m1 , m2 ) is estimated by counting the number of times that m1 is followed by m2 and normalizing by N . The term extraction process is iterative and incremental process. The compound terms of the iteration i + 1 (that is to say that their length is i + 1 words) are built from the term of the iteration i (their length is i words). The extraction process starts from the simple terms composed of 1 word. For each couple of words (simple term + another word of the indexing corpus), we compute the AM I. If its AM I is above a threshold, this new compound term is added in the starting list of the next iteration of this process. In our experiment we ﬁxed the AMI threshold to 15. The iterations carry one as far as a new compound term is extracted. 3.3 Linguistic Detection versus Statistical Detection We test our new statistical concept detection process on the collection CLEFmed 2007 using UMLS. We compare the statistical detection from those obtained by [4] using linguistic techniques with the similar collection and UMLS. In [4], three linguistic analyzers are used prior to the concept detection: MetaMap MM, MiniPar MP and TreeTagger TT. The results obtained by these various analyzers as those obtain by our statistical method, that we named FA, are given in Table 1. We note the linguistic methods perform slightly better for MAP, and the statistic method perform better for P@5. Thus we can conclude 1 http://www.statmt.org/europarl/ 208 L. Maisonnasse et al. that our statistical method of concept detection has similar results than those using linguistic techniques. Table 1. comparison of statistical versus linguistic concept detection using CLEFMed 2007 Method analysis MAP Linguistic MM 0.246 MP 0.246 TT 0.258 Statistical FA 0.244 3.4 P@5 0.357 0.424 0.462 0.425 Δ MAP -0.81% -0.81% -5.43% Δ P@5 19.05% 0.24% -8.01% Our Four Detection Processes Due to the previous results we use the fourth analyses in our experiments. ¿From these analyses, we use the MP and TT ones to analyse the collection and we pick some to analyse the query depending of the runs. This year we also test this combination approach on French queries, where we ﬁrst detect concepts with our term mapping tools with the French version of TreeTagger. Then we translate the French queries from French to English with Google API2 and we extract concepts from this English translation with the MP and the TT analysis. 4 Evaluation We train our methods on the corpus CLEFmed 2008 and we run the best parameters obtained on CLEFmed 2009 corpus[2]. On this year collection, we submit 10 runs, these runs explore diﬀerent variations of our model. Previous year results show that merging queries improves the results, we test this year the impact of adding new analysis only on the queries. So we ﬁrst test 3 model variations: – (UNI.log) that use the conceptual unigram model (as deﬁne in 1). – (UNI.kld) that use the conceptual unigram model with the divergence (as deﬁne in 3). – (PRF.kld) that combine the conceptual unigram model with a pseudo relevance feedback (as deﬁne in 6). For each model, we test it on the collection analysed by two detection methods, MiniPar and TreeTagger (MPTT), using the model combination methods proposed in section 2.1 and we test it with the three following query analysis: – (MPTT) that groups MP and TT analysis, – (MMMPTT) that groups the two preceding analysis with MM one, – (MMMPTTFA) that groups the three preceding analysis with FA one. 2 http://code.google.com/intl/fr/apis/ajaxlanguage/documentation/ Analysis Combination and Pseudo Relevance Feedback 4.1 209 Results From each method we use the bests parameters obtained on ImageCLEFmed 08 corpus for MAP and we use these parameters on the new 09 collection. We ﬁrst compare the variation between the results on the two rsv deﬁne for MAP and for diﬀerent query merging on, table 2. Results show that the two rsv give close results on 2008 queries. On 2009 queries, our best result is obtained with the logprobability and with two analyses (MPTT) on the query. Using the four analyses (MMMPTTFA), the log-probability is slightly better than the KL-divergence but the results are close. As presented before, we test our combination model on French queries, from these queries we obtain diﬀerent concept sets by merging detection methods and by translating, or not, the query to English in order to ﬁnd the UMLS concepts that are not linked with French terms. This method obtains the good results of 0.377 in MAP. This shows that the combinations methods can be used on translation methods. We then test our pseudo relevance feedback method for this, we query with RSVkld and we process the relevance feedback, the results are presented in table 3. The results, we achieve on 2008 queries, show that the best results are obtain with the pseudo query build on the 100 ﬁrst documents initially retrieve. On 2008, merging more analysis of the query improve the results. Transposed to 2009 the results also show good results, but the best results are obtained by using only two analyses (MPTT). Table 2. Results for diﬀerent query analysis combination, for the two unigram models MPTT 2008 2009 log-probability 0.280 0.420 KL-divergence 0.279 - MMMPTT MMMPTTFA 2008 2009 2008 2009 0.276 0.412 0.281 0.410 0.416 Table 3. Results for diﬀerent size of pseudo relevance feedback with the KullbackLeiber divergence and with diﬀerent query analysis size of the MPTT pseudo query (n) 2008 2009 20 0.279 50 0.289 100 0.292 0.429 5 MMMPTT MPTTFA MMMPTTFA 2008 2009 2009 2009 0.281 0.290 0.299 0.416 0.424 0.418 Conclusion Using the conceptual language model provides good performance in medical IR, and merging conceptual analysis is still improving the results. This year we explore a variation of this model by testing the use of a Kullback-Leiber divergence and we improve it by integrating a pseudo relevance feedback. The 210 L. Maisonnasse et al. two model variations provide good but similar results. Adding a pseudo relevance feedback improves the results providing the best MAP results for 2009 CLEF campaign. We also made an experimentation on French queries where we use the combination method to solve the ’lack’ of French terms in UMLS, this results show that combination methods can also be used on various methods of concepts detection. References [1] Smadja, F.A.: Retrieving collocations from text: Xtract. Computational Linguistics 19(1), 143–177 (1993) [2] M¨ uller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C., Hersh, W.: Overview of the clef 2009 medical image retrieval track. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) [3] Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990) [4] Chevallet, J.P., Maisonnasse, L., Gaussier, E.: Combinaison d’analyses s´emantiques pour la recherche d’information m´edicale. In: INFORSID (ed.) Atelier RISE (Recherche d’Information SEmantique) dans le cadre de la conf´erence INFORSID (May 2009) [5] Gaussier, E., Maisonnasse, L., Chevallet, J.P.: Model fusion in conceptual language modeling. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956. Springer, Heidelberg (2008) [6] Gaussier, E., Maisonnasse, L., Chevallet, J.P.: Multiplying concept sources for graph modeling. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 585–592. Springer, Heidelberg (2008) The MedGIFT Group at ImageCLEF 2009 Xin Zhou1 , Ivan Eggel2, and Henning M¨ uller1,2 1 2 Geneva University Hospitals and University of Geneva, Switzerland University of Applied Sciences Western Switzerland, Sierre, Switzerland henning.mueller@sim.hcuge.ch Abstract. MedGIFT is a medical imaging group of the Geneva University Hospitals and the University of Geneva, Switzerland. Since 2004, the group has participated ImageCLEF each year, focusing on the medical imaging tasks. For the medical image retrieval task, two existing retrieval engines were used: the GNU Image Finding Tool (GIFT) for visual retrieval and Apache Lucene for text. Various strategies were applied to improve the retrieval performance. In total, 16 runs were submitted, 10 for the image–based topics and 6 for the case–based topics. The baseline GIFT setup used for the past three years obtained the best results among all our submissions. For medical image annotation two approaches were tested. One approach is using GIFT for retrieval and kNN (k–Nearest Neighbors) for classiﬁcation. The second approach used the Scale–Invariant Feature Transform (SIFT) with a Support Vector Machine (SVM) classiﬁer. Three runs were submitted, two with the GIFT–kNN approach and one using the common results of the two approaches. The GIFT–kNN approach gave stable results. The SIFT–SVM approach did not achieve the expected performance, most likely due to the SVM Kernel used that was not optimized. 1 Introduction A medical retrieval task has been part of ImageCLEF1 since 2004 [1,2]. The MedGIFT2 research group has participated in all these competitions using the same technology as a baseline and tried to improve the performance of this baseline over time. The GIFT3 (GNU Image Finding Tool, [3]) has been the technology used for visual retrieval. Visual runs using GIFT have also been made available to other participants of ImageCLEF. For text retrieval, Lucene4 was employed in 2009. The full text of the articles was indexed with no optimization. More information concerning the setup and collections of the medical retrieval task can be found in [4]. 1 2 3 4 http://www.imageclef.org/ http://www.sim.hcuge.ch/medgift/ http://www.gnu.org/software/gift/ http://lucene.apache.org/ C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 211–218, 2010. c Springer-Verlag Berlin Heidelberg 2010 212 2 X. Zhou, I. Eggel, and H. M¨ uller Retrieval Tools Reused This section describes the basic technologies used for retrieval. 2.1 Text Retrieval Approach The text retrieval used in 2009 is based on Lucene. No speciﬁc terminologies such as MeSH (Medical Subject Headings) were used. Only one textual run was submitted. The texts were indexed entirely from the HTML (Hyper Text Markup Language), removing links and metadata. The query text was not modiﬁed. 2.2 Visual Retrieval Techniques GIFT has been used for the visual retrieval for the past ﬁve years. This tool is open source and can be used by other participants of ImageCLEF as well. The goal of using standard GIFT is also to provide a baseline to facilitate the evaluation of other techniques. GIFT uses a partitioning of the image into ﬁxed regions to obtain local features. During the last 3 years, the performance obtained by GIFT remained unsatisfying. Various strategies were tried out in order to get improvements, such as integration of aspect–ratio as feature, automatic query expansion and threshold optimization for axes for the annotation task. In ImageCLEF 2009, query expansion with negative examples was carried out for the image retrieval task, and SIFT features were integrated into the image annotation task. 3 Results This section describes our main results for the two medical tasks. 3.1 Medical Image Retrieval All the runs were obtained by using GIFT with 8 gray levels. Various strategies were tried to increse performance. One strategy is to query the images belonging to one topic separately, and then to combine the obtained results. Another strategy is to apply negative feedback using the query images of other topics as we assume that the topics are suﬃciently diﬀerent. Adding aspect–ratio is another feature that has worked well in the past. In total, 16 automatic runs were submitted : 2 textual, 10 visual and 4 mixed. 10 run were for the image– based retrieval topics and 6 for the case–based topics. Runs were labeled by the strategies applied. The labels and their signiﬁcation are: – – – – txt textual retrieval; vis visual retrieval; mix combination of textual and visual retrieval; sep one query per image is performed to produce a list of similar images for each query image; The MedGIFT Group at ImageCLEF 2009 213 – AR adding aspect ratio; – NgRan query expansion by randomly taking images from other topics as negative examples; – sum basic results fusion: if one item has several similarity scores, the sum of all scores is used; – max basic results fusion: if one item has several similarity scores, the maximum value is used; – 0.x for a mixed run, 0.x is the weight for the visual retrieval and (1 − 0.x) for the textual retrieval; – EN the language used for textual retrieval is English; – BySim for results fusion, each result is weighted by the similarity score given by Lucene/GIFT; – ByFreq for results fusion, each result is weighted by the number of appearances. The results of the 25 ad–hoc topics are shown in Table 1 and those of the case–based topics (26–30) are shown in Table 2. Mean average precision (MAP), binary preference (Bpref), and early precisions (P10, P30) are used as measures. Table 1. Results of the runs for the image–based topics Run best textual run (LIRIS) HES-SO-VS txt EN MedGIFT vis GIFT8 (best visual run) MedGIFT vis sep max MedGIFT vis sep sum AR MedGIFT vis sep sum MedGIFT vis sep max AR MedGIFT vis sum negRan MedGIFT vis max negRan best automatic mixed run (DEU) MedGIFT mix 0.3NegRan EN MedGIFT mix 0.5 EN MedGIFT mix 0.5NegRan EN run type MAP Textual 0.4293 Textual 0.3179 Visual 0.0153 Visual 0.0131 Visual 0.013 Visual 0.0114 Visual 0.0102 Visual 0.0098 Visual 0.0079 Mixed 0.3682 Mixed 0.29 Mixed 0.2097 Mixed 0.1354 Bpref 0.4568 0.3498 0.0347 0.0276 0.0303 0.0282 0.0303 0.028 0.0248 0.386 0.3216 0.2456 0.1691 P10 0.664 0.600 0.068 0.076 0.072 0.052 0.076 0.044 0.044 0.544 0.604 0.592 0.488 P30 num rel ret 0.552 1814 0.4987 1462 0.0467 284 0.056 266 0.052 262 0.0573 259 0.0547 253 0.053 210 0.044 201 0.4827 1753 0.516 1176 0.4293 848 0.3267 547 Image–Based Topics. In total, 59 textual runs were submitted for ImageCLEFmed 2009. The average score (MAP) for the textual runs is around 0.3. The Lucene search engine with a standard setup(HES–SO–VS txt.txt ) performed slightly better than the average. The best textual runs used mapping of text to Medical Subject Headings (MeSH) or the Uniﬁed Medical Language System (UMLS) to reach an improvement [5,6,7]. 5 groups submitted 16 visual runs. Our best run is the baseline that used GIFT with 8 gray levels(MedGIFT vis GIFT8.txt ). The baseline obtained the highest MAP among all visual runs. The run using the one query per image strategy was oﬃcially ranked as second but it outperformed the other visual 214 X. Zhou, I. Eggel, and H. M¨ uller runs on early precision. As the performance was fairly limited, additional tests were performed and are described in Section 3.1. The second best visual run was submitted by the Image and Text Integration(ITI) group from the National Library of Medicine. Various low level global features were used and a linear combination of these features was applied [8]. SVMs were used to map visual features to semantic terms based on a predeﬁned visual concept tree built from the consolidated ImageCLEFmed collection. Despite the integration of a visual concept tree with machine learning, the results were not extremely high. There were 29 mixed textual/visual runs. The MedGIFT runs are among the ﬁve best runs. However, as textual runs outperform the visual runs, many mixed runs are not even as good as the corresponding textual runs. Compared with our textual baseline run all mixed runs obtained worse performance. Several other groups had similar conclusions [8,9]. York University declared that the Color and Edge Directivity Descriptor(CEDD) slightly boosted the performance of a textual run [10]. Both the group from York University and our group used a similar linear combination strategy for fusing the results. Considering the fact that visual runs submitted by York University obtained the worst results among all submitted visual runs, the improvement detected by York University might require further investigation. The best mixed run is from the DEU group, that combined visual and textual features into a single feature matrix [11]. The results show that fusion in the feature space can obtain good results. Follow–Up Analyses. Follow–up analyzes were performed once the ground– truth was available. Using the one query per image strategy and negative query expansion did not improve visual retrieval. The performance for each topic with the three main techniques is shown in Figure 1. The similarities among topic images for each topic are shown in Figure 2 to show homogeneous and heterogeneous topics. To obtain the similarity among topic images, all topic images were indexed and queries with each topic image were performed. In a pairwise comparison the images of one topic were analyzed. The result shown is the average score among all pairwise per topic using the GIFT baseline run. For the submitted visual runs with negative query expansion, negative examples were randomly selected. In an additional approach, negative examples were selected based on the similarity score obtained through visual queries. These runs slightly outperformed the submitted runs using negative examples. In Figure 1, the new run using one negative example is also presented. Comparing the baseline(MedGIFT vis GIFT8) with one query per image (MedGIFT vis sep max) shows that the performance for a topic is not correlated with similarity among the topic images. On the one hand, topics 2, 4, 14, 24, 25 contain images with little similarity, with only topic 2 being improved using one query per image. On the other hand, topics 5, 15, 23 contain very similar images and still the one query per image strategy gave signiﬁcantly better results. For all other topics the baseline obtained better scores. Using negative examples outperformed the baseline run and the one query per image strategy only rarely (for example topics 6, 16, 18). The MedGIFT Group at ImageCLEF 2009 215 0.2 medGIFT_vis_GIFT8(baseline) medGIFT_vis_sep_max medGIFT_vis_neg1 0.15 0.1 0.05 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Fig. 1. The performance obtained by GIFT conﬁgurations for visual retrieval per topic image similarity for each topic 0.6 0.5 0.4 0.3 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Fig. 2. The similarity among the images of one topic showing whether the images depict a similar topic or diﬀerent aspects of the search topic Case–Based Topics. In 2009, the MedGIFT group submitted 1 mixed run, 4 visual runs and 1 textual run for case–based topics. In total, 11 textual runs, 2 mixed runs and 5 visual runs were submitted for this task. In Table 2, the three best runs of other groups are shown, all of them were textual. The MedGIFT runs were best for visual and mixed retrieval. Our best textual run used Lucene with its standard conﬁguration(HES-SO-VS txt case). By combining the visual 216 X. Zhou, I. Eggel, and H. M¨ uller run with a textual run (MedGIFT mix 0.5BySim EN ) the MAP decrease significantly but slightly more relevant cases could be found. Table 2. Results of the runs for the case–based retrieval topics Run run type MAP ceb-cases-essie2-automatic Textual 0.3355 sinai TA cbt Textual 0.2626 Textual 0.1912 aueb ipl HES-SO-VS txt case Textual 0.1906 MedGIFT mix 0.5BySim EN Mixed 0.0655 MedGIFT vis maxBySim AR Visual 0.021 MedGIFT vis sumBySim AR Visual 0.019 MedGIFT vis maxByFreq AR Visual 0.0025 MedGIFT vis sumByFreq AR Visual 0.0025 3.2 Bpref 0.2766 0.2264 0.1252 0.1531 0.0488 0.029 0.026 0.0035 0.0035 P10 0.34 0.34 0.24 0.32 0.14 0.04 0.06 0 0 P30 num rel ret 0.2267 74 0.2267 89 0.1867 93 0.2 71 0.0867 74 0.0533 41 0.0533 42 0.0067 26 0.0067 26 Medical Image Annotation In the medical image annotation task 6 groups submitted a total of 18 runs. Three of these runs were submitted by the MedGIFT group. Two runs used the same strategy as in the past 2 years: – using GIFT to ﬁnd a list of similar images; – reordering the list by integrating the aspect ratio; – using 5 nearest neighbors (5NN) to perform the classiﬁcation for each axis by voting using descending weights. Details can be found in the papers of ImageCLEF 2007 [12] and 2008 [13]. One run was submitted to test a SIFT–SVM approach. The standard Gaussian kernel was used for the SVMs. No optimizations of the SVMs were tried. As the results of the SIFT–SVM approach were not optimal we used this run in combination with one of our standard runs for the submission. In both cases, the N most similar images were retrieved for each test image and then used for the classiﬁcation. The results are shown in Table 3. Best results were obtained using GIFT–5NN as in the past years. Using a combination with SIFT–SVM gave worse results. Table 3. Results of the runs submitted to the medical image annotation task run ID 2005 2006 2007 2008 SUM best system (TAU Biomed) 356 263 64.3 169.5 852.8 second best system (IDIAP) 393 260 67.23 178.93 899.16 GE GIFT8 AR0.2 vdca5 th0.5.run 618 507 190.73 317.53 1633.26 GE GIFT16 AR0.1 vdca5 th0.5.run 641 527 210.93 380.41 1759.34 791.5 612.5 272.69 420.91 2097.6 GE GIFT8 SIFT commun.run The MedGIFT Group at ImageCLEF 2009 217 Two groups (Biomed and IDIAP) submitted runs signiﬁcantly outperforming all other techniques. Very similar techniques were used as Biomed was inspired from by IDIAP [14]. Their system uses the following approach: – extract local features from a sub–set of images using random points; – use k–means clustering to create a dictionary of visual words; – sample each image with a denser grid and represent each image as a histogram of the visual words; – train a classiﬁer using SVMs with a X 2 kernel. This approach has proven to obtain best results for the past three years. 4 Conclusions This paper summarizes the participation of the MedGIFT group in ImageCLEF2009. The medical image retrieval and medical image annotation tasks were addressed. A preliminary analysis of our results for the medical retrieval task shows that visual retrieval is able to improve early precision. Overall performance (measure by MAP) of mixed–media runs relied highly on the performance of the textual run. Textual/visual run fusion strategies require further study as currently the MAP of mixed runs is often lower than that of the corresponding textual run. An additional analysis were carried out to better understand the obtained results. Query performance of a topic is not directly related to the similarity among the images of the topic. There is still a big gap of performance between textual and visual retrieval. Keywords are naturally linked to semantic topics and this for semantic topics text–based approaches perform much better, although even for the visual topics the text retrieval results obtain better results. Using SVMs together with local features based on salient points shows to obtain reasonable results but requires further optimization as our obtained results were by far not as good as those groups obtaining the best results. Acknowledgments This study was partially supported by the Swiss National Science Foundation (Grant 200020–118638/1), the HES–SO (BeMeVIS), and the European Union in the 6th Framework Program through KnowARC (Grant IST 032691). References 1. Clough, P., M¨ uller, H., Deselaers, T., Grubinger, M., Lehmann, T.M., Jensen, J., Hersh, W.: The CLEF 2005 cross–language image retrieval track. In: Peters, C., Gey, F.C., Gonzalo, J., M¨ uller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 535–557. Springer, Heidelberg (2006) 218 X. Zhou, I. Eggel, and H. M¨ uller 2. Clough, P., M¨ uller, H., Sanderson, M.: The CLEF cross–language image retrieval track (ImageCLEF) 2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 597–613. Springer, Heidelberg (2005) 3. Squire, D.M., M¨ uller, W., M¨ uller, H., Pun, T.: Content–based query of image databases: inspirations from text retrieval. In: Ersboll, B.K., Johansen, P. (eds.) Pattern Recognition Letters (Selected Papers from The 11th Scandinavian Conference on Image Analysis SCIA 1999), vol. 21(13-14), pp. 1193–1198 (2000) 4. M¨ uller, H., Kalpathy-Cramer, J., Eggers, I., Bedrick, S., Said, R., Bakke, B., Kahn Jr., C.E., Hersh, W.: Overview of the 2009 medical image retrieval task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 5. Maisonnasse, L., Harrathi, F.: Analysis combination and pseudo relevance feedback in conceptual language model. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 6. Lana-Serrano, S., Villena-Rom´ an, J., Gonz´ alez-Crist´ obal, J.C.: MIRACLE at ImageCLEFmed 2009: Reevaluating strategies for automatic topic expansion. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 7. D´ıaz-Galiano, M.C., Mart´ın-Valdivia, M.T., Ure˜ na-L´ opez, L.A., Garc´ıaCumbreras, M.A.: SINAI at ImageCLEF 2009 medical task. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 8. Simpson, M., Rahman, M.M., Demner-Fushman, D., Antani, S., Thoma, G.R.: Text– and content–based approaches to image retrieval for the ImageCLEF 2009 medical retrieval track. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 9. Boutsis, I., Kalamboukis, T.: Combined content–based and semantic image retrieval. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 10. Ye, Z., Huang, X., Lin, H.: Towards a better performance for medical image retrieval using an integrated approach. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 11. Berber, T.: Alpko¸cak, A.: DEU at ImageCLEFmed 2009: Evaluating re-ranking and integrated retrieval model. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 12. Zhou, X., Depeursinge, A., M¨ uller, H.: Hierarchical classiﬁcation using a frequency– based weighting and simple visual features. Pattern Recognition Letters 29(15), 2011–2017 (2008) 13. Zhou, X., Gobeill, J., M¨ uller, H.: The medgift group at imageclef 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 712–718. Springer, Heidelberg (2009) 14. Avni, U., Goldberger, J., Greenspan, H.: TAU MIPLAB at ImageClef 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) An Extended Vector Space Model for Content-Based Image Retrieval Tolga Berber and Adil Alpkocak Dokuz Eylul University, Dept. of Computer Engineering, Tinaztepe Buca, 35160, Izmir, Turkey {tberber,alpkocak}@cs.deu.edu.tr Abstract. This paper describes participation of Dokuz Eylul University to the ImageCLEF2009Med task. This year, we proposed a new model for content-based image retrieval combining both textual and visual information in the same space. It simply extends traditional vector space model of text retrieval with visual terms. The proposed model also supports to close the semantic gap problem of content-based image retrieval. Experiments showed that our proposed system improves the performance of textual retrieval methods by adding visual terms. The proposed method was evaluated on the ImageCLEFmed 2009 dataset and it was ranked the best performance among the participants in automatic mixed retrieval including both text and visual features. Keywords: Content-based Image Retrieval, Vector Space Model, Semantic Gap, Visual Terms. 1 Introduction Content-Based Image Retrieval (CBIR) Systems aim to use image content to search and retrieve images from image collections. Instead of precise pixel-to-pixel matching, CBIRs use some low-level image features like color distribution, texture of a shape, etc. to define image content. Because low-level features are insufficient to define semantics of an image, result sets generated by CBIR systems could not satisfy user information need. This problem is called the semantic gap problem. The model we present in this paper is an integrated retrieval system which extends the well-known textual information retrieval technique with visual terms. Proposed model aims to close the semantic gap problem by helping to map low-level features into high level textual semantic concepts. Moreover, this combination of textual and visual modality into one model also helps to query a textual database with visual content or visual database with textual content. Consequently, images could also be defined with semantic concepts instead of low-level features. The rest of this paper is structured as follows. In Section 2, our proposal is described. Section 3 contains the experimentation results on the ImageCLEFmed 2009 dataset. Finally, section 4 concludes the paper and gives a look at possible future works on this subject. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 219–222, 2010. © Springer-Verlag Berlin Heidelberg 2010 220 T. Berber and A. Alpkocak 2 The Integrated Retrieval Model Proposed model is an extension to classical vector space model (VSM), focusing on to integrate both textual and visual features into one model, so it is called Integrated Retrieval Model (IRM). We also aim to close the semantic gap problem of content-based image retrieval by mapping low-level visual features to classical VSM, where a document is represented as a vector of terms. Hence, a document repository, D, becomes a sparse matrix whose rows are document vectors, columns are term vectors as follows: ⎡ w1,1 w1,2 w1, n ⎤ ... ⎢ ⎥ D=⎢ # # % # ⎥ (1) ⎢ wm,1 wm,2 " wm, n ⎥ ⎣ ⎦ where wij is the weight of term j in document i, n and m is term and document count, respectively. Literature proposes a plenty number of weighting schemes [1][2][3]. We used pivoted unique term weighting scheme proposed in [4] and [5]. The system proposes a modification on D matrix (Eq. 1) by adding visual terms representing visual contents. Formally, document-term matrix becomes as follows: ⎡ w1,1 ⎢ D=⎢ # ⎢ wm ,1 ⎣ w1, 2 # wm , 2 ... % " w1, n i1, n +1 i1, n + 2 # # # wm , n im ,n +1 im , n +1 ... % ... i1, n + k ⎤ ⎥ # ⎥ im , n +1 ⎥⎦ (2) where iij is the weight of visual term j in document i, k is the number of visual terms. Visual and textual features are normalized independently. In sum, IRM extends traditional text-based VSM with visual features. Initially, we have used two simple visual terms representing color information of the image. It simply counts the number of gray pixels. Then, the first visual term represents what amount of image is grayscale, and the second visual term is the complement of first term. In other words, it is the probability of color pixels in an image. 3 Experimentations In order to evaluate the proposed method, we conducted five runs with the ImageCLEFmed 2009 dataset [6], however we present only two of them. We preprocessed all 74902 documents including combination of title and captions. First, all documents were converted into lowercase. All numbers and some punctuation characters like dash (-) and apostrophe (’) were removed. However, some of the non-letter characters like comma (,) and slash (/) were replaced with a space. This is because dash character conveys an important role as in X-Ray and T3-MR. Then, we choose the words surrounded by spaces as index terms. For each image in the data set, we have added two visual terms as shown in previous section. Altogether, the total number of indexing terms became 33615. After preprocessing phase, we implemented text-only retrieval on the dataset. Here, we normalized text term weights as shown in [2], and we simply calculated dot An Extended Vector Space Model for Content-Based Image Retrieval 221 product of query and document vectors as similarity function. Then, the top 1000 documents having the highest similarity scores were selected as result set for each query. The first row of Table 1, whose run identifier is deu_baseline, shows the results we obtained from this experimentation. This ranked as 16th position since we have used a very simple retrieval method, without any enhancement. Table 1. Results of experimentations with the ImageCLEFmed 2009 dataset Run Identifier NumRel deu_baseline 2362 deu_IRM 2362 RelRet 1742 1754 MAP 0.339 0.368 P@5 0.584 0.632 P@10 0.520 0.544 P@30 0.448 0.483 P@100 0.303 0.324 Rank 16 1 Run Type Text Mixed The second experimentation is about the integrated retrieval method, which combines two visual terms with previous experimentation. The results are shown in the second row of Table 1. These results present that IRM had a better performance than our baseline retrieval in all measures. Furthermore, this performance gain was obtained by using simple visual feature. Figure 1 illustrates the precision and recall values of our experimentations. IRM outperformed classical vector space model with respect to recall measure at all precision levels. These results show that combining textual retrieval techniques with good visual features positively affects the results and improves the system performance. Fig. 1. Precision-Recall graph of baseline and IRM runs 4 Discussion and Future Work In this paper, we proposed a new content-based image retrieval model combining both visual and textual features of a document in same model. So, this model may help to bridge the semantic gap problem of content-based image retrieval systems. We evaluated the proposed approach on the ImageCLEFmed 2009 dataset and our method 222 T. Berber and A. Alpkocak ranked the best performance among the participants in mixed automatic run track. Experimentation results showed that the proposed method performs better than any other automatic mixed retrieval approaches even when a simple visual feature is used. The integrated retrieval model is a starting point, and ultimate goal of system is to close the semantic gap in visual information retrieval systems. It is promising that usage of simple visual term gives better results than most of the textual models. It is also expected that addition of new visual features to the model improves system performance. Acknowledgements This work is supported by Turkish National Science Foundation (TÜBİTAK) under project number 107E217. References 1. Amati, G.: Probabilistic models for information retrieval based on divergence from randomness. PhD thesis, Department of Computing Science, University of Glasgow (2003) 2. Hancock-Beaulieu, M., Gatford, M., Huang, X., Robertson, S.E., Walker, S., Williams, P.W.: Okapi at TREC-5. In: Text REtrieval Conference (TREC) TREC-5 Proceedings (1996) 3. Zheng, Z., Metzler, D., Zhang, R., Yi, C., Nie, J.: Search result re-ranking by feedback control adjustment for time-sensitive query. In: Proceedings of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (2009) 4. Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. ACM, New York (1996) 5. Chisholm, E., Kolda, T.G.: New term weighting formulas for the vector space method in information retrieval. Technical report (1999) 6. Müller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C.E., Hersh, W.: Overview of the ImageCLEF 2009 Medical Image Retrieval Track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) Using Media Fusion and Domain Dimensions to Improve Precision in Medical Image Retrieval Sa¨ıd Radhouani, Jayashree Kalpathy-Cramer, Steven Bedrick, Brian Bakke, and William Hersh Department of Medical Informatics & Clinical Epidemiology Oregon Health and Science University (OHSU) Portland, OR, USA radhouan@ohsu.edu Abstract. In this paper, we focus on improving retrieval performance, especially early precision, in the task of solving medical multimodal queries. The queries we deal with consist of a visual component, given as a set of image-examples, and textual annotation, provided as a set of words. The queries’ semantic content can be classiﬁed along three domain dimensions: anatomy, pathology, and modality. To solve these queries, we interpret their semantic content using both textual and visual data. Medical images often are accompanied by textual annotations, which in turn typically include explicit mention of their image’s anatomy or pathology. Annotations rarely include explicit mention of image modality, however. To address this, we use an image’s visual features to identify its modality. Our system thereby performs image retrieval by combining purely visual information about an image with information derived from its textual annotations. In order to experimentally evaluate our approach, we performed a set of experiments using the 2009 ImageCLEFmed collection using our integrated system as well as a purely textual retrieval system. Our integrated approach consistently outperformed our text-only system by 43% in MAP and by 71% in precision within the top 5 retrieved documents. We conclude that this improved performance is due to our method of combining visual and textual features. Keywords: Medical Image Retrieval, Performance Evaluation, Image Classiﬁcation, Image Modality Extraction, Domain Dimensions, Media Fusion. 1 Introduction Advances in digital imaging technologies and the increasing prevalence of Picture Archival and Communication Systems (PACS) have led to a substantial growth in the number of digital images stored in hospitals and medical systems in recent years. Medical images can form an essential component of a patient’s health record, and the ability to retrieve them is useful to a variety of important clinical tasks, including diagnosis, education, and research. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 223–230, 2010. c Springer-Verlag Berlin Heidelberg 2010 224 S. Radhouani et al. Image retrieval systems (IRS) do not currently perform as well as their text counterparts [1]. Medical and other IRS have historically relied on indexing annotations or captions associated with the images. The last decade, however, has seen advancements in the area of content-based image retrieval (CBIR) [2,3]. Although CBIR systems have demonstrated success in fairly constrained medical domains (e.g., dermatology, chest x-rays, lung CT images, etc.), they have demonstrated poor performance when applied to databases with a wide spectrum of imaging modalities, anatomies, and pathologies [1,4,5,6]. In this paper, we address the problem of solving medical multimodal queries. We do this by focusing on improving early precision (i.e., precision at top 5 documents), as early precision is believed to be important to many users of search systems[7]. The queries we deal with contain visual data, given as a set of image-examples, as well as textual data in the form of a set of words that can be categorized along three dimensions: anatomy, pathology, and modality. In previous work, we have shown that it is possible to automatically identify the modality of a medical image using purely visual means [8]. However, it can be challenging to identify the anatomy or the pathology (e.g., slight fracture of a femur) relying solely on visual analysis of an image, particularly in unconstrained settings. On the other hand, it is often relatively straightforward to identify an image’s anatomy and the pathology using textual analysis of the image’s accompanying annotations— but these same text-based approaches often fail to correctly identify an image’s modality. This is due to the fact that while image annotations frequently include explicit mention of the image’s anatomical or pathological features, the annotations often leave the modality unsaid. To overcome these problems and therefore improve retrieval performance, especially early precision, we propose to merge the results of textual and visual earch techniques [9,8,10]. In this paper, we ﬁrst present a brief description of our system (Section 2). We next describe the evaluation experiments we performed using our visual-based search technique and text-based search technique (Section 3). We then discuss the obtained results in Section 4. Finally, we conclude this paper and provide some perspectives (Section 5). 2 Adaptive Medical Image Retrieval System Starting in 2007, we created and have continued to develop a multimodal image retrieval system based on an open-source framework that allows the incorporation of user search preferences. We designed a ﬂexible database schema that allows us to easily incorporate new collections while facilitating retrieval using both text and visual techniques. We used the Ruby programming language, with the open source Ruby On Rails1 web application framework. To manage the mappings between the images 1 http://www.rubyonrails.org/ Using Media Fusion and Domain Dimensions to Improve Precision 225 and their associated metadata, we used the PostgreSQL2 relational database system. Our system also indexes the title and caption ﬁelds for each image. Our system features a customizable query parser which allows the user to choose between several options to improve the precision of the search. In addition to stop word removal and stemming, our parser can identify a query’s desired search modality, if speciﬁed. The results can then be ﬁltered to only return images whose purported modality matches that of the query. The system is also linked to the National Library of Medicine’s Unifed Medical Language System (UMLS) Metathesaurus; the user may choose to perform manual or automatic query expansion using synonyms generated from the Metathesarus. Medical images often begin life with rich metadata in the form of DICOM headers describing their imaging modality or anatomy. However, since most teaching or on-line image collections are made up of compressed standalone JPEG ﬁles, it is very common for medical images to exist “in the wild” sans metadata. In previous work [8], we described a modality classiﬁer that could identify the imaging modality for medical images using supervised machine learning. We extended that work to the new dataset used for ImageCLEF 2009. One of the biggest challenges in creating such a modality classiﬁer is creating a labeled training dataset of suﬃcient size and quality. In 2009, as in 2008, we used a regular-expression based text parser, written in Ruby, to extract the modality from the image captions for all images in the collection, when available. Images, where a unique modality could be identiﬁed based on the caption, were used for training the modality classiﬁer. Our text processing module is very simple. After applying a stop-word list, we index documents and queries using the Vector Space Model (VSM). During the retrieval process, a list of relevant documents can be ranked with regard to their relevance to the corresponding query. While no speciﬁc treatment was applied to documents, we theorized that it would be useful to use external knowledge to interpret the queries’ semantic content. Indeed, each query contains a precise description of a user need materialized by a set of words belonging to three semantic categories: modality of the image (e.g., MRI, x-ray, etc.), anatomy (e.g., leg, head, etc.), and pathology (e.g., cancer, fracture, etc.). We call these categories “domain dimensions” and deﬁne them as follow: “A dimension of a domain is a concept used to express the themes in this domain” [11]. The idea behind our approach is that, in a given domain, a theme can be developed with reference to a set of dimensions of this domain. For instance, a physician wishing to write a report about a medical image, ﬁrst focuses on a domain (“medicine”). Next, they refer to speciﬁc dimensions of this domain (e.g., “anatomy”), then choose words from this dimension (e.g., “femur”), and ﬁnally write their report. In order to solve multimodal medical queries, we proposed using the domain dimensions to interpret their semantic content. To do so, we ﬁrst needed to deﬁne the dimensions. For this purpose, we used external resources, such as ontologies or thesaurus, to deﬁne our dimensions as hierarchies of concepts. Every 2 http://www.postgresql.org/ 226 S. Radhouani et al. concept is denoted by a set of words. Our system was then able to use these dimensions to extract query semantics by attempting to identify, for each query, what dimensions may or may not be present. Then, our system uses boolean operators to combine the extracted query dimensions and thereby constrain the raw search results. Our query process contains two main steps. The ﬁrst step consists of using the initial user-supplied query text to search for documents based on the VSM. The result of this step is a list of ranked documents D. The second step consists of selecting, from D, those documents that satisfy the Boolean expression formulated based on the domain dimensions. 3 Experimental Evaluation In order to experimentally evaluate our approach, we performed a set of experiments using the 2009 ImageCLEF collection, which consists of 74,902 medical images together with textual annotations for each one[12]. The following sections describe these experiments and their results. In total, 9 runs were performed; all of them are automatic; two of them were based entirely on textual data and the remaining eight are based on both textual and visual data. Our baseline run, labeled “no mod,” is based on the VSM, where each document/query is represented by a vector of words. The result of this run will be compared to those obtained by the other runs, which are based on domain dimensions and/or a combination of textual and visual data. 3.1 Modality Extraction-Based Experiences 1.0 We performed two mixed runs that used the automatically extracted modality to ﬁlter results. The custom query parser ﬁrst extracted the desired modality from the query, if it existed. The “mod1” run used the custom parser to remove stop-words from the query and limit the results to the desired modality. This run was expected to have high precision but potentially lower recall as it did not 0.6 0.4 0.0 0.2 precision 0.8 baseline run modality filtered run P5 P15 P30 P200 P100 Fig. 1. Improvement in early precision with the use of modality classiﬁcation Using Media Fusion and Domain Dimensions to Improve Precision 227 use any term expansion. Also, if the modality classier was not accurate or the modality extraction from the textual query was too strict, the results could be limited. In order to try to increase the recall, we also performed a run, labeled “umls,” where term expansion based on the UMLS metathesaurus was used. As can be seen in Figure 1), there is a improvement, especially in early precision with the use of modality ﬁltration. 3.2 Dimensions-Based Experiences To deﬁne the domain dimensions, we utilized the UMLS Metathesaurus and its database of semantic types. The types we ultimately included were as follows: Anatomy: “Body Part,” “Organ, or Organ Component,” “Body Space or Junction,” “Body Location or Region,” and “Cell” Pathology: “Sign or Symptom,” “Finding,” “Pathologic Function,” “Injury or Poisoning,” “Disease or Syndrome,” “Neoplastic Process,” “Neoplasms,” “Anatomical Abnormality,” “Congenital Abnormality,” adn “Acquired Abnormality” Modality: “Manufactured Object,” and “Diagnostic Procedure” After automatically extracting the dimensions from each query, we used them to perform the following runs: R1: We used dimensions to rank the retrieved documents for a given query into three tiers. Documents that contained the three dimensions belonging to the query were considered to be most relevant, and were therefore highly ranked and placed in the ﬁrst result tier. These were followed by those documents that were only missing the modality. Finally, we placed documents that contain at least one of the query dimensions in the third tier. Since the modality was not always explicitly described in the text, we used the visual data to extract it from each image. Thereafter, we used it to re-rank the document list obtained using the textual data. For example, documents from which the modality has been extracted from an image were ranked at the top of results for queries that speciﬁed a modality. In all the following runs, we applied this technique using the result of the “mod1” run. For simplicity of writing, we call this process “checking modality.” R2:Re-rank the document list obtained during run R1 by “checking modality.” R3: Re-rank the document list obtained during run no mod by the “checking modality.” R4: Selecting, for each query, only documents that contain the Anatomy and the Pathology dimensions relevant to the query. The obtained results are reranked by “checking modality.” R5: From the result of the “R2” run, we randomly selected one image from each article. In CLEF documents, a textual article might contain more than one image (as in, e.g., the case of multiple ﬁgures being in the same published paper). In this run, if an article is retrieved by the textual approach, we randomly select one of its images (instead of keeping all images as we had done for the other runs). 228 S. Radhouani et al. R6: We selected from the result of the “R1” run only those documents from which modality had been extracted during the “mod1” run. 4 Results For each query, the results are measured by mean average precision (MAP) of top 1000 documents, precision at 10 documents (p@10), and precision at 5 documents (p@5). The results given by the baseline run are 0.1223 (MAP), 0.416 (p@10), and 0.38 (p@5). Obtained results are presented in Table 1 where rows correspond to the runs, and values correspond to their corresponding results. We also include data from the best runs (based on MAP and p@5) in ImageCLEF 2009 campaign. Table 1. Results of our experimental evaluation Run Name umls mod1 R1 R2 R3 R4 R5 R6 no mod Best in p@5 Best in MAP MAP 0.1753 0.1698 0.1756 0.1582 0.1511 0.1147 0.1133 0.1646 0.1223 0.3775 0.4293 P@5 0.712 0.592 0.592 0.624 0.608 0.6 0.584 0.68 0.416 0.744 0.696 P@10 0.664 0.552 0.536 0.54 0.524 0.484 0.516 0.612 0.38 0.716 0.664 As described above, our system has been designed to maximize precision, often at the expense of recall. Since we do not use any advanced natural language processing, and we ﬁlter images based on purported modality, we were expecting a relatively low recall. Consequently, as the MAP is highly dependent and limited by recall, we believe that it makes more sense to compare our results to those obtained by the other ImageCLEF 2009 participants in terms of our early precision levels (p@5 or p@10). We have generally high early precision levels, in particular in our “umls” run (p@5 = 0.712). This was the second best p@5 result obtained during the ImageCLEF 2009 campaign; the absolute best was 0.744 and was obtained from an interactive run. Therefore, our “umls” run had the highest fully-automatic p@5 in ImageCLEF2009. Independent of the other ImageCLEF 2009 participants, most of our runs outperform our baseline, reaching an improvement of 43% in MAP (R1) and 71% in p@5 (umls). From the result of the “R1” run, we notice that the use of domain dimensions is of great interest in solving medical multimodal queries. Indeed, by using domain dimensions, we highlight the “relevant words” that describe the queries’ semantic content. Using these words, the system can retrieve only Using Media Fusion and Domain Dimensions to Improve Precision 229 documents that contain the anatomy, the modality, and the pathology described in the query text. We notice that at p@5 and p@10, all our mixed runs outperform our baseline. This is not surprising, because the ﬁrst ranked documents are those that have been retrieved both by the text-based search technique and the visualbased search technique. This supports conclusions drawn from our previous work, which found that the retrieval performance can be improved demonstrably by merging the results of textual and visual search techniques [9,8,10]. Two of our runs displayed decreased MAP when compared with our baseline. The ﬁrst of these was the “R4” run, wherein the modality dimension was ignored during the querying process. This decrease might be explained by the fact that the modality is described in some documents, and its use is thought to be beneﬁcial— for example, consider our “R1” run, wherein we used the three dimensions and obtained our highest performance. The second lower-than-baseline run was our “R5” run, in which only one image was randomly selected from each article. An increase in the performance would have been surprising, since this technique is not accurate at all, and there is a high risk that the selected image is irrelevant. Indeed, it is better to keep all images of each article, thus, if one of them is relevant to the corresponding query, it will be retrieved. 5 Conclusions In order to improve early precision in the task of solving medical multimodal images, we combined a text-based search technique with a visual-based one. The ﬁrst technique consisted of using domain dimensions to highlight relevant words that describe the queries’ semantic content. While anatomy and pathology are relatively easy to identify from textual documents, it can be extremely diﬃcult, if not impossible, to identify an image’s modality using solely textual features. We therefore experienced the best results when we combined our textual techniques with a visual-based search technique that automatically extracts modality information from images visually. The obtained results in terms of precision at p@5 and p@10 are very encouraging and outperform our baseline. Among the ImageCLEF 2009’s particiants, we obtained the second best overall and best automatic result in terms of precision at p@5. However, in terms of MAP, even though our results outperform our baseline, they are signiﬁcantly below the best performance obtained in ImageCLEF 2009. This was expected, since this year our sole focus was on improving early precision, and we did not use any sophisticated natural language processing to improve our system’s recall. Our future work will attempt to address this issue. We believe that our current text-based search technique has signiﬁcant room for improvement. We plan to use further textual processing, such as term expansion and pseudo-relevance feedback, in order to improve our recall, and hope to be able to compare our results to the best ImageCLEF 2009’ performance. 230 S. Radhouani et al. Acknowledgements We acknowledge the support of NLM Training Grant 2T15LM007088, NLM Grant 1K99LM009889-01A1, NSF Grant ITR-0325160, and the Swiss National Science Foundation grant PBGE22-121204. References 1. Hersh, W.R., M¨ uller, H., Jensen, J.R., Yang, J., Gorman, P.N., Ruch, P.: Advancing biomedical image retrieval: Development and analysis of a test collection. J. Am. Med. Inform. Assoc., M2082 (June 2006) 2. Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12), 1349–1380 (2000) 3. Tagare, H.D., Jaﬀe, C.C., Duncan, J.: Medical image databases: A content-based retrieval approach. J. Am. Med. Inform. Assoc. 4(3), 184–198 (1997) 4. Aisen, A.M., Broderick, L.S., Winer-Muram, H., Brodley, C.E., Kak, A.C., Pavlopoulou, C., Dy, J., Shyu, C.R., Marchiori, A.: Automated storage and retrieval of thin-section ct images to assist diagnosis: System description and preliminary assessment. Radiology 228(1), 265–270 (2003) 5. Schmid-Saugeona, P., Guillodb, J., Thirana, J.P.: Towards a computer-aided diagnosis system for pigmented skin lesions. Computerized Medical Imaging and Graphics: The Oﬃcial Journal of the Computerized Medical Imaging Society 27(1), 65–78 (2003) 6. M¨ uller, H., Michoux, N., Bandon, D., Geissbuhler, A.: A review of content-based image retrieval systems in medical applications–clinical beneﬁts and future directions. International Journal of Medical Informatics 73(1), 1–23 (2004) 7. Jansen, B., Spink, A.: How are we searching the world wide web? a comparison of nine search engines. Information Processing and Management 42(1), 248–263 (2006) 8. Kalpathy-Cramer, J., Hersh, W.: Automatic image modality based classiﬁcation and annotation to improve medical image retrieval. Studies in Health Technology and Informatics 129(Pt 2), 1334–1338 (2007) 9. Hersh, W., Kalpathy-Cramer, J., Jensen, J.: Medical image retrieval and automated annotation: Ohsu at imageclef 2006. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 660–669. Springer, Heidelberg (2007) 10. Radhouani, S., HweeLim, J.: pierre Chevallet, J., Falquet, G.: Combining textual and visual ontologies to solve medical multimodal queries. In: IEEE International Conference on Multimedia and Expo., pp. 1853–1856 (2006) 11. Radhouani, S.: Un mod`ele de recherche d’information orient´e pr´ecision fond´e sur les dimensions de domaine. PhD thesis, University of Geneva, Switzerland, and University of Grenoble, France (2008) 12. M¨ uller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kann, J. C.E.: Overview of the medical retrieval task at imageclef 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) ImageCLEF 2009 Medical Image Annotation Task: PCTs for Hierarchical Multi-Label Classification Ivica Dimitrovski1,2, Dragi Kocev1, Suzana Loskovska2, and Sašo Džeroski1 1 2 Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia Department of Computer Science, Faculty of Electrical Engineering and Information Technologies, Skopje, Macedonia {ivicad,suze}@feit.ukim.edu.mk, {dragi.kocev,saso.dzeroski}@ijs.si Abstract. In this paper, we describe an approach to the automatic medical image annotation task of the 2009 CLEF cross-language image retrieval campaign (ImageCLEF). This work focuses on the process of feature extraction from radiological images and their hierarchical multi-label classification. To extract features from the images we use two different techniques: edge histogram descriptor (EHD) and Scale Invariant Feature Transform (SIFT) histogram. To annotate the images, we use predictive clustering trees (PCTs) which are able to handle target concepts that are organized in a hierarchy, i.e., perform hierarchical multi-label classification. Furthermore, we construct ensembles (Bagging and Random Forests) that use PCTs as base classifiers: this improves the predictive/classification performance. 1 Introduction The amount of medical images produced is constantly growing. Manual description and annotation of each image is time consuming, expensive and impractical. This calls for development of image annotation algorithms that can perform the task reliably. Automatic annotation classifies an image into one of a set of classes. If the classes are organized in a hierarchy and several of them can be assigned to an image, we are talking about hierarchical multi-label classification (HMLC). This paper describes our approach to the medical image annotation task of ImageCLEF 2009 (for details see [1]). The objective of this task is to provide the IRMA (Image Retrieval in Medical Applications) code [2] for each image of a given set of previously unseen medical (radiological) images. The IRMA coding system consists of four axes: technical axis (T, image modality), directional axis (D, body orientation), anatomical axis (A, body region examined) and biological axis (B, biological system examined). The database of medical images contains 12677 fully annotated radiographs (training dataset for the classifier) and 1733 testing images without labels. The annotation should be performed by using the four different annotation label sets (the competitions from 2005-2008) in turn. The code is strictly hierarchical because each sub-code element is connected to only one code element. This characteristic of the IRMA code allow us to exploit the C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 231–238, 2010. © Springer-Verlag Berlin Heidelberg 2010 232 I. Dimitrovski et al. code hierarchy and construct an automatic annotation system based on predictive clustering trees for hierarchical multi-label classification [3]. This approach is directly applicable for the datasets of ImageCLEF2007 and ImageCLEF2008 where the images were labeled according to the IRMA code scheme. To apply the same algorithm for the ImageCLEF2005 and ImageCLEF2006 datasets, we mapped the class numbers with the corresponding IRMA codes. Some images from the ImageCLEF2005 dataset can belong to more than one IRMA code. In the classification process, we use the most general IRMA code (that contains 0) to describe these images. Automatic image classification/annotation relies on numerical features that are computed from the image pixel values. In our approach, we use an edge histogram descriptor (to extract the global features of the images) and SIFT histogram (to extract the local features from the images). We combine the feature vectors (histograms) with simple concatenation in a single vector with 2080 features. The purpose of the concatenation of the global and the local features is to tackle the problem of intra-class variability vs. inter-class similarity and the different distribution of images between the training and the testing dataset (the testing dataset contains many images of some classes that are under-represented in the training set). Tomassi et al. [4] show that high and mid level combination of the different feature extraction techniques yield better results when SVMs are used as classifiers. In our work, we use ensembles of predictive clustering trees [3,5]. The ensembles of trees, such as random forests, can effectively exploit the information provided by the large number of features. Thus, we expect that concatenation of the feature extraction techniques yields better performance than the other combination methods. The remainder of the paper is organized as follows: Section 2 describes the techniques for feature extraction from images. Section 3 introduces predictive clustering trees and their use for HMLC. In Section 4, we explain the experimental setup. Section 5 reports the obtained results. Conclusions and a summary are given in Section 6, where we also discuss some directions for further work. 2 Feature Extraction from Images This section describes the techniques for feature extraction from images that we use to describe the X-ray images from ImageCLEF 2009. We shortly describe the edge histogram descriptor and the scale invariant feature transform. To learn a classifier and to annotate the images from the testing set, we use the feature vector obtained with simple concatenation of the features obtained from these two techniques. Edge Histogram Descriptor: Edge detection is a fundamental problem of computer vision and has been widely investigated [6]. The goal of edge detection is to mark the points in a digital image at which the luminous intensity changes sharply. An edge representation of an image drastically reduces the amount of data to be processed, yet it retains important information about the shapes of objects in the scene. Edges in images constitute important features to represent their content. One way of representing important edge features is to use a histogram. An edge histogram in the image space represents the frequency and the directionality of the brightness changes in the image. To represent it, MPEG-7 contains edge histogram ImageCLEF 2009 Medical Image Annotation Task: PCTs for HMLC 233 descriptors (EHD). These basically represent the distribution of five types of edges in each local area called a sub-image. The sub-images are defined by dividing the image space into 4×4 non-overlapping blocks. Thus, the image partition always yields 16 equal-sized sub-images, regardless of the size of the original image. To characterize the sub-images, we then generate a histogram of edge distribution for each sub-image. Edges in the sub-images are categorized into five types: vertical, horizontal, 45-degree diagonal, 135-degree diagonal and non-directional edges. Thus, the histogram for each sub-image represents the relative frequency of occurrence of the five types of edges in the corresponding sub-image. As a result, each local histogram contains five bins. Each bin corresponds to one of the five edge types. Since there are 16 sub-images in the image, a total of 5×16=80 histogram bins are required. Note that each of the 80-histogram bins has its own semantics in terms of location and edge type. Edge detection is performed using the Canny edge detection algorithm [7]. SIFT histogram: Many different techniques for detecting and describing local image regions have been developed [8]. The Scale Invariant Feature Transform (SIFT) was proposed as a method of extracting and describing key-points which are reasonably invariant to changes in illumination, image noise, rotation, scaling, and small changes in viewpoint [8]. For content based image retrieval, good response times are required and this is hard to achieve when using the huge amount of data contained in descriptors by local features. The descriptors using local features can be extremely big because an image may contain many key-points, each described by a 128 dimensional vector. To reduce the descriptor size, we use histograms of local features [9]. With this approach, the amount of data is reduced by estimating the distribution of local feature values for every image. The creation of these histograms is a three step procedure. First, the key-points are extracted from all database images, where a key-point is described with a 128 dimensional vector of numerical values. For the key-point extraction and descriptor calculation, we use the default parameters proposed by Lowe [8]. The key-points are clustered in 2000 clusters using k-means. Afterwards, for each key-point we discard all information except the identifier of the most similar cluster center. A histogram of the occurring patch-cluster identifiers is created for each image. To be independent of the total number of key-points in an image, the histogram bins are normalized to sum to 1. This results in a 2000 dimensional histogram. 3 Ensembles of PCTs In this section, we discuss the approach we use to classify the data at hand. We shortly describe the predictive clustering trees (PCT) framework, its use for HMLC and the learning of ensembles. PCTs for Hierarchical-Multi Label Classification: In the PCT framework [5], a tree is viewed as a hierarchy of clusters: the top-node corresponds to one cluster containing all data, which is recursively partitioned into smaller clusters while moving down the tree. PCTs can be constructed with a standard “top-down induction of 234 I. Dimitrovski et al. decision trees” (TDIDT) algorithm. The heuristic for selecting the tests is the reduction in variance caused by partitioning the instances. Maximizing the variance reduction maximizes cluster homogeneity and improves predictive performance. A leaf of a PCT is labeled with/predicts the prototype of the set of examples belonging to it. With instantiation of the variance and prototype functions, the PCTs can handle different types of data, e.g., multiple targets [10] or time series [11]. A detailed description of the PCT framework can be found in [5]. To apply PCTs to the task of HMLC the example labels are represented as vectors with Boolean components. The i-th component of the vector is 1 if the example belongs to class ci and 0 otherwise (See Fig. 1). The variance of a set of examples (S) is defined as the average squared distance between each example’s label vi and the mean label v of the set, i.e., ∑ d(v ,v ) 2 i Var(S) = i (1) |S | The higher levels of the hierarchy are more important: an error in the upper levels costs more than an error on the lower levels. Considering that, a weighted Euclidean distance is used as a distance measure. d(v1 , v 2 ) = ∑ w(c )(v i 1,i − v 2,i ) 2 (2) i where vk,i is the i’th component of the class vector vk of an instance xk, and the class weights w(c) decrease with the depth of the class in the hierarchy. In the case of HMLC, the notion of majority class does not apply in a straightforward manner. Each leaf in the tree stores the mean v of the vectors of the examples that are sorted in that leaf. Each component of v is the proportion of examples v i in the leaf that belong to class ci. An example arriving in the leaf can be predicted to belong to class ci if v i is above some threshold ti. The threshold can be chosen by a domain expert. A detailed description of PCTs for HMLC can be found in [3]. Fig. 1. A toy hierarchy. Class label names reflect the position in the hierarchy, e.g., ‘2/1’ is a subclass of ‘2’. The set of classes {1, 2, 2/2} is indicated in bold in the hierarchy and is represented as a vector. Ensemble Methods: An ensemble classifier is a set of classifiers. Each new example is classified by combining the predictions of every classifier from the ensemble. ImageCLEF 2009 Medical Image Annotation Task: PCTs for HMLC 235 These predictions can be combined by taking the average (for regression tasks) or the majority vote (for classification tasks) [12,13], or by taking more complex combinations. We have adopted the PCTs for HMLC as base classifiers. Average is applied to combine the predictions of the different trees because the leaf’s prototype is the proportion of examples of different classes that belong to it. Just like for the base classifiers a threshold should be specified to make a prediction. We consider two ensemble learning techniques that have primarily been used in the context of decision trees: bagging and random forests. Bagging [12] constructs the different classifiers by making bootstrap replicates of the training set and using each of these replicates to construct one classifier. Each bootstrap sample is obtained by randomly sampling training instances, with replacement, from the original training set, until a number of instances is obtained equal to the size of the training set. Bagging is applicable to any type of learning algorithm. A random forest [13] is an ensemble of trees, where diversity among the predictors is obtained both by bootstrap sampling, and by changing the feature set during learning. More precisely, at each node in the decision tree, a random subset of the input attributes is taken, and the best feature is selected from this subset (instead of the set of all attributes). The number of attributes that are retained is given by a function f of the total number of input attributes x (e.g., f(x) = 1, f(x) = x , f(x) = ⎣log 2 x⎦ +1 ,…). By setting f(x) = x , we obtain the bagging procedure. PCTs for HMLC are used as base classifiers. 4 Experimental Design We decided to split the training images into training and development images. To tune the system for different distribution of images across classes in the training set and the test set, we generated several splits where the distributions of the images differed (in varying ways) between the training and development data. We constructed a classifier for each axis from the IRMA code separately (see Section 1). From each of the datasets, we learn a PCT for HMLC and Ensembles of PCTs (Bagging and Random Forests). The ensembles consisted of 100 un-pruned trees. The feature subset size for Random Forests was set to 11 (using the formula f( 2080 ) = ⎣log 2 ( 2080 )⎦ ). To compare the performance of a single tree and an ensemble we use PrecisionRecall (PR) curves. These curves are obtained with varying the value for the classification threshold: a given threshold corresponds to a single point from the PRcurve. For more information, see [3]. According to these experiments and previous research the ensembles of PCTs have higher performance as compared to a single PCT when used for hierarchical annotation of medical images [14]. Furthermore, the Bagging and Random Forest methods give similar results. Because the Random Forest method is much faster than the Bagging method, we submitted only the results for the Random Forest method. To select an optimal value of the threshold (t), we performed validation on the different development sets. The threshold values that give the best results were used for the prediction of the unlabelled radiographs according to the four different classification schemes (see Section 1). 236 I. Dimitrovski et al. Fig. 2. Example images with same value for axis D, but different values for the axis combining D with the first code from A To reduce the intra-class variability for axis D and improve the prediction performance, we decided to modify the hierarchy for this axis and include the first code of axis A from the corresponding IRMA code. Fig. 2 presents example images that have the same code for axis D, but are visually very different. After inclusion of the first code from the axis A, these images belong to different classes. 5 Results For the ImageCLEF 2009 medical annotation task, we submitted one run. In this task, our result was third among the participating groups, with a total error score of 1352.56. The results for the particular datasets are presented in Table 1. From the results, we can note the high error for the annotations from ImageCLEF2005 and ImageCLEF2006. Recall that we pre-processed the images and the classes from 2005 and 2006 were mapped to an IRMA code. One class from the annotation from ImageCLEF2005 corresponds to multiple labels from the hierarchical annotation of the IRMA code and we used the most general class. This restricted the classifier to make more specific predictions. The performance for the ImageCLEF2008 is worse than the performance for ImageCLEF2007 because ImageCLEF2008 has a bigger hierarchy and more test images. Similar conclusions can be made by analyzing the PR curves shown in Fig. 3. For each of the axes (T, D, A and B) we present three PR curves that correspond to the different annotation schemes. The PR curves for 2006 and 2007 coding schemes are equal because we simply mapped the class numbers to the corresponding IRMA codes. From the presented values for the AU PRC (Area under the Average PrecisionRecall Curve) it can be seen that we obtain best results for the ImageCLEF2007 dataset. The AU PRC values for the ImageCLEF2005 dataset are very low considering the total number of classes, but this is mainly because we didn’t apply a one-to-one mapping as for the ImageCLEF2006 dataset. ImageCLEF 2009 Medical Image Annotation Task: PCTs for HMLC 237 Table 1. Error score for the medical image annotation task and AU PRC per axis, using random forests of PCTs for HMLC Annotation label sets Error score Number of wildcards (*) Axis T Axis D Axis A Axis B 2005 2006 2007 2008 549 433 128.1 242.26 0 0 2550 2613 0.9990 0.9998 0.9998 0.9995 0.7712 0.8177 0.8177 0.7488 0.7059 0.7419 0.7419 0.6621 0.9843 0.9948 0.9948 0.9760 AU PRC / RF The excellent performance for the prediction task for axes T and B is due to the simplicity of the problem, the hierarchies along these axes contain only a few nodes (8 and 19 nodes for ImageCLEF2008, respectively). This means that in each node in the hierarchy there is a large portion of the examples, thus learning a good classifier is not a difficult task. The classifiers for the other two axes have satisfactory predictive performance, but here the predictive task is somewhat more difficult (especially for axis A). The size of the hierarchy along the A and D axis, for ImageCLEF2008 are 202 and 88 nodes, respectively. Fig. 3. Precision-Recall curves for the random forest predictions of the codes for T, D, A and B axis, respectively, for the four different competition tasks. The PR curves for the axes T and B are close to each other for each year. For the axes D and A, the upper PR curves are for the years 2006/07, the lower ones are for 2008 and the PR curves in the middle are for 2005. 6 Conclusions This paper presents a hierarchical multi-label classification approach to medical image annotation. For efficient image representation, we use edge histogram descriptor and SIFT histograms. The predictive modeling problem that we consider is to learn PCTs and ensembles of PCTs that predict a hierarchical annotation of an X-ray image. Using these approaches, we obtained good predictive performance and ranked third on the ImageCLEF 2009 competition. There are several ways to further improve the predictive performance of the proposed approach. First, one could try to tackle the shift in distribution of images between the training and the testing set. One solution is to develop extensions of the PCT approach that can handle such differences. Another approach is to generate virtual samples of the images that are underrepresented in the training set by rotation, 238 I. Dimitrovski et al. translation and manipulation of contrast and brightness. Second, better performance may be obtained by post-processing the output from the ensembles and by reducing the dependence from the thresholding: instead of the hard threshold, use the raw probabilities. Third, we could use additional feature extraction techniques and combine them using different combination schemes (other than concatenation). In summary, we presented a general approach to hierarchical image annotation. The approach can be easily extended with new feature extraction methods, and can thus be applied to other domains. It can be also easily applied to arbitrary domains, because it can handle hierarchies with arbitrary sizes (bigger hierarchies, hierarchies that are organized as trees or directed acyclic graphs). References 1. Tommasi, T., Caputo, B., Welter, P., Guld, M.O., Deserno, T.M.: Overview of the CLEF 2009 medical image annotation track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Lehmann, T.M., Schubert, H., Keysers, D., Kohnen, M., Wein, B.B.: The IRMA code for unique classification of medical images. In: Proc. of SPIE - Medical Imaging 2003, vol. 5033, pp. 440–451 (2003) 3. Vens, C., Struyf, J., Schietgat, L., Dzeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Machine Learning 73(2), 185–214 (2008) 4. Tommasi, T., Orabona, F., Caputo, B.: Discriminative cue integration for medical image annotation. Pattern Recognition Letters 29(15), 1996–2002 (2008) 5. Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of clustering trees. In: Proc. of the 15th ICML, pp. 55–63 (1998) 6. Ziou, D., Tabbone, S.: Edge Detection Techniques an Overview. International Journal of Pattern Recognition and Image Analysis 8(4), 537–559 (1998) 7. Canny, J.F.: A computational approach to edge detection. IEEE Trans. Pattern Analysis and Machine Intelligence 8(6), 679–698 (1986) 8. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 9. Deselaers, T., Keysers, D., Ney, H.: Discriminative training for object recognition using image patches. In: CVPR 2005, San Diego, CA, vol. 2, pp. 157–162 (2005) 10. Kocev, D., Vens, C., Struyf, J., Dzeroski, S.: Ensembles of Multi-Objective Decision Trees. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 624–631. Springer, Heidelberg (2007) 11. Dzeroski, S., Gjorgjioski, V., Slavkov, I., Struyf, J.: Analysis of Time Series Data with Predictive Clustering Trees. In: Džeroski, S., Struyf, J. (eds.) KDID 2006. LNCS, vol. 4747, pp. 63–80. Springer, Heidelberg (2007) 12. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 13. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001) 14. Dimitrovski, I., Kocev, D., Loskovska, S., Dzeroski, S.: Hierarchical annotation of medical images. In: Proc. of the 11th International Multiconference − IS 2008, Ljubljana, Slovenia, pp. 170–174 (2008) Dense Simple Features for Fast and Accurate Medical X-Ray Annotation Uri Avni1 , Hayit Greenspan1 , and Jacob Goldberger2 1 BioMedical Engineering, Tel-Aviv University 2 Engineering School, Bar-Ilan University Abstract. We present a simple, fast and accurate image categorization system, applied to medical image databases within the ImageCLEF 2009 medical annotation task. The methodology presented is based on local representation of the image content, using a bag of visual words approach in multiple scales, with a kernel based SVM classifier. The system was ranked first in this challenge, with total error score of 852.8. 1 Introduction This work presents a classification system that is based on the bag-of-visual-words (BoW) paradigm which is a recently introduced concept that has been successfully applied to scenery image classification tasks (see e.g. [1,2,3]). Aproaches based on local features were presented in recent ImageCLEF medical annotation challenges [4,5], including the most successful ones (e.g., [6,7]). In 2006 Deselaers et al. [6] displayed the best medical annotation results using a local features approach, where the features are local patches of different sizes taken at every position, and scaled to a common size. No dictionary was used, rather, the feature space was quantized uniformly in every dimension and the image was represented as a sparse histogram in the quantized space. Tommasi et al. [7] had the highest score in 2007 and 2008. In this work both global and local features were used, with different integration techniques. As the local features, modified SIFT descriptors were used, sampled randomly. Local features on four image quadrants are were learned and represented separately. Nowak et al. [3] showed that the number of patches is the single most influential parameter governing performance. Based on several of the above conclusions, and with several differentiations, we present a classification system with the following characteristics: We sample patches densely, and show that in this case simple features give comparable classification accuracy to SIFT features, while taking significantly less computation time. We argue that building a visual dictionary from the data does not significantly compromise the computation time. A visual dictionary can be built from a select group of images, with computation time that does not depend on the database size. Moreover, when spatial coordinates are included in the features, indexing dictionary words by the coordinates significantly accelerates the lookup process. An overview and detailed description of the proposed classification system is provided in Section 2. The experimental validation and sensitivity analysis is described in Section 3. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 239–246, 2010. c Springer-Verlag Berlin Heidelberg 2010 240 U. Avni, H. Greenspan, and J. Goldberger 2 A Classification System Based on a Dictionary of Visual-Words We review the classification system we have implemented for the ImageCLEF 2009 medical annotation challenge. Key components are shown in the flow-diagram in Fig. 1. Patch extraction Given an image, feature detection is used to extract several small local patches. Each small patch shows a localized view of the image content. These patches are considered as candidates for basic elements, or “words”. The patch size needs to be larger than a few pixels across, in order to capture higher-level semantics such as edges or corners. At the same time, the patch size should not be too large if it is to serve as a common building block of many images. Common feature detection approaches include using a regular sampling grid, a random selection of points, or the selection of points with high information content using salient point detectors. We utilize all the information in the image, by sampling rectangular patches of fixed size N × N around every pixel. Feature space description Following the feature detection step, the feature representation method involve representing the patches using feature descriptors. In this step, a large random subset of images is used (ignoring their labels). We extract patches of size N × N using a regular grid, and normalize each patch by subtracting its mean gray level, and dividing it by its standard deviation. This step insures invariance to local changes in brightness, provides local contrast enhancement and augments the information within a patch. Patches that have a single intensity value are abundant in x-ray images. These patches are common in all categories, much like stop-words in text documents. These patches are ignored. We are left with a large collection of several million vectors of length N 2 . To reduce both the computational complexity of the algorithm and the level of noise, we apply a principal component analysis procedure (PCA) to this initial patch collection. The first few components of the PCA, which are the components with the largest eigenvalues, serve as a basis for the information description. A popular alternative approach to raw patches is the SIFT representation [8] which is beneficial in scenery images [9], where object scales can vary. We examine this option in the experiments defining the system parameter set. In addition to patch content information represented either by PCA coefficients or SIFT descriptors, we add the patch center coordinates to the feature vector. This addition introduces spatial information into the image representation, without the need to explicitly model the spatial dependency between patches. Quantization The final step of the bag-of-words model is to convert vector represented patches into visual words and to generate a representative dictionary. A visual word can be considered as a representative of several similar patches. A frequently-used method is to perform K-means clustering over the vectors of the initial collection, and then cluster them into K groups in the feature space. The resultant cluster centers serve as a vocabulary of K visual words. Due to the fact that we included spatial coordinates as part of the feature space, the visual words have a localization component in them, which is reflected as a spatial spread of the words in the image plane. Words are more dense in Dense Simple Features for Fast and Accurate Medical X-Ray Annotation 241 areas with greater variability across images in the database. Dictionary words are stored in a kd-tree, indexed by the spatial coordinates. From an input image to a representative histogram A given (training or testing) image can now be represented by a unique distribution over the generated dictionary of words. In our implementation, patches are extracted from every pixel in the image. For an x-ray image of size 512 × 512 there are typically several hundreds of thousands of non-empty patches. The patches are projected into the selected feature space, and translated (quantized) to indices by looking up the most similar feature-vector in the generated dictionary. Using the spatial indexation of dictionary words, dictionary lookup process is accelerated, by comparing a new patch only to dictionary words at a certain radius from it. The dictionary generation process and the shift from a given image to its representative histogram, are shown in Figure 1 left column and right column, respectively. Note that as a result of including spatial features, both the image local content and spatial layout are preserved in the discrete histogram representation. Multi-scale image information may in some cases provide additional information that supports the required discrimination. To address this we repeat the dictionary building process for scaled down replications of the input image, using the same patch size. The image representation in this case is a 1-D concatenation of histograms from varying scales. This process, demonstrated in Figure 2, provides a richer image representation. It does not imply scale invariance, as in [6]. In our experiments we found that objects of interest in radiographs appear at roughly similar size-range across all images, thus invariance to scale is not a necessity. Classification Image classification is performed using a non-linear multiclass Support Vector Machine (SVM) with different kernels. We examined several non-linear kernels, commonly used with histogram data: – Histogram intersection kernel [10]: K(x, y) = exp(− i min(xi , yi )) −γx−y2 – Radial Basis Function kernel: K(x, y) = e 2 i| – χ2 kernel: K(x, y) = exp(−γ i |x|xii−y +yi | ) Note that histogram intersection has no free kernel parameters, which makes it convenient for fast parameter evaluation. The two other kernels have a free tradeoff parameter γ, and require careful optimization. In order to classify multiple categories, we use the one-vs-one extension of the binary classifier, where N (N − 1)/2 binary classifiers are trained for all pairs of categories in the dataset. Whenever an unknown image is classified with a binary classifier it casts one vote for its preferred class, and the final result is the class with the most votes. Since each binary classifier runs independently, parallelization of both training and testing phases of the SVM is straightforward. It is implemented as a parallel enhancement of the LIBSVM library1. 1 http://www.csie.ntu.edu.tw/∼cjlin/libsvm U. Avni, H. Greenspan, and J. Goldberger Images database Image Patch extraction Patch extraction Feature space description Feature space description Clustering Quantization Image model Frequency 242 Dictionary 0.04 0.02 0 0 100 Word number 200 Fig. 1. Dictionary building and image representation flow chart 0 100 200 300 400 500 600 Fig. 2. Image representation at multiple scales Dense Simple Features for Fast and Accurate Medical X-Ray Annotation 243 3 Experiments and Results A key component in using the BoW paradigm in a categorization task is the tuning of the system parameters. An optimization step is thus required for a given task and image archive. We focus on three components of the system: finding the optimal set of local features, finding the optimal dictionary size, and optimizing the classifier parameters. We use the 2007 labels of the IRMA database. Each of the 116 categories is treated as a separate label, disregarding the hierarchical nature of the IRMA code. We optimized the system parameters using several cross-validation experiments. In the following experiments, 10,667 images were used for training and 2000 randomly drawn images were used for testing and verification. The optimization is performed independently in three steps: finding the optimal set of local features, finding optimal dictionary size, and optimizing classifier parameters. Local features. We examined three feature extraction strategies: raw patches, raw patches with normalized variance, and SIFT descriptors. In all cases we added spatial coordinates to the feature vector. We used dense extraction of features around every pixel in the image. There are often strong artifacts near the image border that are not relevant to the image category, so a 5% margin from the image border was ignored. The feature extraction step produces about 100,000 to 200,000 features from a single image. It is our experience that X-ray images from the same category usually appear in a similar scale and orientation in a given archive. In this task the invariance of the SIFT features to scale and orientation is therefore not necessary. We used SIFT descriptors taken at a single scale, without aligning the orientation, as in [7]. Raw patches and normalized patches of size N × N were dimensionally reduced using PCA. Table 1 summarizes the classification results of the three feature sets. Normalizing patch variance improves the classification rate compared to raw patches. The gain can be attributed to the local contrast invariance achieved in this step. In this task, using normalized patches proved marginally preferable to SIFT descriptors in terms of classification accuracy. However, when using raw patches, the feature extraction step is significantly faster than with SIFT descriptors, as seen in Figure 3. The majority of the running time was spent in the image representation step; this step takes over 3 seconds per image with the SIFT features, and less than half a second with the simpler raw patches. Time was measured on a dual quad-core Intel Xeon 2.33 GHz. In the following sections variance normalized raw patches are used as features. Figure 4(a) depicts the effect of using 4 to 10 components for variance-normalized raw patches. It can be seen that the number of components has a minimal effect on classification accuracy. The addition of spatial coordinates to the feature set, on the other hand, improves the classification performance noticeably, as seen in Figure 4(b). Table 1. Comparison of different features Features Average % Standard Deviation Raw Patches 88.43 0.32 SIFT 90.80 0.41 Normalized 91.29 0.56 244 U. Avni, H. Greenspan, and J. Goldberger ZĂǁWĂƚĐŚĞƐ ^/&d Ϭ ϮϬϬ ƵŝůĚĚŝĐƚŝŽŶĂƌǇ ϰϬϬ dŝŵĞ΀^ĞĐŽŶĚƐ΁ ǆƚƌĂĐƚĨĞĂƚƵƌĞƐ ϲϬϬ ϴϬϬ dƌĂŝŶĐůĂƐƐŝĨŝĞƌ ůĂƐƐŝĨǇ Fig. 3. Running time using SIFT descriptors and normalized raw patches 92.5 93 92 92 91 91 accuracy % accuracy % 91.5 90.5 90 89.5 90 89 88 89 87 88.5 86 88 4 8 6 PCA components (a) 10 85 0 2 4 6 8 spatial weight (b) Fig. 4. (a): Effect of the number of PCA components in a patch on classification accuracy. (b) Effect of spatial features: Weight of spatial features (x-axis); Classification accuracy (y-axis). We found that when using 7 PCA components, the optimal range for the x, y coordinates was [−3, 3]. Bars show means and standard deviations from 20 experiments running on 1000 random test images. We next investigated the appropriate number of words in the dictionary. As Figure 5 shows, increasing the number of dictionary words proved useful up to 1000 words. Adding additional words after that point increased the computational time with no evident improvement in the classification rate. Combining the above, the classification system used normalized raw patch features, with 7 PCA components, spatial features with weight [-3,3], and 1000 visual words. Using the SVM with a histogram intersection kernel achieved a classification accuracy of 91.29%. We next examined two additional kernel types with the SVM classifier, the Radial basis function and χ2 kernels. We used the optimal features and dictionary size, consistently across all experiments. For these kernel types the SVM cost parameter C, and free kernel parameter γ were scanned simultaneously over a grid to find the classifier’s optimal working point. The results of these experiments are summarized in Table 2. The χ2 kernel is ranked first by a small margin with 91.62% accuracy, followed by the RBF kernel with 91.45%. As a final experiment, we take information from multiple image scales into account by repeating the dictionary creation step on scaled-down versions of the original image. The image representation thus was a concatenation of histograms built on the single scale dictionaries. We used 3 scales: the original image, 1/2 size and 1/8 size. Using 3 Dense Simple Features for Fast and Accurate Medical X-Ray Annotation 245 89.5 89 % Correct 88.5 88 87.5 87 86.5 86 85.5 200 400 600 800 1000 1200 1400 1600 Number of words Fig. 5. Effect of dictionary size on classification accuracy Table 2. Comparison of SVM kernel types, for 1-scale and 3-scale models Kernel Average % 1-scale Average % 3-scales Radial Basis 91.45 91.59 Histogram Intersection 91.29 91.89 91.62 91.95 χ2 (a) (e) (b) (f) (c) (g) (d) (h) Fig. 6. Detecting category ’posteroanterior, left hand’: (a),(b),(c),(d) Correctly classified. (e) False negative, misclassified as ’left anterior oblique, left hand’. False positives come from categories: (f) anteroposterior, left carpal joint (g) anteroposterior, left foot (h) right anterior oblique, right foot. scales further improved the accuracy for all kernels, as seen in the right-most column of Table 2. The average classification accuracy with the χ2 kernel is 91.95%. Figure 6 demonstrates the subtlety of the challenge by examining the classification accuracy on one category: ’Posteroanterior, Left hand’. In this run there are 2000 random test images, with 57 images from the examined category. Out of which, 56 were 246 U. Avni, H. Greenspan, and J. Goldberger Table 3. Error score of the submitted medical image annotation run, and the second best result. Lower is better. Run 2005 2006 2007 2008 Sum This work 356 263 64.3 169.5 852.8 Second best - Idiap[4] 393 260 67.23 178.93 899.16 correctly detected by the described system, for eg. (a,b,c,d). Only one image, (e), was falsely classified - it was detected as ’Left anterior oblique, Left hand’ (false negative). 3 images from other categories, (f,g,h), were misclassified as ’Posteroanterior, Left hand’ (false positives). The system parameters, tuned to the labels from 2007, were applied to the 4 labeling sets of the ImageCLEF 2009 medical annotation challenge. The results are presented in Table 3. Our system was ranked first on 3 of the 4 labeling sets (2005, 2007 and 2008), and first in the overall error score. To conclude, in this study we applied a visual words approach to medical image annotation. We showed that using dense sampling in multiple scales while keeping the features simple makes the system both accurate and computationally efficient. The system ranked first in ImageCLEF 2009 medical annotation challenge. Currently we are looking at extending the capabilities to categorizations of healthy vs pathology cases, as well as within-pathology identifications. References 1. Varma, M., Zisserman, A.: Texture classification: are filter banks necessary? In: CVPR, vol. 2, pp. 691–698 (2003) 2. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR, vol. 2, pp. 524–531 (2005) 3. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 490–503. Springer, Heidelberg (2006) 4. Tommasi, T., Caputo, B., Welter, P., Deserno, T.M.: Overview of the CLEF 2009 medical image annotation track. In: CLEF Working Notes (2009), http://www.clef-campaign.org/2009/working_notes 5. Deselaers, T., Deserno, T.M.: Medical image annotation in ImageCLEF 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 523–530. Springer, Heidelberg (2009) 6. Deselaers, T., Hegerath, A., Keysers, D., Ney, H.: Sparse patch-histograms for object classification in cluttered images. In: Franke, K., M¨uller, K.-R., Nickolay, B., Sch¨afer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 202–211. Springer, Heidelberg (2006) 7. Tommasi, T., Orabona, F., Caputo, B.: Discriminative cue integration for medical image annotation. Pattern Recogn. Lett. 29(15), 1996–2002 (2008) 8. Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, vol. 2, pp. 1150–1157 (1999) 9. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. Int. J. Comput. Vision 73(2), 213–238 (2007) 10. Barla, A., Odone, F., Verri, A.: Histogram intersection kernel for image classification. In: ICIP, vol. 3 (2003) Automated X-Ray Image Annotation Single versus Ensemble of Support Vector Machines Devrim Unay1 , Octavian Soldea2 , Sureyya Ozogur-Akyuz3, Mujdat Cetin2 , and Aytul Ercil2 1 2 Electrical and Electronics Engineering, Bahcesehir University, Istanbul, Turkey Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey 3 Mathematics and Computer Science, Bahcesehir University, Istanbul, Turkey {devrim.unay,sureyya.akyuz}@bahcesehir.edu.tr, {octavian,mcetin,aytulercil}@sabanciuniv.edu Abstract. Advances in the medical imaging technology has lead to an exponential growth in the number of digital images that needs to be acquired, analyzed, classiﬁed, stored and retrieved in medical centers. As a result, medical image classiﬁcation and retrieval has recently gained high interest in the scientiﬁc community. Despite several attempts, such as the yearly-held ImageCLEF Medical Image Annotation Challenge, the proposed solutions are still far from being suﬃciently accurate for real-life implementations. In this paper we summarize the technical details of our experiments for the ImageCLEF 2009 medical image annotation challenge. We use a direct and two ensemble classiﬁcation schemes that employ local binary patterns as image descriptors. The direct scheme employs a single SVM to automatically annotate X-ray images. The two proposed ensemble schemes divide the classiﬁcation task into sub-problems. The ﬁrst ensemble scheme exploits ensemble SVMs trained on IRMA sub-codes. The second learns from subgroups of data deﬁned by frequency of classes. Our experiments show that ensemble annotation by training individual SVMs over each IRMA sub-code dominates its rivals in annotation accuracy with increased process time relative to the direct scheme. 1 Introduction Digital medical images, such as standard radiographs (X-Ray) and computed tomography (CT) images, represent a large part of the data that need to be stored, archived, retrieved, and shared among medical centers. Manual labeling of this data is not only time consuming, but also error-prone due to inter/intraobserver variations. In order to realize an accurate classiﬁcation of digital medical images one needs to develop automatic tools that allow high performance image annotation, i.e. a given image is automatically labeled with a text or a code without any user interaction. This work was supported in part by the Marie Curie Programme of the European Commission under FP6 IRonDB project MTK-CT-2006-047217. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 247–254, 2010. c Springer-Verlag Berlin Heidelberg 2010 248 D. Unay et al. Several attempts in the ﬁeld of medical images have been performed in the past, such as the WebMRIS system [1] for cervical spinal X-Ray images, and the ASSERT system [2] for CT images of the lung. While these eﬀorts consider retrieving a speciﬁc body part only, other initiatives have been taken in order to retrieve multiple body parts. The yearly held ImageCLEF Medical Image Annotation challenge, run as part of the Cross-Language Evaluation Forum (CLEF) campaign, aims in automatic classiﬁcation of an X-Ray image archive containing more than 12,000 images randomly taken from the medical routine. The dataset contains images of diﬀerent body parts of people from diﬀerent ages, of diﬀerent genders, under varying viewing angles and with or without pathologies. A potent classiﬁcation system requires the image data to be translated into a more compact and more manageable representation containing descriptive features. Several feature representations have been investigated in the past for such a classiﬁcation task. Among others, image features, such as average value over the complete image or its sub-regions [3] and color histograms [4], have been investigated. Recently in [5], texture features like local binary patterns (LBP) [6] have been shown to outperform other types of low-level image features in classiﬁcation of X-Ray images. Subsequently in [7], it has been shown that retaining only the relevant local binary pattern features achieves comparable classiﬁcation accuracies with smaller feature sets, thus leading to reduced processing time and storage space requirements. A less investigated path is to exploit from hierarchical organization of medical data, such as the ImageCLEF data labeled by the IRMA coding system, using ensemble classiﬁers. Accordingly, in this paper we explore the annotation performance of two ensemble classiﬁcation schemes based on IRMA sub-codes and frequency of classes, and compare them to the well-known single-classiﬁer scheme over the ImageCLEF-2009 Medical Annotation dataset. The paper is organized as follows. Section 2 presents our feature extraction and classiﬁcation steps in detail. Then, in Section 3 we introduce the image database and the experimental evaluation performed. And ﬁnally, Sections 4 and 5, present corresponding results and our conclusions, respectively. 2 2.1 Method Feature Extraction We extract spatially enhanced local binary patterns as features from each image in the database. LBP [6] is a gray-scale invariant local texture descriptor with low computational complexity. The LBP operator labels image pixels by thresholding a neighborhood of each pixel with the center value and considering the results as a binary number. Formally, given a pixel at (xc ,yc ), the resulting LBP code can be expressed as: LBPP,R (xc , yc ) = P −1 n=0 s(in − ic )2n (1) Automated X-Ray Image Annotation 249 Fig. 1. The image is divided into 4x4 non-overlapping sub-regions from which LBP histograms are extracted and concatenated into a single, spatially enhanced histogram where n runs over the P neighbors of the central pixel, ic and in are the graylevel values of the central and the neighboring pixels, and s(x) is 1 if x ≥ 0 and 0 otherwise. Eventually, a histogram of the labeled image fl (x, y) can be deﬁned as Hi = I(fl (x, y) = i), i = 0, . . . , L − 1 (2) x,y where L is the number of diﬀerent labels produced by the LBP operator, and I(A) is 1 if A is true and 0 otherwise. The derived LBP histogram contains information about the distribution of local micro-patterns, such as edges and ﬂat areas, over the image. Because not all LBP codes are informative [6], we use uniform version of LBP and reduce the number of informative codes from 256 to 59 (58 informative bins + one bin for noisy patterns). In order to obtain a more local description, we divide images into 4x4 non-overlapping sub-regions and concatenate the LBP histograms extracted from each region into a single, spatially enhanced feature histogram, as in [5] (Figure 1). Finally, we obtain a total of 944 features per image, and each feature is linearly scaled to [-1,+1] range before presented to the classiﬁer. 2.2 Image Annotation In this work we use a support vector machine (SVM) based learning framework to automatically annotate the images. SVM [8] is a popular machine learning algorithm that provides good results for general classiﬁcation tasks in the computer vision and medical domains: e.g. nine of the ten best models in ImageCLEFmed 2006 competition were based on SVM [9]. In a nutshell, SVM maps data to a higher-dimensional space using kernel functions and performs linear discrimination in that space by simultaneously minimizing classiﬁcation error and maximizing geometric margin between classes. 250 D. Unay et al. Fig. 2. Illustration of ensemble classiﬁcation based on IRMA sub-codes. A separate SVM is trained for each sub-code, and ﬁnal decision is formed by concatenating predictions of each SVM. Among all available kernel functions for data mapping in SVM, Gaussian radial basis function is the most popular choice, and therefore it is used here. In this work we used LibSVM1 library (version 2.89) for SVM and empirically found its optimum parameters on the dataset. Direct Annotation Scheme (D). In this scheme, we classify images by using a single SVM with one versus all multi-class model. To the contrary, ensemble schemes break down the annotation task to subproblems by dividing the data into subgroups based on 1)IRMA sub-codes, and 2)frequency of classes. Ensemble Annotation by IRMA sub-codes (E-1). In the IRMA coding system, images are categorized in a hierarchical manner based on four sub-codes describing image modality, image orientation, body region examined, and biological system investigated. Accordingly, in this scheme we train a separate SVM for each sub-code and merge their predictions to form the ﬁnal decision, as illustrated in Figure 2. Ensemble Annotation by frequency of classes (E-2). On the contrary, this ensemble scheme successively divides the data into sub-groups based on frequency of classes and trains a separate SVM on each sub-group (Figure 3). Let L1 , L2 , . . . , Ln be the set of classes in the training set and m ∈ N be a positive integer parameter. Without loss of generality, assume L1 , L2 , . . . , Ln are sorted in their decreasing cardinality values. We divide the training set in a sequence clusters C1 , C2 , . . . , Ck , such that , . . . , Lm , U1 } , C2 = n C1 = {L1 , L2 n {Lm+1 , Lm+2 , . . . , L2m , U2 } , where U1 = i=m+1 Li , U2 = i=2m Li , and so on, see Figure 3. For each Ci we train a SVM. Let Si be the SVM trained on Ci . When classifying, we begin from S1 . If S1 suggests one of the L1 , L2 , . . . , Lm labels, then we consider this result a valid classiﬁcation. If the result is U1 , then we proceed further to S2 . We follow recursively this procedure, until we eventually reach Sk , which ﬁnishes the classiﬁcation procedure. Note that we adjust Ck to include only Li labels. 1 Available at http://www.csie.ntu.edu.tw/ cjlin/libsvm Automated X-Ray Image Annotation 251 Fig. 3. Illustration of second ensemble SVM scheme for m = 2. The ﬁrst cluster, C1 , consists of classes {L1 , L2 , U1 } . The second cluster, C2 , consists of {L3 , L4 , U2 } , and so on. 3 3.1 Experimental Setup Image Data The database released for the ImageCLEF-2009 Medical Annotation task includes 12677 fully classiﬁed (2D) radiographs for training and a separate test set consisting of 2000 radiographs. The aim is to automatically classify the test set using four diﬀerent label sets including 57 to 193 distinct classes. A more detailed explanation of the database and the tasks can be found in [10]. 3.2 Evaluation We evaluate our SVM-based learning using two schemes depending on the availability of test data labels: 1)5-fold cross validation if test data labels are missing, and 2)ImageCLEF error counting scheme, otherwise. In the former scheme, the training database is partitioned into ﬁve subsets. Each subset is used once for testing while the rest are used for training, and the ﬁnal result is assigned as the average of the ﬁve validations. Note that for each validation all classes were equally divided among the folds. We measure the overall classiﬁcation performance using accuracy, which is the number of correct predictions divided by the total number of images. To the contrary, the error counting scheme is introduced by the challenge organizers to compare all runs submitted. Further details on this scheme can be found in [10]. 3.3 Runs Submitted As Computer Vision and Pattern Analysis (VPA) Laboratory of Sabanci University, we submitted three diﬀerent runs to the ImageCLEF 2009 medical image annotation task. One obtained by the direct scheme (VPA-SABANCI-1), and two with the ensemble schemes (VPA-SABANCI-2 and -3). For each run, the optimum parameter setting was realized by trial-and-error. 252 4 D. Unay et al. Results In this section, we present the results obtained by the proposed annotation schemes. In Table 1 we observe the results realized on the training database with 5-fold cross-validation. Ensemble scheme based on IRMA sub-codes clearly outperforms others, especially in terms of the 2007, 2008 and overall accuracies. Table 1. Performance of VPA-SABANCI runs on training data Run Type VPA-SABANCI-1 D VPA-SABANCI-2 E-1 VPA-SABANCI-3 E-2 2005 88.0 88.0 83.3 Accuracy (%) 2006 2007 2008 Average 83.2 83.2 83.1 84.4 83.2 91.7 93.0 89.0 77.4 77.6 77.6 79.0 Table 2 provides a detailed performance comparison of the direct scheme and the IRMA sub-codes based ensemble one over 2007 and 2008 labels. Simplifying the classiﬁcation task by training a separate SVM over each sub-code, considerably improves the ﬁnal accuracy relative to the usage of a single SVM. Furthermore, 2008 accuracies of individual SVMs excel those of 2007 despite higher number of classes (thus a more diﬃcult classiﬁcation problem). The underlying reason for this observation may be attributed to the more realistic labels of 2008. Table 2. Eﬃcacy of ensemble classiﬁcation based on IRMA sub-codes. Values in parenthesis refer to the number of distinct classes for that sub-task. Ensemble by IRMA sub-codes Direct SVM1 SVM2 SVM3 SVM4 Final 2007 accuracy (%) 96.7(5) 85.6(27) 88.0(66) 96.4(6) 91.7 83.2 2008 accuracy (%) 99.2(6) 86.3(34) 88.0(97) 98.5(11) 93.0 83.1 The results achieved on the test dataset in terms of prediction errors are presented in Table 3, together with the results of the best run realized in the challenge for comparison. As observed, IRMA sub-codes based ensemble scheme (E-1) outperforms its rivals again. With this performance, VPA-SABANCI-2 run is ranked 7th among 18 runs submitted to the competition. Compared to our solution, the best run of the challenge exploits multiresolution analysis. Figure 4 displays exemplary confusions realized by the best performing VPASABANCI-2 run for a class with few samples that lead to low recognition performance (19,5%), which may be partly due to low number of examples, and partly because of high visual similarity between the confused classes and the reference class (Most confusions are between images of the same body part, i.e. the head. Note that, at manual categorization these images were assigned to diﬀerent labels because of variation in image acquisition, such as view angle). Automated X-Ray Image Annotation 253 Table 3. Performance of VPA-SABANCI runs, in comparison with the best run of the challenge, on test data. D refers to direct scheme, while E-1 and E-2 refer to ensemble schemes based on IRMA code and data distribution, respectively. Run VPA-SABANCI-1 VPA-SABANCI-2 VPA-SABANCI-3 TAUbiomed (best run) Type D E-1 E-2 2005 578 578 587 356 2006 462 462 498 263 Error 2007 2008 201.31 272.61 155.05 261.16 169.33 300.44 64.30 169.50 Sum 1513.92 1456.21 1554.77 852.80 Fig. 4. Exemplary confusions realized by the proposed approach for a class with relatively low accuracy. Reference class with the corresponding label, number-of-examples, accuracy, and a representative X-ray image are shown on the left, while three mostobserved confusions in descending order are displayed to the right. Table 4. Computational expense of VPA-SABANCI runs for testing on a PC with with 2.40GHz processor and 6GB RAM. T = 1.83min, M = 140MB, and k = #classes m m being the split parameter deﬁned in Section 2.2. Typically, k > 4 in our case. Run Type CPU Time Memory Usage VPA-SABANCI-1 D T M VPA-SABANCI-2 E-1 4T M VPA-SABANCI-3 E-2 kT M Table 4 demonstrates the computational requirements of the proposed schemes for testing. As observed, ensemble schemes require over 4-fold resources than the direct scheme on a single processor architecture. Nevertheless, this additional requirement can be canceled out by parallel processing. 5 Conclusion In this paper we have introduced a classiﬁcation work with the aim of automatically annotating X-Ray images. We have explored the annotation performances 254 D. Unay et al. of two ensemble classiﬁcation schemes based on individual SVMs trained on IRMA sub-codes and frequency of classes, and compared the results with the popular single-classiﬁer scheme. Our experiments on the ImageCLEF-2009 Medical Annotation database revealed that breaking the annotation problem down to sub-problems by training individual SVMs over each IRMA sub-code outperforms its rivals in terms of annotation accuracy with the compromise of increased computational expense. References 1. Long, L.R., Pillemer, S.R., Lawrence, R.C., Goh, G.H., Neve, L., Thoma, G.R.: WebMIRS: web-based medical information retrieval system. In: Sethi, I.K., Jain, R.C. (eds.) Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 3312, pp. 392–403 (December 1997) 2. Shyu, C.R., Brodley, C.E., Kak, A.C., Kosaka, A., Aisen, A.M., Broderick, L.S.: Assert: a physician-in-the-loop content-based retrieval system for hrct image databases. Comput. Vis. Image Underst. 75(1-2), 111–132 (1999) 3. Rahman, M.M., Desai, B.C., Bhattacharya, P.: Medical image retrieval with probabilistic multi-class support vector machine classiﬁers and adaptive similarity fusion. Computerized Medical Imaging and Graphics 32(2), 95–108 (2008) 4. Mueen, A., Sapian Baba, M., Zainuddin, R.: Multilevel feature extraction and x-ray image classiﬁcation. J. Applied Sciences 7(8), 1224–1229 (2007) 5. Jacquet, V., Jeanne, V., Unay, D.: Automatic detection of body parts in x-ray images. In: IEEE Computer Society Workshop on Mathematical Methods in Biomedical Image Analysis, MMBIA (2009) 6. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classiﬁcation with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 7. Unay, D., Soldea, O., Ekin, A., Cetin, M., Ercil, A.: Automatic Annotation of X-ray Images: A Study on Attribute Selection. In: Medical Content-based Retrieval for Clinical Decision Support (MCBR-CDS) Workshop in Conjunction with MICCAI 2009 (2009) 8. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998) 9. M¨ uller, H., Deselaers, T., Deserno, T., Clough, P., Kim, E., Hersh, W.: Overview of the imageCLEFmed, medical retrieval and medical annotation tasks. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 595–608. Springer, Heidelberg (2007) 10. Tommasi, T., Caputo, B., Welter, P., G¨ uld, M.O., Deserno, T.M.: Overview of the clef 2009 medical image annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) Topological Localization of Mobile Robots Using Probabilistic Support Vector Classification Yan Gao and Yiqun Li Department of Computer Vision and Image Understanding Institute for Infocomm Research, Singapore {ygao,yqli}@i2r.a-star.edu.sg Abstract. Topologically localizing a mobile robot using visual information alone is a diﬃcult problem. We propose a localization system that comprises Gaussian derivatives as raw local descriptors, a three-tier spatial pyramid of histograms as the image descriptor, and probabilistic multi-class support vector machines for classiﬁcation. Based on the probability estimate, the proposed system is able to predict the unknown class which corresponds to locations that are not imaged in the training sequence. To exploit the continuity of the sequence, a smoothing procedure can be applied, which is shown to be simple yet eﬀective. 1 Introduction The RobotVision@ImageCLEF1 addresses the problem of topological localization of a mobile robot using visual information. The objective is to determine the topological location of a robot based on images acquired with a perspective camera mounted on the robot platform. The obligatory track requires the decision to be based on a single test image. The optional track, on the other hand, allows the use of images acquired before the current test image in a sequence of recording to exploit the continuity of the sequence. Our localization system is mainly built upon a probabilistic multi-class support vector classiﬁer that is included in the LIBSVM software[2]. By making use of the probability estimate of the classiﬁcation, we are able to handle the unknown class which corresponds to locations that have not been imaged in the training sequence. Simple yet eﬃcient local descriptors based on Gaussian derivatives on the lightness component of the LAB color space are used. Histograms of the local descriptors are extracted using a three-tier spatial pyramid. For the optional task, a smoothing procedure is proposed to enhance the prediction of the current test image based on the predictions made on earlier images in the sequence. 2 The Data The training and validation data are from the IDOL2 database[6]. The image sequences in the database are acquired using the MobileRobots PowerBot 1 http://imageclef.org/2009/robot C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 255–260, 2010. c Springer-Verlag Berlin Heidelberg 2010 256 Y. Gao and Y. Li robot platform. In the ﬁnal experiment for the ImageCLEF robot vision task, one sequence is chosen as the training data, while a previously unreleased image sequence is used for testing. The training sequence consists of 1034 image frames of 5 classes according to the robot’s topological location, namely, one-person oﬃce(BO), corridor(CR), two-persons oﬃce(EO), kitchen(KT), and printer area(PA). The test sequence consists of 1690 image frames classiﬁed into 6 classes, 5 of which are the same as those of the training sequence, and one additional unknown(UK) class corresponding to the additional rooms that are not imaged previously. The test sequence is acquired 20 months after the training sequence. For more details please refer to [1]. 3 3.1 System Description Feature Extraction Each training image in the training sequence is ﬁrst converted to a representation in the LAB color space. Normalized Gaussian derivatives are obtained on the L component. Altogether 5 partial derivatives (Lx , Ly , Lxx , Lyy , Lxy ) are computed on the smoothed L component using Gaussian ﬁlters. A code book is built on the computed Gaussian derivations using k-means clustering, with k = 32. The computed Gaussian derivatives are then quantized into the 32 codewords. A three-tier spatial pyramid [5] of histograms is then obtained on each image, with the three tiers made up by the whole image, 2 × 2 sub-images, and 4 × 4 subimages. The histograms have 32 bins which corresponds to the 32 codewords. Altogether there are 1 + 2 × 2 + 4 × 4 = 21 histograms. Each image is represented by a 21 × 32 = 672 dimensional feature vector. An illustration of the feature extraction process is given in ﬁgure 1. 3.2 The Classifier In this section we describe the probabilistic multi-class support vector machine used as the classiﬁer. Binary support vector classification. Over the recent years, support vector machines [8] have been developed to a state-of-the-art classiﬁer. Let D be a training data set and D = {(x, y)} ∈ (Rd × Y)|D| , where Rd denotes an ddimensional input space, Y = {±1} denotes the label set of x, and |D| is the size of D. Given D, SVM ﬁnds an optimal separating hyperplane which classiﬁes the two classes by the minimal expected test error. Let w∗ , x + b∗ = 0 denote this hyperplane, where w∗ and b∗ are normal vector and bias, respectively. w∗ and b∗ can be found by minimizing |D| Φ(w) = p 1 w2 + C ξi 2 i=1 subject to: yi (w, xi + b) ≥ 1 − ξi , i = 1, · · · , |D| (1) Topological Localization of Mobile Robots Image L-component Ly Lxx 257 Lx Lyy 0.12 0.1 0.08 0.06 0.04 0.02 0 Lxy 0 100 Quantization 200 300 400 500 600 700 Histogram Fig. 1. The original image from the test sequence, its L component of the LAB color space, 5 Gaussian derivatives (Lx Ly Lxx, Lyy, Lxy). The image is then quantized into 32 bins. A histogram is obtained by concatenating histograms obtained on the whole image, the 2 × 2 sub-images, and the 4 × 4 sub-images. where ξi (ξi ≥ 0) is the i-th slack variable and C is the regularization parameter controlling the trade-oﬀ between function complexity and training error. p = 1 or 2 corresponds to the case of L-1 norm or L-2 norm based soft margin, respectively. The solution of the optimization problem is |D| w∗ = i=1 αi yi xi (2) b∗ = 1 − w∗ , xs where αi is a non-negative coeﬃcient of xi . In this way, the optimal separating hyperplane is expressed as w∗ , x + b∗ = |D| i=1 αi yi xi , x + b∗ = 0 (3) 258 Y. Gao and Y. Li In the classiﬁcation stage, f (x) = w∗ , x + b∗ is used as the decision function, and a test sample x is labeled as y = sgn [f (x)], where sgn(·) denotes the sign function. The kernel trick [4] can be conveniently embedded into SVM to handle the nonlinearly separable patterns. A kernel, k, is deﬁned to be k(x, y) = φ(x), φ(y), where φ(·) is the associated mapping from a feature space, Rn , to a kernel space, F . This mapping is often nonlinear, and the dimensionality of F can be of high or even inﬁnite dimensions. The nonlinearly separable patterns in Rn can become linearly separable in F with higher probability. Hence, the optimal separating hyperplane is constructed in F instead of Rn , and it becomes f (x) = |D| αi yi φ(xi ), φ(x) + b∗ = i=1 |D| αi yi k(xi , x) + b∗ . (4) i=1 The χ2 kernel has been found to give good performance on histogram based features [3]: (q) (q) 2 d (x − x ) i j k(xi , xj ) = exp(−γχ2 (xi , xj )) = exp −γ , (5) (q) (q) q=1 xi + xj (q) where xi refers to the qth dimension of the feature vector xi , which also corresponds to a bin in the histogram. Probabilistic estimate for binary support vector classification. Instead of predicting the label, one can instead estimate a class probability P (y = +1|x) and P (y = −1|x). In [7], the probability is approximated by P (y = +1|x) = 1 , 1+ P (y = −1|x) = 1 − P (y = +1|x), eAf (x)+B (6) where f (x) is the decision value of the SVM classiﬁer, and A and B are determined by solving a maximum likelihood problem [7]. Probabilistic estimate for multi-class support vector classification. A multi-class classiﬁcation problem can be solved by solving pairwise binary classiﬁcation problems. For a multi-class classiﬁcation problem that classify each input x into one of M classes y ∈ Y = {1, ..., M }, pairwise class probabilities rmn = P (y = m|y = m or n, x), m = 1, ..., M, n = 1, ..., M, m = n are ﬁrst computed. The objective is to estimate the multi-class probability pi = P (y = m|x), m = 1, ..., M . In [9] it is proposed that p= arg min pm ,m=1,...,M M m=1 n:n=m (rnm pm − rmn pn )2 subject to M pm = 1, pm ≤ 0, ∀m. (7) m=1 It can be proven that equation (7) has a unique solution under mild conditions and can be solved using a linear system of equations. The decision of the multiclass classiﬁer is then given by δ = arg maxm pm Topological Localization of Mobile Robots 3.3 259 Handling the Unknown Class Since the test sequence include images taken in additional rooms not recorded in the training sequence, we make use of the probability estimate of SVMs to determine when to classify an image into the unknown(UK) class. The approach taken is to classify images with low conﬁdence of belonging to the winning class into the unknown class. Recall that m pm = 1, and the decision made by SVMs is δ = arg maxm pm . The probability of the sample being from the wining class is P (y = δ|x) = pδ . If pδ is small, it means that the classiﬁer is not very conﬁdent about putting the sample into the winning class δ. In this case, the sample is classiﬁed as the unknown class. Formally, y = UK if pδ < T, (8) where T is a threshold that can be tuned on the validation set. 3.4 Exploiting Continuity of the Sequence The optional track of the robot vision task allows the use of earlier images for the prediction of current test image. Assume that a sequence of predictions have been made, denoted by y1 , ..., yi . The smoothing procedure make corrections to the prediction of the current image yi . Denote the corrected prediction of yi as yi , then yi is given by yi = mode(yi−h+1 , yi−h+2 , ..., yi ), (9) where mode is a function that ﬁnds the value that occurs most frequently. Note that the original prediction yi is not overwritten by yi . The smoothing procedure on the following images will make use of yi , not yi . Otherwise the same label will prorogate through the whole sequence. One drawback of the smoothing procedure is that when the robot enters a new room, the prediction will be h images delayed. 4 Experiments and Results The classiﬁcation system is coded in MATLAB while we make use of the LIBSVM software as the classiﬁer. We have modiﬁed the LIBSVM source code to include the χ2 kernel. The parameters of the classiﬁer include the kernel parameter γ = 5, The regularization parameter in SVMs C = 1, the threshold T = 0.6 for classiﬁcation of the unknown class, and the number of earlier images h = 20 used for the smoothing procedure in the optional task. All are tuned on the validation set. The classiﬁcation performance is evaluated according to the rules set for the task. 1 point is awarded for each correctly classiﬁed image, -0.5 point for each misclassiﬁed image, and 0 point for each image that is not classiﬁed (not implemented in the proposed system). Our system gives a score of 784, which corresponds to an accuracy rate of 64.26%. This is ranked 4th in the competition (the 260 Y. Gao and Y. Li top score is 793). For the optional task, after applying the proposed smoothing procedure, the score raises to 884.5, which corresponds to an accuracy rate of 68.22%. This is ranked 3rd in the competition (the top score is 916.5). The beneﬁts brought by the probabilistic multi-class support vector machines lie in the classiﬁcation of the unknown class. If we do not make use of the probability estimate to determine the unknown class, the performance drops to a score of 700. The proposed localization system is computationally eﬃcient in both training and classiﬁcation phases. On an Intel Core2 Duo CPU E8400 PC, the extraction of pyramid histograms of Gaussian derivatives on each image frame takes about 0.3 second (MATLAB implementation). The training of a probabilistic multi-class support vector machine using LIBSVM takes 25 seconds on the whole training set of 1034 images. Classiﬁcation of each test image only takes about 0.004 second. 5 Conclusion We propose a robot localization system based on the probabilistic multi-class support vector machines. By incorporating the prediction of the unknown class based on the probability estimates, the performance improves from a score of 700 to 784, or 12%. By applying a smoothing procedure to exploit continuity of the sequence, the performance further improves from 784 to 884.5, or 12.8%. References 1. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the CLEF 2009 robot vision track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 3. Chapelle, O., Haﬀner, P., Vapnik, V.N.: Support vector machines for histogrambased image classiﬁcation. IEEE Transactions on Neural Networks 10(5), 1055–1064 (1999) 4. Crisianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000) 5. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178 (2006) 6. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2007), San Diego, CA, USA (October 2007) 7. Platt, J.C., Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classiﬁers, pp. 61–74. MIT Press, Cambridge (1999) 8. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 9. Wu, T.-F., Lin, C.-J., Weng, R.C.: Probability estimates for multi-class classiﬁcation by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004) The University of Amsterdam’s Concept Detection System at ImageCLEF 2009 Koen E.A. van de Sande, Theo Gevers, and Arnold W.M. Smeulders Intelligent Systems Lab Amsterdam, University of Amsterdam, The Netherlands http://www.colordescriptors.com Abstract. Our group within the University of Amsterdam participated in the large-scale visual concept detection task of ImageCLEF 2009. Our experiments focus on increasing the robustness of the individual concept detectors based on the bag-of-words approach, and less on the hierarchical nature of the concept set used. To increase the robustness of individual concept detectors, our experiments emphasize in particular the role of visual sampling, the value of color invariant features, the inﬂuence of codebook construction, and the eﬀectiveness of kernel-based learning parameters. The participation in ImageCLEF 2009 has been successful, resulting in the top ranking for the large-scale visual concept detection task in terms of both EER and AUC. For 40 out of 53 individual concepts, we obtain the best performance of all submissions to this task. For the hierarchical evaluation, which considers the whole hierarchy of concepts instead of single detectors, using the concept likelihoods estimated by our detectors directly works better than scaling these likelihoods based on the class priors. 1 Introduction Robust image retrieval is highly relevant in a world that is adapting to visual communication. Online services like Flickr show that the sheer number of photos available online is too much for any human to grasp. Many people place their entire photo album on the internet. Most commercial image search engines provide access to photos based on text or other metadata, as this is still the easiest way for a user to describe an information need. The indices of these search engines are based on the ﬁlename, associated text or (social) tagging. This results in disappointing retrieval performance when the visual content is not mentioned, or properly reﬂected in the associated text. In addition, when the photos originate from non-English speaking countries, such as China, or Germany, querying the content becomes much harder. To cater for robust image retrieval, the promising solutions from the literature are in majority concept-based [1], where detectors are related to objects, like trees, scenes, like a desert, and people, like big group. Any one of those brings an understanding of the current content. The elements in such a lexicon oﬀer users a semantic entry by allowing them to query on presence or absence of visual content elements. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 261–268, 2010. c Springer-Verlag Berlin Heidelberg 2010 262 K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders Point Sampling Strategy Color Descriptors Bag-of-Words Fixed-length feature vector Harris-Laplace ColorSIFT Hard assignment Fixed-length feature vector I Input t image i Dense sampling p g ColorSIFT Soft assignment Fixed-length Fixed length feature vector 2x2 spatial pyramid Fig. 1. University of Amsterdam’s ImageCLEF 2009 concept detection scheme. The scheme serves as the blueprint for the organization of Section 2. The Large-Scale Visual Concept Detection Task [2] evaluates 53 visual concept detectors. The concepts used are from the personal photo album domain: beach holidays, snow, plants, indoor, mountains, still-life, small group of people, portrait. For more information on the dataset and concepts used, see the overview paper [2]. Based on our previous work on concept detection [3,4,5], we have focused on improving the robustness of the visual features used in our concept detectors. Systems with the best performance in image retrieval [3,6] and video retrieval [4,7] use combinations of multiple features for concept detection. The basis for these combinations is formed by good color features and multiple point sampling strategies. This paper is organized as follows. Section 2 deﬁnes our concept detection system. Section 3 details our experiments and results. Finally, in section 4, conclusions are drawn. 2 Concept Detection System We perceive concept detection as a combined computer vision and machine learning problem. The ﬁrst step is to represent an image using a ﬁxed-length feature vector. Given a visual feature vector xi , the aim is then to obtain a measure, which indicates whether semantic concept C is present in photo i. We may choose from various visual feature extraction methods to obtain xi , and use a supervised machine learning approach to learn the appearance relation between C and xi . The supervised machine learning process is composed of two phases: training and testing. In the ﬁrst phase, the optimal conﬁguration of features is learned from the training data. In the second phase, the classiﬁer assigns a probability p(C|xi ) to each input feature vector for each semantic concept C. The University of Amsterdam’s Concept Detection System 2.1 263 Point Sampling Strategy The visual appearance of a concept has a strong dependency on the viewpoint under which it is recorded. Salient point methods [8] introduce robustness against viewpoint changes by selecting points, which can be recovered under diﬀerent perspectives. Another solution is to simply use many points, which is achieved by dense sampling. We summarize our sampling approach in Figure 1: HarrisLaplace and dense point selection, and a spatial pyramid.1 Harris-Laplace point detector. In order to determine salient points, Harris-Laplace relies on a Harris corner detector. By applying it on multiple scales, it is possible to select the characteristic scale of a local corner using the Laplacian operator [8]. Hence, for each corner the Harris-Laplace detector selects a scale-invariant point if the local image structure under a Laplacian operator has a stable maximum. Dense point detector. For concepts with many homogenous areas, like scenes, corners are often rare. Hence, for these concepts relying on a Harris-Laplace detector can be suboptimal. To counter the shortcoming of Harris-Laplace, random and dense sampling strategies have been proposed [9,10]. We employ dense sampling, which samples an image grid in a uniform fashion using a ﬁxed pixel interval between regions. We use an interval distance of 6 pixels and sample at multiple scales (σ = 1.2 and σ = 2.0). Spatial pyramid. Both Harris-Laplace and dense sampling give an equal weight to all keypoints, irrespective of their spatial location in the image. To overcome this limitation, Lazebnik et al. [11] suggest to repeatedly sample ﬁxed subregions of an image, e.g. 1x1, 2x2, 4x4, etc., and to aggregate the diﬀerent resolutions into a so called spatial pyramid. Since every region is an image in itself, the spatial pyramid can be used in combination with both the Harris-Laplace point detector and dense point sampling [12]. For the ideal spatial pyramid conﬁguration, some claim 2x2 is suﬃcient [11], others suggest to include 1x3 also [6]. We use a spatial pyramid of 1x1, 2x2, and 1x3 in our experiments. 2.2 Color Descriptor Extraction In the previous section, we addressed the dependency of the visual appearance of semantic concepts on the viewpoint under which they are recorded. However, the lighting conditions during photography also play an important role. We [3] analyzed the properties of color descriptors under classes of illumination changes within the diagonal model of illumination change, and speciﬁcally for data sets consisting of Flickr images. In ImageCLEF, the images used also originate from Flickr. Here we use the four color descriptors from the recommendation table in [3]. The descriptors are computed around salient points obtained from the Harris-Laplace detector and dense sampling. For the color descriptors in Figure 1, each of those four descriptors can be inserted. 1 Software to perform point sampling, color descriptor computation and the hard and soft assignment is available from http://www.colordescriptors.com 264 K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders SIFT. The SIFT feature proposed by Lowe [13] describes the local shape of a region using edge orientation histograms. The gradient of an image is shiftinvariant: taking the derivative cancels out oﬀsets [3]. Under light intensity changes, i.e. a scaling of the intensity channel, the gradient direction and the relative gradient magnitude remain the same. Because the SIFT feature is normalized, the gradient magnitude changes have no eﬀect on the ﬁnal feature. To compute SIFT features, we use the version described by Lowe [13]. OpponentSIFT. OpponentSIFT describes all the channels in the opponent color space using SIFT features. The information in the O3 channel is equal to the intensity information, while the other channels describe the color information in the image. The feature normalization, as eﬀective in SIFT, cancels out any local changes in light intensity. C-SIFT. The C-SIFT feature uses the C invariant [14], which can be intuitively seen as the gradient (or derivative) for the normalized opponent color space O1/I and O2/I. The I intensity channel remains unchanged. C-SIFT is known to be scale-invariant with respect to light intensity. See [15,3] for detailed evaluation. RGB-SIFT. For the RGB-SIFT, the SIFT feature is computed for each RGB channel independently. Due to the normalizations performed within SIFT, it is equal to transformed color SIFT [3]. The feature is scale-invariant, shiftinvariant, and invariant to light color changes and shift. 2.3 Bag-of-Words Model We follow the well-known bag-of-words model, also known as codebook approach, see e.g. [16,10,17,18,3]. First, we assign visual descriptors to discrete codewords predeﬁned in a codebook. Then, we use the frequency distribution of the codewords as a feature vector representing an image. We construct a codebook with a maximum size of 4096 using k-means clustering. An important issue is codeword assignment. An comparison of codeword assignment is presented in [18]. Here we only detail two codeword assignment methods: – Hard assignment. Given a codebook of codewords, the traditional codebook approach assigns each descriptor to a single best representative codeword in the codebook. Basically, an image is represented by a histogram of codeword frequencies describing the probability density over codewords. – Soft-assignment. The traditional codebook approach may be improved by using soft-assignment through kernel codebooks. A kernel codebook uses a kernel function to smooth the hard-assignment of image features to codewords. Out of the various forms of kernel-codebooks, we selected codeword uncertainty based on its empirical performance [18]. Each of the possible sampling methods from Section 2.1 coupled with each visual descriptor from Section 2.2, and an assignment approach results in a separate visual codebook. An example is a codebook based on dense sampling of RGB-SIFT The University of Amsterdam’s Concept Detection System 265 features in combination with hard-assignment. Naturally, various conﬁgurations can be used to combine multiple of these choices. For simplicity, we employ equal weights in our experiments when combining diﬀerent features. 2.4 Machine Learning The supervised machine learning process is composed of two phases: training and testing. In the ﬁrst phase, the optimal conﬁguration of features is learned from the training data. From all machine learning approaches on oﬀer to learn the appearance relation between C and xi , the support vector machine is commonly regarded as a solid choice [19]. Here we use the LIBSVM implementation [20] with probabilistic output [21]. The parameter of the support vector machine we optimize is C. In order to handle imbalance in the number of positive versus negative training examples, we ﬁx the weights of the positive and negative class by estimation from the class priors on training data. It was shown by Zhang et al. [17] that in a codebook-approach to concept detection the earth movers distance and χ2 kernel are to be preferred. We employ the χ2 kernel, as it is less expensive in terms of computation. In the second machine learning phase, the classiﬁer assigns a probability p(C|xi ) to each input feature vector for each semantic concept C, i.e. the trained model is applied to the test data. 3 3.1 Concept Detection Experiments Submitted Runs We have submitted ﬁve diﬀerent runs. All runs use both Harris-Laplace and dense sampling with the SVM classiﬁer. We do not use the EXIF metadata provided for the photos. – – – – OpponentSIFT: single color descriptor with hard assignment. 2-SIFT: uses OpponentSIFT and SIFT descriptors. 4-SIFT: uses OpponentSIFT, C-SIFT, RGB-SIFT and SIFT descriptors. Soft 4-SIFT: uses OpponentSIFT, C-SIFT, RGB-SIFT and SIFT descriptors with soft assignment. The soft assignment parameters have been taken from our PASCAL VOC 2008 system [3]. – Rescaled 4-SIFT: the same ordering of images as 4-SIFT, but with all concept detector outputs linearly scaled so the number of images with a score > 0.5 is equal to the concept prior probability in the training set. 3.2 Evaluation per Concept In table 1, the overall scores for the evaluation of concept detectors are shown. As for the evaluation of single detectors only the ranking of the images within a single concept matters, the rescaled version of 4-SIFT achieves the exact same performance as 4-SIFT. We note that the 4-SIFT run with hard assignment 266 K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders Table 1. Overall results of the our runs evaluated over all concepts in the Photo Annotation task using the equal error rate (EER) and the area under the curve (AUC) Run name Codebook Average EER Average AUC 4-SIFT Hard-assignment Soft 4-SIFT Soft-assignment 2-SIFT Hard-assignment OpponentSIFT Hard-assignment 0.2345 0.2355 0.2435 0.2530 0.8387 0.8375 0.8300 0.8217 Table 2. Results per concept for our runs in the Large-Scale Visual Concept Detection Task using the Area Under the Curve. The highest score per concept is highlighted using a grey background. The concepts are ordered by their highest score. ŽŶĐĞƉƚ ůŽƵĚƐ ^ƵŶƐĞƚͲ^ƵŶƌŝƐĞ ^ŬǇ >ĂŶĚƐĐĂƉĞͲEĂƚƵƌĞ ^ĞĂ DŽƵŶƚĂŝŶƐ >ĂŬĞ ĞĂĐŚͲ,ŽůŝĚĂǇƐ dƌĞĞƐ tĂƚĞƌ EŝŐŚƚ ZŝǀĞƌ KƵƚĚŽŽƌ &ŽŽĚ ĞƐĞƌƚ ƵŝůĚŝŶŐͲ^ŝŐŚƚƐ ŝŐͲ'ƌŽƵƉ WůĂŶƚƐ &ůŽǁĞƌƐ ƵƚƵŵŶ WŽƌƚƌĂŝƚ hŶĚĞƌĞǆƉŽƐĞĚ EŽͲWĞƌƐŽŶƐ WĂƌƚůǇͲůƵƌƌĞĚ tŝŶƚĞƌ ^ŶŽǁ ĂǇ EŽͲůƵƌ ϰͲ^/&d ^ŽĨƚϰͲ^/&d ϮͲ^/&d KƉƉ͘^/&d Ϭ͕ϵϱϴ Ϭ͕ϵϱϯ Ϭ͕ϵϰϱ Ϭ͕ϵϰϰ Ϭ͕ϵϯϱ Ϭ͕ϵϯϰ Ϭ͕ϵϭϭ Ϭ͕ϵϬϲ Ϭ͕ϵϬϯ Ϭ͕ϵϬϭ Ϭ͕ϴϵϴ Ϭ͕ϴϵϳ Ϭ͕ϴϵϬ Ϭ͕ϴϵϱ Ϭ͕ϴϵϭ Ϭ͕ϴϴϬ Ϭ͕ϴϴϭ Ϭ͕ϴϳϳ Ϭ͕ϴϲϴ Ϭ͕ϴϳϬ Ϭ͕ϴϲϱ Ϭ͕ϴϱϴ Ϭ͕ϴϱϬ Ϭ͕ϴϱϮ Ϭ͕ϴϰϯ Ϭ͕ϴϰϲ Ϭ͕ϴϰϭ Ϭ͕ϴϰϯ Ϭ͕ϵϱϭ Ϭ͕ϵϰϳ Ϭ͕ϵϯϱ Ϭ͕ϵϰϬ Ϭ͕ϵϯϮ Ϭ͕ϵϯϬ Ϭ͕ϵϭϮ Ϭ͕ϴϵϴ Ϭ͕ϴϵϮ Ϭ͕ϴϵϮ Ϭ͕ϴϵϱ Ϭ͕ϴϵϭ Ϭ͕ϴϳϵ Ϭ͕ϴϴϭ Ϭ͕ϴϵϭ Ϭ͕ϴϳϯ Ϭ͕ϴϳϬ Ϭ͕ϴϱϯ Ϭ͕ϴϰϲ Ϭ͕ϴϲϯ Ϭ͕ϴϱϳ Ϭ͕ϴϱϳ Ϭ͕ϴϯϳ Ϭ͕ϴϰϱ Ϭ͕ϴϯϮ Ϭ͕ϴϮϵ Ϭ͕ϴϯϭ Ϭ͕ϴϯϲ Ϭ͕ϵϱϴ Ϭ͕ϵϱϰ Ϭ͕ϵϰϴ Ϭ͕ϵϰϮ Ϭ͕ϵϯϬ Ϭ͕ϵϯϭ Ϭ͕ϵϬϯ Ϭ͕ϵϬϳ Ϭ͕ϵϬϮ Ϭ͕ϵϬϯ Ϭ͕ϴϵϱ Ϭ͕ϴϴϵ Ϭ͕ϴϵϲ Ϭ͕ϴϵϱ Ϭ͕ϴϲϱ Ϭ͕ϴϴϮ Ϭ͕ϴϳϳ Ϭ͕ϴϴϭ Ϭ͕ϴϳϱ Ϭ͕ϴϲϲ Ϭ͕ϴϲϰ Ϭ͕ϴϱϵ Ϭ͕ϴϱϴ Ϭ͕ϴϱϮ Ϭ͕ϴϰϲ Ϭ͕ϴϰϱ Ϭ͕ϴϰϱ Ϭ͕ϴϰϱ Ϭ͕ϵϰϱ Ϭ͕ϵϰϲ Ϭ͕ϵϯϬ Ϭ͕ϵϯϲ Ϭ͕ϵϮϲ Ϭ͕ϵϮϮ Ϭ͕ϵϬϬ Ϭ͕ϴϴϰ Ϭ͕ϴϴϭ Ϭ͕ϴϴϲ Ϭ͕ϴϵϮ Ϭ͕ϴϴϯ Ϭ͕ϴϳϭ Ϭ͕ϴϳϳ Ϭ͕ϴϴϰ Ϭ͕ϴϲϭ Ϭ͕ϴϱϴ Ϭ͕ϴϯϵ Ϭ͕ϴϯϲ Ϭ͕ϴϰϵ Ϭ͕ϴϰϲ Ϭ͕ϴϱϰ Ϭ͕ϴϮϲ Ϭ͕ϴϯϬ Ϭ͕ϴϮϴ Ϭ͕ϴϮϱ Ϭ͕ϴϮϰ Ϭ͕ϴϮϯ ŽŶĐĞƉƚ EŽͲsŝƐƵĂůͲdŝŵĞ /ŶĚŽŽƌ &ĂŵŝůŝǇͲ&ƌŝĞŶĚƐ WĂƌƚǇůŝĨĞ sĞŚŝĐůĞ ŶŝŵĂůƐ ŝƚǇůŝĨĞ ^ƚŝůůͲ>ŝĨĞ ^ƉƌŝŶŐ ĂŶǀĂƐ ^ƵŵŵĞƌ DĂĐƌŽ EŽͲsŝƐƵĂůͲ^ĞĂƐŽŶ ^ŵĂůůͲ'ƌŽƵƉ ^ŝŶŐůĞͲWĞƌƐŽŶ KƵƚͲŽĨͲĨŽĐƵƐ EŽͲsŝƐƵĂůͲWůĂĐĞ KǀĞƌĞǆƉŽƐĞĚ EĞƵƚƌĂůͲ/ůůƵŵŝŶĂƚŝŽŶ ^ƵŶŶǇ DŽƚŝŽŶͲůƵƌ ^ƉŽƌƚƐ ĞƐƚŚĞƚŝĐͲ/ŵƉƌĞƐƐŝŽŶ KǀĞƌĂůůͲYƵĂůŝƚǇ &ĂŶĐǇ ǀĞƌĂŐĞ ϰͲ^/&d ^ŽĨƚϰͲ^/&d ϮͲ^/&d KƉƉ͘^/&d Ϭ͕ϴϯϯ Ϭ͕ϴϯϬ Ϭ͕ϴϯϰ Ϭ͕ϴϯϰ Ϭ͕ϴϯϮ Ϭ͕ϴϭϴ Ϭ͕ϴϮϲ Ϭ͕ϴϮϰ Ϭ͕ϴϮϮ Ϭ͕ϴϭϳ Ϭ͕ϴϭϯ Ϭ͕ϴϭϮ Ϭ͕ϴϬϱ Ϭ͕ϳϵϮ Ϭ͕ϳϵϮ Ϭ͕ϳϵϮ Ϭ͕ϳϴϵ Ϭ͕ϳϴϴ Ϭ͕ϳϳϴ Ϭ͕ϳϲϯ Ϭ͕ϳϰϰ Ϭ͕ϲϵϱ Ϭ͕ϲϱϴ Ϭ͕ϲϱϲ Ϭ͕ϱϲϱ Ϭ͕ϴϯϱ Ϭ͕ϴϯϱ Ϭ͕ϴϯϰ Ϭ͕ϴϯϰ Ϭ͕ϴϯϮ Ϭ͕ϴϮϴ Ϭ͕ϴϮϲ Ϭ͕ϴϮϱ Ϭ͕ϴϬϭ Ϭ͕ϴϭϬ Ϭ͕ϴϭϯ Ϭ͕ϳϵϭ Ϭ͕ϴϬϲ Ϭ͕ϳϵϱ Ϭ͕ϳϵϱ Ϭ͕ϳϴϭ Ϭ͕ϳϴϲ Ϭ͕ϳϴϮ Ϭ͕ϳϴϯ Ϭ͕ϳϲϱ Ϭ͕ϳϰϳ Ϭ͕ϲϵϱ Ϭ͕ϲϲϮ Ϭ͕ϲϱϲ Ϭ͕ϱϱϵ Ϭ͕ϴϮϮ Ϭ͕ϴϮϯ Ϭ͕ϴϮϮ Ϭ͕ϴϯϭ Ϭ͕ϴϯϮ Ϭ͕ϴϭϭ Ϭ͕ϴϭϵ Ϭ͕ϴϬϴ Ϭ͕ϴϭϮ Ϭ͕ϴϬϯ Ϭ͕ϳϵϭ Ϭ͕ϴϬϱ Ϭ͕ϳϵϰ Ϭ͕ϳϴϰ Ϭ͕ϳϴϬ Ϭ͕ϳϴϰ Ϭ͕ϳϴϭ Ϭ͕ϳϳϳ Ϭ͕ϳϳϱ Ϭ͕ϳϰϰ Ϭ͕ϳϮϱ Ϭ͕ϲϳϵ Ϭ͕ϲϱϳ Ϭ͕ϲϱϯ Ϭ͕ϱϴϬ Ϭ͕ϴϭϱ Ϭ͕ϴϭϬ Ϭ͕ϴϭϯ Ϭ͕ϴϭϵ Ϭ͕ϴϮϮ Ϭ͕ϳϵϳ Ϭ͕ϴϭϯ Ϭ͕ϳϵϱ Ϭ͕ϳϵϭ Ϭ͕ϳϵϬ Ϭ͕ϳϴϮ Ϭ͕ϳϵϱ Ϭ͕ϳϴϮ Ϭ͕ϳϳϲ Ϭ͕ϳϲϵ Ϭ͕ϳϳϰ Ϭ͕ϳϳϵ Ϭ͕ϳϳϭ Ϭ͕ϳϳϰ Ϭ͕ϳϰϭ Ϭ͕ϳϭϬ Ϭ͕ϲϳϯ Ϭ͕ϲϱϳ Ϭ͕ϲϱϴ Ϭ͕ϱϴϯ Ϭ͕ϴϯϴϳ Ϭ͕ϴϯϳϱ Ϭ͕ϴϯϬϬ Ϭ͕ϴϮϭϳ achieves not only the highest performance amongst our runs, but also over all other runs submitted to the Large-Scale Visual Concept Detection task. In table 2, the Area Under the Curve scores have been split out per concept. We observe that the three aesthetic concepts have the lowest scores. This comes as no surprise, because these concepts are highly subjective: even human annotators only agree around 80% of the time with each other. For virtually all concepts besides the aesthetic ones, either the Soft 4-SIFT or the Hard 4-SIFT is the best run. This conﬁrms our beliefs that these (color) descriptors are not redundant when used in combinations. Therefore, we recommend the use of these 4 descriptors instead of 1 or 2. The diﬀerence in overall performance between the Soft 4-SIFT or the Hard 4-SIFT run is quite small. Because the soft codebook The University of Amsterdam’s Concept Detection System 267 Table 3. Results using the hierarchical evaluation measures for our runs in the LargeScale Visual Concept Detection Task Average Annotation Score Run name Codebook Soft 4-SIFT 4-SIFT 2-SIFT OpponentSIFT Rescaled 4-SIFT Soft-assignment Hard-assignment Hard-assignment Hard-assignment Hard-assignment with agreement without agreement 0.7647 0.7623 0.7581 0.7491 0.7398 0.7400 0.7374 0.7329 0.7232 0.7199 assignment smoothing parameter was directly taken from a diﬀerent dataset, we expect that the soft assignment run could be improved if the soft assignment parameter was selected with cross-validation on the training set. Together, our runs obtain the highest Area Under the Curve scores for 40 out of 53 concepts in the Photo Annotation task (20 for Soft 4-SIFT, 17 for 4-SIFT and 3 for the other runs). This analysis has shown us that our system is falling behind for concepts that correspond to conditions we have included invariance against. Our method is designed to be robust to unsharp images, so for Out-of-focus, Partly-Blurred and No-Blur there are better approaches possible. For the concepts Overexposed, Underexposed, Neutral-Illumination, Night and Sunny, recognizing how the scene is illuminated is very important. Because we are using invariant color descriptors, a lot of the discriminative lighting information is no longer present in the descriptors. Again, there should be better approaches possible for these concepts, such as estimating the color temperature and overall light intensity. 3.3 Evaluation per Image For the hierarchical evaluation, overall results are shown in table 3. When compared to the evaluation per concept, the Soft 4-SIFT run is now slightly better than the normal 4-SIFT run. Our attempt to improve performance for the hierarchical evaluation measure using a linear rescaling of the concept likelihoods has had the opposite eﬀect: the normal 4-SIFT run is better than the Rescaled 4-SIFT run. Therefore, further investigation into building a cascade of concept classiﬁers is needed, as simply using the individual concept classiﬁers with their class priors does not work. 4 Conclusion Our focus on invariant visual features for concept detection in ImageCLEF 2009 has been successful. It has resulted in the top ranking for the large-scale visual concept detection task in terms of both EER and AUC. For 40 individual concepts, we obtain the best performance of all submissions to the task. For the hierarchical evaluation, using the concept likelihoods estimated by our detectors directly works better than scaling these likelihoods based on the class priors. Acknowledgements. Special thanks to Cees Snoek, Jasper Uijlings and Jan van Gemert for providing valuable input and their cooperation over the years. 268 K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders References 1. Snoek, C.G.M., Worring, M.: Concept-based video retrieval. FTIR 4(2), 215–322 (2009) 2. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large scale visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 3. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. IEEE Transactions on PAMI (in press, 2010) 4. Snoek, C.G.M., van de Sande, K.E.A., de Rooij, O., Huurnink, B., van Gemert, J.C., Uijlings, J.R.R., et al.: The MediaMill TRECVID 2008 semantic video search engine. In: TRECVID Workshop (2008) 5. Uijlings, J.R.R., Smeulders, A.W.M., Scha, R.J.H.: Real-time bag-of-words, approximately. In: ACM CIVR (2009) 6. Marszalek, M., Schmid, C., Harzallah, H., van de Weijer, J.: Learning object representations for visual object class recognition. In: Visual Recognition Challenge Workshop, in Conjunction with IEEE ICCV (2007) 7. Wang, D., Liu, X., Luo, L., Li, J., Zhang, B.: Video diver: generic video indexing with diverse features. In: ACM MIR, Augsburg, Germany, pp. 61–70 (2007) 8. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: A survey. FTCGV 3(3), 177–280 (2008) 9. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: IEEE CVPR, vol. 2, pp. 524–531 (2005) 10. Jurie, F., Triggs, B.: Creating eﬃcient codebooks for visual recognition. In: IEEE ICCV, Beijing, China, pp. 604–610 (2005) 11. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE CVPR, vol. 2, pp. 2169–2178 (2006) 12. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: A comparison of color features for visual concept classiﬁcation. In: ACM CIVR, pp. 141–150 (2008) 13. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 14. Geusebroek, J.M., van den Boomgaard, R., Smeulders, A.W.M., Geerts, H.: Color invariance. IEEE Transactions on PAMI 23(12), 1338–1350 (2001) 15. Burghouts, G.J., Geusebroek, J.M.: Performance evaluation of local color invariants. CVIU 113, 48–62 (2009) 16. Leung, T.K., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. IJCV 43(1), 29–44 (2001) 17. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classiﬁcation of texture and object categories: A comprehensive study. IJCV 73(2), 213–238 (2007) 18. van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Visual word ambiguity. IEEE Transactions on PAMI (in press, 2010) 19. Vapnik, V.N.: The Nature of Statistical Learning Theory. 2nd edn (2000) 20. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 21. Lin, H.T., Lin, C.J., Weng, R.C.: A note on Platt’s probabilistic outputs for support vector machines. ML 68(3), 267–276 (2007) Enhancing Recognition of Visual Concepts with Primitive Color Histograms via Non-sparse Multiple Kernel Learning Alexander Binder1 and Motoaki Kawanabe1,2 1 Fraunhofer Institute FIRST, Kekul´estr. 7, 12489 Berlin, Germany {alexander.binder,motoaki.kawanabe}@first.fraunhofer.de 2 TU Berlin, Franklinstr. 28/29, 10587 Berlin, Germany Abstract. In order to achieve good performance in image annotation tasks, it is necessary to combine information from various image features. In recent competitions on photo annotation, many groups employed the bag-of-words (BoW) representations based on the SIFT descriptors over various color channels. In fact, it has been observed that adding other less informative features to the standard BoW degrades recognition performances. In this contribution, we will show that even primitive color histograms can enhance the standard classiﬁers in the ImageCLEF 2009 photo annotation task, if the feature weights are tuned optimally by non-sparse multiple kernel learning (MKL) proposed by Kloft et al.. Additionally, we will propose a sorting scheme of image subregions to deal with spatial variability within each visual concept. 1 Introduction Recent research results show that combining information from various image features is inevitable to achieve good performance in image annotation tasks. With the support vector machine (SVM) [1,2], this is implemented by mixing kernels (similarities between images) constructed from diﬀerent image descriptors with appropriate weights. For instance, the average kernel with uniform weights or the optimal kernel trained by multiple kernel learning (called 1 -MKL later) have been used so far. Since the sparse 1 -MKL tends to overﬁt by ignoring quite a few kernels, Kloft et al. [3] proposed the non-sparse MKL with p -regularizer (p ≥ 1), which bridges the average kernel (p = ∞) and 1 -MKL. The non-sparse MKL is successfully applied to object classiﬁcation tasks; it could outperform the two baseline methods by optimizing the tuning parameter p ≥ 1 through cross validation. In particular, it is useful to combine less informative features such as color histograms with the standard bag of words (BoW) representations [4]. We will show that by p -MKL additional simple features can enhance classiﬁcation performances of some visual concepts in the ImageCLEF 2009 photo annotation task [5], while with the average kernel they just degrade recognition rates. Since the images are not aligned, we will also propose a sorting scheme of image subregions to deal with the spatial variability, when computing similarities between diﬀerent images. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 269–276, 2010. c Springer-Verlag Berlin Heidelberg 2010 270 2 A. Binder and M. Kawanabe Features and Kernels Used in Our Experiments Features. For the following experiments, we prepared two kinds of image features: one is the BoW representations based on the SIFT descriptors [6] and the other is the pyramid histograms [7] of color intensities (PHoCol). The BoW features were constructed in a standard way. By the code used in [8], the SIFT descriptors were computed on a dense grid of step size six over multiple color channels: red, green, blue, and grey. Then, for both grey and combined red-greenblue channels, 4000 visual words (prototypes) were generated by using k-means clustering with large sets of SIFT descriptors selected randomly from the training images in analogy to [9]. For each image, one of the visual words was assigned to the base SIFT at each grid point and the set of words was summarized in a histogram within each cell of the spatial tilings 1 × 1, 2 × 2 and 3 × 1 [7]. Finally, we obtained 6 BoW features (2 colors × 3 pyramid levels). On the other hand, the PHoCol features were computed by making histograms of color intensities with 10 bins within each cell of the spatial tiling 4 × 4 and 8 × 8 for various color channels: grey, opponent color 1, opponent color 2, normalized red, normalized green, normalized blue. The ﬁner pyramid levels were considered, because the intensity histograms usually contain only little information. Sorting the color histograms. The spatial pyramid representation [7] is very useful, in particular, when annotating aligned images, because we can incorporate spatial relations of visual contents in images properly. However, if we want to use histograms on higher-level pyramid tilings (4 × 4 and 8 × 8) as parts of input features for general digital photos of the annotation task, it is necessary to handle large spatial variability within each visual concept. Therefore, we propose to sort the cells within a pyramid tiling according to the slant of the histograms. Mathematically, our sort criterion sl(h) is deﬁned as a[h]i = k≤i hk , a[h]i a[h]i = , a[h]k k sl(h) = − a[h]i ln(a[h]i ). (1) i The idea behind the criterion can be explained intuitively. The accumulation process a[h] maps the histogram h with only one peak at the highest intensity bin to the minimum entropy distribution (Fig. 1 left) and that with only one peak at the lowest intensity bin to the maximum entropy distribution (Fig. 1 right). If the original histogram h is ﬂat, the accumulated histogram a[h] becomes a linearly increasing function which has an entropy in between the two extremes (Fig 1 middle). On the other hand, it is natural to think that all possible permutations π are not equally likely in sorting of the image cells. In many cases, spatial positions of visual concepts can change more horizontally than vertically (e.g. sky, sea). Therefore, we introduced a sort cost in order to punish large changes of the vertical positions of the image cells before and after sorting. sc(π) = C max(v(π(k)) − v(k), 0) (2) k Enhancing Recognition of Visual Concepts via Non-sparse MKL 1 1 1 0.5 0.5 0.5 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 1 1 0.5 0.5 0.5 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 271 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Fig. 1. Explanation of the slant score. Upper: intensity histograms. Lower: corresponding histograms accumulated. Here v(i) denotes the vertical position of the i-th cell within a pyramid tiling and the constant C is chosen such that the sort cost is upper-bounded by one and lies at a similar range compared to the χ2 -distance between the color histograms. The sort cost is used to modify the PHoCol kernels. When comparing images x and y, the squared distance between the sort costs are added to the χ2 -distance between the color histograms. k(x, y) = exp[−σ{dχ2 (hx , hy ) + (scx − scy )2 }] . (3) In our experiments, we computed the sorted PHoCol features on 4 × 4 and 8 × 8 pyramid tilings and constructed kernels with and without the sort cost modiﬁcation. Although the intensity-based features have a lesser performance as standalone image descriptors even after the sorting modiﬁcations, combining them with the standard BoW representations can enhance performances in some of the 53 classiﬁcation problems in the ImageCLEF09 task with almost no additional computation costs. Kernels. We used the χ2 -kernel except for the cases that the sort cost was incorporated. The kernel width was set to be the mean of the inner χ2 -distances computed over the training data. All kernels were normalized. 3 Experimental Results We aim at showing an improvement over a gold standard represented by BoW features with average kernel while lacking the ground truth on the test data. Therefore we evaluated all settings using 10-fold cross validation on the ImageCLEF09 photo annotation training data, consisting of 5000 images. This allows to perform statistical testing and to predict generalization errors for selecting better methods/models. The BoW baseline is a reduced version ouf our ImageCLEF submission described in the working notes. The submitted version gave rise to results behind the ISIS and INRIA-LEAR groups by AUC 272 A. Binder and M. Kawanabe (margins 0.022, 0.006) and by EER also behind CVIUI2R (margins 0.02, 0.005, 0.001). XRCE and CVIUI2R performed better by the hierarchy measure (margins 0.013, 0.014). We report in this section performance comparison between SVMs with the average kernels, the sparse 1 -MKL, and the non-sparse p -MKL [3]. In p -MKL, the tuning parameter p is selected for each class from the set {1.0625, 1.125, 1.25, 1.5, 2} by cross validation scores and the regularization parameter of the SVMs was ﬁxed to one. We chose the average precision (AP) as the evaluation criterion which is also employed in the Pascal VOC Challenges due to its sensitivity to smaller changes, even when AUC values are already saturated above 0.9. This rank-based measure is invariant against the actual choice of a bias. We did not employ the equal error rate (EER), because it can suﬀer from unbalanced sizes of the ImageCLEF09 annotations. We remark that several classes have less than 100 positive samples and generally no learning algorithm generalizes well in such cases. We will pose four questions and present experimental results to answer them in the following. Does MKL help for combining the bag of words features? Our ﬁrst question is whether the MKL techniques are useful compared to the average kernel SVMs for combining the default 6 BoW features. The upper panel of Fig. 2 shows the performance diﬀerences between p -MKL with class-wise optimal p’s and SVMs with the average kernels over all 53 categories. The classes are sorted as in the guidelines of the ImageCLEF09 photo annotation task. In general, we see just minor improvements by applying p -MKL in 33 out of 53 classes and for only one class it achieved major gain. Seemingly, the chosen BoW features have on average similar information contents. The cross-validated scores (average AP 0.4435 and average AUC 0.8118) of the baseline method imply that these 6 BoW features contributed mostly to the ﬁnal results of our group achieved on the testdata of the annotation task. On the other hand, the lower panel of Fig. 2 indicate that the canonical 1 MKL is not a good idea in this case. On average over all classes 1 -MKL gives worse results compared to the baseline. We attribute this to the harmful eﬀects of sparsity in noisy image data. Our observations are quantitatively supported by Wilcoxon signed rank test (the signiﬁcance level α = 0.05) which can tell the signiﬁcance of the performance diﬀerences. For p -MKL vs the average kernel SVM, we have 10 statistically signiﬁcant classes with 5 gains and 5 losses, while there are 12 statistically signiﬁcant losses and only one gain in comparison between 1 -MKL and the average kernel baseline. Do sorted PHoCol features improve the BoW baseline? To answer this question we compared classiﬁers which takes both the BoW and PHoCol features with the baselines which rely only on the BoW representations. For each of the two cases and each class, we selected the best result in the AP score among various settings which will be explained later. Enhancing Recognition of Visual Concepts via Non-sparse MKL 273 0.015 0.01 0.005 0 −0.005 −0.01 −0.015 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0.06 0.04 0.02 0 −0.02 −0.04 −0.06 −0.08 Fig. 2. Class-wise performance diﬀerences when combining the 6 BoW features. Upper: p -MKL vs the average kernel SVM. Lower: 1 -MKL vs the average kernel SVM. Baseline mean AP 0.4434, mean AUC 0.8118. For combinations of the BoW and PHoCol features, we considered the six sets of base kernels in Table 1. For each set, the kernel weights are learned by p -MKL with the tuning parameters p ∈ {1, 1.0625, 1.25, 1.5, 2}. The baselines only with the 6 BoW were also computed by taking the best result from p -MKL and the average kernel SVM. In Fig. 3, we can see several classes with larger improvements over the BoW baseline by employing the full setup including PHoCol features with the sort modiﬁcation and the optimal kernel weights learned by p -MKL. We also see slight decreases of the AP score on 6 classes out of all 53, where the worst setback is just of the size 0.004. In fact, they are rather minor compared to the large gains on their complement. Note that the combinations of PHoCol did not include the average kernel SVM as an option, while the best performances with the BoW only could be achieved by the average kernel SVM. Thanks to ﬂexibility of p MKL, classiﬁcation performances by the larger combination (PHoCol+BoW) were never much worse than the standard BoW classiﬁers, even PHoCols are much less informative. 274 A. Binder and M. Kawanabe Table 1. The sets of base kernels tested set no. BoWs 1 2 3 4 5 6 all all all all all all 6 6 6 6 6 6 sorted PHoCols color sort costs opponent color 1 & 2 no opponent color 1 & 2 yes grey no grey yes normalized red, green, blue no normalized red, green, blue yes spatial tiling both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 −0.01 0 5 10 15 20 25 30 35 40 45 50 Fig. 3. Class-wise performance gains by the combination of the PHoCol and BoW over the standard BoW only. The baseline has mean AP 0.4434 and mean AUC 0.8118. The gains were statistically signiﬁcant according to Wilcoxon signed rank test with the level α = 0.05 on the 9 classes: Winter (13), Sky (21), Day (28), Sunset Sunrise (32), Underexposed (38), Neutral Illumination (39), Big Group (46), No Persons(47) and Aesthetic Impression (51) in Fig. 3. This is not surprising, as we would expect for these outdoor classes to have a certain color proﬁle, while the two ’No’ and ’Neutral’ classes have a large number of samples for generalization via the learning algorithm. We remark that the sorted PHoCol features are very fast to compute and that the MKL training times are negligible compared to those necessary for computing SIFT descriptors, clustering and assigning visual words. Actually we could compute the PHoCol kernels on the ﬂy. In summary, the result of this experiment shows that combining additional features with lower standalone performance can further improve recognition performances. In the next experiment we show that the non-sparse MKL is the key to the gain brought by the sorted PHoCol features. Does averaging suﬃce for combining extra PHoCol features? We consider again the same two cases (i.e. PHoCol+BoW vs Bow only) as in the previous experiments. In the ﬁrst case, the average kernels are always used as the Enhancing Recognition of Visual Concepts via Non-sparse MKL 275 0.03 0.02 0.01 0 −0.01 −0.02 −0.03 −0.04 −0.05 −0.06 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 −0.01 Fig. 4. Class-wise performance diﬀerences. Upper: combined Phocol and BoW by average kernel vs baseline with BoW only. Lower: combined Phocol and BoW by p -MKL vs the same features with average kernel. combination of the base kernels in each set instead of p -MKL and the best AP score was obtained for each class and each case. The performances of the second case were calculated in the same way as the last experiment. From the upper panel of Fig. 4, we see a mixed result with more losses than gains. That is, the average kernels of PHoCol and BoW rather degrade the performance compared to the baselines with BoW only. Additionally, for the combination of PHoCol and BoW, we compared p -MKL with the average kernel SVMs in the lower panel of Fig. 4. This result shows clearly that the average kernel fails in the combination of highly and poorly expressive features throughout most classes. We conclude that the non-sparse MKL techniques are essential to achieve further gains by combining extra sorted PHoCol features with the BoW representations. Does the sort modiﬁcation improve the PHoCol features? The default PHoCol features gave substantially better performances for the classes snow and desert on which the sorted ones do improve only somewhat compared to BoW models. We assume that the higher importance of color together with low spatial variability of color distributions in these concepts explains the gap. The default PHoCols without sorting degraded performances strongly in three other 276 A. Binder and M. Kawanabe classes, where the sorted version does not lead to losses. In this sense, the sorting modiﬁcation seems to make classiﬁers more stable on average over all classes. 4 Conclusions We have shown that primitive color histograms can further enhance recognition performance over the standard procedure using BoW representations in most visual concepts of the ImageCLEF2009 photo annotation task, if they are combined optimally by the recently developed non-sparse MKL techniques. This fact was not known before and nobody has pursued this direction, because the average kernels constructed from such heterogenous features degrade classiﬁcation performance substantially due to high noise in the least informative kernels. Furthermore, we gave insights and evidences when p -MKL is particularly useful: it can achieve better performance when combining informative and noisy features, even if the average kernel SVMs and the sparse 1 -MKL fail. Acknowledgements. We like to thank Shinichi Nakajima, Marius Kloft, Ulf Brefeld and Klaus-Robert M¨ uller for fruitful discussions. This work was supported in part by the Federal Ministry of Economics and Technology of Germany (BMWi) under the project THESEUS (01MQ07018). References 1. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 2. M¨ uller, K.R., Mika, S., R¨ atsch, G., Tsuda, K., Sch¨ olkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw. 12(2), 181–201 (2001) 3. Kloft, M., Brefeld, U., Sonnenburg, S., Laskov, P., M¨ uller, K.R., Zien, A.: Eﬃcient and accurate Lp-norm multiple kernel learning. In: Adv. In: Neur. Inf. Proc. Sys., NIPS (2009) 4. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV 2004, Prague, Czech Republic, pp. 1–22 (May 2004) 5. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large-scale visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comp. Vis. 60(2), 91–110 (2004) 7. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of CVPR 2006, New York, USA, pp. 2169–2178 (2006) 8. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pat. Anal. & Mach. Intel. 27(10), 1615–1630 (2005) 9. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pat. Anal. & Mach. Intel. (2010) Using SIFT Method for Global Topological Localization for Indoor Environments Emanuela Boroş, George Roşca, and Adrian Iftene UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania {emanuela.boros,george.rosca,adiftene}@info.uaic.ro Abstract. The paper represents a brief description of our system as one of the solutions to the problem of global topological localization for indoor environments. The experiment involves analyzing images acquired with a perspective camera mounted on a robot platform and applying a feature-based method (SIFT) and two main systems in order to search and classify the given images. To obtain acceptable results and improved performance improvement, the algorithm acquires two main maturity levels: one capable of running in real-time and taking care of the computers’ resources and the other one capable of classifying correctly the input images. One of the principal benefits of the developed system is a server-client architecture that brings efficiency to the table along with statistical methods that improve the quality of data with their design. 1 Introduction A proper understanding of human learning is important to consider while making any decision. Our need in imitating the human capability of learning has become the main purpose of science. And so, methods are needed to be used to summarize, describe and classify collections of data. Recognizing places where we have been before and remembering objects that we have seen are difficult operations to imitate or interpret. The subject of this paper outlines the Robot Vision task1, hosted for the first time in 2009 by ImageCLEF. The task addresses the problem of topological localization of a mobile robot using visual information. Specifically, we were asked to determine the topological location of a robot based on images acquired with a perspective camera mounted on a robot platform. We received training data consisting of an image sequence recorded in a five room subsection of an indoor environment under fixed illumination conditions and at a given time [1]. Our system is able to globally localize the robot, i.e. to estimate the robot’s position even if the robot is passively moved from one place to another within the mapped area. The process of recognizing the robot’s position is done answering the question Where are you? (With possible answers: I’m in the kitchen, or I’m in the corridor, etc.). This is achieved combining an improved algorithm for object recognition from [5] with an algorithm for classifying amounts of data. The system is divided in two main managers for classifying images: one that we call the brute finder and the other one the managed finder. The managed finder is an 1 Robot Vision task: http://www.imageclef.org/2009/robot C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 277–282, 2010. © Springer-Verlag Berlin Heidelberg 2010 278 E. Boroş, G. Roşca, and A. Iftene algorithmic search that picks the most representative images for every room. The brute finder is applying an algorithm for extracting images’ features to every image in every room, creating meta files for all the images with no exception. The rest of the paper is organized as follows: in section 2 we describe our system (UAIC System) separated on main components. Results and evaluation of the system are reported in sections 3 and 4. In the last section conclusions regarding our participation in Robot Vision task at ImageCLEF 2009 are draws. 2 UAIC System Our system is based mainly on SIFT algorithm [5, 6] and it has the aim to implement a method for extracting distinctive invariant features [3, 4] from images that can be used to perform reliable matching between different views of an object or scene (See Figure 1). The features have to be invariant to a lot of changes that images must suffer. Translations, rotations, scales and luminance changes can cause the difference of two pictures [2]. It is virtually impossible to compare two images using traditional methods such as a direct comparison between gray values as it could be really simple with an existing API (Java Advanced Imaging API2). Fig. 1. UAIC system used in Robot Vision task The vision of the project is a source of information all about local points of interest via the above channels to offer. The purpose of information and visualized objects is 2 Java Advanced Imaging API (JAI): http://java.sun.com/javase/technologies/desktop/media/ Using SIFT Method for Global Topological Localization for Indoor Environments 279 to organize and communicate valuable data to people, so they can derive increased knowledge that guides their thinking and behavior. The architecture of the system is similar to server-client architecture, and it is possible to accomplish more requests at a time. Therefore, one of the maturity levels that the system has is the possibility of running in real-time thanks this ability. The server supports a training stage with the data from IDOL Database [7]. In addition, the server is responsible of the method for extracting distinctive invariant features from images (SIFT) that can be used to perform reliable matching between different views of an object or scene. The client is based on knowledge that it receives from the server at request, testing the new images. We didn’t choose a mechanism of incremental learning, we chose a statistical one as the people learn through observation, trial-and-error and experiment. As we now, learning happens during interaction. We managed the images (interactions with objects for human learning) so they become a system of storing features of them. 2.1 The Server Module This module has two parts: one part necessary for training and one part necessary for classifying, both based on extracted key points (points of interest) from images. 2.1.1 Trainer Component Training and validation were performed on a subset of the publicly available IDOL2 Database [7]. The database contains image sequences acquired in a five room subsection of an office environment, under three different illumination settings and over a time frame of 6 months. The test sequences were acquired in the same environment, 20 months after the training data, and contain additional rooms that were not imaged previously. This fact means that variations such as changes in illumination conditions, changes in the rooms, people moving around and changes in the robots’ path could take place. At the next step, the server loads once with all the key points’ files obtained with SIFT algorithm and waits for requests. Accordingly with the number of considered images for rooms we have two types of classifiers: the brute classifier and the managed classifier. The brute classifier uses in the training and in the finding process all the available meta files. The managed classifier creates the representative meta files for the representative images from the training data. First of all, when it gets through all the steps that we explained at SIFT algorithm subsection, it chooses only the images that have almost 10-16 percent similarities with images treated before. In the end, we obtain only 10 from 50 - 60 images appreciatively, also 10 meta files (with the key points for them), the most representative images for, in this case, every room that has been loaded as a training directory. With these two methods, we have trained our application twice (that took us 2 days): one for all pictures and one for representative images. 2.1.2 SIFT Algorithm In order to extract key points from images, both in training and in classifying processes we use SIFT algorithm [6]. SIFT algorithm uses a feature called Scale Invariant Feature Transform [3], which is a feature-based on image matching approach, which 280 E. Boroş, G. Roşca, and A. Iftene lessens the damaging effects of image transformations to a certain extent. Features extracted by SIFT are invariant to image scaling and rotation, and partially invariant to photometric changes. This is the key step in achieving invariance to rotation as the key point descriptor can be represented relative to this orientation and therefore achieves invariance to image rotation. All key points are written to meta files that represent the database for the server. 2.1.3 Classifier Component In this iteration, the database for the management of point of interests will be created and it will be browse using the SIFT algorithm. Access to the database is done by the server, which loads files with key points and waits for requests. Thus, first classifier called the brute classifier loads all the meta files into memory for training and for classifying. The second called managed classifier creates the representative meta files for the representative images from the batch. First of all, when it gets through all the steps that we explained at SIFT algorithm subsection, it chooses only the images that have the almost 10-16 percent similarities with images treated before. In the end these selected pictures represent the most representative images, and we obtain only 10 from 50-60 images appreciatively. Only these files with corresponding meta files (with the key points) are considered then in the training and classification processes. 2.2 The Client Module The client module is the tester and it has two phases: a naive one and a more precise one. This implies comparison at its bases. This module only sends one image to the server and receives a list of results that represents, in this case, the list with the rooms where the images for testing belong. 3 Results Experiments were done testing both of the servers’ methods of classifying images. The training in the case of getting the most representative images from the batches of images provided takes longer than the other method, of course. The wanted results in this case were far beyond from what we were expecting: lower than in the case of using the brute finder. The results were more explicit in this second case. As we knew that a new room was introduced in the robot’s path, we had to do test the system in two situations: one with unknown room treated (not recognizing the room) and with unknown room not treated. Plain search is the process of getting the results for topological localization with the brute finder. From 21 runs that were submitted in Robot Vision task, 5 of them were ours (see Table 1). The results were more explicit on the brute finder even though it took a lot of time to complete the training. The get representative method didn’t give the expected results, but it is faster in comparison to the brute method. Using SIFT Method for Global Topological Localization for Indoor Environments 281 Table 1. UAIC runs in Robot Vision Task Run ID 155 157 156 158 159 Details Full search using all frames Run Duration: 2.5 days on one computer Run Duration: 2.5 days for this run Search using representative pictures from all rooms Run Duration: 30 minutes on one computer Run Duration: 30 minutes for one run Wise search on representative images with Unknown threaded. Score 787.0 Ranking 2 787.0 599.5 3 5 595.5 296.5 6 13 4 Evaluation In this section we try to identify plusses and minuses of our approach. For that we compare the results obtained for two of our better runs: first called Plain search with unknown rooms treated (PlainSearchUK), and second called Plain search with unknown rooms not treated (PlainSearchNoUK). For both runs we offered 1690 values, in which for first one we got 1088 correct values, in time that for the second run we got 963 correct values. As we can see from the graphical representation from Figure 2, the success rate for the ones with unknown rooms not treated (No UK) is better on the known rooms than in the other case, of unknown images treated. The brute finder found in No UK case more correct values for the rooms that the robot knew (or the server has learn) than in the UK case. Because of that fact that, in the case of unknown rooms not treated, the classifier (server) had to assign every image to a category (room), the percentage for every room had significantly increased. The best results were obtained for the CR and KT rooms. This was possible because of the bigger amount of images representing those rooms, and of course, more key points. This fact means that the statistical methods applied need to be improved. Fig. 2. Comparison between PlainSearchUK and PlainSearchNoUK Runs 282 E. Boroş, G. Roşca, and A. Iftene 5 Conclusions This paper presents the UAIC system which took part in the Robot Vision task. We used to apply a feature-based method (SIFT) and two main systems in order to search and classify the given images. The first system uses the most important/representative images for a room’s category. The second system is a brute force one and the results in this case are statistically significant. From analysis part we deduce that methods used have a better behavior in cases when in comparison processes are rooms with more key points. Future work will focus on working on more productive statistical methods and maybe integrating the brute finder with the managed one. References 1. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the CLEF 2009 robot vision track. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Lindeberg, T.: Scale-space theory in computer vision. Kluwer Academy Publisher, Dordrecht (1994) 3. Lindeberg, T.: Feature detection with automatic scale selection. Technical report ISRN KTH NA/P-96/18-SE. Department of Numerical Analysis and Computing Science, Royal Institute of Technology, S-100 44 Stockholm, Sweden (1996) 4. Lindeberg, T., Bretzner, L.: Real-time scale selection in hybrid multi-scale representations. In: Griffin, L.D., Lillholm, M. (eds.) Scale-Space 2003. LNCS, vol. 2695, pp. 148–163. Springer, Heidelberg (2003) 5. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, Corfu, Greece, pp. 1150–1157 (1999) 6. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 7. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proceedings IROS, San Diego, CA, USA (2007) UAIC at ImageCLEF 2009 Photo Annotation Task Adrian Iftene, Loredana Vamanu, and Cosmina Croitoru UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania {adiftene,loredana.vamanu,cosmina.croitoru}@info.uaic.ro Abstract. In this paper, UAIC system participating in the ImageCLEF 2009 Photo Annotation task is described. The UAIC team’s debut in the ImageCLEF competition has enriched us with the experience of developing the first system for the Photo Annotation task, paving the way for subsequent ImageCLEF participations. Evaluation of the used modules shown that there is more work to capture as many possible situations. 1 Introduction The ImageCLEF 2009 visual concept detection and annotation task used training and test sets consisting of thousand images from Flickr image database. All images had multiple annotations with references to holistic visual concepts and were annotated at an image-based level. The visual concepts were organized in a small ontology with 53 concepts, which could be used by the participants for the annotation task. For the image classification we used four components: (1) the first uses face recognition, (2) the second one use training data, (3) the third one uses associated exif file and (4) the fourth uses default values calculated according to the degree of occurrence in the training set data. In what follows we describe our system and its main components, analyze the results obtained and discuss the experience gained. 2 The UAIC System The system has four main components. The first component try to identify in every image people faces and after that, accordingly with the number of these faces, make the classification. The second one uses for classification the clusters built from training data and calculates for every image the minimum distance between image and clusters. The third one uses for classification details extracted from associated exif file. If none of these components can perform the image classification, then the fourth component uses default values determined in the training set data. Face Recognition Module. Some categories implies the presence of people, such as Abstract Categories (through concepts: Family Friends, PartyLife, Beach Holidays, CityLife), Activity through the concept Sports, of course Persons and Representation having one of the concepts Portrait. We used, Faint1, a JAVA library that recognizes if there 1 Faint: http://faint.sourceforge.net/ C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 283–286, 2010. © Springer-Verlag Berlin Heidelberg 2010 284 A. Iftene, L. Vamanu, and C. Croitoru are any faces in a given photo. Also, we can receive how many and how much percent of that picture is the face. In this way we were able to decide if there is a big group, a small one, if the photo was a portrait. Unfortunately, if the light isn’t normal, it works well for 80% of these cases, in comparison with day pictures (after our estimations). Clustering using Training Data Module. We used a similarity processing for finding some concepts. For this, we have selected the most representing pictures for some concepts from test data and we used JAI2 (Java Advanced Imaging API) for manipulating images easily and a small program that calculates a rate of similarity between the clusters of photos and the photo wanted to be annotated. It was hard to find the most representing photos for concepts and to build their associated clusters as every concept can be seen so different in different seasons, different time of day, etc; but the hardest part was to decide the acceptance rate. Using the training data, we ran for some images that we expected to be annotated with one or more of the concepts that were illustrated by the pictures in our small clusters and we notated the rates. We also made the same thing with pictures that shouldn’t be annotated with one of the concepts but were very similar with the pictures chosen to compare to and we also notated the rates. In the end, we made kind of a compromise average rate and this was our limit rate. Of course that this algorithm could be improved, it can be calculated a rate for every cluster and maybe this way the program would be more accurate. The concepts that we tried to annotate in this way were CityLife, Clouds, Snow, Desert, Sea and Snow. Exif Processing Module. We processed the exif information for every picture and, according to the parameters of the camera with which the picture was taken, we were able to annotate concepts for example related to illumination, but also correlating with other concepts found; some more abstract concepts like City Life or LandscapeNature could also be found this way. Because the information taken from exif can or can not be accurate, and some of our limits can be subjective, the concepts discovered annotated that were not clearly, were set to 0.5. Default values Module. There are five categories that contain disjoint concepts, which implies that only one concept from this kind of category can be found in a photo. Taking this into consideration, if by any other method a concept from a disjoint category was not discovered, and then a default value will be inserted. The default concept is selected according to the degree of occurrence in the training set data. 3 Results The run time needed for the test data was 24 hours. It took so long because of the similarities process as this supposed to compare a photo with 7 up to 20 photos for every category that was built from training data. We have submitted only one run in which we took into consideration the relation between categories and the hierarchical order. The scores for the hierarchical measure were over 65% in both cases with and without annotator agreement, unfortunately the results were not that great when the evaluation per concept was made as we only tried to annotate 30 out of 53 concepts (the average results are presented in Table 1 [3]). 2 JAI: http://java.sun.com/javase/technologies/desktop/media/jai/ UAIC at ImageCLEF 2009 Photo Annotation Task 285 Table 1. Average values for EER and AUC for UAIC run Run ID UAIC_34_2_1244812428616_changed Average EER 0.4797 Average AUC 0.105589 Our detailed results regarding AUC on classes are presented in Figure 1. We can see how many results are zero while the average was 0.105589 and the best value was 0.469569 for winter class (number 13 in below figure). Along class winter (13) we obtained good results for classes: autumn (12), night (29), day (28) and spring (10). In all these cases classification was done using the module that process the exif file. Fig. 2. AUC - Detailed Results on Classes The lower values for AUC were obtained for classes where we have not rules for classifying: partylife (1), family-friends (2), beach-holidays (3), etc. In all these cases we cannot apply any rule from our defined rules and AUC value remain on zero. We made some statistics on the results obtained for every technique we used to annotate in order to determine which worked and which failed (see Table 2). Table 2. Average values on Methods used to Annotate Method Applied by Module Additional File Processing Default Values Image Similarity Face Recognition Without Values EER 0,447955 0,467783 0,467672 0,487723 0,502011 AUC 0,225782 0,146464 0,132259 0,081350 0 286 A. Iftene, L. Vamanu, and C. Croitoru The best results came from the exif processing; this helped us to annotate photos with concepts from categories such as: Season, Time of Day, Place, Blurring and other concepts more abstract as Landscape-Nature. The results could have been improved if the limits used to annotate a concept from the parameters of the camera with which the photo was taken could have been more objective chosen. The use of image similarity had pretty good results. One of the facts that contributed to bad annotation of this method probably is the choice of the pictures that are the most representative for a concept, and this part is very subjective. Also the solution of using many different pictures that have a different view of the concept is not good enough as the process of similarity can be damaged and also it must be taken into consideration that a comparison between two photos costs a lot. As it concerns the face recognition method, we were expecting better results. Probably what can be improved here is to find an algorithm that also recognizes a person even though her entire face is not captured in the photo. 4 Conclusions The system built by us for this task has four main components. The first component try to identify in every image people faces and after that accordingly with the number of these faces make the classification. The second one use for classification clusters built from training data and calculates for every image the minimum of distances between image and clusters. The third one uses for classification details extracted from associated exif file. If none of these components can perform image classification, it is done by the fourth module using default values. From our run evaluation we conclude that some of the applied rules are better than others. For the future, on basis of detailed analysis of the results we will try to apply rules in descending order of their quality. Also, we intend to use the prediction (like in [1]), in combination with a method that used extracting feature vectors for each of the images similar to [2]. References 1. Abonyi, J., Feil, B.: Cluster Analysis for Data Mining and System Identification. Birkhäuser, Basel (2007) 2. Hare, J.H., Lewis, P.H.: IAM@ImageCLEFPhotoAnnotation 2009: Naïve application of a linear-algebraic semantic space. In: CLEF Working Notes 2009, Corfu, Greece (2009) 3. Nowak, S., Dunker, P.: Overview of the CLEF 2009 Large-Scale Visual Concept Detection and Annotation Task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) Learning Global and Regional Features for Photo Annotation Jiquan Ngiam and Hanlin Goh Institute for Infocomm Research 1 Fusionopolis Way, #21-01 Connexis (South Tower), Singapore 138632 {jngiam,hlgoh}@i2r.a-star.edu.sg Abstract. This paper describes a method that learns a variety of features to perform photo annotation. We introduce concept-speciﬁc regional features and combine them with global features. The regional features were extracted through a novel region selection algorithm based on Multiple Instance Learning. Supervised classiﬁcation for photo annotation was learned using Support Vector Machines with extended Gaussian Kernels over the χ2 distance, together with a simple greedy feature selection. The method was evaluated using the ImageCLEF 2009 Photo Annotation task and competitive benchmarking results were achieved. 1 Introduction In the ImageCLEF 2009 Photo annotation task, the concepts involved ranged from holistic concepts (e.g. landscape, seasons) to regional image elements (e.g. person, mountain), which only involved a sub-region of the image. This broad range of concepts necessitates the use of a large feature set. The set of features was coupled with feature selection using a simple greedy algorithm. For the regional concepts, we experimented with a new idea involving conceptspeciﬁc region selection. We hypothesize that if we can ﬁnd the relevant region supporting a concept, features from this region will be good indicators of whether the concept exists in the image. For these concepts, global features provide contextual information while regional features help improve classiﬁcation. Our method follows the framework in [1], involving Support Vector Machines and extended Gaussian Kernels over the χ2 distance. The framework provides a structured approach for integrating the various global and local features. 2 2.1 Image Feature Extraction Global Feature Extraction The global features listed in Table 1 were computed over the entire image. Each type of feature provides a histogram describing the image. In features where a quantized HSV space was used, 12 Hue, 3 Saturation and 3 Value bins were employed, resulting in a total of 108 bins. The bins are of equal width in each dimension. The choice of these parameters was motivated by [2]. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 287–290, 2010. c Springer-Verlag Berlin Heidelberg 2010 288 J. Ngiam and H. Goh Table 1. List of global features extracted Feature Dim. Description HSV Histogram 108 Quantized HSV Histogram Color Auto Correlogram [2] 432 Computed over a quantized HSV space with 4 distances: 1,3,5 and 7. For each color-distance pair (c, d), the probability of ﬁnding the same color at exactly distance d away was computed. Color Coherence Vector [3] 216 Computed over a quantized HSV space with two states: coherent and incoherent. The τ parameter was set to 1% of the image size. Census Transform [4] 256 A simple transformation of each pixel into a 8-bit value based on its 8 surrounding neighbors, based on two states: either ‘>=’ or ‘<’ its neighbor. Edge Orientation Histogram 37 Interest Point SIFT [5] 500 SIFT features quantized into a visual words dictionary of 500 visual words using k-Means clustering. Densely Sampled SIFT [6] 1500 SIFT features densely sampled at 10 pixel intervals, with 4 scales (4, 8, 12, 16 px radius) and 1 orientation. Features are quantized into 1500 visual words by k-Means clustering. 2.2 Edge orientation histogram computed by the Canny edge detector. Each pixel is assigned to either an edge (with orientation) or non-edge. Orientations are quantized into 5 degree angle bins. An additional bin is concatenated for non-edges. Region Selection The classiﬁcation problem for some concepts can be framed in the Multiple Instance Learning (MIL) framework. The concept (e.g. Mountain) exists if and only if a region within the image demonstrates the concept. Hence, one is motivated to consider whether it is possible to improve classiﬁcation performance by considering only appropriate region(s) in an image. In our method, a region is deﬁned as a bounding box, which image features can be computed in a manner similar to the extraction of global features, but just for the pixels within the bounding box. Hence, a descriptor for a region is a feature vector that is extracted based only on region in a bounding box. Therefore, the problem of ﬁnding regional features is reduced to one of ﬁnding suitable bounding boxes for each image in a MIL setting whereby each image is a bag-of-regions and regions are considered to be true iﬀ they contain the target concept. Furthermore, a bag is true iﬀ it contains a true region. The EMDiverse Density [7] was used together with Eﬃcient Subwindow Search (ESS) [8] to search for a target concept with good diverse density. ESS considers all Learning Global and Regional Features for Photo Annotation 289 possible bounding boxes, but requires multiple restarts since the algorithm is susceptible to local minimas. For photo annotation, a region selector is learned based on the densely sample SIFT features for each concept. From each region, three histograms, namely: 1) HSV, 2) interest point SIFT and 3) densely sampled SIFT histograms, form the concept-speciﬁc regional features. 3 Learning and Feature Selection 3.1 Support Vector Machine To obtain the ﬁnal classiﬁcation, each concept was considered independently using SVM with extended Gaussian kernels over the χ2 distance [1], K(Si , Sj ) = f ∈f eatures 1 2 χ (f (Si ), f (Sj )) μf (1) where μf is the average χ2 distance for a particular feature. μf is used to normalize the distances across diﬀerent features. Both global and regional features were treated in the same manner. 3.2 Greedy Feature Selection Since diﬀerent features work well with diﬀerent concepts, weighting was incorporated into the kernel function to combine the diﬀerent features. However, learning these weights is non-trivial and one often resorts to ad-hoc methods such as genetic algorithms. Here, we use a simple greedy algorithm for feature selection as follows. Algorithm. Greedy Feature Selection 1: F = {all global features} 2: For each feature f ∈ F : Compute error rate if f is removed 3: Remove the feature which results in best improvement 4: Repeat (2-3) until removing any feature results in worse performance 5: Consider each feature f ∈ F : Compute error rate if f is added 6: Consider each feature f ∈ F : Compute error rate if f is removed 7: Add or remove the feature which gives best improvement 8: Repeat (5-7) until local optima is reached 9: Return F 4 Method Evaluation and Results To evaluate the performance of the method and benchmark it against other methods from various groups worldwide, we used the data set provided for the ImageCLEF Photo Annotation 2009 task. The data set consists of 53 concepts 290 J. Ngiam and H. Goh Table 2. Results evaluated with average EER and average AUC Experiment Setting Average EER Average AUC Global and regional features 0.253296 Global features only 0.255945 0.813893 0.811421 spanning from visual elements (trees, people) to abstract concepts (aesthetics, blur). Although an ontology was provided, our method did not use the ontology heavily, except for handling disjoint cases. The training and test sets consist of 5000 and 13,000 images respectively. Two diﬀerent experimental settings using diﬀerent features were evaluated. The ﬁrst used all global and regional features, while only global features were used in the second setting. Their performances in terms of average equal error rate (EER) and average area under ROC curve (AUC) are shown in Table 2. We observe that regional features help to improve the performance slightly, though the diﬀerence is insigniﬁcant. While a new hierarchical measure was introduced by the organizers of this task, we did not speciﬁcally optimize for it because there was a lack of information regarding the annotator agreement values. Interestingly, for this evaluation metric, the experiment setting that used only global features negligibly outperformed the other setting that used all features. References 1. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classiﬁcation of texture and object categories: A comprehensive study. Int. J. Comput. Vision 73(2), 213–238 (2007) 2. Ojala, T., Rautiainen, M., Matinmikko, E., Aittola, M.: Semantic image retrieval with hsv correlograms. In: 12th Scandinavian Conf. on Image Analysis, pp. 621–627 (2001) 3. Pass, G., Zabih, R., Miller, J.: Comparing images using color coherence vectors. In: MULTIMEDIA 1996: Proceedings of the Fourth ACM International Conference on Multimedia, pp. 65–73. ACM, New York (1996) 4. Wu, J., Rehg, J.M.: Where am I: Place instance and category recognition using spatial pact. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 5. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 6. Bosch, A., Zisserman, A., Mu˜ noz, X.: Scene classiﬁcation using a hybrid generative/discriminative approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(4), 712–727 (2008) 7. Zhang, Q., Goldman, S.A.: Em-dd: An improved multiple-instance learning technique. In: Advances in Neural Information Processing Systems, pp. 1073–1080. MIT Press, Cambridge (2001) 8. Lampert, C.H., Blaschko, M.B., Hofmann, T.: Beyond sliding windows: Object localization by eﬃcient subwindow search. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8 (2008) Improving Image Annotation in Imbalanced Classification Problems with Ranking SVM Ali Fakeri-Tabrizi, Sabrina Tollari, Nicolas Usunier, and Patrick Gallinari Universit´e Pierre et Marie Curie - Paris 6, Laboratoire d’Informatique de Paris 6 - UMR CNRS 7606 104 avenue du pr´esident Kennedy, 75016 Paris, France firstname.lastname@lip6.fr Abstract. We try to overcome the imbalanced data set problem in image annotation by choosing a convenient loss function for learning the classiﬁer. Instead of training a standard SVM, we use a Ranking SVM in which the chosen loss function is helpful in the case of imbalanced data. We compare the Ranking SVM to a classical SVM with diﬀerent visual features. We observe that Ranking SVM always improves the prediction quality, and can perform up to 23% better than the classical SVM. 1 Introduction The goal of image annotation is to label an image with one or more concepts that it represents. Given a training set of labeled images, the standard way to solve this problem is to deﬁne several binary classiﬁcation problems. For a given concept, the positive examples are the images that are labeled with this concept, and the negative examples are the other images. One of the main diﬃculties of this approach is that the datasets are generally highly imbalanced: it often happens that a vast majority (or minority) of the training images are labeled with a particular concept. For example, in the VCDT 2009 [3], the concept Neutral Illumination appears in 4656 images out of 5000, and the concept Beach Holidays has only 78 occurrences. The performance of standard classiﬁcation algorithms is highly aﬀected by this imbalance, because they are designed to optimize the accuracy (or, equivalently, the misclassiﬁcation rate). In such a case, the accuracy is biased towards the majority class: taking the example of the concept Beach Holidays, a classiﬁer that never predicts the presence of this concept has an accuracy of 1 − 78/5000 98%. Thus it is possible to obtain a very good accuracy with a useless classiﬁer. In this paper, we propose to change the learning algorithm to avoid the bias of the accuracy measure. To that end, during training, we use a ranking algorithm, namely a Ranking SVM [2]. Similarly to a standard SVM, a Ranking SVM learns a function that gives scores to examples, and the class label is decided by thresholding the score. Unlike a standard SVM, the ranking algorithm we use, aims at optimizing the Area Under the ROC Curve (AUC) instead of the accuracy. This boils down to maximizing the probability that a positive example C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 291–294, 2010. c Springer-Verlag Berlin Heidelberg 2010 292 A. Fakeri-Tabrizi et al. has a greater score than a negative example. So, the loss of the Ranking SVM does not suﬀer from the same bias as the accuracy measure, and we can expect much better performance on imbalanced datasets. We present experiments with various visual features on data from the VCDT 2009. The results show that Ranking SVM can perform up to 23% better than a classical SVM. 2 SVM and Ranking SVM Consider N training examples (xi , yi )N i=1 with yi ∈ {−1, 1}, and an embedding φ mapping the observations xi in feature space endowed with a dot product ., .. The “classical” SVM ﬁnds a hyperplane w, φ(x) + b = 0 that separates the two classes as well as possible. The goal is thus to optimize the accuracy, or, equivalently, to minimize the misclassiﬁcation rate E deﬁned as: E= N 1 [[yi (w, φ(xi ) + b) ≤ 0]] N i=1 (1) where [[.]] is the indicator function. Formally, the primal formulation of SVM is N min w,b,ξ 1 ||w||2 + C ξi 2 i=1 (2) subject to yi (w, φ(xi ) + b) ≥ 1 − ξi and ξi ≥ 0, where C is a user-speciﬁed regularization parameter. The embedding φ may be deﬁned implicitly by a kernel function K(xi , xj ). In this paper, however, we will only consider linear kernels. The Ranking SVM does not strive to separate the two classes, but rather learns a score function of the form w, φ(x) that gives greater scores to positive examples than to negative ones. We choose to optimize the Area under the ROC Curve (AUC) as in [2]. The AUC is the probability that a positive example has a greater score than a negative one, and can be computed as: AU C = 1 − SwappedP airs p.n (3) where p (resp. n) is the number of positive (resp. negative) examples, and: SwappedP airs = |{ (i, j) : (yi > yj ) and w, φ(xi ) < w, xj }| (4) SwappedPairs is the number of pairs of examples that are in the wrong order. The ranking SVM uses a training criterion similar to the classical SVM, but using pairs of examples rather than single examples. The primal formulation is: min ||w||2 + C ξi,j (5) w,ξ i,j:yi >yj under the constraints w, φ(xi ) ≥ 1 + w, φ(xj ) − ξi,j and ξi,j ≥ 0. Improving Image Annotation in Imbalanced Classiﬁcation Problems 293 Table 1. Equal Error Rate (EER) and Area Under the ROC Curve (AUC) scores obtained by SVM and Ranking SVM with various visual features. Ranking SVM which deals with imbalanced data obtains always the best results. N: number of dimensions. Visual Descriptor Random HSV HSV SIFT SIFT SIFT+HSV SIFT+HSV Mixed+PCA Mixed+PCA N 51 51 1024 1024 1075 1075 180 180 Classiﬁer SVM Ranking SVM SVM Ranking SVM SVM Ranking SVM SVM Ranking SVM EER 0.495 0.460 0.378 (-18%) 0.451 0.350 (-22%) 0.459 0.378 (-18%) 0.353 0.294 (-17%) AUC 0.506 0.551 0.669 (+21%) 0.561 0.690 (+23%) 0.552 0.669 (+21%) 0.694 0.771 (+11%) Strictly speaking, a Ranking SVM does not learn a classiﬁer. A classiﬁer can however be obtained by comparing the scores with an appropriate threshold. In the following, the classiﬁer is obtained by comparing the score to 0: if an observation x is such that w, φ(x) > 0, then we predict that x is in the positive class, otherwise we predict it in the negative class. Although this choice may not be optimal, it is a simple decision rule that gives good results in practice. 3 Experiments The corpus is composed of the 5000 Flickr images from the training set of the VCDT 2009 [3]. We randomly divide this set in two parts: a training set of 3000 images and a test set of 2000 images. Each image is tagged by one or more of the 53 hierarchical visual concepts. We want to show that using an appropriate loss function, improves the performance for various visual features. First, we segment images into 3 horizontal regions and extract HSV features. Second, we extract SIFT keypoints, and then we cluster them to obtain a visual dictionary. Third, we make an early fusion by concatenating HSV and SIFT spaces (SIFT+HSV ). Fourth, we use a concatenation of various visual features from 3 labs proposed by the AVEIR consortium [1] reduced using a PCA (Mixed+PCA). To compare the performances of classical SVM and of Ranking SVM, we use SVMperf 1 in which a SVM classiﬁer and a Ranking SVM are implemented. We only consider linear kernels. Table 1 gives the results obtained by SVM and Ranking SVM with the various visual features on the 2000 test images. First, we can notice that we obtain low results using SVM on HSV, SIFT and HSV+SIFT, and better results with Mixed+PCA. The early fusion of HSV and SIFT does not give better results than SIFT only. For all experiments, the Ranking SVM gives better results than the SVM. Finally, using Ranking SVM, the best results are obtained using Mixed+PCA data. In Figure 1, we compare the diﬀerences of EER between SVM and Ranking SVM, both on Mixed+PCA, in function of 1 http://svmlight.joachims.org/svm_perf.html 294 A. Fakeri-Tabrizi et al. Difference of EER 0.2 0.15 0.1 0.05 0 0 500 1000 1500 2000 2500 Number of positives examples Fig. 1. Diﬀerences of EER between SVM and Ranking SVM, on Mixed+PCA, in function of the number of positive examples in the training data set. The diﬀerence is bigger when the number of positive examples is small or great in the training set. the number of positive examples in the training data set. We see that when the number of positive examples is small (or large), the diﬀerence is visible. It means that, comparing to classical SVM, a Ranking SVM is particularly eﬃcient when the data are highly imbalanced. 4 Conclusion This work shows that the choice of the loss function is important in the case of imbalanced data set problem. Using Ranking SVM can improve the results. We also see that the results are dependent on the feature types. Still, for all the features, the use of Ranking SVM improves the performances up to 23% comparing to classical SVM. As perspective, we can study the importance of the decision threshold used with the Ranking SVM. We may increase the performance by choosing a more appropriate threshold than 0. Acknowledgment. This work was partially supported by the French National Agency of Research (ANR-06-MDCA-002 AVEIR project). References 1. Glotin, H., Fakeri-Tabrizi, A., Mulhem, P., Ferecatu, M., Zhao, Z.-Q., Tollari, S., Quenot, G., Sahbi, H., Dumont, E., Gallinari, P.: Comparison of various AVEIR visual concept detectors with an index of carefulness. In: CLEF Working Notes (2009) 2. Joachims, T.: A support vector method for multivariate performance measures. In: International Conference on Machine Learning, ICML (2005) 3. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large scale — visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) University of Glasgow at ImageCLEF 2009 Robot Vision Task: A Rule Based Approach Yue Feng, Martin Halvey, and Joemon M. Jose Department of Computing Science, University of Glasgow, Glasgow, G12 8RZ, UK {yuefeng,halvey,jj}@dcs.gla.ac.uk Abstract. For the submission from the University of Glasgow for the ImageCLEF 2009 Robot Vision Task a large set of interesting points were extracted using an edge corner detector, these points were used to represent each image. The RANSAC method [1] was then applied to estimate the similarity between test and training images based on the number of matched pairs of points. The location of robot was then annotated based on the training image which contains the highest number of matched point pairs with the test image. A set of decision rules with the respect to the trajectory behaviour of robot’s motion were defined to refine the final results. An illumination filter was also applied for two of the runs in order to reduce the illumination effect. 1 Introduction We describe the approaches and results for 3 independent runs submitted by the University of Glasgow for the ImageCLEF 2009 Robot Vision task [2]. For this task training and validation was performed on a subset of the publically available IDOL2 Database [3]. The database contains image sequences acquired in a 5 room subsection of an office setting, under 3 different illumination settings. Our strategy is to analyse the visual content of the test sequence and compare it with the training sequence to determine location. This approach is able to automatically and efficiently detect if two images are similar, or find similar images within a database of images. This image matching approach estimates the visual distance between an unannotated frame and each frame in the training sequence and returns a ranked list, where the unannotated frame is annotated the same as highest ranked training image. The image matching techniques are combined with knowledge of robot motion and trajectory to determine the robots location. In addition, an illumination filter is integrated into one of the runs to minimise lighting effects with the goal of improving the predictive accuracy. 2 Methodology and Approach As both the training and test sequences are captured using the same camera in the same geographic condition, it is assumed that frames taken in the same location will contain similar content and geometric information. Motivated by this assumption, our image matching algorithm consists of the following successive stages: (1) A corner detection method is used to create an initial group of points of interest (POI); (2) The C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 295–298, 2010. © Springer-Verlag Berlin Heidelberg 2010 296 Y. Feng, M. Halvey, and J.M. Jose RANSAC algorithm [1] is applied to establish point correspondences between two frames and calculate the fundamental matrix [4] (this matrix encodes an epipolar constraint which is applied to the general motion and rigid structure; this is used to compute the geometric information for refining matched point pairs); (3) the number of refined matched points will be regarded as the similarity between two frames. POI are used instead of all of the pixels in the frame in order to reduce computational cost. As corners are local image features characterised by locations where variations of intensity in both X and Y directions are high, it is easier to detect and compare the POI in these areas, such as edges or textures. In order to exploit this, a Harris corner detector [5] was employed to initiate the POI, as it has strong invariance to rotation, scale, illumination variation and image noise. The Harris corner detector uses the local autocorrelation function to measure the local changes of the signal to detect the corner positions in each frame. The next step is to use a point matching technique to establish point correspondences between two frames. The point matching method generates putative matches between two frames by looking for points that are maximally correlated with each other inside a window surrounding each point. Only points that correlate strongly with each other in both directions are returned. Given the initial P POI, a parameter X is used to check whether this point is fitted or not, is first estimated using N points chosen at random from P. The number of points in P fit the model with values of X within a tolerance value T given by the user is then found. If this number is satisfactory, it is regarded as a fit and the operation terminates with success. Such operations are carried on looping through all the POI. In this work, T is set at 95%; this high threshold reduces the number of points of interest. The initial matching pair may contain mismatches, thus a post processing step for refining the results is needed. Given the assumption that frames taken in the same location will contain similar geometric information, a fundamental matrix [4] was applied. Given the initial matching points, the fundamental matrix F can be estimated given a minimum of seven point’s correspondence. Its seven parameters represent the only geometric information about the cameras that can be obtained through point correspondence alone. Given the computed F, it is applied on all the matching pairs in order to eliminate incorrectly matched pairs, where the matched point pair should satisfied with the following formula, =0, x′ are the corresponding points in two images, Fx describes an epipolar line on which the other corresponding point x′ on the other image must lie. After applying the matrix F on all the paired points the number of matched point pairs remaining is regarded as the similarity between two images. In order to localise the robot, we assume that the robots position can be retrieved by finding the most similar frame in the training sequence. Given the results of point matching, each test frame can be annotated as being from one of the possible rooms and the trajectory of the robot could be generated. The trajectory can be represented using the extracted annotation information frame by frame. By studying the training sequence released as part of the ImageCLEF 2009 Robot Vision training and test sets, we find (i) the robot does not move “randomly”, (ii) the period of time that the robot stays in the one room is always more than 0.5 seconds, which corresponds to more than 12 continuous frames, (iii) the robot always enters one room from the outside of the room and then exits this room to the place A Rule Based Approach 297 where it came from instead of to a different place. Based on the above observations, a set of rules to help determine the location of the robot at any given time were devised. Rule 1: The robot will not stay in one place for a period less than 20 frames. If rule 1 is violated and the location before the false detection period is the same as the location behind it, the location of the false period will be revised and annotated the same as the previous period. Rule 2: If the location of the robot changes from room A to room B without passing through the corridor and printing area, there must be a false detection. If this rule is violated, then a window with a size of N frames is applied on the location boundary to recalculate the similarity. The similarity between the test image and the top 10 matched training images is summed as the recalculated similarity. The frames with the highest score will be used to annotate the current frame with the location. Rule 3: Since test sequence contains additional rooms that were not images previously, no correspondence frames in the training set could be used to annotate these rooms. We define one rule that any frame detected less than 15 matched point pairs with the training frames is annotated as an unknown room. 3 Results and Evaluation 3 runs were submitted for ImageCLEF 2009 Robot Vision task each using a different combination of the image matching, decision rules and an illusion filter approaches. Run1: Uses every first frame out of 5 continuous frames of both the training and testing sequences for image matching, this is followed by the application of the rule based model to refine the results. Run 2: Uses all of the frames in training and testing set for image matching, followed by the application of the rule based model. The illumination filter is applied for pre-processing the frames. Run 3: Uses every first frame out of 5 continuous frames for both training and test sequence for image matching, followed by the application of the rule based model to refine the results. In addition an illumination filter called Retinex [6], was applied to improve visual rendering of a frame in which lighting conditions are not good. For run 1 and 3 we reduce the number of frames considered, in order to reduce redundancy, as every second consists of 25 frames, there are few changes amongst 5 frames. Once a keyframe has been annotated, all the frames in that shot are annotated similarly. Since the computational cost of our approach is linear, this keyframe representation can reduce the processing time by 80%. However, there is a chance that this may result in false detections as the robot changes location. All 3 runs were submitted for official evaluation, and the benefits of these approaches were measured using precision and the ImageCLEF score. The score is used as the measure of the overall performance of the systems, and is calculated as follows, +1.0 for each correctly annotated frame, +1 for correct detection of an unknown room, -0.5 points for each incorrectly annotated frame and 0 points for each image that was not annotated. Table 1 shows the results of the three runs. It can be seen clearly that the second run achieved the highest accuracy and score overall. This second run was the best performing run for the obligatory task at ImageCLEF Robot Vision task 2009 [2]. Comparing our 1st and 2nd runs, the 2nd run improves the accuracy from 59% to 68.5%, demonstrating that additional frames for training and testing could increase the image matching results. Also the illumination filter does not improve performance. 298 Y. Feng, M. Halvey, and J.M. Jose Table 1. Results of submitted 3 runs, total frame 1689 Run Run1 (Baseline) Run 2 Run 3 Accuracy 59% 68.5% 25.9% Score 650.5 890.5 -188 4 Conclusion In this paper we have described a vision-based localization framework for a mobile robot and applied this approach as part of the Robot Vision task for ImageCLEF 2009. This approach is applicable to indoor environments to identify the current location of the robot. The novelty in this approach is a methodology for obtaining image similarity using POI based image matching together with a rule-based reasoning for simulating the moving behaviour of the robot to refine the annotation results. The evaluated results show our proposed method achieved the second highest score in all the submissions to the Robot Vision task in ImageCLEF 2009. The experimental results show that using a coarse-to-fine approach is successful since it considers the visual information along with the motion behaviour of the mobile robot. The results also reflect the magnitude of the difficulty of the problems in robot vision, such as how to annotate an unknown room correctly, we believe we have gained insight to the practical problems and will use our findings in future work. Acknowledgments. This research is supported by the European Commission under contract FP6-027122-SALERO. References 1. Lu, L., Dai, X., Hager, G.: Efficient particle filtering using RANSAC with application to 3D face tracking. Image and Vision Computing 24(6) (January 2006) 2. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the CLEF 2009 robot vision track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II, Springer. LNCS, vol. 6242, Heidelberg (2010) 3. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proc. IROS (2007) 4. Zhong, H.X., Pang, Y.J., Feng, Y.P.: A new approach to estimating fundamental matrix. Image and Vision Computing 24(1) (2006) 5. Harris, C., Stephens, M.: A combined Corner and Edge Detector. In: Proc. Alvey Conf. (1987) 6. Rahman, Z., Jobson, D.J.: Retinex processing for automatic image enhancement. Journal of Electronic Imaging 13(1) (2004) A Fast Visual Word Frequency - Inverse Image Frequency for Detector of Rare Concepts Emilie Dumont1,2 , Herv´e Glotin1,2 , S´ebastien Paris1, and Zhong-Qiu Zhao3 1 Sciences and Information Lab. LSIS UMR CNRS 6168, France 2 University of Sud Toulon-Var, France 3 College of Computer Science and Information Engineering, Hefei University of Technology, China Abstract. In this paper we propose an original image retrieval model inspired from the vector space information retrieval model. We build for diﬀerent features and diﬀerent scales a visual concept dictionary composed by visual words intended to represent a semantic concept, and then we represent an image by the frequency of the visual words within the image. Then the image similarity is computed as in the textual domain where a textual document is represented by a vector in which each component is the frequency of occurrence of a speciﬁc textual word in that document. We then adapt the common text-based paradigm by using the TF-IDF weighting scheme to construct a WF-IIF weighting scheme in our Multi-Scale Visual Dictionary (MSVD) vector space model. The experiments are conducted on the 2009 Visual Concept Detection ImageCLEF Campaign. We compare WF-IIF to usual direct SupportVector Machine (SVM) algorithm. We demonstrate that SVM and WFIIF are in average over all the concept giving the same Area Under the Curve (AUC). We then discuss the fusion process that should enhance the whole system, and of some particular properties of MSVD, that shall be less dependant of the training set size of each concept than the SVM. 1 Introduction Visual document indexing and retrieval from digital libraries have been extensively studied for decades. In the literature, there has been a large variety of approaches proposed to retrieve images eﬃciently for users. A content-based image retrieval (CBIR) approach relies on certain low-level image features, such as color, shape, and texture, for retrieving images. A major drawback of this approach is that there is a ’semantic gap’ between low-level features of images and high-level human concepts. Image analysis is a typical domain for which a high degree of abstraction from low-level methods is required, and where the semantic gap immediately aﬀects the user. Recently, the main issue is how to relate low-level image features to high-level semantic concepts because if image content is to be identiﬁed to understand the meaning of an image, the only available independent information is the low-level pixel data. To recognize the displayed scenes from the raw data of an image C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 299–306, 2010. c Springer-Verlag Berlin Heidelberg 2010 300 E. Dumont et al. the algorithms for selection and manipulation of pixels must be combined and parametrized in an adequate manner and ﬁnally linked with the natural description. In other words, research focuses on how to extract semantics from low-level features which approximate well the user interprets of the images content (objects, themes, events). The state of the art techniques in reducing this gap include mainly three categories: i) using ontologies to deﬁne high-level concepts, ii) using machine learning tools to associate low level features with query concepts and iii) introducing relevance feedback into retrieval process to improve responses. We propose here a new approach for semantic interpretation that is using a multi-scale mid level visual representation. Images are systematically decomposed in regions with diﬀerent sizes, and regions are represented according several features. These diﬀerent aspects are fused by stage to obtain a more complex image representation, where an image is a vector of visual word frequency. These visual words are deﬁned by a concept Multi-Scale Visual Dictionary (MSVD). This original multi-scale analysis intended to be robust to the large variety of visual concept. Related approaches without this multi-scale extension can be found in the literature. In Picard’s work, which was the ﬁrst to develop the general concept of a visual thesaurus by transforming the main idea of text dictionary to a visual dictionary [1]. One year later, she proposed examples of a visual dictionary based on texture, in particular the FourEyes system [2] but no experiment was carried out in order to show the quality of these systems A ﬁrst method consists in building a visual dictionary from the feature vectors of region of segmented image. In [3], authors use a self-organizing map to select visual elements, in [4] SVMs are trained on image regions of a small number of images belonging to seven semantic categories. In [5,6], regions are clustered by similar visual features with a competitive agglomeration clustering. And then, images are represented as vectors based on this dictionary. The semantic content of those visual elements depends primarily on the quality of the segmentation. 2 Model Word Frequency - Inverse Image Frequency through a Multi-Scale Visual Dictionary We propose to use a Multi-Scale Visual Dictionary to represent an image, each visual word is intended to represent a semantic concept. Images are systematically decomposed in regions with diﬀerent sizes, and regions are represented according several features. These diﬀerent aspects are fused by stage to obtain a more complex image representation, where an image is a vector of visual word frequency. This original multi-scale analysis is expected to be robust to the large variety of visual concept. In a second step, we propose an adaptation of the common text-based paradigm. TF-IDF [7] is a classical information retrieval term weighting model, which estimates the importance of a term in a given textual document by multiplying the raw term frequency (TF) of the term in a document by the term’s inverse document frequency (IDF) weight. Our imagebased classiﬁer is analogous to the TF*IDF approach, where we deﬁne a term A Fast Visual Word Frequency - Inverse Image Frequency 301 by a visual word and we called this method Word Frequency - Image Inverse Frequency (WF-IIF). 2.1 Multi-Scale Visual Dictionary Visual Atoms. Visual atoms or elements are intended to represent semantic concepts, they should be automatically computable and an image can be automatically described in terms of those visual elements. This visual representation should also have a relationship between the content of the image. A visual element is an image area, i.e. images are split into a regular grid. The size of the grid is obviously a very important factor in the whole process. Smaller grid allow a more precise description with fewer visual elements, while bigger grid may contain more information, but require a larger number of elements. We then propose a multi-scale process that is integrating all the grid size and is selecting the best ones. Global Visual Dictionary. There is not universal dictionary contrary to the textual domain. So our ﬁrst step is to automatically create a visual dictionary composed by a large set of visual words representing a sub-concept. Each visual element is represented by diﬀerent feature vectors (color, texture, edge, ...). For each feature and size of grid, we cluster visual elements using the K-Means algorithm, with a predeﬁned number of clusters and using the Euclidean distance in order to group visual elements and to smooth some visual artefacts. The KMeans algorithm is one of the most popular iterative descent clustering method. It is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance is chosen as the dissimilarity measure. Then, for each cluster, we select the medoid to be a visual word wi and to compose the visual dictionary of a feature, i.e. W {wi }, i = 1, . . . , P with P = K × nF × nG, where K, nF , nG design the number of clusters in the K-means, the number of features computed for each block and the number of grid respectively. Image Transcription by Block Matching. Based on the visual dictionary, we replace visual elements by the nearest visual word (one of the medoids) in the visual dictionary. To match a block to a visual word, we ﬁnd the visual word for which a distance measure between the block’s visual elements and the visual word is a minimum. In this stage, every block of an image is matched to one of the visual words from the visual dictionary. And then, the image representation is based on the frequency of the visual words within the image for each feature. Similarity to the textual domain where a textual document is represented by a vector in which each component is the frequency of occurrence of a speciﬁc textual word in that document. Visual Vocabulary Reduction. In the textual domain the ”bag-of-words” representations are surprisingly eﬀective for text classiﬁcation. The representation is high dimensional though, containing many non consistent words for text categorization like “the”, “a” ... These non consistent words result in reduced 302 E. Dumont et al. generalization performance of subsequent classiﬁers, e.g., from ill-posed principal component transformations. In this communication our aim is to reduce the least relevant visual words from the bag-of-words representation. In a visual document, visual words do not have the same importance to determine the presence of a concept. So we want to select the most discriminative visual words for a concept given to compose a Visual Concept Dictionary associated with this concept. We use classical methods such as document frequency thresholding (DF), word frequency thresholding (WF), information gain (IG) [8], mutual information (MI) [9] and entropy-based reduction [10]. 2.2 Vector Based Visual Concept Detection We used the TF-IDF weighting scheme in the vector space model together with cosine similarity to determine the similarity between a visual document and a concept. As an image is represented as a vector, the dimensionality P of the vector is the number of words in the MSVD. Each dimension corresponds to a separate visual word. If a word occurs in the image, its value in the vector is non-zero. Several diﬀerent ways of computing word weights have been developed. One of the best known schemes is TF-IDF weighting. The basic concept is that term frequency TF is the number of times the term occurs in a document while the IDF is the inverse of the number of document in which a word occurs. In our case a document is an image, so we use the Word Frequency - Image Inverse Term Frequency (WF-IIF) model. the number of document in which a word occurs The visual word count in the given image is simply the number of times a given visual words appears in that image. This count gives a measure of the importance of the visual word w ci within the particular image I j . Thus we have the term frequency, deﬁned as follows. nci,j c wfi,j = c , k nk,j (1) where nci,j is the number of occurrences of the considered visual word (w ci ) in image I j for the concept c, and the denominator is the sum of number of occurrences of all visual words in image I j for the topic c. The inverse image frequency is a measure of the general importance of the visual word obtained by dividing the number of all images by the number of images containing the visual word, and then taking the logarithm of that quotient. |I c | , (2) |{I : wic ∈ I c }| where |I c | is the total number of images in the corpus where the concept c appears, |{I : wi ∈ I c }| is the number of images where the visual words wci appears (that is nci,j = 0). Relevancy rankings of images in a visual keyword search can be calculated, using the assumptions of documents similarities theory, by comparing the deviation of angles between each image vector and the original query vector where the query is represented as same kind of vector as the images. iific = log A Fast Visual Word Frequency - Inverse Image Frequency 303 Here for a particular image I to classify, v c1 = Ej wf ci,j iific where tf ci,j and iific are computed oﬀ-line on the training set and v c2 = tf ci iif ci where tf ci is computed on I. Using the cosine the similarity cos θc between v c2 and query v c1 can be calculated. 3 Experiments Experiments are conducted on image data which is used in the Photo Annotation task of ImageCLEF 2009 [11]. It represents a total of 8000 images annotated by 53 concepts. The criterion commonly used to measure the ranking quality of a classiﬁcation algorithm is the mean area under the ROC curve (AUC) integrated over all the concepts and denoted AU C. A large variety of features oﬀers a better representation of concepts, since these concepts can very diﬀer and have a combination of very variable characteristics. For example, it is clear that a concept such as sky, sea or forest, colour or texture will have a big impact while for a concept such as face, edge and colour will be favoured. So, we extract HSV histogram, edge histogram, Gabor ﬁlter histogram, generalized fourier descriptors [12], and proﬁle entropy feature [13]. The size of the grid is obviously a very important factor in the whole process. Smaller grid allow a more precise description with fewer visual elements, while bigger grid may contain more information, but require a larger number of elements. We choose to use a multi-scale process. In order to take full advantages of our multi-scale approach, we use diﬀerent grid sizes : 1 × 1, 2 × 2, 4 × 4, 8 × 8, and 2 × 4, 4 × 2, 8 × 4, 4 × 8, so nG = 8. To construct our global visual dictionary, we must deﬁne the number of clusters and also during the vocabulary reduction. These parameters was ﬁxed by experimental results on the development test set. For the global visual dictionary, we varied the number of clusters by 50 to 2500. And for the number of visual words into the visual concept dictionary, we tested with 10 to 10000. We optimize our parameters based on parts of the initial training set of 5000 images: it is split into a new training set with 2500 images, a validation set with 2000 images and a test set with 500 images. 4 Results Figure 1 shows the AU C results of our method for diﬀerent parameters. Based on these results, and in order to have the best compromise between AUC and time computing, we choose to use a cluster number of 250 and 6000 visual words. With these parameters, we obtain a AU C equals to 0.668. So, we train our model again, but on the whole training set (5000 images) and we tested this model on the test set of 3000 images, we obtained a AU C equals to 0.688. 304 E. Dumont et al. 0.7 0.68 0.66 0.64 AUC 0.62 0.6 0.58 0.56 0.54 0.52 0.5 100 clusters 250 clusters 500 clusters 1000 clusters 2500 clusters 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of visual words in the concept visual dictionary Fig. 1. AU C results on the validation set according the number of visual words in the Visual Concept Dictionary 4.1 Vocabulary Reduction Visual words do not have the same importance to determine the presence of a concept. We select the most discriminative visual words for a concept given to compose a Visual Concept Dictionary. Automatic word selection methods such as information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. In order to select the words, we test various method depicted previously giving the results in the table 1. We see that IG, MI and Ent methods are similar, we use information gain, IG. Table 1. AU C and EER results for diﬀerent vocabulary reduction methods WF DF IG MI Ent AU C 0.680 0.668 0.688 0.680 0.687 EER 0.365 0.375 0.356 0.364 0.360 4.2 Comparison with Classical SVM We compare our method WF-IIF with the classical SVM [14] using the RBF kernel and the one-against-all strategy and exactly the same information as our MSVD. First, for each concept, we run the Linear Discriminant Analysis (LDA) on the joint of all features to decrease the high dimension impact. Then, we train a support vector machine for every concept on the LDA feature, of which the outputs are considered as the conﬁdences with which the samples belong to the concept. We use the same sets as in the visual dictionary: a training set to train SVM models, validation sets to optimize parameters, and the ﬁnal test set to evaluate the method. We compare with the best ImageCLEF2009 system but results are not on the same data. The ISIS group [15] applies a system that is based on four main steps: a spatial pyramid approach and saliency points detection, SIFT features extraction, codebook transformation, and the ﬁnal learning step is based on SVM with χ2 kernel. A Fast Visual Word Frequency - Inverse Image Frequency 305 Fig. 2. AUC results by concept for the Visual Dictionary vs the LDA+SVM method and best ImageCLEF2009 system 1 Visual Dictionary SVM 0.9 0.8 AUC 0.7 0.6 0.5 0.4 0.3 0.2 0 500 1000 1500 2000 2500 Concept frequency Fig. 3. AUC results by concept frequency for the WF-IIF vs the LDA+SVM method In average the LDA+SVM method obtains a AU C equals to 0.653. In ImageCLEF campaign evaluation, this method obtained 0.72. The performance is smaller with our data set which is a part of the global data of ImageCLEF, so the comparison with the best results has a bias. 5 Conclusion We can see that on average WF-IIF is more competitive than LDA+SVM. In particular, better results are obtained especially with concepts with rare occur#positives 1 rence ( #negatives ≤ 10 ) probably due to lack of positives examples in the SVM training, see ﬁgure 3 where concepts are sorting by the number of positive sample in the training set. However our WF-IIF model on MSVD needs few minutes for training on a pentium IV 3Ghz, 4 GRam, comparing with the LDA+SVM model which costs more than 5 hours. The test processing of our WF-IIF on MSVD is also faster than the LDA+SVM. 306 E. Dumont et al. Acknowledgement. This work was partially supported by the French National Agency of Research: ANR-06-MDCA-002 AVEIR project and ANR Blanc ANCL. References 1. Picard, R.W.: Toward a visual thesaurus. In: Springer Werlag Workshops in Computing, MIRO (1995) 2. Picard, R.W.: A society of models for video and image libraries (1996) 3. Zhang, R., Zhang, Z.M.: Hidden semantic concept discovery in region based image retrieval. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 996–1001 (2004) 4. Lim, J.H.: Categorizing visual contents by matching visual “keywords”. In: Huijsmans, D.P., Smeulders, A.W.M. (eds.) VISUAL 1999. LNCS, vol. 1614, pp. 367– 374. Springer, Heidelberg (1999) 5. Fauqueur, J., Boujemaa, N.: Mental image search by boolean composition of region categories. In: Multimedia Tools and Applications, pp. 95–117 (2004) 6. Souvannavong, F., Hohl, L., M´erialdo, B., Huet, B.: Enhancing latent semantic analysis video object retrieval with structural information. In: IEEE International Conference on Image Processing, ICIP 2004, Singapore, October 24-27 (2004) 7. Salton, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGrawHill, Inc., New York (1986) 8. Mitchell, T.: Machine Learning (October 1997) 9. Seymore, K., Chen, S., Rosenfeld, R., Chen, S., Rosenfeld, R.: Nonlinear interpolation of topic models for language model adaptation. In: Proceedings of ICSLP-1998, vol. 6, pp. 2503–2506 (1998) 10. Jensen, R., Shen, Q.: Fuzzy-rough data reduction with ant colony optimization. Fuzzy Sets and Systems (March 2004) 11. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large-scale visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 12. Smach, F., Lemaˆıtre, C., Gauthier, J.P., Miteran, J., Atri, M.: Generalized fourier descriptors with applications to objects recognition in svm context. J. Math. Imaging Vis. 30(1), 43–71 (2008) 13. Glotin, H., Zhao, Z., Ayache, S.: Eﬃcient image concept indexing by harmonic and arithmetic proﬁles entropy. In: Proceedings of 2009 IEEE International Conference on Image Processing (ICIP 2009), Cairo, Egypt, November 7-11 (2009) 14. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 15. van de Sande, K., Gevers, T., Smeulders, A.: The university of Amsterdam’s concept detection system at imageCLEF 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) Exploring the Semantics behind a Collection to Improve Automated Image Annotation Ainhoa Llorente, Enrico Motta, and Stefan R¨ uger Knowledge Media Institute, The Open University Walton Hall, Milton Keynes, MK7 6AA United Kingdom {a.llorente,e.motta,s.rueger}@open.ac.uk Abstract. The goal of this research is to explore several semantic relatedness measures that help to reﬁne annotations generated by a baseline non-parametric density estimation algorithm. Thus, we analyse the beneﬁts of performing a statistical correlation using the training set or using the World Wide Web versus approaches based on a thesaurus like WordNet or Wikipedia (considered as a hyperlink structure). Experiments are carried out using the dataset provided by the 2009 edition of the ImageCLEF competition, a subset of the MIR-Flickr 25k collection. Best results correspond to approaches based on statistical correlation as they do not depend on a prior disambiguation phase like WordNet and Wikipedia. Further work needs to be done to assess whether proper disambiguation schemas might improve their performance. 1 Introduction Early attempts in automated image annotation were focused on algorithms that explored the correlation between words and image features. More recently, there are some eﬀorts which beneﬁt from exploiting the correlation between words computing semantic similarity measures. In this work, we use indistinctly the term semantic similarity and semantic relatedness. Nevertheless, we refer to the deﬁnition by Miller and Charles [1] who consider semantic similarity as the degree to which two words can be interchanged in the same context. Thus, we propose a model that automatically reﬁnes the image annotation keywords generated by a non-parametric density estimation approach by considering semantic relatedness measures. The underlying problem that we attempt to correct is that annotations generated by probabilistic models present poor performance as a result of too many “noisy” keywords. By “noisy” keywords, we mean those which are not consistent with the rest of the image annotations and in addition to that, are incorrect. Semantic measures will improve the accuracy of these probabilistic models, allowing these new combined semantic-based models to be further investigated. As there exist numerous semantic relatedness measures and each one of them works with diﬀerent knowledge bases we extend the model presented in [2] to new measures that perform the knowledge extraction using WordNet, Wikipedia, and World Wide Web through search engines. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 307–314, 2010. c Springer-Verlag Berlin Heidelberg 2010 308 A. Llorente, E. Motta, and S. R¨ uger The ultimate goal of this research is to explore how semantics can help an automated image annotation system. In other to achieve this, we examine several semantic relatedness measures studying their eﬀect on a subset of the MIRFlickr 25k collection, the proposed dataset for the Photo Annotation Task [3] in the latest edition, 2009, of the ImageCLEF competition. The rest of this paper is structured as follows. Section 2 introduces our model as well as the applied semantic measures. Then, Section 3 describes the experiments carried out on the image collection provided by ImageCLEF2009. Section 4 discusses the results and ﬁnally, conclusions are presented in Section 5. 2 Model Description The baseline approach is based on the probabilistic framework developed by Yavlinsky et al. [4] who used global features together with a non-parametric density estimation to model the conditional probability of an image given a word. The density estimation is accomplished using a Gaussian kernel. A key aspect of this approach is the global visual features used. The algorithm described combines the CIELAB colour feature with the Tamura texture. The process for extracting each of these features is as follows, each image is divided into nine equal rectangular tiles, the mean and second central moment feature per channel are calculated in each tile. The resulting feature vector is obtained after concatenating all the vectors extracted in each tile. In what follows, some of the semantic relatedness measures used in this approach are introduced. Due to space constraints, we refer to exhaustive reviews found in the literature whenever appropriate. 2.1 Training Set Correlation This approach is introduced in [2] where the training set is computed to generate a co-occurrence matrix that represents the probabilities of the frequency of two vocabulary words appearing together in a given image. This algorithm was previously tested on the Corel5k dataset and in the collection provided by the 2008 ImageCLEF edition showing promising results. 2.2 Web-Based Correlation The most important limitation aﬀecting approaches that rely on keyword correlation in the training set, is that they are limited to the scope of the topics represented in the collection. Consequently, a web-based approach is proposed that makes use of web search engines as knowledge base. Thus, the semantic relatedness between concepts x and y, is deﬁned by Gracia and Mena [5] as: rel(x, y) = e−2 NWD(x,y) , (1) where N W D stands for Normalized Web Distance which is a generalisation of the Normalized Google Distance deﬁned by Cilibrasi and Vit´anyi [6]. Exploring the Semantics behind a Collection 2.3 309 WordNet Measures A fair amount of thesaurus-based semantic relatedness measures were proposed and investigated on the WordNet hierarchy of nouns (see [7] for a detailed review). The best result was achieved by Jiang and Conrath using a combination of statistical measures and taxonomic analysis. This was accomplished using the list of 30 noun pairs proposed by Miller and Charles in [1]. During our training phase (Section 3.1), we applied several WordNet semantic measures (Jiang and Conrath [8], Hirst and St-Onge [9], Resnik [10] and, Adapted Lesk [11]) to 1,378 pair of words obtained from our vocabulary. The best performing was the adapted Lesk measure proposed by Banerjee and Pedersen, closely followed by Jiang and Conrath’s relatedness measure. Banerjee and Pedersen deﬁned the extended gloss overlap measure which computes the relatedness between two synsets by comparing the glosses of synsets related to them through explicit relations provided by the thesaurus. 2.4 Wikipedia Measures According to a review by Medelyan et al. [12], the computation of semantic relatedness using Wikipedia has been addressed from three diﬀerent point of views; one that applies WordNet-based techniques to Wikipedia followed by [13]; another that uses vector model techniques to compare similarity of Wikipedia articles proposed by Gabrilovich and Markovitch in [14]; and, the ﬁnal one, which explores the Wikipedia as a hyperlinked structure introduced by Milne and Witten in [15]. The approach adopted in this research is the last one as it is less computationally expensive than the others that work with the whole content of Wikipedia. Milne and Witten proposed their Wikipedia Link-based Measure (WLM) which extracts semantic relatedness measure between two concepts using the hyperlink structure of Wikipedia. Thus, the semantic relatedness between two concepts is estimated by computing the angle between the vectors of the links found between the Wikipedia’s articles whose title matches each one of the concepts. 3 Experimental Work In this paper, we describe the experiments carried out for the Photo Annotation Task for the ImageCLEF2009 campaign. The main goal of this task is, as described in [3], given a training set of 5,000 images manually annotated with words coming from a vocabulary of 53 visual concepts, to automatically provide annotations for a test set of 13,000 images. 3.1 Training Phase Before submitting our runs to the ImageCLEF2009, we made a preliminary study about which method performs better. In order to accomplish this goal, we performed a 10-fold cross validation on the training set. Thus, we divided the 310 A. Llorente, E. Motta, and S. R¨ uger Table 1. Comparative performance of the held-out data for our proposed methods using diﬀerent semantic relatedness measures. Results are expressed in terms of mean average precision (MAP). In the third column, Δ represents the % percentage of improvement of the method over the baseline. Best performing results are marked with an asterisk. Method MAP Δ Baseline Training Set Correlation Wikipedia Link-based Measure Web-based Correlation (Yahoo) Web-based Correlation (Google)* WordNet: Hirst and St-Onge (HSO) WordNet: Resnik (RES) WordNet: Jiang and Conrath (JCN) WordNet: Adapted Lesk (LESK) 0.2613 0.2720 0.2681 0.2720 0.2736* 0.2675 0.2685 0.2720 0.2721 4.09% 2.60% 4.09% 4.71%* 2.37% 2.76% 4.09% 4.13% training set of the collection into two parts: a training set of 4,500 images and a validation set of 500. The validation test was used to tune the model parameters. During this training phase, we use as evaluation measure the mean average precision (MAP), as it has shown to have especially good discriminatory and stable capabilities among evaluation measures. For a given query, average precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved, and this value is then averaged over all queries. We consider as queries, all the words that are able to annotate an image in the test set, in our case is the whole 53 vocabulary words. We evaluated the performance of several semantic measures using various knowledge sources as indicated in Table 1. The ﬁnal goal of this training phase is to select the best performing measure per method. As noted from the results, methods based on word correlation outperform methods based on thesaurus such as WordNet or Wikipedia. The poorest performance corresponds to Wikipedia Link-based Measure. 3.2 Discussion The fact that many words of the proposed vocabulary are not included in WordNet or in Wikipedia adds a further complication to the process of computing some semantic relatedness measures. Thus, we followed the same approach adopted by [16] which consists in replacing some words by others similar to them. However, these replacements were rather diﬃcult to accomplish as we needed to select a word semantically and, at the same time, visually similar to the original one. Specially diﬃcult were the words that represents a negation such as “no visual season”, “no visual place”, etc. In other cases, the replacement consists in ﬁnding the corresponding noun to a given adjective like in the case of “indoor”, “outdoor”, “sunny”, “overexposed” or, “underexposed”. In addition to that, the computation of some semantic relatedness implies a prior Exploring the Semantics behind a Collection 311 Table 2. Examples of Word Sense Disambiguation (WSD) using Wikipedia and WordNet. The wrong disambiguations are highlighted in bold characters. Word Wikipedia WordNet: word#n#1 Indoor Outdoor Canvas Still Life Macro Overexposed Underexposed Plants Partly Blurred Small Group Big Group The Inside “the region that is inside of something” Outside(magazine) “the region that is outside of something” Canvas “the setting for a fictional account” Still “a static photograph” Macro(computer science) “a single computer instruction” Light “electromagnetic radiation” Darkness “absence of light or illumination” Plant “building for industrial labor” Bokeh “a hazy or indistinct representation” Group (auto racing) “a number of entities considered as a unit” Hunter-gatherer “a group of persons together in one place” disambiguation task. This occurs, again, in the case of Wikipedia and WordNet. Both of them automatically assign to every word the most usual sense. In the case of WordNet, this sense corresponds to the ﬁrst sense in the synset (word#n#1) while in Wikipedia corresponds to the sense of the word more probable according to the content store on Wikipedia database. Surprisingly, both methods present similar disambiguation capabilities that are around 70% of accuracy, being WordNet slightly better. Table 2 shows some unlucky examples. Unfortunately, the most popular sense of a word does not necessarily match the sense of the word attributed in our collection. Consequently, these inaccuracies in the disambiguation process translates into poor performance for the resulting methods. This explains the results of Table 1 where Google achieves the best performance as it does not need to do any disambiguation task. This result is closely followed by word correlation using the training set as source of knowledge. Finally, and conﬁrming our previous expectations WordNet and Wikipedia semantic relatedness obtained the lowest results. Among WordNet results, Jiang and Conrath (JCN) measure is narrowly beaten by Adapted Lesk. 4 Analysis of Results Due to the limitations in the number of runs to be submitted to the ImageCLEF2009 competition, we propose our top four performing runs according to the training process described in Section 3.1. At the end of it, the training and validation set were merged, again, to form a new training set of 5,000 images that was used to predict the annotations in the test set of 13,000 images. Thus, we submitted the following runs: correlation based on the training set, web-based correlation using Google, semantic relatedness using WordNet based on Adapted Lesk measure and ﬁnally, Wikipedia Link-based measure using Wikipedia. Evaluation of results were done using the two metrics proposed by ImageCLEF organisers. The ﬁrst one is based on ROC curves and proposes as measures Equal 312 A. Llorente, E. Motta, and S. R¨ uger Table 3. Evaluation performance of the proposed algorithms under the EER and AUC metric. A random run is included for comparison purposes. The best performing result is marked with an asterisk. Note that, the lower the EER, the better the performance of the annotation algorithm. Algorithm EER AUC Training Set Correlation* Web-based Correlation (Google) WordNet: Adapted Lesk Wikipedia Link-based Measure Random 0.352478* 0.352485 0.352612 0.356945 0.500280 0.689410* 0.689407 0.689342 0.684821 0.499307 Table 4. Evaluation performance of the proposed algorithms under the Ontology Score (OS) metric considering the agreement among annotators or without it. A random run is included for comparison purposes. The best performing result is marked with an asterisk. In this case, the higher the OS, the better the performance of the annotation algorithm. Algorithm With Agreement Without Agreement Web-based Correlation (Google)* Training Set Correlation WordNet: Adapted Lesk Wikipedia Link-based Measure Random 0.6180272* 0.6179764 0.6172693 0.4205571 0.3843171 0.57583610* 0.57577974 0.57497290 0.35027474 0.35097164 Error Rate (EER) and the Area under Curve (AUC) while the second metric is the Ontology Score (OS) proposed by [17] that takes into consideration the hierarchical form adopted by the vocabulary. Table 3 shows the results obtained by the proposed algorithms. These results are rather in tune with the results previously computed during our training process. As expected, the best results correspond to word correlation either using the training set or using a web-based search engine like Google. Results for the OS metric are presented in Table 4. They corroborate previous results computed using the ranked retrieval metric or the metric based on ROC curves. The only variation is that depending on the metric Web-based correlation outperforms training set correlation. It is worth noting that the emphasis of this research is placed on the analysis of the performance of the diﬀerent semantic relatedness measures more than on the baseline run. However, we were able to perform an additional run accomplishing an adequate selection of image features together with a kernel function obtaining a signiﬁcant better value of EER, 0.309021. This result was achieved by combining Tamura and Gabor texture with HSV and CIELAB colour descriptors and using a Laplacian kernel function instead of the Gaussian mentioned before. Exploring the Semantics behind a Collection 5 313 Conclusions The goal of this research is to explore several semantic relatedness measures that help to reﬁne annotations generated by a baseline non-parametric density estimation algorithm. Thus, we analyse the beneﬁts of performing a statistical correlation using the training set or using the World Wide Web versus approaches based on a thesaurus like WordNet or Wikipedia (considered as a hyperlink structure). Experiments are carried out using the dataset provided by the 2009 edition of the ImageCLEF competition, a subset of the MIR-Flickr 25k collection. Several metrics are employed to evaluate the results, the MAP ranked retrieval metric, the ROC curves based EER metric and the proposed Ontology-based Score. Disparity among results using the three metrics is not signiﬁcant. Thus, we observe that the best performance is achieved using correlation approaches. This is due to the fact that they do not rely on a prior disambiguation process like WordNet and Wikipedia. Not surprisingly, the worst result corresponds to the semantic measure based on Wikipedia. The reasons behind it might be found in the strong dependency of the semantic relatedness measure on doing a proper word disambiguation. The disambiguation in Wikipedia is automatically performed by selecting the sense of the word more probable according to the content store on Wikipedia database. Most of the vocabulary words do not correspond to real visual features and at the same time present diﬃcult semantics. Consequently, we predicted and posteriorly checked, lower results for concepts classiﬁed into categories such as “Seasons”, “Time of the day”, “Picture representation”, “Illumination”, “Quality Blurring” and specially, the most subjective one, “Quality Aesthetics”. Further analysis is needed to determine whether the performance of WordNet and Wikipedia can be improved by incorporating robust disambiguation schemas. Acknowledgments. This work was partially funded by the EU-Pharos project under grant number IST-FP6-45035 and by Santander Universities. References 1. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Journal of Language and Cognitive Processes 6, 1–28 (1991) 2. Llorente, A., R¨ uger, S.: Using second order statistics to enhance automated image annotation. In: Proceedings of the 31st European Conference on Information Retrieval, vol. 5478, pp. 570–577 (2009) 3. Nowak, S., Dunker, P.: Overview of the CLEF 2009 Large Scale – Visual Concenpt Detection and Annotation Task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 4. Yavlinsky, A., Schoﬁeld, E., R¨ uger, S.: Automated image annotation using global features and robust nonparametric density estimation. In: Proceedings of the International ACM Conference on Image and Video Retrieval, pp. 507–517 (2005) 5. Gracia, J., Mena, E.: Web-based measure of semantic relatedness. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 136–150. Springer, Heidelberg (2008) 314 A. Llorente, E. Motta, and S. R¨ uger 6. Cilibrasi, R., Vitanyi, P.: The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007) 7. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics 32(1), 13–47 (2006) 8. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of International Conference Research on Computational Linguistics (1997) 9. Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. In: WordNet: A Lexical Database for English, pp. 305–332. The MIT Press, Cambridge (1998) 10. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artiﬁcial Intelligence, pp. 448–453 (1995) 11. Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence (2003) 12. Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from wikipedia. International Journal of Human-Computer Studies 67(9), 716–754 (2009) 13. Ponzetto, S., Strube, M.: Knowledge derived from wikipedia for computing semantic relatedness. Journal of Artiﬁcial Intelligence Research (JAIR) 30, 181–212 (2007) 14. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipediabased explicit semantic analysis. In: Proceedings of the 20th International Joint Conference for Artiﬁcial Intelligence, pp. 1606–1611 (2007) 15. Milne, D., Witten, I.: An eﬀective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceedings of the ﬁrst AAAI Workshop on Wikipedia and Artiﬁcal Intellegence (2008) 16. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M., Soroa, A.: A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of NAACL-HLT (2009) 17. Nowak, S., Lukashevich, H.: Multilabel classiﬁcation evaluation using ontology information. In: Proceedings of ESWC Workshop on Inductive Reasoning and Machine Learning on the Semantic Web (2009) Multi-cue Discriminative Place Recognition Li Xing and Andrzej Pronobis Centre for Autonomous Systems, The Royal Institute of Technology SE100-44 Stockholm, Sweden {lixing,pronobis}@kth.se Abstract. In this paper we report on our successful participation in the RobotVision challenge in the ImageCLEF 2009 campaign. We present a place recognition system that employs four diﬀerent discriminative models trained on diﬀerent global and local visual cues. In order to provide robust recognition, the outputs generated by the models are combined using a discriminative accumulation method. Moreover, the system is able to provide an indication of the conﬁdence of its decision. We analyse the properties and performance of the system on the training and validation data and report the ﬁnal score obtained on the test run which ranked ﬁrst in the obligatory track of the RobotVision task. 1 Introduction This paper presents the place recognition algorithm based on multiple visual cues that was applied to the RobotVision task of the ImageCLEF 2009 campaign. The task addressed the problem of visual indoor place recognition applied to robot topological localization. Participants were given training, validation and test sequences capturing the appearance of an oﬃce environment under various conditions [1]. The task was to build a system able to answer the question “where are you?” (I am in the kitchen, in the corridor, etc) when presented with a test sequence imaging rooms seen during training, or additional rooms that were not imaged in the training sequence. The results could be submitted for two separate tracks: (a) obligatory, in case of which each single image had to be classiﬁed independently; (b) optional, where the temporal continuity of the sequences could be exploited to improve the robustness of the system. For more information about the task and the dataset used for the challenge, we refer the reader to the RobotVision@ImageCLEF’09 overview paper [2]. The visual place recognition system presented in this paper obtained the highest score in the obligatory track and constituted a basis for our approach used in the optional track. The system relies on four discriminative models trained on diﬀerent visual cues that capture both global and local appearance of a scene. In order to increase the robustness of the system, the cues are integrated eﬃciently using a high-level accumulation scheme that operates on the separate models This work was supported by the EU FP7 integrated project ICT-215181-CogX. The supportis gratefully acknowledged. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 315–323, 2010. c Springer-Verlag Berlin Heidelberg 2010 316 L. Xing and A. Pronobis adapted to the properties of each cue. Additionally, in the optional track, we used a simple temporal accumulation technique which exploits the continuity of the image sequences to reﬁne the results. Since the misclassiﬁcations were penalized in the competition, we experimented with an ignorance detection technique relying on the estimated conﬁdence of the decision. Visual place recognition is a vastly researched topic in the robotics and computer vision communities and several diﬀerent approaches have been proposed to the problem considered in the competition. The main diﬀerences between the approaches relate to the way the scene is perceived and thus the visual cues extracted from the input images. There are two main groups of approaches using either global or local image features. Typically, SIFT [3] and SURF [4] are applied as local features, either using a matching strategy [5,6] or the bag-of-words approach [7,8]. Global features are also commonly used for place recognition and such representations as gist of a scene [9], CRFH [10], or PACT [11] were proposed. Recently, several authors observed that robustness and eﬃciency of the recognition system can be improved by combining information provided by both types of cues (global and local) [5, 12]. Our approach belongs to this group and four diﬀerent types of features previously used in the domain of place recognition have been used in the presented system. The rest of the paper gives a description of the structure and components of our place recognition system (Section 2). Then, we describe the initial experiments performed on the training and validation data (Section 3). We explain the procedure applied for parameter selection and study the properties of the cue integration and conﬁdence estimation algorithms. Finally, we present the results obtained on the test sequence and our ranking in the competition (Section 4). The paper concludes with a summary and possible avenues for future research. 2 The Visual Place Recognition System This section describes our approach to visual place classiﬁcation. Our method is fully supervised and assumes that during training, each place (room) is represented by a collection of labeled data which captures its intrinsic visual properties under various viewpoints, at a ﬁxed time and illumination setting. During testing, the algorithm is presented with data samples acquired under diﬀerent conditions and after some time. The goal is to recognize correctly each single data sample provided to the system. The rest of the section describes the structure and components of the system. 2.1 System Overview The architecture of the system is illustrated in Fig. 1. We use four diﬀerent cues extracted independently from the visual input. We see that there is a separate path for each cue. Every path consists of two main building blocks: a feature extractor and a classiﬁer. Thus separate decisions can be obtained for every cue. The outputs encoding the conﬁdence of single-cue classiﬁers are combined using a discriminative accumulation scheme. Multi-cue Discriminative Place Recognition 317 Fig. 1. Structure of the multi-cue visual place recognition system 2.2 Visual Features The system relies on visual cues based on global and local image features. Global features are derived from the whole image and thus can capture general properties of the whole scene. In contrast, local features are computed locally, from distinct parts of an image. This makes them much more robust to occlusions and viewpoint variations. In order to capture diﬀerent aspects of the environment, we combine cues produced by four diﬀerent feature extractors. Composed Receptive Field Histograms (CRFH). CRFH [13] is a multidimensional statistical representation (a histogram) of the occurrence of responses of several image descriptors applied to the whole image. Each dimension corresponds to one descriptor and the cells of the histogram count the pixels sharing similar responses of all descriptors. This approach allows to capture various properties of the image as well as relations that occur between them. On the basis of the evaluation in [10], we build the histograms from second order Gaussian derivative ﬁlters applied to the illumination channel at two scales. PCA of Census Transform Histograms (PACT) Census Transform (CT) [11] is a non-parametric local transform designed for establishing correspondence between local patches. Census transform compares the intensity values of a pixel with its eight neighboring pixels, as illustrated in Figure 2. A histogram of the CT values encode both local and global information of the image. PACT [11] is a global representation that extracts the CT histograms for several image patches organized in a grid and applies Principal Component Analysis (PCA) to the resulting vector. Scale Invariant Feature Transform (SIFT). As one of the local representations, we used a combination of the SIFT descriptor [3] and the scale, rotation and translation invariant Harris-Laplace corner detector [14]. The SIFT descriptor represents local image patches around interest points characterized by coordinates in the scale space in the form of histograms of gradient directions. 318 L. Xing and A. Pronobis Fig. 2. Illustration of the Census Transform [11] Speed-Up Robust Features (SURF). SURF [4] is a scale- and rotationinvariant local detector and descriptor which is designed to approximate the performance of previously proposed schemes while being much more computationally eﬃcient. This is obtained by using integral images, a Hessian matrixbased measure for the detector and a distribution of Haar-wavelet responses for the descriptor. 2.3 Place Models Based on its state-of-the-art performance in several visual recognition domains [15, 16], we used the Support Vector Machine classiﬁer [17] to build the models of places for each cue. The choice of the kernel function is a key ingredient for the good performance of SVMs and we selected specialized kernels for each cue. Based on results reported in the literature, we chose in this paper the χ2 kernel [18] for CRFH, the Gaussian (RBF) kernel [17] for PACT and the match kernel [19] for both local features. In order to extend the binary SVM to multiple classes, we used the one-against-all strategy for which one SVM is trained for each class separating the class from all other classes. SVMs do not provide any out-of-the-box solution for estimating conﬁdence of the decision; however, it is possible to derive conﬁdence information and hypotheses ranking from the distances between the samples and the hyperplanes. In this work, we experimented with the distance-based methods proposed in [5], which deﬁne conﬁdence as a measure of unambiguity of the ﬁnal decision. 2.4 Cue Integration and Temporal Accumulation As indicated in [5], diﬀerent properties of viual cues result in diﬀerent performance and error patterns on the place classiﬁcation task. The role of the cue integration scheme is to exploit this fact in order to increase the overall performance. Our place recognition system uses the Discriminative Accumulation Scheme (DAS) [16] that was proposed for the place classiﬁcation problem in [5]. It accumulates multiple cues, by turning classiﬁers into experts. The basic idea is to consider real-valued outputs of a multi-class discriminative classiﬁer as an indication of a soft decision for each class. Then, all of the outputs obtained from the various cues are summed together, therefore linearly accumulated. In the presented system, this can be expressed by the equation OΣ = a · OCRF H + b · OP ACT + c · OSIF T + d · OSURF , where a, b, c, d are the weights assigned to each cue and a + b + c + d = 1. The vectors O represent the outputs of the multi-class classiﬁers for each cue. We used a very similar scheme to improve the robustness of the system operating on image sequences. For this, we exploited the continuity of the sequences Multi-cue Discriminative Place Recognition 319 and accumulated the outputs (of a single cue or integrated cues) for the current sample and N previously classiﬁed samples. The result of accumulation was then used as the ﬁnal decision of the system. 3 Experiments on the Training and Validation Data We conducted several series of experiments on the training and validation data in order to analyze the behavior of our system and select parameters. We present the analysis and results in successive subsections. 3.1 Selection of the Model Parameters The ﬁrst set of experiments was aimed at ﬁnding the values of parameters of the place models, i.e. the SVM error penalty C and the kernel parameters. The experiments were performed separately for each visual cue (CRFH, PACT, SIFT and SURF). To ﬁnd the parameters, we performed cross validation on the training and validation data. For every training set, we selected parameters that resulted in highest classiﬁcation rate on all available test sets acquired under diﬀerent conditions. The classiﬁcation rate was calculated in a similar way as the ﬁnal score used in the competition i.e. as the percentage of correctly classiﬁed images in the whole testing sequence. Figure 3 presents the results obtained for the experiments with the dum-night3 training set which was selected for the ﬁnal run of the competition. It is apparent that the model based on the SIFT features provides the highest recognition rate on average. However, we can also see that diﬀerent cues have diﬀerent characteristics as their performance changes according to diﬀerent patterns. This suggests that the overall performance of the system could be increased by integrating the outputs of the models. 3.2 Cue Integration and Temporal Accumulation The next step was to integrate the outputs of the models and choose the proper values of the DAS weights for each model. We performed an exhaustive search for Test Sets 88 86 84 82 80 78 76 74 72 70 68 CRFH 66 PACT 64 62 SIFT SURF DAS 60 dum−cloudy1 dum−cloudy2 dum−sunny1 dum−sunny2 Average Fig. 3. Classiﬁcation rates for the best model parameters and the dum-night3 training set. Results are given separately for each test set as well as averaged over all sets. 320 L. Xing and A. Pronobis 84 82 0 b−PACT 0.2 80 0.4 78 0.6 76 0.8 1 1 74 0.8 1 0.6 0.8 0.6 0.4 a−CRFH 72 0.4 0.2 0.2 0 0 c−SIFT Fig. 4. Classiﬁcation rates obtained for various values of the DAS weights the weights on the training and validation data independently for each training set. Then, we selected the values that provided the highest average classiﬁcation rate over all test sets. The results are presented in Figure 3. This weight selection procedure revealed that once SIFT is used as one of the cues, there is no beneﬁt of adding SURF (the weight for SURF was selected to be 0). This is not surprising since SURF captures similar information as SIFT, while employing some heuristics in order to make the feature extraction process more eﬃcient. According to the results presented in the previous section, those heuristics decrease the overall performance of the system, while not introducing any additional knowledge. Figure 4 illustrates how the the average classiﬁcation rates for the dum-night3 training set and all test sets changed for various values of the weights used for CRFH, PACT and SIFT (the weight used for SURF is assumed to be 0). The following weights were selected and used for further experiments: a = 0.1, b = 0.15, c = 0.75, d = 0. We performed similar experiments to ﬁnd the number of past samples we should accumulate over in order to reﬁne the results in case of the optional track. The results revealed that we obtain the highest score when 4 past test samples are accumulated with equal weights with the currently classiﬁed sample. 3.3 Confidence Estimation According to the performance measure used in the competition, the classiﬁcation errors were penalized. Therefore, we experimented with an ignorance detection mechanism based on the conﬁdence of the decision produced by the system. In order to simulate the case of unknown rooms in the test set, we always removed one room from the training set. Figure 5a-e presents the obtained average results. We gradually increased the value of conﬁdence that is required in order to accept the decision of the system and measured the statistics of the accepted and rejected decisions. In both cases, we measured the percentage of test samples that were classiﬁed correctly, misclassiﬁed or unknown during training. We can see from the plots that the conﬁdence thresholding procedure rejects mostly samples from unknown rooms and samples that would be incorrectly classiﬁed. This increases the classiﬁcation rate for the accepted samples. At the same time, the Multi-cue Discriminative Place Recognition 460 100 450 100 540 440 520 80 380 40 Score 400 60 360 340 20 400 60 350 Score 420 40 300 Percentage of Samples 80 Percentage of Samples Percentage of Samples 80 500 480 60 460 40 440 420 20 20 320 0 400 300 0.8 0 100 450 100 80 400 80 0 0.1 0.2 0.3 0.4 0.5 Confidence Threshold 0.6 0.7 Score 100 321 0 0.1 0.2 (a) CRFH 0.3 0.4 0.5 Confidence Threshold 0.6 0.7 250 0.8 0 (b) PACT 0 0.1 0.2 0.3 0.4 0.5 Confidence Threshold 0.6 0.7 380 0.8 (c) SIFT 560 Score 500 60 480 40 40 300 20 250 20 200 0.8 0 Score 350 60 Percentage of Samples Percentage of Samples 540 520 460 440 420 0 0.1 0 0.2 0.5 0.4 0.3 Confidence Threshold 0.6 0.7 0 0.1 (d) SURF 0.2 0.5 0.4 0.3 Confidence Threshold (e) DAS 0.6 0.7 400 0.8 (f) Legend Fig. 5. Average results of the experiments with conﬁdence-based ignorance detection for separate cues and cues integrated Table 1. Results and scores obtained on the ﬁnal test set Obligatory Optional Track Track Score 793.0 853.0 Rank 1 4 (a) Scores and ranks Predicted → 1-person Corridor 2-person Kitchen Printer True↓ Office Office Area 1-person Office 119 (129) 25 (23) 12 (8) 4 (0) 0 (0) Corridor 4 (2) 570 (580) 6 (3) 10 (6) 1 (0) 2-person Office 1 (0) 4 (0) 131 (134) 25 (27) 0 (0) Kitchen 1 (0) 5 (0) 2 (0) 152 (161) 1 (0) Printer Area 5 (0) 138 (139) 10 (7) 3 (2) 120 (128) Unkn. Room 13 (14) 206 (206) 22 (24) 11 (6) 89 (91) Unkn. Room 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) (b) Confusion matrix. Values in brackets are for the optional track. plots show the score used for the competition calculated for the accepted samples only. If we use the penalty equal to 0.5 points for each misclassiﬁed sample (as used in the competition), the number of rejected errors must be twice as large as the number of rejected samples that would be classiﬁed correctly. As a result, the ignorance detection scheme provided only a slight improvement of the ﬁnal score and we decided not to use conﬁdence thresholding for the ﬁnal run. However, as shown in Figure 5, if the penalty was increased to 1 point, the improvement would be signiﬁcant. 4 The Final Test Run The test sequence and the ID of the training sequence (dum-cloudy3 ) were released in the ﬁnal round of the competition. For the ﬁnal run, we used the parameters identiﬁed on the training and validation data. In order to obtain the results for the obligatory track, we applied the models independently to each image in the test sequence and integrated the results using the selected weights. We did not perform ignorance detection. In order to obtain the results for the optional task, we applied the temporal averaging to the results submitted to 322 L. Xing and A. Pronobis the obligatory track. Table 1a presents our scores and ranks in both tracks. Table 1b shows the confusion matrix for the test set. We can see that the temporal averaging ﬁltered out many single misclassiﬁcations in the test sequence. 5 Conclusions In this paper we presented our place recognition system applied to the RobotVision task of the ImageCLEF’09 campaign. Through the use of multiple visual cues integrated using a high-level discriminative accumulation scheme, we obtained a system that provided robust recognition despite diﬀerent types of variations introduced by changing illumination and long-term human activity. The most diﬃcult aspect of the task turned out to be the novel class detection. We showed that the conﬁdence of the classiﬁer can be used to reject unknonwn or misclassiﬁed samples. However, we did not provide any principled way to detect the cases when the classiﬁer dealt with a novel room. Our future work will concentrate on that issue. References 1. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proc. of IROS 2007 (2007) 2. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the CLEF 2009 Robot Vision task (2009) 3. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2) (2004) 4. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404– 417. Springer, Heidelberg (2006) 5. Pronobis, A., Caputo, B.: Conﬁdence-based cue integration for visual place recognition. In: Proc. of IROS 2007 (2007) 6. Valgren, C., Lilienthal, A.J.: Incremental spectral clustering and seasons: Appearance-based localization in outdoor env. In: Proc. of ICRA 2008 (2008) 7. Filliat, D.: A visual bag of words method for interactive qualitative localization and mapping. In: Proc. of ICRA 2007 (2007) 8. Cummins, M., Newman, P.: FAB-MAP: Probabilistic localization and mapping in the space of appearance. International Journal of Robotics Research 27(6) (2008) 9. Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: Proc. of ICCV 2003 (2003) 10. Pronobis, A., Caputo, B., Jensfelt, P., Christensen, H.I.: A discriminative approach to robust visual place recognition. In: Proc. of IROS 2006 (2006) 11. Wu, J., Rehg, J.M.: Where am I: Place instance and category recognition using spatial PACT. In: Proc. of CVPR 2008 (2008) 12. Weiss, C., Tamimi, H., Masselli, A., Zell, A.: A hybrid approach for vision-based outdoor robot localization using global and local image features. In: Proc. of IROS 2007 (2007) 13. Linde, O., Lindeberg, T.: Object recognition using composed receptive ﬁeld histograms of higher dimensionality. In: Proc. of ICPR 2004 (2004) Multi-cue Discriminative Place Recognition 323 14. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: Proc. of ICCV 2001 (2001) 15. Pronobis, A., Mart´ınez Mozos, O., Caputo, B.: SVM-based discriminative accumulation scheme for place recognition. In: Proc. of ICRA 2008 (2008) 16. Nilsback, M.E., Caputo, B.: Cue integration through discriminative accumulation. In: Proc. of CVPR 2004 (2004) 17. Cristianini, N., Taylor, J.S.: An Introduction to SVMs and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000) 18. Chapelle, O., Haﬀner, P., Vapnik, V.: Support vector machines for histogram-based image classiﬁcation. IEEE Transactions on Neural Networks 10(5) (1999) 19. Wallraven, C., Caputo, B., Graf, A.: Recognition with local features: the kernel recipe. In: Proc. of ICCV 2003 (2003) MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation and Retrieval Tasks Trong-Ton Pham1 , Lo¨ıc Maisonnasse2 , Philippe Mulhem1 , Jean-Pierre Chevallet1 , Georges Qu´enot1 , and Rami Al Batal1 1 Laboratoire Informatique de Grenoble (LIG), Grenoble University, CNRS, LIG 2 Laboratoire d’InfoRmatique en Image et Systemes d’information (LIRIS) {Trong-Ton.Pham,Philippe.Mulhem,Jean-Pierre.Chevallet}@imag.fr, {Georges.Quenot,Rami.Albatal}@imag.fr Abstract. This paper describes mainly the experiments that have been conducted by the MRIM group at the LIG in Grenoble for the the ImageCLEF 2009 campaign, focusing on the work done for the Robotvision task. The proposal for this task is to study the behaviour of a generative approach inspired by the language model of information retrieval. To ﬁt with the speciﬁcity of the Robotvision task, we added post-processing in a way to tackle with the fact that images do belong only to several classes (rooms) and that image are not independent from each others (i.e., the robot cannot in one second be in three diﬀerent rooms). The results obtained still need improvement, but the use of such language model in the case of Robotvision is showed. Some results related to the Image Retrieval task and the Image annotation task are also presented. 1 Introduction We describe here the diﬀerent experiments that have been conducted by the MRIM group at the LIG in Grenoble for the ImageCLEF 2009 campaign, and more speciﬁcally for the Robotvision task. Our goal for this task was to study the use of language models in the context where we try to guess in which room a robot is in a partially known environment. Language models for text retrieval where proposed ten years ago, and behave very well when all the data cannot be directly extracted from the corpus. We have already proposed such application for image retrieval in [10], achieving very good results. We decided to focus on this challenging task represented by the Robotvision task in CLEF 2009. We also participated to the Image retrieval and the image annotation task for CLEF 2009, and we discuss brieﬂy, because of space constrains, some of our proposal and results. The paper is organized as follows. First we describe the Robotvision task in section 2, our proposal based on language models and the results obtained. In this section, we focus on the features that were used to represent the images, before describing the language model deﬁned on such representation and the post-processing that took advantage of the speciﬁcity of the Robotvision task. Because the MRIM-LIG research group participated in two other image related C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 324–331, 2010. c Springer-Verlag Berlin Heidelberg 2010 MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation 325 tasks, we propose in section 3 to describe shortly our main proposals and ﬁnding for the image annotation and the image retrieval tasks. We conclude in section 4. 2 2.1 Robovision Track Task Description The Robotvision task at CLEF 2009 [1], aims at determining “the topological location of a robot based on images acquired with a perspective camera mounted on a robot platform.” A robot is moving on a building ﬂoor, going across several (six) rooms, and an automatic process has to indicate, for each image of a video sequence shot by the robot, in which room is the robot. In the test video, a additional room (which was not given in the training set), unknown, is present and has also to be tagged automatically. The full video set is the IDOL video database [6]. 2.2 Image Representation We have applied a visual language modeling framework for the Robotvision task. This generative model is quite standard in the Information Retrieval ﬁeld, and already lead to good results for visual scene recognition [10]. Before explaining in detail the language modeling approach, we ﬁx some elements related to the feature extractions of images. To cover the diﬀerent classes of features that could be relevant, we have extracted color, texture, and region of interest features in our proposal. These features are: HSV color histogram: we extract the color information from HSV color space. One image is represented by a concatenation of n×n histograms, according to non overlapping rectangular patches deﬁned from a n×n grid applied on the image. Each histogram has 512 dimensions; Multi-scale canny edge histogram: we used Canny operator to detect the contour of objects as presented in [15]. An 80-dimensional vector was used to capture magnitudes and gradient of the contours for each patch. This information is extracted from a grid of m×m for each image; Color SIFT: SIFT features are extracted using D. Lowe’s detector [5]. Region around the keypoint is described by a 128-dimensional vector for each R, G, B channel. Based on the usual bag of visual words approach, we construct for each of the features above a visual vocabulary of 500 visual words using k-means clustering algorithm. Each visual word is designated to a concept c. Each image will then be represented using theses concepts and the language model proposed is built on these concepts. 2.3 Visual Language Modeling The language modeling approach to information retrieval exists from the end of the 90s [11]. In this framework, the relevance status value of a document for a given query is estimated by the probability of generating the query from the document. Even though this approach was originally proposed for unigrams (i.e. 326 T.-T. Pham et al. isolated terms), several extensions have been proposed to deal with n-grams (i.e. sequences of n terms) [12,13], and, more recently, with relationships between terms and graphs. Thus, [3] proposes (a) the use of a dependency parser to represent documents and queries, and (b) an extension of the language modeling approach to deal with such trees. [8,9] further extend this approach with a model compatible with general graphs, as the ones obtained by a conceptual analysis of documents and queries. Other approaches (as [2,4]) have respectively used probabilistic networks and kernels to capture spatial relationships between regions in an image. In the case of [2], the estimation of the region probabilities relies on an EM algorithm, which is sensitive to the initial probability values. In contrast, in the model we propose, the likelihood function is convex and has a global maximum. In the case of [4], the kernel used only considers the three closest regions to a given region. In [10], we have presented the image as a probabilistic graph which allows capturing the visual complexity of an image. Images are represented by a set of weighted concepts, connected through a set of directed associations. The concepts aim at characterizing the content of the image whereas the associations express the spatial relations between concepts. Our assumption is that the concepts are represented by non-overlapping regions extracted from images. In this competition, the images acquired by the robot are of poor quality, and we decided to not take into account the relationship between concepts. We thus assume that each document image d (equivalent each query image q) is represented by a set of weighted concepts WC . The concepts correspond to a visual word used to represent the image. The weight of concepts captures the number of occurrences of this concept in image. Denoting C the set of concepts over all the whole collection, WC can be deﬁned as a set of pairs (c, w(c, d)), where c is an element of C and w(c, d) is the number of times c occur in the document image i. We are then in a context similar to usual language model for text retrieval. We rely then on a language model deﬁned over concepts, as proposed in [7], which we refer to as Conceptual Unigram Model. We assume that a query q or a document d is composed of a set WC of weighted concepts, each concept being conditionally independent to the others. Unlike [7] that computes a query likelihood, we evaluate the relevance status value rsv of a document image d for query q by using a generalized formula, the negative Kullback-Leiber divergence, noted D. Such divergence is computed between two probability distributions: the document model Md computed over the document image d and the query model Mq computed over the query image q. Assuming the concept independence hypothesis, this leads to: RSVkld (q, d) = −D (Mq Md ) log(P (ci |Mq ) ∗ P (ci |Md )) ∝ (1) (2) ci ∈C where P (ci |Md ) and P (ci |Mq ) are the probability of the concept ci in the model estimated over the document d and query q respectively. If we assume a multinomial models for Md and Mq , P (ci |Md ) is estimated through maximum likelihood MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation 327 (as is standard in the language modeling approach to IR), using Jelinek-Mercer smoothing: P (ci |Md ) = (1 − λu ) Fc (ci ) Fd (ci ) + λu Fd Fc (3) where Fd (c), representing the sum of the weight of c in all graphs from document image d and Fd the sum of all the document concept weights in d. The functions Fc are similar, but deﬁned over the whole collection (i.e. over the union of all the images from all the documents of the collection). The parameter λu helps taking into account reliable information when the information from a given document is scarce. For this part, the quantity P (ci |Mq ) is estimated through maximum likelihood without smoothing on the query. The ﬁnal result L(qi ) for one query image i is a list of the images dj from the learning set ranked according to the RSVkld (qi , dj ) value. 2.4 Post-Processing of the Results As we just mentioned, in this basic case we may associate the query image with the room id of the best ranked image. However, because we represent each image with several features and because we have several images of each room in the training set, we post-process this basic result: – Fusion: an image is represented independently for each feature considered (color with a given grid, texture with a given grid, regions of interest). Each of these representations lead to diﬀerent matching results using the language model. We choose to make a late fusion of the three results obtained using a linear combination: RSV (Q, D) = RSVkld (qi , di ) (4) i where Q and D correspond to the image query and documents, and qi and di describe the query and the document according to a feature i. – Grouping training images by their room: assuming that the closest training image of a query image is not suﬃcient to determine the room because of their intrinsic ambiguity, we propose to group the results of the n-best images for each room. We are then able to compute a ranked list of room RL instead of an image list for each query image: RLq = [(r, RSVr (q, r)] with RSVr (q, r) = RSV (q, d) (5) fn−best (q,r) where r corresponds to a room and fn−best is a function that select the n images with the best RSV belonging to the room r. 328 T.-T. Pham et al. – Filtering the unknown room: in the test set of the Robotvision task, we know that one additional room is added. To tackle this point, we assume that if one room r is recognized, then the matching value for r is signiﬁcantly larger than the matching value for the other rooms, especially compared to the room with the lower matching value. So, if this diﬀerence is large (> β), we consider that there is a signiﬁcant diﬀerence and then we keep the tag r for the image. Otherwise we consider the image room tag as unknown. In our experiment, we ﬁxed the threshold β to 0.003 after experiments. – Smoothing window: we exploit the visual continuity in a sequence of images by smoothing the result across the temporal axis. To do that, we use a flat (i.e., all the images in the window have the same weight) smoothing window centered on the current image. In the experiments, we choose the width of window w = 40 (i.e. 20 images before and after the classiﬁed image). 2.5 Validating Process The validation aims at evaluating robustness of the algorithms to visual variations that occur over time due to the changing conditions and human activity. We trained our system with the night3 condition set and tested against all the other conditions from validation set. Our objective was to understand the behavior of our system with the changing conditions and with diﬀerent types of features. We ﬁrst study the models one by one. We built 3 diﬀerent language models corresponding with 3 types of visual features. The training set used is night3 set. The model Mc and Me correspond to the color histogram and the edge histograms generated from a 5×5 grid. The model Ms corresponds to the SIFT color feature extracted from interest points. The recognition rates according to several validation sets are presented in Table 1. Table 1. Results obtained with 3 visual language models (Mc, Me, Ms) Train Validation HSV(Mc) night3 night2 84.24% night3 cloudy2 39.33% night3 sunny2 29.04% Edge(Me) SIFT color(Ms) 59.45% 79.20% 58.62% 60.60% 52.37% 54.78% We noticed that, in the same condition (e.g. night-night), the HSV color histogram Mc outperforms the two other models. However, in diﬀerent conditions, the result of this model dropped signiﬁcantly (from 84% to 29%). On the other hand, the edge model (Me) and the SIFT color model (Ms) are more robust to the change of conditions. In the worst condition (night-sunny), it still obtains a recognition rate of 52% for Me and 55% for Ms. As the result, we choose to consider only the edge histogram and SIFT feature for the oﬃcial runs. Then, we studied the impact of the post-processing on the ranked list of the models Me and Ms on the recognition rate in Table 2. MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation 329 Table 2. Result of the post-processing step based on 2 models Me and Ms Train Validation Fusion Regrouping Filtering Smoothing night3 sunny2 62% 67% (n=15) 72% (β=0.003) 92%(k=20) The fusion of the 2 models leads to an overall 8% of improvement. The regrouping step helped to pop-up some prominent rooms from the score list by averaging room’s n-best scores. The ﬁltering, using the threshold β=0.003, eliminated some of the uncertain decisions. Eventually, the smoothing step with a window size of 40 helped to increase the performance of a sequence of images signiﬁcantly, by more than 20% compared to the initial result. 2.6 Submitted Runs and Results For the oﬃcial test, we have constructed 3 models based on the validating process. We eliminated the HSV histogram model because of its poor performance on diﬀerent lighting conditions and there was a little chance to have the same condition. We used the same visual vocabulary of 500 visual concepts generated for night3 set. Each model provided a ranked result corresponding with the test sequence released. The post-processing steps were performed similarly to the validating process employing the same parameters. The visual language models built for the competition are: Me1: visual language model based on edge histogram extracted from 10x10 patches division; Me2: visual language model based on edge histogram extracted from 5x5 patches division, and Ms: visual language model based on color SIFT local features. Our test has been performed on a quad core 2.00GHz computer with 8Gb of memory. The training took about 3 hours on a whole night3 set. Classiﬁcation of the test sequence was executed in real time. Based on the 3 visual models constructed, we have submitted 4 valid runs to the ImageCLEF evaluation (our runs with smoothing windows were not valid). – 01-LIG-Me1Me2Ms: linear fusion of the results coming from 3 models (score = 328). We consider this run as our baseline; – 02-LIG-Me1Me2Ms-Rk15: re-ranking the result of 01-LIG-Me1Me2Ms with the regrouping of top 15 scores for each room (score = 415); – 03-LIG-Me1Me2Ms-Rk15-Fil003: if the result of the 1st and the 4th in the ranked list is too small (i.e. β = 0.003), we remove that image from the result list (score = 456.5); – 05-LIG-Me1Ms-Rk15: same as 02-LIG-Me1Me2Ms-Rk15 but with the fusion of 2 types of image representation. (score = 25); These result show that the grouping increases results by 27% compared to the baseline. Adding a ﬁltering after the grouping increases again the results, gaining more that 39% compared to the baseline. The use of SIFT features is also validated: the result obtained by the run 05-LIG-Me1Ms-Rk15 in not good, even after grouping the results by room. Our best run 03-LIG-Me1Me2Ms-Rk15Fil003 for the obligatory track is ranked at 12th place among 21 runs submitted 330 T.-T. Pham et al. in overall. We conclude from these results that the use of post-processing is a must in the context of Robotvision room recognition. 3 Image Retrieval and Image Annotation Tasks Results This paper focuses on the robovision task, but the MRIM-LIG group also submitted results for the image annotation and the image retrieval tasks. For the image annotation task, we tested a simple late fusion (selection of the best) based on three diﬀerent sets of features: RGB colors, SIFT features, and an early fusion of hsv color space and Gabor ﬁlters energy. We tested two learning frameworks using SVM classiﬁers: a simple one against all, and a multiple one against all inspired from the work of Tahir, Kittler, Mikolajczyk and Yan called Inverse Random Under Sampling [14]. As a post processing, we applied on all our diﬀerent runs a linear scaling in a way to ﬁt the learning set a priori probabilities. We took afterward into account the hierarchy of concept in the following way: a) when conﬂicts occur (for instance the tag Day and the tag Night are associated to one image of the test set), we keep unchanged the larger value tag, and we decrease (linearly) the value all the other conﬂicting tags, b) we propagated the concepts values in a bottom-up way if the values of the generic concept is increased, otherwise we do not update the pre-existing values. The best result that we obtained was 0.384 for equal error rate (rank 34 on 74 runs) and 0.591 for recognition rate (rank 45 on 74). These results need to be studied further. For the image retrieval task, we focused on a way to generate subqueries, corresponding to potential clusters for the diversity process. We extracted the ten most cooccurring words with the query words, and used these words in conjunction with the initial query to generate sub-queries. One interesting result obtained comes from the fact that, for a text+image run, the result we obtained for the 25 last queries (the one for which we had to generate sub queries) was ranked 6th. This result encourages us to further study the behavior of our proposal. 4 Conclusion To summarize our work on the Robotvision task, we have presented a novel approach for localization of a mobile robot using visual language modeling. Theoretically, this model ﬁts within the standard language modeling approach which is well developed for IR. On the other hand, this model helps to capture in the same time the generality of the visual concepts associated with the regions from a single image or sequence of images. The validation process has proved a good recognition rate of our system against diﬀerent illumination conditions. We believe that a good extension of this model is possible in the real scenario of scene recognition (more precisely for robot self-localization). With the addition of more visual features and the increase of system robustness, this could be a suitable approach for the future recognition systems. For the two other tasks in which we participated, we achieved average results. For the image retrieval we will study in the future more speciﬁcally the diversity algorithm. MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation 331 Acknowledgment This work was partly supported by: a) the French National Agency of Research (ANR-06-MDCA-002), b) the Quaero Programme, funded by OSEO, French State agency for innovation and c) the R´egion Rhones Alpes (projet LIMA). References 1. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the clef 2009 robot vision track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Fergus, R., Perona, P., Zisserman, A.: A sparse object category model for eﬃcient learning and exhaustive recognition. In: Conference on Computer Vision and Pattern Recognition (2005) 3. Gao, J., Nie, J.-Y., Wu, G., Cao, G.: Dependence language model for information retrieval. In: ACM SIGIR 2004, pp. 170–177 (2004) 4. Gosselin, P., Cord, M., Philipp-Foliguet, S.: Kernels on bags of fuzzy regions for fast object retrieval. In: International Conference on Image Processing (2007) 5. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 91–110 (2004) 6. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proc. IROs 2007 (2007) 7. Maisonnasse, L., Gaussier, E., Chevalet, J.P.: Model fusion in conceptual language modeling. In: ECIR 2009, pp. 240–251 (2009) 8. Maisonnasse, L., Gaussier, E., Chevallet, J.: Revisiting the dependence language model for information retrieval. In: Poster SIGIR 2007 (2007) 9. Maisonnasse, L., Gaussier, E., Chevallet, J.: Multiplying concept sources for graph modeling. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 585–592. Springer, Heidelberg (2008) (to be published) 10. Pham, T.T., Maisonnasse, L., Mulhem, P., Gaussier, E.: Visual language model for scene recognition. In: Proceedings of SinFra 2009, Singapore (2009) 11. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: ACM SIGIR 1998, pp. 275–281 (1998) 12. Song, F., Croft, W.B.: General language model for information retrieval. In: CIKM 1999, pp. 316–321 (1999) 13. Srikanth, M., Srikanth, R.: Biterm language models for document retrieval. In: Research and Development in Information Retrieval, pp. 425–426 (2002) 14. Tahir, M.A., Kittler, J., Mikolajczyk, K., Yan, F.: A multiple expert approach to the class imbalance problem using inverse random under sampling. In: Multiple Classiﬁer Systems, Reykjavik, Iceland, pp. 82–91 (2009) 15. Won, C.S., Park, D.K., Park, S.-J.: Eﬃcient use of mpeg-7 edge histogram descriptor. ETRI Journal 24(1) (2002) The ImageCLEF Management System Ivan Eggel1 and Henning M¨ uller2 1 Business Information Systems, University of Applied Sciences Western Switzerland (HES–SO), Sierre, Switzerland 2 Medical Informatics, University and Hospitals of Geneva, Switzerland ivan.eggel@hevs.ch Abstract. The ImageCLEF image retrieval track has been part of CLEF (Cross Language Evaluation Forum) since 2003. Organizing ImageCLEF and its large participation of research groups involves a considerable amount of work and data to manage. Goal of the management system described in this paper was to create a system for the organization of ImageCLEF to reduce manual work and professionalize the structures. All ImageCLEF sub tracks having a page in a single run submission system reduces work of organizers and makes submissions easier for participants. The system was developed as a web application using Java and JavaServer Faces (JSF) on Glassﬁsh with a Postgres 8.3 database. The main functionality consists of user, collection and subtrack management as well as run submissions. The system has two main user groups, participants and administrators. The main task for participants is to register for subtasks and then submit runs. Administrators create collections for the sub tasks and can deﬁne the data and constraints for submissions. The described system was used for ImageCLEF 2009 with 86 subscribed users and more than 300 submitted runs in 7 subtracks. The system has proved to signiﬁcantly reduce manual work and will be used for upcoming ImageCLEF events and other evaluation campaigns. 1 Introduction ImageCLEF is the cross–language image retrieval track, which is run as part of the Cross Language Evaluation Forum (CLEF). ImageCLEF1 has seen participation from both academic and commercial research groups worldwide from communities including: cross–language information retrieval (CLIR), content–based image retrieval (CBIR) and human computer interaction. The main objective of ImageCLEF is to advance the ﬁeld of image retrieval and oﬀer evaluation in various ﬁelds of image information retrieval. The mixed use of text and visual features has been identiﬁed as important because little knowledge exists on such combinations and most research groups work either on text or on images but only few work on the two. By making available visual and textual baseline results ImageCLEF gives participants data and task to obtain the information that they do not have themselves [1,2]. ImageCLEF 2009 was divided into 7 subtracks (tasks) each of which provides an image collection: 1 http://www.imageclef.org/ C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 332–339, 2010. c Springer-Verlag Berlin Heidelberg 2010 The ImageCLEF Management System 333 – ImageCLEFmed: medical retrieval; – ImageCLEFmed–annotation–IRMA: automatic medical image annotation task for the IRMA (Image Retrieval in Medical Applications) data set; – ImageCLEFmed–annotation–nodules: automatic medical image annotation for lung nodules; – ImageCLEFphoto: photographic retrieval; – ImageCLEFphoto–annotation: annotation of images using a simple ontology; – ImageCLEFwiki: image retrieval from a collection of Wikipedia images; – ImageCLEFrobot: robotic image analysis. ImageCLEF has been part of CLEF since 2003, with the number of registered research groups having grown from 4 in 2003 to 86 in 2009.. Taking the ever growing number of participants, it has become increasingly diﬃcult to manage the registration, communication with participants and run submission manually. The data includes a copyright agreement for CLEF, submitted runs, task a user registered for, contact details for each participant. Registered groups became passwords for data download of each of the sub tasks that were send upon signature of the copyright agreement manually. The many manual steps created misunderstandings, data inconsistencies, and a large amount of email requests. After several years of experience with much manual work, a computer–based solution was created in 2009. In this paper we present the developed system based on Java and JSF (Java Server Faces) to manage ImageCLEF events without replacing other already existing tools such as Easychair2 for review management or DIRECT to evaluate results in several other CLEF tasks [3]. The new system was developed to integrate into the ImageCLEF structure and to facilitate organizational issues. This includes a run submission interface to avoid every task developing own solutions. 2 Methods For the implementation of the system we relied on Java and JSF running on Glassﬁsh v2.1. For data integration a Postgres 8.2 database was employed. The bridge between Java and Postgres was established with a Postgres JDBC 3 driver. Other Technologies used for client side interaction were pure Javascript and AJAX. The server used an Intel Xeon Dual Core 1.6 GHz processor with 2 GB of RAM and total disk space of 244 GB running on SuSe Linux. 3 Results The ImageCLEF management system3 mainly handles 4 functions: management of users, collections, sub tracks and runs. The possibility of dynamic sub track creation makes the system usable for other events and data of participants can be transferred from one event to another. Participating in a new event mainly includes setting up a new database making the application ﬂexible. 2 3 http://www.easychair.org/ http://medgift.unige.ch:8080/ICPR2010/faces/Login.jsp 334 3.1 I. Eggel and H. M¨ uller User Management Account Types. Generally, there are two user groups in the management system: participants and administrators. Participants are users with the goal to participate in one or more ImageCLEF tasks and submit runs. After the registration and the validation of the copyright agreement by the organizers, a user is allowed to submit runs. Administrators are users that enjoy rights to set up and modify the system with essential data, e.g. creating subtracks or delete users. They can also act as participants for run submissions. Usually, all ImageCLEF organizers have their own administrator accounts. To become an administrator the user needs to be registered as a participant. An existing administrator can then convert an existing participant account into an administrator account. User Registration. Each participating group can register easily and quickly. A link for the registration on the initial login page will guide the user to the registration process. For security reasons it is not possible to register as an administrator, so it is necessary to register as a participant ﬁrst. To complete the registration, the following information needs to be provided: – – – – – – – – group name (e.g. name of association, university, etc.); group e–mail address (representative for the group); group address; group country; ﬁrst name of contact persony; last name of contact person; phone number of contact person (not mandatory); selection of sub tracks the participant wishes to participate in. After submitting the registration form the system validates all input ﬁelds and (in case of validity) stores the participant’s registration information to the database, which at the same sends the login password to the participant by e–mail. General Resources/Tasks of User Management. There are several resources and tasks for user management, which include viewing a list of all users, users’ details, updating and deleting a user as well as validating pending participant signatures. In Figure 1 the list of all users shows a table with users row by row. Every row represents a user with the possibility to navigate to the detail and update pages by clicking the according links in the table. There is also a delete button in the row, which will remove the user from the database. Only administrators are allowed to delete participants, however it is not permitted to remove another administrator account. It is possible for every user, regardless of being administrator or participant to view a user detail page, however with the restriction of participants not being able t o see the list of submitted runs within another user’s page (see Figure 2). The system also provides an update function. While participants can only update their own accounts, administrators are allowed to update all participants they wish to. Only administrators possess the authorization to validate a participant’s signature for the copyright agreement. The ImageCLEF Management System 335 Fig. 1. List of all the users, allowing to sort by various criteria and with diﬀerent views Fig. 2. The view of the details of one user 336 3.2 I. Eggel and H. M¨ uller Collection Management A collection describes a dataset of images used for the retrieval. Since all subtracks are associated with a collection the creation of a collection has to be performed before adding a sub track. Theoretically, the same collection can be part of several sub tracks. Any administrator can create new collections. For a new collection the user needs to provide information like the name of the collection, the number of images in the collection and the address to its location on the web. Additionally, the user has to provide an imagenames–ﬁle, which represents a ﬁle containing the names of all images in the collection with one imagename per line. Providing this ﬁle is essential to perform checks for run submissions, i.e. if the images speciﬁed in the submitted run ﬁle are contained in the collection. It is also only possible for administrators to perform updates on existing collections if necessary. The update page provides the possibility to change ordinary collection information as well as the exchange of the imagenames–ﬁle. 3.3 Subtrack Management Each subtrack determines a beginning and an end date preventing participants from submitting runs for this subtrack when the time period for submission is over. Every subtrack allows only a limited number of submitted runs per participant. Like all organizational tasks, creating a new subtrack is only possible for administrators. The interface for the creation of new subtracks asks to provide information like the name of the collection, the maximal number of runs allowed as well as start and end dates of the task. Providing these dates will prevent a participant from submitting runs for this task before the task starts or after the task has ﬁnished. It is equally important to select the collection associated with the subtrack, which demands prior creation of at least one collection. In a task view, all submitted runs for the task are listed in a table (only accessible to administrators). Administrators also enjoy the privilege to download all submitted runs for the task in one zip ﬁle. All participants in the subtrack are listed. 3.4 Runs Run submission is one of the central functions of the presented system. Each participant has the opportunity to submit runs. Administrators can act as participants and thus submit runs. Figure 3 shows an example of run submission. The main item of a run submission is the runﬁle itself, which can be uploaded on the same page. After the ﬁle upload and before storage of the metadata to the database, the system executes a runﬁle validation. Due to varying ﬁle formats among the tasks there are speciﬁc validators created for each task. In case of invalid ﬁles the transaction will be discarded, i.e. the data will not be stored to the system and an error message will notify the user avoiding the submission of runs in incorrect format. Likewise, the validator assures that each image speciﬁed in the run ﬁle has to be part of the collection. All this avoids the submission of incorrect run ﬁles and thus manual work of the organizers. The ImageCLEF Management System 337 Fig. 3. Example for a run submission Administrators have the possibility to see all submitted runs in a table, whereas ordinary participants are only allowed to see their own runs. The simplest way for a admin to view his or another user’s submitted runs is to inspect the user’s detail page. For administrators, a table with all submitted runs of all users appears also on the initial sub track page. A useful feature for administrators is the opportunity to download all runs of a subtrack in one zip ﬁle. The system generates (at runtime) a zip ﬁle including all runs of a particular task. The same page equally provides the facility to download a zipped ﬁle of run meta data xml ﬁles with each ﬁle corresponding to a run. After submission it is still permitted to modify own runs by replacing the runﬁle or by altering meta information on the run. 4 System Use in 2009 The registration interface of the system provided an easy way for users to register themselves to ImageCLEF 2009. The system counted 86 registered users from 30 countries. 10 of these users were also system administrators, the rest normal ImageCLEF participants. ImageCLEF 2009 consisted of 7 sub tracks (see Table 1). With 37 the ImageCLEFphoto–annotation task had the largest number of participants whereas the RobotVision task with its 16 participants recorded the smallest number. As shown in Table 1, participants of the ImageCLEFmed task submitted 124 runs in total, which was the highest number of submitted runs by 338 I. Eggel and H. M¨ uller Table 1. ImageCLEF tasks with number of users and submitted runs Task ImageCLEFmed ImageCLEFmed-annotation-IRMA ImageCLEFmed-annotation-nodules ImageCLEFphoto ImageCLEFphoto-annotation ImageCLEFwiki RobotVision TOTAL # users # runs 34 124 23 19 20 0 34 0 37 74 30 57 16 32 86 306 subtrack, although the task did not have the largest number of participants. The high number of submitted runs was partly due to ImageCLEFmed being devided into image–based and case–based topics, allowing groups to submit twice as many runs. Both ImageCLEFmed–annotation tasks as well as ImageCLEFphoto did not use the system’s run submission interface and used other tools. However, it is foreseen that all tasks will provide their run submission in the future. There were a total of 39 participants that did not submit any run on the system. Some of these participants only participated in tasks that did not use the described interface and others ﬁnally did not submit any runs. Sometimes groups registered with more than one email address and in these cases we ask groups to remove the additional identiﬁers and have a unique submission point per group. 5 Conclusion This paper brieﬂy presents a solution to reduce manual and redundant work for benchmarking events such as ImageCLEF. Goal was to complement already existing systems such as DIRECT or Easychair and supply the missing functionality. All seven ImageCLEF tasks were integrated and almost all participants who registered for ImageCLEF on the paper–based registration also registered electronically. Not all tasks used the provided run submission interface but this is foreseen in the future. With 86 registered users and more than 300 submitted runs the prototype system showed to work in a stable and reliable manner. Several small changes were performed to the system based on comments from the users, particularly in the early registration phase. Reminder emails for forgotten passwords were added as well as several views and restrictions of views on the data. In the ﬁrst version, run ﬁle updates were not possible once the run was submitted. This was changed. Confusion caused the renaming of the original run ﬁle names by the system after submission, which was meant to unify the submitted names based on the identiﬁers given inside the ﬁles. Some participants were then unable to properly identify their runs without a certain eﬀort. To avoid this, the system will keep original names of runﬁles in the future. There is also more ﬂexibility in the meta data for each of the runs before submission but the goal is to harmonize this across tasks as much as possible. The ImageCLEF Management System 339 The management system could enormously reduce manual interaction between participants and organizers of ImageCLEF. As the standard CLEF registration was still on paper with a signed copyright agreement, the electronic system gave the possibility to have one contact with participants and then make all information available at a single point of entry, the ImageCLEF web pages and with it the registration system. Passwords did not need to be sent to participants manually but access was organized through the system. Having a single submission interface also lowered the entry burden for participants of several sub tasks. Having only fully validated runs avoided a large amount of manual work for cleaning the data and contact with participants. Acknowledgements This work was partially supported by the BeMeVIS project of the University of Applied Sciences Western Switzerland (HES–SO). References 1. Clough, P., M¨ uller, H., Deselaers, T., Grubinger, M., Lehmann, T.M., Jensen, J., Hersh, W.: The CLEF 2005 cross-language image retrieval track. In: Peters, C., Gey, F.C., Gonzalo, J., M¨ uller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 535–557. Springer, Heidelberg (2006) 2. Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.): CLEF 2007. LNCS, vol. 5152. Springer, Heidelberg (2008) 3. Nunzio, G.M.D., Ferro, N.: Direct: A system for evaluating information access components of digital libraries. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 483–484. Springer, Heidelberg (2005) Interest Point and Segmentation-Based Photo Annotation B´alint Dar´ oczy, Istv´an Petr´ as, Andr´ as A. Bencz´ ur, Zsolt Fekete, D´ avid Nemeskey, D´avid Sikl´ osi, and Zsuzsa Weiner Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {benczur,daroczyb,zsfekete,ndavid,petras,sdavid,weiner}@ilab.sztaki.hu http://www.sztaki.hu Abstract. Our approach to the ImageCLEF 2009 tasks is based on image segmentation, SIFT keypoints and Okapi BM25-based text retrieval. We use feature vectors to describe the visual content of an image segment, a keypoint or the entire image. The features include color histograms, a shape descriptor as well as a 2D Fourier transform of a segment and an orientation histogram of detected keypoints. We trained a Gaussian Mixture Model (GMM) to cluster the feature vectors extracted from the image segments and keypoints independently. The normalized Fisher gradient vector computed from GMM of SIFT descriptors is a well known technique to represent an image with only one vector. Novel to our method is the combination of Fisher vectors for keypoints with those of the image segments to improve classification accuracy. We introduced correlation-based combining methods to further improve classification quality. 1 Introduction In this paper we describe our approach to the ImageCLEF Photo, WikiMediaMM and Photo Annotation 2009 evaluation campaigns [11,17,12]. The ﬁrst two campaigns are ad-hoc image retrieval tasks: ﬁnd as many relevant images as possible from the image collections. The third campaign requires image classiﬁcation into 53 concepts organized in a small ontology. The key feature of our solution in the ﬁrst two cases is to combine textbased and content-based image retrieval. Our method is similar to the method we applied in 2008 for ImageCLEF Photo [7]. Our CBIR method is based on segmentation of the image and on the comparison of features of the segments. We use the Hungarian Academy of Sciences search engine [3] as our information retrieval system that is based on Okapi BM25 [16] and query expansion by thesaurus. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 340–347, 2010. c Springer-Verlag Berlin Heidelberg 2010 Interest Point and Segmentation-Based Photo Annotation 2 341 Image Processing We transform images into a feature space both in order to deﬁne their similarity for ad hoc retrieval and to apply classiﬁers over them for annotation. For image processing we employ both SIFT keypoints [9] and image segmentation [6, 14, 5, 10]. While SIFT is a standard procedure, we describe our home developed segmenter in more detail below. Our iterative segmentation algorithm [2] is based on a graph of the image pixels where the eight neighbors of a pixel are connected by edges. The weight of an edge is equal to the Euclidean distance of the pixels in the RGB space. We proceed in the order of increasing edge weight as in a minimum spanning tree algorithm except that we do not merge segments if their size and the similarity of their boundary edges are above a threshold. The algorithm consists of several iterations of the above minimum spanning tree type procedure. In the ﬁrst iteration we join sturdily coherent pixels into segments. In further iterations we gradually increase the limits in order to enlarge segments and reach a required number of them. We performed colour, shape, orientation and texture feature extraction over the segments and environment of keypoints of images. This resulted in approximately 500 − 7000 thousand keypoint descriptors in 128 dimensions and in approximately two hundred segment descriptors in 350 dimensions. The following features were extracted for each segment: mean RGB histogram; mean HSV histogram; normalized RGB histogram; normalized HSV histogram; normalized contrast histogram; shape moments (up to 3rd order); DFT phase and amplitude. For ad hoc image retrieval we considered segmentation-based image similarity only. We extracted features for color histogram, shape and texture information for every segment. In addition we used contrast and 2D Fourier coeﬃcients. An asymmetric distance function is deﬁned in the above feature space as d(Di , Dj ) = min dist(S ik , Sj ), where {Sdt : t ≥ 1} denotes the set of segments of image k Dd . Finally image similarity rank was obtained by substracting the above distance from a suﬃciently large constant. 3 The Base Text Search Engine We used the Hungarian Academy of Sciences search engine [3] as our information retrieval system-based on Okapi BM25 ranking [16] with the proximity of query terms taken into account [15, 4]. We employed stopword removal and stemming by the Porter stemmer. We extended the stop word list with terms such as “photo” or “image” that are frequently used in annotations but does not have a distinctive meaning in this task. We applied query term weighting to distinguish deﬁnite and rough query terms, the latter may be obtained from the topic description or a thesaurus. We multiplied the BM25 score of each query term by its weight; the sum of the scores gave the ﬁnal rank. 342 B. Dar´ oczy et al. We used a linear combination of the text-based and image similarity based scores for ad hoc retrieval. We considered the text-based score more accurate used small weight for the content-based score. 4 The WikipediaMM Task We preprocessed the annotation text by regular expressions to remove author and copyright information. We made no diﬀerentiation between the title and the body of the annotation. Since ﬁle names often contain relevant keywords and also often as substring, we gave score proportional to the length of the matching substring. Since the indexing of all substrings is infeasible, we only performed this step for those documents that already matched at least one query term in their body. For the WikipediaMM task we also deployed query expansion by an online thesaurus1 . We added groups of synonyms with reduces weight so that only the score of the ﬁrst few best performing synonym was added to the ﬁnal score to avoid overscoring long lists of synonyms. Table 1. WikiMediaMM ad hoc search evaluation MAP P10 Image 0.0068 0.0244 Text 0.1587 0.2668 Image+Text 0.1619 0.2778 Text+Thesaurus 0.1556 0.2800 Text+Thesaurus lower weight 0.1656 0.2888 Image+Text+Thesaurus lower weight 0.1684 0.2867 1st place: DEUCENG, txt 0.2397 0.4000 2nd place: LAHC, txt+img 0.2178 0.3378 P20 0.0144 0.2133 0.2233 0.2356 0.2399 0.2355 0.3133 0.2811 As seen in Table 1, our CBIR score improved performance in terms of MAP for the price of worse early precision and expansion by thesaurus improved the performance in a similar sense. The results of the winner and second teams are shown in the last rows. 5 The Photo Retrieval Task: Optimizing for Diversity We preprocessed the annotation text by regular expressions to remove photographer and agency information. This step was in particular important to get rid of the false positives for Belgium-related queries as the majority of the images has the Belga News Agency as annotated source. Since the annotation was very noisy, we could only approximately cleanse the corpus. 1 http://thesaurus.com/ Interest Point and Segmentation-Based Photo Annotation 343 Table 2. ImageCLEF Photo ad hoc search evaluation Text CT Text Image+Text CT Image 1st place: Xerox 2nd place: Inria F-measure 0.6449 0.6394 0.6315 0.1727 0.80 0.76 P5 0.5 0.52 0.49 0.02 P20 0.64 0.68 0.64 0.03 CR5 0.5106 0.4719 0.4319 0.2282 CR20 0.6363 0.6430 0.6407 0.2826 MAP 0.49 0.50 0.48 0 0.29 0.08 As the main diﬀerence from the WikimediaMM task, since almost all queries were related to names of people or places, we did not deploy the thesaurus. Some of the topics had description (denoted by CT in the topic set as well as in Table 2) that we added with weight 0.1. We modiﬁed our method to achieve greater diversity within the top 20. For each topic in the ImageCLEF Photo set, relevant images were manually clustered into sub-topics. Evaluation was based on two measures: precision at 20 and cluster recall at rank 20, the percentage of diﬀerent clusters represented in the top 20. The topics of this task were of two diﬀerent types and we processed them separately in order to optimize for cluster recall. The ﬁrst set of topics included subtopics; we merged the hit lists of the subtopics by one by one. The last subtopic typically contained terms from other subtopics negated; we fed the query with negation into the retrieval engine. The other class of topics had no subtopics; here we proceeded as follows. Let Orig(i) be the ith document (0 ≤ i < 999) and OrigSc(i) be the score of this element on the original list for a given query Qj . We modiﬁed these scores by giving penalties to the scores of the documents based on their Kullback-Leibler divergence. We used the following algorithm. Algorithm 1. Algorithm Re-ranking 1. New (0) = Orig(0) and NewSc(0) = OrigSc(0) 2. For i = 1 to 20 (a) New(i) = argmaxk {CLi (k) |i <= k < 999} (b) NewSc(i) = max{CLi (k) |i <= k < 999} (c) For = 0 to (i − 1) NewSc() = NewSc() + c(i) i−1 Here CLi (k) = OrigSc(k) + α l=0 KL(i, k), where α is a tunable parameter and KL(i, k) is the Kullback-Leibler distance of the ith and kth documents. We used a correction term c(i) at Step (2c) to ensure that the new scores will be also in descending order. 344 6 B. Dar´ oczy et al. The Photo Annotation Task The Photo Annotation data consisted of 5000 annotated training and 13000 test images. Our overall procedure is shown in Fig. 1. We used the bag-of-visual words (BOV) approach with the Fisher kernels method for images [13, 1]. The feature vectors were composed from SIFT key point descriptors and the color image segment descriptors such as shape, color histogram as described in Section 2. We used a held-out set to rank each row from the Fisher kernel. After computing the results for all of the 53 concepts, a matrix of dimensionality N × 53 holds the concept detection results, where N is the number of images. The concept detection results from diﬀerent kernels can be combined. We followed a correlation-based approach that exploits the connection between the concepts of the training annotation. It is described in Section 6.2. Fig. 1. Our image annotation procedure 6.1 Feature Generation and Modeling To reduce the size of the feature vectors we modeled them with g = 64 Gaussians. The classical EM algorithm with diagonal covariance matrix assumption was used for the computation of the mixture parameters. To get ﬁxed sized image descriptors we computed g − 1 + g × D × 2 dimensional normalized Fisher vectors per images [13,1], where D = 128 is the dimension of the low level feature vectors. The t × t Fisher kernel matrix contained the L1 distances of all training images from themselves. There are t = 5000 training images. We computed the Fisher kernels for several low level feature type combinations. Such combinations were: SIFT+image segments etc. We used the resulting Fisher kernels for training binary linear classiﬁers (L2-regularized logistic regression classiﬁer from the LibLinear package [8]) for each of the k = 53 concepts. For prediction we used the s × t kernel matrix with the trained linear classiﬁers, where s = 13000 denotes the number of test images. Interest Point and Segmentation-Based Photo Annotation 345 No_Persons −75 −43 −47 −24 Single_Person 52 Big_Group Small_Group Portrait 36 −51 28 33 Familiy_Friends (a) Person-portrait (b) Relation of person concepts Lake Sea 88 70 Autumn Outdoor 64 62 87 56 River No_Blur 53 Portrait 43 47 Familiy_Friends 63 Sunset_Sunrise Mountains 43 33 Landscape_Nature Day Partly_Blurred Water 24 Overall_Quality (c) Landscape connections 46 84 Indoor Single_Person (d) Portrait implications Fig. 2. Relations of the concepts. (a) Fragment of the Clef2009 ontology provided with the participants [11]. (b) Part of the auto-correlation matrix (denoted with CC in the text) of the training annotation visualized as a graph. Positive weights mean positive correlation between concepts. (c) Part of the implication matrix (denoted with CI in the text). Connection may be expressed in verbal form, e.g. ‘Landscape Nature implies from Lake, Mountains, Water ‘ (d) Portrait implications. Comparing with (a) it reflects ”hasPerson” relation. 6.2 Annotation-Based Combining The ontology provided with the participants is an explicit description of the relation of the concepts. However, the user annotation of the training data contains an implicit, weighted graphs of concepts. These graphs can be directed or undirected and can be used to enhance the results of the predictions. The following relations can be extracted from the user annotations. Correlation: the co-occurrence of ConceptX and ConceptY is computed (Fig. 2b). Let A be the t × k annotation matrix. Each entry of A is either 0 or 1. Moreover, let CC = [cij ] be the k × k symmetric correlation matrix where (xi −x)(yi −y) cij = corr (ai , aj ), ai is the ith column of A, corr (x, y) = is the (n−1)sx sy sample correlation coeﬃcient. Implication: ConceptB → ConceptA (see Fig. 2c and 2d). 346 B. Dar´ oczy et al. Table 3. ImageCLEF2009-PhotoAnnotation results. ”Impl”, ”Cross” and ”Cross and Impl Optimized” corresponds to weighting with PI , PC and Popt respectively. Method Segmentation SIFT SIFT + Segmentation SIFT + Segmentation + Impl SIFT + Segmentation + Cross SIFT + Segm.+ Cross+Impl 1st place:ISIS 2nd place:LEAR EER 0.372925 0.350944 0.296315 0.291622 0.288599 0.282956 0.234476 0.249469 AUC 0.672872 0.698616 0.771324 0.774395 0.776256 0.780710 0.838699 0.823105 For this, one computes the conditional probability matrix CI = cij where cij = P (Concepti |Conceptj ), i, j = 0..52 denotes the concepts. Based on the two matrices (or graphs) C and C the relations of concepts in the ontology can be characterized. The quality of the annotations can also be judged. For example: Mutually exclusive concepts: should have negative correlation close to -100%. Too much or too few edges: the concept in the annotation may not be discriminative enough. Using matrix CC and CI we exploited the common knowledge of annotations about the relationship between the concepts. With these matrices we re-weighted the output of the predictors. We also considered additional matrices that were produced by raising CC and CI to the pt h power element-wise, where [p ∈ [1, 5]. Each concept corresponds to the same row in CC , CI and their element-wise powers. For a given concept the algorithm selected that row from CC , CI and their element-wise powers that yielded the maximal AUC. i denote the ﬁnal weight matrix. Its rows are selected according to Let Cmax i i i p the following: Cmax = max(CC , CIi , (CC ) , (CIi )q ), where p, q ∈ [1..5], C i denotes p,q the it h row of the matrix and i = 0..52, according to the concepts. Let P denote the t × k matrix composed from the outputs of the predictors. Rows correspond to images, while columns correspond to concepts. The combined predictions were computed: PC = P CC and PI = P CI and Popt = P Cmax . Our results are shown in Table 3. The AUC results of the best two team were 0.838 and 0.823 respectively. 7 Conclusions For image classiﬁcation, we successfully combined a pure keypoint-based and a region-based method, two image processing algorithms that complement each other. Exploitation of training annotation improved the results. For image retrieval our content-based score improved the text score in combination. The use of the thesaurus and other query expansion techniques increased the performance. We took minimal eﬀort for optimizing for diversity; while our results were strong in MAP. Interest Point and Segmentation-Based Photo Annotation 347 References 1. Ah-Pine, J., Cifarelli, C., Clinchant, S., Csurka, G., Renders, J.: Xrce’s participation to imageclef 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) Evaluating Systems for Multilingual and Multimodal Information Access. LNCS, vol. 5706. Springer, Heidelberg (2009) 2. Daroczy, B., et al.: Sztaki@imageclef 2009. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (2009) 3. Bencz´ ur, A.A., Csalog´ any, K., Friedman, E., Fogaras, D., Sarl´ os, T., Uher, M., Windhager, E.: Searching a small national domain—preliminary report. In: Proceedings of the 12th World Wide Web Conference (WWW), Budapest, Hungary (2003), http://datamining.sztaki.hu/?q=en/en-publications 4. B¨ uttcher, S., Clarke, C.L.A., Lushman, B.: Term proximity scoring for ad-hoc retrieval on very large text collections. In: SIGIR 2006, pp. 621–622. ACM Press, New York (2006) 5. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002) 6. Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with regions. J. Mach. Learn. Res. 5, 913–939 (2004) 7. Dar´ oczy, B., Fekete, Z., Brendel, M., R´ acz, S., Bencz´ ur, A., Sikl´ osi, D., Pereszl´enyi, A.: Cross-modal image retrieval with parameter tuning. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 8. Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: A library for large linear classication. The Journal of Machine Learning Research 9, 1871–1874 (2008) 9. Lowe, D.: Object recognition from local scale-invariant features. In: International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999) 10. Lv, Q., Charikar, M., Li, K.: Image similarity search with compact data structures. In: CIKM 2004: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 208–217. ACM Press, New York (2004) 11. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large scale visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 12. Paramita, M., Sanderson, M., Clough, P.: Diversity in photo retrieval: overview of the ImageCLEFPhoto task 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 13. Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8 (2007) 14. Prasad, B.G., Biswas, K.K., Gupta, S.K.: Region-based image retrieval using integrated color, shape, and location index. Comput. Vis. Image Underst. 94(1-3), 193–233 (2004) 15. Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval systems. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 207–218. Springer, Heidelberg (2003) 16. Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. In: Document retrieval systems, pp. 143–160. Taylor Graham Publishing, London (1988) 17. Tsikrika, T., Kludas, J.: Overview of the WikipediaMM task at ImageCLEF 2009. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (2009) University of Ja´ en at ImageCLEF 2009: Medical and Photo Tasks Miguel A. Garc´ıa-Cumbreras, Manuel Carlos D´ıaz-Galiano, Man´ıa Teresa Mart´ın-Valdivia, Arturo Montejo-Raez, and L. Alfonso Ure˜ na-L´ opez SINAI Research Group, Computer Science Department, University of Ja´en, Spain {magc,mcdiaz,maite,amontejo,laurena}@ujaen.es Abstract. This papers describes the participation of the SINAI research group in the medical and photo retrieval ImageCLEF tasks. The approach for medical retrieval continues our usage of the MeSH ontology for query expansion, but comparing it to term expansion within the documents in the collection. Regarding the photo retrieval task, diversity of top results has been pursued by applying a clustering algorithm. For both tasks, results and dicussion are included. In general, no relevant ﬁndings were obtained with the novel approaches applied. 1 Introduction This paper presents the participation of the SINAI research group at two different ImageCLEF tasks: the ImageCLEF medical retrieval task [5] and at the ImageCLEF photo retrieval task [6]. For medical retrieval, our main goal is to study the expansion of the collection and the queries using the MeSH ontology. In previous years we have experimented with the expansion of the queries with medical ontologies [4,3] but now we want to compare the results expanding terms in the collection . Other previous works include the development of a system that tests diﬀerent aspects, such as the application of Information Gain in order to improve the results [2], the expansion of the topics with the MeSH1 ontology [3], and the expansion of the topics again with the UMLS2 metathesaurus that uses minor textual information but more speciﬁc [4]. As regards the photo retrieval approach, the task for 2009 is diﬀerent than the evaluation in 2008, and the organizers give special value to the diversity of results. Given a query, the goal is to retrieve a relevant set of images at the top of a ranked list. Text and visual information can be used to improve the retrieval methods, and the main evaluation points are the use of pseudo-relevant feedback (PRF), query expansion, IR systems with diﬀerent weighting functions and clustering or ﬁltering methods applied over the cluster terms. Our system makes use of text information, not visual information, to improve the retrieval 1 2 http://www.nlm.nih.gov/mesh/ http://www.nlm.nih.gov/research/umls/ C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 348–353, 2010. c Springer-Verlag Berlin Heidelberg 2010 University of Ja´en at ImageCLEF 2009: Medical and Photo Tasks 349 methods, and a new method has been implemented to cluster images. The paper is organized in two sections, one for medical task and another for photo retrieval. Each section includes the description of the system, the collection and the query treatment, the experiments and the results obtained. In addition, conclusions and discussion for further work are presented in Section 4. 2 2.1 Medical Task System Description Since 2008, a new collection was introduced in this task. We created diﬀerent textual collections using information of this collection in the web [4]. In 2009, the medical task has been separated into two subtasks, one based in imagen retrieval (adhoc) and another one based in medical case retrieval. For this reason, we have created three diﬀerent textual collections, two collections for image based retrieval and one for case based retrieval. The collections are: – C: Contains the caption of the image to use in image based retrieval. – CT: Constains the caption of the image and the title of the article to use in image based retrieval. – TA: Constains the title and the text of the full article to be used in medical case based retrieval. We have experimented with expansion of the textual collection. Our initial aim was to expand the textual collection in the same way as the query expansion performed in the experiment of last year [4]. Nevertheless, the time request needed for expansion of minimal textual collection with UMLS and MetaMap software has been excessive. For this reason, we have only used the MeSH ontology to expand collections and topics. From the collections used in image based retrieval, we have created two collections by expanding the text using the MeSH ontology. These new collections have been named CM and CTM respectively. We have not used the collection with full articles for image based retrieval for two reasons: the ﬁrst one is that we want to experiment the expansion of the collection with minimal textual information, the second reason is that the expansion of a big textual collection requires a large amount of time. 2.2 Experiments In our experiments we have expanded the whole set of topics obtaining the following: – – – – t: Original topic set to use in adhoc retrieval. tM: Topic set for adhoc retrieval expanded with MeSH ontology. cbt: Original topic set to use in case based retrieval. cbtM: Topic set for case based retrieval expanded with the MeSH ontology. 350 M.A. Garc´ıa-Cumbreras et al. The dataset of the collection has been indexed using the Lemur3 IR system, applying KL-divergence weighting function and using Pseudo-Relevance Feedback (PRF). Table 1 shows the main average precision (MAP) of image based retrieval experiments. Contrary to expectations, these results show that query expansion does not improve the performance of the system. The expansion by using only the collection with more textual information (CTM collection) obtains the best results. The last row shows the best system in the competition in the textual runs (LIRIS maxMPTT extMPTT). Table 2 shows the results of medical case based experiments. As we can see in this table, query expansion does not improve the results. The last row shows the best system in the competition in the mixed mode (ceb-cases-essie2-automatic). Table 1. MAP values of image based experiments C CT CM CTM Best t tM 0.3289 0.3569 0.3124 0.3795 0.43 0.2754 0.3077 0.2838 0.3286 Table 2. Results of case based experiments Collection Topics MAP TA TA Best 3 cbt 0.2626 cbtM 0.2605 0.34 Photo Task 3.1 System Description As mentioned in the introduction, in 2009 the main goal is to explore the beneﬁts of increasing the diversity of results. Figure 1 shows a general scheme of the system developed. In the ﬁrst module of the system diﬀerent sets of topics have been built combining the title of the query, the word of each cluster and the title of the last cluster. In a second step the English collection has been preprocessed as usual (English stopwords removal and the Porter’s stemmer [7]), and the documents have been indexed using also the Lemur retrieval system Lemur, with the Okapi weighting function and Pseudo-Relevance Feedback(PRF). Then, these 3 http://www.lemurproject.org/ University of Ja´en at ImageCLEF 2009: Medical and Photo Tasks 351 Fig. 1. General scheme of the SINAI system at ImagePhoto 2009 sets of topics were run over the IR system, and a list of relevant documents was obtained. Clustering Subsystem. It has been found that when there exists variability in top results in a list of documents retrieved as answer to a query, the performance of the retrieval systems increases, being in some cases more desirable to have less but more varied items in this list [1]. In order to increase variability, a clustering system has been applied. The idea behind is rather simple: re-arrange most relevant documents so that documents belonging to diﬀerent clusters are promoted to the top of the list. We have applied k-means on every list of results returned by the Lemur IR system. This has been done using the Rapid Miner tool4 . The clustering algorithm has tried to group these results, without any concern on ranking, into four diﬀerent groups (this is the average number of clusters for training documents as speciﬁed in their metadata). Once each of the documents in the list has been labeled to its resulting cluster index, the list has been reordered according to the result of the clustering process. Therefore, we ﬁll the list by alternating documents from diﬀerent clusters. 3.2 Experiments In our ImagePhoto system we have proved the following conﬁgurations: 4 http://rapid-i.com/ 352 M.A. Garc´ıa-Cumbreras et al. 1. (1) SINAI1 - Baseline. It is the baseline experiment. It uses Lemur as IR system with automatic feedback. The weighting function applied was Okapi. The topic used is only the query title. 2. (2) SINAI2 - title and final cluster. This experiment combines the query title with the title of the ﬁnal cluster that appears in the topics ﬁle. Lemur also uses Okapi as weighting function and PRF. 3. (3) SINAI3 - title and all clusters. This experiment combines the query title with all the words that appear in the titles of all the clusters. Lemur also uses Okapi as weighting function and PRF. 4. (4) SINAI4 - clustering. The query title and each cluster title (except the last one that combines all) are run against the index generated by the IR system. Several lists of relevant documents are retrieved, and the clustering module combines them to obtain the ﬁnal list of relevant documents. The aim of this experiment is to increment the diversity of the retrieved results using a clustering algorithm. Table 3 shows the results obtained in our four experiments. The last row shows the best system in the competition with only text (InfoComm group). Table 3. SINAI experiment results for the ImagePhoto tasks Experiment sinai1 T TXT sinai2 TCT TXT sinai3 TCT TXT sinai4 TCT TXT LRI2R TI TXT(Best) 4 CR10 P10 0.4580 0.3798 0.5210 0.4356 0.671 0.796 58 0.778 0.474 0.848 MAP F-measure 0.4454 0.3286 0.4567 0.2233 0.5814 0.4590 0.6241 0.4540 0.7492 Discussion and Conclusions In medical retrieval task, we have used topic and collection expansion. The topic expasion has been carried out in the same way as in previous year. The expansion of the collection improves the results when the topics are expanded too. However, the obtained results are not successful because we do not get the same results as in previous years. Although the collection is the same as in 2008, in 2009 we rebuilt the collection via Web. This new collection is diﬀerent from that generated in 2008. We conducted experiments with the 2008 collection and the results are similar to previous years, although the MAP values are lower than those obtained in 2009. This may indicates that there has been an error in the generation of the collection and the results are not relevant. In photo retrieval task, we have experimented with diﬀerent kinds of cluster combination. However, as we can see in Table 3, the application of clustering does not improve the results greatly. In fact, only in the run SINAI3, which combines the query the original title and the titles of all the clusters, overcomes University of Ja´en at ImageCLEF 2009: Medical and Photo Tasks 353 the baseline case SINAI1 that only uses the original title. Unfortunately, the experiment SINAI4 that applies our clustering and fusion approach has achieved the worst results. Thus, the obtained results show that it is necessary to continue investigating the clustering solution for diversity. In addition, the use of visual information could improve the ﬁnal system. Acknowledgements This work has been supported by the Regional Government of Andalucia (Spain) under excellence project GeOasis (P08-41999), the Spanish Government under project Text-Mess TIMOM (TIN2006-15265-C06-03) and the University of Jaen local project RFC/PP2008/UJA-08-16-14. References 1. Chen, H., Karger, D.R.: Less is more: probabilistic models for retrieving fewer relevant documents. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 429–436. ACM, Seattle (2006) 2. D´ıaz-Galiano, M., Garc´ıa-Cumbreras, M., Mart´ın-Valdivia, M., Montejo-R´ aez, A., Ure˜ na L´ opez, L.: Using Information Gain to Improve the ImageCLEF 2006 Collection. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 711–714. Springer, Heidelberg (2007) 3. D´ıaz-Galiano, M., Garc´ıa-Cumbreras, M., Mart´ın-Valdivia, M., Montejo-R´ aez, A., Ure˜ na L´ opez, L.: Integrating MeSH Ontology to Improve Medical Information Retrieval. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 601–606. Springer, Heidelberg (2008) 4. D´ıaz-Galiano, M., Garc´ıa-Cumbreras, M., Mart´ın-Valdivia, M., Urea-L´ opez, L., Montejo-R´ aez, A.: Query Expansion on Medical Image Retrieval: MeSH vs. UMLS. In: Peters, C., et al. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 732–735. Springer, Heidelberg (2009) 5. M¨ uller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C.E., Hersh, W.: Overview of the CLEF 2009 medical image retrieval task (2009) 6. Paramita, M.L., Sanderson, M., Clough, P.: Diversity in photo retrieval: Overview of the ImageCLEFphoto task 2009 (2009) 7. Porter, M.F.: An algorithm for suﬃx stripping. Program 14(3), 130–137 (1980), http://portal.acm.org/citation.cfm?id=275705 Overview of VideoCLEF 2009: New Perspectives on Speech-Based Multimedia Content Enrichment Martha Larson1, Eamonn Newman2 , and Gareth J.F. Jones2 1 2 Multimedia Information Retrieval Lab, Delft University of Technology, 2628 CD Delft, Netherlands Centre for Digital Video Processing, Dublin City University, Dublin 9, Ireland m.a.larson@tudelft.nl, {enewman,gjones}@computing.dcu.ie Abstract. VideoCLEF 2009 oﬀered three tasks related to enriching video content for improved multimedia access in a multilingual environment. For each task, video data (Dutch-language television, predominantly documentaries) accompanied by speech recognition transcripts were provided. The Subject Classiﬁcation Task involved automatic tagging of videos with subject theme labels. The best performance was achieved by approaching subject tagging as an information retrieval task and using both speech recognition transcripts and archival metadata. Alternatively, classiﬁers were trained using either the training data provided or data collected from Wikipedia or via general Web search. The Aﬀect Task involved detecting narrative peaks, deﬁned as points where viewers perceive heightened dramatic tension. The task was carried out on the “Beeldenstorm” collection containing 45 short-form documentaries on the visual arts. The best runs exploited aﬀective vocabulary and audience directed speech. Other approaches included using topic changes, elevated speaking pitch, increased speaking intensity and radical visual changes. The Linking Task, also called “Finding Related Resources Across Languages,” involved linking video to material on the same subject in a diﬀerent language. Participants were provided with a list of multimedia anchors (short video segments) in the Dutch-language “Beeldenstorm” collection and were expected to return target pages drawn from English-language Wikipedia. The best performing methods used the transcript of the speech spoken during the multimedia anchor to build a query to search an index of the Dutch-language Wikipedia. The Dutch Wikipedia pages returned were used to identify related English pages. Participants also experimented with pseudo-relevance feedback, query translation and methods that targeted proper names. 1 Introduction VideoCLEF 20091 was a track of the CLEF2 benchmark campaign devoted to tasks aimed at improving access to video content in multilingual environments. 1 2 http://www.multimediaeval.org/videoclef09/videoclef09.html http://www.clef-campaign.org C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 354–368, 2010. c Springer-Verlag Berlin Heidelberg 2010 Overview of VideoCLEF 2009 355 The overall goal of the VideoCLEF benchmarking initiative, now referred as “MediaEval”3, is to develop new, forward-looking multimedia retrieval tasks and data sets with which to evaluate these tasks. During VideoCLEF 2009, three tasks were carried out. The Subject Classiﬁcation Task required participants to automatically tag videos with subject theme labels (e.g., ‘factories,’ ‘physics,’ ‘poverty’, ‘cultural identity’ and ‘zoos’). The Aﬀect Task, also called “Narrative peak detection,” involved automatically detecting dramatic tension in shortform documentaries. Finally, “Finding Related Resources Across Languages,” referred to as the Linking Task, required participants to automatically link video to Web content that is in a diﬀerent language, but on the same subject. The data sets for these tasks contained Dutch-language television content supplied by the Netherlands Institute of Sound and Vision4 (called in Dutch Beeld & Geluid ), which is one of the largest audio/video archives in Europe. Each participating site had access to video data, speech recognition transcripts, shot boundaries, shotlevel keyframes and archival metadata supplied by VideoCLEF. Sites developed their own approaches to the tasks and were allowed to chose the method and features that they found most appropriate. Seven groups made submissions of task results for evaluation. In 2009, the VideoCLEF track ran for the ﬁrst time as a full track within the Cross-Language Evaluation Forum (CLEF) evaluation campaign. The track was piloted last year as VideoCLEF 2008 [11]. The VideoCLEF track is successor to the Cross-Language Speech Retrieval (CL-SR) track, which ran at CLEF from 2005 to 2007 [12]. VideoCLEF seeks to extend the results of CL-SR to the broader challenge of video retrieval. VideoCLEF is intended to complement the TRECVid benchmark [15] by running tasks related to the topic or subject matter treated by video and emphasizing the importance of speech and language (e.g., via speech recognition transcripts). TRECVid has traditionally focused on objects, entities and scenes that are depicted in the visual channel. In contrast, VideoCLEF concentrates on what is described in a video, in other words, what a video is about. This paper describes the data sets and the tasks of VideoCLEF 2009 and summarizes the results achieved by the participating sites. We ﬁnish with a conclusion and an outlook for MediaEval 2010. For additional information concerning individual approaches used in 2009, please refer to the papers of the individual sites in this volume. 1.1 Data VideoCLEF 2009 used two data sets both containing Dutch-language television programs. Note that these programs are predominantly documentaries with the addition of some talk shows. This means that the data contains a great deal of conversational speech, including opinionated and subjective speech and speech that has been only loosely planned. In this way, the VideoCLEF data is diﬀerent and more challenging than broadcast news data, which largely involves scripted speech. 3 4 http://www.multimediaeval.org http://www.beeldengeluid.nl 356 M. Larson, E. Newman, and G.J.F. Jones The VideoCLEF 2009 Subject Classiﬁcation Task ran on TRECVid 2007 and 2008 data from Beeld & Geluid. The Aﬀect Task and Linking Task both ran on a data set containing material from the short-form documentary Beeldenstorm, also supplied by Beeld & Geluid. For both data sets, Dutch-language speech recognition transcripts were supplied by the University of Twente [5]. The shot segmentation and the shot-level keyframe data were provided by Dublin City University [1]. Further details are given in the following. TRECVid 2007/2008 data set. In 2009, VideoCLEF attempted to encourage cross-over from the TRECVid community by recycling the TRECVid data set5 for the Subject Classiﬁcation Task. Notice that the Subject Classiﬁcation Task is a fundamentally diﬀerent task than what ran at TRECVid in 2007 and 2008. Subject Classiﬁcation involves automatically assigning subject labels to videos at the episode level. The subject matter of the entire video is important, not just the concepts visible in the visual channel and not just the shot-level topic. Classifying video, i.e., taking a video and assigning it a topic class subject label, is exactly what the archive staﬀ does at Beeld & Geluid when they annotate video material that is to be stored in the archive. The class labels used for the VideoCLEF 2009 Subject Classiﬁcation Task are a subset of labels that are used by archive staﬀ. As a result, we have gold standard topic class labels with which to evaluate classiﬁcation. Additionally, we can be relatively certain that if these labels are already used for retrieval of material from the archive then they are relevant for video search in an archive setting, and we assume, beyond. Original Dutch-language examples of subject labels can be examined in the search engine.6 In the VideoCLEF 2009 Subject Classiﬁcation Task, archivist-assigned subject labels were used as ground truth.7 The training set is a large subset of TRECVid 2007 and contains 212 videos. The test set is a large subset of TRECVid 2008 and contains 206 videos. The videos are most drawn from a subset of the overall Beeld 5 6 7 http://www-nlpir.nist.gov/projects/tv2007/tv2007.html#3 Visit the search engine at http://zoeken.beeldengeluid.nl. A keyword search will return a results list with a column labeled Trefwoorden or keywords. These are the topic class subject labels that are used in the archive. In total 46 labels were used: aanslagen (attacks), armoede (poverty), burgeroorlogen (civil wars), criminaliteit (crime), culturele identiteit (cultural identity), dagelijks leven (daily life), dieren (animals), dierentuinen (zoos), economie (economy), etnische minderheden (ethnic minorities), fabrieken (factories), families (families), gehandicapten (disabled), geneeskunde (medicine), geneesmiddelen (pharmaceutical drug), genocide (genocide), geschiedenis (history), gezinnen (families), havens (harbors), hersenen (brain), illegalen (undocumented immigrants), journalisten (journalist), kinderen (children), landschappen (landscapes), media (media), militairen (military personnel), musea (museums), muziek (music), natuur (nature), natuurkunde (physics), ouderen (seniors), pers (press), politiek (politics), processen (lawsuits), rechtszittingen (court hearings), reizen (travel), taal (language), verkiezingen (elections), verkiezingscampagnes (electoral campaigns), voedsel (food), voetbal (soccer), vogels (birds), vrouwen (women), wederopbouw (reconstruction), wetenschappelijk onderzoek (scientiﬁc research), ziekenhuizen (hospitals). Overview of VideoCLEF 2009 357 & Geluid collection called Academia,8 which is a collection that was created for use in research and educational settings. The Academia collection currently contains about 7,000 hours of video. In general, each video in the training and test set is an individual episode of a television show. Their length varies widely with the average length being around 30 minutes. Note that the VideoCLEF 2009 Subject Classiﬁcation set excludes several videos in the TRECVid collection for which archival metadata was not available. We would also like to explicitly point out that participants were not required to make use of the training data set, but were free to collect their own training data, if they wished. Beeldenstorm data set. For both the Aﬀect Task and Linking Task a data set consisting of 45 episodes of the documentary series Beeldenstorm (Eng. Iconoclasm) was used. The Beeldenstorm series consists of short-form Dutch-language video documentaries about the visual arts. Each episode lasts approximately eight minutes. Beeldenstorm is hosted by Prof. Henk van Os, known and widely appreciated, not only for his art expertise, but also for his narrative ability9 . This data set is also supplied by Beeld & Geluid, but it is mutually exclusive with the TRECVid 2007/2008 data set. The narrative ability of Prof. van Os makes the Beeldenstorm set an interesting corpus to use for aﬀect detection and the domain of visual arts oﬀers a wide number of possibilities for interesting multimedia links for the linking task. Finally, the fact that each episode is short makes it possible for assessors to watch the entire episode when creating the ground truth. Knowledge of the complete context is important for relevance judgments for cross-language related resources and also for deﬁning narrative peaks. The ground truth for the Aﬀect Task and Linking Task was created by a team of three Dutch-speaking assessors during a nine-day assessment and annotation event at Dublin City University referred to as Dublin Days. The videos were annotated with the ground truth with the support of the Anvil10 Video Annotation Research Tool [8]. Anvil makes it possible to generate frame-accurate video annotations in a graphic interface. Particularly important for our purposes was the support oﬀered by Anvil for user-deﬁned annotation schemes. Details of the ground truth creation are included in the discussions of the individual tasks in the following section. 2 Subject Classiﬁcation Task 2.1 Task The goal of the Subject Classiﬁcation Task is automatic subject tagging. Semantictheme-based subject tags are assigned automatically to videos. The purpose of 8 9 10 http://www.academia.nl/ http://www.avro.nl/tv/programmas a-z/beeldenstorm/ http://www.anvil-software.de/ 358 M. Larson, E. Newman, and G.J.F. Jones these tags is to make the videos ﬁndable to users who are searching and browsing the collection. The information needs (i.e., queries) of the users are not speciﬁed at the time of tagging. In VideoCLEF 2009, the Subject Classiﬁcation Task had the speciﬁc goal of reproducing the subject labels that were hand assigned to the test set videos by archivists at Beeld & Geluid. Since these subject labels are currently in use to archive and retrieve video in the setting of a large archive, we are conﬁdent about their usefulness for search and browsing in real-world information retrieval scenarios. The Subject Classiﬁcation Task was introduced during the VideoCLEF 2008 pilot [11]. In 2009, the number of videos in the collection was increased from 50 to 418 and the number of subject labels increased from 10 to 46. 2.2 Evaluation The Subject Classiﬁcation Task is evaluated using Mean Average Precision (MAP). This choice of score is motivated by the popularity of techniques that approach the subject tagging task as an information retrieval problem. These techniques return, for each subject label, a ranked list of videos that should receive that label. MAP is calculated by taking the mean of the Average Precision over all subject labels. For each subject label, precision scores are calculated by moving down the results list and calculating precision at each position where a relevant document is retrieved. Average Precision is calculated by taking the average of the precision at each position. Calculations were performed using version 8.1 of the trec eval11 scoring package. 2.3 Techniques Computer Science, Chemnitz University of Technology, Germany (see also [9]) The task was treated as an information retrieval task. The test set was indexed using an information retrieval system and was queried using the subject labels as queries. Documents returned as relevant to a given subject label were tagged with that label. The number of documents receiving a given label was controlled by a threshold. The submitted runs varied with respect to whether or not the archival metadata was indexed in addition to the speech recognition transcripts. They also varied with respect to whether expansion was applied to the class label (i.e., the query). Expansion was performed by augmenting the original query with the most frequent term occurring in the top ﬁve documents returned by an initial retrieval round. If fewer than two documents were returned, queries were expanded using a thesaurus. SINAI Research Group, University of Ja´en, Spain The SINAI12 (see also [13]) The group approached the task as a categorization problem, training SVMs using the training data provided. One run, SINAI svm nometadata, extracted feature vectors from the speech transcripts alone and one run, SINAI svm withmetadata, made use of both speech recognition transcripts and metadata. 11 12 http://trec.nist.gov/trec eval SINAI stands for Sistemas Inteligentes de Acceso a la Informaci´ on Overview of VideoCLEF 2009 359 Computer Science, Alexandru Ioan Cuza University, Romania (see also [2]) A training set was created by using subject category labels to select documents from Wikipedia and also from the Web at large (using Google). The training set was used to create a category ﬁle for each subject category containing a set of informative terms representative of that category. Two categorization methods were applied, one made use of information retrieval techniques to match the speech recognition transcripts of the videos to the category ﬁles and the other made use of a Naive Bayes multinomial classiﬁer to classify the videos into the classes represented by the category ﬁles. 2.4 Results The MAP of the results of the task are reported in Table 1. The results conﬁrm the viability of techniques that approach the Subject Classiﬁcation Task as an information retrieval task. Such techniques proved useful in VideoCLEF 2008 [11] and also provide the best results in 2009 where the size of the collection and the label set increased. Also, consistent with VideoCLEF 2008 observations, performance is better when archival metadata is used in addition to speech recognition transcripts. Table 1. Subject Classiﬁcation Results Test Set run ID cut1 sc asr baseline cut2 sc asr expanded cut3 sc asr meta baseline cut4 sc asr meta expanded cut5 sc asr meta expanded SINAI svm nometadata SINAI svm withmetadata MAP 0.0067 0.0842 0.2586 0.2531 0.3813 0.0023 0.0028 In the wake of VideoCLEF 2008, we decided that we wanted to provide a training data set of videos accompanied by speech transcripts in 2009 to see whether training classiﬁers on data from the same domain as the test data would improve performance. The runs submitted this year demonstrate the eﬃcacy of an approach that combines Web data and information retrieval techniques. A supervised approach which uses same-domain training data cannot easily achieve the same level of performance. These results leave open the question of how much training data is necessary in order for a supervised approach to compete with the information retrieval approach. In all cases, runs that make use of metadata outperform runs that make use of ASR transcripts only. This performance diﬀerences demonstrate the high value of using metadata, if available, to supplement ASR transcripts in order to generate class labels for videos. 360 M. Larson, E. Newman, and G.J.F. Jones The Alexandru Ioan Cuza University (UAIC) team reported results on the training set only. Note that because they train using data that they have collected themselves, the training set constitutes for the purposes of their experiments a separate, unseen test set. The results are not, however, directly comparable to those given in Table 1. We do not repeated them here, but rather refer the interested reader to the UAIC team paper [2]. Here, we include the comment that the best UAIC run involved using both general Web and Wikipedia training data and then combining the output Information-Retrieval approach (which they ﬁnd improves the quality of the ﬁrst-best label) and the output of a Naive Bayesclassiﬁer (which they ﬁnd contributes to the overall label quality). The UAIC results are consistent with our overall conclusion that it is better to collect training data from external sources rather than to use the training set. We believe that there are two possible sources to which we can attribute the failure of the training set to allow the training of high quality classiﬁers. First, the training set was relatively small, including only 212 videos. Although some semantic categories are represented by a fair number of video items, other categories may have as few as two items associated with them in the training set. Second, the transcripts of the training set contain a high level of speech recognition errors, which means that important terms might be mis-recognized and thus fail to occur in the transcripts at all, or fail to occur with the proper distribution. There is general awareness shared by VideoCLEF participants that although MAP is a useful tool, it may not be the ideal evaluation metric for this task. The reader can refer to the papers of the Chemnitz [9] and SINAI [13] for additional discussion and results reported with additional performance metrics. The ultimate goal of subject tagging is to generate a set of tags for each video that will allow users to ﬁnd that video while searching or browsing. The utility of a tag assigned to a given video is therefore not entirely independent of the other tags assigned. Under the current formulation of the task, the presence or absence of the tag is the only information that is of use to the searcher. The ranking of a video in a list of videos that are assigned the same tag is for this reason not directly relevant to the utility of that tag for the user. Future work must necessarily involve developing appropriate metrics for evaluating the usefulness to the uses of sets of tags assigned to multimedia items. 3 3.1 Aﬀect Task Task The goal of the Aﬀect Task at VideoCLEF 2009 was to automatically detect narrative peaks in documentaries. Narrative peaks were deﬁned to be those places in a video where viewers report feeling a heightened emotional eﬀect due to dramatic tension. This task was new in 2009. The ultimate aim of the Aﬀect Task is to move beyond the information content of the video and to analyze the video with respect to characteristics that are important for viewers, but not related to the video topic. Overview of VideoCLEF 2009 361 Narrative peak detection builds on and extends work in aﬀective analysis of video content carried out in the areas of sports and movies, cf. e.g., [4]. Viewers perceive an aﬀective peak in sports videos due to tension arising from the spontaneous interaction of players within the constraints of the physical world and the rules and conventions of the game. Viewers perceive an aﬀective peak in a movie due to the action or the plot line, which is carefully planned by the script writer and the ﬁlmmaker. Narrative peaks in documentaries are a new domain in so far as they cannot be considered to fall into either category. Documentaries convey information and often have storylines, but do not have the all-dominating plot trajectory of a movie. Documentaries often include extemporaneous narrative or interviews, and therefore also have a spontaneous component. The aﬀective curve experienced by a viewer watching a documentary can be expected to be relatively subtly modulated. It is important to diﬀerentiate narrative peak detection from other cases of aﬀect detection, such as hotspot detection in meetings. Hotspots are moments during meetings where people are highly involved in the discussion [16]. Hotspots can be self-reported by meeting participants or annotated in meeting video by viewers. In either case, it is the participant and not the viewer whose aﬀective reaction is being detected. We chose the the Beeldenstorm series for the narrative peak detection task in order to make the task as simple and straightforward as possible in its initial year. Beeldenstorm features a single speaker, the host Prof. van Os, and covers a topical domain, the visual arts, that is rich enough to be interesting, yet is relatively constrained. These characteristics help us to control for the eﬀects of personal style of the host and of viewer familiarity with topic in the aﬀect and appeal task. Further, as mentioned above, the fact that the documentaries are short makes it possible for annotators to watch them in their entirety when annotating narrative peaks. 3.2 Evaluation For the purposes of evaluation, as mentioned above, three Dutch speakers annotated the Beeldenstorm collection by each identifying the three top narrative peaks in each video. Annotators were asked to mark the peaks where they felt the dramatic tension reached its highest level. They were not supplied with an explicit deﬁnition of a narrative peak. Instead, all annotators needed to form independent opinions of where they perceived narrative peaks. In order to make the task less abstract, they were supplied with the information that the Beeldenstorm series is associated with humorous and moving moments. They were told that they could use this information to formulate their notion of what constitutes a narrative peak. Peaks were required to be a maximum of ten seconds in length. Although the annotators did not consult with each other about speciﬁc peaks, the team did engage in discussion during the deﬁnition process. The discussion ensured that there was underlying consensus about the approach to the task. 362 M. Larson, E. Newman, and G.J.F. Jones In particular, it was necessary to check that annotators understood that a peak must be a high point in the storyline as measured by their perceptions of their own emotional reaction. Dramatic objects or facts in the spoken or visual content that were not part of the storyline as it was created by the narrator/producer were not considered narrative peaks. Regions in the video where the annotator guessed that the speaker or producer had intended there to be a peak, but where the annotator did not feel any dramatic tension were not considered to be peaks. An example of this would be a joke that the annotator did not understand completely. The ﬁrst two episodes for which the annotators deﬁned peaks were discarded in order to assure that the annotators perception of a narrative peak had stabilized. This warm-up exercise was particularly important in light of the fact that at the end of the annotation eﬀort, assessors reported that it was necessary to become familiar with the style and allow an aﬃnity for the series to develop before they started to feel an emotional reaction to narrative peaks in the video. The peaks identiﬁed by the assessors were considered to be a reﬂection of underlying “true” peaks in the narrative of the video. We assumed that the variation between assessors is the result of noise due to eﬀects such as personal idiosyncracies. In order to generate a ground truth most highly reﬂective of “true” peaks, the peaks identiﬁed by the assessors were merged. The assessment team consisted of three members who each identiﬁed three peaks in 45 videos for a total of 405 marked peaks. The assessors were able to give a rough estimate of the minimum distance between peaks and on the basis of their observations, it was decided to consider two peaks that overlapped by at least two seconds to be the same peak. After merging the peaks, 293 of the 405 peaks turned out to be distinct. The merging process was carried out by ﬁtting a 10 second window to overlapping assessor peaks in order to ensure that merged peaks could never exceed the speciﬁed peak length of 10 seconds. Evaluation involved the application of two scoring methods, the point-based approach and the peak-based approach. Under point-based scoring, the peaks chosen by each assessor are assessed without merging. A hypothesized peak receives a point in every case in which it falls within eight seconds of an assessor peak. The run score is the total number of peaks returned by all peak hypotheses in the run. A single episode can earn a run between three points (assessors chose completely diﬀerent peaks) and nine points (assessors all chose the same peaks). There are no episodes in the set that fall at either of these extremes. The distribution of the peaks in the ﬁles is such that the best possible run would earn 246 points. Under peak-based scoring, a hypothesis is counted as correct if it falls within an 8 second window of a peak representing a merger of assessor annotations. Three diﬀerent types of merged reference peaks are deﬁned for peak-based scoring. Three diﬀerent peak-based scores are reported that diﬀer in the number of assessors required to agree in order for a region in the video to be considered a peak. Of the 293 total peaks identiﬁed, 203 peaks are “personal peaks” (peaks identiﬁed by only one assessor), 90 are “pair peaks” (peaks that are identiﬁed by at least two assessors) and 22 are “general peaks” (peaks upon which all three assessors agreed). Overview of VideoCLEF 2009 3.3 363 Techniques Narrative peak detection techniques were developed that used the visual channel, the audio channel and the speech recognition transcript. Each group took a diﬀerent approach. Computer Science, Alexandru Ioan Cuza University, Romania (see also [2]) Based on the hypothesis that speakers raise their voices at narrative peaks, three runs were developed that made use of the intensity of the audio signal. A score was computed for each group of words that involved a comparison of intensity means and other statistics for sequential groups of words. The top three scoring points were hypothesized as peaks. Computer Vision and Multimedia Laboratory, University of Geneva, Switzerland (see also [7]) The assumption was made that dramatic peaks correspond to the introduction of a new topic and thus correspond to change in word use as reﬂected in the speech recognition transcripts. Additionally, the video and audio channel eﬀects assumed to be indicative of peaks were explored. Finally, a weighting was deployed that gave more emphasis to positions at which peaks were expected to occur based on the distribution of peaks in the development data. The weighting is used in unige-cvml1, unige-cvml2 and unige-cvml3. Run unige-cvml1 uses text features alone. Run unige-cvml3 uses text plus elevated speaker pitch. Run unige-cvml2 uses text, elevated pitch and quick changes in the video. Run unige-cvml4 uses text only and no weighting. Run unige-cvml5 sets peaks randomly to provide a random baseline for comparsion. Delft University of Technology and University of Twente, Netherlands (see also [10]) Only features extracted from the speech transcripts were exploited. Run duotu09fix predicted peaks at ﬁxed points chosen by analyzing the development data. Run duotu09ind used indicator words as cues of narrative peaks. Indicator words were chosen by analyzing the development data. Run duotu09rep applied the assumption that word repetition, reﬂecting the use of an important rhetorical device, would indicate a peak. Run duotu09pro used pronouns as indicators of audience directed speech and assumed that high pronoun densities would correspond to points where viewers feel maximum involvement. Run duotu09rat exploited the aﬀective scores of words, building on the hypothesis that use of aﬀective speech characterizes narrative peaks. 3.4 Results The results of the task are reported in Table 2. The results make clear that it is quite challenging to eﬀectively support the detection of narrative peaks using audio and video features. Recall that unige-cvml5 is a randomly generated run. Most runs failed to yield results appreciably better than this random baseline. The best scoring approaches exploited the speech recognition transcripts, in particular, the occurrence of pronouns reﬂecting user directed speech and the use of words with high eﬀective ratings. 364 M. Larson, E. Newman, and G.J.F. Jones Table 2. Narrative peak detection results run ID duotu09fix duotu09ind duotu09rep duotu09pro duotu09rat unige-cvml1 unige-cvml2 unige-cvml3 unige-cvml4 unige-cvml5 uaic-run1 uaic-run2 uaic-run3 point-based 47 55 30 63 59 39 41 42 43 43 33 41 33 peak-based peak-based peak-based > 1 assessor > 2 assessors > 3 assessors (“personal peaks”) (“pair peaks”) (“general peaks”) 28 8 4 38 12 2 21 7 0 44 17 4 33 18 6 32 6 0 30 11 2 31 8 0 31 9 0 32 8 3 26 7 2 29 10 3 24 7 2 Because of the newness of the Narrative Peak Detection Task, the method of scoring is still a subject of discussion. The scoring method was designed such that algorithms were given as much credit as possible for agreement between the peaks they hypothesized and the peaks chosen by the annotators. See the papers of individual participants [7] [10] for some additional discussion. 4 4.1 Linking Task Task The Linking Task, also called “Finding Related Resources Across Languages,” involves linking episodes of the Beeldenstorm documentary (Dutch language) to Wikipedia articles about related subject matter (English language). This task was new in 2009. Participants were supplied with 165 multimedia anchors, short (ca. 10 seconds) segments, pre-deﬁned in the 45 episodes that make up the Beeldenstorm collection. For each anchor, participants were asked to automatically generate a list of English language Wikipedia pages relevant to the anchor, ordered from the most to the least relevant. Notice that this task was designed by the task organizers such that it goes beyond a named-entity linking task. Although a multimedia anchor may contain a named entity (e.g., a person, place or organization) that is mentioned in the speech channel, the anchors have been carefully chosen by the task organizers so that this is not always the case. The topic being discussed in the video at the point of the anchor may not be explicitly named. Also, the representation of a topic in the video may be split between the visual and the speech channel. Overview of VideoCLEF 2009 4.2 365 Evaluation The ground truth for the linking task was created by the assessors. We adapted the four graded relevance levels used in [6] for application in the Linking Task. Level 3 links are referred to as primary links and are deﬁned as “highly relevant – the page is the single page most relevant for supporting understanding of the video in the region of the anchor.” There is only a single primary link per multimedia anchor representing the one best page to which that anchor can be linked. Level 2 links are referred to as secondary links and are deﬁned as “fairly relevant – the page treats a subtopic (aspects) of the video in the region of the anchor.” The ﬁnal two levels: Level 1 (deﬁned as: “marginally relevant, the page is not appropriate for the anchor”) and Level 0 (deﬁned as “irrelevant, the page is unrelated to the anchor”), were conﬂated and regarded as irrelevant. Links classiﬁed as Level 1 are generic links, e.g., “painting,” or links involving a speciﬁc word that is mentioned, but is not really central to the topic of the video at that point. Primary link evaluation. For each video, the primary link was deﬁned by consensus among three assessors. The assessors were required to watch the entire episode so as to have the context to decide the primary link. Primary links were evaluated using recall (correct links/total links) and Mean Reciprocal Rank (MRR). Related resource evaluation. For each video, a set of related resources was deﬁned. This set necessarily includes the primary link. It also includes other secondary links that the assessors found relevant. Only one assessor needed to ﬁnd a secondary link relevant for it to be included. However, the assessors agreed on the general criteria to be applied when chosing a secondary link. Related resources were evaluated with MRR. The list of secondary links is not exhaustive, for this reason, no recall score is reported. 4.3 Techniques Centre for Digital Video Processing, Dublin City University, Ireland (see also [3]) The words spoken between the start point and the end point of the multimedia anchor (as transcribed in the speech recognition transcript) were used as a query and ﬁred oﬀ against an index of Wikipedia. For dcu run1 and dcu run2 the Dutch Wikipedia was queried and the corresponding English page was returned. Stemming was applied in dcu run2. Dutch pages did not always have corresponding English pages. For dcu run3, the query was translated ﬁrst and ﬁred oﬀ against an English language Wikipedia index. For dcu run4 a Dutch query expanded using psuedo-relevance feedback was used. TNO Information and Communication Technology, Netherlands (see also [14]) A set of existing approaches were combined in order to implement a sophisticated baseline to provide a starting point for future research. A wikify tool was used to 366 M. Larson, E. Newman, and G.J.F. Jones ﬁnd links in the Dutch speech recognition transcripts and in English translations of the transcripts. Particular attention was given to proper names, with one strategy giving preference to links to articles with proper-name titles and another strategy ensuring that proper name information was preserved under translation. 4.4 Results The results of the task are reported in Table 3 (primary link evaluation) and Table 4 (related resource evaluation). The best run used a combination of different strategies, referred to by TNO as a “cocktail.” The techniques applied by DCU achieved a lower overall score, but demonstrate that in general it is better not to translate the query, but rather to query Wikipedia in the source language and then cross over to the target language by using Wikipedia’s own links article-level links between languages. Note that the diﬀerence is in reality not as extreme as suggested by Table 3 (i.e., by dcu run1 vs. dcu run3). A subsequent version of the dcu run3 experiment (not reported in Table 3) that makes use of a version of Wikipedia that has been cleaned up by removing clutter (e.g., articles scheduled for deletion and meta-articles containing discussion) achieves a MRR of 0.171 for primary links. Insight into the diﬀerence between the DCU approach and the TNO approach is oﬀered by an analysis that makes a query-by-query comparison between speciﬁc runs and average performance. DCU runs provide an improvement over average performance for more queries than TNO run [14]. Table 3. Linking results: Primary link evaluation. Raw count correct and MRR. run ID dcu run1 dcu run2 dcu run3 dcu run4 tno run1 tno run2 tno run3 tno run4 tno run5 5 raw 44 44 13 38 57 55 58 44 47 MRR 0.182 0.182 0.056 0.144 0.230 0.215 0.251 0.182 0.197 Table 4. Linking results: Related resource evaluation. MRR. run ID dcu run1 dcu run2 dcu run3 dcu run4 tno run1 tno run2 tno run3 tno run4 tno run5 MRR 0.268 0.275 0.090 0.190 0.460 0.428 0.484 0.392 0.368 Conclusions and Outlook In 2009, VideoCLEF participants carried out three tasks, Subject Classiﬁcation, Narrative Peak Detection and Finding Related Resources Across Languages. These tasks generate enrichment for spoken content that can be used to provide improvement in multimedia access and retrieval. With the exception of the Narrative Peak Detection Task, participants concentrated largely on features derived from the speech recognition transcripts and Overview of VideoCLEF 2009 367 did not exploit other audio information or information derived from the visual channel. Looking towards next year, we will continue to encourage participants to use a wider range of features. We see the Subject Classiﬁcation Task as developing increasingly towards a tag recommendation task, where systems are required to assign tags to videos. The tag set might not necessarily be known in advance. We expect that the formulation of this task as an information retrieval task will continue to prove useful and helpful, although we wish to move to metrics for evaluation that will better reﬂect the utility of the assigned tags for real-world search or browsing. In 2010, VideoCLEF will change its name to MediaEval13 and its sponsorship will be taken over by PetaMedia,14 a Network of Excellence dedicated to research and development aimed to improve multimedia access and retrieval. In 2010, several diﬀerent data sets will be used. In particular, we introduce data sets containing creative commons data collected from the Web (predominantly English language) that will be used in addition to data sets from Beeld & Geluid (predominantly Dutch data). We will oﬀer a tagging task, and aﬀect task and a linking task as in 2009, but we will extend our task set to include new tasks, in particular: geo-tagging and multimodal passage retrieval. The goal of MediaEval is to promote cooperation between sites and projects in the area of the benchmarking, moving towards the common aim of “Innovation and Education via Evaluation.” Acknowledgements We are grateful to TrebleCLEF,15 a Coordination Action of European Commission’s Seventh Framework Programme for a grant that made possible the creation of a data set for the Narrative Peak Detection Task and the Linking Task. Thank you to the University of Twente for supplying the speech recognition transcripts and to the Netherlands Institute of Sound and Vision (Beeld & Geluid ) for supplying the video. Thank you to Dublin City University for providing the shot segmentation and keyframes and also for hosting the team of Dutch-speaking video assessors during the Dublin Days event. We would also like to express our appreciation to Michael Kipp for use of the Anvil Video Annotation Research Tool. The work that went into the organization of VideoCLEF 2009 has been supported, in part, by PetaMedia Network of Excellence and has received funding from the European Commission’s Seventh Framework Programme under grant agreement no. 216444. References 1. Calic, J., Sav, S., Izquierdo, E., Marlow, S., Murphy, N., O’Connor, N.: Temporal video segmentation for real-time key frame extraction. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, ICASSP (2002) 13 14 15 http://www.multimediaeval.org/ http://www.petamedia.eu/ http://www.trebleclef.eu/ 368 M. Larson, E. Newman, and G.J.F. Jones 2. Dobrilˇ a, T.-A., Diacona¸su, M.-C., Lungu, I.-D., Iftene, A.: UAIC: Participation in VideoCLEF task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) ´ Jones, G.J.F.: When to cross over? Cross-language linking using 3. Gyarmati, A., Wikipedia for VideoCLEF 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 4. Hanjalic, A., Xu, L.-Q.: Aﬀective video content representation and modeling. IEEE Transactions on Multimedia 7(1), 143–154 (2005) 5. Huijbregts, M., Ordelman, R., de Jong, F.: Annotation of heterogeneous multimedia content using automatic speech recognition. In: Proceedings of the International Conference on Semantic and Digital Media Technologies, SAMT (2007) 6. Kek¨ al¨ ainen, J., J¨ arvelin, K.: Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology 53(13), 1120–1129 (2002) 7. Kierkels, J.J.M., Soleymani, M., Pun, T.: Identiﬁcation of narrative peaks in video clips: Text features perform best. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 8. Kipp, M.: Anvil – a generic annotation tool for multimodal dialogue. In: Proceedings of Eurospeech, pp. 1367–1370 (2001) 9. K¨ ursten, J., Eibl, M.: Video classiﬁcation as IR task: Experiments and observations. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 10. Larson, M., Jochems, B., Smits, E., Ordelman, R.: A cocktail approach to the VideoCLEF 2009 linking task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 11. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2008: Automatic generation of topic-based feeds for dual language audio-visual content. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) Evaluating Systems for Multilingual and Multimodal Information Access. LNCS, vol. 5706, pp. 906–917. Springer, Heidelberg (2009) 12. Pecina, P., Hoﬀmannov´ a, P., Jones, G.J.F., Zhang, Y., Oard, D.W.: Overview of the CLEF-2007 Cross-Language Speech Retrieval track. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 674–686. Springer, Heidelberg (2008) 13. Perea-Ortega, J.M., Montejo-R´ aez, A., Mart´ın-Valdivia, M.T., Ure˜ na L´ opez, L.A.: Using Support Vector Machines as learning algorithm for video categorization. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 14. Raaijmakers, S., Versloot, C., de Wit, J.: A cocktail approach to the VideoCLEF 2009 linking task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 15. Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVID. In: Proceedings of the ACM International Workshop on Multimedia Information Retrieval (MIR), pp. 321–330. ACM, New York (2006) 16. Wrede, B., Shriberg, E.: Spotting “hot spots” in meetings: Human judgments and prosodic cues. In: Proceedings of Eurospeech, pp. 2805–2808 (2003) Methods for Classifying Videos by Subject and Detecting Narrative Peak Points Tudor-Alexandru Dobrilă, Mihail-Ciprian Diaconaşu, Irina-Diana Lungu, and Adrian Iftene UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania {tudor.dobrila,ciprian.diaconasu, diana.lungu,adiftene}@info.uaic.ro Abstract. 2009 marked UAIC’s1 first participation at the VideoCLEF evaluation campaign. Our group built two separate systems for the “Subject Classification” and “Affect Detection” tasks. For the first task we created two resources starting from Wikipedia pages and pages identified with Google and used two tools for classification: Lucene and Weka. For the second task we extracted the audio component from a given video file, using FFmpeg. After that, we computed the average amplitude for each word from the transcript, by applying the Fast Fourier Transform algorithm in order to analyze the sound. A brief description of our systems’ components is given in this paper. 1 Introduction VideoCLEF2 2009 required participants to carrying out cross-language classification, retrieval and analysis tasks on a video collection containing documentaries and talk shows. In 2009, the collection extended the corpus used for the 2008 VideoCLEF pilot track. Two classification tasks were evaluated: “Subject Classification”, which involves automatically tagging videos with subject labels, and “Affect and Appeal”, which involves classifying videos according to characteristics beyond their semantic content. Our team participated in the following tasks: Subject Classification (in which participants had to automatically tag videos with subject labels such as ‘Archeology’, ‘Dance’, ‘History’, ‘Music’, etc.) and Affect Detection (in which participants had to identify narrative peaks, points within a video where viewers report increased dramatic tension, using a combination of video and speech/audio features). 2 Subject Classification In order to classify a video using its transcripts we perform four steps: (1) For each category we extract from Wikipedia and Google web pages related to the video; (2) From the documents obtained at Step 1 we extract only relevant words and compute 1 2 Univeristatea “Alexandru Ioan Cuza” (“Al. I. Cuza” University of Iași). VideoCLEF: http://www.cdvp.dcu.ie/VideoCLEF/ C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 369–372, 2010. © Springer-Verlag Berlin Heidelberg 2010 370 T.-A. Dobrilă, et al. for each term a normalized value using its number of appearances; (3) We perform the same action as in Step 2 on the video transcripts; (4) The terms obtained at Step 2 are grouped into a list of categories given a priori and, using the list of words from Step 3 and a classification tool (Lucene or Weka), we classify the video into one of these categories. Extract Relevant Words from Wikipedia: We used CategoryTree3 to analyze Wikipedia’s category structure as a tree. The query URL was created based on the language, the name of the category and the depth of the search within the tree. We performed queries for each category and obtained Wikipedia pages, which were later sorted by relevance. From the source of each page we extracted the content of the paragraph tags and transformed all words to lower case, lemmatized and counted their number of appearances. In the end we computed for each term out of each category a normalized score as the ratio between its number of appearances and the total number of appearances of all words in that category. Extract Relevant Words from Google: This part is similar to the part performed on Wikipedia, except that from relevant search results terms from the “keywords” meta tag were extracted as well. Lucene allows adding, indexing and searching capabilities to applications [1]. Instead of directly indexing files created at previous steps and corresponding to each category, we generated other files in which every word’s score is proportional to its associated number of appearances from its corresponding files. This way the score returned by Lucene will be greater if the word from the file associated to a category has a higher number of appearances. The Weka4 workbench [2] contains a collection of visualization tools and algorithms for data analysis and predictive modeling. For each category file (model file) and transcript file (test file), we create an ARFF file (Attribute-Relation File Format). Using a filter, provided by the Weka Tool, the content of the newly created files is transformed into instances. Each instance is classified by assigning it a score and the one with the highest score is the result that Weka offers. System Evaluation In Table 1, we report the results of the evaluation in terms of mean average precision (MAP) using the trec_eval tool on training data5. We have evaluated nine runs, using different combinations between resources and classification algorithms. Table 1. UAIC Runs on Training Data Tools\Resources Lucene Weka Lucene and Weka 3 Google 0.12 0.14 0.19 Wikipedia 0.17 0.30 0.33 Google and Wikipedia 0.20 0.35 0.45 CategoryTree: http://www.mediawiki.org/wiki/Extension:CategoryTree Weka: http://www.cs.waikato.ac.nz/ml/weka/ 5 During the evaluation campaign we did not send a run on test data and the data in this table were evaluated by us on the training files provided by the organizers. 4 Methods for Classifying Videos by Subject and Detecting Narrative Peak Points 371 The least conclusive results were obtained using Lucene and resources extracted from Google. Classification results using resources from Wikipedia and Lucene or Weka tool are more representative because the information extracted from this database is more concise. The best results were obtained when resources from both Google and Wikipedia were used. Lucene proved to be more useful when more results for a single input were needed, but Weka tool using the Naive Bayes Multinomial classifier lead to a single, more conclusive, result. Combining both resources and the two tools is much more efficient in terms of the accuracy of the results. 3 Affect Detection Our work is based on the assumption that a narrative peak is a point in the video where the narrator raises his voice within a given phrase, in order to emphasize a certain idea. This means that a group of words is said more intensely than the way previous words are said and, since this applies in any language, we were able to develop a language independent application using statistical analysis. This is why our approach is based on two aspects of the video: the sound and the ASR transcript. The first step is the extraction of the audio from a given video file, which we accomplished with the use of FFmpeg6. We then computed the average amplitude of each word from the transcript, by applying the Fast Fourier Transform (FFT7) algorithm on the audio signal. The amplitude of a point in complex form X is defined as the ratio between the intensity of the frequency in X (as calculated by FFT) and the total number of points in the time-domain signal. FFT proved to be successful, because it helped establish the relation between neighboring words in terms of the way they are pronounced by the narrator. Next, we computed a score for any group of words (which spanned between 5 and 10 seconds) based on the previous group of words. The score is a weighted mean of several metrics, listed in Table 2. In the end, we considered only the top 3 scores, which were exported in .anvil format for later use in Anvil Player. We submitted 3 runs with following characteristics: Table 2. Affect Detection: characteristics of UAIC runs Run ID Metrics Used For Computing The Score Run 1 • Ratio of Means of Amplitudes of Current Group and Previous Group • Ratio of Quartile Coefficients of Dispersion of Current Group and Previous Group Run 2 • Ratio of Means of Amplitudes of Current Group and Previous Group • Ratio of Quartile Coefficients of Dispersion of Current Group and Previous Group • Ratio of Coefficients of Variation of Current Group and Previous Group Run 3 • Ratio of Means of Amplitudes of Current Group and Previous Group • Ratio of Coefficients of Variation of Current Group and Previous Group 6 7 FFmpeg: http://ffmpeg.org/ FFT: http://en.wikipedia.org/wiki/Fast_Fourier_transform 372 T.-A. Dobrilă, et al. In total, 60 hours of assessor time were devoted to creating the reference files of the narrative peaks for the 45 Beeldenstorm episodes used in the VideoCLEF 2009 Affect Task. Three assessors watched each of the 45 test files and marked their top three narrative peaks using the Anvil tool. Our best run (Run 2) was obtained when more statistical measures were incorporated into the final weighted sum that gave the score of a group of words. This could be improved by adding other metrics (e.g. coefficient of correlation) and by properly adjusting the weights. Our method was successful when the narrator raised his voice in order to emphasize a certain idea, but failed when the semantic meaning of the words played an important role within a narrative peak. Table 3. UAIC Runs Evaluation Run ID Run 1 Run 2 Run 3 Point based scoring 33 41 33 Peaks based scoring 1 26 29 24 Peaks based scoring 2 7 10 7 Peaks based scoring 3 2 3 2 4 Conclusions This paper presents UAIC’s system which took part in the VideoCLEF 2009 evaluation campaign. Our group built two separate systems for the “Subject Classification” and “Affect Detection” tasks. For the Subject Classification task we created two resources based from Wikipedia pages and results from the Google search engine. These resources are then used by Lucene and Weka tools for classification. For the Affect Detection task we extracted the audio component from a given video file, using FFmpeg. The audio signal is analyzed with the Fast Fourier Transform algorithm and scores are given to groups of neighboring words. References 1. Hatcher, E., Gospodnetic, O.: Lucene in action. Manning Publications Co. (2005) 2. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) Retrieved 2007-06-25 3. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: Peters, C., Gonzalo, J., Jones, G.J.F., Muller, H., Tsikrika, T., Kalpathy-Kramer, J. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) Using Support Vector Machines as Learning Algorithm for Video Categorization Jos´e Manuel Perea-Ortega, Arturo Montejo-R´ aez, Mar´ıa Teresa Mart´ın-Valdivia, and L. Alfonso Ure˜ na-L´ opez SINAI Research Group, Computer Science Department, University of Ja´en, Campus Las Lagunillas, Edificio A3, E-23071, Ja´en, Spain {jmperea,amontejo,maite,laurena}@ujaen.es http://sinai.ujaen.es Abstract. This paper describes a supervised learning approach to classify Automatic Speech Recognition (ASR) transcripts from videos. A training collection was generated using the data provided by the VideoCLEF 2009 framework. These data contained metadata files about videos. The Support Vector Machines (SVM) learning algorithm was used in order to evaluate two main experiments: using the metadata files for generating the training corpus and without using them. The obtained results show the expected increase in precision due to the use of metadata in the classification of the test videos. 1 Introduction Multimedia content-based retrieval is a challenging research ﬁeld that has drawn signiﬁcant attention in the multimedia research community [5]. With the rapid growth of multimedia data, methods for eﬀective indexing and search of visual content are decisive. Speciﬁcally, the interest in multimedia Information Retrieval (IR) systems has grown in recent years, as can be seen at some conferences like for example the ACM International Conference on Multimedia Information Retrieval (ACM MIR1 ) or the TREC Video Retrieval Evaluation (TRECVID2 ) conference. Our group has some experience in this ﬁeld, using an approach based on the fusion process between a text-based retrieval and an image-based retrieval [1]. Video categorization can be considered a subtask of the multimedia contentbased retrieval. VideoCLEF3 is a recent track of CLEF4 whose aim is to evaluate and improve access to video content in a multilingual environments. One of the main subtask that it proposes is the Subject Classification task, that is about automatically tagging videos with subject theme labels (e.g., “factories”, “poverty”, “cultural identity”, “zoos”, ...) [4]. 1 2 3 4 http://press.liacs.nl/mir2008/index.html http://www-nlpir.nist.gov/projects/trecvid http://www.cdvp.dcu.ie/VideoCLEF Cross Language Evaluation Forum, http://www.clef-campaign.org C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 373–376, 2010. c Springer-Verlag Berlin Heidelberg 2010 374 J.M. Perea-Ortega et al. In this paper, two experiments about the Subject Classification task are described. To proceed, one main approach has been followed: supervised categorization using the SVM algorithm [2]. Additionally, two corpora have been generated, using the metadata ﬁles provided by the VideoCLEF 2009 framework and without using them. The paper is organized as follows: Section 2 describes the approach followed in this work. Then, in Section 3, experiments and results are shown. Finally, in Section 4, the conclusions and further work are presented. 2 The Supervised Learning Approach 2.1 Generating the Training Data The VideoCLEF 2009 Subject Classification task ran on TRECVid 2007/2008 data from Beeld & Geluid 5 . The training corpus consists of 262 XML ﬁles. These ASR ﬁles belong to the VideoCLEF 2008 (50 ﬁles) and TRECVID 2007 (212 ﬁles). Additionally, there are some metadata ﬁles about the videos provided by the VideoCLEF organization [4]. For generating the training data, the content of the FreeTextAnnotation labels from ASR ﬁles was extracted. Therefore, a TREC ﬁle per document was built. Additionally, the content of the description abstract labels from the metadata ﬁles was added to generate the learning corpus with metadata. The preprocessing of training corpora was to ﬁlter the stopwords and to apply a stemmer. Because all the original ﬁles are in Dutch language, the Snowball stopword list for Dutch6 was used, which contains 101 stopwords, and the Snowball Dutch stemmer7 . 2.2 Using SVM as an ASR Classifier Automatic tagging of videos with subject labels can be seen as a categorization problem, using the speech transcriptions of the test videos like documents to classify. One of the successful uses of SVM algorithms is the task of text categorization into ﬁxed number of predeﬁned categories based on their content. Commonly utilized representation of text documents from the ﬁeld of IR provides a natural mapping for construction of Mercer kernels utilized in SVM algorithms. For the experiments and analysis carried out in this paper, the Rapid Miner8 framework was selected. This toolkit provides several machine learning algorithms such as SVM and techniques along with other interesting features. The learning algorithm selected for testing the supervised strategy has been Support Vectors Machine [2]. SVM has been used in classiﬁcation mode, with a 3-degree RBF kernel, nu parameter equal to 0.5 and epsilon set to 0.0001, with p-value at 0.1. The rest of parameters were set to 0. A brief description of the experiments and its results using each corpora generated are showed below. 5 6 7 8 The Netherlands Institute of Sound and Vision (called in Dutch Beeld & Geluid ) http://snowball.tartarus.org/algorithms/dutch/stop.txt http://snowball.tartarus.org/algorithms/dutch/stemmer.html Rapid Miner is available from http://rapid-i.com Using Support Vector Machines as Learning Algorithm 3 375 Experiments and Results The Subject Classification task was introduced during the VideoCLEF 2008 as a pilot task [3]. In 2009, the number of videos in the collection was increased from 50 to 418 and the number of subject labels increased from 10 to 46. This task is usually evaluated using Mean Average Precision (MAP), but the R-Precision measure has also been calculated. In 2008, the approach used in our participation in VideoCLEF classiﬁcation task was the use of an Information Retrieval (IR) system as a classiﬁcation architecture [7]. We collected topical data from the Internet by submitting the thematic class labels as queries to the Google search engine. The queries were derived from the speech transcripts and a video was assigned to the label corresponding to the top ranked document, returned as result of the video transcript text used as query. This approach was taken since the VideoCLEF 2008 collection provided development and test data, but no training data. Instead, the approach followed in this paper is a ﬁrst approximation to the automatic tagging of videos using a supervised learning scheme. The SVM algorithm has been selected. During the generation of the training corpus, two experiments have been evaluated: using the metadata ﬁles provided by the VideoCLEF organization and without using them. The results obtained are showed in Table 1. Table 1. Experiments and results using SVM as learning algorithm Learning corpus MAP R-prec Using metadata 0.0028 0.0089 Without using metadata 0.0023 0.0061 Analyzing the results, it can be observed that the use of metadata during the generation of the training corpus improves the average precision of video classiﬁcation by about 21.7%, without using metadata for generating the learning corpus. Consistent with VideoCLEF 2008 observations, performance is better when archival metadata is used in addition to speech recognition transcripts. 4 Conclusions and Further Work The use of metadata as a valuable source of information in text categorization has been already applied some time ago, for example, in the categorization of full-text papers enriched by its bibliographic records [6]. The results of the experiments suggest that training classiﬁers on speech transcripts of same domain of videos could be a good strategy for the future. We expect to continue this work by applying a multi-label classiﬁer, instead the multiclass SVM algorithm used so far. Additionally, the semantics of the speech transcriptions will also be investigated by studying how the inclusion of 376 J.M. Perea-Ortega et al. synonyms from external resources such as WordNet9 aﬀects the corpora generated and further improve the performance of our system. On top of that, a method for detecting the linguistic register of the documents to be classiﬁed would serve as selector for a suitable training corpus. Acknowledgments This paper has been partially supported by a grant from the Spanish Government, project TEXT-COOL 2.0 (TIN2009-13391-C04-02), a grant from the Andalusian Government, project GeOasis (P08-TIC-41999), and a grant from the University of Ja´en, project RFC/PP2008/UJA-08-16-14. References 1. D´ıaz-Galiano, M.C., Perea-Ortega, J.M., Mart´ın-Valdivia, M.T., Montejo-R´ aez, A., Ure˜ na-L´ opez, L.: SINAI at TRECVID 2007. In: Proceedings of the TRECVID 2007 Workshop, TRECVID 2007 (2007) 2. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: N´edellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998) 3. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2008: Automatic Generation of Topic-Based Feeds for Dual Language Audio-Visual Content. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 906–917. Springer, Heidelberg (2009) 4. Larson, M., Newman, E., Jones, G.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 5. Li, J., Chang, S.F., Lesk, M., Lienhart, R., Luo, J., Smeulders, A.W.M.: New challenges in multimedia research for the increasingly connected and fast growing digital society. In: Multimedia Information Retrieval, pp. 3–10. ACM, New York (2007) 6. Montejo-R´ aez, A., Ure˜ na-L´ opez, L.A., Steinberger, R.: Text categorization using bibliographic records: beyond document content. Sociedad Espa˜ nola para el Procesamiento del Lenguaje Natural (35) (2005) 7. Perea-Ortega, J.M., Montejo-R´ aez, A., D´ıaz-Galiano, M.C., Mart´ın-Valdivia, M.T., Ure˜ na-L´ opez, L.A.: Using an Information Retrieval System for Video Classification. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 927–930. Springer, Heidelberg (2009) 9 http://wordnet.princeton.edu/ Video Classification as IR Task: Experiments and Observations Jens K¨ ursten and Maximilian Eibl Chemnitz University of Technology Faculty of Computer Science, Chair Computer Science and Media Straße der Nationen 62 09111 Chemnitz, Germany {jens.kuersten,eibl}@cs.tu-chemnitz.de Abstract. This paper describes experiments we conducted in conjunction with the VideoCLEF 2009 classiﬁcation task. In our second participation in the task we experimented with treating classiﬁcation as an IR problem and used the Xtrieval framework [1] to run our experiments. We conﬁrmed that the IR approach achieves strong results although the data set was changed. We proposed an automatic threshold to limit the number of labels per document. Query expansion performed better than the corresponding baseline experiments in terms of mean average precision. We also found that combining the ASR transcriptions and the archival metadata improved the classiﬁcation performance unless query expansion was used. 1 Introduction and Motivation This article describes a system and its conﬁguration, which we used for participation in the VideoCLEF classification task. The task [2] was to categorize dual-language video into 46 diﬀerent classes based on provided ASR transcripts and additional archival metadata. Thereby, each of the given video documents can have none, one or even multiple labels. Hence the task can be characterized as a real world scenario in the ﬁeld of automatic classiﬁcation. Our participation in the task is motivated by its close relation to our research project sachsMedia 1 . The main goals of the project are twofold. The ﬁrst objective is automatic extraction of low level features from audio and video for automated annotation of poorly described content in archives. On the other hand sachsMedia aims to support local TV stations in Saxony to replace their analog distribution technology with innovative digital distribution services. A special problem of the broadcast companies is the accessibility of their archives for end users. The remainder of the article is organized as follows. In section 2 we brieﬂy review existing approaches and describe the system architecture and its basic 1 Funded by the Entrepreneurial Regions program of the German Federal Ministry of Education and Research from April 2007 to March 2012. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 377–384, 2010. c Springer-Verlag Berlin Heidelberg 2010 378 J. K¨ ursten and M. Eibl conﬁguration. In section 3 we present and interpret the results of preliminary and oﬃcially submitted experiments. A summary of our ﬁndings is given in section 4. The ﬁnal section concludes the experiments with respect to our expectations and gives and outlook to future work. 2 System Architecture and Configuration Since the classiﬁcation task was an enhanced modiﬁcation of last year’s VideoCLEF classiﬁcation task [3], we give a brief review on previously used approaches. There were two distinct ways to approach the classiﬁcation task: (a) collecting training data from external sources like general Web content or Wikipedia to train a text classiﬁer or (b) treat the problem as information retrieval task. Villena-Rom´an and Lana-Serrano [4] combined both ideas by obtaining training data from Wikipedia and assigning the class labels to the indexed training data. The metadata from the video documents were used as query on the training corpus and the dominant label of the retrieved documents was assigned as class label. Newman and Jones [5] as well as Perea-Ortega et. al. [6] approached the problem as IR task and achieved similar strong performance. K¨ ursten et. al. [7] and He et. al. [8] tried to solve the problem with state of the art classiﬁers like k-NN and SVM. Both used Wikipedia articles for training. 2.1 Resources Given the impressions from last year’s evaluation and the huge success of the IR approaches as well as the enhancement of the task to a larger number of class labels and more documents, we decided to treat the problem as an IR task to verify these results. Hence we used the Xtrieval framework [1] to create an index on the provided metadata. This index was composed of three ﬁelds, one with the ASR output, another with the archival metadata and a third containing both. A language speciﬁc stopword list2 and the Dutch stemmer from the Snowball project3 were applied to process the tokens. We used the class labels to query our video document index. Within our framework we decided to use the Lucene4 retrieval core with its default vector-based IR model. An English thesaurus5 in combination with the Google AJAX language API6 was applied for query expansion purposes in the retrieval stage. 2.2 System Conﬁguration and Parameters The following list brieﬂy explains our system parameters and their values in the experimental evaluation. Figure 1 illustrates the general workﬂow of the system. 2 3 4 5 6 http://snowball.tartarus.org/algorithms/dutch/stop.txt http://snowball.tartarus.org/algorithms/dutch/stemmer.html http://lucene.apache.org http://de.openoﬃce.org/spellcheck/about-spellcheck-detail.html#thesaurus http://code.google.com/apis/ajaxlanguage/documentation Video Classiﬁcation as IR Task 379 – Source Field (SF): The metadata source was variated to indicate which source is most reliable and whether any combination yields to improvement of the classiﬁcation or not. – Multi-label Limit (LpD): The number of correct labels is usually related to the content of a video document. Therefore we investigated the relation between the number of assigned labels per document and the classiﬁcation performance. Another related question is, if a automatic and content-speciﬁc threshold might be superior to ﬁxed threshold values. Thus, we compared ﬁxed thresholds to an automatic threshold (see equation 1). – Pseudo-Relevance Feedback (PRF): We performed some initial experiments on the training data to identify promising values for the number of terms and documents to use. We found that selecting a single term from a small set of only ﬁve documents was beneﬁcial for this speciﬁc task and data set. Using more terms dramatically decreased the classiﬁcation performance. – Cross-Language Thesaurus Query Expansion (CLTQE): We used cross-language thesaurus query expansion for those queries, which returned less than two documents for a given query. Again, only the ﬁrst returned term was extracted and fed back to the system to reformulate the query for the same reason as in the case of PRF. The automatic threshold TLpD is based on the scores of the retrieved documents. Thereby RSVavg denotes the average score and RSVmax is the maximum score of the documents retrieved. N umdocs stands for the total number of documents retrieved for a speciﬁc class label. Please note that the explanation of the formula given in [9] was not correct. Class Labels Doc + Labels Query Expansion Query Formulation Token Processing Stopword Removal Stemming CLTQE PRF Label / Doc Limit DocList Xtrieval Framework Lucene API Fig. 1. General System Architecture 380 J. K¨ ursten and M. Eibl TLpD = RSVavg + 2 ∗ RSVmax − RSVavg N umdocs (1) Analyzing our experiments on the training data, we noticed that a number of class labels (which were used as queries) returned only a few or even no documents. Therefore the cross-language thesaurus query expansion (CLTQE) component was implemented. It expanded the English class labels with terms returned from an English thesaurus5 . The resulting English terms were subsequently translated into Dutch. Finally the expanded query was sent to the retrieval engine. 3 Experimental Setup and Results In this section we report results that were obtained by running various system conﬁgurations on the test data. The experimental results on the training data are completely reported in [9]. Regarding the evaluation of the task we had a problem with calculating the measures. The MAP values reported by trec eval and our Xtrieval framework had marginal variations due to the fact that our system allows to return two documents with identical RSV. Unfortunately we were neither able to correct the behavior of our system nor could we ﬁnd out when or why the trec eval tool reorders our result sets. Since the evaluation results had only small variations (see tables 1 and 2 in [9]) we do only report MAP values calculated by our framework to avoid confusion. Furthermore we present results for additional experiments that were not oﬃcially submitted. Column captions 2-5 of all result tables in the following subsections refer to speciﬁc system parameters that were introduced in section 2.2. Please note that the utilization of the threshold formula is denoted with x in column LpD. Experiments that were submitted for oﬃcial evaluation are denoted with *. The performance of the experiments is reported with respect to overall sum of assigned labels (SumL), the average ratio of correct classiﬁcations (CR), average recall (AR) as well as mean average precision (MAP) and the F-Measure calculated over CR and AR. 3.1 Baseline Experiments Table 1 contains results for our experiments without any query expansion. The only diﬀerence in the reported runs was the metadata source (SF) that was used in the retrieval stage. It is obvious that the best results in terms of AR and MAP were achieved when the ASR output and the archival metadata was used. The highest correct classiﬁcation rate was obtained by using only archival metadata terms. Video Classiﬁcation as IR Task 381 Table 1. Results for Baseline Experiments ID SF LpD SumL CR AR MAP cut1 l1 base* asr 1 27 0.0741 0.0102 0.0104 cut2 l1 base meta 1 63 0.6349 0.2010 0.2003 cut3 l1 base* meta + asr 1 112 0.5000 0.2814 0.2541 3.2 F-Meas 0.0177 0.3053 0.3601 Experiments with Query Expansion In the following list of experiments we used two types of query expansion. First we applied the PRF approach on all queries. It was brieﬂy described in section 2.2. Additionally the CLTQE method was implemented to handle cases in which no or only few documents were returned. Table 2 is divided into 3 blocks depending on how many labels per document were allowed. It is obvious that using only archival metadata resulted in highest MAP. Average recall was similar for all experiments using archival metadata or combining archival metadata and ASR transcripts. Looking at the correct classiﬁcation rate we observed that highest rates were achieved for experiments, where the number of assigned labels for each document were restricted to 1. Without this restriction the correct classiﬁcation rate decreased dramatically. Using the proposed restriction formula from section 2.2 resulted in a balance of CR and MAP. The evaluation with respect to the F-Measure shows highest performance for the combination of archival metadata and ASR output. Table 2. Results using Query Expansion ID cut4 l0 qe cut5 l0 base cut6 l0 qe cut7 l1 qe* cut8 l1 base cut9 l1 qe* cut10 lx base cut11 lx qe 3.3 SF LpD SumL CR AR MAP F-Meas asr ∞ 1,571 0.0350 0.2764 0.1036 0.0621 meta ∞ 1,933 0.0792 0.7688 0.4391 0.1435 meta + asr ∞ 2,276 0.0690 0.7889 0.4389 0.1269 asr 1 158 0.1266 0.1005 0.0904 0.1120 meta 1 196 0.3776 0.3719 0.2867 0.3747 meta + asr 1 196 0.3622 0.3568 0.2561 0.3595 meta x 396 0.2879 0.5729 0.4115 0.3832 meta + asr x 482 0.2427 0.5879 0.4130 0.3436 Impact of Diﬀerent Query Expansion Methods This section deals with the eﬀects of the two automatic expansion techniques. Therefore we switched PRF and CLTQE on and oﬀ for selected experiments from section 3.2 and aggregated the results. Table 3 is divided into 2 blocks corresponding to diﬀerent values for threshold LpD, namely LpD=1 for 1 label per document and LpD=x, where formula (1) from section 2.2 was used. 382 J. K¨ ursten and M. Eibl Table 3. Comparing the Impact of Query Expansion Approaches ID cut2 l1 base cut12 l1 base cut13 l1 base cut3 l1 base* cut14 l1 qe cut15 l1 qe cut16 lx base cut17 lx base cut18 lx base cut19 lx qe cut20 lx qe cut21 lx qe SF meta meta meta meta meta meta meta meta meta meta meta meta + asr + asr + asr + asr + asr + asr PRF CLTQE LpD SumL CR AR MAP F-Meas no no 1 63 0.6349 0.2010 0.2003 0.3053 yes no 1 195 0.3846 0.3769 0.3055 0.3807 no yes 1 68 0.6176 0.2111 0.2033 0.3146 no no 1 112 0.5000 0.2814 0.2541 0.3601 yes no 1 196 0.3622 0.3568 0.2619 0.3595 no yes 1 112 0.4821 0.2714 0.2275 0.3473 no no x 84 0.5714 0.2412 0.2386 0.3392 yes no x 366 0.3060 0.5628 0.4140 0.3965 no yes x 92 0.5543 0.2563 0.2418 0.3505 no no x 162 0.4383 0.3568 0.2978 0.3934 yes no x 466 0.2489 0.5829 0.4108 0.3489 no yes x 169 0.4083 0.3467 0.2707 0.3750 The results show that the automatic feedback approach is superior to the thesaurus expansion in all experiments. This observation complies with our expectation, because CLTQE was only used in rather rare cases, where no or only few documents matched the given class label. Interestingly using CLTQE results in very small gains in terms of MAP and only when the source ﬁeld for retrieval was archival metadata (compare ID’s cut2 to cut13 and cut16 to cut18). The CLTQE approach decreased retrieval performance in experiments where both source ﬁelds were used. 3.4 General Observations and Interpretation The best correct classiﬁcation rates (CR) were achieved without using any form of query expansion (see ID’s cut2, cut3 and cut19) for all data sources used. The best overall CR was achieved by using only archival metadata in the retrieval phase (see ID cut2). Since the archival metadata ﬁelds contain intellectual annotations this is a very straightforward ﬁnding. Using archival metadata only also resulted in best performance in terms of MAP and AR. Nevertheless the gap to the best results when combining ASR output with archival metadata is very small (compare ID cut5 to cut6 or cut10 to cut11). Regarding our proposed automatic threshold calculation for limitation of the number of assigned labels per document the results are twofold. On the one hand there is a slight improvement in terms of MAP and AR compared to a ﬁxed threshold LpD=1 assigned labels per document. On the other hand the overall correct classiﬁcation rate (CR) decreases in the same magnitude as MAP and AR are increasing. The interpretation of our experimental results led us to the conclusion that using MAP for evaluating a multi-label classiﬁcation task is somehow questionable. In our point of view the main reason is that MAP does not take into account the overall correct classiﬁcation rate CR. Take a close look on the two best performing experiments using archival metadata and ASR transcriptions in table 2 (see ID’s cut6 and cut11). The diﬀerence in terms of MAP is about 6%, but Video Classiﬁcation as IR Task 383 the gain in terms of CR is about 352%. In our opinion in a real world scenario where assigning class labels to video documents should be completely automatic it would be essential to take into account the overall ratio of correctly assigned labels. We used the F-measure composed of AR and CR to derive an evaluation measure, which takes into account the overall precision of the classiﬁcation, recall and the total number of assigned labels. Regarding the F-measure the best overall performance was achieved by using our proposed threshold formula on the archival metadata (see ID cut17). Nevertheless the gap between using intellectual metadata only and its combination with automatic metadata like ASR output was fairly small (compare ID’s cut17 to cut19 or cut12 to cut14). 4 Result Analysis - Summary The following list provides a short summary of our observations and ﬁndings from the participation in the VideoCLEF classification task in 2009. – Classification as an IR task: According to the observations from last year, we conclude that treating the given task as a traditional IR task with some modiﬁcations is a quite successful approach. – Metadata Sources: Combining ASR output and archival metadata improves MAP and AR when no query expansion was used. However, best performance was achieved by querying archival metadata ﬁelds only and using QE. – Label Limits: We compared an automatically calculated threshold to low manual set thresholds and found that the automatic threshold works better in terms of MAP and AR. – Query Expansion: Automatic pseudo-relevance feedback improved the results in terms of MAP in all experiments. The impact of the CLTQE was very small and it even decreased performance when both ﬁelds (intellectual and automatic metadata) were queried. – Evaluation Measure: In our opinion using MAP as evaluation measure for a multi-label classiﬁcation task is questionable. Therfore we also calculated the F-measure based on CR and AR. 5 Conclusion and Future Work This year we used the Xtrieval framework for the VideoCLEF classification task. With our experimental evaluation we can conﬁrm the observations from last year, where approaches treating the task as IR problem were most successful. We proposed an automatic threshold to limit the number of assigned labels per document to preserve high correct classiﬁcation rates. This seems to be an issue that could be worked on in the future. A manual restriction of assigned labels per document is not an appropriate solution in a real world problem, where possibly hundreds of thousand video documents have to be labeled with maybe hundreds of diﬀerent topic labels. Furthermore one could try to evaluate diﬀerent retrieval models and try to combine the results from those models to gain a better overall 384 J. K¨ ursten and M. Eibl performance. Finally, it should be evaluated whether assigning ﬁeld boosts to the metadata sources could improve performance when intellectual annotations are combined with automatically extracted metadata. Acknowledgments We would like to thank the VideoCLEF organizers and the Netherlands Institute of Sound and Vision (Beeld & Geluid) for providing the data sources for the task. This work was accomplished in conjunction with the project sachsMedia, which is funded by the Entrepreneurial Regions 8 program of the German Federal Ministry of Education and Research. References 1. K¨ ursten, J., Wilhelm, T., Eibl, M.: Extensible Retrieval and Evaluation Framework: Xtrieval. In: Workshop Proceedings of LWA 2008: Lernen - Wissen - Adaption, W¨ urzburg (October 2008) 2. Larson, M., Newman, E., Jones, J.F.G.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 3. Larson, M., Newman, E., Jones, J.F.G.: Overview of VideoCLEF 2008: Automatic Generation of Topic-based Feeds for Dual Language Audio-Visual Content. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 906–917. Springer, Heidelberg (2009) 4. Villena-Rom´ an, J., Lana-Serrano, S.: MIRACLE at VideoCLEF 2008: Topic Identiﬁcation and Keyframe Extraction in Dual Language Videos. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 572–576. Springer, Heidelberg (2009) 5. Newman, E., Jones, G.J.F.: DCU at VideoClef 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 923–926. Springer, Heidelberg (2009) 6. Perea-Ortega, J.M., Montejo-Ra´ez, A., Mart´ın-Valdivia, M.T.: Using an Information Retrieval System for Video Classiﬁcation. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 927–930. Springer, Heidelberg (2009) 7. K¨ ursten, J., Richter, D., Eibl, M.: VideoCLEF 2008: ASR Classiﬁcation with Wikipedia Categories. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 931–934. Springer, Heidelberg (2009) 8. He, J., Zhang, X., Weerkamp, W., Larson, M.: Metadata and Multilinguality in Video Classiﬁcation. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 935–938. Springer, Heidelberg (2009) 9. K¨ ursten, J., Eibl, M.: Chemnitz at VideoCLEF 2009: Experiments and Observations on Treating Classiﬁcation as IR Task. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece, September 30-2 October (2009) 8 The Innovation Initiative for the New German Federal States. Exploiting Speech Recognition Transcripts for Narrative Peak Detection in Short-Form Documentaries Martha Larson1, Bart Jochems2 , Ewine Smits1 , and Roeland Ordelman2 1 Multimedia Information Retrieval Lab, Delft University of Technology, 2628 CD Delft, Netherlands 2 Human Media Interaction, University of Twente, 7500 AE Enschede, Netherlands {m.a.larson,e.a.p.smits}@tudelft.nl, b.e.h.jochems@student.utwente.nl, ordelman@ewi.utwente.nl Abstract. Narrative peaks are points at which the viewer perceives a spike in the level of dramatic tension within the narrative ﬂow of a video. This paper reports on four approaches to narrative peak detection in television documentaries that were developed by a joint team consisting of members from Delft University of Technology and the University of Twente within the framework of the VideoCLEF 2009 Aﬀect Detection task. The approaches make use of speech recognition transcripts and seek to exploit various sources of evidence in order to automatically identify narrative peaks. These sources include speaker style (word choice), stylistic devices (use of repetitions), strategies strengthening viewers’ feelings of involvement (direct audience address) and emotional speech. These approaches are compared to a challenging baseline that predicts the presence of narrative peaks at ﬁxed points in the video, presumed to be dictated by natural narrative rhythm or production convention. Two approaches deliver top narrative peak detection results. One uses counts of personal pronouns to identify points in the video where viewers feel most directly involved. The other uses aﬀective word ratings to calculate scores reﬂecting emotional language. 1 Introduction While watching video content, viewers feel ﬂuctuations in their emotional response that can be attributed to their perception of changes in the level of dramatic tension. In the literature on aﬀective analysis of video, two types of content have received particular attention: sports games and movies [1]. These two cases diﬀer with respect to the source of viewer-perceived dramatic tension. In the case of sports, tension spikes arise as a result of the unpredictable interactions of the players within the rules and physical constraints of the game. In the case of movies, dramatic tension is carefully crafted into the content by a team including scriptwriters, performers, special eﬀects experts, directors and producers. The diﬀerence between the two cases is the amount and nature of human intention – i.e., premeditation, planning, intervention – involved in the C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 385–392, 2010. c Springer-Verlag Berlin Heidelberg 2010 386 M. Larson et al. creation of the sequence of events that plays out over time (and space). We refer to that sequence as a narrative and to high points in the dramatic tension within that narrative as narrative peaks. We are interested in investigating a third case of video content, namely television documentaries. We consider documentaries to be a form of “edu-tainment,” whose purpose is both to inform and entertain the audience. The approaches described and tested here have been developed in order to detect narrative peaks within documentary videos. Our work diﬀers in an important respect from previous work in the domains of sports and movies. Dramatic tension in documentaries is never completely spontaneous – the narrative curve follows a previously laid out plan, for example a script or an outline, that is carried out during the process of production. However, dramatic tension is characteristically less tightly controlled in a documentary than it would be in a movie. In a movie, the entire content is subordinated to the plot, whereas a documentary may follow one or more story lines, but it simultaneously pursues the goal of providing the viewer with factual subject matter. Because of these diﬀerences, we chose to dedicate separate and speciﬁc attention to the affective analysis of documentaries and in particular to the automatic detection of narrative peaks. This paper reports on joint work carried out by research groups at two universities in the Netherlands, Delft University of Technology and the University of Twente, on the Aﬀect Detection task of the VideoCLEF1 track of the 2009 Cross-Language Evaluation Forum (CLEF)2 benchmark evaluations. The Aﬀect Detection task involves automatically identifying narrative peaks in short-form documentaries. In the rest of this paper, we ﬁrst give a brief description of the data and the task. Then, we present the approach that we took to the task and give the details of the algorithms used in each of the ﬁve runs that we submitted. We report the results achieved by these runs and then conclude with a summary and outlook. 2 Experimental Setup 2.1 Data Set and Task Deﬁnition The data set for the VideoCLEF 2009 Aﬀect Detection task consisted of 45 episodes from the Dutch-language short-form documentary series called Beeldenstorm (in English, ‘Iconoclasm’). The series treats topics in the visual arts, integrating elements from history, culture and current events. Beeldenstorm is hosted by Prof. Henk van Os, known not only for his art expertise, but also for his narrative ability. Henk van Os is highly acclaimed and appreciated in the Netherlands, where he has established his ability to appeal to a broad audience.3 Constraining the corpus to contain episodes from Beeldenstorm limits the spoken content to a single speaker speaking within the style of a single documentary 1 2 3 http://www.multimediaeval.org/videoclef09/videoclef09.html http://www.clef-campaign.org/ http://www.avro.nl/tv/programmas a-z/beeldenstorm/ Exploiting Speech Recognition Transcripts for Narrative Peak Detection 387 series. This limitation is imposed in order to help control eﬀects that could be introduced by variability in style or skill. Experimentation of the ability of algorithms to transfer performance to other domains is planned for the future. An additional advantage of using the Beeldenstorm series is that the episodes are relatively short, approximately eight minutes in length. Because they are short, the assessors who create the ground truth for the test collection (discussed below) are able to watch each video in its entirety. It is essential for assessors to watch the entire video in order to judge relative rises in tension over the course of the narrative. In short, the Beeldenstorm program provides a highly suitable corpus for developing and evaluating algorithms for narrative peak detection. Ground truth was created for the Beeldenstorm by a team of assessors who speak Dutch natively or at an advanced level. The assessors were told that the Beeldenstorm series is known to contain humorous and moving moments and told that they could use that information to formulate an opinion of what constitutes a narrative peak. They were asked to mark the three points in the video where their perception of the level of dramatic tension reached the highest peaks. Peaks were required to be a maximum of ten seconds in length. For the Aﬀect Detection task of VideoCLEF 2009, task participants were supplied with an example set containing ﬁve Beeldenstorm episodes in which example narrative peaks had been identiﬁed by a human assessor. On the basis of their observations and generalizations concerning the peaks marked in the example set, the task participants designed algorithms capable of automatically detecting similar peaks in the test set. The test set contained 45 videos and was mutually exclusive with the example set. Participants were required to identify the three highest peaks in each episode. Up to ﬁve diﬀerent runs (i.e., system outputs created according to diﬀerent experimental conditions) could be submitted. Further details about the data set and the Aﬀect Detection task for VideoCLEF 2009 can be found in the track overview paper [3]. Participants were provided with additional resources accompanying the test data, including transcripts generated by an automatic speech recognition system [2]. Our approaches, described in the next section, focus on exploiting the contents of the speech transcripts for the purpose of automatically detecting narrative peaks. 2.2 Narrative Peak Detection Approaches Our approaches consist of a sophisticated baseline and four other techniques for using speech recognition transcripts to automatically detect narrative peaks. We describe each algorithm in turn. Fixing Time Points (duotu09ﬁx). Our baseline approach duotu09ﬁx4 hypothesizes fixed time points for three narrative peaks in each episode. These points were set at ﬁxed distances from the start of each video: (1) 44 secs, (2) 7 mins 9 secs and (3) 3 mins 40 secs. They were selected by analyzing the peak 4 duotu is an acronym indicating the combined eﬀorts of Delft University of Technology and the University of Twente. 388 M. Larson et al. positions in the example set and choosing three that appeared typical. They are independent of episode content and are the same for every episode. We chose this approach in order to establish a baseline against which our speech-transcriptbased peak detection algorithms can be compared. Because the narrative structure of the episodes adheres to some basic patterns, presumably due to natural narrative rhythm or production convention, choosing ﬁxed time points is actually a quite competitive approach and constitutes a challenging baseline. Counting Indicator Words (duotu09ind). We viewed the example videos and examined the words that were spoken during the narrative peaks that the assessor had marked in these videos. We formulated the hypothesis that the speaker applies a narrow range of strategies for creating narrative peaks in the documentary. These strategies might be reﬂected in a relatively limited vocabulary of words that could be used as indicators in order to predict the position of narrative peaks. We compiled a list of narrative peak indicators by analyzing the words spoken during each of the example peaks and compiled a list of words and word-stems that seemed relatively independent of the topic at the point in the video and which could be plausibly characteristic of the general word use of the speaker during peaks. The duotu09ind algorithm detects narrative peaks using the following sequence of steps. First, a set of all possible peak candidates was established by moving a 10-second sliding window over the speech recognition transcripts, advancing the window by one word at each step. Each peak candidate is maximally 10 seconds in length, but can be shorter if the speech in the window lasts for less than the 10-second duration of the window. Peak candidates of less than three seconds in length are discarded. Then, the peak candidates are ranked with respect to the raw count of the indicator words that they contain. The size limitation of the sliding window already introduces a normalizing eﬀect and for this reason we do not undertake further normalization of the raw counts. Finally, peak candidates are chosen from the ranked list, starting at the top, until a total of three peaks has been selected. If a candidate has a midpoint that falls within eight seconds of the midpoint of a previously selected candidate in the list, that candidate is discarded and the next candidate from the list is considered instead. Counting Word Repetitions (duotu09rep). Analysis of the word distributions in the example set suggested that repetition may be a stylistic device that is deployed to create peaks. The duotu09rep algorithm uses the same list of peak candidates described in the previous section in the explanation of duotu09ind. The peak candidates are ranked by the number of occurrences they contain of words that occur multiple times. In order to eliminate the impact of function words, stop word removal is performed before the peak candidates are scored. Three peaks are selected starting from the top of the ranked list of peak candidates, using the same procedure as was described above. Counting First and Second Person Pronouns (duotu09pro). We conjecture that dramatic tension rises along with the level to which the viewers feel that they are directly involved in the video content they are watching. The Exploiting Speech Recognition Transcripts for Narrative Peak Detection 389 duotu09pro approach identiﬁes two possible conditions of heightened viewer involvement: when viewers feel that the speaker in the videos is addressing them directly or as individuals, or, second, when viewers feel that the speaker is sharing something personal. In the duotu09pro approach we use second person pronominal forms (e.g., u, ‘you’; uw ‘your’) to identify audience directed speech and ﬁrst person pronominal forms (e.g., ik, ‘I’) to identify personal revelation of the speaker. The duotu09pro algorithm uses the same list of peak candidates and the same method of choosing from the ranked candidate lists that was used in duotu09ind and duotu09rep. For duotu09pro, the candidates are ranked according to the raw count of ﬁrst and second person pronominal forms that they contain. Again, no normalization was applied to the raw count. Calculating Aﬀective Ratings (duotu09rat). The duotu09rat approach uses an aﬀective rating score that is calculated in a straightfoward manner using known aﬀective levels of words in order to identify narrative peaks. The approach makes use of Whissell’s Dictionary of Aﬀect in Language [5] as deployed in the implementation of [4], which is available online.5 This dictionary of words and scores focuses on the scales of pleasantness and arousal levels. The scales are called evaluation and activation and they both range from -1.00 to 1.00. Under our approach, narrative peaks are identiﬁed with a high arousal emotion combined with either a very pleasant or unpleasant emotion. In order to score words, we combine the evaluation and the activation scores into an overall aﬀective word score, calculated using Equation 1. wordscore = evaluation2 + activation2 (1) If a certain word has a negative arousal, its wordscore is set to zero. In this way, wordscore captures high arousal only. In order to apply the dictionary, we ﬁrst translate the Dutch-language speech recognition transcripts into English using the Google Language API.6 The duotu09rat algorithm uses the same list of peak candidates used in duotu09ind, duotu09rep and duotu09pro. Candidates are ranked according to the average wordscore of the words that they contain, calculated using Equation 2. wordscore (2) rating = N N Here, N is the number of words within contained within a peak candidate that are included in Whissell’s Dictionary. Selection of peaks proceeds as in the other approaches with the exception of the fact that the peak proximity condition was set to be more stringent. Edges of peaks are required to be 4 secs apart from each other. The imposition of the more stringent condition reﬂects a design decision made in regards to the implementation and does not represent an optimized value. The wordscore curve for an example episode is illustrated in Figure 1. The peaks hypothesized by the system are indicated with circles. 5 6 http://technology.calumet.purdue.edu/met/gneﬀ/NeﬀPubl.html http://code.google.com/intl/nl/apis/ajaxlanguage 390 M. Larson et al. Fig. 1. Plot of aﬀective score over the course of an example video (Beeldenstorm episode Kluizenaars in de kunst, ‘Hermits in art’). The three top peaks identiﬁed by duotu09rat are marked with circles. 3 Experimental Results We tested our ﬁve experimental approaches on the 45 videos in the test set. Evaluation of results was carried out by comparing the peak positions hypothesized by each experimental system with peak positions that were set by human assessors. In total, three assessors viewed each of the test videos and set peaks at the three points where he or she felt most highly aﬀected by narrative tension created by the video content. In total the assessors identiﬁed 293 distinct narrative peaks in the 45 test episodes. Peaks identiﬁed by diﬀerent assessors were considered to be the same peak if they overlapped by at least two seconds. This value was set on the basis of observations by the assessor on characteristic distances between peaks. Overlapping peaks were merged by ﬁtting the overlapped region with a ten second window. This process was applied so that merged peaks could never exceed the speciﬁed peak length of ten seconds. Two methods of scoring the experiments were applied, the point-based approach and the peak-based approach. Under point-based scoring, a peak hypothesis scores a point for each assessor who selected a reference peak that is within eight seconds of that hypothesis peak. The total number of points returned by the run is the reported run score. A single episode can earn a run between three points (assessors chose completely diﬀerent peaks) and nine points (assessors all chose the same peaks). In reality, no episode however, falls at either of these extremes. The distribution of the peaks in the ﬁles is such that a perfect run would earn 246 points. Under peak-based scoring, the total number of correct peaks is reported as the run score. Three diﬀerent types of reference peaks are deﬁned for peak-based scoring. The diﬀerence is related to the number of assessors required to agree for a point in the video to be counted as a peak. Of these 293 total peaks identiﬁed, 203 peaks are “personal peaks” (peaks identiﬁed by only one assessor), 90 are “pair peaks” (peaks that are identiﬁed by at least two assessors) and 22 are “general peaks” (peaks upon which all three assessors agreed). Peak-based scores are reported separately for each of these types of peaks. A summary of the results of the evaluation is given in Table 1. Exploiting Speech Recognition Transcripts for Narrative Peak Detection 391 Table 1. Narrative peak detection results measure duotu09fix duotu09ind duotu09rep duotu09pro duotu09rat point-based 47 55 30 63 59 peak-based 28 38 21 44 33 (“personal”) peak-based 8 12 7 17 18 (“pair”) peak-based 4 2 0 4 6 (“general”) From these results it can be seen that duotu09pro, the approach that counted ﬁrst and second person pronouns, and duotu09rat, the approach that made use of aﬀective word scores are the best performing approaches. The approach relying on a list of peak indicator words, i.e., duotu09ind, performed surprisingly well considering that the list was formulated on the basis of a very limited number of examples. 4 Conclusion and Outlook We have proposed four approaches to the automatic detection of narrative peaks in short-form documentaries and have evaluated these approaches within the framework of the VideoCLEF 2009 Aﬀect Detection task, which uses a test set consisting of episodes from the Dutch language documentary on the visual arts called Beeldenstorm. Our proposed approaches exploit speech recognition transcripts. The two most successful algorithms are based on the idea that narrative peaks are perceived where particularly emotional speech is being used (duotu09rat) or when the viewer feels speciﬁcally addressed by or involved in the video (duotu09pro). These two approaches easily beat both the random baseline and also a challenging baseline approach hypothesizing narrative peaks at set positions in the video. Approaches based on capturing speaking style, either by using a set of indicator words typical for the speaker, or by trying to determine where repetition is being used as a stylistic device, proved less helpful. However, the experiments reported here are not extensive enough to exclude the possibility that they would perform well given a diﬀerent implementation. Future work will involve returning to many of the questions opened here, for example, while selecting peak-indicator words, we noticed that contrasts introduced by the word ‘but’ appear to often be associated with narrative peaks. Stylistic devices in addition to repetition, for example, use of questions, could also prove to be helpful. Under our approach, peak candidates are represented by their spoken content. We would also like to investigate the enrichment of the representations of peak candidates using words derived from surrounding regions in the speech transcripts or from an appropriate external text collection. Finally, 392 M. Larson et al. we intend to develop peak detection methods based on the combination of information sources, in particular, exploring whether using information derived from pronoun occurrences can provide enhancement to aﬀect based rating. Acknowledgements The work was carried out within the PetaMedia Network of Excellence and has received funding from the European Commission’s 7th Framework Program under grant agreement no. 216444. References 1. Hanjalic, A., Xu, L.-Q.: Aﬀective video content representation and modeling. IEEE Transactions on Multimedia 7(1), 143–154 (2005) 2. Huijbregts, M., Ordelman, R., de Jong, F.: Annotation of heterogeneous multimedia content using automatic speech recognition. In: Falcidieno, B., Spagnuolo, M., Avrithis, Y., Kompatsiaris, I., Buitelaar, P. (eds.) SAMT 2007. LNCS, vol. 4816, pp. 78–90. Springer, Heidelberg (2007) 3. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2009: New perspectives on speech-based multimedia content enrichment. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Kramer, J., Jones, G.J.F., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 4. Neﬀ, G., Neﬀ, B., Crandon, P.: Assessing the aﬀective aspect of languaging: The development of software for public relations. In: The 52nd Annual Conference of the International Communication Association (July 2002) 5. Whissell, C., Charuk, K.: A dictionary of aﬀect in language: II. Word inclusion and additional validation. Perceptual and Motor Skills 61(1), 65–66 (1985) Identification of Narrative Peaks in Video Clips: Text Features Perform Best Joep J.M. Kierkels1,2, Mohammad Soleymani2, and Thierry Pun2 1 Department of medical physics, TweeSteden hospital, 5042AD Tilburg, the Netherlands 2 Computer vision and multimedia laboratory (CVML) Computer Science Department, University of Geneva, Battelle Campus, Building A, 7 Route de Drize CH – 1227 Carouge, Geneva, Switzerland jkierkels@tsz.nl, {mohammad.soleymani,thierry.pun}@unige.ch Abstract. A methodology is proposed to identify narrative peaks in video clips. Three basic clip properties are evaluated which reflect on video, audio and text related features in the clip. Furthermore, the expected distribution of narrative peaks throughout the clip is determined and exploited for future predictions. Results show that only the text related feature, related to the usage of distinct words throughout the clip, and the expected peak-distribution are of use when finding the peaks. On the training set, our best detector had an accuracy of 47% in finding narrative peaks. On the test set, this accuracy dropped to 24%. 1 Introduction A challenging issue in content-based video analysis techniques is the detection of sections that evoke increased levels of interest or attention in viewers of videos. Once such sections are detected, a summary of a clip can be created which allows for faster browsing through relevant sections. This will save valuable time of any viewer who merely wants to see an overview of the clip. Past studies on highlight detection often focus on analyzing sports-videos [1], in which highlights usually show abrupt changes in content features. Although clips usually contain audio, video, and spoken text content, many existing approaches focus on merely one of these [2;3]. In the current paper, we will attempt to compare and show results for all three modalities. The proposed methodology to identify narrative peaks in video clips was presented at VideoCLEF 2009 subtask on “Affect and Appeal” [4]. The clips that were given in this subtask were all taken from a Dutch program called “Beeldenstorm”. They were in Dutch, had durations between seven and nine minutes, consisted of video and audio, and had speech transcripts available. Detection accuracy was determined by comparison against manual annotations on narrative peaks provided by three annotators. The annotators were either native Dutch speakers or fluent in Dutch. Each annotator chose the three highest affective peaks of each episode. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 393–400, 2010. © Springer-Verlag Berlin Heidelberg 2010 394 J.J.M. Kierkels, M. Soleymani, and T. Pun While viewing the clips, finding clear indicators as to which specific audiovisual features could be used to identify narrative peaks was not straightforward, even by looking at the annotations that were provided with the training set. Furthermore we noticed that there was little consistency among the annotators because more than three narrative peaks were indicated for all clips. This led to the conclusion that tailoring any detection method to a single person’s view on narrative peaks would not be fruitful and hence we decided to work only with basic features. These features are expected to be indicators of narrative peaks that are common to most observers, including the annotators. Our approach for detecting peaks consists of a top-down search for relevant features, e.g., first we computed possibly relevant features and secondly we investigated which of these features really enhanced detection accuracy. Three different modalities were separately treated. First, video, in MPEG1 format, was used to determine at what place in the clip frames showed the largest change compared to a preceding key frame. Second, Audio, in MPEG layer 3 format, was used to determine at what place in the clip the speaker has an elevated pitch or has an increased speech volume. Third, ext, taken from the available metadata xml files in MPEG 7 format, was used to determine at what place in the clip the speaker introduced a new topic. Next to this, the expected distribution of narrative peaks over clips was considered. Details on how all these steps were implemented are given in Section 2, followed by results of our approach on the given training data in Section 3. Discussions over the obtained results and evaluations are given in Section 4. In Section 5 several conclusions are drawn from these results. In the VideoCLEF subtask, the focus of detecting segments of increased interest is on the data part, e.g., we analyze parts of the shown video-clip to predict their impact on a viewer. Even though there exists a second approach to identify segments of increased interest. This second approach focuses not on the data but directly on the reactions of a viewer, e.g., by monitoring his physiological activity such as heart-rate [5] or by filming his facial expressions [6]. Based on such reactions, the affective state of a viewer can be estimated and one can estimate levels of excitation, attention and interest in a viewer [7]. By themselves, physiological activity measures can thus be used to estimate interest, but they could also be used to validate the outcomes of data-based techniques. 2 Feature Extraction For the different modalities, feature extraction will be described separately in the following subsections. As the topic of detecting affective peaks is quite unexplored, only basic features were implemented. This provides an initial idea of which features are useful, and future studies could focus on enhancing the relevant basic features. Feature extraction was implemented using MATLAB (Mathworks Inc). 2.1 Video Features Our key assumption for video features was that dramatic tension is related to big changes in video. It is a film editors’ choice to include such changes along time [8], Identification of Narrative Peaks in Video Clips: Text Features Perform Best 395 and this may be used to stress the importance of certain parts in the clip. The proposed narrative peak detector will output a 10 s window of enhanced dramatic tension from videos with the frame rate of 25 frames per second and this precision level is too large and merely slows down computations. Hence only the key frames (I-frames) are treated. A (Video) B (Audio) 0 0 -0.5 -1 0 100 200 300 400 1 Arb. score Arb. score Arb. sc o re 0.5 -0.5 0 C (Text) 0.5 1 100 200 300 400 500 400 500 time(s) 500 time(s) 0.5 0 -0.5 0 100 200 300 time(s) 400 500 Arb. score 1 0.5 0 -0.5 0 100 200 300 time(s) Fig. 1. Illustration of single modality feature values computed over time. A: Video feature, B: Audio features, C: Text feature. All figures are based on the episode with the identification code (ID), “BG_37016”. 2.2 Audio Features The key assumption for audio was that a speaker has an elevated pitch or has an increased speech volume when applying dramatic tension, as suggested in [9;10]. The audio is encoded at 44.1 kHz sampling rate in mpeg layer 3 format. The audio signals only contain speech except a short opening and ending credits at the start and the end of each episode. The audio signal is divided in 0.5 s segments for which the average pitch of the speaker’s voice is computed by imposing a Kaiser window and applying a Fast Fourier Transform. In the transformed signal, the frequency with maximum power is determined and is assumed to be the average pitch of the speaker’s voice over this window. Next the difference in average pitch between subsequent segments is computed. If a segment’s average pitch is less than 2.5 times as high as the pitch of the preceding segment, its pitch value is set to zero. This way, only those segments with strong increase in pitch (supposed indicator of dramatic tension) are kept. Speech volume is determined by computing the averaged absolute value of the audio signal within the 0.5 s segment. As a final step again, the resulting signals for pitch and volume are both smoothed by averaging over a 10 s window, and the smoothed resulting signal is scaled to have a maximum absolute value of one and subsequently to have a mean of zero. Next, they are down-sampled by a factor 2, resulting in vectors audio1 and audio2 which both contain 1 value per second as is illustrated in Fig. 1B. 2.3 Text Features The main assumption for text is that dramatic tension starts by the introduction of a new topic, and hence involves the introduction of new vocabulary related to this topic. Text transcripts are obtained from the available metadata xml files. The absolute 396 J.J.M. Kierkels, M. Soleymani, and T. Pun occurrence frequency for each word was computed. Words that occurred only once were considered to be non-specific and were ignored. Words that occurred more than five times were considered too general and were also ignored. The remaining set of words is considered to be topic specific. Based on this set of words, we estimated where the changes in used vocabulary are the largest. A vector v filled with zeros was initialized, having a length equal to the number of seconds in the clip. For each remaining word, its first and last appearance in the metadata container was determined and was rounded off to whole seconds, subsequently all elements in v in between the elements corresponding to these obtained timestamps are increased by one. Again, the resulting vector v is averaged over a 10 s window, scaled and set to zero mean. The resulting vector text is illustrated in Fig. 1C. 2.4 Distribution of Narrative Peaks A clip is directed by a program director and is intended to hold the attention of the viewer. To this end, it is expected that points of dramatic tension are distributed over the duration of the whole clip, and that not all moments during a clip are equally likely to have dramatic tension. For each dramatic tension-point as indicated by the annotators, its time of occurrence was determined (mean of start and stop timestamp) and a histogram, illustrated in Fig. 2, was created based on these occurrences. Based on this histogram, a weighting vector w was created for each recording. Vector w contains one element for each second of the clip. Each element’s value is determined according to the histogram. 8 7 peak count 6 5 4 3 2 1 0 0 50 100 150 200 250 300 350 400 450 500 time(s) Fig. 2. Histogram that illustrates when dramatic tension-points occur in the clips according to the annotators. Note that during the first several seconds there is no tension-point at all. 2.5 Fusion and Selection For fusion of the features, our approach merely consisted in giving equal importance to all used features. After fusion, the weights vector w can be applied and the final indicator of dramatic tension drama is derived as (shown for all three features): ⎛ ( audio1 + audio2 ) + text ⎞ . drama = w ⋅ ⎜ video + ⎟ 2 ⎝ ⎠ T (2) The estimated three points of increased dramatic tension are then obtained by selecting the three maxima from drama. The three top estimates for dramatic points are constructed by selecting the intervals starting 5s before these peaks and ending 5s afterwards. If either the second or third highest point in drama is within 10s of the Identification of Narrative Peaks in Video Clips: Text Features Perform Best 397 highest point, the point is ignored in order to avoid having an overlap between the detected segments of increased dramatic tension. In those cases, the next highest point is used (provided that the new point is not within 10s) Table 1. Schemes for feature combinations Scheme number 1 2 3 4 Used features Weights Video Yes Audio Yes Text Yes Video, Audio Yes Scheme number 5 6 7 8 Used features Weights Video, Text Yes Audio, Text Yes Video, Audio, Text Yes Text No 3 Evaluation Schemes and Results Different combinations of the derived features were made and subsequently evaluated against the training data. The schemes tested are listed in table 1. If no weights are used (Scheme 8) vector w contains only ones. Scoring of evaluation results is performed based on agreement with the reviewers’ annotations. Each time a peak that was detected coincides with (at least) one reviewer’s annotation, a point is added. A maximum of three points can thus be scored per clip and since there are five clips in the training set, the maximum score for any scheme is 15. The obtained scores are shown in table 2. Table 2. Results on the training sets. The video ID codes in the dataset start by “BG_”. Scheme number BG_36941 BG_37007 BG_37016 BG_37036 BG_37111 Total 1 0 0 1 1 1 3 2 2 1 1 1 1 6 3 2 1 1 2 1 7 4 0 1 2 1 1 5 5 1 2 2 1 0 6 6 2 1 1 2 1 7 7 1 1 2 1 0 5 8 0 1 1 1 0 3 4 Discussion As can be seen in table 2, the best performing schemes on training samples are scheme 3 and scheme 6 which both result in 7 accurately predicted narrative peaks and hence an accuracy of 47%. These two schemes both include the text based feature and the weights vector. Scheme 6 also contains the audio based feature but fails to achieve an increased accuracy because of this inclusion. Considering that there is also strong disagreement between annotators, an accuracy of 47% (compared against the joint annotations of three annotators) shows the potential of using the automated narrative peak detector. The fact that this best performing scheme is only based on a text 398 J.J.M. Kierkels, M. Soleymani, and T. Pun based feature corresponds well to the initial observation that there is no clear audiovisual characteristic of a narrative peak when observing the clips. Five schemes have been evaluated using the test samples mainly corresponding to some of the different schemes that were previously used in table 1. The results of these five methods on the test-data, and their explanations are given in table 3. For number 5, all narrative peaks were randomly selected (for comparison with random level detection). Evaluation of these runs was performed in two ways: Peak-based (similar to the scoring system on the training data) and Point-based which can be explained as follows; If a peak that is detected coincides with annotations of more than one reviewer annotation, multiple points are added. Hence the maximum-maximum score for a clip can be nine when annotators fully agree on segments, the minimum-maximum score remains three when annotators fully disagree. Table 3. Results on the test set run number (scheme nr) Score (Peak-based) Score (Point-based) 1 3 33 39 2 7 30 41 33 42 3 6 4 8 32 43 43 5 -32 The difference between the two scoring system lies in the fact that the Point-based scoring system awards more than one point to segments which were selected by more than one annotator. If annotators agree on segments with increased dramatic tension, there will be (in total over three annotators) less annotated segments and hence the probability that by chance our automated approach selects an annotated segment will decrease. Therefore, awarding more points to the detection of these less probable segments seems logical. Moreover, a segment on which all annotators agree must be a really relevant segment of increased tension. On the other hand, this Point-based approach gives equal points to having just one correctly detected segment in a clip (annotated by all three annotators) and to detecting all three segments correctly (each of them by one annotator). Since our runs were selected based on the results that were obtained using the Peakbased scoring system, results on the test data are mainly compared to this scoring. First of all, it should be noted that results are never far better than random level, as can be seen by comparing to run number 5. Surprisingly, the Peak-based and Pointbased scores show a distinctly different ranking of the runs. Run 1 performed the worst under the Point-based scoring, yet it performed best under the Peak-based scoring system. Based on the results obtained on the clips in the test set, it was expected that runs 1 and 3 would perform best. This is clearly reflected in the results we obtain when using the same evaluation method on the test clips, the Peak-based evaluation. However, with the Point-based scoring system this effect disappears. This may indicate that the main feature that we used, the text based feature based on the introduction of a new topic, does not reflect properly the notion of dramatic tension for all annotators, but is biased towards a single annotator. Identification of Narrative Peaks in Video Clips: Text Features Perform Best 399 Each video clip in the dataset was only annotated for its top three narrative peaks. The lack of a fully annotated dataset with all possible narrative peaks, made it difficult to study the effect of narrative peaks on low level content features. Having all the narrative peaks at different levels on a larger dataset, the correlation between the corresponding different low level content features could have been computed. The significance of these features for estimating narrative peaks could therefore have been further investigated. 5 Conclusions The narrative peak detection subtask described in the VideoCLEF 2009 Benchmark Evaluation has proven to be a challenging and difficult one. Failing to see obvious features when viewing the clips and only seeing a mild connection between new topics and dramatic tension peaks, we resorted to the detection of the start of new topics in the text annotations of the provided video clips and the use of some basic videoand audio-based features. In our initial evaluation based on the training clips, the text based feature proved to be the most relevant one and hence our submitted evaluationruns were centered on this feature. When using a consistent evaluation of training and test clips, the text based feature also led to our best results on the test data. The overall detection accuracy based on the text-based feature dropped from 47% correct detection on the training data to 24% on the test data. It should be stated that results on the test data were just mildly above random level. The randomly drawn results by chance performed better than random level. The simulated random level results are 40 for the point based and 30 for the peak based scoring schemes. The reported results based on the Point-based scoring differed strongly from the results obtained using the scoring system that was employed on the training data. It was shown that although using the peaks distribution as a data driven method enhanced the results on the training data the same approach cannot be generalized due to its bias toward the annotations on the training samples. In fact, the number of narrative peaks is unknown for any given video. The most precise annotation of such documentary clips can be obtained from the original script writer and the narrator himself. Not having access to these resources, more annotators should annotate the videos. These annotators should be able choose freely any number of narrative peaks. To improve the peak detection, a larger dataset is needed to compute the significance of correlations between features and narrative peaks. Given the challenging task that was given, it is our strong belief that the indication that text based features (related to the introduction of new topics) perform well, is a valuable contribution in the search for an improved dramatic tension detector. Acknowledgments. The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2011] under grant agreement n° 216444 (see Article II.30. of the Grant Agreement), NoE PetaMedia. The work of Soleymani and Pun is supported in part by the Swiss National Science Foundation. 400 J.J.M. Kierkels, M. Soleymani, and T. Pun References 1. Hanjalic, A.: Adaptive extraction of highlights from a sport video based on excitement modeling. IEEE Transactions on Multimedia 7(6), 1114–1122 (2005) 2. Gao, Y., Wang, W.B., Yong, J.H., Gu, H.J.: Dynamic video summarization using twolevel redundancy detection. Multimedia Tools and Applications 42(2), 233–250 (2009) 3. Otsuka, I., Nakane, K., Divakaran, A., Hatanaka, K., Ogawa, M.: A highlight scene detection and video summarization system using audio feature for a Personal Video Recorder. IEEE Transactions on Consumer Electronics 51(1), 112–116 (2005) 4. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 5. Soleymani, M., Chanel, G., Kierkels, J.J.M., Pun, T.: Affective Characterization of Movie Scenes Based on Multimedia Content Analysis and User’s Physiological Emotional Responses. In: IEEE International Symposium on Multimedia (2008) 6. Valstar, M.F., Gunes, H., Pantic, M.: How to Distinguish Posed from Spontaneous Smiles using Geometric Features. In: ACM Int’l Conf. Multimodal Interfaces, ICMI 2007 (2007) 7. Kierkels, J.J.M., Pun, T.: Towards detection of interest during movie scenes. In: PetaMedia Workshop on Implicit, Human-Centered Tagging (HCT 2008), Abstract only (2008) 8. May, J., Dean, M.P., Barnard, P.J.: Using film cutting techniques in interface design. Human-Computer Interaction 18(4), 325–372 (2003) 9. Alku, P., Vintturi, J., Vilkman, E.: Measuring the effect of fundamental frequency raising as a strategy for increasing vocal intensity in soft, normal and loud phonation. Speech Communication 38(3-4), 321–334 (2002) 10. Wennerstrom, A.: Intonation and evaluation in oral narratives. Journal of Pragmatics 33(8), 1183–1206 (2001) A Cocktail Approach to the VideoCLEF’09 Linking Task Stephan Raaijmakers, Corn´e Versloot, and Joost de Wit TNO Information and Communication Technology, Delft, The Netherlands {stephan.raaijmakers,corne.versloot,joost.dewit}@tno.nl Abstract. In this paper, we describe the TNO approach to the Finding Related Resources or linking task of VideoCLEF091 . Our system consists of a weighted combination of oﬀ-the-shelf and proprietary modules, including the Wikipedia Miner toolkit of the University of Waikato. Using this cocktail of largely oﬀ-the-shelf technology allows for setting a baseline for future approaches to this task2 . 1 Introduction The Finding Related Resources or linking task of VideoCLEF’09 consists of relating Dutch automatically transcribed TV speech to English Wikipedia content. For a total of 45 video episodes, a total of 165 anchors (speech transcripts) have to be linked to related Wikipedia articles. Technology emerging from this task will contribute to a better understanding of Dutch video for non-native speakers. The TNO approach to this problem consists of a cocktail of oﬀ-the-shelf techniques. Central to our approach is the use of the Wikipedia Miner toolkit developed by researchers at the University of Waikato3 (see Milne and Witten [9]). The so-called Wikifier functionality of the toolkit detects Wikipedia topics from raw text, and generates cross-links from input text to a relevance-ranked list of Wikipedia pages. Here, ’topic’ means: a Wikipedia topic label, i.e. an element from the Wikipedia ontology, e.g. ’Monarchy of Spain’, or ’rebellion’. We investigated two possible options for bridging the gap between Dutch input text and English Wikipedia pages: translating queries to English prior to the detection of Wikipedia topics, and translating Wikipedia topics detected in Dutch texts to English Wikipedia topics. In the latter case, the use of Wikipedia allows for an abstraction of raw queries to Wikipedia topics, for which the translation process in theory is less complicated and error prone. In addition, we deploy a specially developed part-of-speech tagger for uncapitalized speech transcripts that is used to reconstruct proper names. 1 2 3 Additional information about the task can be found in Larson et al. [7]. This work is supported by the European IST Programme Project FP6-0033812. This paper only reﬂects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. See http://wikipedia-miner.sourceforge.net C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 401–408, 2010. c Springer-Verlag Berlin Heidelberg 2010 402 2 S. Raaijmakers, C. Versloot, and J. de Wit Related Work The problem of cross-lingual link detection is an established topic on the agenda of cross-lingual retrieval, e.g. in the Topic Detection and Tracking community (e.g. Chen and Ku [2]). Recently, Van Gael and Zhu [4] proposed a graph-based clustering method (correlation clustering) for cross-linking news articles in multiple languages to the same event. In Smet and Moens [3], a method is proposed for cross-linking resources to the same (news) events in Dutch and English using probabilistic (latent Dirichlet) topic models, omitting the need for translation services or dictionaries. The current problem, linking Dutch text to English Wikipedia pages, is related to this type of cross-lingual, event-based linking in the sense that Dutch ’text’ (speech transcripts) is to be linked to English text (Wikipedia pages) tagged for a certain ’event’ (the topic of the Wikipedia page). There are also strong connections with the relatively recent topic of learning to rank (e.g. Liu [8]), as the result of cross-linking is a list of ranked Wikipedia pages. 3 System Setup In this section, we describe the setup of our system. We start with the description of the essential ingredients of our system, followed by the deﬁnition of a number of linking strategies based on these ingredients. The linking strategies are combined into scenarios for our runs (Sect. 4). Fig. 1 illustrates our setup. For the translation of Dutch text to English (following Adafre and de Rijke [1]), we used the Yahoo! Babel Fish translation service4 . An example of the output of this service is the following: – Dutch input: als in 1566 de beeldenstorm heeft plaatsgevonden, ´e´en van de grootste opstanden tegen de inquisitie, keert willem zich definitief tegen de koning van spanje – English translation: if in 1566 the picture storm has taken place, one of the largest insurrections against the inquisitie, turn himself willem definitively against the king of Spain Since people, organizations and locations often have entries in Wikipedia, accurate proper name detection seems important for this task. Erroneous translation to English of Dutch names (e.g. ’Frans Hals’ becoming ’French Neck’) should be avoided. Proper name detection prior to translation allows for exempting the detected names from translation. A complicating factor is formed by the fact that the transcribed speech in the various broadcastings is in lowercase, which makes the recognition of proper names challenging, since important capitalization features can no longer be used. To address this problem, we trained a maximum entropy part-of-speech tagger: an instance of the Stanford tagger5 4 5 http://babelfish.yahoo.com/ http://nlp.stanford.edu/software/tagger.shtml A Cocktail Approach to the VideoCLEF’09 Linking Task 403 (see Toutanova and Manning [10]). The tagger was trained on a 700K part-ofspeech tagged corpus of Dutch, after having decapitalized the training data. The feature space consists of a 5-cell bidirectional window addressing part-of-speech ambiguities and preﬁx and suﬃx features up to a size of 3. The imperfect English translation by Babel Fish was observed to be the main reason for erroneous Wikiﬁer results. In order to omit the translation step, we ported the English Wikiﬁer of the Wikipedia Miner toolkit to Dutch, for which we used the Dutch Wikipedia dump and Perl scripts provided by developers of the Wikipedia Miner toolkit. The resulting Dutch Wikiﬁer (’NL Wikiﬁer’ in Fig. 1) has exactly the same functionality as the English version, but unfortunately contains a lot less pages than the English version (a factor 6 less). Even so, the translation process now is narrowed down to translating detected Wikipedia topics (the output of the Dutch Wikiﬁer) to English Wikipedia topics. For the latter, we implemented a simple database facility (to which we shall refer with ’The English Topic Finder’) that uses the cross-lingual links between topics in the Wikipedia database for carrying out the translation of Dutch topics to English topics. An example of the output of the English and Dutch Wikiﬁers for the query presented above is the following: – Output English Wikiﬁer: 1566, Charles I of England, Image, Monarchy, Monarchy of Spain, Rebellion, Spain, The Picture – Output Dutch Wikiﬁer: 1566, Beeldenstorm, Inquisitie, Koning (*titel), Lijst van koningen van Spanje, Spanje, Willem I van Holland The diﬀerent rankings of the various detected topics are represented as a tag cloud with diﬀerent font sizes, and can be extracted as numerical scores from the output. In order to be able to entirely by-pass the Wikipedia Miner toolkit, we deployed the Lucene search engine (Hatcher and Gospodnetic [5]) for performing the matching of raw, translated text with Wikipedia pages. Lucene was used to index the Dutch Wikipedia with the standard Lucene indexing options. Dutch speech transcripts were simply provided to Lucene as a disjunctive (OR) query, with Lucene returning the best matching Dutch Wikipedia pages for the query. The HTML of these pages was subsequently parsed in order to extract the English Wikipedia page references (which are indicated in Wikipedia, whenever present). The set of techniques just described leads to a total of four basic linking strategies. Of the various combinatorial possibilities of these strategies, we selected ﬁve promising combinations, each of which corresponding to a submitted run. The basic linking strategies are the following. Strategy 1: proper names only (the top row in Fig. 1). Following proper name recognition, a quasi-document is created that only consists of all recognized proper names. The Dutch Wikiﬁer is used to produce a ranked list of Dutch Wikipedia pages for this quasi-document. Subsequently, the topics of these pages are linked to English Wikipedia pages with the English Topic Finder. Strategy 2: proper names preservation (second row in Fig. 1). Dutch text is translated to English with Babel Fish. Any proper names found in the part-of-speech tagged 404 S. Raaijmakers, C. Versloot, and J. de Wit Fig. 1. TNO system setup Dutch text are added to the translated text as untranslated text, after which the English Wikiﬁer is applied, producing a ranked list of matching Wikipedia pages. Strategy 3: topic-to-topic linking (3rd row from the top in Fig. 1). The original Dutch text is wikiﬁed using the Dutch Wikiﬁer, producing a ranked list of Wikipedia pages. The topics of these pages are subsequently linked to English Wikipedia pages with the English Topic ﬁnder. Strategy 4: text-to-page linking (bottom row in Fig. 1). After Lucene has matched queries with Dutch Wikipedia pages, the English Topic Finder tries to ﬁnd the corresponding English Wikipedia pages for the Dutch topics in the pages returned by Lucene. This strategy omits the use of the Wikiﬁer and was used as a fall-back option, in case none of the other modules delivered a result. A thresholded merging algorithm removes any results below a hand-estimated threshold and blends the remaining results into a single ordered list of Wikipedia topics, using again hand-estimated weights for the various sources of these results. Several diﬀerent merging schemata were used for diﬀerent runs; these are discussed in Sect. 4. 4 Run Scenarios In this section, we describe the conﬁgurations of the 5 runs we submitted. We were speciﬁcally interested in the eﬀect of proper name recognition, the relative contributions of the Dutch and English Wikiﬁers, and the eﬀect of full-text Babel Fish translation as compared to a topic-to-topic translation approach. Run 1: All four linking strategies were used to produce the ﬁrst run. A weighted merger (’Merger’ in Fig. 1) was used to merge the results from the diﬀerent strategies. The merger works as follows: 1. English Wikipedia pages referring to proper names are uniformly ranked before all other results. A Cocktail Approach to the VideoCLEF’09 Linking Task 405 2. The rankings produced by the second linking strategy (rankEN ) and third linking strategy (rankDU ) for any returned Wikipedia page p are combined according to the following scheme: rank(p) = ((rankEN (p) ∗ 0.2) + (rankDU (p) ∗ 0.8)) ∗ 1.4 (1) The Dutch score was found to be more relevant than the English one (hence the 0.8 vs. 0.2 weights). The sum of the Dutch and English score is boosted with an additional factor of 1.4, awarding the fact that both linking strategies come up with the same result. 3. Pages found by Linking Strat. 2 but not by Linking Strat. 3 are added to the result and their ranking score is boosted with a factor of 1.1. 4. Pages found by Linking Strat. 3 but not by Linking Strat. 2 are added to the result (but their ranking score is not boosted). 5. If Linking Strats. 1 to 3 did not produce results, the results of Linking Strat. 4 are added to the result. Run 2: Run 2 is the same as Run 1 with the exception that Linking Strat. 1 is left out. Run 3: Run 3 is similar to Run 1, but does not boost results at the merging stage, and averages the rankings of the second and third linking strategy. This means that the weights used by the merger in Run 1 (0.8, 0.2 and 1.4) are respectively 0.5, 0.5 and 1.0 for this run. Run 4: Run 4 only uses Linking Strats. 1 and 3. This means that no translation from Dutch to English is performed. In the result set, the Wikipedia pages returned by Linking Strat. 1 are ordered before the results from Linking Strat. 3. Run 5: Run 5 uses all linking strategies except Linking Strat. 1 (it omits proper name detection). In this run a diﬀerent merging strategy is used: 1. If Linking Strat. 2 produces any results, add those to the ﬁnal result set and then stop. 2. If Linking Strat. 2 produces no results, but Linking Strat. 3 does, add those to the ﬁnal result and stop. 3. If none of the preceding linking strategies produces any results, add the results from Linking Strat. 4 to the ﬁnal result set. 5 Results and Discussion For VideoCLEF’09, two groups submitted runs for the linking task: Dublin City University (DCU) and TNO. Two evaluation methods were applied by the task organizers to the submitted results. A team of assessors ﬁrst achieved consensus on a primary link (the most important or descriptive Wikipedia article), with a minimum consensus among 3 people. All queries in each submitted run were 406 S. Raaijmakers, C. Versloot, and J. de Wit Table 1. Left table: recall and MRR for the primary link evaluation. (Average DCU scores were 0.21 and 0.14, respectively.). Right table: MRR for the secondary link evaluation. (Average DCU score was 0.21.) Run 1 2 3 4 5 Recall 0.345 0.333 0.352 0.267 0.285 MRR 0.23 0.215 0.251 0.182 0.197 Average TNO 0.32 0.215 Run 1 2 3 4 5 MRR 0.46 0.428 0.484 0.392 0.368 Average TNO 0.43 scored for Mean Reciprocal Rank6 for this primary link, as well as for recall. Subsequently, the annotators agreed on a set of related resources that necessarily included the primary link, in addition to secondary relevant links (minimum consensus of one person). Since this list of secondary links is non-exhaustive, for this measure only Mean Reciprocal Rank is reported, and not recall. As it turns out, the unweighted combination of results (Run 3) outperforms all other runs, followed by the thresholded, weighted combination (Run 1). This indicates that the weights in the merging step are suboptimal7 . Merging unweighted results is generally better than applying an if-then-else schema: Run 2 clearly outperforms Run 5. Omitting proper name recognition results in a noticeable drop of performance under both evaluation methods, underlining the importance of proper names for this task. This is in line with the ﬁndings of e.g. Chen and Ku [2]. For the primary links, leaving out the ’proper names only’ strategy leads to a drop of MRR from 0.23 (Run 1) to 0.215 (Run 2). Leaving out text translation and ’proper name preservation’ triggers a drop of MRR from 0.23 (Run 1) to 0.182 (Run 4). While various additional correlations between performance and experimental options are open to exploration here, these ﬁndings underline the importance of proper names for this task. In addition to the recall and MRR scores, the assessment team distributed the graded relevance scores (Kek¨al¨ ainen and J¨ arvelin[6]) assigned to all queries. In Figs. 2 and 3, we plotted the diﬀerence per query of the averaged relevance score to the total average obtained relevance scores for both DCU and TNO runs. For every video, we averaged the relevance scores of the hits reported by DCU and TNO. Subsequently, for every TNO run, we averaged relevance scores for every query, and measured the diﬀerence with the averaged DCU and TNO runs. For TNO, Run 1 and 3 produce the best results, with only a small amount of queries below the mean. Most of the relevance 6 7 For a response r = r1 , . . . , rQ to a ranking task, the Mean Reciprocal Rank (MRR) Q 1 1 would be M RR = |Q| , with ranki the rank of answer ri with respect to rank i i=1 the correct answer. For subsequent runs, these weights can now be estimated from the ground truth data that has become available from the initial run of this task. A Cocktail Approach to the VideoCLEF’09 Linking Task 407 1.5 ’tno_run1.plot’ ’tno_run2.plot’ ’tno_run3.plot’ ’tno_run4.plot’ ’tno_run5.plot’ Difference with mean relevance score (TNO+DCU) 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 0 20 40 60 80 100 120 140 160 180 Queries Fig. 2. Diﬀerence plots of the various TNO runs compared to the averaged relevance scores of DCU and TNO (ordered queries) 3 Difference with mean relevance score (TNO+DCU) ’dcu_run1.plot’ ’dcu_run2.plot’ ’dcu_run3.plot’ ’dcu_run4.plot’ 2 1 0 -1 -2 -3 0 20 40 60 80 Queries 100 120 140 160 Fig. 3. Diﬀerence plots of the various DCU runs compared to the averaged relevance scores of DCU and TNO (ordered queries) results obtained from these runs are around the mean, showing that from the perspective of relevance quality, our best runs produce average results. DCU on the other hand appears to produce a higher proportion of relatively high quality relevance results. 6 Conclusions In this contribution, we have taken a technological and oﬀ-the-shelf-oriented approach to the problem of linking Dutch transcripts to English Wikipedia pages. 408 S. Raaijmakers, C. Versloot, and J. de Wit Using a blend of commonly available software resources (Babel Fish, the Waikato Wikipedia Miner Toolkit, Lucene, and the Stanford maximum entropy partof-speech tagger), we demonstrated that an unweighted combination produces competitive results. We hope to have demonstrated that this low-entry approach can be used as a baseline level that can inspire future approaches to this problem. A more accurate estimation of weights for the contribution of several sources of information can be carried out in future benchmarks, now that the VideoClef annotators have produced ground truth ranking data. References 1. Adafre, S.F., de Rijke, M.: Finding Similar Sentences across Multiple Languages in Wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006) 2. Chen, H.-H., Ku, L.-W.: An NLP & IR approach to topic detection. Kluwer Academic Publishers, Norwell (2002) 3. De Smet, W., Moens, M.-F.: Cross-language linking of news stories on the web using interlingual topic modelling. In: SWSM 2009: Proceeding of the 2nd ACM Workshop on Social Web Search and Mining, pp. 57–64. ACM, New York (2009) 4. van Gael, J., Zhu, X.: Correlation clustering for crosslingual link detection. In: Veloso, M.M. (ed.) IJCAI, pp. 1744–1749 (2007) 5. Hatcher, E., Gospodnetic, O.: Lucene in Action. In Action series. Manning Publications Co., Greenwich (2004) 6. Kek¨ al¨ ainen, J., J¨ arvelin, K.: Using graded relevance assessments in IR evaluation. J. Am. Soc. Inf. Sci. Technol. 53(13), 1120–1129 (2002) 7. Larson, M., Newman, E., Jones, G.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 8. Liu, T.-Y.: Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3(3), 225–331 (2009) 9. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Mining (CIKM 2008), pp. 509–518. ACM Press, New York (2008) 10. Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63–70 (2000) When to Cross Over? Cross-Language Linking Using Wikipedia for VideoCLEF 2009 ´ Agnes Gyarmati and Gareth J.F. Jones Centre for Digital Video Processing Dublin City University, Dublin 9, Ireland {agyarmati,gjones}@computing.dcu.ie Abstract. We describe Dublin City University (DCU)’s participation in the VideoCLEF 2009 Linking Task. Two approaches were implemented using the Lemur information retrieval toolkit. Both approaches first extracted a search query from the transcriptions of the Dutch TV broadcasts. One method first performed search on a Dutch Wikipedia archive, then followed links to corresponding pages in the English Wikipedia. The other method first translated the extracted query using machine translation and then searched the English Wikipedia collection directly. We found that using the original Dutch transcription query for searching the Dutch Wikipedia yielded better results. 1 Introduction The VideoCLEF Linking Task involved locating content related to sections of an automated speech recognition (ASR) transcription cross-lingually. Elements of a Dutch ASR transcription were to be linked to related pages in an English Wikipedia collection [1]. We submitted four runs by implementing two diﬀerent approaches to solve the task. Because of the diﬀerence between the source language (Dutch) and the target language (English), a switch between the languages is required at some point in the system. Our two approaches diﬀered in the switching method. One approach performed the search in a Dutch Wikipedia archive with the exact words (either stemmed or not) and then returned the corresponding links pointing to the English Wikipedia pages. The other one ﬁrst performed an automatic machine translation of the Dutch query into English, the translated query was then used to search the English Wikipedia archive directly. 2 System Description For our experiments we used the Wikipedia dump dated May 30th 2009 for the English archive, and the dump dated May 31st 2009 for the Dutch Wikipedia collection. In a simple preprocessing phase, we eliminated some information irrelevant to the task, e.g. information about users, comments, links to other C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 409–412, 2010. c Springer-Verlag Berlin Heidelberg 2010 410 ´ Gyarmati and G.J.F. Jones A. languages we did not need. For indexing and retrieving, we used the Indri model of the open source Lemur Toolkit [2]. English texts were stemmed using Lemur’s built-in stemmer, while Dutch texts were stemmed using Oleander’s implementation [5] of Snowball’s Dutch stemmer algorithm [6]. We used stopword lists provided by Snowball for both languages. Queries were formed based on sequences of words extracted from the ASR transcripts using the word timing information in the transcript ﬁle. For each of the anchors deﬁned by the task, the transcript was searched from the anchor starting point until the given end point, and the word sequence between these boundaries extracted as the query. These sequences were used directly as queries for retrieval from the Dutch collection. The Dutch Wikipedia’s links pointing to the corresponding articles of the English version were returned as the solution for each anchor point in the transcript. For the other approach queries were translated automatically from Dutch to English using the query translation component developed for the Multimatch project [3]. This translation tool combines the WorldLingo machine translation engine augmented with a bilingual dictionary from the cultural heritage domain automatically extracted from the multilingual Wikipedia. The translated query was used to search the English Wikipedia archive. 3 Run Configurations Here we describe the four runs we submitted to the Linking Task, plus an additional one performed subsequently. 1. Dutch. The Dutch Wikipedia was indexed without stemming or stopping. Retrieval was performed on the Dutch collection, returning the relevant links from the English collection. 2. Dutch stemmed. Identical to Run 1, except that the Dutch Wikipedia text is stemmed and stopped as described in Sect. 2. 3. English. This run represents the second approach with stop word removal and stemming applied to the English documents and queries. The translated query was applied to the indexed English Wikipedia. 4. Dutch with blind relevance feedback. This run is almost identical to Run 1, with a diﬀerence in parameter setting for Lemur to perform blind relevance feedback. Lemur/Indri uses a relevance model for query expansion, for details see [4]. The ﬁrst 10 retrieved documents were assumed relevant and queries were expanded by 5 terms. 5. English. (referred to as 3 ) This is an amended version of Run 3, with the diﬀerence of an improved preprocessing phase applied to the English Wikipedia, disregarding irrelevant pages as described in Sect. 4. 4 Results The Linking Task was assessed by the organisers as a known item task. The top ranked relevant link for each anchor is referred to as a primary link, and all other relevant links identiﬁed by the assessors as secondary links [1]. When to Cross Over? Cross-Language Linking Using Wikipedia 411 Table 1. Scores for Related Links Run Run Run Run Run Run Recall (prim) MRR (prim) MRR (sec) 1 2 3 4 3 0.267 0.267 0.079 0.230 0.230 0.182 0.182 0.056 0.144 0.171 0.268 0.275 0.090 0.190 – Table 1 shows Recall and Mean Reciprocal Rank (MRR) for primary links, and MRR values for secondary links. Recall cannot be calculated for secondary links due to the lack of an exhaustive identiﬁcation of secondary links. Table 1 also includes Run 3 evaluated automatically using the same set of primary links as in the oﬃcial evaluation. Secondary links have been omitted as we could not provide the required additional manual case-by-case evaluation by the assessors. Runs 1 and 2 achieved the highest scores. Although they do yield slightly diﬀerent output, the decision of whether to stem and stop text does not alter the results statistically, in the matter of primary links, while stemming and stopping (Run 2) did improve results a little in ﬁnding secondary links. Run 4 using blind relevance feedback to expand the queries was not eﬀective here. Setting the optimal parameters for this process would require further experimentation, and either this or alternative expansion methods may produce better results. The main problem of retrieving from the Dutch collection lies in the diﬀerences between the English and the Dutch versions of Wikipedia. Although the English site contains a signiﬁcantly larger number of articles, there are articles that have no equivalent pages cross-lingually, due to diﬀerent structuring or cultural diﬀerences. Systems 1, 2 and 4 might (and in fact did) come up with relevant links at some points which were lost when looking for a direct link to an English page. Thus a weak point of our approach is that some hits from the Dutch Wikipedia might get lost in the English output due to the lack of an equivalent English article. In the extreme case, our system might return no output at all if none of the hits for a given anchor are linked to any page in the English Wikipedia. Run 3 performed signiﬁcantly worse. This might be due to two aspects of the switch to the English collection. First, the query text was translated automatically from Dutch to English, which in itself carries a risk of translation errors due to misinterpretation of the query or weaknesses in the translation dictionaries. While the MultiMatch translation tool has a vocabulary expanded to include many concepts from the domain of cultural heritage, there are many specialist concepts in the ASR transcription which are not included in its translation vocabulary. Approximately 3.5% of Dutch words were left untranslated (in addition to names). Some of these turned out to be important expressions, e.g. rariteitenkabinet ’cabinet of curiosities’, which were in fact successfully retrieved by the systems for Run 1 and 2 (although ranked lower than desired). The other main problem we encountered in Run 3 lay in the English Wikipedia and our limited experience concerning its structure. The downloadable dump 412 ´ Gyarmati and G.J.F. Jones A. includes a large number of pages that look like useful articles, but are in fact not. These articles include old articles set for deletion and meta-articles containing discussion of an existing, previous or future article. We were not aware of these articles during the initial development phase, but this had a signiﬁcant impact on our results, about 18.5 % of the links returned in Run 3 proved to be invalid articles. Run 3 reﬂects results where the English Wikipedia archive has been cleaned up to remove these irrelevant pages prior to indexing. As shown in Table 1, this cleanup produces a signiﬁcant improvement in performance. A similar cleanup applied to the Dutch collection would produce a new ranking of Dutch documents. However, very few of the Dutch pages which would be deleted in cleanup are actually retrieved or have a link to English pages, and thus any changes in the Dutch archive will have no noticeable eﬀect on evaluation of the overall system output. 5 Conclusions In this paper we have outlined the two approaches used in our submissions to the Linking Task at VideoCLEF 2009. We found using the source language for retrieval to be more eﬀective than switching to the target language in an early phase. This result may be diﬀerent if translation of the query for the second method were to be improved. Both methods could be expected to beneﬁt from the ongoing development of Wikipedia collections. Acknowledgements This work is funded by a grant under the Science Foundation Ireland Research Frontiers Programme 2008. We are grateful to Eamonn Newman for assistance with the MultiMatch translation tool. References 1. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. The Lemur Toolkit, http://www.lemurproject.org/ 3. Jones, G.J.F., Fantino, F., Newman, E., Zhang, Y.: Domain-Specific Query Translation for Multilingual Information Access Using Machine Translation Augmented With Dictionaries Mined From Wikipedia. In: Proceedings of the 2nd International Workshop on Cross Lingual Information Access - Addressing the Information Need of Multilingual Societies (CLIA-2008), Hyderabad, India, pp. 34–41 (2008) 4. Don, M.: Indri Retrieval Model Overview, http://ciir.cs.umass.edu/~ metzler/indriretmodel.html 5. Oleander Stemming Library, http://sourceforge.net/projects/porterstemmers/ 6. Snowball, http://snowball.tartarus.org/ Author Index Adda, Gilles I-289 Agirre, Eneko I-36, I-166, I-273 Agosti, Maristella I-508 Ah-Pine, Julien II-124 Al Batal, Rami II-324 AleAhmad, Abolfazl I-110 Alegria, I˜ naki I-174 Alink, W. I-468 Alpkocak, Adil II-219 Anderka, Maik I-50 Ansa, Olatz I-273 Arafa, Waleed II-189 Araujo, Lourdes I-245, I-253 Arregi, Xabier I-273 Avni, Uri II-239 Azzopardi, Leif I-480 Bakke, Brian II-72, II-223 Barat, C´ecile II-164 Barbu Mititelu, Verginica I-257 Basile, Pierpaolo I-150 Batista, David I-305 Becks, Daniela I-491 Bedrick, Steven II-72, II-223 Benavent, Xaro II-142 Bencz´ ur, Andr´ as A. II-340 Benzineb, Karim I-502 Berber, Tolga II-219 Bergler, Sabine II-150 Bernard, Guillaume I-289 Bernhard, Delphine I-120, I-598 Besan¸con, Romaric I-342 Bilinski, Eric I-289 Binder, Alexander II-269 Blackwood, Graeme W. I-578 Borbinha, Jos´e I-90 Borges, Thyago Bohrer I-135 Boro¸s, Emanuela II-277 Bosca, Alessio I-544 Buscaldi, Davide I-128, I-197, I-223, I-438 Byrne, William I-578 Cabral, Lu´ıs Miguel I-212 Calabretto, Sylvie II-203 Can, Burcu I-641 Caputo, Annalina I-150 Caputo, Barbara II-85, II-110 Cardoso, Nuno I-305, I-318 Ceau¸su, Alexandru I-257 Cetin, Mujdat II-247 Chan, Erwin I-658 Chaudiron, St´ephane I-342 Chevallet, Jean-Pierre II-324 Chin, Pok II-37 Choukri, Khalid I-342 Clinchant, Stephane II-124 Clough, Paul II-13, II-45 Comas, Pere R. I-197, I-297 Cornacchia, Roberto I-468 Correa, Santiago I-223, I-438 Cristea, Dan I-229 Croitoru, Cosmina II-283 Csurka, Gabriela II-124 Damankesh, Asma I-366 Dar´ oczy, B´ alint II-340 Dehdari, Jon I-98 Denos, Nathalie I-354 de Pablo-S´ anchez, C´esar I-281 Deserno, Thomas M. II-85 de Ves, Esther II-142 de Vries, Arjen P. I-468 de Wit, Joost II-401 D’hondt, Eva I-497 Diacona¸su, Mihail-Ciprian II-369 D´ıaz-Galiano, Manuel Carlos I-381, II-185, II-348 Dimitrovski, Ivica II-231 Dini, Luca I-544 Di Nunzio, Giorgio Maria I-36, I-508 Dobril˘ a, Tudor-Alexandru II-369 Dolamic, Ljiljana I-102 Doran, Christine I-508 Dornescu, Iustin I-326 Dr˘ agu¸sanu, Cristian-Alexandru I-362 Ducottet, Christophe II-164 Dumont, Emilie II-299 Dunker, Peter II-94 Dˇzeroski, Saˇso II-231 414 Author Index Eggel, Ivan II-72, II-211, II-332 Eibl, Maximilian I-570, II-377 El Demerdash, Osama II-150 Ercil, Aytul II-247 Fakeri-Tabrizi, Ali II-291 Falquet, Gilles I-502 Fautsch, Claire I-476 Fekete, Zsolt II-340 Feng, Yue II-295 Fern´ andez, Javi I-158 Ferr´es, Daniel I-322 Ferro, Nicola I-13, I-552 Flach, Peter I-625, I-633 Fluhr, Christian I-374 Forˇcscu, Corina I-174 Forner, Pamela I-174 Galibert, Olivier I-197, I-289 Gallinari, Patrick II-291 Gao, Yan II-255 Garc´ıa-Cumbreras, Miguel A. II-348 Garc´ıa-Serrano, Ana II-142 Garrido, Guillermo I-245, I-253 Garrote Salazar, Marta I-281 Gaussier, Eric I-354 G´ery, Mathias II-164 Gevers, Theo II-261 Ghorab, M. Rami I-518 Giampiccolo, Danilo I-174 Gl¨ ockner, Ingo I-265 Glotin, Herv´e II-299 Gobeill, Julien I-444 Goh, Hanlin II-287 Goldberger, Jacob II-239 Gol´enia, Bruno I-625, I-633 G´ omez, Jos´e M. I-158 Go˜ ni, Jos´e Miguel II-142 Gonzalo, Julio II-13, II-21 Goyal, Anuj II-133 Graf, Erik I-480 Granados, Ruben II-142 Granitzer, Michael I-142 Greenspan, Hayit II-239 Grigoriu, Alecsandru I-362 G¨ uld, Mark Oliver II-85 Gurevych, Iryna I-120, I-452 Guyot, Jacques I-502 ´ Gyarmati, Agnes II-409 Habibian, AmirHossein I-110 Halvey, Martin II-133, II-295 Hansen, Preben I-460 Harman, Donna I-552 Harrathi, Farah II-203 Hartrumpf, Sven I-310 Herbert, Benjamin I-452 Hersh, William II-72, II-223 Hollingshead, Kristy I-649 Hu, Qinmin II-195 Huang, Xiangji II-195 Husarciuc, Maria I-229 Ibrahim, Ragia II-189 Iftene, Adrian I-229, I-362, I-426, I-534, II-277, II-283, II-369 Inkpen, Diana II-157 Ionescu, Ovidiu I-426 Ion, Radu I-257 Irimia, Elena I-257 Izquierdo, Rub´en I-158 Jadidinejad, Amir Hossein I-70, I-98 J¨ arvelin, Anni I-460 J¨ arvelin, Antti I-460 Jochems, Bart II-385 Jones, Gareth J.F. I-58, I-410, I-518, II-172, II-354, II-409 Jose, Joemon M. II-133, II-295 Juﬃnger, Andreas I-142 Kahn Jr., Charles E. II-72 Kalpathy-Cramer, Jayashree II-223 Karlgren, Jussi II-13 Kawanabe, Motoaki II-269 Kern, Roman I-142 Kierkels, Joep J.M. II-393 Kludas, Jana II-60 Kocev, Dragi II-231 Koelle, Ralph I-538 Kohonen, Oskar I-609 K¨ olle, Ralph I-491 Kosseim, Leila II-150 Kurimo, Mikko I-578 K¨ ursten, Jens I-570, II-377 Lagus, Krista I-609 La¨ıb, Meriama I-342 Lamm, Katrin I-538 II-72, Author Index Langlais, Philippe I-617 Largeron, Christine II-164 Larson, Martha II-354, II-385 Larson, Ray R. I-86, I-334, I-566 Lavall´ee, Jean-Fran¸cois I-617 Le Borgne, Herv´e II-177 Leelanupab, Teerapong II-133 Lemaˆıtre, C´edric II-164 Lestari Paramita, Monica II-45 Leveling, Johannes I-58, I-310, I-410, I-518, II-172 Li, Yiqun II-255 Lignos, Constantine I-658 Lin, Hongfei II-195 Lipka, Nedim I-50 Lopez de Lacalle, Maddalen I-273 Llopis, Fernando II-120 Llorente, Ainhoa II-307 Lloret, Elena II-29 Lopez, Patrice I-430 L´ opez-Ostenero, Fernando II-21 Lopez-Pellicer, Francisco J. I-305 Losada, David E. I-418 Loskovska, Suzana II-231 Lungu, Irina-Diana II-369 Machado, Jorge I-90 Magdy, Walid I-410 Mahmoudi, Fariborz I-70, I-98 Maisonnasse, Lo¨ıc II-203, II-324 Manandhar, Suresh I-641 Mandl, Thomas I-36, I-491, I-508, I-538 Mani, Inderjeet I-508 Marcus, Mitchell P. I-658 Mart´ınez, Paloma I-281 Martins, Bruno I-90 Mart´ın-Valdivia, Mar´ıa Teresa II-185, II-348, II-373 Min, Jinming II-172 Mo¨ellic, Pierre-Alain II-177 Monson, Christian I-649, I-666 Montejo-R´ aez, Arturo I-381, II-348, II-373 Moreau, Nicolas I-174, I-197 Moreira, Viviane P. I-135 Moreno Schneider, Juli´ an I-281 Moriceau, V´eronique I-237 Moruz, Alex I-229 Mostefa, Djamel I-197, I-342 Motta, Enrico II-307 415 Moulin, Christophe II-164 Mulhem, Philippe II-324 M¨ uller, Henning II-72, II-211, II-332 Mu˜ noz, Rafael II-120 Myoupo, D´ebora II-177 Navarro-Colorado, Borja II-29 Navarro, Sergio II-120 Nemeskey, D´ avid II-340 Newman, Eamonn II-354 Ngiam, Jiquan II-287 Nowak, Stefanie II-94 Oakes, Michael I-526 Oancea, George-R˘ azvan I-426 Ordelman, Roeland II-385 Oroumchian, Farhad I-366 Osenova, Petya I-174 Otegi, Arantxa I-36, I-166, I-273 Ozogur-Akyuz, Sureyya II-247 Paris, S´ebastien II-299 Pasche, Emilie I-444 Peinado, V´ıctor II-13, II-21 Pelzer, Bj¨ orn I-265 Pe˜ nas, Anselmo I-174, I-245, I-253 Perea-Ortega, Jos´e Manuel I-381, II-185, II-373 P´erez-Iglesias, Joaqu´ın I-245, I-253 Peters, Carol I-1, I-13, II-1 Petr´ as, Istv´ an II-340 Pham, Trong-Ton II-324 Piroi, Florina I-385 Pistol, Ionu¸t I-229 Popescu, Adrian II-177 Pronobis, Andrzej II-110, II-315 Puchol-Blasco, Marcel II-29 Pun, Thierry II-393 Punitha, P. II-133 Qamar, Ali Mustafa I-354 Qu´enot, Georges II-324 Raaijmakers, Stephan II-401 Radhouani, Sa¨ıd II-72, II-223 Roark, Brian I-649, I-666 Roda, Giovanna I-385 ´ Rodrigo, Alvaro I-174, I-245, I-253 Rodr´ıguez, Horacio I-322 Romary, Laurent I-430 416 Author Index Ronald, John Anton Chrisostom I-374 Ro¸sca, George II-277 Rosset, Sophie I-197, I-289 Rossi, Aur´elie I-374 Rosso, Paolo I-128, I-197, I-223, I-438 Roussey, Catherine II-203 Ruch, Patrick I-444 R¨ uger, Stefan II-307 Ruiz, Miguel E. II-37 Sanderson, Mark II-45 Santos, Diana I-212 Saralegi, Xabier I-273 Savoy, Jacques I-102, I-476 Schulz, Julia Maria I-508 Semeraro, Giovanni I-150 Shaalan, Khaled I-366 Shakery, Azadeh I-110 Sikl´ osi, D´ avid II-340 Silva, M´ ario J. I-305 Smeulders, Arnold W.M. II-261 Smits, Ewine II-385 Soldea, Octavian II-247 Soleymani, Mohammad II-393 Spiegler, Sebastian I-625, I-633 S ¸ tef˘ anescu, Dan I-257 Stein, Benno I-50 Sutcliﬀe, Richard I-174 Szarvas, Gy¨ orgy I-452 Tait, John I-385 Tannier, Xavier I-237 Tchoukalov, Tzvetan I-666 Teodoro, Douglas I-444 Terol, Rafael M. II-29 Timimi, Isma¨ıl I-342 Tollari, Sabrina II-291 Tomlinson, Stephen I-78 Tommasi, Tatiana II-85 Toucedo, Jos´e Carlos I-418 Trandab˘ a¸t, Diana I-229 Tsikrika, Theodora II-60 Tuﬁ¸s, Dan I-257 Turmo, Jordi I-197, I-297 Turunen, Ville T. I-578 Unay, Devrim II-247 Ure˜ na-L´ opez, L. Alfonso II-185, II-348, II-373 Usunier, Nicolas II-291 I-381, Vamanu, Loredana II-283 van de Sande, Koen E.A. II-261 van Rijsbergen, Keith I-480 V´ azquez, Sonia II-29 Verberne, Suzan I-497 Versloot, Corn´e II-401 Vicente-D´ıez, Mar´ıa Teresa I-281 Virpioja, Sami I-578, I-609 Wade, Vincent I-58, I-62, I-518 Weiner, Zsuzsa II-340 Welter, Petra II-85 Wilkins, Peter II-172 Wolf, Elisabeth I-120 Womser-Hacker, Christa I-491 Xing, Li Xu, Yan II-110, II-315 I-526 Yang, Charles I-658 Yeh, Alexander I-508 Ye, Zheng II-195 Zaragoza, Hugo I-166, I-273 Zenz, Veronika I-385 Zhao, Zhong-Qiu II-299 Zhou, Dong I-58, I-62, I-518 Zhou, Xin II-211 Zhu, Qian II-157 Zuccon, Guido II-133 </div> </div> </div> <div class="row hidden-xs"> <div class="col-md-12"> <h2></h2> <hr /> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/multilingual-information-access-evaluation-i-text-retrieval-experiments-10th-wor.html"> <img src="https://epdf.tips/img/300x300/multilingual-information-access-evaluation-i-text-_5a7a54fab7d7bc82108f16c4.jpg" alt="Multilingual Information Access Evaluation I - Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Multilingual Information Access Evaluation I - Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/multilingual-information-access-evaluation-i-text-retrieval-experiments-10th-wor.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/multilingual-and-multimodal-information-access-evaluation-international-conferen.html"> <img src="https://epdf.tips/img/300x300/multilingual-and-multimodal-information-access-eva_5a7a54a5b7d7bc80101d44bb.jpg" alt="Multilingual and Multimodal Information Access Evaluation: International Conference of the Cross-Language Evaluation Forum, CLEF 2010, Padua, Italy, ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Multilingual and Multimodal Information Access Evaluation: International Conference of the Cross-Language Evaluation Forum, CLEF 2010, Padua, Italy, ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/multilingual-and-multimodal-information-access-evaluation-international-conferen.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/evaluating-systems-for-multilingual-and-multimodal-information-access-9th-worksh.html"> <img src="https://epdf.tips/img/300x300/evaluating-systems-for-multilingual-and-multimodal_5a7a552fb7d7bc80101d44bd.jpg" alt="Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/evaluating-systems-for-multilingual-and-multimodal-information-access-9th-worksh.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/multilingual-and-multimodal-information-access-evaluation-clef-2011.html"> <img src="https://epdf.tips/img/300x300/multilingual-and-multimodal-information-access-eva_5b74c94ab7d7bcfc6e4ec625.jpg" alt="Multilingual and Multimodal Information Access Evaluation - CLEF 2011" /> <h3 class="note-title">Multilingual and Multimodal Information Access Evaluation - CLEF 2011</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/multilingual-and-multimodal-information-access-evaluation-clef-2011.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/evaluation-of-multilingual-and-multi-modal-information-retrieval-7th-workshop-of.html"> <img src="https://epdf.tips/img/300x300/evaluation-of-multilingual-and-multi-modal-informa_5a7f43f5b7d7bc4003db1827.jpg" alt="Evaluation of Multilingual and Multi-modal Information Retrieval: 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain," /> <h3 class="note-title">Evaluation of Multilingual and Multi-modal Information Retrieval: 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain,</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/evaluation-of-multilingual-and-multi-modal-information-retrieval-7th-workshop-of.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/focused-retrieval-and-evaluation-8th-international-workshop-of-the-initiative-fo.html"> <img src="https://epdf.tips/img/300x300/focused-retrieval-and-evaluation-8th-international_5a63e03db7d7bcea6639a086.jpg" alt="Focused Retrieval and Evaluation: 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Focused Retrieval and Evaluation: 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/focused-retrieval-and-evaluation-8th-international-workshop-of-the-initiative-fo.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/cross-language-information-retrieval-and-evaluation-workshop-of-cross-language-e.html"> <img src="https://epdf.tips/img/300x300/cross-language-information-retrieval-and-evaluatio_5a7f3c8db7d7bc2212b1fb2b.jpg" alt="Cross-Language Information Retrieval and Evaluation: Workshop of Cross-Language Evaluation Forum, CLEF 2000, Lisbon, Portugal, September 21-22, 2000," /> <h3 class="note-title">Cross-Language Information Retrieval and Evaluation: Workshop of Cross-Language Evaluation Forum, CLEF 2000, Lisbon, Portugal, September 21-22, 2000,</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/cross-language-information-retrieval-and-evaluation-workshop-of-cross-language-e.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/comparative-evaluation-of-multilingual-information-access-systems-4-conf-clef-20.html"> <img src="https://epdf.tips/img/300x300/comparative-evaluation-of-multilingual-information_5a6c2c2db7d7bc4a0b073e56.jpg" alt="Comparative Evaluation of Multilingual Information Access Systems, 4 conf., CLEF 2003" /> <h3 class="note-title">Comparative Evaluation of Multilingual Information Access Systems, 4 conf., CLEF 2003</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/comparative-evaluation-of-multilingual-information-access-systems-4-conf-clef-20.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/advances-in-web-and-network-technologies-and-information-management-ap-web-waim-.html"> <img src="https://epdf.tips/img/300x300/advances-in-web-and-network-technologies-and-infor_5a5d4975b7d7bcd61b9c0754.jpg" alt="Advances in Web and Network Technologies and Information Management: AP Web WAIM 2009 International Workshops: WCMT 2009, RTBI 2009, DBIR-ENQOIR 2009, ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Advances in Web and Network Technologies and Information Management: AP Web WAIM 2009 International Workshops: WCMT 2009, RTBI 2009, DBIR-ENQOIR 2009, ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/advances-in-web-and-network-technologies-and-information-management-ap-web-waim-.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/multilingual-information-access-for-text-speech-and-images-5th-workshop-of-the-c.html"> <img src="https://epdf.tips/img/300x300/multilingual-information-access-for-text-speech-an_5a7f4687b7d7bc4003db1828.jpg" alt="Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, Bath, UK, September 15-17," /> <h3 class="note-title">Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, Bath, UK, September 15-17,</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/multilingual-information-access-for-text-speech-and-images-5th-workshop-of-the-c.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/human-interface-and-the-management-of-information-information-and-interaction-sy.html"> <img src="https://epdf.tips/img/300x300/human-interface-and-the-management-of-information-_5a86193db7d7bcd3052c520e.jpg" alt="Human Interface and the Management of Information. Information and Interaction: Symposium on Human Interface 2009, Held as Part of HCI International ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Human Interface and the Management of Information. Information and Interaction: Symposium on Human Interface 2009, Held as Part of HCI International ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/human-interface-and-the-management-of-information-information-and-interaction-sy.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/web-information-systems-engineering-wise-2009-10th-international-conference-pozn.html"> <img src="https://epdf.tips/img/300x300/web-information-systems-engineering-wise-2009-10th_5a988ea7b7d7bcb844aa09d8.jpg" alt="Web Information Systems Engineering - WISE 2009: 10th International Conference, Poznen, Poland, October 5-7, 2009, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Web Information Systems Engineering - WISE 2009: 10th International Conference, Poznen, Poland, October 5-7, 2009, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/web-information-systems-engineering-wise-2009-10th-international-conference-pozn.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/e-commerce-and-web-technologies-10th-international-conference-ec-web-2009-linz-a.html"> <img src="https://epdf.tips/img/300x300/e-commerce-and-web-technologies-10th-international_5a63ea69b7d7bc881503a30a.jpg" alt="E-Commerce and Web Technologies: 10th International Conference, EC-Web 2009, Linz, Austria, September 1-4, 2009, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">E-Commerce and Web Technologies: 10th International Conference, EC-Web 2009, Linz, Austria, September 1-4, 2009, Proceedings (Lecture Notes in ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/e-commerce-and-web-technologies-10th-international-conference-ec-web-2009-linz-a.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/the-welfare-experiments-politics-and-policy-evaluation.html"> <img src="https://epdf.tips/img/300x300/the-welfare-experiments-politics-and-policy-evalua_5ad283c3b7d7bc6107c99f39.jpg" alt="The Welfare Experiments: Politics and Policy Evaluation" /> <h3 class="note-title">The Welfare Experiments: Politics and Policy Evaluation</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/the-welfare-experiments-politics-and-policy-evaluation.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/evaluation-of-cross-language-information-retrieval-systems-second-workshop-of-th.html"> <img src="https://epdf.tips/img/300x300/evaluation-of-cross-language-information-retrieval_5a7f3f03b7d7bc2212b1fb2d.jpg" alt="Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Darmstadt, Germany, September ... Papers" /> <h3 class="note-title">Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Darmstadt, Germany, September ... Papers</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/evaluation-of-cross-language-information-retrieval-systems-second-workshop-of-th.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/secure-data-management-6th-vldb-workshop-sdm-2009-lyon-france-august-28-2009-pro.html"> <img src="https://epdf.tips/img/300x300/secure-data-management-6th-vldb-workshop-sdm-2009-_5a7a3f67b7d7bc415131c902.jpg" alt="Secure Data Management: 6th VLDB Workshop, SDM 2009, Lyon, France, August 28, 2009, Proceedings (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Secure Data Management: 6th VLDB Workshop, SDM 2009, Lyon, France, August 28, 2009, Proceedings (Lecture Notes in Computer Science Information Systems and Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/secure-data-management-6th-vldb-workshop-sdm-2009-lyon-france-august-28-2009-pro.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/advances-in-multimedia-modeling-15th-international-multimedia-modeling-conferenc.html"> <img src="https://epdf.tips/img/300x300/advances-in-multimedia-modeling-15th-international_5b46ab69b7d7bc024a62a0dc.jpg" alt="Advances in Multimedia Modeling: 15th International Multimedia Modeling Conference, MMM 2009, Sophia-Antipolis, France, January 7-9, 2009. Proceedings. ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Advances in Multimedia Modeling: 15th International Multimedia Modeling Conference, MMM 2009, Sophia-Antipolis, France, January 7-9, 2009. Proceedings. ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/advances-in-multimedia-modeling-15th-international-multimedia-modeling-conferenc.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/internet-and-network-economics-5th-international-workshop-wine-2009-rome-italy-d.html"> <img src="https://epdf.tips/img/300x300/internet-and-network-economics-5th-international-w_5a77a928b7d7bc9b08cc8fe2.jpg" alt="Internet and Network Economics: 5th International Workshop, WINE 2009, Rome, Italy, December 14-18, 2009, Proceedings (Lecture Notes in Computer ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Internet and Network Economics: 5th International Workshop, WINE 2009, Rome, Italy, December 14-18, 2009, Proceedings (Lecture Notes in Computer ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/internet-and-network-economics-5th-international-workshop-wine-2009-rome-italy-d.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/the-semantic-web-iswc-2009-8th-international-semantic-web-conference-iswc-2009-c.html"> <img src="https://epdf.tips/img/300x300/the-semantic-web-iswc-2009-8th-international-seman_5a988d56b7d7bcb844aa09d5.jpg" alt="The Semantic Web - ISWC 2009: 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">The Semantic Web - ISWC 2009: 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/the-semantic-web-iswc-2009-8th-international-semantic-web-conference-iswc-2009-c.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/research-and-advanced-technology-for-digital-libraries-13th-european-conference-.html"> <img src="https://epdf.tips/img/300x300/research-and-advanced-technology-for-digital-libra_5a456f03b7d7bc2d74408db5.jpg" alt="Research and Advanced Technology for Digital Libraries: 13th European Conference. ECDL 2009, Corfu, Greece, September 27 - October 2, 2009, ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Research and Advanced Technology for Digital Libraries: 13th European Conference. ECDL 2009, Corfu, Greece, September 27 - October 2, 2009, ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/research-and-advanced-technology-for-digital-libraries-13th-european-conference-.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/the-semantic-web-iswc-2011-10th-international-semantic-web-conference-bonn-germa.html"> <img src="https://epdf.tips/img/300x300/the-semantic-web-iswc-2011-10th-international-sema_5b43a158b7d7bc585bdb5a33.jpg" alt="The Semantic Web -- ISWC 2011: 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part II (Lecture Notes in ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">The Semantic Web -- ISWC 2011: 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part II (Lecture Notes in ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/the-semantic-web-iswc-2011-10th-international-semantic-web-conference-bonn-germa.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/computer-performance-evaluation-and-benchmarking-spec-benchmark-workshop-2009.html"> <img src="https://epdf.tips/img/300x300/computer-performance-evaluation-and-benchmarking-s_5a499099b7d7bc4412513b25.jpg" alt="Computer Performance Evaluation and Benchmarking, SPEC Benchmark Workshop 2009" /> <h3 class="note-title">Computer Performance Evaluation and Benchmarking, SPEC Benchmark Workshop 2009</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/computer-performance-evaluation-and-benchmarking-spec-benchmark-workshop-2009.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/service-oriented-computing-agents-semantics-and-engineering-aamas-2009-internati.html"> <img src="https://epdf.tips/img/300x300/service-oriented-computing-agents-semantics-and-en_5a535baab7d7bc106e26f44c.jpg" alt="Service-Oriented Computing: Agents, Semantics, and Engineering: AAMAS 2009 International Workshop, SOCASE 2009, Budapest, Hungary, May 11, 2009, ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Service-Oriented Computing: Agents, Semantics, and Engineering: AAMAS 2009 International Workshop, SOCASE 2009, Budapest, Hungary, May 11, 2009, ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/service-oriented-computing-agents-semantics-and-engineering-aamas-2009-internati.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/evaluation.html"> <img src="https://epdf.tips/img/300x300/evaluation_5ad92bf5b7d7bc5a49595640.jpg" alt="Evaluation" /> <h3 class="note-title">Evaluation</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/evaluation.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/web-and-wireless-geographical-information-systems-9th-international-symposium-w2.html"> <img src="https://epdf.tips/img/300x300/web-and-wireless-geographical-information-systems-_5a988919b7d7bcb9448fbbb7.jpg" alt="Web and Wireless Geographical Information Systems: 9th International Symposium, W2GIS 2009, Maynooth, Ireland, December 7-8, 2009. Proceedings ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Web and Wireless Geographical Information Systems: 9th International Symposium, W2GIS 2009, Maynooth, Ireland, December 7-8, 2009. Proceedings ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/web-and-wireless-geographical-information-systems-9th-international-symposium-w2.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/ergonomics-and-health-aspects-of-work-with-computers-international-conference-eh.html"> <img src="https://epdf.tips/img/300x300/ergonomics-and-health-aspects-of-work-with-compute_5a59921cb7d7bc660c1df53d.jpg" alt="Ergonomics and Health Aspects of Work with Computers: International Conference, EHAWC 2009, Held as Part of HCI International 2009, San Diego, CA, ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Ergonomics and Health Aspects of Work with Computers: International Conference, EHAWC 2009, Held as Part of HCI International 2009, San Diego, CA, ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/ergonomics-and-health-aspects-of-work-with-computers-international-conference-eh.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/internationalization-design-and-global-development-third-international-conferenc.html"> <img src="https://epdf.tips/img/300x300/internationalization-design-and-global-development_5a49c81bb7d7bcc5174e9125.jpg" alt="Internationalization, Design and Global Development: Third International Conference, IDGD 2009, Held as Part of HCI International 2009, San Diego, CA, ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Internationalization, Design and Global Development: Third International Conference, IDGD 2009, Held as Part of HCI International 2009, San Diego, CA, ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/internationalization-design-and-global-development-third-international-conferenc.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/online-communities-and-social-computing-third-international-conference-ocsc-2009.html"> <img src="https://epdf.tips/img/300x300/online-communities-and-social-computing-third-inte_5a599193b7d7bc660c1df53a.jpg" alt="Online Communities and Social Computing: Third International Conference, OCSC 2009, Held as Part of HCI International 2009, San Diego, CA, USA, July ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Online Communities and Social Computing: Third International Conference, OCSC 2009, Held as Part of HCI International 2009, San Diego, CA, USA, July ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/online-communities-and-social-computing-third-international-conference-ocsc-2009.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/advances-in-information-retrieval-theory-second-international-conference-on-the-.html"> <img src="https://epdf.tips/img/300x300/advances-in-information-retrieval-theory-second-in_5a5ff244b7d7bc9d432e7a99.jpg" alt="Advances in Information Retrieval Theory: Second International Conference on the Theory of Information Retrieval, ICTIR 2009 Cambridge, UK, September ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Advances in Information Retrieval Theory: Second International Conference on the Theory of Information Retrieval, ICTIR 2009 Cambridge, UK, September ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/advances-in-information-retrieval-theory-second-international-conference-on-the-.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.tips/semantic-multimedia-4th-international-conference-on-semantic-and-digital-media-t.html"> <img src="https://epdf.tips/img/300x300/semantic-multimedia-4th-international-conference-o_5a60731db7d7bc32156696c1.jpg" alt="Semantic Multimedia: 4th International Conference on Semantic and Digital Media Technologies, SAMT 2009 Graz, Austria, December 2-4, 2009 Proceedings ... Applications, incl. Internet Web, and HCI)" /> <h3 class="note-title">Semantic Multimedia: 4th International Conference on Semantic and Digital Media Technologies, SAMT 2009 Graz, Austria, December 2-4, 2009 Proceedings ... Applications, incl. Internet Web, and HCI)</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.tips/semantic-multimedia-4th-international-conference-on-semantic-and-digital-media-t.html">Read more</a> </div> </div> </div> </div> </div> <div class="col-lg-3 col-md-4 col-xs-12"> <div class="panel-recommend panel panel-primary"> <div class="panel-heading"> <h4 class="panel-title">Recommend Documents</h4> </div> <div class="panel-body"> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/multilingual-information-access-evaluation-i-text-retrieval-experiments-10th-wor.html"> <img src="https://epdf.tips/img/60x80/multilingual-information-access-evaluation-i-text-_5a7a54fab7d7bc82108f16c4.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/multilingual-information-access-evaluation-i-text-retrieval-experiments-10th-wor.html"> Multilingual Information Access Evaluation I - Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, ... Applications, incl. Internet Web, and HCI) </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/multilingual-and-multimodal-information-access-evaluation-international-conferen.html"> <img src="https://epdf.tips/img/60x80/multilingual-and-multimodal-information-access-eva_5a7a54a5b7d7bc80101d44bb.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/multilingual-and-multimodal-information-access-evaluation-international-conferen.html"> Multilingual and Multimodal Information Access Evaluation: International Conference of the Cross-Language Evaluation Forum, CLEF 2010, Padua, Italy, ... Applications, incl. Internet Web, and HCI) </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/evaluating-systems-for-multilingual-and-multimodal-information-access-9th-worksh.html"> <img src="https://epdf.tips/img/60x80/evaluating-systems-for-multilingual-and-multimodal_5a7a552fb7d7bc80101d44bd.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/evaluating-systems-for-multilingual-and-multimodal-information-access-9th-worksh.html"> Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, ... Applications, incl. Internet Web, and HCI) </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/multilingual-and-multimodal-information-access-evaluation-clef-2011.html"> <img src="https://epdf.tips/img/60x80/multilingual-and-multimodal-information-access-eva_5b74c94ab7d7bcfc6e4ec625.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/multilingual-and-multimodal-information-access-evaluation-clef-2011.html"> Multilingual and Multimodal Information Access Evaluation - CLEF 2011 </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/evaluation-of-multilingual-and-multi-modal-information-retrieval-7th-workshop-of.html"> <img src="https://epdf.tips/img/60x80/evaluation-of-multilingual-and-multi-modal-informa_5a7f43f5b7d7bc4003db1827.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/evaluation-of-multilingual-and-multi-modal-information-retrieval-7th-workshop-of.html"> Evaluation of Multilingual and Multi-modal Information Retrieval: 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/focused-retrieval-and-evaluation-8th-international-workshop-of-the-initiative-fo.html"> <img src="https://epdf.tips/img/60x80/focused-retrieval-and-evaluation-8th-international_5a63e03db7d7bcea6639a086.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/focused-retrieval-and-evaluation-8th-international-workshop-of-the-initiative-fo.html"> Focused Retrieval and Evaluation: 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, ... Applications, incl. Internet Web, and HCI) </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/cross-language-information-retrieval-and-evaluation-workshop-of-cross-language-e.html"> <img src="https://epdf.tips/img/60x80/cross-language-information-retrieval-and-evaluatio_5a7f3c8db7d7bc2212b1fb2b.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/cross-language-information-retrieval-and-evaluation-workshop-of-cross-language-e.html"> Cross-Language Information Retrieval and Evaluation: Workshop of Cross-Language Evaluation Forum, CLEF 2000, Lisbon, Portugal, September 21-22, 2000, </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen 2069 3 Berlin Heidelberg New Yo...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/comparative-evaluation-of-multilingual-information-access-systems-4-conf-clef-20.html"> <img src="https://epdf.tips/img/60x80/comparative-evaluation-of-multilingual-information_5a6c2c2db7d7bc4a0b073e56.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/comparative-evaluation-of-multilingual-information-access-systems-4-conf-clef-20.html"> Comparative Evaluation of Multilingual Information Access Systems, 4 conf., CLEF 2003 </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/advances-in-web-and-network-technologies-and-information-management-ap-web-waim-.html"> <img src="https://epdf.tips/img/60x80/advances-in-web-and-network-technologies-and-infor_5a5d4975b7d7bcd61b9c0754.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/advances-in-web-and-network-technologies-and-information-management-ap-web-waim-.html"> Advances in Web and Network Technologies and Information Management: AP Web WAIM 2009 International Workshops: WCMT 2009, RTBI 2009, DBIR-ENQOIR 2009, ... Applications, incl. Internet Web, and HCI) </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.tips/multilingual-information-access-for-text-speech-and-images-5th-workshop-of-the-c.html"> <img src="https://epdf.tips/img/60x80/multilingual-information-access-for-text-speech-an_5a7f4687b7d7bc4003db1828.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.tips/multilingual-information-access-for-text-speech-and-images-5th-workshop-of-the-c.html"> Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, Bath, UK, September 15-17, </a> </label> <div class="note-meta"> <div class="note-desc">Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> </div> </div> </div> </div> </div> <div class="modal fade" id="report" tabindex="-1" role="dialog" aria-hidden="true"> <div class="modal-dialog"> <div class="modal-content"> <form role="form" method="post" action="https://epdf.tips/report/multilingual-information-access-evaluation-ii-multimedia-experiments-10th-worksh" style="border: none;"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button> <h4 class="modal-title">Report "Multilingual Information Access Evaluation II - Multimedia Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, ... Applications, incl. Internet Web, and HCI)"</h4> </div> <div class="modal-body"> <div class="form-group"> <label>Your name</label> <input type="text" name="name" required="required" class="form-control" /> </div> <div class="form-group"> <label>Email</label> <input type="email" name="email" required="required" class="form-control" /> </div> <div class="form-group"> <label>Reason</label> <select name="reason" required="required" class="form-control"> <option value="">-Select Reason-</option> <option value="pornographic" selected="selected">Pornographic</option> <option value="defamatory">Defamatory</option> <option value="illegal">Illegal/Unlawful</option> <option value="spam">Spam</option> <option value="others">Other Terms Of Service Violation</option> <option value="copyright">File a copyright complaint</option> </select> </div> <div class="form-group"> <label>Description</label> <textarea name="description" required="required" rows="3" class="form-control" style="border: 1px solid #cccccc;"></textarea> </div> <div class="form-group"> <div style="display: inline-block;"> <div class="g-recaptcha" data-sitekey="6Lemmz0UAAAAAANSnNH_YtG0406jaTUcUP7mxrLr"></div> </div> </div> <script src='https://www.google.com/recaptcha/api.js'></script> </div> <div class="modal-footer"> <button type="button" class="btn btn-default" data-dismiss="modal">Close</button> <button type="submit" class="btn btn-success">Send</button> </div> </form> </div> </div> </div> <footer class="footer" style="margin-top: 60px;"> <div class="container-fluid"> Copyright © 2024 EPDF.TIPS. All rights reserved. <div class="pull-right"> <a href="https://epdf.tips/about">About Us</a> | <a href="https://epdf.tips/privacy">Privacy Policy</a> | <a href="https://epdf.tips/term">Terms of Service</a> | <a href="https://epdf.tips/copyright">Copyright</a> | <a href="https://epdf.tips/dmca">DMCA</a> | <a href="https://epdf.tips/contact">Contact Us</a> | <a href="https://epdf.tips/cookie_policy">Cookie Policy</a> </div> </div> </footer>  <div class="modal fade" id="login" tabindex="-1" role="dialog" aria-labelledby="myModalLabel"> <div class="modal-dialog" role="document"> <div class="modal-content"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-label="Close" on="tap:login.close">×</button> <h4 class="modal-title" id="add-note-label">Sign In</h4> </div> <div class="modal-body"> <form action="https://epdf.tips/login" method="post"> <div class="form-group"> <label class="sr-only" for="email">Email</label> <input class="form-input form-control" type="text" name="email" id="email" value="" placeholder="Email" /> </div> <div class="form-group"> <label class="sr-only" for="password">Password</label> <input class="form-input form-control" type="password" name="password" id="password" value="" placeholder="Password" /> </div> <div class="form-group"> <div class="checkbox"> <label class="form-checkbox"> <input type="checkbox" name="remember" value="1" /> Remember me </label> <label class="pull-right"><a href="https://epdf.tips/forgot">Forgot password?</a></label> </div> </div> <button class="btn btn-primary btn-block" type="submit">Sign In</button> </form> <hr style="margin-top: 15px;" /> <a href="https://epdf.tips/login/facebook" class="btn btn-facebook btn-block"> Login with Facebook</a> </div> </div> </div> </div>  <script async src="https://www.googletagmanager.com/gtag/js?id=UA-111550345-1"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-111550345-1'); </script> <script src="https://epdf.tips/assets/js/jquery-ui.min.js"></script> <link rel="stylesheet" href="https://epdf.tips/assets/css/jquery-ui.css"> <script> $(function () { $("#document_search").autocomplete({ source: function (request, response) { $.ajax({ url: "https://epdf.tips/suggest", dataType: "json", data: { term: request.term }, success: function (data) { response(data); } }); }, autoFill: true, select: function( event, ui ) { $(this).val(ui.item.value); $(this).parents("form").submit(); } }); }); </script>  <div id="EPDFTIPS_cookie_box" style="z-index:99999; border-top: 1px solid #fefefe; background: #97c479; width: 100%; position: fixed; padding: 5px 15px; text-align: center; left:0; bottom: 0;"> Our partners will collect data and use cookies for ad personalization and measurement. <a href="https://epdf.tips/cookie_policy" target="_blank">Learn how we and our ad partner Google, collect and use data</a>. <a href="#" class="btn btn-success" onclick="accept_EPDFTIPS_cookie_box();return false;">Agree & close</a> </div> <script> function accept_EPDFTIPS_cookie_box() { document.cookie = "EPDFTIPS_cookie_box_viewed=1;max-age=15768000;path=/"; hide_EPDFTIPS_cookie_box(); } function hide_EPDFTIPS_cookie_box() { var cb = document.getElementById('EPDFTIPS_cookie_box'); if (cb) { cb.parentElement.removeChild(cb); } } (function () { var EPDFTIPS_cookie_box_viewed = (function (name) { var matches = document.cookie.match(new RegExp("(?:^|; )" + name.replace(/([\.$?*|{}\[\]\\\/\+^])/g, '\\$1') + "=([^;]*)")); return matches ? decodeURIComponent(matches[1]) : undefined; })('EPDFTIPS_cookie_box_viewed'); if (EPDFTIPS_cookie_box_viewed) { hide_EPDFTIPS_cookie_box(); } })(); </script>  </body> </html> <script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script>