Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Author: Giambattista Amati | Claudio Carpineto | Giovanni Romano

15 downloads 2802 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

4425

Giambattista Amati Claudio Carpineto Giovanni Romano (Eds.)

Advances in Information Retrieval 29th European Conference on IR Research, ECIR 2007 Rome, Italy, April 2-5, 2007 Proceedings

13

Volume Editors Giambattista Amati Claudio Carpineto Giovanni Romano Fondazione Ugo Bordoni Via Baldassarre Castiglione 59, 00142 Rome, Italy E-mail:{gba, carpinet, romano}@fub.it

Library of Congress Control Number: 2007923290 CR Subject Classiﬁcation (1998): H.3, H.2, I.2.3, I.2.6-7, H.4, H.5.4, I.7 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13

0302-9743 3-540-71494-4 Springer Berlin Heidelberg New York 978-3-540-71494-1 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12038824 06/3142 543210

Preface

This volume contains the papers presented at ECIR 2007, the 29th European Conference on Information Retrieval. The conference was organized by the Fondazione Ugo Bordoni (FUB), in cooperation with the Information Retrieval Specialist Group of the British Computer Society (BCS-IRSG) and the ACM Special Interest Group on Information Retrieval (ACM SIGIR). It was held during April 2–5, 2007 in Rome, hosted by the National Research Council of Italy (CNR). The conference was initially established by BCS-IRSG in the United Kingdom under the name “Annual Colloquium on Information Retrieval Research”. When the involvement of Continental Europe became more important, it was decided to alternate the conference venue between the UK and Continental Europe. Starting from 2001, the conference was renamed ECIR. In the last few years, ECIR has grown steadily, becoming the major European forum for research on information retrieval. ECIR 2007 has dramatically conﬁrmed the growth trend, with a high record number of submissions. In four years the number of full-paper submissions to ECIR has nearly tripled, going from 88 (2004, Sunderland) to 124 (2005, Santiago de Compostela), then to 175 (2006, London), and now to 220 (2007, Rome). ECIR 2007 has also attracted more and more people from outside Europe, thus making the conference a large, truly international event. Furthermore, in spite of these major changes, the traditional strong focus on students has been well preserved, for their participation has been massive. The large number of sponsors of ECIR 2007 both reﬂects the increasing importance of the conference and is a key to its success. The sponsorship contribution was primarily used to support student attendance and to invite three internationally renowned researchers to give a keynote speech, namely, Andrei Broder (Yahoo! Research), Stephen Robertson (Microsoft Research Cambridge and City University London), and Marco Gori (University of Siena). Turning to the reviewing process, the Program Committee (PC) of ECIR 2007 was formed by 147 members, 41 of whom were non-European. To increase consistency of refereeing, the PC was split in subgroups of homogeneous expertise with three members each, and each subgroup was then assigned a bunch of related papers. In response to the call for papers, 220 submissions were received. Each submission was normally reviewed by three PC members. In case of a strong disagreement, the reviewers were given a chance to change their reviews before the PC meeting. The reviews were then thoroughly discussed by the 19 participants in the PC meeting, which was held at FUB in Rome. Decisions were based on the scores assigned by the reviewers and on the reviews themselves. When the reviews of a paper were not suﬃcient to make a decision, the paper went through an extra review at the PC meeting. Out of the 220 submissions,

VI

Preface

42 were accepted for presentation at the conference. Interestingly, 23 of these had a full-time student as the primary author. In addition, at the PC meeting, it was decided to create a short-paper session because there were many papers with good scores that would have been rejected due to high competition and limited capacity for oral presentation at the conference. Nineteen papers were accepted as short papers. Short papers were given 8 pages in the proceeding (instead of 12), with the short-paper session consisting of a brief oral presentation of all short papers followed by a poster-style exhibition. There was also a separate call for posters. Each of the 72 poster submissions was normally reviewed by three PC members, as with paper submissions. Twenty-one posters were accepted for presentation. The topics covered by the accepted papers span well-established as well as emerging research areas, with a concentration on indexing techniques, distributed information retrieval, and query processing. The 42 full papers have been grouped in the following way: Theory and Design (5), Eﬃciency (5), Peerto-Peer Networks (4), Result Merging (2), Queries (4), Relevance Feedback (2), Evaluation (2), Classiﬁcation and Clustering (4), Filtering (4), Topic Identiﬁcation (2), Expert Finding (2), XML (2), Web IR (2), Multimedia IR (2). The success of ECIR 2007 was due to a huge number of people and several organizations who were involved in the various stages of the whole process. We thank the researchers who submitted their results, the members of the Program Committee for reviewing many papers in a short time, the members of the local Organizing Committee for their hard work over many months, the members of the Award Committees for choosing the best paper and the the best student paper, the keynote speakers for accepting our invitation, the Italian National Research Council for providing the conference venue, the sponsoring organizations for providing the money. We are deeply indebted to all of them. We would also like to thank the members of BCS-IRSG and the organizers of ECIR 2006 for their useful help. Finally, a special thanks to Fondazione Ugo Bordoni, whose support was invaluable for running the whole conference. We would like to conclude this preface on a more personal note. Although the amount of work required was deﬁnitely harder than expected, we now feel that it was worth it! Organizing and running ECIR 2007 was a great experience for us. We are very happy especially because of our own historical involvement in ECIR. We have seen how the conference has grown over time and we are proud to have contributed to making it a key event in the information retrieval ﬁeld. Enjoy the papers! January 2007

Giambattista Amati Claudio Carpineto Giovanni Romano

Organization

ECIR 2007 was organized by the Fondazione Ugo Bordoni (FUB), in collaboration with the Information Retrieval Specialist Group of the British Computer Society (BCS-IRSG) and the ACM Special Interest Group on Information Retrieval (ACM-SIGIR).

General and Program Chairs Giambattista Amati, Fondazione Ugo Bordoni, Rome, Italy Claudio Carpineto, Fondazione Ugo Bordoni, Rome, Italy Giovanni Romano, Fondazione Ugo Bordoni, Rome, Italy

Local Organization Committee Marilena Carletti, Fondazione Ugo Bordoni Annalisa Filardo, Fondazione Ugo Bordoni Sara Saverione, Fondazione Ugo Bordoni Stefania Vinci, Fondazione Ugo Bordoni Guido Zampilloni, Fondazione Ugo Bordoni

Program Committee Maristella Agosti, University of Padova, Italy James Allan, University of Massachusetts, Amherst, USA Massih-Reza Amini, Universit´e Pierre et Marie Curie, France Einat Amitay, IBM Research Lab, Haifa, Israel Giuseppe Attardi, University of Pisa, Italy Leif Azzopardi, University of Strathclyde, UK Ricardo Baeza-Yates, Yahoo! Research, Barcelona, Spain Alex Bailey, Google, UK ´ Alvaro Barreiro, Universidade da Coru˜ na, Spain Roberto Basili, University of Tor Vergata, Roma, Italy Holger Bast, Max Planck Institute for Informatics, Germany Micheline Beaulieu, University of Sheﬃeld, UK ´ Michel Beigbeder, Ecole Nationale Sup´erieure, Saint-Etienne, France Nicholas Belkin, Rutgers University, USA Gloria Bordogna, IDPA CNR, Italy Theo Bothma, University of Pretoria, South Africa

VIII

Organization

Mohand Boughanem, University Paul Sabatier, France Giorgio Brajnik, Universit` a di Udine, Italy Martin Braschler, Zurich University of Applied Sciences, Switzerland Peter Bruza, Queensland University of Technology, Australia Stefan B¨ uttcher, University of Waterloo, Canada Wray Buntine, Helsinki Institute of Information Technology, Finland Fidel Cacheda, Universidade da Coru˜ na, Spain Jamie Callan, Carnegie Mellon University, USA Caterina Caracciolo, FAO, Italy David Carmel, IBM Research Lab, Haifa, Israel Carlos Castillo, Yahoo! Research, Spain Yves Chiaramella, IMAG, Grenoble, France Paul-Alexandru Chirita, L3S Research Center, Hannover, Germany Charles L. A. Clarke, University of Waterloo, Canada Paul Clough, University of Sheﬃeld, UK Gordon Cormack, University of Waterloo, Canada Nick Craswell, Microsoft, USA Fabio Crestani, University of Strathclyde, UK Bruce Croft, University of Massachusetts, Amherst, USA Pablo De la Fuente, Universidad de Valladolid, Spain Renato De Mori, University of Avignon, France Maarten de Rijke, University of Amsterdam, The Netherlands Arjen de Vries, CWI, The Netherlands Marcello Federico, Istituto Trentino di Cultura, Italy Ronen Feldman, Bar-Ilan University, Israel Juan Manuel Fern´ andez Luna, University of Granada, Spain Paolo Ferragina, University of Pisa, Italy Nicola Ferro, University of Padova, Italy Edward Fox, Virginia Tech, USA Johannes F¨ urnkranz, TU Darmstadt, Germany Norbert Fuhr, University of Duisburg-Essen, Germany Patrick Gallinari, LIP6, Universit´e Pierre et Marie Curie, France ´ Eric Gaussier, Xerox Research Centre Europe, France C. Lee Giles, Pennsylvania State University, USA Mark Girolami, University of Glasgow, UK Ayse Goker, Robert Gordon University, UK Julio Gonzalo, UNED, Spain Margaret Graham, Northumbria University, UK Warren Greiﬀ, The MITRE Corporation, USA David Grossman, Illinois Institute of Technology, USA Antonio Gulli, ASK, Italy Cathal Gurrin, Dublin City University, Ireland Preben Hansen, SICS, Sweden David Hawking, CSIRO ICT Centre, Australia

Organization

IX

Ben He, University of Glasgow, UK William Hersh, Oregon Health Sciences University, USA Djoerd Hiemstra, University of Twente, The Netherlands Eduard Hoenkamp, University of Nijmegen, The Netherlands Andreas Hotho, Universitt Kassel, Germany Xiaohua Hu, Drexel University, USA Theo Huibers, University of Twente, The Netherlands Peter Ingwersen, Royal School of Library and Information Science, Denmark Mario J. Gaspar da Silva, University of Lisbon, Portugal Kalervo Jarvelin, University of Tampere, Finland Gareth Jones, Dublin City University, Ireland Joemon Jose, University of Glasgow, UK Jaap Kamps, University of Amsterdam, The Netherlands Jussi Karlgren, SICS, Sweden Gabriella Kazai, Microsoft, Cambridge, UK Manolis Koubarakis, Technical University of Crete, Greece Wessel Kraaij, TNO TPD, The Netherlands S. Ravi Kumar, Yahoo! Research, USA Kui-Lam Kwok, Queens College, City University of New York, USA Mounia Lalmas, Queen Mary, University of London, UK Monica Landoni, University of Strathclyde, UK Birger Larsen, Royal School of Library and Information Science, Denmark Mun-Kew Leong, Laboratories of Information Technology, Singapore David Lewis, David D. Lewis Consulting, USA Xuelong Li, School of Computer Science and Information Systems, London, UK Yang Lingpeng, Institute for Infocomm Research, Singapore Christina Lioma, University of Glasgow, UK David Losada, University of Santiago de Compostela, Spain Craig Macdonald, University of Glasgow, UK Andrew MacFarlane, City University, London, UK Marco Maggini, University of Siena, Italy Bernardo Magnini, Istituto Trentino di Cultura, Italy Massimo Melucci, University of Padova, Italy Alessandro Micarelli, University of Roma Tre, Italy Stefano Mizzaro, University of Udine, Italy Dunja Mladenic, Jozef Stefan Institute, Slovenia Marie-Francine Moens, Katholieke Universiteit Leuven, Belgium Alistair Moﬀat, University of Melbourne, Australia Josiane Mothe, IRIT, France Gheorghe Muresan, Rutgers University, USA Jian-Yun Nie, University of Montreal, Canada Michael Oakes, University of Sunderland, UK Stanislaw Osinski, Poznan Supercomputing and Networking Center, Poland Iadh Ounis, University of Glasgow, UK

X

Organization

Marius Pasca, Google, USA Gabriella Pasi, University of Milano - Bicocca, Italy Jan O. Pedersen, Yahoo!, USA Nils Pharo, Oslo University College, Norway Vassilis Plachouras, University of Glasgow, UK Jay Ponte, Google, USA Victor Poznanski, Sharp, UK Andreas Rauber, Vienna University of Technology, Austria Stephen Robertson, Microsoft Research, UK Thomas Rolleke, Queen Mary, University of London, UK Stefan R¨ uger, Imperial College, UK Song Ruihua, Microsoft Research Asia, China Ian Ruthven, University of Strathclyde, UK Dominich Sandor, University of Veszprem, Hungary Ralf Schenkel, Max-Planck-Institut f¨ ur Informatik, Germany Hinrich Schuetze, University of Stuttgart, Germany Giovanni Semeraro, University of Bari, Italy Stefan Siersdorfer, Max Planck Institute for Computer Science, Germany Fabrizio Silvestri, ISTI CNR, Italy Alan Smeaton, Dublin City University, Ireland Barry Smyth, University College Dublin, Ireland Vaclav Snasel, VSB-Technical University Ostrava, Czech Republic Eero Sormunen, University of Tampere, Finland Amanda Spink, Queensland University of Technology, Australia John Tait, University of Sunderland, UK Martin Theobald, Max Planck Institute for Informatics, Germany Ulrich Thiel, Fraunhofer IPSI, Germany Anastasios Tombros, Queen Mary, University of London, UK Andrew Tomkins, Yahoo!, USA Stephen Tomlinson, Hummingbird, USA Pertti Vakkari, University of Tampere, Finland Keith van Rijsbergen, University of Glasgow, UK Olga Vechtomova, University of Waterloo, Canada Dawid Weiss, Poznan University of Technology, Poland Ross Wilkinson, CSIRO ICT Centre, Australia Wensi Xi, Google, USA Tao Yang, ASK, USA Elad Yom-Tov, IBM Research Lab, Haifa, Israel Hugo Zaragoza, Yahoo! Research, Barcelona, Spain Hua-Jun Zeng, Microsoft Research Asia, China ChengXiang Zhai, University of Illinois at Urbana-Champaign, USA Dell Zhang, University of London, UK Justin Zobel, RMIT, Australia

Organization

Paper Awards (Sponsored by Yahoo! Research) Best Paper Award Committee Fabio Crestani, University of Strathclyde, UK (Chair) David Losada, University of Santiago de Compostela, Spain Hinrich Schuetze, University of Stuttgart, Germany

Best Student Paper Award Committee Maristella Agosti, University of Padova, Italy (Chair) David Lewis, David D. Lewis Consulting, USA Giovanni Semeraro, University of Bari, Italy

Additional Reviewers Eija Airio, University of Tampere, Finland Mark Baillie, University of Strathclyde, UK Nuno Cardoso, University of Lisboa, Portugal Mauro Cettolo, Istituto Trentino di Cultura, Italy Marco Degemmis, University of Bari, Italy Gianna Del Corso, University of Pisa, Italy Chris H.Q. Ding, Lawrence Berkeley National Laboratory, USA Nicola Fanizzi, University of Bari, Italy Stefano Ferilli, University of Bari, Italy Claudio Giuliano, Istituto Trentino di Cultura, Italy Milen Kouylekov, Istituto Trentino di Cultura, Italy Jie Lu, Carnegie Mellon University, USA Pasquale Lops, University of Bari, Italy Joao Magalhaes, Imperial College, UK Inderjeet Mani, The MITRE Corporation, USA Vanessa Murdock, Yahoo! Research, Barcelona, Spain Matteo Negri, Istituto Trentino di Cultura, Italy Domenico Redavid, University of Bari, Italy James Thom, RMIT, Australia

XI

Table of Contents

Keynote Talks The Next Generation Web Search and the Demise of the Classic IR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrei Broder

1

The Last Half-Century: A Perspective on Experimentation in Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephen Robertson

2

Learning in Hyperlinked Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marco Gori

3

Theory and Design A Parameterised Search System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Cornacchia and Arjen P. de Vries

4

Similarity Measures for Short Segments of Text . . . . . . . . . . . . . . . . . . . . . . Donald Metzler, Susan Dumais, and Christopher Meek

16

Multinomial Randomness Models for Retrieval with Document Fields . . . Vassilis Plachouras and Iadh Ounis

28

On Score Distributions and Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephen Robertson

40

Modeling Term Associations for Ad-Hoc Retrieval Performance Within Language Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xing Wei and W. Bruce Croft

52

Eﬃciency Static Pruning of Terms in Inverted Files . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Roi Blanco and Alvaro Barreiro

64

Eﬃcient Indexing of Versioned Document Sequences . . . . . . . . . . . . . . . . . . Michael Herscovici, Ronny Lempel, and Sivan Yogev

76

Light Syntactically-Based Index Pruning for Information Retrieval . . . . . Christina Lioma and Iadh Ounis

88

Sorting Out the Document Identiﬁer Assignment Problem . . . . . . . . . . . . . Fabrizio Silvestri

101

Eﬃcient Construction of FM-index Using Overlapping Block Processing for Large Scale Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Di Zhang, Yunquan Zhang, and Jing Chen

113

XIV

Table of Contents

Peer-to-Peer Networks (In Memory of Henrik Nottelmann) Performance Comparison of Clustered and Replicated Information Retrieval Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fidel Cacheda, Victor Carneiro, Vassilis Plachouras, and Iadh Ounis

124

A Study of a Weighting Scheme for Information Retrieval in Hierarchical Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Massimo Melucci and Alberto Poggiani

136

A Decision-Theoretic Model for Decentralised Query Routing in Hierarchical Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henrik Nottelmann and Norbert Fuhr

148

Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Milad Shokouhi

160

Result Merging Results Merging Algorithm Using Multiple Regression Models . . . . . . . . . George Paltoglou, Michail Salampasis, and Maria Satratzemi

173

Segmentation of Search Engine Results for Eﬀective Data-Fusion . . . . . . . Milad Shokouhi

185

Queries Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javed A. Aslam and Virgil Pavlu Query Reformulation and Reﬁnement Using NLP-Based Sentence Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fr´ed´eric Roulland, Aaron Kaplan, Stefania Castellani, Claude Roux, Antonietta Grasso, Karin Pettersson, and Jacki O’Neill Automatic Morphological Query Expansion Using Analogy-Based Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabienne Moreau, Vincent Claveau, and Pascale S´ebillot Advanced Structural Representations for Question Classiﬁcation and Answer Re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Silvia Quarteroni, Alessandro Moschitti, Suresh Manandhar, and Roberto Basili

198

210

222

234

Relevance Feedback Incorporating Diversity and Density in Active Learning for Relevance Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zuobing Xu, Ram Akella, and Yi Zhang

246

Table of Contents

Relevance Feedback Using Weight Propagation Compared with Information-Theoretic Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fadi Yamout, Michael Oakes, and John Tait

XV

258

Evaluation A Retrieval Evaluation Methodology for Incomplete Relevance Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark Baillie, Leif Azzopardi, and Ian Ruthven

271

Evaluating Query-Independent Object Features for Relevancy Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andres R. Masegosa, Hideo Joho, and Joemon M. Jose

283

Classiﬁcation and Clustering The Utility of Information Extraction in the Classiﬁcation of Books . . . . Tom Betts, Maria Milosavljevic, and Jon Oberlander

295

Combined Syntactic and Semantic Kernels for Text Classiﬁcation . . . . . . Stephan Bloehdorn and Alessandro Moschitti

307

Fast Large-Scale Spectral Clustering by Sequential Shrinkage Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tie-Yan Liu, Huai-Yuan Yang, Xin Zheng, Tao Qin, and Wei-Ying Ma A Probabilistic Model for Clustering Text Documents with Multiple Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shanfeng Zhu, Ichigaku Takigawa, Shuqin Zhang, and Hiroshi Mamitsuka

319

331

Filtering Personalized Communities in a Distributed Recommender System . . . . . . Sylvain Castagnos and Anne Boyer

343

Information Recovery and Discovery in Collaborative Web Search . . . . . . Maurice Coyle and Barry Smyth

356

Collaborative Filtering Based on Transitive Correlations Between Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandros Nanopoulos

368

Entropy-Based Authorship Search in Large Document Collections . . . . . . Ying Zhao and Justin Zobel

381

Topic Identiﬁcation Use of Topicality and Information Measures to Improve Document Representation for Story Link Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chirag Shah and Koji Eguchi

393

XVI

Table of Contents

Ad Hoc Retrieval of Documents with Topical Opinion . . . . . . . . . . . . . . . . Jason Skomorowski and Olga Vechtomova

405

Expert Finding Probabilistic Models for Expert Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hui Fang and ChengXiang Zhai

418

Using Relevance Feedback in Expert Search . . . . . . . . . . . . . . . . . . . . . . . . . Craig Macdonald and Iadh Ounis

431

XML IR Using Topic Shifts for Focussed Access to XML Repositories . . . . . . . . . . Elham Ashoori and Mounia Lalmas

444

Feature- and Query-Based Table of Contents Generation for XML Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zolt´ an Szl´ avik, Anastasios Tombros, and Mounia Lalmas

456

Web IR Setting Per-ﬁeld Normalisation Hyper-parameters for the Named-Page Finding Search Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ben He and Iadh Ounis

468

Combining Evidence for Relevance Criteria: A Framework and Experiments in Web Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theodora Tsikrika and Mounia Lalmas

481

Multimedia IR Classiﬁer Fusion for SVM-Based Multimedia Semantic Indexing . . . . . . . . St´ephane Ayache, Georges Qu´enot, and J´erˆ ome Gensel

494

Search of Spoken Documents Retrieves Well Recognized Transcripts . . . . Mark Sanderson and Xiao Mang Shou

505

Short Papers Natural Language Processing for Usage Based Indexing of Web Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anne Boyer and Armelle Brun

517

Harnessing Trust in Social Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Briggs and Barry Smyth

525

How to Compare Bilingual to Monolingual Cross-Language Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Franco Crivellari, Giorgio Maria Di Nunzio, and Nicola Ferro

533

Multilingual Text Classiﬁcation Using Ontologies . . . . . . . . . . . . . . . . . . . . Gerard de Melo and Stefan Siersdorfer

541

Table of Contents

XVII

Using Visual-Textual Mutual Information and Entropy for Inter-modal Document Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean Martinet and Shin’ichi Satoh

549

A Study of Global Inference Algorithms in Multi-document Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryan McDonald

557

Document Representation Using Global Association Distance Model . . . . Jos´e E. Medina-Pagola, Ansel Y. Rodr´ıguez, Abdel Hechavarr´ıa, and Jos´e Hern´ andez Palancar

565

Sentence Level Sentiment Analysis in the Presence of Conjuncts Using Linguistic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arun Meena and T.V. Prabhakar

573

PageRank: When Order Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Massimo Melucci and Luca Pretto

581

Model Tree Learning for Query Term Weighting in Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christof Monz

589

Examining Repetition in User Search Behavior . . . . . . . . . . . . . . . . . . . . . . Mark Sanderson and Susan Dumais

597

Popularity Weighted Ranking for Academic Digital Libraries . . . . . . . . . . Yang Sun and C. Lee Giles

605

Naming Functions for the Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . Yannis Tzitzikas and Yannis Theoharis

613

Eﬀective Use of Semantic Structure in XML Retrieval . . . . . . . . . . . . . . . . Roelof van Zwol and Tim van Loosbroek

621

Searching Documents Based on Relevance and Type . . . . . . . . . . . . . . . . . . Jun Xu, Yunbo Cao, Hang Li, Nick Craswell, and Yalou Huang

629

Investigation of the Eﬀectiveness of Cross-Media Indexing . . . . . . . . . . . . . Murat Yakıcı and Fabio Crestani

637

Improve Ranking by Using Image Information . . . . . . . . . . . . . . . . . . . . . . . Qing Yu, Shuming Shi, Zhiwei Li, Ji-Rong Wen, and Wei-Ying Ma

645

N-Step PageRank for Web Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li Zhang, Tao Qin, Tie-Yan Liu, Ying Bao, and Hang Li

653

Authorship Attribution Via Combination of Evidence . . . . . . . . . . . . . . . . . Ying Zhao and Phil Vines

661

Posters Cross-Document Entity Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roxana Angheluta and Marie-Francine Moens

670

XVIII

Table of Contents

Enterprise People and Skill Discovery Using Tolerant Retrieval and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Brunnert, Omar Alonso, and Dirk Riehle

674

Experimental Results of the Signal Processing Approach to Distributional Clustering of Terms on Reuters-21578 Collection . . . . . . . . Marta Capdevila Dalmau and Oscar W. M´ arquez Fl´ orez

678

Overall Comparison at the Standard Levels of Recall of Multiple Retrieval Methods with the Friedman Test . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e M. Casanova, Manuel A. Presedo Quindimil, and ´ Alvaro Barreiro

682

Building a Desktop Search Test-Bed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergey Chernov, Pavel Serdyukov, Paul-Alexandru Chirita, Gianluca Demartini, and Wolfgang Nejdl

686

Hierarchical Browsing of Video Key Frames . . . . . . . . . . . . . . . . . . . . . . . . . Gianluigi Ciocca and Raimondo Schettini

691

Active Learning with History-Based Query Selection for Text Categorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Davy and Saturnino Luz

695

Fighting Link Spam with a Two-Stage Ranking Strategy . . . . . . . . . . . . . . Guang-Gang Geng, Chun-Heng Wang, Qiu-Dan Li, and Yuan-Ping Zhu

699

Improving Naive Bayes Text Classiﬁer Using Smoothing Methods . . . . . . Feng He and Xiaoqing Ding

703

Term Selection and Query Operations for Video Retrieval . . . . . . . . . . . . . Bouke Huurnink and Maarten de Rijke

708

An Eﬀective Threshold-Based Neighbor Selection in Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taek-Hun Kim and Sung-Bong Yang

712

Combining Multiple Sources of Evidence in XML Multimedia Documents: An Inference Network Incorporating Element Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhigang Kong and Mounia Lalmas

716

Language Model Based Query Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Merkel and Dietrich Klakow

720

Integration of Text and Audio Features for Genre Classiﬁcation in Music Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Neumayer and Andreas Rauber

724

Retrieval Method for Video Content in Diﬀerent Format Based on Spatiotemporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuefeng Pan, Jintao Li, Yongdong Zhang, Sheng Tang, and Juan Cao

728

Table of Contents

XIX

Combination of Document Priors in Web Information Retrieval . . . . . . . . Jie Peng and Iadh Ounis

732

Enhancing Expert Search Through Query Modeling . . . . . . . . . . . . . . . . . . Pavel Serdyukov, Sergey Chernov, and Wolfgang Nejdl

737

A Hierarchical Consensus Architecture for Robust Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xavier Sevillano, Germ´ an Cobo, Francesc Al´ıas, and Joan Claudi Socor´ o

741

Summarisation and Novelty: An Experimental Investigation . . . . . . . . . . . Simon Sweeney, Fabio Crestani, and David E. Losada

745

A Layered Approach to Context-Dependent User Modelling . . . . . . . . . . . Elena Vildjiounaite and Sanna Kallio

749

A Bayesian Approach for Learning Document Type Relevance . . . . . . . . . Peter C.K. Yeung, Stefan B¨ uttcher, Charles L.A. Clarke, and Maheedhar Kolla

753

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

757

The Next Generation Web Search and the Demise of the Classic IR Model Andrei Broder Yahoo! Research, USA [email protected]

Abstract. The classic IR model assumes a human engaged in activity that generates an “information need”. This need is verbalized and then expressed as a query to search engine over a deﬁned corpus. In the past decade, Web search engines have evolved from a ﬁrst generation based on classic IR algorithms scaled to web size and thus supporting only informational queries, to a second generation supporting navigational queries using web speciﬁc information (primarily link analysis), to a third generation enabling transactional and other “semantic” queries based on a variety of technologies aimed to directly satisfy the unexpressed “user intent”, thus moving further and further away from the classic model. What is coming next? In this talk, we identify two trends, both representing “short-circuits” of the model: The ﬁrst is the trend towards context driven Information Supply (IS), that is, the goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query. The information supply concept greatly precedes information retrieval; what is new in the web framework, is the ability to supply relevant information speciﬁc to a given activity and a given user, while the activity is being performed. Thus the entire verbalization and query-formation phase are eliminated. The second trend is “social search” driven by the fact that the Web has evolved to being simultaneously a huge repository of knowledge and a vast social environment. As such, it is often more eﬀective to ask the members of a given web milieu rather than construct elaborate queries. This short-circuits only the query formulation, but allows information ﬁnding activities such as opinion elicitation and discovery of social norms, that are not expressible at all as queries against a ﬁxed corpus.

G. Amati, C. Carpineto, and G. Romano (Eds.): ECIR 2007, LNCS 4425, p. 1, 2007. c Springer-Verlag Berlin Heidelberg 2007

The Last Half-Century: A Perspective on Experimentation in Information Retrieval Stephen Robertson Microsoft Research Cambridge and City University London, UK [email protected]

Abstract. The experimental evaluation of information retrieval systems has a venerable history. Long before the current notion of a search engine, in fact before search by computer was even feasible, people in the library and information science community were beginning to tackle the evaluation issue. Sometimes it feels as though evaluation methodology has become ﬁxed (stable or frozen, according to your viewpoint). However, this is far from the case. Interest in methodological questions is as great now as it ever was, and new ideas are continuing to develop. This talk will be a personal take on the ﬁeld.

G. Amati, C. Carpineto, and G. Romano (Eds.): ECIR 2007, LNCS 4425, p. 2, 2007. c Springer-Verlag Berlin Heidelberg 2007

Learning in Hyperlinked Environments Marco Gori Dipartimento di Ingegneria dell’Informazione, University of Siena, Italy [email protected]

Abstract. A remarkable number of important problems in diﬀerent domains (e.g. web mining, pattern recognition, biology . . . ) are naturally modeled by functions deﬁned on graphical domains, rather than on traditional vector spaces. Following the recent developments in statistical relational learning, in this talk, I introduce Diﬀusion Learning Machines (DLM) whose computation is very much related to Web ranking schemes based on link analysis. Using arguments from function approximation theory, I argue that, as a matter of fact, DLM can compute any conceivable ranking function on the Web. The learning is based on a human supervision scheme that takes into account both the content and the links of the pages. I give very promising experimental results on artiﬁcial tasks and on the learning of functions used in link analysys, like PageRank, HITS, and TrustRank. Interestingly, the proposed learning mechanism is proven to be eﬀective also when the rank depends jointly on the page content and on the links. Finally, I argue that the propagation of the relationships expressed by the links reduces dramatically the sample complexity with respect to traditional learning machines operating on vector spaces, thus making it reasonable the application to real-world problems on the Web, like spam detection and page classiﬁcation.

G. Amati, C. Carpineto, and G. Romano (Eds.): ECIR 2007, LNCS 4425, p. 3, 2007. c Springer-Verlag Berlin Heidelberg 2007

A Parameterised Search System Roberto Cornacchia and Arjen P. de Vries CWI, Kruislaan 413, 1098SJ, Amsterdam, The Netherlands {roberto,arjen}@cwi.nl Abstract. This paper introduces the concept of a Parameterised Search System (PSS), which allows ﬂexibility in user queries, and, more importantly, allows system engineers to easily deﬁne customised search strategies. Putting this idea into practise requires a carefully designed system architecture that supports a declarative abstraction language for the speciﬁcation of search strategies. These speciﬁcations should stay as close as possible to the problem deﬁnition (i.e., the retrieval model to be used in the search application), abstracting away the details of the physical organisation of data and content. We show how extending an existing XML retrieval system with an abstraction mechanism based on array databases meets this requirement.

1

Introduction

For many years, information retrieval (IR) systems could be adequately described as software that assign an estimate of relevancy to a pair of document and query, each represented as a ‘bag-of-words’. The implementation of such search systems has been relatively straightforward, and most engineers code the retrieval model directly on top of an inverted ﬁle structure. Trends in research and industry motivate however a reconsideration of the above characterisation of IR. First, modern retrieval systems have become more complex, as they exploit far more than ‘just’ the text. For example, the ranking function combines query and document text with other types of evidence, derived from, e.g., document markup, link structure, or various types of ‘context information’. Also, work tasks supported by search have become diverse. Within organisations, enterprise search refers to intranet search, but also search over collections of e-mail, ﬁnding expertise, etc. [1]. People use web search indeed for the goal of ‘ﬁnding information about’, but also to book a hotel, ﬁnd a job, hunt for a house, just to name a few. Companies are targeting these niche markets with specialised search engines (known as vertical search). Today, the development of such specialised applications is the job of information retrieval specialists. We expect however that, very soon, any software developer should be able to develop applications involving search. Actually, Hawking has stated that ‘an obvious reason for poor enterprise search is that a high performing text retrieval algorithm developed in the laboratory cannot be applied without extensive engineering to the enterprise search problem, because of the complexity of typical enterprise information spaces’ [1]. Simplifying the process of tailoring search to a speciﬁc work task and user context should therefore be an important goal of IR research! G. Amati, C. Carpineto, and G. Romano (Eds.): ECIR 2007, LNCS 4425, pp. 4–15, 2007. c Springer-Verlag Berlin Heidelberg 2007

A Parameterised Search System

5

This paper proposes that the engineering of search systems may proceed analogous to the development of oﬃce automation applications using relational database management systems – deﬁne the ‘universe of discourse’; design a conceptual schema; express the user application in terms of this schema; and, design the user interface. So, software developers of a search application should have access to a high-level declarative language to specify collection resources and retrieval model. The search engine can then be parameterised for optimal eﬀectiveness: adapted to the work task and user context, optimised for speciﬁc types of content in the collection, and specialised to exploit domain knowledge. 1.1

Anatomy of a Parameterised Search System

We refer to this new generation of information retrieval systems as parameterised search engines (PSSs). Fig. 1 illustrates how a PSS diﬀers from the traditional search engine, in the so-called abstraction language. Its main purpose is to decouple search strategies from algorithms and data structures, bringing what the database ﬁeld calls data independence to IR systems. This abstraction language should enable the search system developer to specify search strategies declaratively, ideally without any consideration of the physical representation of document structure and content. Compare this to current practise using IR software like Lucene, Lemur, or Terrier. These systems provide a variety of well-known ranking functions, implemented on top of the physical document representation. The modular design of the IR toolkit allows application developers to select one of these ranking functions suited for their search problem. Design decisions about how to rank, e.g., weighting various types of documents diﬀerently with respect to their expected contribution to relevancy, will however be part of the application code. Summarising, a PSS provides a declarative IR language. Search application developers specify the desired search strategy for a speciﬁc user and task context in this language. The system translates expressions in this language into operations on its internal data structures to perform the actual retrieval. 1.2

Approach, Contributions and Outline

This raises many open questions, the obvious one of course what the abstraction language could look like. The main contribution of this work is to demonstrate feasibility of the idea of a PSS through a (preliminary) prototype implementation. We propose a very speciﬁc instantiation of a PSS, by extending an existing XML IR system (PF/Tijah) with a formal language for the speciﬁcation of retrieval models (Matrix Framework for IR). Another contribution is to operationalise this theoretical framework for IR on an array database system (SRAM), such that it can be deployed in practical search system implementation. The remainder is organised as follows. Section 2 describes the layered architecture of the XML retrieval system PF/Tijah. Section 3 details how we turn PF/Tijah into a PSS named Spiegle. We introduce its two main ingredients, the Matrix Framework for IR formalism and the array data-model abstraction of

6

R. Cornacchia and A.P. de Vries Conceptual Layer

Conceptual Layer User Query Language

User Query Language

Data-model Abstraction

System Engineer

Data-model

Logical Layer

Logical Layer STRUCTURE

CONTENT

Data fetch / browse algorithms

Retrieval Models

Abstraction language System Engineer

STRUCTURE

CONTENT

Data fetch / browse algorithms

Retrieval Models

Abstraction language System Engineer

Data-model

Data-model

System Engineer

Abstraction

Abstraction

Data-model

Data-model

Physical Layer

Physical Layer Data Primitives

Storage

(a) Traditional

Data Primitives

Data Primitives

Storage

(b) Parameterised

Fig. 1. Comparison of Search System architectures

SRAM, and explain how these are integrated. Section 4 provides implementation details. Related research is discussed in Section 5, before we conclude and outline future work in Section 6.

2

Querying Content and Structure in PF/Tijah

A preliminary requirement of implementing a PSS is that it allows to express search strategies that refer to structure and content. Spiegle meets this requirement by building upon PF/Tijah [2], a ﬂexible environment for setting up XML search systems. PF/Tijah integrates XQuery and NEXI: XQuery to query and transform the structure of XML documents, NEXI (Narrowed Extended XPath) [3] to rank XML elements by their content. The resulting query language provides a powerful way to customise (mainly) the structural aspects of a retrieval strategy. Its layered system architecture, depicted in Fig. 2(a), uses the PathFinder (PF) XQuery system [4] to query by structure, as well as to construct the preferred result presentation. PathFinder translates the XQuery (non-ranked) part of a query into a relational query plan, independently from the ranking component of the system. The Tijah XML retrieval system provides the XML IR support, by processing the NEXI expressions in a query. NEXI is a subset of XPath (it allows only descendant and self axis steps) that allows an additional about() clause, which ranks the selected XML elements by their content. The following example retrieves, from a database of scientiﬁc publications, The title of documents written by John Smith, where the abstract is about “IR DB integration”:

A Parameterised Search System Conceptual Layer

Conceptual Layer

+

XQuery

+

XQuery

NEXI

7

NEXI

XML

XML

XML

XML

PathFinder

TIJAH

PathFinder

TIJAH

PF/Relational

SRA

PF/Relational

SRA

Logical Layer

Logical Layer STRUCTURE

CONTENT

STRUCTURE

CONTENT

Data fetch / browse algorithms

Retrieval Models

Data fetch / browse algorithms

Retrieval Models

System Engineer

System Engineer

Array language System Engineer

Arrays SRAM System Engineer

Relations

Physical Layer

Physical Layer PF / Relational Primitives

Relational Primitives

Relational Storage

(a) PF/Tijah

PF / Relational Primitives

Relational Primitives

Relational Storage

(b) Spiegle

Fig. 2. Architectures of PF/Tijah and Spiegle

let $c := doc("papers.xml")//DOC[author = "John Smith"] let $q := "//text[about(.//Abstract, IR DB integration)];" for $res in tijah-query($c, $q) return $res/title/text()

PF/Tijah translates the NEXI expression $q into a logical query plan consisting of SRA (Score Region Algebra) operators [5].The SRA algebra includes operators that perform the following tasks: (i) selection of the XML elements that are to be ranked; (ii) a parameterised element score computation, implementing the about() clause according to the desired retrieval model; (iii) a parameterised element score combination, i.e. compute the score of AND and OR combinations of about() clauses; (iv) a parameterised element score propagation, needed when scores need to be propagated to a common ancestor before being combined. As depicted in Fig. 2(a), structured information retrieval queries involve operations that act on structure (like (i)) and on content (e.g. (ii)). Each SRA operator implementation is given in terms of relational database operators (by a system engineer). Several retrieval models are supported out-of-the-box, selected by an extra ‘options’ parameter to the tijah-query() function.

3

Spiegle: Turning an XML Retrieval System into a PSS

While PF/Tijah provides a powerful query language to embed search functionality in data processing, it does not support customisation of the retrieval model. Although advanced users may in principle modify the pre-deﬁned mapping of SRA operators to relational query plans to implement new ranking functions,

8

R. Cornacchia and A.P. de Vries

doing so is far from trivial. In other words, PF/Tijah supports the customisation of the structural aspects of various search strategies, but it is inﬂexible with respect to modifying the content aspects. Spiegle overcomes this limitation in two steps. First, it supports the declarative speciﬁcation of retrieval models, by employing the Matrix Framework for IR. This way, the search system engineer may implement information retrieval models at a high level of abstraction. Second, SRAM translates the speciﬁed retrieval model automatically into a relational query plan. 3.1

The Matrix Framework for IR: A Formalism for Search Strategies

The Matrix Framework for IR [6] (abbreviated to Matrix Framework) is a mathematical formalism that maps a wide spectrum of IR concepts to matrix spaces and matrix operations, providing a convenient logical abstraction that facilitates the design of IR systems. Indexing, retrieval, relevance feedback and evaluation measures are aspects described within the Matrix Framework. Also, it establishes a consistent notation for frequencies in event spaces, readily available as building blocks for IR applications in common matrix operation libraries. We introduce only the part of the Matrix Framework that is focused on indexing and retrieval. First, we deﬁne three vectors, one for each of the dimensions used in the framework: D = [wd ]N ×1 for documents, T = [wt ]S×1 for terms and L = [wl ]R×1 for locations, with 1 ≤ d ≤ N , 1 ≤ t ≤ S and 1 ≤ l ≤ R. The quantities wd ≥ 0, wt ≥ 0 and wl ≥ 0 are the weight of document d, term t and location l, respectively. In the simplest case, these weights are boolean values that denote presence/absence in the collection. The term “location” is rather generic and covers concepts indicating document components of varying granularity, such as section, paragraph, position, or XML element. The content and the structure of each data collection are entirely described by the two boolean matrices LDL×D (location-document) and LTL×T (locationterm), whereas each query is described by a vector QT ×1 . As deﬁned in (1), each value LD(l, d) tells whether location l belongs to document d and each value LT (l, t) encodes the occurrence of term t at location l of the collection. Finally, each query Q is described as a bit-vector of all collection-terms. LD(l, d) =

0, 1,

if l ∈ /d , if l ∈ d

LT (l, t) =

0, 1,

if t ∈ /l , if t ∈ l

Q(t) =

0, 1,

if t ∈ /Q if t ∈ Q

(1)

Standard IR statistics are deﬁned as simple operations on matrices LD and LT : DTnl , DT – #term occurrences and term presence DTnl = LDT · LT,

DT = min(DTnl , Ones[|D| × |T |]),

Dnl , Tnl – #per-document and #per-term locations Dnl = LDT · L,

Tnl = LT T · L,

(2)

Dnt , Tnd – #terms per document and #documents per term Dnt = DT · T,

Tnd = DT T · D,

where Ones[A × B] deﬁnes a matrix of size A × B, ﬁlled with 1’s. A number of standard IR frequencies are easily derived from the quantities deﬁned above:

A Parameterised Search System

9

DTf – within-document term frequency DTf (d, t) =

DTnl (d, t) , Dnl (d)

Dtf , Ditf – term frequency and inverse term frequency of a document Dtf (d) =

Dnt (d) , |T |

Ditf (d) = −logDtf (d),

(3)

Tdf , Tidf – document frequency and inverse document frequency of a term Tdf (t) =

Tnd (t) , |D|

Tidf (t) = −logTdf (t),

Tnl – collection frequency of a term Tf (t) =

Tnl (t) . |L|

Finally, it is possible to deﬁne a number of retrieval models using the quantities and the frequencies described above. The tf.idf approach is speciﬁed as RSVtf.idf = DTf · diag (Tidf ) · Q. Refer to [6] for many other retrieval models. The Matrix Framework expresses indexing and retrieval tasks declaratively, starting from a simple matrix-based representation of collection and queries, and following a set of generic deﬁnitions that can be stored as a well-structured system library. This facilitates the engineering of extensible retrieval systems, as it allows a programming interface based on the array data-model (suited for IR). Physical implementation is delegated to another, dedicated layer. 3.2

SRAM: An Array Data-Model Implementation

The SRAM system is a prototype tool for mapping sparse arrays to relations and array operations to relational expressions. While SRAM syntax allows to express array operations on an element by element basis, the system translates such element-at-a-time operations to collection-oriented database queries, suited for (potentially more eﬃcient) bulk processing. The SRAM language deﬁnes operations over arrays declaratively in comprehension syntax [7], which allows to declare arrays by means of the following construct: A = [ <array-cell value> | <array axes> ]. The section <array axes> speciﬁes the shape SA of result array A, i.e., its number of dimensions and their domain, denoted as: i0 < N0 , . . . , in−1 < Nn−1 . The value of each dimension variable ij ranges from 0 to Nj − 1. The section <array-cell value> assigns a value to each cell indexed by the index values enumerated by the <array axes> section. For example, the expression Tidf = [ -log( Tdf(t) ) | t < nTerms ] deﬁnes a new array Tidf [nT erms], where the value of each cell is computed by applying the function − log to corresponding cells of array Tdf . The explicit domain speciﬁcation may be omitted when its range is clear from the context, e.g., the more convenient [-log( Tdf(t) ) | t] is equivalent to [-log( Tdf(t) ) | t < nTerms]. The language supports aggregations over any array dimension (sum, prod, min, max). For example, summing per document the term frequency counts (in a document-term matrix DTf ) of query terms (in binary vector Q) is expressed as: P = [ sum([ DTf(d,t) * Q(t) | t ]) | d ]. The shape SP of array P is determined by the rightmost axis d.

10

R. Cornacchia and A.P. de Vries

Retrieving the top-N values is allowed for one-dimensional arrays only. The result consists of the positions (in order) of the values in the original array. So, T = topN(P, N, ) returns a dense array T with ST = [N ]. The actual values can then be fetched by dereferencing the original array, S = P(T). The SRAM syntax allows deﬁnitions and assignments. Deﬁnitions are indicated by the symbol “=”, as in all the expressions seen above. They are expanded symbolically by a preprocessor at every occurrence in the array expression. Assignments, indicated by the symbol “:=”, translate to database expressions whose result is stored permanently in tables, named as indicated by the left part of the assignment expression: <array name> := . 3.3

How Spiegle Applies the Array Data-Model

Fig. 2 shows how Spiegle inherits PF/Tijah’s logical layer, which provides an algebraic abstraction (SRA) between the conceptual and physical layers. The Spiegle architecture takes the point of logical abstraction further, by exploiting the clean distinction between structure and content operations oﬀered by SRA. In the following, we ﬁrst explain how the Matrix Framework can handle XML documents. Next, we see how to bootstrap, index and rank XML collections using the array data-model. The SRAM scripts shown are translated on-the-ﬂy into database queries executed by the backend. The Matrix Framework and XML data. The formalism of the Matrix Framework is not restricted to ﬂat text search. Mapping the XML structure to the three dimensions location, document and term is all that is needed to apply the Matrix Framework to semi-structured collections. Each XML ﬁle can be seen as a collection of documents, where each distinct XML element represents a diﬀerent one. The main diﬀerence from modelling plain text collections is that locations are now shared among diﬀerent (nested) elements. Therefore, locations are deﬁned as term and XML tag position in the ﬁle. Consider the following excerpt of the papers.xml collection of scientiﬁc publications, where location, document and term identiﬁers are annotated next to each piece of text for the reader’s convenience, with l, d and t preﬁxes respectively: [l0,d0] [l1,d1] [l2,d2] [l3,d3] IR[l4,t0] as[l5,t1] simple[l6,t2] as[l7,t1] cakes[l8,t3][l9] ... [l40] [l41] [l42,d15] ..... [l120] ... [l30000000]

Notice that (i) the two DOC elements have diﬀerent identiﬁers; (ii) the two as occurrences have the same term identiﬁer; (ii) locations 3 to 9 belong to documents 0,1,2 and 3. Bootstrapping XML collections. A Matrix Framework representation of a collection is obtained from vectors L, D and T and matrices LD and LT .

A Parameterised Search System

11

For the example collection papers.xml, this corresponds to a SRAM script with the following deﬁnitions. #

papers.ram :

// global information for collection "papers.xml" nLocs=30000000 , nDocs=2000000 , nTerms=400000 L = [ 1 | l < nLocs ] D = [ 1 | d < nDocs ] T = [ 1 | t < nTerms ] LD = [nLocs,nDocs], bool, sparse("0"), "LD_table" LT = [nLocs,nTerms], bool, sparse("0"), "LT_table"

" ! First, the length of each dimension is deﬁned as nLocs, nDocs, nTerms. Then, vectors L, D and T are deﬁned, using a constant weight 1. Finally, matrices LD and LT are declared as persistently stored in the database, as they are supposed to be the outcome of an earlier parsing and pre-processing phase on the XML ﬁle. For each persistent matrix, the dimensionality, the element type, the information on sparsity (LD and LT are both sparse with common value 0) and the name of the corresponding relational table are speciﬁed. Section 4 gives further details on the automatic creation of matrices LD and LT . A script ﬁle similar to papers.ram is created automatically for each collection and it is included in every SRAM script that uses that particular collection. Notice that the script ﬁle above only contains SRAM deﬁnitions and no assignments (see Section 3.2), which result in simple in-memory declarations. Array System Libraries. The uniform approach to IR tasks like indexing and retrieval that the Matrix Framework provides is easily operationalised by collecting the formulae presented in Section 3.1 in a system library composed by SRAM expressions (automatically made collection-dependent using a unique preﬁx for the collection’s stored arrays). Fig. 3 shows an excerpt of such a library, that is used in the subsequent paragraphs about indexing and retrieval of XML collections. One can observe that the SRAM expressions are an almost direct transcription of mathematical formulae to ASCII characters, which demonstrates the intuitiveness of array comprehensions as an IR query language. Indexing XML collections. The indexing phase creates statistical information about a given collection. As described in Section 3.1, this entails the computation of the matrices deﬁned in (2) and (3). The deﬁnition of such matrices in SRAM syntax is given in Fig. 3, ﬁle MF Indexing.ram. Notice that this ﬁle contains array assignments, that create persistent arrays in the database. The SRAM script for indexing the example papers.xml collection becomes straightforward: load the collection’s global deﬁnitions, followed by the generic index-creation assignments: #include "papers.ram" #include "MF_Indexing.ram"

// global definitions for papers.xml // create index arrays

Ranking XML collections. Recall the example query of Section 2, where NEXI expression //text[about(.//Abstract, IR DB integration)] ranks the “text” elements that contain an “Abstract” element about “IR DB integration”. This structured IR part of the query is translated to SRA algebra. Selection of text and their containing Abstract elements is performed by structure

12

R. Cornacchia and A.P. de Vries

#include "MF_DocContext.ram" bm25(d,k1,b) = sum( [ w(d,t,k1,b) * Q(t) | t ] ) w(d,t,k1,b) = log( nDocs / Tdf(t) ) * (k1+1) * DTf(d,t) / ( DTf(d,t) + k1 * (1 - b + b * Dnl(d) / avgDnl) ) langmod(d,a) = sum( [ lp(d,t,a) * Q(t) | t ] ) lp(d,t,a) = log( a*DTf(d,t) + (1-a)*Tf(t) ) MF RetrievalModels.ram #include "LinearAlgebra.ram" DTnl := mxMult(mxTrnsp(LT), LD) Dnl := mvMult(mxTrnsp(LD), L) Tnl := mvMult(mxTrnsp(LT), L) DT := [ min(DTnl(d,t),1) | d,t ] DTf := [ DTnl(d,t)/Dnl(d) | d,t ] Tf := [ Tnl(t)/nLocs | t ]

DTnl Dnl DT DTf

= = = =

MF Indexing.ram

[ [ [ [

DTnl(d,t) Dnl(d) DT(d,t) DTf(d,t)

* * * *

DX(d) DX(d) DX(d) DX(d)

| d,t ] | d ] | d,t ] | d,t ]

MF DocContext.ram

mxTrnsp(A) = [ A(j,i) | i,j ] mxMult(A,B) = [ sum([ A(i,k) * B(k,j) | k ]) | i,j ] mvMult(A,V) = [ sum([ A(i,j) * V(j) | j ]) | i ] LinearAlgebra.ram

Fig. 3. SRAM libraries (excerpts)

operations implemented using PathFinder primitives. We call the resulting nodeset the ranking-context, represented in the Matrix Framework by a binary vector DX. Ranking the Abstract elements of the ranking-context is performed by a content SRA operation, implemented in Spiegle by a function with signature: Function rank(collection, rankingContext, queryTerms, N, retrievalModel, param1, param2, ...) := rankedContext

This function turns the ranking-context of a given query into a top-N rankedcontext against the speciﬁed query terms, by applying the desired retrieval model. Its body executes a SRAM script, customised at each call with the value of the current function parameters. The script of the example NEXI query above, with parameters collection=“papers”, retrievalModel=“langmod” and N=20, corresponds to: $ ' // global definitions for collection papers.xml #include "papers.ram" // ranking-context and query terms DX = [nDocs], bool, sparse("0"), "DX_table" Q = [nTerms], bool, sparse("0"), "Q_table" // include retrieval model definitions #include "MF_RetrievalModels.ram" // retrieve top N documents using "langmod" (Language Modelling) S = [ langmod(param1,param2,...) | d ] D20 := topN( [S(d) | d<20], DESC ) S20 := S(D20)

&

%

A Parameterised Search System

13

First, global deﬁnitions for the collection papers.xml are loaded. Then, the ranking-context and the query terms are declared as persistent arrays, previously stored in the database by the system. Deﬁnitions of the available retrieval models are loaded (ﬁle MF RetrievalModels.ram) and the selected one used to rank documents. Finally, the top 20 document identiﬁers and scores are computed and used together as the ranked-context returned by the SRA operator. Fig. 3, ﬁle MF RetrievalModels.ram shows an excerpt of the available retrieval model expressions. These get customised to the current querying scenario by declaring the two arrays DX (ranking-context ) and Q (queryterms) before each query is executed. The ﬁrst line includes library ﬁle MF DocContext.ram. The included script limits the document axes of the index arrays to the current ranking-context by multiplying the document axes with the binary vector representing the ranking-context (the result of this multiplication corresponds to precisely those portions of index arrays that contain information about the node-set to be ranked).

4

Implementation Details

Spiegle uses PathFinder’s eﬃcient ‘document shredder’ to turn any XML data ﬁle into database relations (also from remote sources). PF/Tijah extends the standard PathFinder data-model by indexing text words (terms) in addition to XML nodes. XML elements are encoded as regions. Regions are stored as tuples <start,end,id,type>, where id is the identiﬁer associated with the XML element (a node tag or a term) and type can be either node or term. The start and end positions identify the text region in the original text ﬁle (each term is a text region of length 1). An additional relation is created for eﬃcient term access (representing an inverted ﬁle structure). A more detailed description of PathFinder and PF/Tijah indexing schemes can be found in [4,2]. To use the Matrix Framework as described above, matrices LD and LT should be prepared during this indexing phase. Clearly, all required information is available in the tables created by PF/Tijah. Positions delimited by start and end values become “locations”, whereas the combination of id and type values become the document and term identiﬁers for matrices LD and LT . Many storage schemes have been proposed for sparse multi-dimensional arrays, with strong emphasis on the special case of two-dimensional arrays [8]. SRAM represents an n-dimensional sparse array A as relation RA , with εA denoting its default value. Each tuple encodes a vector i = (i0 , . . . , in−1 ) of index values and a cell value A(i); only those cells for which A(i) = εA are stored: A → RA (i0 , . . . , in−1 , v) = {(i0 , . . . , in−1 , A(i)) | A(i) = εA }. This storage scheme naturally extends to dense arrays, for which all tuples are stored physically. Data access patterns are optimised by creating standard relational indexing structures on top of such relations, or by explicit tuple clustering/sorting (the index columns form the relation’s primary key). The life-cycle of array queries through the SRAM architecture can be summarised by the following sequence of transformations: Array comprehension →

14

R. Cornacchia and A.P. de Vries

Array algebra → Relational algebra → Relational plan. The ﬁrst translation step generates an array-algebra tree, which represents the sequence of operations performed on the stored arrays (reshaping, selection, aggregation, or function application). A cost-based optimiser normalises and rewrites the array-algebra tree. No physical details are involved in the process yet: only arrays’ size and density are taken into account. The next step maps the array data-model to the relational data-model, by means of translation rules that take into account the storage details: although common values of a sparse arrays are not stored, they may aﬀect the ﬁnal result. Standard relational optimisation techniques ﬁnalise the relational expression. The last step of the translation process maps relational algebra expressions to physical query plans for the database engine. Data access paths and algorithms are chosen for the physical implementation of relational operators in the query tree. For more in-depth details about the array-algebra and the mapping and optimisation rules used in the translation process, see [9].

5

Related Work

This work addresses some of the main issues about IR&DB integration [10,11]. We are not aware of previous work that provides a declarative language for the deﬁnition of the retrieval model as part of an XML retrieval system. However, Inquery’s [12] (now Indri) query language can be considered to provide some early version of a parameterised search system, and it has often been used for precisely this ﬂexibility (for example, in cross-language IR). Already ten years ago, Fuhr argued in favour of data independence in IR [13], pointing out how this would reduce problems in plain text search with noun phrase search and treatment of compound words, and (semi-)structured data types to capture attributes like author, journal title or publication year. De Vries deﬁned the notion of ‘content independence’ [14] to refer to the decoupling between search strategies and content representation. His deﬁnition has been reﬁned by Mihajlovic into the two related concepts ‘retrieval model independence’ and ‘content description independence’ [5]. Wen et al. use ‘media independence’ for a similar separation of concerns as Mihajlovic’s content description independence [15]. Yet, none of these authors has proposed a declarative language that actually achieves the goal of separating the retrieval model from its actual implementation.

6

Conclusions and Future Work

This paper has argued that modern IR application requirements force us to reconsider the design characteristics of search systems. We promote an innovation in the search system engineering process, by introducing more ﬂexibility in the IR system’s architecture. The increased ﬂexibility aims to reduce the eﬀort of adapting search functionalities to work task and user context. We deﬁned the architectural requirements of so-called Parameterised Search Systems: (i) a layered architecture that allows structural and content information to be exploited for the search task; (ii) a convenient abstraction from

A Parameterised Search System

15

the physical details that discloses the retrieval engine’s capabilities to the unique needs of each particular combination of collection characteristics, user preferences, and search strategies. We indicated the architecture of the PF/Tijah XML IR system as a possible foundation to build upon, and the Matrix Framework for IR as a very well-suited abstraction to express retrieval strategies. Finally, we showed how SRAM can operationalise the Matrix Framework on top of a database system. The result of this eﬀort is Spiegle: the ﬁrst prototype of parameterised XML search system. Future work includes further exploiting the array abstraction for the implementation of structure operations. This will allow for simpler inclusion of structural information in the retrieval strategy and better opportunities for optimising the ﬁnal relational query plan. Early results of [9] have demonstrated that SRAM’s multi-stage query translation process can give excellent run-time performance on a large collection of web-data, given an eﬃcient back-end system. Integrating this new back-end into the Spiegle architecture is on-going work.

References 1. Hawking, D.: Challenges in enterprise search. In: Proc. ADC. (2004) 15–26 2. Hiemstra, D., Rode, H., van Os, R., Flokstra, J.: PFTijah: text search in an XML database system. In: Proc. OSIR. (2006) 3. R.A.O’Keefe, Trotman, A.: The simplest query language that could possibly work. In: Proc. INEX. (2004) 4. Boncz, P., Grust, T., van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery. In: Proc. SIGMOD. (2006) 479–490 5. Mihajlovic, V.: Score Region Algebra. A Flexible Framework for Structured Information Retrieval. PhD thesis, University of Twente (2006) 6. R¨ olleke, T., Tsikrika, T., Kazai, G.: A General Matrix Framework for Modelling Information Retrieval. IP&M 42(1) (2005) 4–30 7. Buneman, P., Libkin, L., Suciu, D., Tannen, V., Wong, L.: Comprehension syntax. SIGMOD Record 23(1) (1994) 87–96 8. Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the solution of algebraic eigenvalue problems: a practical guide. SIAM (2000) 9. Cornacchia, R., H´eman, S., Zukowski, M., de Vries, A., Boncz, P.: Flexible and eﬃcient IR using Array Databases. Technical Report INS-E0701, CWI (2007) 10. Chaudhuri, S., Ramakrishnan, R., Weikum, G.: Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? In: Proc. CIDR, Asilomar, CA, USA (2005) 1–12 11. Amer-Yahia, S., Case, P., R¨ olleke, T., Shanmugasundaram, J., Weikum, G.: Report on the DB/IR Panel at Sigmod 2005. SIGMOD Record 34(4) (2005) 71–74 12. Callan, J.P., Croft, W.B., Harding, S.M.: The INQUERY retrieval system. In: Proc. DEXA. (1992) 78–83 13. Fuhr, N.: Object-oriented and database concepts for the design of networked information retrieval systems. In: Proc. CIKM. (1996) 164–172 14. de Vries, A.: Content independence in multimedia databases. JASIST 52(11) (2001) 954–960 15. Wen, J., Li, Q., Ma, W., Zhang, H.: A multi-paradigm querying approach for a generic multimedia database management system. SIGMOD Record 32(1) (2003) 26–34

Similarity Measures for Short Segments of Text Donald Metzler1, Susan Dumais2, and Christopher Meek2 1 University

of Massachusetts Amherst, MA 2 Microsoft Research Redmond, WA

Abstract. Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.

1 Introduction Retrieving documents in response to a user query is the most common text retrieval task. For this reason, most of the text similarity measures that have been developed take as input a query and retrieve matching documents. However, a growing number of tasks, especially those related to web search technologies, rely on accurately computing the similarity between two very short segments of text. Example tasks include query reformulation (query-query similarity), sponsored search (query/ad keyword similarity), and image retrieval (query-image caption similarity). Unfortunately, standard text similarity measures fail when directly applied to these tasks. Such measures rely heavily on terms occurring in both the query and the document. If the query and document do not have any terms in common, then they receive a very low similarity score, regardless of how topically related they actually are. This is well-known as the vocabulary mismatch problem. This problem is only exacerbated if we attempt to use these measures to compute the similarity of two short segments of text. For example, “UAE” and “United Arab Emirates” are semantically equivalent, yet share no terms in common. Context is another problem when measuring the similarity between two short segments of text. While a document provides a reasonable amount of text to infer the contextual meaning of a term, a short segment of text only provides a limited context. For example, “Apple computer” and “apple pie” share the term apple, but are topically distinct. Despite this, standard text similarity measures would say that these G. Amati, C. Carpineto, and G. Romano (Eds.): ECIR 2007, LNCS 4425, pp. 16 – 27, 2007. © Springer-Verlag Berlin Heidelberg 2007

Similarity Measures for Short Segments of Text

17

two short segments of text are very similar. However, computing the similarity between the query “Apple computer” and a full document about “apple pie” will produce a low similarity score since the document contains proportionally less text that is relevant to the query, especially compared to a full document about “Apple business news”. In this paper, we explore the problem of measuring similarity between short segments of text from an information retrieval perspective. Studies in the past have investigated the problem from a machine learning point of view and provided few, if any comparisons to standard text similarity measures. In this work, we describe a set of similarity measures that can be used to tackle the problem. These measures include simple lexical matching, stemming, and text representations that are enriched using web search results within a language modeling framework. In addition, we formally evaluate the measures for the query-query similarity task using a collection of 363,822 popular web queries. Our analysis provides a better understanding of the strengths and weaknesses of the various measures and shows an interesting tradeoff between effectiveness and efficiency. The remainder of this paper is laid out as follows. First, Section 2 provides an overview of related work. We then describe the various ways to represent short segments of text in Section 3. Section 4 follows up this discussion by describing the similarity measures we investigated. Section 5 provides the details of our experimental evaluation on the query-query similarity task. Finally, in Section 6 we wrap up and provide conclusions and directions of future work.

2 Related Work Many techniques have been proposed to overcome the vocabulary mismatch problem, including stemming [5,9], LSI [3], translation models [1], and query expansion [6,14]. This section describes several of these techniques that are most related to our work. The task we focus on is a query-query similarity task, in which we compare short text segments, such as “Apple computer”, “apple pie”, “MAC OS X”, and “iMAC”. Translation models, in a monolingual setting, have been used for document retrieval [1], question answering [8], and detecting text reuse [7]. The goal is to measure the likelihood that some candidate document or sentence is a translation (or transformation) of the query. However, such models are less likely to be effective on very short segments of texts, such as queries, due to the difficulty involved in estimating reliable translation probabilities for such pieces of text. Query expansion is a common technique used to convert an initial, typically short, query into a richer representation of the information need [6,10,14]. This is accomplished by adding terms that are likely to appear in relevant or pseudo-relevant documents to the original query representation. In our query-query matching work, we explore expanding both the original and candidate query representations. Sahami and Heilman proposed a method of enriching short text representations that can be construed as a form of query expansion [11]. Their proposed method expands short segments of text using web search results. The similarity between two short segments of text can then computed in the expanded representation space. The expanded representation and DenseProb similarity measure that we present in

18

D. Metzler, S. Dumais, and C. Meek

Sections 3 and 4 are similar to this approach. However, we estimate term weights differently and analyze how such expansion approaches compare, in terms of efficiency and effectiveness, to other standard information retrieval measures. Finally, since we evaluate our techniques on a query-query similarity task, it should be noted that this problem, and the related problem of suggesting and identifying query-query reformulations has been investigated from a number of angles, ranging from machine learning approaches [4] to query session log analysis[2]. These techniques are complimentary to the core representational and similarity ideas that we explore in our work.

3 Text Representations Text representations are an important part of any similarity measure. In this section, we describe three different ways of representing text. Although these representations can be applied to text of any length, we are primarily interested in using them to represent short segments of text. 3.1 Surface Representation The most basic representation of a short segment of text is the surface representation (i.e. the text itself). Such a representation is very sparse. However, it is very high quality because no automatic or manual transformations (such as stemming) have been done to alter it. While it is possible that such transformations enhance the representation, it is also possible that they introduce noise. 3.2 Stemmed Representation Stemming is one of the most obvious ways to generalize (normalize) text. For this reason, stemming is commonly used in information retrieval systems as a rudimentary device to overcome the vocabulary mismatch problem. Various stemmers exist, including rule-based stemmers [9] and statistical stemmers [5]. Although stemming can significantly improve matching coverage, it also introduces noise, which can lead to poor matches. Using the Porter stemmer, both “marine vegetation” and “marinated vegetables” stem to “marin veget”, which is undesirable. Overall, however, the number of meaningful matches introduced typically outweighs the number of spurious matches. Throughout the remainder of this paper, we use the Porter stemmer to generate all of our stemmed representations. 3.3 Expanded Representation Although stemming helps overcome the vocabulary mismatch problem to a certain extent, it does not handle the contextual problem. It fails to discern the difference between the meaning of “bank” in “Bank of America” and “river bank”. Therefore, it is desirable to build representations for the short text segments that include contextually relevant information.

Similarity Measures for Short Segments of Text

19

apple pie Applie pie – Wikipedia, the free encyclopedia <snippet>In cooking, an apple pie is a fruit pie (or tart ) in which the principal filling ingredient is apples . Pastry is generally used top-and-bottom, making a double-crust pie, the upper crust of which ... en.wikipedia.org/wiki/Apple_ pie All About Food – Apple Pies <snippet>Apple Pie. Recipes. All-American Apple Pie. American Apple Pie. Amish Apple Pie . Apple Cream Pie. Apple Crumble Pie. Apple Pie . Apple Pie in a Brown Bag. Best Apple Pie fp.enter.net/~rburk/pies/ applepie/applepie.htm Apple Pie Recipe <snippet>Apple Pie Recipe using apple peeler corer slicer ... Apple Pie Recipe. From Scratch to Oven in 20-Minutes. Start by preheating the oven. By the time it's ... applesource.com/applepierecipe.htm …

Fig. 1. Example expanded representation for the text “apple pie.” The expanded representation is the concatenation of the title and snippet elements.

One approach is to enrich the representation using an external source of information related to the query terms. Possible sources of such information include web (or other) search results returned by issuing the short text segment as a query, relevant Wikipedia articles, and, if the short text segment is a query, query reformulation logs. Each of these sources provides a set of contextual text that can be used to expand the original sparse text representation. In our experiments, we use web search results to expand our short text representations. For each short segment of text, we run the query against a commercial search engine’s index and retrieve the top 200 results. The titles and snippets associated with these results are then concatenated and used as our expanded representation. In Figure 1, we show a portion of the expanded representation for the short text segment “apple pie”. As we see, this expanded representation contains a number of contextually relevant terms, such as “recipe”, “food”, and “cooking” that are not present in the surface representation. We note that this expanded representation is similar to the one proposed in [11].

4 Similarity Measures In this section we describe three methods for measuring the similarity between short segments of text. These measures are motivated by, and make use of, the representations described in the previous section. We also propose a hybrid method of combining the ranking of the various similarity measures in order to exploit the strengths and weaknesses of each. 4.1 Lexical The most basic similarity measures are purely lexical. That is, they rely solely on matching the terms present in the surface representations. Given two short segments

20

D. Metzler, S. Dumais, and C. Meek

of text, Q and C, treating Q as the query and C as the candidate we wish to measure the similarity of, we define the following lexical matching criteria: • Exact – Q and C are lexically equivalent. (Q: “seattle mariners tickets”, C: “seattle mariners tickets”) • Phrase – C is a substring of Q. (Q: “seattle mariners tickets”, C: “seattle mariners”) • Subset – The terms in C are a subset of the terms in Q. (Q: “seattle mariners tickets”, C: “tickets seattle”) These measures are binary. That is, two segments of text either match (are deemed ‘similar’) or they do not. There is no graded score associated with the match. However, if necessary, it is possible to impose such a score by looking at various characteristics of the match such as the length of Q and C, or the frequency of the terms in some collection. It should also be noted that exact matches ⊆ phrase matches ⊆ subset matches. Exact matches are very high precision (excellent matches), yet very low recall since they miss a lot of relevant material. At the other extreme, subset matches are lower precision, but have higher recall. Any candidate C that contains a term that does not appear in the query Q will not match under any of these rules, which is very undesirable. Therefore, we expect that matches generated using these lexical rules will be have high precision but poor recall. 4.2 Probabilistic As we just described, lexical matching alone is not enough to produce a large number of relevant matches. In order to improve recall, we must make use of the expanded text representations. To do so, we use the language modeling framework to model query and candidate texts. To utilize the framework, we must estimate unigram language models for the query (θQ) and each candidate (θC). For ranking purposes, we use the negative KLdivergence between the query and candidate model, which is commonly used in the language modeling framework [14]. This results in the following ranking function:

− KL(θ Q , θ C ) = H (θ Q ) − CE (θ Q , θ C ) ≡ ∑ P( w | θ Q ) log P( w | θ C )

(1)

w∈V

where V is the vocabulary, H is entropy, CE is cross entropy, and ≡ denotes rank equivalence. The critical part of the ranking function is how the query and candidate language models are estimated. Different estimates can lead to radically different rankings. We now describe how we estimate these models using the representations available to us. We begin with the query model. The most straightforward way of estimating a query model is to use the surface representation. This is estimated as:

P( w | θ Q ) =

tf w,QS | QS |

(2)

Similarity Measures for Short Segments of Text

21

where QS denotes the query surface representation, tfw,QS is the number of times w occurs in the representation, and |QS| is the total number of terms in QS. This estimate will be very sparse since we are using the surface representation. This allows Equation 1 to be computed very efficiently since most terms in the summation ( w ∈ V ) will be zero. We also consider the case when we use the expanded representation of the query, as described in Section 3.3. The estimate, which is analogous to the unexpanded case, is:

P( w | θ Q ) =

tf w,QE + μ Q P( w | C ) | QE | + μ Q

(3)

where QE is the query expanded representation, and µ Q is a smoothing parameter. This type of estimation is commonly used in the language modeling community and is often referred to as Dirichlet or Bayesian smoothing [13]. Since this estimate is much more dense than the unexpanded estimate, it is more time consuming to evaluate Equation 1. Due to the amount of data we work with in our experiments, we truncate this distribution by only keeping the 20 most likely terms and setting the remaining probabilities to 0. Pruning similar to this was done in [11] for the same reason. Finally, we describe how the candidate model is estimated. Rather than exploring both estimates using both unexpanded and expanded representations, we restrict ourselves to expanded representations. Therefore, we get the following estimate:

P(w | θ C ) =

tf w,CE + μ C P( w | C ) | CE | + μ C

(4)

where CE is the candidate expanded representation, and µ C is a smoothing parameter. Unlike the expanded query model, we do not truncate this distribution in any way. Finally, it is important to recall that expanded representations may be created using any number of external sources. Our use of the web was simply a matter of convenience. However, we can use this same general framework with expanded representations generated using any possible external text source. 4.3 Hybrid We are often interested in taking the matches generated by several different similarity measures and combining them. We call these hybrid techniques. Given two or more lists of matches, we stack the lists according to some pre-defined ordering (denoted “>”) of the lists, to form a combined list. For example, given match lists A and B, and ordering A > B, we form the hybrid list AB, which is list B appended to the end of list A. Since the same match may occur in more than one set of results, we must

22

D. Metzler, S. Dumais, and C. Meek

Table 1. Overview of query representation, candidate representation, and similarity measure used for each matching method Method Name

Query Representation

Candidate Representation

Lexical

Surface

Surface

Stemming

Stemmed

Stemmed

SparseProb

Surface

Expanded

Probabilistic

DenseProb

Expanded

Expanded

Probabilistic

Backoff

Various

Various

Hybrid (Exact > Exact Stems > DenseProb)

Similarity Measure Hybrid (Exact > Phrase > Subset) Hybrid (Lexical > Exact Stems)

remove duplicates from the combined list. Our deduplication policy states that we keep the highest ranked match and remove all others. Although this combination scheme is naïve, it has the advantage that there are no combination parameters to learn. 4.4 Summary of Methods Evaluated Table 1 summarizes the methods we evaluate in the next section. For each method, we include the query and candidate representations and the similarity measure used. The Lexical method, which considers the surface forms of the query and candidate, makes use of a hybrid technique that ranks exact matches first, then phrase matches, and finally subset matches. The Stemming method also uses a hybrid technique that first ranks matches using the Lexical method just described and then ranks any exact matches that result after stemming both the query and the candidate. We refer to these types of matches as “exact stems” matches. The SparseProb method is the first of the two probabilistic methods. It uses the unexpanded query representation, the expanded candidate representation, and ranks using the negative KL-divergence, whereas the DenseProb method uses expanded representations for both the query and the candidate and also ranks using the negative KL-divergence. Finally, the Backoff method is a hybrid method that ranks exact matches, exact stems matches, and then DenseProb matches. The goal here is to see what benefit, if any, is achieved by replacing the phrase and subset matches from the Stemming method with DenseProb matches. We hypothesize that the DenseProb matches will be better than the often poor phrase and subset matches. Many other query/candidate representation combinations are possible beyond those listed in Table 1. For example, it may be reasonable to use an expanded query form and a surface candidate form. However, in order to maintain a reasonable scope, we constrain ourselves to the methods described in this section.

Similarity Measures for Short Segments of Text

23

Table 2. Examples matches taken from our test collection for the query "seattle mariners". The Seattle Mariners are a baseball team from Seattle. For each method, we show the 10 matches with the highest similarity score. Query: "seattle mariners" SparseProb DenseProb

Lexical

Stemming

seattle mariners seattle

seattle mariners seattle

seattle mariners tickets mariners tickets

mariners

mariners

seattle mariners

seattle mariner

seattle mariners baseball seattle mariners schedule

seattle mariners tickets mariners tickets seattle mariners baseball

seattle mariners seattle mariner seattle mariners tickets

seattle mariners

mariners tickets

seattle mariners schedule

seattle mariners baseball seattle mariners schedule mariners baseball

mariners baseball

mariners baseball

seattle baseball

seattle baseball red sox mariners tickets

mariners

Backoff

seattle baseball

mariners schedule

mariners schedule

red sox mariners tickets

seattle mariner

cheap mariners tickets

mariners schedule

5 Experimental Evaluation In this section we evaluate the similarity measures proposed in Section 4. We begin by showing some illustrative examples of matches generated using our algorithms. We then formally evaluate the methods in the context of a query-query similarity task. 5.1 Illustrative Examples Table 2 provides illustrative matches returned using the various matching techniques described in Section 4. Although many of these results look reasonable, it is difficult to quantify how much better any one method is by simply looking at these results. Therefore, in the next section we formally evaluate the different match types. 5.2 Query-Query Similarity We now describe our query-query similarity experiments. Here, we are interested in evaluating how well the various methods we described in Section 4 can be used to find queries that are similar to some target query. This task is a general task that is widely applicable. For example, such a query-query similarity system could be used to recommend alternative queries to users of a web search engine or for session boundary detection in query log analysis. 5.2.1 Data The following data resources were used in our experimental evaluation. A sample of 363,822 popular queries drawn from a 2005 MSN Search query log was used as our

24

D. Metzler, S. Dumais, and C. Meek Table 3. Description of the relevance judgment scale

Judgment Excellent Good

Fair Bad

Description The candidate is semantically equivalent to the user query. The candidate is related to (but not identical to) the query intent and it is likely the user would be interested in the candidate. The candidate is related to the query intent, but in an overly vague or specific manner that results in the user having little, if any, interest in the candidate. The candidate is unrelated to the query intent.

Examples (Query / Candidate) atlanta ga / atlanta georgia seattle mariners / seattle baseball tickets hyundia azera / new york car show web visitor count / coin counter

candidate pool of queries to match against. For each query, we generated an expanded representation, as described in Section 3.3. In our experiments, we set µ Q to 0 and µ C to 2500. To handle this amount of data, we built an index out of the expanded representations using the Indri search system [12]. We also randomly sampled a set of 120 queries from the same log to use as target queries. These target queries were then matched against the full set of 363k queries. For each of these target queries, we ran the methods described in Section 4 and pooled the results down to a depth of 25 per method. A single human assessor then judged the relevance of each candidate result with respect to the target query using a 4-point judgment scale. Table 3 provides a description and examples of each type of judgment. The result of this assessment was 5231 judged target/candidate pairs. Of these judgments, 317 (6%) were Excellent, 600 (11%) were Good, 2537 (49%) were Fair, and 1777 (34%) were Bad. In order to determine the reliability of the judgments, four assessors judged 10 target queries. The inter-annotator agreement was then computed for these queries and was found to be 60%. However, when Excellent and Good judgments were binned and Fair and Bad judgments were binned, the agreement increased to 80%. This indicates the boundary between Fair and Bad is interpreted differently among users. For this reason, we will primarily focus our attention on the boundary between Excellent and Good and between Good and Fair. In addition, the Excellent and Good matches are the most interesting for many practical applications including query suggestion and sponsored search. 5.2.2 Evaluation We are interested in understanding how our matching methods compare to each other across various relevance criteria. Since we are interested in using standard information retrieval metrics, such as precision and recall, we must binarize the relevance judgments. For each experiment, we state the relevance criteria used. We first evaluate the methods using precision-recall graphs using two different relevance criteria. The results are given in Figure 2. For the case when Excellent matches are considered relevant (left panel), we see that the Lexical and Stemming methods outperform the probabilistic methods, especially at lower recall levels.

Similarity Measures for Short Segments of Text

1

1 Lexical Stemming SparseProb DenseProb Backoff

0.9 0.8 0.7

Lexical Stemming SparseProb DenseProb Backoff

0.9 0.8 0.7

0.6

Precision

Precision

25

0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.2

0.2

0.1

0.1 0

0 0

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Fig. 2. Interpolated, 11-point precision-recall curves for the five matching methods described in Section 4. On the left, candidates judged ‘Excellent’ are considered relevant. On the right, candidates judged ‘Excellent’ or ‘Good’ are considered relevant.

This is not surprising, since we expect lexical matches to easily find most of the Excellent matches. In addition, we see that Stemming consistently outperforms the Lexical method. However, the Backoff method dominates the other methods at all recall levels. This results from backing off from stricter matches to less strict matches. For example, for the query “atlanta ga”, the Lexical method will match “atlanta ga”, but neither the Lexical nor the Stemming methods will match “atlanta georgia", which is actually an Excellent match that is found using the DenseProb method. When we relax the relevance criteria and consider both Excellent and Good judgments to be relevant (right panel), we see an interesting shift in the graph. Here, the probabilistic methods, SparseProb and DenseProb, outperform the Lexical and Stemming methods at all recall levels, except very low levels. This suggests that the Lexical and Stemming methods are good at finding Excellent matches, but that they are worse at finding Good matches compared to the probabilistic methods. We further test this hypothesis later in this section. However, once again, we see that the Backoff method outperforms all of the methods at all recall levels. One reason why the Backoff method is superior to the non-hybrid probabilistic methods is the fact that the SparseProb and DenseProb methods often fail to return exact matches high in the ranked list. This is caused by truncating the expanded query distribution before computing the KL divergence. Since exact matches account for a majority of the Excellent judgments, this causes the entire curve to be shifted down. By forcing the exact and exact stems matches to occur first, we are ‘stacking the deck’ and promoting matches that are likely to be high precision. This, combined with the high recall of the DenseProb method, results in a superior matching method. It is clear that exact matches are very likely to result in Excellent matches. However, it is not clear how phrase and subset lexical matches compare to stemming and probabilistic matches. To measure this, we compute the precision at k for the Lexical and Backoff methods, where k is the number of results returned by the Lexical method. This evaluation allows us to quantify the improvement achieved by replacing the low precision phrase and subset matches with the high precision exact stems matches and high recall DenseProb matches.

26

D. Metzler, S. Dumais, and C. Meek

Table 4. Precision at k, where k is the number of matches returned using the Lexical method. In this table, the evaluation set of queries was stratified according to k. Queries indicates the the number of queries associated with each k . Only values of k associated with 10 or more queries are shown. R = {Excellent}

R = {Excellent, Good}

k

Queries

Lexical

Backoff

Lexical

Backoff

1 2 3

40 38 31

0.7500 0.3235 0.2688

0.8125 0.4853 0.4194

0.7500 0.3382 0.3978

0.8125 0.5882 0.5914

The results are presented in Table 4 for two relevance criteria. We stratify the queries with respect to k, the number of Lexical method matches for the query, and compute precision at depth k over these queries. We only include values of k associated with 10 or more queries, since it misleading to compute and compare means over smaller samples. As the results show, the Backoff method is superior in every case. This suggests that the stemming and probabilistic matches (used in the Backoff method) are considerably better at finding both Excellent and Good matches compared to the phrase and subset matches (used in the Lexical method). 5.2.3 Effectiveness vs. Efficiency One important practical aspect of the techniques developed is efficiency. Generating lexical and stemming matches is very efficient. The probabilistic methods are slower, but not unreasonable. Generating matches against our collection of 363,822 candidates using a modern single CPU machine takes 0.15 seconds per query using the SparseProb method and 3 seconds per query using the DenseProb method. The DenseProb method requires, a priori, an index of expanded representations for both the candidates and the incoming queries. If we are asked to generate DenseProb matches for a query that is not in our index, then we must generate this representation on the fly. However, the SparseProb method does not exhibit this behavior and can be used to efficiently generate matches for any incoming query. Therefore, SparseProb is the the best choice in terms of speed and coverage. However, if speed is not an issue, and high quality results are important, then DenseProb is the better choice.

6 Conclusions and Future Work In this paper we studied the problem of measuring the similarity between short segments of text. We looked at various types of text representations, including surface, stemmed, and expanded. We showed how web search results can be used to form expanded representations of short text segments. We then described several similarity measures based on these representations, including lexical matching and probabilistic measures based on language models estimated from unexpanded and expanded representations. We then formally evaluated and compared these measures in the context of a query-query similarity task over a large collection of popular web

Similarity Measures for Short Segments of Text

27

search queries. Our results showed that lexical matching is good for finding semantically identical matches and that the probabilistic methods are better at finding interesting topically related matches. It was shown that a simple hybrid technique that combines lexical, stemmed, and probabilistic matches results in far superior performance than any method alone. The probabilistic framework presented in this paper provides a general method for measuring the similarity between two short segments of text. Although we chose to use web search results as the basis of our expanded representation in this work, an interesting direction of future work would be to use a variety of other sources of external text, such as query reformulation logs, queries that result in similar click patterns, and Wikipedia. It would also be worthwhile to evaluate these techniques in an end-to-end application, such as a query-query reformulation system, in order to see what impact they have in a more practical setting.

References [1] Berger, A. and Lafferty, J. Information retrieval as statistical translation. In Proceedings of SIGIR ’99, pages 222-229, 1999. [2] Cucerzan, S. and Brill, E. Extracting semantically related queries by exploiting user session information. Technical Report, Microsoft Research, 2005. [3] Deerwester, S., Dumais, S., Landauer, T., Furnas, G. and Harshman, R. Indexing by latent semantic analysis. In JASIST, 41(6), pages 391-407, 1990. [4] Jones, R. Generating query substitutions. In Proceedings of WWW 2006, pages 387-396, 2006. [5] Krovetz, R. Viewing morphology as an inference process. In Proceedings of SIGIR ’93, pages 191-202, 1993. [6] Lavrenko, V. and Croft, W.B. Relevance based language models. In Proceedings of SIGIR ‘01, pages 120-127, 2001. [7] Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., and Zobel, J. Similarity measures for tracking information flow. In Proceedings of CIKM ‘05, pages 517-524, 2005. [8] Murdock, V. and Croft, W.B. A Translation Model for Sentence Retrieval. In Proceedings of HLT/EMNLP ‘05, pages 684-691, 2005. [9] Porter, M. F. An algorithm for suffix stripping. Program, 14(3), pages 130-137, 1980. [10] Rocchio, J. J. Relevance Feedback in Information Retrieval, pages 313-323. PrenticeHall, 1971. [11] Sahami, M. and Heilman, T. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of WWW 2006, pages 377-386, 2006. [12] Strohman, T., Metzler, D., Turtle, H., Croft, W. B. Indri: A language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligence Analysis, 2005. [13] Zhai, C. and Lafferty, J. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR ‘01, pages 334-342, 2001. [14] Zhai, C. and Lafferty, J. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of CIKM ‘01, pages 403-410, 2001.

Multinomial Randomness Models for Retrieval with Document Fields Vassilis Plachouras1 and Iadh Ounis2 1

Yahoo! Research, Barcelona, Spain University of Glasgow, Glasgow, UK [email protected], [email protected] 2

Abstract. Document fields, such as the title or the headings of a document, offer a way to consider the structure of documents for retrieval. Most of the proposed approaches in the literature employ either a linear combination of scores assigned to different fields, or a linear combination of frequencies in the term frequency normalisation component. In the context of the Divergence From Randomness framework, we have a sound opportunity to integrate document fields in the probabilistic randomness model. This paper introduces novel probabilistic models for incorporating fields in the retrieval process using a multinomial randomness model and its information theoretic approximation. The evaluation results from experiments conducted with a standard TREC Web test collection show that the proposed models perform as well as a state-of-the-art field-based weighting model, while at the same time, they are theoretically founded and more extensible than current field-based models.

1 Introduction Document fields provide a way to incorporate the structure of a document in Information Retrieval (IR) models. In the context of HTML documents, the document fields may correspond to the contents of particular HTML tags, such as the title, or the heading tags. The anchor text of the incoming hyperlinks can also be seen as a document field. In the case of email documents, the fields may correspond to the contents of the email’s subject, date, or to the email address of the sender [9]. It has been shown that using document fields for Web retrieval improves the retrieval effectiveness [17,7]. The text and the distribution of terms in a particular field depend on the function of that field. For example, the title field provides a concise and short description for the whole document, and terms are likely to appear once or twice in a given title [6]. The anchor text field also provides a concise description of the document, but the number of terms depends on the number of incoming hyperlinks of the document. In addition, anchor texts are not always written by the author of a document, and hence, they may enrich the document representation with alternative terms. The combination of evidence from the different fields in a retrieval model requires special attention. Robertson et al. [14] pointed out that the linear combination of scores, which has been the approach mostly used for the combination of fields, is difficult to interpret due to the non-linear relation between the assigned scores and the term frequencies in each of the fields. Hawking et al. [5] showed that the term frequency G. Amati, C. Carpineto, and G. Romano (Eds.): ECIR 2007, LNCS 4425, pp. 28–39, 2007. c Springer-Verlag Berlin Heidelberg 2007

Multinomial Randomness Models for Retrieval with Document Fields

29

normalisation applied to each field depends on the nature of the corresponding field. Zaragoza et al. [17] introduced a field-based version of BM25, called BM25F, which applies term frequency normalisation and weighting of the fields independently. Macdonald et al. [7] also introduced normalisation 2F in the Divergence From Randomness (DFR) framework [1] for performing independent term frequency normalisation and weighting of fields. In both cases of BM25F and the DFR models that employ normalisation 2F, there is the assumption that the occurrences of terms in the fields follow the same distribution, because the combination of fields takes place in the term frequency normalisation component, and not in the probabilistic weighting model. In this work, we introduce weighting models, where the combination of evidence from the different fields does not take place in the term frequency normalisation part of the model, but instead, it constitutes an integral part of the probabilistic randomness model. We propose two DFR weighting models that combine the evidence from the different fields using a multinomial distribution, and its information theoretic approximation. We evaluate the performance of the introduced weighting models using the standard .Gov TREC Web test collection. We show that the models perform as well as the state-of-the-art model field-based PL2F, while at the same time, they employ a theoretically founded and more extensible combination of evidence from fields. The remainder of this paper is structured as follows. Section 2 provides a description of the DFR framework, as well as the related field-based weighting models. Section 3 introduces the proposed multinomial DFR weighting models. Section 4 presents the evaluation of the proposed weighting models with a standard Web test collection. Sections 5 and 6 close the paper with a discussion related to the proposed models and the obtained results, and some concluding remarks drawn from this work, respectively.

2 Divergence from Randomness Framework and Document Fields The Divergence From Randomness (DFR) framework [1] generates a family of probabilistic weighting models for IR. It provides a great extent of flexibility in the sense that the generated models are modular, allowing for the evaluation of new assumptions in a principled way. The remainder of this section provides a description of the DFR framework (Section 2.1), as well as a brief description of the combination of evidence from different document fields in the context of the DFR framework (Section 2.2). 2.1 DFR Models The weighting models of the Divergence From Randomness framework are based on combinations of three components: a randomness model RM; an information gain model GM; and a term frequency normalisation model. Given a collection D of documents, the randomness model RM estimates the probability PRM (t ∈ d|D) of having tf occurrences of a term t in a document d, and the importance of t in d corresponds to the informative content − log2 (PRM (t ∈ d|D)). Assuming that the sampling of terms corresponds to a sequence of independent Bernoulli trials, the randomness model RM is the binomial distribution: T F tf p (1 − p)T F −tf (1) PB (t ∈ d|D) = tf

30

V. Plachouras and I. Ounis

where TF is the frequency of t in the collection D, p = N1 is a uniform prior probability that the term t appears in the document d, and N is the number of documents in the collection D. A limiting form of the binomial distribution is the Poisson distribution P: PB (t ∈ d|D) ≈ PP (t ∈ d|D) =

λtf −λ e tf !

where λ = T F · p =

TF N

(2)

The information gain model GM estimates the informative content 1 − Prisk of the probability Prisk that a term t is a good descriptor for a document. When a term t appears many times in a document, then there is very low risk in assuming that t describes the document. The information gain, however, from any future occurrences of t in d is lower. For example, the term ‘evaluation’ is likely to have a high frequency in a document about the evaluation of IR systems. After the first few occurrences of the term, however, each additional occurrence of the term ‘evaluation’ provides a diminishing additional amount of information. One model to compute the probability Prisk is the Laplace after-effect model: tf Prisk = (3) tf + 1 Prisk estimates the probability of having one more occurrence of a term in a document, after having seen tf occurrences already. The third component of the DFR framework is the term frequency normalisation model, which adjusts the frequency tf of the term t in d, given the length l of d and the average document length l in D. Normalisation 2 assumes a decreasing density function of the normalised term frequency with respect to the document length l. The normalised term frequency tf n is given as follows: l tf n = tf · log2 (1 + c · ) l

(4)

where c is a hyperparameter, i.e. a tunable parameter. Normalisation 2 is employed in the framework by replacing tf in Equations (2) and (3) with tf n. The relevance score wd,q of a document d for a query q is given by: wd,q =

qtw · wd,t

where wd,t = (1 − Prisk ) · (− log2 PRM )

(5)

t∈q

, qtf is the where wd,t is the weight of the term t in document d, qtw = qtfqtf max frequency of t in the query q, and qtfmax is the maximum qtf in q. If PRM is estimated using the Poisson randomness model, Prisk is estimated using the Laplace after-effect model, and tf n is computed according to normalisation 2, then the resulting weighting model √ is denoted by PL2. The factorial is approximated using Stirling’s formula: tf ! = 2π · tf tf +0.5 e−tf . The DFR framework generates a wide range of weighting models by using different randomness models, information gain models, or term frequency normalisation models. For example, the next section describes how normalisation 2 is extended to handle the normalisation and weighting of term frequencies for different document fields.

Multinomial Randomness Models for Retrieval with Document Fields

31

2.2 DFR Models for Document Fields The DFR framework has been extended to handle multiple document fields, and to apply per-field term frequency normalisation and weighting. This is achieved by extending normalisation 2, and introducing normalisation 2F [7], which is explained below. Suppose that a document has k fields. Each occurrence of a term can be assigned to exactly one field. The frequency tfi of term t in the i-th field is normalised and weighted independently of the other fields. Then, the normalised and weighted term frequencies are combined into one pseudo-frequency tf n2F : tf n2F =

li wi · tfi log2 1 + ci · li i=1

k

(6)

where wi is the relative importance or weight of the i-th field, tfi is the frequency of t in the i-th field of document d, li is the length of the i-th field in d, li is the average length of the i-th field in the collection D, and ci is a hyperparameter for the i-th field. The above formula corresponds to normalisation 2F. The weighting model PL2F corresponds to PL2 using tf n2F as given in Equation (6). The well-known BM25 weighting model has also been extended in a similar way to BM25F [17].

3 Multinomial Randomness Models This section introduces DFR models which, instead of extending the term frequency normalisation component, as described in the previous section, use document fields as part of the randomness model. While the weighting model PL2F has been shown to perform particularly well [7,8], the document fields are not an integral part of the randomness weighting model. Indeed, the combination of evidence from the different fields takes place as a linear combination of normalised frequencies in the term frequency normalisation component. This implies that the term frequencies are drawn from the same distribution, even though the nature of each field may be different. We propose two weighting models, which, instead of assuming that term frequencies in fields are drawn from the same distribution, use multinomial distributions to incorporate document fields in a theoretically driven way. The first one is based on the multinomial distribution (Section 3.1), and the second one is based on an information theoretic approximation of the multinomial distribution (Section 3.2). 3.1 Multinomial Distribution We employ the multinomial distribution to compute the probability that a term appears a given number of times in each of the fields of a document. The formula of the weighting model is derived as follows. Suppose that a document d has k fields. The probability that a term occurs tfi times in the i-th field fi , is given as follows: TF tfk tf 1 tf2 PM (t ∈ d|D) = (7) ptf 1 p2 . . . pk p tf1 tf2 . . . tfk tf

32

V. Plachouras and I. Ounis

1 In the above equation, T F is the frequency of term t in the collection, pi = k·N is the prior probability that a term occurs in a particular field of document d, and N is the number of documents in the collection D. The frequency tf = T F − ki=1 tfi corresponds to the number of occurrences of t in other documents than d. The probability 1 p = 1 − k k·N = NN−1 corresponds to the probability that t does not appear in any of the fields of d. The DFR weighting model is generated using the multinomial distribution from Equation (7) as a randomness model, the Laplace after-effect from Equation (3), and replacing tfi with the normalised term frequency tf ni , obtained by applying normalisation 2 from Equation (4). The relevance score of a document d for a query q is computed as follows: wd,q = qtw · wd,t = qtw · (1 − Prisk ) · − log2 (PM (t ∈ d|D) t∈q

=

t∈q

t∈q

k

qtw

i=1

tf ni + 1

·

− log2 (T F !) +

+ log2 (tf n !) − tf n log2 (p )

k

log2 (tf ni !) − tf ni log2 (pi )

i=1

(8)

k where qtw is the weight of a term t in query q, tf n = T F − i=1 tf ni , tf ni = tfi · log2 (1 + ci · llii ) for the i-th field, and ci is the hyperparameter of normalisation 2 for the i-th field. The weighting model introduced in the above equation is denoted by ML2, where M stands for the multinomial randomness model, L stands for the Laplace after-effect model, and 2 stands for normalisation 2. Before continuing, it is interesting to note two issues related to the introduced weighting model ML2, namely setting the relative importance, or weight, of fields in the document representation, and the computation of factorials. Weights of fields. In Equation (8), there are two different ways to incorporate weights for the fields of documents. The first one is to multiply each of the normalised term frequencies tf ni with a constant wi , in a similar way to normalisation 2F (see Equation (6)): tf ni := wi · tf ni . The second way is to adjust the prior probabilities pi of fields, in order to increase the scores assigned to terms occurring in fields with low prior probabilities: pi := wpii . Indeed, the assigned score to a query term occurring in a field with low probability is high, due to the factor −tf ni log2 (pi ) in Equation (8). Computing factorials. As mentioned in Section 2.1, the factorial in the weighting model PL2 is approximated using Stirling’s formula. A different method to approximate the factorial is to use the approximation of Lanczos to the Γ function [12, p. 213], which has a lower approximation error than Stirling’s formula. Indeed, preliminary experimentation with ML2 has shown that using Stirling’s formula affects the performance of the weighting model, due to the accumulation of the approximation error from computing the factorial k + 2 times (k is the number of fields). This is not the case for the Poisson-based weighting models PL2 and PL2F, where there is only one factorial computation for each query term (see Equation (2)). Hence, the computation of factorials in Equation (8) is performed using the approximation of Lanczos to the Γ function.

Multinomial Randomness Models for Retrieval with Document Fields

33

3.2 Approximation to the Multinomial Distribution The DFR framework generates different models by replacing the binomial randomness model with its limiting forms, such as the Poisson randomness model. In this section, we introduce a new weighting model by replacing the multinomial randomness model in ML2 with the following information theoretic approximation [13]: tf i TF! 2−T F ·D T F ,pi 1 tf1 tf2 tfk tf

p1 p2 · · · pk p ≈ √ k tf1 !tf2 ! · · · tfk !tf ! pt1 pt2 · · · ptk pt 2πT F D

(9)

tfi

tfi TF

, pi corresponds to the information theoretic divergence of the probability pti = that a term occurs in a field, from the prior probability pi of the field:

TF

D

k tfi tfi tf tf + , pi = log2 log2 TF TF T F · pi TF T F · p i=1

tf

i

(10)

k where tf = T F − i=1 tfi . Hence, the multinomial randomness model M in the weighting model ML2 can be replaced by its approximation from Equation (9): wd,q

k log2 (2πT F ) tf ni /T F tf ni 1 tf ni log2 · = qtw · k + log2 pi 2 TF t∈q i=1 i=1 tf ni + 1 tf n /T F tf n 1 log (11) + + tf n log2 2 p 2 TF

k 2

The above model is denoted by MD L2. The definitions of the variables involved in the above equation have been introduced in Section 3.1. It should be noted that the information theoretic divergence D TtfFi , pi is defined tfi only when tfi > 0 for 1 ≤ i ≤ k. In other words, D T F , pi is defined only when there is at least one occurrence of a query term in all the fields. This is not always the case, because a Web document may contain all the query terms in its body, but it may contain only some of the query terms in its title. To overcome this issue, the weight of a query term t in a document is computed by considering only the fields in which the term t appears. The weights of different fields can be defined in the same way as in the case of the weighting model ML2, as described in Section 3.1. In more detail, the weighting of fields can be achieved by either multiplying the frequency of a term in a field by a constant, or by adjusting the prior probability of the corresponding field. An advantage of the weighting model MD L2 is that, because it approximates the multinomial distribution, there is no need to compute factorials. Hence, it is likely to provide a sufficiently accurate approximation to the multinomial distribution, and it may lead to improved retrieval effectiveness compared to ML2, due to the lower accumulated numerical errors. The experimental results in Section 4.2 will indeed confirm this advantage of MD L2.

34

V. Plachouras and I. Ounis

4 Experimental Evaluation In this section, we evaluate the proposed multinomial DFR models ML2 and MD L2, and compare their performance to that of PL2F, which has been shown to be particularly effective [7,8]. A comparison of the retrieval effectiveness of PL2F and BM25F has shown that the two models perform equally well on various search tasks and test collections [11], including those employed in this work. Hence, we experiment only with the multinomial models and PL2F. Section 4.1 describes the experimental setting, and Section 4.2 presents the evaluation results. 4.1 Experimental Setting The evaluation of the proposed models is conducted with the .Gov TREC Web test collection, a crawl of approximately 1.25 million documents from the .gov domain. The .Gov collection has been used in the TREC Web tracks between 2002 and 2004 [2,3,4]. In this work, we employ the tasks from the Web tracks of TREC 2003 and 2004, because they include both informational tasks, such as the topic distillation (td2003 and td2004, respectively), as well as navigational tasks, such as named page finding (np2003 and np2004, respectively) and home page finding (hp2003 and hp2004, respectively). More specifically, we train and test for each type of task independently, in order to get insight on the performance of the proposed models [15]. We employ each of the tasks from the TREC 2003 Web track for training the hyperparameters of the proposed models. Then, we evaluate the models on the corresponding tasks from the TREC 2004 Web track. In the reported set of experiments, we employ k = 3 document fields: the contents of the tag of Web documents (b), the anchor text associated with incoming hyperlinks (a), and the contents of the <TITLE> tag (t). More fields can be defined for other types of fields, such as the contents of the heading tags