Communications in Computer and Information Science
91
Maristella Agosti Floriana Esposito Costantino Thanos (Eds.)
Digital Libraries 6th Italian Research Conference, IRCDL 2010 Padua, Italy, January 28-29, 2010 Revised Selected Papers
13
Volume Editors Maristella Agosti Università degli Studi di Padova Padua, Italy E-mail:
[email protected] Floriana Esposito Università di Bari Bari, Italy E-mail:
[email protected] Costantino Thanos Istituto di Scienza e Tecnologie dell’Informazione "Alessandro Faedo" Pisa, Italy E-mail:
[email protected]
Library of Congress Control Number: Applied for CR Subject Classification (1998): H.3, H.5, H.4, J.1, H.2, H.2.8 ISSN ISBN-10 ISBN-13
1865-0929 3-642-15849-8 Springer Berlin Heidelberg New York 978-3-642-15849-0 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180 543210
Preface
This volume contains the revised accepted papers from among those presented at the 6th Italian Research Conference on Digital Libraries (IRCDL 2010), which was held in the Department of Information Engineering of the University of Padua, Italy, during January 28–29, 2010. The well-established aim of IRCDL is to bring together Italian researchers interested in the different methods and techniques that allow the building and operation of Digital Libraries. A national Program Committee was set up composed of 15 members, with representatives of the most active Italian research groups on Digital Libraries. Seventeen of the papers presented at the conference were accepted for inclusion in this volume. Selected authors submitted expanded versions of their workshop papers. Those papers were reviewed again and the results of the selection are the papers appearing in these proceedings. The covered topics testifying to the bright interests of the community are: – System Interoperability and Data Integration including emerging interoperability issues and technologies promoting interoperability – Infrastructures, Metadata Creation and Content Management for Digital Libraries, such as methods and techniques for text summarization and key phrase extraction, ontology-based annotation, event-centric provenance models, semantic relatedness – Information Access and Search for Digital Library Systems from mathematical symbol indexing to term-based text retrieval and disambiguation techniques, from audio-based information retrieval to video content-based identification, classification and retrieval – User Interfaces for Digital Libraries including interactive visual representations, collaborative user interfaces, multimodal user interface for multimedia Digital Libraries, personalization models The volume also contains the reports on the two keynote addresses and a communication on a relevant national project. This year is the first time that the IRCDL proceedings have been published in the Springer CCIS series. The IRCDL series of national conferences was originally conceived and organized in the context of the activities of DELOS, the Network of Excellence on Digital Libraries (http://www.delos.info/), partially funded by the European Union under the Sixth Framework Program from 2004 to 2007. The first IRCDL conference took place in 2005, as an opportunity for Italian researchers to present recent results on their research activities related to the wide world of Digital Libraries. In particular, young researchers were (and still are) invited to submit the results of their ongoing research, to be presented in a friendly and relaxed atmosphere, to facilitate constructive discussion and exchange of opinions.
VI
Preface
Thanks to the initial support of DELOS, and later on to the support of both the DELOS Association and the Department of Information Engineering of the University of Padua, IRCDL continued in the subsequent years and has become a fixture as a yearly meeting point for the Italian researchers on Digital Libraries and related topics. Detailed information about IRCDL can be found at the conference home page (http://ims.dei.unipd.it/ircdl/home.html), which also contains links to IRCDL editions. Here we would like to thank those institutions and individuals who made this conference possible: the Program Committee members, the Department of Information Engineering of the University of Padua, the members of the same department who contributed to the organization of the event, namely, Maria Bernini, Emanuele Di Buccio, Marco Dussin, and Nicola Montecchio, and the members of the University of Padua Library Centre who contributed to the organization of the registration and management of the on-line presentations, namely, Yuri Carrer, Francesca Moro, and Ornella Volpato. Finally, we take this opportunity to also thank Fabrizio Falchi of ISTI CNR of Pisa who helped in the revision of the final papers. To conclude, we would like to point out that, in addition to the enthusiastic participation of the young researchers and the good will of the members of the various committees, much of the credit for having the IRCDL series of conferences today goes to DELOS. As a matter of fact, DELOS started its activities more than ten years ago as a working group under the ESPRIT Program, then continued as a Thematic Network under the Fifth Framework Program and after that as a Network of Excellence under the Sixth Framework Program. It is generally recognized that during these years DELOS has made a substantial contribution to the establishment in Europe of a research community on Digital Libraries. At the end of 2007 the funding of the DELOS Network of Excellence came to an end. In order to keep the “DELOS spirit” alive, a DELOS Association was established as a not-for-profit organization, with the main aim of continuing as much as possible the DELOS activities by promoting research activities in the field of Digital Libraries. In this vein, there is also the commitment to supporting the new edition of IRCDL, which, as customary, will be held in January 2011. A call for participation for IRCDL 2011 will be circulated, but meanwhile we invite all researchers having research interests in Digital Libraries to start thinking about possible contributions to next year’s conference. June 2010
Maristella Agosti Floriana Esposito Costantino Thanos
Organization
General Chair Costantino Thanos
ISTI CNR, Pisa
Program Chairs Maristella Agosti Floriana Esposito
University of Padua University of Bari
Program Committee Giuseppe Amato Marco Bertini Leonardo Candela Tiziana Catarci Alberto Del Bimbo Stefano Ferilli Nicola Ferro Maria Guercio Carlo Meghini Nicola Orio Fausto Rabitti Pasquale Savino Anna Maria Tammaro Letizia Tanca Carlo Tasso
ISTI CNR, Pisa University of Florence ISTI CNR, Pisa University of Rome “La Sapienza” University of Florence University of Bari University of Padua University of Urbino “Carlo Bo” ISTI CNR, Pisa University of Padua ISTI CNR, Pisa ISTI CNR, Pisa University of Parma Politecnico di Milano University of Udine
Organizing Committee Department of Information Engineering, University of Padua: Maria Bernini Emanuele Di Buccio Marco Dussin Nicola Montecchio Library Centre - CAB, University of Padua: Yuri Carrer Francesca Moro Ornella Volpato
VIII
Organization
Supporting Institutions IRCDL 2010 benefited from the support of the following organizations: DELOS Association Institute for Information Science and Technologies of the Italian National Research Council (ISTI-CNR), Pisa, Italy Department of Information Engineering, University of Padua, Italy
Table of Contents
Keynote Addresses Digital Cultural Content: National and European Projects and Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rossella Caffo
1
Archival Information Systems in Italy and the National Archival Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Vitali
5
System Interoperability and Data Integration Making Digital Library Content Interoperable . . . . . . . . . . . . . . . . . . . . . . . Leonardo Candela, Donatella Castelli, and Costantino Thanos Integrating a Content-Based Recommender System into Digital Libraries for Cultural Heritage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cataldo Musto, Fedelucio Narducci, Pasquale Lops, Marco de Gemmis, and Giovanni Semeraro
13
27
Infrastructures, Metadata Creation and Management Digital Stacks: Turning a Current Prototype into an Operational Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giovanni Bergamin and Maurizio Messina
39
A First National Italian Register for Digital Resources for Both Cultural and Scientific Communities (Communication) . . . . . . . . . . . . . . . . Maurizio Lunghi
47
FAST and NESTOR: How to Exploit Annotation Hierarchies . . . . . . . . . . Nicola Ferro and Gianmaria Silvello
55
A New Domain Independent Keyphrase Extraction System . . . . . . . . . . . . Nirmala Pudota, Antonina Dattolo, Andrea Baruzzo, and Carlo Tasso
67
An Event-Centric Provenance Model for Digital Libraries . . . . . . . . . . . . . Donatella Castelli, Leonardo Candela, Paolo Manghi, Pasquale Pagano, Cristina Tang, and Costantino Thanos
79
X
Table of Contents
A Digital Library Effort to Support the Building of Grammatical Resourcesfor Italian Dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maristella Agosti, Paola Beninc` a, Giorgio Maria Di Nunzio, Riccardo Miotto, and Diego Pescarini
89
Representation, Indexing and Retrieval in Digital Libraries Interactive Visual Representations of Complex Information Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gianpaolo D’Amico, Alberto Del Bimbo, and Marco Meoni Mathematical Symbol Indexing for Digital Libraries . . . . . . . . . . . . . . . . . . Simone Marinai, Beatrice Miotti, and Giovanni Soda Using Explicit Word Co-occurrences to Improve Term-Based Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Ferilli, Marenglen Biba, Teresa M.A. Basile, and Floriana Esposito Semantic Relatedness Approach for Named Entity Disambiguation . . . . . Anna Lisa Gentile, Ziqi Zhang, Lei Xia, and Jos´e Iria Merging Structural and Taxonomic Similarity for Text Retrieval Using Relational Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Ferilli, Marenglen Biba, Nicola Di Mauro, Teresa M.A. Basile, and Floriana Esposito
101 113
125
137
149
Handling Audio-Visual and Non-traditional Objects Audio Objects Access: Tools for the Preservation of the Cultural Heritage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergio Canazza and Nicola Orio Toward Conversation Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matteo Magnani and Danilo Montesi Improving Classification and Retrieval of Illuminated Manuscript with Semantic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Costantino Grana, Daniele Borghesani, and Rita Cucchiara Content-Based Cover Song Identification in Music Digital Libraries . . . . . Riccardo Miotto, Nicola Montecchio, and Nicola Orio
161 173
183 195
Toward an Audio Digital Library 2.0: Smash, a Social Music Archive of SHellac Phonographic Discs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergio Canazza and Antonina Dattolo
205
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
219
Digital Cultural Content: National and European Projects and Strategies Rossella Caffo Ministero per i Beni e le Attività Culturali Istituto centrale per il catalogo unico delle biblioteche italiane e per le informazioni bibliografiche (ICCU) Roma, Italy
Abstract. For many years the Central Institute for the Single Catalogue of Italian Libraries (ICCU) of the Italian Ministry for Cultural Heritage has been involved in the coordination of European and national projects that promote the digitization and online accessibility of cultural heritage.
1 MINERVA (2002-2008) - www.minervaeurope.org The first initiative coordinated by the Ministry was MINERVA, the network of the ministries of culture of the Member States that worked for the harmonization of digitization efforts. Between 2002 and 2008 MINERVA produced several results in terms of concrete tools and strategic actions and today is recognized as a trustworthy name. The project was articulated in three phases, each of them coordinated by MiBAC: • • •
2002-2005: MINERVA 2004-2006: MINERVA Plus, aimed at enlarging the consortium to new Member States (for a total of 29 countries, i.e. the current 27 Member States of the European Union, plus Israel and Russia) 2006: MINERVA eC, on the basis of the achieved results, develops supporting actions for the development of the European Digital Library, EUROPEANA.
The MINERVA eC activities are related to: • • •
Workshops and seminars on: o MINERVA tools for digitization of cultural heritage o Quality of cultural websites Plenary meetings, in coordination with the rotating Presidency of the European Union Guidelines and studies: IPR guidelines, technical guidelines, study on user needs, report on content interoperability, best practice on digitization, cost reduction in digitization, multilingualism, quality, accessibility and usability of contents, annual report on digitization.
MINERVA operated through the coordination of national digitization policies and programs and supported the National Representatives Group for digitization (NRG) in M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 1–4, 2010. © Springer-Verlag Berlin Heidelberg 2010
2
R. Caffo
order to facilitate the creation of added value products and services shared at the European level, to improve awareness of the state-of-the-art in the sector, to contribute to overcoming fragmentation and duplication of digitization activities of cultural and scientific content and to maximize cooperation among the Member States.
2 MICHAEL (2004-2008) - www.michael-culture.org MICHAEL was a MINERVA spin-off; it has realized a European multilingual portal which provides integrated access to European cultural heritage through the collections of museums, libraries, archives and other cultural institutions and organizations. MICHAEL has a distributed organization involving thousands of institutions belonging to every domain at the national, regional and local level. At the national level MiBAC involved all its sectors and offices: State archives (about 130), State libraries (almost 50), and hundreds of museums, heritage offices etc, covered at the regional level through MiBAC Regional Directorates. MiBAC stipulated cooperation agreements with all the 20 Italian Regions and with Universities coordinated through CRUI-Padua University. MICHAEL describes both the collections and its context. The MICHAEL data model is suitable for describing digital collections belonging to all cultural heritage sectors and recording related contextual information to Institutions (creator, owner, keeper, manager), Projects / programs (funding), Services / products (giving access) and Physical collections (represented full or in part). The national MICHAEL portals can be visited at the following addresses: Italy: http://www.michael-culture.it Estonia: http://www.michael-culture.kul.ee France: http://www.numerique.culture.fr Germany: http://www.michael-portal.de Greece: http://www.michael-culture.gr Israel: http://www.michael-culture.org.il Netherlands: http://www.michael-culture.nl Spain: http://www.michael-culture.es Finland: http://www.michael-culture.fi Czech Republic : http://www.michael-culture.cz Slovakia: http://www.michael-culture.sk Sweden: http://www.michael-culture.se The MICHAEL consortium (19 countries and 40 partners) has established an International Association, MICHAEL-Culture AISBL, to enable the service to be sustainable and continue in the future. The MICHAEL-Culture Association is member of the European Digital Library Foundation which is developing Europeana.
3 CULTURAITALIA - www.culturaitalia.it CulturaItalia is the Italian Culture Portal. It is a cross-domain initiative based on the results of past digitization projects coordinated by the Ministry (mainly MINERVA and MICHAEL) and on the most widespread international interoperability standards.
Digital Cultural Content: National and European Projects and Strategies
3
In Italy, CulturaItalia is the first portal to offer a single and integrated access point to Italian cultural heritage. The Portal is a collaborative project since it is being carried out in cooperation between all the Ministry’s offices, the Regions, Universities, and private and local Institutions. CulturaItalia aims to communicate the various aspects of Italian culture (heritage, landscape, cinema, music, literature etc.) and to make the digital cultural content available to a wide audience. CulturaItalia is organized as a unique index of data and metadata harvested from several metadata repositories. A specific Application Profile (PicoAp) has been developed for CulturaItalia to describe the digital objects and the tangible resources of all the cultural heritage sectors. The metadata are described by the PICO-AP and classified through the PICO Thesaurus, a controlled vocabulary conceived for addressing the digital resources to the index and the thematic menus. Through metadata harvesting and the implementation of PicoAp, CulturaItalia is therefore able to aggregate the digital resources of hundreds of Italian “digital libraries”. CulturaItalia, like the other national multidisciplinary aggregators, can contribute to Europeana by making available the databases, the cooperatives resources, the contacts and the agreements stipulated with hundreds of Institutions. CulturaItalia is the national aggregator towards Europeana, the European digital library.
4 ATHENA (2008-2011) - www.athenaeurope.org Thanks to the results achieved by MINERVA, MICHAEL and CulturaItalia and to the works carried out by the network of experts created within these projects, a new European project, ATHENA, started at the beginning of November 2008. ATEHNA is a best practice network financed by the European Commission eContentplus program and coordinated by MiBAC. The consortium is made up of 20 Member states plus Israel, Russia, and Azerbaijan. It involves 109 important museums and other European cultural institutions. ATHENA aims to:
reinforce, support and encourage the participation of museums and other institutions coming from those sectors of cultural heritage not fully involved yet in the EDL; contribute to the integration of the different sectors of cultural heritage, in cooperation with other projects more directly focused on libraries and archives, with the overall objective of merging all these different contributions into the EDL; develop a set of plug-ins to be integrated within the EDL, to facilitate access to the digital content belonging to European museums.
ATHENA will also produce a set of scalable tools, recommendations and guidelines, focusing on multilingualism and semantics, metadata and thesauri, data structures and IPR issues, to be used within museums for supporting internal digitization activities and facilitating the integration of their digital content into the EDL. All these outputs
4
R. Caffo
will be based on standards and guidelines agreed by the partner countries for harmonized access to the content, and will be easily applicable. The final aim of ATHENA is to bring together relevant stakeholders and content owners from all over Europe, evaluate and integrate standards and tools for facilitating the inclusion of new digital content into the EDL, thus conveying to the user the original and multifaceted experience of all the European cultural heritage. ATHENA will work in close cooperation with existing projects (e.g. EDLnet and Michael, both present in ATHENA) and develop intense clustering activities with other relevant projects.
5 DC-NET (2009-2011) - http://www.dc-net.org DC-NET is an ERA-NET (European Research Area Network) project, financed by the European Commission under the e-Infrastructure - Capacities Programme of the FP7 and coordinated by MiBAC-ICCU. It started on 1 December 2009 and involves eight Ministries of Culture from eight European countries. The final aim of the DC-NET project is to generate a powerful and comprehensive plan of joint activities for the implementation of a new data and service e-Infrastructure for the virtual research community in the Digital Cultural Heritage. The new Digital Cultural Heritage e-Infrastructure will be based on the enhancement of the MICHAEL platform and will deploy a wide range of end-to-end services and tools facilitating integration and increasing of the research capacities in the sector. The new e-Infrastructure will be targeted towards a multidisciplinary virtual research community on digital cultural heritage that is demanding more and more empowered functions (access, search, storage, usability, etc.) to improve their scientific collaboration and innovation. DC-NET aims to:
create a place for the dialog between the community of digital cultural heritage and the e-Infrastructures, generating a common awareness of the reciprocal research items involve all the relevant stakeholders through a program of seminars, workshops, meetings and conferences enlarge the network to involve all the countries who are willing to join DC-NET create a joint commitment among the participating countries to implement the joint activities plan.
Archival Information Systems in Italy and the National Archival Portal Stefano Vitali Director of Archival Supervising Office for the Emilia Romagna Region, Italy
1 Archival Description and Information Systems The glossary of the International Standard for Archival Description (General), drawn up by the International Council on Archives for the development of archival information systems, defines archival description as “an accurate representation of a unit of description and its component parts, if any, by capturing, analyzing, organizing and recording information that serves to identify, manage, locate and explain archival materials and the context and records systems which produced it.” This definition summarizes the fundamental problems which need to be tackled when developing archival information systems which in a digital environment are the equivalent of the paper finding aids such as guides and inventories, which traditionally are prepared to enable access to and consultation of materials held in archival institutions. The definition underlines the fact that archives are first of all complex "objects". Indeed they are made up of a collection of entities and relations which link them to each other. These relations create strong and specific bonds - "determined" as some scholars of twentieth century archive systems have defined them to stress the fact that these bonds are generated by the common origin of the entities - which even in the digital environment cannot be ignored. It is precisely the nature of these links which distinguishes archives from other "objects" in the realm of cultural heritage (e.g. books and works of art) which in general are perceived as individual and unrelated entities. Archives (or "fonds" as they are also called) are in fact made up of series which in turn can be organized in sub-series which are formed of archival units (files, registers and so on) which have a homogeneous nature and can in turn be divided into sub-units containing items such as letters, reports, contracts or even photographs, drawings, audio-video recordings and so on. This implies that each of these entities can only be correctly identified and interpreted in relation to the entity they belong to and from which they inherit certain characteristics. This obviously influences the manner of their representation and the retrieval of these representations. The prevailing solution, in the development of digital systems of archival description, has been to represent these relations with hierarchical metaphors which collocate each entity in a vertical relationship of subordination with the entity it belongs to, i.e. with the “father” entity (e.g. a series with the fond it belongs to). The hierarchical representation is further complicated by the fact that the entities that belong to the same father - and are therefore related to each other by a horizontal-type relationship - need to be represented according to a significant sequence which reflects the position that they have in the logical and/or material order of the archive (e.g. in a company archive the series of the articles of incorporation will be located before the series of the deliberations of the board of directors). Both the vertical relations and the horizontal ones, established according to a M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 5–11, 2010. © Springer-Verlag Berlin Heidelberg 2010
6
S. Vitali
pre-established sequence, form the complex archival context of a determined entity and this context contributes in a fundamental manner to its identification. However, a similarly fundamental role in archival description is played by other types of contexts, which in a certain sense are external to the archives themselves. Archives are in fact historical entities, which like sediment slowly are accumulated within a certain space-time context, generally speaking, as the outcome of the practical activity of certain subjects (corporate bodies – such as institutions or organizations families and persons). The context in which a certain archive is created is therefore in itself an essential part of the descriptive system of archives. Each archive therefore should be related with one or more creators who presided over its accumulation and their history, functions, activities, etc. should be described. In addition, when archives held by a number of institutions are described in the same information system, also these institutions have to be described to help users to locate the archives described. The so-called archival institutions, together with archival materials and creators therefore constitute the essential entities which in general make up information systems of archival description in accordance with the three standards issued by the International Council on Archives over the last decade. These standards are: the International Standard Archival Description (General) or ISAD (G), regarding the description of archival fonds and their common components; the International Standard Archival Authority Records (Corporate Bodies, Persons, Families) or ISAAR (CPF), regarding the preparation of authority records for creators; the International Standard for Describing Institutions with Archival Holdings (ISDIAH), which instead is dedicated to describing archival institutions (Fig. 1). In many systems of archival description, however, the context has a broader meaning and therefore the entities described and related to each other and, in either a direct or indirect way, with the archival materials are more numerous and include the political-institutional settings and jurisdictions where the creators operated, or the previous archival institutions where the archival materials where kept. Other entities can be added to these, such as the description of finding aids existing for a certain fond, bibliographic references, other information resources, etc. with the outcome of creating relatively complex systems (Fig. 2).
Essential entities
Archival insititutions (ISDIAH)
Fonds (ISAD - G)
Creators (ISAAR-CPF)
Fig. 1.
Archival Information Systems in Italy and the National Archival Portal
Complex systems
7
Archival institutions (ISDIAH)
Fonds (ISAD - G)
Creators (ISAAR-CPF)
Political/inst itutional context
Jurisdictions
Findings aids Fig. 2.
In terms of the nature of the information stored in the these systems, it should be noted that the archival description is typically a collection of structured data (e.g. dates), semi-structured data (e.g. the physical or logical extent of the archival material, i. e. its quantity, bulk, or size), of narrative texts which may be quite large (e.g. the institutional history or biography of the creator). These data are often uncertain, problematic and attributed according to certain criteria. Precisely due to this problematic nature, the sources of the data has to be indicated too and placed in a historical context. All of this obviously will have an influence on the way this information itself will be represented, treated and retrieved within the system.
2 Archival Information Systems and the Catalogue of Archival Resources of the National Archival Portal1 Over the last decade numerous systems of archival description have been developed in Italy, each with characteristics similar to those mentioned above. These systems were developed by the State Archival Administration at the national level (e.g. the General Guide of State Archives 2, the State Archive Information System or SIAS3, the Unified 1
2
3
This section is a re-working of the speech made at the 14th “Archivwissenschaftliches Kolloquium” of the Marburg Archive School (1-2 December 2009) and will be published in part in the "IBC" journal. Available in its first version at http://www.maas.ccr.it/h3/h3.exe/aguida/findex and in the new 2009 version at URL: http://guidagenerale.maas.ccr.it/. http://www.archivi-sias.it/.
8
S. Vitali
System of the Archival Supervising offices or SIUSA 4 , the Multimedia Historical Archive of the Mediterranean 5), by some of its local branches (e.g. the State Archives of Florence 6, Milan, 7 Bologna8, Naples9, Venice10), by some of the Regions (e.g. Lombardy11 Emilia-Romagna12, Piedmont13 or Umbria14,), by other local bodies (e.g. the Historical Archives of the Province of Trento 15 ), by individual cultural and non-cultural institutions (e.g. the “Giorgio Agosti” Piedmontese Institute for the History of the Resistance and Contemporary Society 16 , the Giangiacomo Feltrinelli Foundation17, the Senate of the Republic18, the Chamber of Deputies19 and many others) or by a collection of “federated” bodies (e.g. the Network of Institutions of the Resistance20 or the project Twentieth Century Archives21). Alongside the development of these systems there has also been a growth in the need to establish links, data exchanges and increasing levels of interoperability between
4
http://siusa.archivi.beniculturali.it/. http://www.archividelmediterraneo.org/portal/faces/public/guest/. 6 See the Information System of the State Archive of Florence, or SiASFIi, URL: http://www.archiviodistato.firenze.it/siasfi/. 7 See the on-line Guide of the State Archive of Milan, URL: http://archiviodistatomilano.it/patrimonio/guida-on-line/. 8 See the Archival Heritage of the State Archive of Bologna, URL: http://patrimonio.archiviodistatobologna.it/asbo-xdams/. 9 See the Archival Heritage of the State Archive of Napoli, URL: http://patrimonio.archiviodistatonapoli.it/xdams-asna/. 10 See the on-line Guide of the State Archive of Venice, or SiASVe, URL: http://www.archiviodistatovenezia.it/siasve/cgi-bin/pagina.pl. 11 See the archival section on the Lombardy cultural heritage portal, URL: http://www.lombardiabeniculturali.it/archivi/. 12 See the IBC Archives, URL: http://archivi.ibc.regione.emilia-romagna.it/ibc-cms/. 13 See the Guarini web archives, URL: http://www.regione.piemonte.it/guaw/MenuAction.do. 14 See .DOC - Information System of the Umbrian Archives, URL: http://www.piau.regioneumbria.eu/default.aspx. 15 See the on-line inventories on the site of the Historical Archives of the Province of Trento, URL: http://www.trentinocultura.net/catalogo/cat_fondi_arch/cat_inventari_h.asp. 16 See ArchOS. Integrated System of Archival Catalogs, URL: http://metarchivi.istoreto.it/. 17 See the on-line Archives of the Foundation, URL: http://www.fondazionefeltrinelli.it/feltrinelli-cms/cms.find?flagfind=quickAccess&type=1& munu_str=0_6_0&numDoc=95. 18 See the Archives Project of the Senate of the Republic, which gathers together digitalized descriptions and materials, including fonds preserved at other institutions, URL: http://www.archivionline.senato.it/. 19 See the site of the Historical Archive of the Chamber of Deputies, URL: http://archivio.camera.it/archivio/public/home.jsp?&f=10371 20 See the Guide to the Historical Archives of the Resistance, URL: http://beniculturali.ilc.cnr.it/insmli/guida.HTM. 21 See the Archives of the Twentieth Century. Collective memory on-line, URL: http://www.archividelnovecento.it/archivinovecento/. 5
Archival Information Systems in Italy and the National Archival Portal
9
them and much has been debated on how to achieve this22. One of the outcomes of this debate has been the project developed by the Directorate General of the Ministry of Cultural Assets and Activities which intend to develop a catalog of archival resources, or CAT, within the National Archival Portal dedicated to joining the initiatives of the various institutions which in Italy hold archives (State archival administration, Regions, local independent bodies, universities, cultural institutions, etc.) and to promote knowledge of Italian archival heritage among a wide national and international audience. Through the development of common access and the provision of concise information regarding national archival heritage, the CAT aims at being a tool for connecting the existing systems without replacing any of them, but rather making them more visible and enhancing their specific characteristics. An operation of this kind can only be carried out because the systems developed in recent years, despite the diversity in the software used and the different aspects of the descriptive formats, share the same conceptual model and a common reference to the international archival standards mentioned above. The CAT, therefore, shall outline a general map of the national archival heritage able to provide initial orientation for researchers and direct them towards more detailed information resources present in the systems participating in the National Archival Portal. It will include descriptive records of archival institutions, fonds or archival aggregations, finding aids and creators. It will be populated and updated through procedures which favor methodologies of harvesting of data from participating systems based on the OAI-PMH protocol. However, other methods of importing data shall not be excluded, as for example the upload of an XML file in a specific area of the system according to predefined formats or the direct input of the data in the CAT via web templates. Thanks to this plurality of techniques the aim is to obtain the participation in the portal of even less technologically equipped archival institutions. The identification of the descriptive elements for fonds and other archival aggregations, for creators and for finding aids has been based on the idea of subsidiarity between systems, and makes reference primarily to those descriptive elements considered mandatory in the international standards, with the integration of few others which are for the most part considered essential in our archival tradition. Concision in any free-text fields shall be ensured by the provision of a number of maximum characters. The data which shall be used to populate and update the CAT data base, as a rule imported by the systems participating in the project, shall be published without any modification. Each of these records shall contain a direct link to the corresponding record present in the system of data provenance. The description of archives will be also connected to those of any digital reproduction project made accessible on the Portal. Since it cannot be ruled out that more than one description from different systems of the same archival complex, finding aid or creator may converge in the CAT, each imported record is linked to a record describing the relevant system of origin so as to provide context for the origin and characteristics.
22
See for example Verso un Sistema Archivistico Nazionale? Special issue of Archivi e Computer, XIII (2004), 2, edited by the author of this paper.
10
S. Vitali
For creators, however, a further effort towards standardization is planned with the aim of offering the user higher quality information. In fact the progressive development, by special editorial staff throughout the country, of an authority file of creators, which shall become not only the main access point for searching and navigating in the CAT but also a reference point at the national level for the identification of corporate bodies, persons and families and the formulation of their names. This authority file could also be the reference point for systems participating in the portal in the preparation of their own descriptions of creators, when they do not consider it appropriate, as would be desirable, to directly entrust it with the overall management of the descriptions of the creators. Lastly, it could function as an interface and connection with similar authority files present in the catalogs and descriptive systems of other sectors of cultural heritage, such as the National Library Service.
3 The Definition and Character of the SAN Standard The architecture of the CAT, its contents and the standards of communication with the participating systems were designed thanks to intense work of comparison and debate which went on right throughout 2009 within working groups appointed in the context of the State-Local Authorities Joint Technical Committee for the Definition of Archival Standards. The working groups saw the important participation of different institutional and geographical entities. In addition to the choice of the descriptive elements for the various entities included in the CAT23 and the definition of the formats and protocols for importing data into the CAT database, the working groups have developed methodologies for drawing up the authority files of creators. During the first few months of 2010 the formats for metadata of the digital resources shall also be issued, which will be made available on the National Archival Portal, according to methodologies not dissimilar from those used for the CAT. The digital archive of the Portal should contain thumbnails of the images and essential information, which allow the user to search among the digital resources available, make a preliminary selection and then be directly addressed to the harvested systems for quality viewing of the digital reproductions of archival documents. For exporting the descriptions from the existing systems to the CAT, an XML exchange format has been developed. It has been named “SAN exchange format”. It is based on three schemas, each of one includes a subset of elements the most diffuse standards at the international level, i.e. Encoded Archival Description for archival aggregations and for finding aids and Encoded Archival Context for creators (Corporate Bodies, Persons, Families)24, recently issued. A special exchange format in XML has been developed for the initial population of the records regarding the creators, which when up and running will be produced by the editorial staff of the Portal. In the development of the XML schema to which the systems should conform in the generation of export files to the CAT, an approach has been adopted which can be 23
See the Technical Subcommittee for the Definition of Metadata (…), Tracciati descrittivi del CAT: soggetti, URL: http://ims.dei.unipd.it/data/san/metadati/docs/2009-04-17_Documento-conclusivo-sui-traccia ti-CAT.pdf (provisional address) 24 See the documentation on the relative site, URL: http://eac.staatsbibliothek-berlin.de/
Archival Information Systems in Italy and the National Archival Portal
11
described as "record centric". Each description of an archival aggregation (first level or subordinate), finding aid or creator to be exported from the systems into the CAT will always have a corresponding record in the export-import file. The relations between the various entities, also with regard to archival aggregations, shall therefore be made explicit through the indication of the identity code of the other connected entities in the export-import file. The XML schemas of the four entities (archival institutions, creators, archival aggregations, finding aid), integrated with control information required for the correct execution of the import procedures, were lastly grouped together in an overall import-export format, which is available to all managers of archival systems who desire to participate in the National Archival Portal and contribute to the efforts of building a single access point to the Italian archival resources present on the Web25. The open and cooperative model for the definition of the SAN standards has already achieved broadly satisfying results. The consolidation of these results will derive from their broad use and shall constitute the precondition for further steps forward in defining exchange formats which embrace other, more complex aspects and components of archival description.
25
The overall schema and the illustrative documentation are provisionally available, respectively, at the following URLs: http://gilgamesh.unipv.it/cat-import/cat-import.xsd and http://gilgamesh.unipv.it/cat-import/cat-import.html - id6.
Making Digital Library Content Interoperable Leonardo Candela, Donatella Castelli, and Costantino Thanos Istituto di Scienza e Tecnologie dell’Informazione “Alessandro Faedo” – CNR, Pisa, Italy {name.surname}@isti.cnr.it
Abstract. The demand for powerful and rich Digital Libraries able to support a large variety of interdisciplinary activities has increased the need for “building by re-use” and sharing, especially when dealing with the content space. Interoperability is a central issue to satisfy these needs. Despite its importance, and the many attempts to address it done in the past, the solutions to this problem are today, however, still very limited. Main reasons for this slow progress are lack of any systematic approach for addressing the issue and scarce knowledge of the adopted solutions. Too often these remain confined to the systems they have been designed for. In order to overcome this lack, this paper proposes an Interoperability Framework for describing and analyzing interoperability problems and solutions related to use of content resources. It also discusses the many facets content interoperability has and provides a comprehensive and annotated portfolio of existing approaches and solutions to this challenging issue.
1 Introduction Interoperability is among the most critical issues to be faced when building systems as “collections” of independently developed constituents (systems on their own) that should cooperate and rely on each other to accomplish larger tasks. Digital Library (DL) interoperability is an issue affecting the Digital Library domain since its beginning. It was explicitly mentioned among the challenges of the Digital Library Initiative (Challenge Four) [11] in early nineties. At that time the issue was formulated as follows: to establish protocols and standards to facilitate the assembly of distributed digital libraries. Recently, the demand for powerful and rich DLs able to support a large variety of interdisciplinary activities has increased the need for resource sharing. Interoperability solutions, which lay at the core of any approach supporting such sharing, have consequently become even more important than in the past. Despite these facts and the critical role interoperability has, there is no developed theory driving the resolution of interoperability issues when they manifest. Actually, there is no single definition of interoperability which is accepted by the overall community nor by the Digital Library community. Wegner [34] defines interoperability as “the ability of two or more software components to cooperate despite differences in language, interface, and execution platform. It is a scalable form of reusability, being concerned with the reuse of server resources by clients whose accessing mechanisms may be plug-incompatible with sockets of the server”. He also identifies in interface standardization and interface bridging two of the major mechanisms for interoperation. Heiler [13] defines interoperability as “the ability to exchange services and data M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 13–25, 2010. c Springer-Verlag Berlin Heidelberg 2010
14
L. Candela, D. Castelli, and C. Thanos
with one another. It is based on agreements between requesters and providers on, for example, message passing protocols, procedure names, error codes, and argument types”. He also defines semantic interoperability as ensuring “that these exchanges make sense – that the requester and the provider have a common understanding of the “meanings” of the requested services and data. Semantic interoperability is based on agreements on, for example, algorithms for computing requested values, the expected side effects of a requested procedure, or the source or accuracy of requested data elements”. Park and Ram [23] define syntactic interoperability as “the knowledge-level interoperability that provides cooperating businesses with the ability to bridge semantic conflicts arising from differences in implicit meanings, perspectives, and assumptions, thus creating a semantically compatible information environment based on the agreed concepts between different business entities”. They also define semantic interoperability as “the application-level interoperability that allows multiple software components to cooperate even though their implementation languages, interfaces, and execution platforms are different” [26]. In addition to that they state that emerging standards, such as XML and Web Services based on SOAP (Simple Object Access Protocol), UDDI (Universal, Description, Discovery, and Integration), and WSDL (Web Service Description Language), can resolve many application-level interoperability problems. As recognized by Paepcke et Al. [22] ten year ago, over the years systems designers have developed different approaches and solutions to achieve interoperability. They have put in place a pragmatic approach and started to implement solutions blending into each other by combining various ways of dealing with the issues including standards and mediators. Too often these remain confined to the systems they have been designed for and lead to “from-scratch” development and duplication of effort whenever similar interoperability scenarios occur in different contexts. This paper tackles the interoperability problem from a different perspective. This results from the understanding that the multitude of definitions sketched above, as well as the need to have blending solutions, are a consequence of the fact that interoperability is a very multifaceted and challenging issue that is not yet fully modeled in its own. The paper focuses on digital library content interoperability, i.e. the problem arising whenever two or more Digital Library “systems” are willing to interoperate by exploiting each other’s content resources. Different systems have to remove the barriers resulting from the different models and “ways to manage” underlying their resources. The aim is to contribute to a better understanding of this interoperability problem and the relative solutions through a systematic and organized approach. The paper refrains from introducing its own definition of content interoperability in favor of an Interoperability Framework aiming at identifying the various facets characterizing this exemplar of interoperability. By exploiting this framework, “interoperability” problems and solutions can be modeled in a multifaceted space. This study is part of a more comprehensive approach to the Digital Library interoperability problem addressed from different perspectives (content, user, functionality, policy, quality, architecture) conducted as part of DL.org1, an EU 7th FP project. The remainder of this paper is structured as follows. Section 2 introduces the many facets content interoperability has and proposes a systematic approach to the under1
www.dlorg.eu
Making Digital Library Content Interoperable
15
standing of this issue. Section 3 identifies which are the most important properties characterizing an information object from the interoperability point of view and briefly reviews the techniques and formalisms proposed in the literature for modeling them. Section 4 identifies significant levels of content interoperability and gives the corresponding definitions. Section 5 describes and comments existing approaches enabling interoperability. Finally, Section 6 presents concluding remarks and future plans.
2 A Content Interoperability Framework in a Nutshell According to the DELOS Reference Model [2], content is one of the six domains characterizing the Digital Library universe. In particular, in this domain there are all the entities held or included in a system to represent information in all its forms. Information Object is the most general concept characterizing the Content Domain. An Information Object is an instance of an abstract data type and represents any unit of information managed in the Digital Library universe, including text documents, images, sound documents, multimedia documents and 3-D objects, as well as data sets and databases. Information Objects also include composite objects and collections of Information Objects. Any interoperability scenario involves two or more “systems” and one or more “resources” about which the involved systems are willing to be interoperable. For the sake of modeling, interoperability among many systems can always be reduced to “interoperation” between pair of actors, one of which performs an operation for the other one. At any given time, one of the two actor plays the role of provider of the resource to be exchanged while the other plays the role of a consumer of this resource. Content interoperability is a multi-layered and very context-specific concept. It encompasses different levels along a multidimensional spectrum. Therefore, rather than aiming for a single, “one-size-fits-all” definition, it seems more promising to carefully identify the properties characterizing the DL content and define different levels of interoperability supporting technical and operational aspects of the interaction between content providers and consumers. In the setting described above, any content interoperability scenario can be characterized by a framework consisting of the following four complementary axes: – resource model (cf. Section 3). Any resource is described by a set of properties that capture its essential characteristics. The larger is the set of properties producer and consumer share the same understanding of, the wider is the exploitation that the consumer can perform of the resource; – interoperability level (cf. Section 4). The same understanding of a model can occur at different level of “completeness”. These levels constraint the type of interoperation that can occur. Typical exemplar of interoperability levels are syntactic, i.e. provider and consumer agree on the representation of the resource model or part of it, and semantic, i.e. provider and consumer agree on the meaning of the resource model or part of it; – reconciliation function (cf. Section 5). The specific interoperability can be achieved using different approaches. Reconciliation functions can vary along a multidimensional spectrum including the dimension for “unilateral” approaches to “collaborative” approaches and the dimension for “non-regulatory” approaches to
16
L. Candela, D. Castelli, and C. Thanos
“regulatory” approaches [10]. Moreover, it should materialize in some architectural components implementing it and a protocol through which the partaking systems operate. Two notable exemplars of reconciliation functions are standards and mediators. Standards are among the most consolidated approaches to achieve interoperability, while mediators have been introduced to guarantee a high-level of autonomy for the partaking systems. – benchmark. Each reconciliation function has its own strengths and weaknesses, costs and benefits. Benchmark characterize these reconciliation function features. It may include, for example, effectiveness, i.e. the measure of how the approach is successful in achieving the expected result, efficiency, i.e. the measure of the ratio between the cost of the approach and the result achieved, and flexibility, i.e. the measure of how much the proposed approach is change-tolerant. As this paper aims at presenting the problems and the solutions to interoperability from the perspective of the “content” domain, in the following sections we will use the above framework to analyze interoperability scenarios in which the resources involved are Information Objects. Note, however, that the described framework (in terms of its characterizing axes) is generic enough to be easily adapted to interoperability scenarios involving other kind of resources.
3 Digital Library Content Modeling Operating on the DL content means operating on the Information Objects that populate it. Interoperability with respect to Content is thus achieved when the provider and the consumer systems are interoperable with respect to these Information Objects. The model of an Information Object captures its distinguishing properties. When considered from an interoperability point of view, the model should capture both (a) properties concurring to form the state of the Information Object and (b) properties concurring to form the setting of the Information Object. Among the former set of properties we discuss below the modeling of identifier, type, metadata, quality and protection, while among the latter set we discuss the modeling of context and provenance. Information Object Identifier. Information Object Identifiers are tokens bound to Information Objects that distinguish them from other Information Objects within a certain scope. They play a role that is similar to that of the Uniform Resource Identifiers (URIs) in the architecture of the World Wide Web [14], i.e. they represent a cornerstone in a scenario in which any party can share information with any other since they permit to identify such a shared information. As discussed in [30], such identifiers should be “persistent” and “actionable”, i.e. they should give access to the resources and should continue to provide this access, even when the associated resources are moved to other location or even to other organizations. Identifier interoperability is necessary for the purpose of referring the target Information Objects similarly in the provider and consumer contexts.
Making Digital Library Content Interoperable
17
Identifiers are often modeled using one of the many available standards. These include the Uniform Resource Name (URN), digital object identifier (DOI), persistent URL (PURL), the Handle system and the Archival Resource Key (ARK) [30]. In addition to these standards, there are other approaches for content-based identification (“fingerprinting”). Information Object Format. An Information Object Format captures the structural (and sometimes operational) properties of the Information Objects. It is a formal and intentional characterization of all the Information Objects having such a “type” or “data model”. According to the Reference Model [2], Information Objects conceptually represent DL content in terms of a “graph” of digital objects associated with each other through relationships whose “label”, i.e. name, expresses the nature of their association. The Format captures these kind of structure including any constrain. Formats interoperability is necessary for the purpose of enabling the consumer of the objects to safely and/or efficiently execute operations over it based on the structural “assumptions” declared by the associated Information Object Format. From the modeling perspective, Information Object Formats can range from rigid data models, where the model basically expresses “one” Information Object model allowing for light customizations (e.g. DSpace [29], Greenstone [37], Eprints [18] data models), to flexible models, where the model can potentially describe “any” information object model (e.g. Fedora data model [15]). The current trend goes in the direction of data model for complex information entities as resource aggregations or compound resources [4,3,17]. Information Object Metadata. Metadata is any structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource [21].2 Metadata is often called data about data or information about information. Because of this potentially broad coverage captured by the term “metadata”, the majority of interoperability problems risk to fall in this category. In fact, many different metadata schemes are being developed in a variety of user environments and disciplines to better serve the specific needs and to capture, through metadata, the distinguishing properties of the resources that are deemed proper to the scope. Metadata interoperability is necessary for the purpose of enabling the consumer of the object to gather / be informed on some characteristics of the Information Object the partaking systems are willing to interoperate. The wider is the set of resource properties captures through the metadata, the larger is the potential understanding the consumer might achieve and the richer is the functionality it will be able to realize by exploiting the reached understanding. From the modeling perspectives, Information Object Metadata capture a set of properties and because of this classic data structure for representing them are exploited, e.g. key-values models. Several schemas have been produced to represent them, e.g. Dublin Core3 , MAchine-Readable Cataloging (MARC)4 , Metadata Encoding and Transmission 2 3 4
Information Object in our terminology. http://dublincore.org/ http://www.loc.gov/marc/
18
L. Candela, D. Castelli, and C. Thanos
Standard (METS)5 , Metadata Object Description Schema (MODS)6 , ISO 191157. The majority of them are dedicated to capture bibliographic information including cataloguing and classification details and are encoded in XML. In addition to such schemas, others conceived to serve application specific needs are continuously defined. In order to promote the interoperability of these application specific schemas, the application profile [12] approach represents a good practice. Exemplars are the Darwin Core8 and the Europeana Semantic Element Set9 . Information Object Quality. Quality is a kind of meta-property as it describes various “characteristics” of Information Object properties and sub-properties. Information Object Quality can be pragmatically defined as “the fitness for use of the information provided”. It is a multi-faceted concept to the definition of which different dimensions concur, each capturing a specific aspect of object quality. More specifically, quality dimensions or parameters can refer either to the extension of data, i.e. to data values, or to their intension, i.e. to their schema/format. In addition to that, the need for capturing the “quality of the quality”, i.e. how the quality parameters values are produced or assessed, is a critical aspect to be considered. The data quality literature provides a thorough classification of data quality dimensions. By analyzing existing classifications, it is possible to define a basic set of data quality dimensions, including accuracy, completeness, consistency, and timeliness [1]. Because of the fundamental and pervasive role quality plays in any Information System, the Digital Library Reference Model [2] includes an entire domain to capture it. Quality interoperability is necessary for the purpose of enabling the consumer to exploit every kind of Information Object in a conscious manner, i.e. being aware of the qualitative aspects of this kind of information and thus being able to put in place proper actions on top of it. From the modeling perspectives, quality characteristics closely resemble metadata, actually they can be considered a kind of metadata. Because of this, quality perspectives might be part of a metadata record. A machine-readable Quality Profile should be associated with an Information Object containing quality assertions. Information Object Protection. This is a highly complex problem that includes three sub-problems: security, integrity, and privacy. Security refers to the protection of content against accidental or intentional disclosure to unauthorized users or unauthorized uses. Integrity refers to the process of ensuring that the content remains an accurate reflection of the universe of discourse it is modeling or representing. Privacy refers to the rights of content providers to determine when, how, and to what extent their content is to be transmitted to content consumers. Protection represents one of the aspects of Policy, i.e. the set of conditions, rules, terms or regulations governing the operation of 5 6 7
8 9
http://www.loc.gov/standards/mets/ http://www.loc.gov/standards/mods/ http://www.iso.org/iso/en/CatalogueDetailPage. CatalogueDetail?CSNUMBER=26020 http://www.tdwg.org/activities/darwincore/ http://www.europeana.eu/
Making Digital Library Content Interoperable
19
any Digital Library. Because of the fundamental and pervasive role played by Policy, the Digital Library Reference Model [2] includes an entire domain to capture it. Information Object Protection becomes a concern when exchanges cross the “trust boundary”; beyond this logical line of demarcation it is rarely possible for an originating entity to assume that all potential recipients are authorized to access all information they are capable of discovering and consuming. Information Object Protection interoperability is necessary for the purpose of enabling the consumer to be aware of the policies governing the Information Object and thus to be able to put in place proper actions on top of it. From the modeling perspectives, this kind of policy represents an information that can be stored in the metadata attached to the Information Objects, e.g. in the “rights” element of Dublin Core. In addition to that, there are specific languages to represent policies in a declarative manner like the eXtensible Access Control Markup Language (XACML)10 . Information Object Context. Context is the set of all “setting” information that can be used to characterize the relation between the Information Object and the “external world” [9] surrounding it. Context represents a distinguishing and complementary information that enriches the informative payload captured by the Information Object itself. Information Object Context interoperability is necessary for the purpose of enabling the consumer of the Information Object to behave as a context-aware system, i.e. a system that is conscious of the situations surrounding the Information Object and can adapt its consumption accordingly. From the modeling perspectives, this information closely resembles metadata, actually it can be considered a kind of metadata. Strang and Linnhoff-Popien [28] pointed out most relevant approaches for context modeling including key-value pairs, markup scheme and ontology-based ones. Najar et al. [20] recently revised this survey to capture some use of context from content adaptation to service adaptation. Information Object Provenance. Provenance, also called lineage, pertains to the derivation history of the Information Object starting from its original sources, that is, it describes the process that led the object to be in its current state. It is a description of the origin and/or of the descendant line of data. Keeping track of provenance has become, in the last decade, crucial for the correct exploitation of data in a wide variety of application domains. Information Object Provenance interoperability is necessary for the purpose of enabling the consumer of the Information Object to be aware of the history leading to its current stage and thus to perform exploitation actions that take this knowledge into account. From the modeling perspectives, a number of provenance models have been proposed ranging from generic models such as OPM [19], that aims to model any kind of provenance, to domain specific models such as the FlyWeb provenance model [38]. The more domain specific a model is, the more restrictive the domain of its provenance subjects is. These models usually materialize in XML or RDF files. 10
http://www.oasis-open.org/committees/tc_home.php? wg_abbrev=xacml
20
L. Candela, D. Castelli, and C. Thanos
4 Levels of Content Interoperability As discussed above, the same understanding of a model can occur at different level of “completeness”. In addition to that, the information object model comprises several characteristics (properties), thus different level of completeness can be achieved among the systems involved into an interoperability scenario with respect to the overall amount of these characteristics as well as with respect to specific characteristics. The following levels are considered to be relevant for content interoperability. Technical/Basic Interoperability is mainly implemented at any level of the Information Object model, i.e. with respect to any characteristic described above. Common tools and interfaces provide the consumer with a superficial uniformity of the characteristics of the provider Information Objects which allows him to access them. However, when implementing this level of interoperability abstraction the task of providing any coherence of the content relies on human intelligence. Syntactic interoperability is concerned with ensuring that the abstract syntax of “target” Information Object characteristics, in particular the metadata and the related ones, is understandable by any other application (recipient) that was not initially developed for this purpose; Semantic Interoperability is concerned with ensuring that the precise meaning of “target” Information Object feature is understandable by any other application (recipient) that was not initially developed for this purpose. Semantic interoperability is achieved only when Information Object producer and consumer agree on the meaning of the Information Object (actually of its properties) they exchange; Operational Interoperability is concerned with ensuring the effective use of the “target” Information Object by the recipient in order to perform a specific task. This recipient ability is guaranteed by the fact that both originator and recipient share the same understanding with respect to the data quality property; Secure Interoperability is concerned with ensuring secure Information Object “exchanges” between the involved systems. This must be conducted with sufficient context so that the purpose to which the recipient applies the received Information Object is consistent with its use as intended by the originator [6]; These levels of interoperability may be subjected to dependencies: operational/secure interoperability is only possible if semantic interoperability is ensured; semantic interoperability is only possible if syntactic interoperability is ensured; syntactic interoperability is only possible if technical interoperability is achieved.
5 Content Reconciliation Approaches The most common solutions and approaches to Content reconciliation can be classified in two main classes: standard-based and mediator-based. Standard-based approaches rely on the usage of an agreed standard (or a combination of them) that achieves a certain amount of homogeneity between the involved systems. Mediator-based approaches are based on the development of a component specifically conceived to host the interoperability machinery, i.e. a component mediating between the the involved systems and aiming to reconcile the content heterogeneity.
Making Digital Library Content Interoperable
21
However, because of the amount of heterogeneity to be reconciled, solutions properly mixing approaches belonging to the two classes can be successfully deployed. In the remainder of this section we describe the most common exploited ones while dealing with the Information Object properties previously discussed. 5.1 Standard-Based Approaches Standards, either de jure or de facto, represent one of the most common and well recognized approach to attack interoperability issues at any level and in any domain. In this context, the term “standard” is intended with the very wide meaning of common agreed specification. Moreover, it is important to recall here that the success or failure of standards does not depend on technical merits only, social and business considerations coming into play. Potentially, standards are everywhere, i.e. a standard can be defined to characterize every single aspect of a “system”. Because of this characteristic, the list of standard reported in this section does not aim to be exhaustive nor complete with respect to the standardization initiatives. The standards of interest for Content Interoperability can be classified in two main non-disjoint classes: standards for content representation and standards for content exchange. Exemplars of the first class are the various formats and schemas that have been discussed in Section 3 and that are exploited to represent content features, e.g. Dublin Core for the metadata, MPEG-21 for Information Object Format as well as generic standards like XML11 and RDF12 . Exemplars of the second class are generic standards like RSS and Atom as well as OAI-PMH [16] and OAI-ORE [32,31], two well known approaches to interoperability in the Digital Library content domain. In addition to these traditional standard initiatives, we include approaches like Application Profiles and Derivation in this category of approaches. Application Profiles. Even within a particular information community, there are different user requirements and special local needs. The details provided in a particular schema may not meet the needs of all user groups. There is often no schema that meets all needs. To accommodate individual needs, an application profile [12] might be defined. In this approach, an existing schema is used as the basis for description in a particular digital library or repository, while individual needs are met through a set of specific application guidelines or policies or through adaptation or modification by creating an application profile for application by a particular interest group or user community. Derivation. In this approach [5], a new schema is derived from an existing one. In a collection of digital repositories where different components have different needs and different requirements regarding description details, an existing complex schema may be used as the “source” or “model” from which new and simpler individual schemas may be derived. Specific derivation methods include adaptation, modification, expansion, partial adaptation, translation, etc. In each case, the new schema is dependent on the source schema. 11 12
http://www.w3.org/XML/ http://www.w3.org/RDF/
22
L. Candela, D. Castelli, and C. Thanos
5.2 Mediator-Based Approaches As already mentioned, a key concept enabling the content interoperation among heterogeneous systems is mediation [35]. This concept has been used to cope with many heterogeneity dimensions ranging from terminology to representation format, transfer protocols, semantics, etc. [27,36]. The content mediation concept is implemented by a mediator, which is a software device that supports (a) a mediation schema capturing user (originator and recipients) requirements, and (b) an intermediation function that describe how to represent the distributed information object sources in terms of the mediation schema. A key feature which characterizes a mediation process is the kind of the reconciliation function implemented by a mediator. There are three main approaches. Mapping which refers to how information object structures, properties, relationships are mapped from one representation scheme/formalism to another one equivalent from the semantic point of view. Matching which refers to the action of verifying whether two strings/patterns match, or whether semantically heterogeneous information objects match. Integration which refers to the action of combining information objects residing in different heterogeneous sources and providing users with a unified view of these objects (or combining domain knowledge that is expressed in domain ontologies). At each level of content interoperability identified in Section 4 a specific mediation process might be applied. Technical mediation enables the linking of the systems and services through the use of common tools, open interfaces, interconnection services, and middleware. Syntactic mediation is mainly implemented at the information object metadata level and makes it possible to bridge the differences of the metadata formats at the syntactic level. Semantic mediation enables the bridging of the differences of the exchanged information object at the semantic level allowing thus for information object to be exchanged according to semantic matching. Operational mediation guarantees that both the information object originator and recipient share the same quality dimensions described in the quality profile associated with the exchanged information object. Protection mediation enables the recipient to use the received information object without violating the security, integrity and privacy constraints associated with it and to protect it from unauthorized users. Approaches put in place by the mediator service are based on a preliminary knowledge of the heterogeneities among the partaking entities and on how to reconcile them including the usage of a pivot schema or lingua franca and a series of mappings and rewriting rules. These permit to realize crosswalks and might be based on the use of some ontology. Crosswalks. A crosswalk is a mapping of the elements, semantics, and syntax from one scheme to another. Currently, crosswalks are by far the most commonly used approach to enable interoperability between and among metadata schemes. Ontology-based Approaches. Ontologies were developed by the the Artificial Intelligence community to facilitate knowledge sharing and reuse. They are largely used for representing domain knowledge. An ontology is a formal, explicit specification of a shared abstract model of some domain knowledge in the world that identifies that domain’s relevant concepts [8].
Making Digital Library Content Interoperable
23
Ontologies have been extensively used in supporting all the three content mediation approaches, i.e. mapping, matching and integration, because they provide an explicit and machine-understandable conceptualization of a domain. They have been used in one of the three following ways [33]. In the single ontology approach, all sources schemas are directly related to a shared global ontology that provides a uniform interface to the user [7]. In the multiple ontology approach, each data source is described by its own (local) ontology separately. Instead of using a common ontology, local ontologies are mapped to each other. In the hybrid ontology approach, a combination of the two preceding approaches is used. Ontology provides a framework within which the semantic matching/mapping process can be carried out by identifying and purging semantic divergence. Semantic divergence occurs where the semantic relationship between the ontology and the representation is not direct and straightforward [24].
6 Concluding Remarks In this paper we report some preliminary results concerning the study of DL Content Interoperability. In particular, we have presented an Interoperability Framework with the aim to contribute to a better understanding of this challenging problem and its relative solutions through a systematic and organized approach. The most important properties of content, from the interoperability perspective, have been introduced and discussed. The main modeling techniques for the representation of them have been reviewed. The relevant levels of content interoperability have been identified and defined. Finally, a number of content interoperability approaches have been presented and discussed. The proposed Interoperability Framework is currently been used in collecting interoperability requirements and in designing appropriate solutions to this problem in the context of the D4Science-II project13 . This project aims at creating an initial ecosystem of interoperable data infrastructures and repository systems capable of exploiting each other’s content resources. The type of content managed by the components of this ecosystem is very heterogeneous. This makes the process of requirement collection, analysis and design particularly complex. So far the framework has guided a systematic collection of requirements from the different actors. This systematic approach has largely facilitated the analysis phase. No particularly significant lack has been identified in the Content Framework in this initial phase. We expect a more considerable evaluation of the framework during the design phase when also the solutions will have to be modeled and described. Acknowledgments. The work reported has been partially supported by the DL.org Coordination and Support Action, within FP7 of the European Commission, ICT-2007.4.3, Contract No. 231551). We are grateful for many helpful suggestions from Detlev Balzer, Stefan Gradmann, C.H.J.P. Hendriks, Carlo Meghini, Luc Moreau and John Mylopoulos, members of the DL.org Content Working Group. 13
www.d4science.eu
24
L. Candela, D. Castelli, and C. Thanos
References 1. Batini, C., Scannapieco, M.: Data Quality: Concepts, methodologies and techniques. Springer, Heidelberg (2006) 2. Candela, L., Castelli, D., Ferro, N., Ioannidis, Y., Koutrika, G., Meghini, C., Pagano, P., Ross, S., Soergel, D., Agosti, M., Dobreva, M., Katifori, V., Schuldt, H.: The DELOS Digital Library Reference Model - Foundations for Digital Libraries. In: DELOS: a Network of Excellence on Digital Libraries (February 2008) ISSN 1818-8044, ISBN 2-912335-37-X 3. Candela, L., Castelli, D., Manghi, P., Mikulicic, M., Pagano, P.: On Foundations of Typed Data Models for Digital Libraries. In: Fifth Italian Research Conference on Digital Library Management Systems, IRCDL 2009 (2009) 4. Candela, L., Castelli, D., Pagano, P., Simi, M.: From Heterogeneous Information Spaces to Virtual Documents. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds.) ICADL 2005. LNCS, vol. 3815, pp. 11–22. Springer, Heidelberg (2005) 5. Chan, L.M., Zeng, M.L.: Metadata Interoperability and Standardization – A Study of Methodology Part I Achieving Interoperability at the Schema Level. D-Lib. Magazine 12(6) (June 2006) 6. Connors, C.L., Malloy, M.A., Masek, E.V.: Enabling Secure Interoperability Among Federated National Entities: It’s a Matter of Trust. Technical report, MITRE Corporation (2007) 7. Cruz, I.F., Xiao, H.: Using a layered approach for interoperability on the semantic web, p. 221 (2003) 8. Cruz Huiyong, I., Cruz, I.F., Xiao, H.: The role of ontologies in data integration. Journal of Engineering Intelligent Systems 13, 245–252 (2005) 9. Dey, A.K.: Understanding and using context. Personal Ubiquitous Comput. 5(1), 4–7 (2001) 10. Gasser, U., Palfrey, J.: Breaking Down Digital Barriers - When and How Interoperability Drives Innovation. Berkman Publication Series (November 2007) 11. Griffin, S.M.: NSF/DARPA/NASA Digital Libraries Initiative - A Program Manager’s Perspective. D-Lib. Magazine (July/August 1998) 12. Heery, R., Patel, M.: Application profiles: mixing and matching metadata schemas. Ariadne 25 (2000) 13. Heiler, S.: Semantic interoperability. ACM Comput. Surv. 27(2), 271–273 (1995) 14. Jacobs, I., Walsh, N.: Architecture of the World Wide Web, vol.1. Technical report, W3C (December 2004) 15. Lagoze, C., Payette, S., Shin, E., Wilper, C.: Fedora: An Architecture for Complex Objects and their Relationships. Journal of Digital Libraries, Special Issue on Complex Objects (2005) 16. Lagoze, C., Van de Sompel, H.: The open archives initiative: building a low-barrier interoperability framework. In: Proceedings of the first ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 54–62. ACM Press, New York (2001) 17. Lagoze, C., Van de Sompel, H., Johnston, P., Nelson, M., Sanderson, R., Warner, S.: ORE Specification - Abstract Data Model. Technical report, Open Archives Initiative (2008) 18. Millington, P., Nixon, W.J.: EPrints 3 Pre-Launch Briefing. Ariadne 50 (2007) 19. Moreau, L., Plale, B., Miles, S., Goble, C., Missier, P., Barga, R., Simmhan, Y., Futrelle, J., McGrath, R., Myers, J., Paulson, P., Bowers, S., Ludaescher, B., Kwasnikowska, N., Van den Bussche, J., Ellkvist, T., Freire, J., Groth, P.: The open provenance model (v1.01). Technical report, University of Southampton (July 2008) 20. Najar, S., Saidani, O., Kirsch-Pinheiro, M., Souveyet, C., Nurcan, S.: Semantic representation of context models: a framework for analyzing and understanding. In: CIAO 2009: Proceedings of the 1st Workshop on Context, Information and Ontologies, pp. 1–10. ACM, New York (2009)
Making Digital Library Content Interoperable
25
21. National Information Standards Organization. Understanding Metadata. NISO Press (2004) 22. Paepcke, A., Chang, C.-C.K., Winograd, T., Garc´ıa-Molina, H.: Interoperability for Digital Libraries Worldwide. Communications of the ACM 41(4), 33–42 (1998) 23. Park, J., Ram, S.: Information Systems Interoperability: What Lies Beneath? ACM Trans. Inf. Syst. 22(4), 595–632 (2004) 24. Partridge, C.: The role of ontology in integrating semantically heterogeneous databases. Technical Report 05/02, LADSEB-CNR (2002) 25. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001) 26. Ram, S., Park, J., Lee, D.: Digital libraries for the next millennium: Challenges and research directions. Information Systems Frontiers 1(1), 75–94 (1999) 27. Spalazzese, R., Inverardi, P., Issarny, V.: Towards a Formalization of Mediating Connectors for on the Fly Interoperability. In: Joint Working IEEE/IFIP Conference on Software Architecture 2009 & European Conference on Software Architecture 2009, Cambridge RoyaumeUni., CONNECT (2009) 28. Strang, T., Linnhoff-Popien, C.: A context modeling survey. In: Davies, N., Mynatt, E.D., Siio, I. (eds.) UbiComp 2004. LNCS, vol. 3205, Springer, Heidelberg (2004) 29. Tansley, R., Bass, M., Stuve, D., Branschofsky, M., Chudnov, D., McClellan, G., Smith, M.: The DSpace Institutional Digital Repository System: current functionality. In: Proceedings of the third ACM/IEEE-CS joint conference on Digital libraries, pp. 87–97. IEEE Computer Society, Los Alamitos (2003) 30. Tonkin, E.: Persistent Identifiers: Considering the Options. Ariadne 56 (2008) 31. Van de Sompel, H., Lagoze, C., Bekaert, J., Liu, X., Payette, S., Warner, S.: An Interoperable Fabric for Scholarly Value Chains. D-Lib. Magazine 12(10) (October 2006) 32. Van de Sompel, H., Payette, S., Erickson, J., Lagoze, C., Warner, S.: Rethinking Scholarly Communication - Building the System that Scholars Deserve. D-Lib. Magazine 10(9) (September 2004) 33. Wache, H., V¨ogele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., H¨ubner, S.: Ontology-based integration of information — a survey of existing approaches. In: Stuckenschmidt, H. (ed.) IJCAI 2001 Workshop: Ontologies and Information Sharing, pp. 108–117 (2001) 34. Wegner, P.: Interoperability. ACM Comput. Surv. 28(1), 285–287 (1996) 35. Wiederhold, G.: Mediators in the Architecture of Future Information Systems. Computer 25(3), 38–49 (1992) 36. Wiederhold, G., Genesereth, M.: The conceptual basis for mediation services. IEEE Expert: Intelligent Systems and Their Applications 12(5), 38–47 (1997) 37. Witten, I., Bainbridge, D., Boddie, S.: Greenstone - Open-Source Digital Library Software. D-Lib. Magazine 7(10) (October 2001) 38. Zhao, J., Miles, A., Klyne, G., Shotton, D.: Linked data and provenance in biological data webs. Briefings in bioinformatics 10(2), 139–152 (2009)
Integrating a Content-Based Recommender System into Digital Libraries for Cultural Heritage Cataldo Musto, Fedelucio Narducci, Pasquale Lops, Marco de Gemmis, and Giovanni Semeraro Department of Computer Science, University of Bari “Aldo Moro”, Italy {musto,narducci,lops,degemmis,semeraro}@di.uniba.it http://www.di.uniba.it/
Abstract. Throughout the last decade, the area of Digital Libraries (DL) get more and more interest from both the research and development communities. Likewise, since the release of new platforms enriches them with new features and makes DL more powerful and effective, the number of web sites integrating these kind of tools is rapidly growing. In this paper we propose an approach for the exploitation of digital libraries for personalization goal in cultural heritage scenario. Specifically, we tried to integrate FIRSt (Folksonomy-based Item Recommender syStem), a content-based recommender system developed at the University of Bari, and Fedora, a flexible digital library architecture, in a framework for the adaptive fruition of cultural heritage implemented within the activities of the CHAT research project. In this scenario, the role of the digital library was to store information (such as textual and multimedial ones) about paintings gathered from the Vatican Picture Gallery and to provide them in a multimodal and personalized way through a PDA device given to a user before her visit in a museum. This paper describes the system architecture of our recommender system and its integration in the framework implemented for the CHAT project, showing how this recommendation model has been applied to recommend the artworks located at the Vatican Picture Gallery (Pinacoteca Vaticana), providing users with a personalized museum tour tailored on their tastes. The experimental evaluation we performed also confirmed that these recommendation services are really able to catch the real user preferences thus improving their experience in cultural heritage fruition. Keywords: Recommender Systems, Digital Libraries, Machine Learning, Personalization, Filtering.
1
Introduction
The amount of information available on the Web and in Digital Libraries is increasing over time. In this context, the role of user modeling and personalized M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 27–38, 2010. c Springer-Verlag Berlin Heidelberg 2010
28
C. Musto et al.
information access is becoming crucial: users need a personalized support in sifting through large amount of retrieved information according to their interests. Information filtering systems, relying on this idea, adapt their behavior to individual users by learning their preferences during the interaction in order to construct a profile of the user that can be later exploited in selecting relevant items. Nowadays Recommender Systems (RS) represent the main area where principles and techniques of Information Filtering are applied. In general, among different recommendation techniques that have already been put forward in studies on this matter, the content-based and the collaborative filtering approaches are the most widely adopted to date. This work is focused on content-based recommender systems. In the next section we will introduce FIRSt (Folksonomy-based Item Recommender syStem): it represents the core of this paper since in this work we tried to exploit its accuracy in producing recommendation for personalization goals in a real-world application. Specifically, we exposed FIRSt basic features through a set of web services and we integrated them with a flexible digital library architecture called Fedora1 . In this scenario, the goal of the digital library was to store information (such as textual and multimedial ones) about a set of paintings and to provide them in an adaptive (namely, multimodal and personalized way) through a mobile device. In this way, we showed that content-based recommender systems and digital libraries can be exploited for personalization goal in a museum scenario, letting Vatican Picture Gallery visitors receive suggestions about artworks they could be interested in and tailoring museum tours on their tastes. This research has been conducted within the CHAT project (Cultural Heritage fruition and e-learning applications of new Advanced multimodal Technologies), that aims at developing new systems and services for multimodal fruition of cultural heritage content. Data has been gathered from the collections of the Vatican Picture Gallery, for which both images and detailed textual information of paintings are available, and letting users involved in the study both rate and annotate them with tags. The paper is organized as follows. Section 2 introduces the general problem of information filtering and recommender systems; the architecture of FIRSt is described in Section 3 whereas Section 4 focuses the attention on design and development of web services to expose FIRSt functionalities. The experimental session carried out to evaluate the effectiveness of implemented web services is presented in Section 5. Related work are briefly analyzed in Section 6, while conclusions and directions for future work are drawn in the last section.
2
Information Filtering and Recommender Systems
As proved by the continuous growth of web sites which embody recommender systems as a way of personalizing their content for users, nowadays these systems 1
http://www.fedora-commons.org/
Integrating a Content-Based Recommender System into Digital Libraries
29
represent the main field of application of principles and techniques coming from Information Filtering (IF) [9]. At Amazon.com, recommendation algorithms are used to personalize the online store for each customer, for example showing programming titles to a software engineer and baby toys to a new mother [5]. The recommendation approaches can be generally classified in two main categories: content-based and collaborative ones. Content-based systems analyze a set of documents, usually textual descriptions of the items previously rated by an individual user, and build a model or profile of user interests based on the features of the objects rated by that user [8]. In this approach static content associated to items (the plot of a film, the description of an artwork, etc.) is usually exploited. The profile is then used to recommend new relevant items. Collaborative recommender systems differ from content-based ones in that user opinions are used, instead of content. User ratings about objects are gathered and stored in a centralized or distributed database. To provide recommendations to user X, the system firstly computes the neighborhood of that user (i.e. the subset of users that have a taste similar to X). Similarity in taste is measured by computing the closeness of ratings for objects that were rated by both users. The system then recommends objects that users in X’s neighborhood indicated to like, provided that they have not yet been rated by X. Although each type of filtering method has its own weaknesses and strengths [12,1], in this work we focused our attention only on a single class of recommenders, introducing in the next section the general architecture of FIRSt, which represents the core of personalization mechanisms designed for Vatican Picture Gallery scenario.
3
FIRSt: Folksonomy-Based Item Recommender syStem
FIRSt is a semantic content-based recommender system capable of providing recommendations for items in several domains (e.g., movies, music, books), provided that descriptions of items are available as text documents (e.g. plot summaries, reviews, short abstracts) [6]. In the context of cultural heritage personalization, for example, an artwork can be represented by at least three textual components (called slots), namely artist, title, and description. The inceptive idea behind FIRSt is to include folksonomies in a classic contentbased recommendation model, integrating static content describing items with dynamic user-generated content (namely tags, through social tagging of items to be recommended) in the process of learning user profiles. Tags are collected during the training step by letting users: 1. express their preferences for items through a numerical rating; 2. annotate rated items with free tags.
30
C. Musto et al.
Fig. 1. Screenshot of Learning Platform
Tags are then stored into an additional slot, different from those containing static content, and are exploited in the profile learning phase in order to include them in the user profiles. The general architecture of FIRSt is depicted in Figure 2. The recommendation process is performed in three steps, each of which is handled by a separate component: – Content Analyzer – it allows introducing semantics in the recommendation process by analyzing documents and tags in order to identify relevant concepts representing the content. This process selects, among all the possible meanings (senses) of each polysemous word, the correct one according to the context in which the word occurs. In this way, documents and tags are represented using concepts instead of keywords, in an attempt to overcome the problems due to the natural language ambiguity. The final outcome of the preprocessing step is a repository of disambiguated documents. This semantic indexing is strongly based on natural language processing techniques, such as Word Sense Disambiguation (WSD), and heavily relies on linguistic knowledge stored in the WordNet [7] lexical ontology. Semantic indexing of contents is performed by the Content Analyzer, which relies on META (Multi Language Text Analyzer), a natural language processing tool developed at the University of Bari, able to deal with documents in English or Italian [2]. The complete description of the adopted WSD strategy is not described here, because already published in [11]. – Profile Learner – it implements a supervised learning technique for learning a probabilistic model of user interests from disambiguated documents rated according to her interests. This model represents the semantic profile,
Integrating a Content-Based Recommender System into Digital Libraries
31
Fig. 2. FIRSt General Architecture
which includes those concepts that turn out to be the most indicative of the user preferences. In FIRSt the problem of learning user profiles is casted as a binary Text Categorization task [10] since each document has to be classified as interesting or not with respect to the user preferences. The algorithm for inferring user profiles is na¨ıve Bayes text learning, widely adopted in content-based recommenders. Details about implemented algorithm are provided in [4]. – Recommender – it exploits the user profile to suggest relevant documents by matching concepts contained in the semantic profile against those contained in documents to be recommended. The outcome of the experiments conducted in [3] demonstrated that tag integration in the recommendation process causes an increase of predictive accuracy of the recommender. In this work we will continue this analysis, by showing how this higher accuracy can be exploited for personalization goals in a realworld application: artwork recommendation in Vatican Picture Gallery scenario. The strategy adopted to personalize the services for the artwork recommendation scenario might be exploited for the design of recommendation services for e-commerce applications.
4
CHAT Project Overview
CHAT is a research project which aims to developing a platform for multimodal fruition of cultural heritage content.
32
C. Musto et al.
Fig. 3. Data modeling in Fedora Digital Library
We exposed FIRSt basic features through a set of Web Services, and we integrated them in an adaptive platform for multimodal and personalized access to museum collections, letting Vatican Picture Gallery visitors to and tailoring museum tours on their tastes. All the information about a museum is contained in specific data structures and is stored in Fedora digital library. The contents in the Fedora are represented by digital objects. In our data modeling (Figure 3) we have three main digital objects: opus, room, and author. Some relations are defined among them: – hasCollectionMember that relates a room with an opus (the inverse relation is hasLocation); – isAuthorOf that relates an author with an opus (the inverse relation is hasAuthor ); Every digital object in Fedora can have one or more datastreams (image, audio, video, text). For each opus (painting) we have the following datastreams: description (txt or html), image (jpg format), audio (wav or mp3), video (mpeg); author and room have only textual contents. All this data are static and not personalized: thus, starting from the same request (for example, a more detailed textual description of an artworks), all users obtain the same answer. How can we improve the quality of information showed to the visitors? To meet this question FIRSt comes into play. In the expected scenario every visitor entering the museum is provided with a device (PDA/smart phone) with a specific application installed. Thanks to some localization sensors, it is possible to know in which room of the museum the user is, while coming trough the doorway, the visitor can acquire detailed information on each painting in that room. The core of the CHAT system architecture is the Adaptive Dialog Manager (Figure 4), whose purpose is to manage personalization mechanisms in order to let visitors receive suggestions about artworks they could be interested in.
Integrating a Content-Based Recommender System into Digital Libraries
33
Fig. 4. CHAT Adaptive Dialog Manager for multimodal and personalized fruition
The Adaptive Dialog Manager embeds many components, called reasoners, each of which manages different types of information (about the environment, such as the noise and brightness, about the user, such as the age and interaction speed, and about user tastes) coming from different input channels and localization sensors. All the data gathered from each reasoner are merged by the Adaptive Dialog Manager, which exploits the user profile to find the most interesting items and the most appropriate way to present them, such as just audio or audio and text, video, etc. 4.1
Scenario
FIRSt manages the content-based profile, which contains all the information about user preferences on artworks (data that cannot be omitted for personalization goals). The typical steps in a scenario for creating content-based profiles are: 1. Registration to the Museum Portal In a preliminary phase a user has to subscribe to a dedicated portal. After entering typical demographic data, such as age, sex, instruction, explicit interests, etc., the user performs a training phase by voting some artworks, randomly chosen from the available ones. After the completion of this step, all user preferences about artworks are stored by FIRSt; 2. Construction of the User Profile Once finished the training phase, FIRSt builds a profile for each user containing information that turns out to be most indicative of the user preferences;
34
C. Musto et al.
3. Personalization of Results When the user visits the museum, she will enjoy additional intelligent personalized services, based on her own user profile. At last, when the user will visit the museum she will enjoy useful additional services such as filtering of non-relevant item or building of a personalized museum tour according to his preferences. Specifically, we found two situations where personalized access could improve user experience: – when data about user interests are gathered, splitting paintings in two disjoint sets, interesting, and not interesting could be useful to build a personalized tour; – when the user enters into a room, providing suggestions about paintings she can be interested in can be useful. To fulfil these specific requirements, we develop two Web Services exposing these functionalities: – Tour: the main idea behind this service is to provide the target user with the subset of items she could be interested in (according to her profile). It is possible to return the first n items for a single user, for example to have the most significant paintings for her. In this service there is not a roombased partition. This service has been used in CHAT to build a personalized museum tour showing visitors a list of paintings ordered according to their interests; – Room: it provides the target user with content related to items located in a specific room of the museum. This service takes as input the room identifier (provided by environment sensors) and returns all the artworks in that location. Room can be adapted to a specific user by exploiting the FIRSt system, that, according to the information stored in the user profile, is able to rank the paintings according to the user interests. This service has been used in CHAT to discover the most interesting rooms in the museum. This could be very useful when the visitor has no much time to visit the entire museum. In that case, the visitor could plan her visit by starting from the first n locations of interest suggested by the system.
5
Experimental Evaluation
The goal of the experimental evaluation was to understand whether the use of FIRSt in this scenario brings to a substantial improvement of user experience in museum collection fruition. The test has been carried out by using an online platform which allows registered users to train the system by rating some paintings belonging to the art gallery. The rating scale varies from 1 (dislikes) to 5 (likes). A group of 30 users has been recruited for the test. The task has been divided in three phases:
Integrating a Content-Based Recommender System into Digital Libraries
35
1. Registration and login; 2. Rating the artworks; 3. Tagging the artworks. After the registration, each user rated 45 artworks taken from the Vatican Picture Gallery web site and annotated them with a set of free tags. In order to evaluate the effectiveness of the services, we adopted the Normalized Distancebased Performance Measure (NDPM) [16] that compares the ranking imposed by the user ratings, with that computed by FIRSt. More specifically, NDPM is used to measure the distance between votes given by a single user u and votes predicted by the system s for a set of items. Given a couple of items ti , tj in the Test Set T of an user, the distance between them is calculated through the following schema: ⎧ ⎨ 2 ⇐⇒ (ti >u tj ∧ tj >s ti ) ∨ (ti >s tj ∧ tj >u ti ) δ>u,>s (ti , tj ) = 1 ⇐⇒ (ti >s tj ∨ tj >s ti ) ∧ ti ≈u tj ⎩ 0 ⇐⇒ otherwise
(1)
The value of NDPM on the Test Set T is calculated through the following equation, where n is the number of couple of items: i=j δ>u,>s (ti , tj ) N DP M>u,>s(T ) = (2) 2·n For the Room service, a single room was set as test set, in order to measure the distance between the ranking imposed on paintings in a room by the user ratings and the ranking predicted by FIRSt. The methodology of the experiment was as follows: 1. the Training Set (T Si ) for user ui , i = 1..30 is built, by including 50% of all ratings given by ui (randomly selected); 2. the profile for ui is built by FIRSt, by exploiting ratings in T Si ; 3. the profile is used for the computation of the classification scores for the class likes for the paintings not included in T Si; 4. scores computed by FIRSt and ratings given by users on paintings not included in T Si are compared. The test was carried out for 3 rooms in which paintings are located. Generally, NDPM values lower than 0.5 reveal acceptable agreement between the two rankings. From results reported in Table 1, it can be noticed that the average NDPM is lower than 0.5. In particular,values are lower than 0.5 for 19 users out of 30 (63%), highlighted in bold in the table. Among these users, NDPM for 9 of them is even lower than 0.4, thus revealing that the ranking of paintings proposed by FIRSt is very effective for 30% of the population involved in the test. The main conclusion which can be drawn from the experiment is that the service is capable of providing a quite effective user experience in museum fruition.
36
C. Musto et al. Table 1. NDPM for each user (averaged on 3 rooms User u1 u6 u11 u16 u21 u26
6
NDPM User 0.56 u2 0.52 u7 0.46 u12 0.43 u17 0.51 u22 0.46 u27
NDPM 0.48 0.38 0.51 0.45 0.47 0.39
User u3 u8 u13 u18 u23 u28
NDPM User 0.53 u4 0.54 u9 0.49 u14 0.36 u19 0.39 u24 0.42 u29
NDPM 0.65 0.39 0.46 0.46 0.55 0.60
User u5 u10 u15 u20 u25 u30 Avg
NDPM 0.57 0.39 0.35 0.35 0.36 0.55 0.47
Related Work
Museums have already recognized the importance of providing visitors with personalized access to artifacts. PEACH (Personal Experience with Active Cultural Heritage) [13] and CHIP (Cultural Heritage Information Personalization) [15] projects are only two examples of the research efforts devoted to support visitors in fulfilling a personalized experience and tour when visiting artwork collections. In particular, the recommender system developed within CHIP aims to provide personalized access to the collections of the Rijksmuseum in Amsterdam. It combines Semantic Web technologies and content-based algorithms for inferring visitors’ preference from a set of scored artifacts and then, recommending other artworks and related content topics. The Steve.museum consortium [14] has begun to explore the use of social tagging and folksonomy in cultural heritage personalization scenarios, to increase audience engagement with museums’ collections. Supporting social tagging of artifacts and providing access based on the resulting folksonomy, open museum collections to new interpretations, which reflect visitors’ perspectives rather than curators’ ones, and help to bridge the gap between the professional language of the curator and the popular language of the museum visitors. Preliminary explorations conducted at the Metropolitan Museum of Art of New York have shown that professional perspectives differ significantly from those of na¨ıve visitors. Hence, if tags are associated to artworks, the resulting folksonomy can be used as a different and valuable source of information to be carefully taken into account when providing recommendations to museum visitors. As in the above mentioned works, we have proposed a solution to the challenging task of identifying user interests from tags. Since the main problem lies in the fact that tags are freely chosen by users and their actual meaning is usually not very clear, the distinguishing feature of our approach is a strategy for the “semantic” interpretation of tags by means of WordNet.
7
Conclusions and Future Work
In this paper we investigated about the design of recommendation services based on folksonomies and their concrete exploitation in real-world applications. We
Integrating a Content-Based Recommender System into Digital Libraries
37
evaluated the FIRSt recommender system in the cultural heritage domain, by integrating the system in an adaptive platform for multimodal and personalized access to museum collections. In this scenario the role of the Fedora Digital Library was to store information about painting and to provide them in an adaptive way according on user interests. Each visitor, equipped with a mobile terminal, enjoys an intelligent guide service which helps her to find the most interesting artworks according to her profile and contextual information (such as her current location in the museum, noise level, brightness, etc.). Experimental evaluations showed that FIRSt is capable of improving user museum experience, by ranking artworks according to visitor tastes, included in user profiles. The profiles are automatically inferred from both static content describing the artworks and tags chosen by visitors to freely annotate preferred artworks. The personalized ranking allows building services for adaptive museum tours. Since FIRSt is capable of providing recommendations for items in several domains, provided that descriptions of items are available as text documents (e.g. plot summaries, reviews, short abstracts), we will try to investigate its application in different scenarios such as book or movie recommendation.
Acknowledgments This research was partially funded by MIUR (Ministero dell’Universita’ e della Ricerca) under the contract Legge 297/99, Prot.691 CHAT “Cultural Heritage fruition & e-Learning applications of new Advanced (multimodal) Technologies” (2006-08). The authors are grateful to Massimo Bux for his effort in developing the services and performing the experimental evaluation.
References 1. Balabanovic, M., Shoham, Y.: Fab: Content-based, Collaborative Recommendation. Communications of the ACM 40(3), 66–72 (1997) 2. Basile, P., de Gemmis, M., Gentile, A.L., Iaquinta, L., Lops, P., Semeraro, G.: META - MultilanguagE Text Analyzer. In: Proceedings of the Language and Speech Technnology Conference - LangTech 2008, Rome, Italy, February 28-29, pp. 137–140 (2008) 3. Basile, P., de Gemmis, M., Lops, P., Semeraro, G., Bux, M., Musto, C., Narducci, F.: FIRSt: a Content-based Recommender System Integrating Tags for Cultural Heritage Personalization. In: Nesi, P., Ng, K., Delgado, J. (eds.) Proceedings of the 4th International Conference on Automated Solutions for Cross Media Content and Multi-channel Distribution (AXMEDIS 2008) - Workshop Panels and Industrial Applications, Florence, Italy, November 17-19, pp. 103–106. Firenze University Press (2008) 4. Degemmis, M., Lops, P., Semeraro, G., Basile, P.: Integrating Tags in a Semantic Content-based Recommender. In: Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008, Lausanne, Switzerland, October 23-25, pp. 163–170 (2008) 5. Linden, G., Smith, B., York, J.: Amazon.com Recommendations: Item-to-Item Collaborative Filtering. IEEE Internet Computing 7(1), 76–80 (2003)
38
C. Musto et al.
6. Lops, P., Degemmis, M., Semeraro, G.: Improving Social Filtering Techniques Through WordNet-Based User Profiles. In: Conati, C., McCoy, K., Paliouras, G. (eds.) UM 2007. LNCS (LNAI), vol. 4511, pp. 268–277. Springer, Heidelberg (2007) 7. Miller, G.: WordNet: An On-Line Lexical Database. International Journal of Lexicography 3(4) (1990) (Special Issue) 8. Mladenic, D.: Text-learning and related intelligent agents: a survey. IEEE Intelligent Systems 14(4), 44–54 (1999) 9. Resnick, P., Varian, H.: Recommender Systems. Communications of the ACM 40(3), 56–58 (1997) 10. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002) 11. Semeraro, G., Degemmis, M., Lops, P., Basile, P.: Combining Learning and Word Sense Disambiguation for Intelligent User Profiling. In: Veloso, M.M. (ed.) Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 2856–2861 (2007) ISBN 978-I-57735-298-3 12. Shardanand, U., Maes, P.: Social Information Filtering: Algorithms for Automating “Word of Mouth”. In: Proceedings of ACM CHI 1995 Conference on Human Factors in Computing Systems, vol. 1, pp. 210–217 (1995) 13. Stock, O., Zancanaro, M., Busetta, P., Callaway, C.B., Kr¨ uger, A., Kruppa, M., Kuflik, T., Not, E., Rocchi, C.: Adaptive, intelligent presentation of information for the museum visitor in peach. User Model. User-Adapt. Interact. 17(3), 257–304 (2007) 14. Trant, J., Wyman, B.: Investigating social tagging and folksonomy in art museums with steve.museum. In: Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland (May 2006) 15. Wang, Y., Aroyo, L., Stash, N., Rutledge, L.: Interactive user modeling for personalized access to museum collections: The rijksmuseum case study. In: User Modeling, 385–389 (2007) 16. Yao, Y.Y.: Measuring retrieval effectiveness based on user preference of documents. Journal of the American Society for Information Science 46(2), 133–145 (1995)
Digital Stacks: Turning a Current Prototype into an Operational Service Giovanni Bergamin and Maurizio Messina Biblioteca Nazionale Centrale di Firenze - Biblioteca Nazionale Marciana, Venezia, Italy
[email protected],
[email protected]
Abstract. The presentation outlines the Digital Stacks project, whose aim is to set up a prototype of a long term digital preservation system for electronic documents published in Italy and made public via digital communication network, according to the legal deposit law. In the first part the technical architecture and the metadata management problems are outlined; the second part concerns the legal and agreements framework of the project, the organizational model and the service model. Also the sustainability issue is briefly addressed.
1 Introduction The Digital Stacks project aims to set up a prototype of a long term digital preservation system for electronic documents published in Italy and made public via digital communication network, according to the legal deposit law (L. 106/2004, DPR 252/2006). The project was originally established in 2006 by the Fondazione Rinascimento Digitale, by the Biblioteca Nazionale Centrale di Firenze and by the Biblioteca Nazionale Centrale di Roma. The first part of this presentation will take into account the technical architecture of Digital Stacks but, of course, it is well known that digital preservation is more than just a technical process. Strategies to avoid bit loss or to prevent hardware and software dependencies are only a part of the issue. Digital Stacks of course have to deal with other problems, including economic implications (sustainability), selection problems (what is important to preserve for future generations), legal aspects, cooperation between legal deposit institutions1. Some of these aspects will be addressed in the second part of the presentation. For the purposes of the project, Digital Preservation could be defined as a public service to be provided by trusted digital repositories in order to ensure for deposited digital resources viability, “renderability”, authenticity and availability for designated communities2. 1
Brian Lavoie, Lorcan Dempsey, Thirteen ways of looking at ... digital preservation, <
> 10(2004), 7/8 http://www.dlib.org/dlib/july04/lavoie/07lavoie.html 2 This definition is based on: a) Trustworthy Repositories Audit & Certification (TRAC) http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf (for the concept of “trusted digital repositories”); M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 39–46, 2010. © Springer-Verlag Berlin Heidelberg 2010
40
G. Bergamin and M. Messina
The name of the project recalls the stacks of the legal deposit libraries. As stated by a historical European project on digital preservation (NEDLIB)3: “For us, as memory organizations, this means we have to move from paper-based stacks to digital stacks”. In other words the name of the project Digital Stacks intentionally recalls the term used to refer to the stacks of legal deposit libraries. In most aspects digital stacks are comparable to conventional ones: digital resources must be preserved for the long term; digital stacks grow as new resources are added; modification and deletion is not an option; it is impossible to predict the usage frequency of stored digital resources; and it is likely that some resources will be seldom or never be used 4. It is worth noting that nine years later, a search query for “Digital stacks” on Google will return the same expression used within the context of digital preservation: “Digital stacks: rather than boxes, shelves, and climate controlled environments, digital information must be stored in containers, file systems, and secure servers”5.
2 Technical Architecture The aim of the project was to set up an infrastructure based on a long term framework. Taking into account the fact that component failures are the norm rather than the exception6, the infrastructure is based on data replication (different machines located in different sites) and on simple and widespread hardware components, non vendordependent, which can easily be replaced (in other words simple personal computers). The infrastructure does not rely on custom or proprietary software but is based on an open source operating system and utilities (widespread acceptance means less dependencies). b) Luciana Duranti, Un quadro teorico per le politiche, le strategie e gli standards di conservazione digitale: la prospettiva concettuale di InterPARES, <>, 9(2006), 1 http://didattica.spbo.unibo.it/bibliotime/num-ix-1/duranti.htm (to assess the authenticity of a digital resource, the public service must be able to establish its identity and demonstrate its integrity) c) PREMIS 2.0, 2008 , PREsevation Metadata: Implementation Strategies, http://www.loc.gov/standards/premis/ (for the concepts of “Viability: Property of being readable from media” and “Renderability” “Render: To make - [by the means of a computer] - a Digital Object perceptible to a user, by displaying (for visual materials), playing (for audio materials), or other means appropriate to the Format of the Digital Object”). d) OAIS. Reference model for an Open Archival Information System, ISO 14721:2003 (for the concept of archive and designated community: “an organization that intends to preserve information for access and use by a designated community) 3 NEDLIB = NETworked European Deposit Libraries 1997-2000: http://nedlib.kb.nl/ 4 The large-scale archival storage of digital objects / Jim Linden, Sean Martin, Richard Masters, and Roderic Parker, 2005, http://www.dpconline.org/docs/dpctw04-03.pdf 5 http://www.pedalspreservation.org/About/stacks.aspx 6 The Google file system / Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,, 2003, http://labs.google.com/papers/gfs-sosp2003.pdf
Digital Stacks: Turning a Current Prototype into an Operational Service
41
Nowadays an ordinary personal computer could easily store up to 8 TB (equipped with four 2000 GB hard disks) using widespread and inexpensive SATA7 technology Data replication relies on open source disk synchronization utility (rsync8); to avoid hardware dependencies, ex. g. disk controllers, RAID9 is not used. It is worth noting that in the passage from prototype to service we changed the Dark Archive architecture. In this site the original plan was to use an offline storage system (ex. g. LTO10 tapes). However for the operational service we decided to use the same technology used in the two "light archives" (i. e. online storage using just simple personal computers). Note that the use of the term online does not change the purpose of the dark archive that is “to function as a repository for information that can be used as a fail-safe during disaster recovery11”. Even though LTO is a robust and reliable solution, it introduces technology dependencies,ex. g. "robots", and media management problems. For the same reasons we decided not to use a HSM12 (Hierarchical storage management) system, since their different implementation is based on proprietary systems. Comparing all the costs of online and offline storage is not an easy task. For instance regarding SATA disks we can say that their cost is decreasing day by day while their capacity is increasing, but it is difficult to estimate the so called total cost of ownership of a tape based solution13. Taking into account all the pros and cons, we concluded that the most convenient solution is online storage on simple and easily replaceable personal computers (“easily replaceable” means replaceable with no or minor impact on the overall architecture). The only drawback to this approach is in fact an ecological problem: the power consumption of the storage computers and the carbon dioxide emissions. However in recent years “green computer” technology (i. e. more energy-efficient versions of computers) is gaining widespread market awareness. Moreover the Solid State Drive (SSD)14 is a rapidly developing technology and this solution could significantly reduce the energy consumption of the storage computers in the near future. The current Digital Stacks prototype is now turning into an operational service based on two main deposit sites (managed by the Biblioteca Nazionale Centrale di Firenze and by the Biblioteca Nazionale Centrale di Roma) and a dark archive (managed by the Biblioteca Nazionale Marciana). Of course the Fondazione Rinascimento Digitale will continue to support and promote the Digital Stacks operational service. Each main site is composed of a set of autonomous and independent nodes. In turn each node on a given site has a mirror node on the other site: the Digital Stacks service does not rely on “master site / mirror site” architecture and each site will contain, in a symmetrical way, both master nodes and mirror nodes (see Figure 1). 7
http://it.wikipedia.org/wiki/Serial_ATA http://it.wikipedia.org/wiki/Rsync 9 http://it.wikipedia.org/wiki/RAID 10 http://en.wikipedia.org/wiki/Linear_Tape-Open 11 http://www.webopedia.com/TERM/D/dark_archive.html 12 http://en.wikipedia.org/wiki/Hierarchical_storage_management 13 http://digitalcuration.blogspot.com/2009/07/online-and-offline-storage-cost-and.html 14 http://en.wikipedia.org/wiki/Solid-state_drive 8
42
G. Bergamin and M. Messina
Each physical file is replicated twice on different computers within the same node. The dark archive also contains two copies of this file on two different computers. As a result within Digital Stacks each physical file is replicated six times.
Fig. 1. Digital Stacks technical architecture overview
Setting up one main site in Florence close to the Arno river and the Dark Archive in Venice with the well known “acqua alta” (or high tide) problem, could result in a relevant threat for the security of the overall service. One important decision was to locate all the hardware at an external data center (or collocation center15). Certification to ISO 2700116 international security standard will be the basic prerequisite for the selection of a data center. Each institution (Florence, Rome and Venice) will select three different data centers owned and managed by three different companies (to reduce the risk of “domino” effects). Moreover we decided that the three collocation centers have to be distant from each other by at least 200 km (to reduce the risk of natural threats). This architecture 15 16
http://en.wikipedia.org/wiki/Colocation_centre ISO/IEC 27001:2005 “specifies the requirements for establishing, implementing, operating, monitoring, reviewing, maintaining and improving a documented Information Security Management System within the context of the organization's overall business risks”
Digital Stacks: Turning a Current Prototype into an Operational Service
43
based on certification to ISO 27001 international security standard will form the basis for a domain specific certification of Digital Stacks as trusted digital repository (during the prototype phase we tried to apply DRAMBORA17 but also TRAC18 was taken into account).
3 Metadata The Digital Stacks core is quite simple. Digital Stacks could ingest two kinds of files:
data wrapped in WARC containers: a WARC (ISO 28500) container aggregates digital objects for ease of storage in a conventional file system19. metadata wrapped in MPEG21-DIDL containers20: MPEG21-DIDL (ISO 21000) is a simple and agnostic container suitable for the representation of digital resources (sets of metadata conformant to different schemas).
To conclude this first part it is worth noting that Digital Stacks within this architecture has to face the metadata management problem (also known as “lake or river model”21). A long term archive can not rely on the “lake model” (stores of metadata based on few schemas and fed by a few principal sources). A long term archive has to face stores of metadata based on schemas22 that can change over time and which are fed by many streams. It could only be based only on a “river model”. In a long term archive there will be different metadata schemas originating from, using the PREMIS language, different agents (ex. g. OAI-PMH metadata harvesters, metadata extractors like JHOVE, librarians, etc). Every schema is subject to change over time. Semantic overlap elements belonging to different schemas (ex. g. PREMIS, MIX) will probably be the norm rather than the exception. Since metadata are the only mean for controlling data it is essential to control metadata to avoid the risk of the “Babel model”. We are currently working on that by taking into account the fact that there are no tools available. The are some interesting directions: crosswalks like MORFROM23 (demonstration OCLC web service, limited 17
http://www.repositoryaudit.eu/ Trustworthy Repositories Audit & Certification (TRAC) http://www.crl.edu/sites/default/ files/attachments/pages/trac_0.pdf 19 IISO 28500:2009 "specifies the WARC file format: to store both the payload content and control information from mainstream Internet application layer protocols, such as the Hypertext Transfer Protocol (HTTP), Domain Name System (DNS), and File Transfer Protocol (FTP); to store arbitrary metadata linked to other stored data” 20 ISO/IEC 21000-2:2005: “ The Digital Item Declaration Model describes a set of abstract terms and concepts to form a useful model for defining Digital Items [ ... ], is based upon the terms and concepts defined in the above model. It contains the normative description of the syntax and semantics of each of the DIDL elements, as represented in XML”. 21 http://orweblog.oclc.org/archives/001754.html 22 Schema is used here as http://www.w3.org/XML/Schema: “XML Schemas express shared vocabularies and allow machines to carry out rules made by people” 23 Toward element-level interoperability in bibliographic metadata / Carol Jean Godby, Devon Smith, Eric Childress, 2008, http://journal.code4lib.org/articles/54 18
44
G. Bergamin and M. Messina
to bibliographic metadata) and Dspace future plans (“HP and MIT also have a research project called SIMILE that is investigating how to support arbitrary metadata schemas using RDF24 ”). This is not an easy project: incidentally it seems that the web site of the project is no longer being updated.
4 Legal Framework and Service Model The second part of the presentation concerns both the legal and agreements framework of the project and the service model we propose. The most recent Italian law on legal deposit (L. 106/2004, DPR 252/2006) provides for a trial period for legal deposit on a voluntary basis of electronic documents, which are defined by the law as “documents disseminated via digital communication network”. This legislation can be regarded as a strong commitment for national libraries to set up the foundations of a Digital Preservation Network that could, on the basis of the results of the trial period or just for specific components, also encompass electronic resources of other domains, different from those of the libraries. As it is well known, the “commitment” is one of the requirements of a trusted digital repository25. The test is funded by MiBAC, General Direction for Libraries; the Fondazione Rinascimento Digitale (FRD) will support the project with human and financial resources, as with the former project Magazzini digitali. As stated, the test will be carried out by the National Library of Florence (BNCF) and the National Library of Rome (BNCR), as main sites for preservation and access, and the Marciana National Library of Venice (BNM), which will act as an off-line dark archive for preservation purposes and redundancy, but not for public access. We would like to point out here the following three main goals: • • •
To implement an organizational model suitable for creating the national and regional archives of electronic publishing production, as provided by the law, and for being extended on a larger scale; To implement a service model suitable for balancing the right-holders interests in the protection of contents with the final users interests in accessing the contents; To implement a system suitable for ensuring long term preservation and access to digital contents, as well as their authenticity (identity and integrity).
In order to achieve these goals a legal and agreements framework is needed, also for balancing different interests of all the involved stakeholders: •
24 25
An agreement between the three MiBAC libraries and the FRD, in order to set specific roles and responsibilities of each institution from different points of view: scientific, technical, operational and financial, and to set up a steering committee for all management, monitoring and results assessment activities. It will also be of utmost importance to define an organizational
http://www.dspace.org/faq/FAQ.html Trustworthy Repositories Audit & Certification (TRAC) http://www.crl.edu/sites/default/ files/attachments/pages/trac_0.pdf
Digital Stacks: Turning a Current Prototype into an Operational Service
•
45
and financial sustainability plan, after the 36 months trial period. The signature of this agreement is underway just in these days; An agreement about the access to and the use of legal deposit digital contents, to be signed between the three National Libraries and each electronic publisher (or electronic content provider) joining the test. The current Italian legislation (Art. 38, paragraph 2, DPR 3 may 2006, n. 252) provides for a free access via computer net to legal deposit documents that are originally freely accessible on the net and an access restricted to registered users inside the deposit institutions premises to the documents whose access is originally subject to a license. In both cases the copyright law must be adhered to. The agreement should provide for the following points: •
• • • • • •
•
•
BNCF and BNCR will periodically harvest the agreed publisher’s electronic documents (harvesting is the cheaper and easier way of feeding the archive, also from the publishers’ point of view, provided that copyright is adhered to); In the case of license subject documents, the publisher will provide the libraries with all the necessary clearances, and the file formats will also be agreed (WARC etc); Documents will be stored in multiple copies in BNCF and BNCR, and off-line in BNM; the libraries will be allowed to store the documents in ISO 27001 certified external data centers; Digital archives will be ISO 14721-2003 OAIS compliant, and will be certified as trusted; BNCF, BNCR and BNM will ensure long term preservation and access of the deposited documents, and will track any changes in the same documents; BNCF, BNCR and BNM will be allowed to perform any necessary actions (refreshing, duplication, migration etc) in order to achieve long term preservation and access of the deposited documents; Only registered users will be allowed to access and consult the documents subject to license on multiple workstations (without printers and USB-ports) on the Local Area Networks of BNCF and BNCR; all user actions will be tracked; Files printing and/or downloading will be subject to specific agreements, a compensation system for right-holders will be provided for if necessary (e.g. for protected documents not available on the publisher’s web site); Access and consultation will also be allowed to regional deposit libraries, in the same way, but only to deposited documents of those publishers whose registered office is in the same region of the deposit library (this could be a critical point to agree upon, but it is in line with the law and with the Italian tradition of legal deposit of analogical material) .
To the purpose of extending the test-basis, the project will take the following main types of electronic resources into account:
46
G. Bergamin and M. Messina
• •
Legal deposit born digital resources, i.e. e-journals, and also Ph. D. digital thesis, resulting from specific agreements with universities; Digital resources resulting from digitization projects funded by the Italian Digital Library Initiative, mainly in the memory institutions range and only for master copies;
The latter issue of this presentation concerns sustainability: as it is known, access to born digital e-journals is normally subject to a license. A typical provision of these licenses concerns the perpetual access to the licensed contents. It is a provision of the utmost importance for libraries and their users, and the only way for libraries to maintain the availability of the contents they have paid for over the time. At the same time however it is a provision that can be fulfilled only through a dedicated organizational and technical infrastructure, i.e. a trusted digital repository. It is unlikely the publishers will manage such an infrastructure. So this kind of service could be provided by the legal deposit libraries network, and its value could be part of the negotiation with publishers26.
26
A comparative study of e-journals archiving solutions. A JISC funded investigation. Final report, May 2008 / Terry Morrow, Neil Beagrie, Maggie Jones, Julia Chruszcz. http://www.slainte.org.uk/news/archive/0805/jiscejournalreport.pdf
A First National Italian Register for Digital Resources for Both Cultural and Scientific Communities (Communication) Maurizio Lunghi Fondazione Rinascimento Digitale, Firenze, Italy [email protected]
Abstract. In this paper we present an Italian initiative, involving relevant research institutions and national libraries, aimed at implementing an NBN Persistent Identifiers (PI) infrastructure based on a novel hardware/software architecture. We describe a distributed and hierarchical approach for the management of an NBN namespace and illustrate assignment policies and identifier resolution strategies based on request forwarding mechanisms. We describe interaction and synergy with the ‘Magazzini Digitali’ project for the legal deposit of digital contents just launched by the Italian Ministry of Culture. Finally, we draw some conclusions and point out the future directions of our work.
1 Introduction Stable and certified reference of Internet resources is crucial for digital library applications, not only to identify a resource in a trustable and certified way, but also to guarantee continuous access to it over time. Current initiatives like the European Digital Library (EDL) and Europeana, clearly show the need for a certified and stable digital resource reference mechanism in the cultural and scientific domains. The lack of confidence in digital resource reliability hinders the use of the Digital Library as a platform for preservation, research, citation and dissemination of digital contents. A trustworthy solution is to associate to any digital resource of interest a Persistent Identifier (PI) that certifies its authenticity and ensures its accessibility. Actually some technological proposals are available, but the current scenario shows that we can’t expect/impose a unique PI technology or only one central registry for the entire world. Moreover, different user communities do not commonly agree about the granularity of what an identifier should point to. In the Library domain the National Bibliography Number (NBN – RFC3188) has been defined and is currently promoted by the CENL. This standard identifier format assumes that the national libraries are responsible for the national name registers. The first implementations of NBN registers in Europe are available at the German and Swedish national libraries. In Italy we are currently developing a novel NBN architecture with a strong participation from the scientific community, leaded by the National Research Council (CNR) through its Central Library and ITC Service. We have designed a hierarchical M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 47–53, 2010. © Springer-Verlag Berlin Heidelberg 2010
48
M. Lunghi
distributed system, in order to overcome the criticalities of a centralised system and to reduce the high management costs implied by a unique resolution service. Our approach implies a central node responsible for the NBN:IT top-level Italian domain, and lower-level nodes each responsible for managing one of the Italian sub-domains (NBN:IT:UR, NBN:IT:UR:CNR, NBN:IT:FRD, etc.). The number of levels within this hierarchy is virtually unlimited. Only the nodes at the lowest level harvest metadata from the actual repositories and create NBN identifiers. The upper level nodes just harvest new NBN records from their child nodes and store them within their databases. In this way each node keeps all the NBN records belonging to its sub-domain. It is easy to see that within this architecture the responsibility for name creation/resolution is distributed and information about persistent identifiers is replicated in multiple sites, thus providing the necessary redundancy and resilience for implementing a reliable service.
2 Persistent Identifiers Standards The association of a Persistent Identifier (PI) to a digital resource can be used to certify its content authenticity, provenance, managing rights, and to provide an actual locator. The only guarantee of the actual persistence of identifier systems is the commitment shown by the organizations that assign, manage, and resolve the identifiers. At present some technological solutions are available but no general agreement has been reached among the different user communities. We provide in the following a brief description for the most widely diffused ones. The Document Object Identifier system (DOI) is a business-oriented solution widely adopted by the publishing industry, which provides administrative tools and a Digital Right Management System (DRM). Archival Resource Key (ARK) is an URL-based persistent identification standard, which provides peculiar functionalities that are not featured by the other PI schemata, e.g., the capability of separating the univocal identifier assigned to a resource from the potentially multiple addresses that may act as a proxy to the final resource. The Handle System is a technology specification for assigning, managing, and resolving persistent identifiers for digital objects and other resources on the Internet. The protocols specified enable a distributed computer system to store identifiers (names, or handles) of digital resources and resolve those handles into the information necessary to locate, access, and otherwise make use of the resources. That information can be changed as needed to reflect the current state and/or location of the identified resource without changing the handle. Finally, the Persistent URL (PURL) is simply a redirect-table of URLs and it’s up to the system-manager to implement policies for authenticity, rights, trustability, while the Library of Congress Control Number (LCCN) is the a persistent identifier system with an associated permanent URL service (the LCCN permanent service), which is similar to PURL but with a reliable policy regarding identifier trustability and stability. This overview shows that it is not viable to impose a unique PI technology and that the success of the solution is related to the credibility of the institution that promotes it. Moreover the granularity of the objects that the persistent identifiers need to be assigned to is widely different in each user application sector.
A First National Italian Register for Digital Resources
49
The National Bibliographic Number (NBN) is a URN namespace under the responsibility of National Libraries. The NBN namespace, as a Namespace Identifier (NID), has been registered and adopted by the Nordic Metadata Projects upon request of the CDNL and CENL. Unlike URLs, URNs are not directly actionable (browsers generally do not know what to do with a URN), because they have no associated global infrastructure that enables resolution. Although several implementations have been made, each proposing its own means for resolution through the use of plug-ins or proxy servers, an infrastructure that enables large-scale resolution has not been implemented. Moreover, each URN name-domain is isolated from other systems and, in particular, the resolution service is specific (and different) for each domain. Each National Library uses its own NBN string independently and separately implemented by individual systems, with no coordination with other national libraries and no commonly agreed formats. In fact, several national libraries have developed their own NBN systems for national and international research projects; several implementations are currently in use, each with different metadata descriptions or granularity levels. In our opinion NBN is a credible candidate technology for an international and open persistent identifier infrastructure, mainly because it is based on an open standard and supports the distribution of the responsibility for the different sub-namespaces, thus allowing the single institutions to keep control over the persistent identifiers assigned to their resources.
3 The NBN Initiative in Italy The project for the development of an Italian NBN register/resolver started in 2007 as a collaboration between “Fondazione Rinascimento Digitale” (FRD), the National Library in Florence (BNCF), the University of Milan (UNIMI) and “Consorzio Interuniversitario Lombardo per l’elaborazione automatica” (CILEA). After one year of work a first prototype was released demonstrating the viability of the hierarchical approach. The second and current phase of the Italian NBN initiative is based on a different partnership involving Agenzia Spaziale Italiana (ASI), Consiglio Nazionale delle Ricerche (CNR), Biblioteca Nazionale Centrale di Firenze (BNCF), Biblioteca Nazionale Centrale di Roma (BNCR), Istituto Centrale per il Catalogo Unico (ICCU), Fondazione Rinascimento Digitale (FRD) and Università di Milano (UniMi). At the beginning of 2009 the Italian National Research Council (CNR) developed a second prototype. Objectives The project aims at:
creating a national stable, trustable and certified register of digital objects to be adopted by cultural and educational institutions; allowing an easier and wider access to the digital resources produced by Italian cultural institutions, including material digitised or not yet published; encouraging the adoption of long term preservation policies by making service costs and responsibilities more sustainable, while preserving the institutional workflow of digital publishing procedures;
50
M. Lunghi
extending as much as possible the adoption of the NBN technology and the user network in Italy; creating some redundant mechanisms both for duplication of name-registers and in some cases also for the digital resources themselves; overcoming the limitation imposed by a centralised system and distributing the high management costs implied by a unique resolution service, while preserving the authoritative control.
The proposed architecture (see Figure 1), introduces some elements of flexibility and additional features. At the highest level there is a root node, which is responsible for the top-level domain (IT in our case). The root node delegates the responsibility for the different second-level domains (e.g.: IT:UR, IT:FRD, etc.) to second-level naming authorities. Sub-domain responsibility can be further delegated using a virtually unlimited number of sub-levels (e.g.: IT:UR:CNR, IT:UR:UNIMI, etc.). At the bottom of this hierarchy there are the leaf nodes, which are the only ones that harvest publication metadata from the actual repositories and assign unique identifiers to digital objects. Each agency adheres to the policy defined by the parent node and consistently defines the policies its child nodes must adhere to. It is easy to see that this hierarchical multi-level distributed approach implies that the responsibility of PI generation and resolution can be recursively delegated to lower level sub-naming authorities, each managing a portion of the domain name space. Given the similarity of the addressed problems, some ideas have been borrowed from the DNS service. Within our architecture each node harvests PI information from its child nodes and is able to directly resolve all identifiers belonging to its domain and sub-domains. Besides, it can query other nodes to resolve NBN identifiers not belonging to its domain. Central Node NBN:IT NBN Central registry
Level 1
NBN:IT:YY
NBN:IT:XX
Level 2 NBN Intermediate Registry
NBN Intermediate Registry
NBN:IT:XX:ZZ
Level 3
NBN Leaf Reg
NBN:IT:YY:KK
NBN:IT:YY:HH NBN Leaf Reg
NBN Leaf Reg.
Fig. 1. The multi-level distributed architecture
A First National Italian Register for Digital Resources
51
This implies that every node can resolve every NBN item generated within the NBN:IT sub-namespace, either by looking up its own tables or by querying other nodes. In the latter case the query result is cached locally in order to speed up subsequent interrogations regarding the same identifier. This redundancy of service access points and information storage locations increases the reliability of the whole infrastructure by eliminating single points of failure. Besides, reliability increases as the number of joining institutions grows up. In our opinion a distributed architecture also increases scalability and performance, while maintaining unaltered the publishing workflows defined for the different repositories. Organisational requirements Each participating agency should indicate an administrative reference person, who is responsible for policy compliance as regards the registration and resolving procedures as well as for the relationships with the upper and lower level agencies, and a technical reference person, who is responsible for the hardware, software and network infrastructure. Guidelines The policy should define rules for: generating well-formed PIs; identifying the digital resources which “deserve” a PI; identifying resource granularity for PI assignment (paper, paper section, book, book chapter, etc.) auditing repositories in order to assess their weaknesses and their strengths (the Drambora toolkit may help in this area).
4 Magazzini Digitali/Digital Stacks The Digital stacks project, established in 2006 by the Fondazione Rinascimento Digitale and by the Biblioteca Nazionale Centrale di Firenze, now relies on an infrastructure based on two main deposit sites (managed by the Biblioteca Nazionale Centrale di Firenze and by the Biblioteca Nazionale Centrale di Roma) and a dark archive (managed by the Biblioteca Nazionale Marciana, Venezia). The name of the project Magazzini Digitali (Digital Stacks) intentionally recalls the term used to refer to the stacks of legal deposit libraries. In most aspects digital stacks are comparable to conventional ones: digital resources must be preserved for the long term; digital stacks grow as new resources are added; modification and deletion is not an option; it is impossible to predict the usage frequency of stored digital resources; and it is likely that some resources will be seldom or never be used. The aim of the project is to set up an infrastructure based on a long term framework. Taking into account the fact that component failures are the norm rather than the exception, the infrastructure is based on data replication (different machines located in different sites) and on simple and widespread hardware components, non vendor-dependent, that can easily be replaced (just simple personal computers).
52
M. Lunghi
The infrastructure does not rely on custom or proprietary software but is based on an open source operating system and utilities (widespread acceptance means less dependencies). The infrastructure has been developed by the Ministry of Culture in order to offer the first service for legal deposit of digital contents in Italy. The experimentation just launched started with the Doctoral thesis and some Universities already joined the test bed, obviously we put together with deposit also the PI assignment of the digital resources and so we encourage the Universities to adopt the appropriate policy and install the free software to become second level agencies to generate NBN names for their own resources.
5 Conclusions The development of a strong policy for persistent identifiers of digital resources from both cultural and scientific communities is very important and poses a structural element for our information society future. Moreover the Italian development of a NBN register has been original and innovative in respect to what other European countries have done in this area. In parallel the Magazzini Digitali project set up a national infrastructure to for legal deposit of digital resources offering for the first time such a strategic service for user communities. The synergy between the two projects promises a serious and robust approach for long term vision of these digital resources management and use.
References 1. Hakala, J.: Using national bibliography numbers as uniform resource names. RFC3188 (2001), http://www.ietf.org/rfc/rfc3188.txt 2. Kunze, J.: The ARK Persistent Identifier Scheme. Internet Draft (2007), http://tools.ietf.org/html/draft-kunze-ark-14 3. Lagoze, C., de Sompel, H.V.: The Open Archives Initiative Protocol for Metadata Harvesting, version 2.0. Technical report, Open Archives Initative (2002), http://www.openarchives.org/OAI/openarchivesprotocol.html 4. Workshop, D.C.C.: on Persistent Identifiers. Wolfson Medical Building, University of Glasgow(June 30-July1, 2005 ) http://www.dcc.ac.uk/events/pi-2005/ 5. ERPANET workshop Persistent Identifers, University College Cork, Cork, Ireland (Thursday June 17 - Friday June 18, 2004), http://www.erpanet.org/events/ 2004/cork/index.php 6. Dublin Core Metadata Initiative. Dublin Core Metadata Element Set, Version 1.1., http://dublincore.org/documents/dces/ 7. Bellini, E., Cirinnà, C., Lunghi, M.: Persistent Identifiers for Cultural Heritage, Digital Preservation Europe Briefing Paper (2008), http://www.digitalpreservationeurope.eu/ publications/briefs/persistent_identifiers.pdf 8. Bellini, E., Lunghi, M., Damiani, E., Fugazza, C.: Semantics-aware Resolution of Multi-part Persistent Identifiers. In: WCKS 2008 Conference (2008) 9. CENL Task Force on Persistent Identifiers, Report 2007 (2007), http://www.nlib. ee/cenl/docs/CENL_Taskforce_PI_Report_2006.pdf
A First National Italian Register for Digital Resources
53
10. National Library of Australia, PADI (Perserving Access to Digital Infornmation) Persistent Identifiers (2002), http://www.nla.gov.au/padi/topics/36.html#article 11. Relationship Between URNs, Handles, and PURLs Library of Congress, National Digital Library Program, http://lcweb2.loc.gov/ammem/award/docs/PURLhandle.html
FAST and NESTOR: How to Exploit Annotation Hierarchies Nicola Ferro and Gianmaria Silvello Department of Information Engineering, University of Padua, Italy {ferro, silvello}@dei.unipd.it
Abstract. In this paper we present the annotation model implemented by Flexible Annotation Service Tool (FAST) and the set-theoretical data models defined in the NEsted SeTs for Object hieRarchies (NESTOR) framework. We show how annotations assume a tree structure that can be exploited by NESTOR to improve access and exchange of Digital Objects (DOs) between Digital Librarys (DLs) in a distributed environment.
1
Motivations
DLs are getting the preponderant mean to manage, exchange and retrieve cultural heritage resources. DLs can be seen as tools for managing information resources of different kinds of organizations ranging from libraries, and museums to archives. In these different contexts, DLs permit the management of wide and different corpora of resources which range from books and archival documents to multimedia resources as pointed out in [8]. Furthermore, DL are not only systems which permit the users to manage, exchange and retrieve digital objects or metadata, they are increasingly becoming part of the user’s work. DL can enable the intellectual production process and support user cooperation and exchange of ideas. In this way, DL not only foster access to knowledge, but they are also part of knowledge creation and evolution. The evolution and transmission of knowledge has always been an interactive process between scientists or field experts and annotations have always been one of the main tools for this kind of interaction. In the digital era, annotations are still means of intellectual collaboration and in DL they are considered first-class digital objects [7]. They are also adopted in a variety of different contexts, such as content enrichment, data curation, collaborative and learning applications, and social networks, as well as in various information management systems, such as the Web (semantic and not), and databases. As an example, in Figure 1 we can see different kinds of annotations commenting an archival document managed by an archival system1 . Documents in an archive are organized in a hierarchy [4] and the annotations may need to be attached both to the content - the actual documents - and to the structure - the 1
Sistema Informativo Unificato per le Soprintendenze Archivistiche (SIUSA) http://siusa.archivi.beniculturali.it/
M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 55–66, 2010. c Springer-Verlag Berlin Heidelberg 2010
56
N. Ferro and G. Silvello
How can I get there?
Follow the attached map
Really ting Interes
Fig. 1. Different kinds of annotation annotating an archival document
relationships between the documents - of an archive. Furthermore, in Figure 1 we can see that an annotation can be a textual comment (i.e. “really interesting” which annotates the content of the document) or the expression of a user information need (i.e. “How can I get there?” which annotates an element of the archival structure); the map is an annotation provided by a user and conceptually attached to another annotation and not to an archival component. The previous example shows that annotations are a quite complex concept comprising a number of different aspects and in order to deal with this heterogeneity a formal model for digital annotations has been proposed in [1] and this model has been adopted by the FAST annotation service for representing and managing annotations. Furthermore, we pointed out the importance of annotating both the content and the structure of digital objects; in [6] we presented the NESTOR Framework which is based on two set data models alternative to the tree that permit us to model hierarchies handling structure and content in an independent way. The clear distinction and autonomous treatment of content and structure elements in the NESTOR Framework ease the adoption of services which consider the structural elements defining the organization of the objects of interest as relevant as the objects themselves. Indeed, it can be exploited by FAST enabling a natural way to annotate both the content and the structure of hierarchies of objects. Moreover, we shall see that also annotations can be shaped into a hierarchy and thus modeled throughout the NESTOR Framework as well enabling a uniform representation of both annotated objects and annotations. The use of NESTOR with annotations concerns: the data structure used to attach annotations to hierarchies and contents, the way in which annotations are accessed and exchanged in a distributed environment and the annotation search strategies. The paper is organized as follows: Section 2 points out the characteristics of FAST annotation model highlighting the hypertext created by annotations in a DL and their hierarchical structure. Section 3 introduces the theoretical
FAST and NESTOR: How to Exploit Annotation Hierarchies
57
foundations of the NESTOR framework. Section 4 describes how NESTOR can be applied to the presented annotation model and points out the advantages of this approach. Finally, Section 5 draws some final remarks.
2
FAST Annotation Model
FAST adopts and implements the formal model for annotations proposed by [1] which has been also embedded in the reference model for digital libraries developed by DELOS, the European network of excellence on digital libraries [2]. According to this model an annotation is a compound multimedia object constituted by different signs of annotation which materialize the annotation itself; for example, we can have textual signs, which contain the textual content of the annotation, image signs, if the annotation is made up of images, and so on. In turn, each sign is characterized by one or more meanings of annotation which specify the semantics of the sign; for example, we can have a sign whose meaning corresponds to the title field in the Dublin Core (DC) metadata schema2 , in the case of a metadatum annotation, or we can a sign carrying a question of the author about a document whose meaning may be question or similar. Every annotation is uniquely identified by the pair (namespace, identifier). In the following, we need a terminology to distinguish between two kinds of digital objects: the generic ones managed by a digital library or available in the Web, which we call documents, and the ones that are annotations. Therefore, when we use the generic term digital object, we mean a digital object that can be either a document or an annotation. Annotations can be associated to a digital object, that can be both a document or an annotation, by two types of link: – annotate link: it permits to link an annotation to a part of a digital object. By means of annotate link an annotation can annotate one or more parts of a digital object expressing intra-digital object relationships between the different parts of the annotated digital object. An important constraint is that an annotation can annotate one and only one digital object. – relate-to link: it is intended to allow an annotation only to relate to one or more parts of other digital objects, but not the annotated one. Therefore, this kind of link lets the annotation express inter-digital object relationships, meaning that the annotation creates a relationship between the annotated digital object and the other digital objects related to it. By means of relate-to links an annotation can link more digital objects. From these definitions annotations can be seen as linking means between digital objects. Annotations permit us to create new relationships between the components of a digital objects, between different digital objects of the same DL or between digital objects belonging to different DLs. As shown in [1] the set of digital objects and annotations forms a labeled directed acyclic graph called document-annotation hypertext. 2
http://www.dublincore.org/
58
N. Ferro and G. Silvello
Definition 1. Let A be the set of annotations, D the set of documents, and DO = A∪D the set of digital objects, which are either annotations or documents. The document-annotation hypertext is a labeled directed graph Hda = (DO, Eda ⊆ A × DO, lda )
(2.1)
where: – DO = A ∪ D is the set of vertices; – Eda is the set of edges; – lda : Eda → LT , with LT = {annotate, relate-to}, is the labeling function which associates the corresponding link type to each edge. It has been proved in [1] that Hda does not contain loops and cycles. Furthermore, we know that each annotation must annotate only one digital object, thus for each document there is a unique tree of annotations constituted by “annotate” edges that can be rooted in the document. For each annotation we can determine the unique path to the document root of the tree at which the annotation belongs. In this work we aim to point out the backbone of annotations which is composed by the “annotate” links; we are interested in independent threads of annotations rather than in the general structure of the annotation graph. The following proposition which is extensively described and proved in [1] outlines this aspect of annotations.
Proposition 1. Let Hda = (DO , Eda ) be the subgraph of Hda , such that:
– Eda = {e ∈ Eda | lda (e) = Annotate}; – DO = {do ∈ DO | ∃e ∈ Eda , e = (a, do)}
Hda is the subgraph whose edges are of the kind, Annotate, and whose vertices are incident with at least one of these edges. Let Hda = (DO , Eda ) be the underlying graph of Hda , which is the undirected version of Hda . The following properties hold:
– Hda is a forest; – every tree in Hda contains a unique document vertex d. In Figure 2 we can see an example of document-annotation hypertext, the di rected acyclic graphs (Hda ) formed by the “annotate” links with the “relate-to” links are removed and the forest Hda . The formal model for annotation provided a sound basis for designing and developing an XML Schema for the FAST annotation service. The FAST XSchema3 allows annotations and related entities to be represented and exchanged into a well-defined XML format. The FAST XSchema encodes both the content and the metadata about the annotation. For instance, it encodes the data about the user who created the annotation and about the groups sharing the annotation; at the same time the FAST XSchema permits us to encode the “annotate” and the “relate-to” links information. 3
http://ims.dei.unipd.it/xml/fast-schema-instance
FAST and NESTOR: How to Exploit Annotation Hierarchies
annotation
annotate link
document
relate-to link
a6
a6 a9
a3
a2
a5
a7
a1
a8
d1
a3
a2
a5
a4
a8
a9 a4
59
d3
Hda
a7
a1
a6
d1
d2
d3
a9
Hda
a5
a4
a8 a3
a2
a7
a1 d1
d3
Hda
Fig. 2. A document-annotation hypertext Hda and the forest Hda composed by two trees created considering only the “annotate” links
3
The NESTOR Framework
We propose two set data models called Nested Set Model (NS-M) and Inverse Nested Set Model (INS-M) based on an organization of nested sets. The foundational idea behind these set data models is that an opportune set organization can maintain all the features of a tree data structure with the addition of some new relevant functionalities. We define these functionalities in terms of flexibility of the model, rapid selection and isolation of easily specified subsets of data and extraction of only those data necessary to satisfy specific needs. The most intuitive way to understand how these models work is to relate them to the well-know tree data structure. Thus, we informally present the two data models by means of examples of mapping between them and a sample tree. The first model we present is the Nested Set Model (NS-M). An organization of sets in the NS-M is a collection of sets in which any pair of sets is either disjoint or one contains the other. In figure 3 (b) we can see a tree mapped into a NS-M represented by the means of an Eulero-Venn diagram. We can see that each node of the tree is mapped into a set, where child nodes become proper subsets of the set created from the parent node. Every set is subset of at least of one set; the set corresponding to the tree root is the only set without any supersets and every set in the hierarchy is subset of the root set. The external nodes are sets with no subsets. The tree structure is maintained thanks to the nested organization and the relationships between the sets are expressed by the set inclusion order.
60
N. Ferro and G. Silvello
a
G A
b
D
B
d
C
G
E
C
B
A
D
F
F c
e
(a) Tree
f
E
g
(b) Nested Set Model
(c) Inverse Nested Set Model
Fig. 3. (a) A tree. (b) The Euler-Venn Diagram of a NS-M. (c) the Doc-Ball representation of a INS-M.
The second data model is the Inverse Nested Set Model (INS-M). We can say that a tree is mapped into the INS-M transforming each node into a set, where each parent node becomes a subset of the sets created from its children. The set created from the tree’s root is the only set with no subsets and the root set is a proper subset of all the sets in the set organization. The leaves are the sets with no supersets and they are sets containing all the sets created from the nodes composing the tree path from a leaf to the root. An important aspect of INS-M is that the intersection of every couple of sets obtained from two nodes is always a set representing a node in the tree. The intersection of all the sets in the INS-M is the set mapped from the root of the tree. In Figure 3 (c) we can see a tree mapped into a INS-M represented throughout a DocBall representation [11]. The representation of the INS-M by means of the Euler-Venn diagrams is not very expressive and can be confusing for the reader, for these reasons we use the DocBall representation. We exploit the DocBall ability to show the structure of an object and to represent the “inclusion order of one or more elements in another one” [11]. The DocBall is composed of a set of circular sectors arranged in concentric rings as shown in Figure 3 (c). In a DocBall each ring represents a level of the hierarchy with the center (level 0) representing the root. In a ring, the circular sectors represent the nodes in the corresponding level. We use the DocBall to represent the INS-M, thus for us each circular sector corresponds to a set. It is worthwhile for the rest of the work to define some basic concepts of set theory: the family of subsets and the subfamily of subsets, with reference to [3] for their treatment. However, we assume the reader is confident with the basic concepts of ZFC axiomatic set theory [9], which we cannot extensively treat here for space reasons. Definition 2. Let A be a set, I a non-empty set and C a collection of subsets of A. Then a bijective function A : I −→ C is a family of subsets of A. We call I the index set and we say that the collection C is indexed by I. We use the following notation {Ai }i∈I to indicate the family A; the notation Ai ∈ {Ai }i∈I means that ∃ i ∈ I | A(i) = Ai . We call subfamily of {Ai }i∈I the restriction of A to J ⊆ I and we denote this with {Bj }j∈J ⊆ {Ai }i∈I .
FAST and NESTOR: How to Exploit Annotation Hierarchies
61
Definition 3. Let {Ai }i∈I be a family. We define {Ai }i∈I to be a linearly ordered family if ∀Aj , Ak ∈ {Ai }i∈I , Aj ⊆ Ak ∨ Ak ⊆ Aj . Furthermore, we can say that a family {Ai }i∈I is a linearly ordered family if every two sets in {Ai }i∈I are comparable. In literature a linearly ordered family is also called a chain. Definition 4. Let {Ai }i∈I be a family . We define {Ai }i∈I to be a topped family if ∃Ak ∈ {Ai }i∈I | ∀Aj ∈ {Ai }i∈I , Aj ⊆ Ak . If Ak ∈ {Ai }i∈I | ∀Aj ∈ {Ai }i∈I , Aj ⊆ Ak then {Ai }i∈I is defined to be a topless family. Definition 5. Let A be a set and let {Ai }i∈I be a family. Then {Ai }i∈I is a Nested Set family if: A ∈ {Ai }i∈I ,
(3.1)
∅∈ / {Ai }i∈I , ∀Ah , Ak ∈ {Ai }i∈I , h = k | Ah ∩ Ak = ∅ ⇒ Ah ⊂ Ak ∨ Ak ⊂ Ah .
(3.2) (3.3)
Thus, we define a Nested Set family (NS-F) as a family where three conditions must hold. The first condition (3.1) states that set A which contains all the sets in the family must belong to the NS-F. The second condition (3.2) states that the empty-set does not belong to the NS-F and the last condition (3.3) states that the intersection of every couple of distinct sets in the NS-F is not the empty-set only if one set is a proper subset of the other one. Theorem 2. Let T (V, E) be a tree and let Φ be a family where I = V and ∀vi ∈ V , Vvi = ΓV+ (vi ). Then {Vvi }vi ∈V is a Nested Set family. This theorem defines how a tree is mapped into a NS-F, the proof and an extensive description can be found in [6]. In the same way we can define the Inverse Nested Set Model (INS-M): Definition 6. Let A be a set and let {Ai }i∈I be a family and let {Bj }j∈J ⊆ {Ai }i∈I be a sub-family. Then {Ai }i∈I is an Inverse Nested Set family if: ∅∈ / {Ai }i∈I , Bj ∈ {Ai }i∈I .
(3.4) (3.5)
j∈J
∃Bk ∈ {Bj }j∈J | ∀Bh ∈ {Bj }j∈J , Bh ⊆ Bk ⇒ ∀Bh , Bg ∈ {Bj }j∈J , Bh ⊆ Bg ∨ Bg ⊆ Bh .
(3.6)
Thus, we define an Inverse Nested Set family (INS-F) as a family where two conditions must hold. The first condition (3.4) states that the empty-set does not belong to the INS-F. The second condition (3.5) states that the intersection of every subfamily of the INS-F belongs to the INS-F itself. Condition 3.6 states that every subfamily of a INS-F can be a topped family only if it is linearly ordered; alternatively, we can say that every subfamily of an INS-F must be a topless family or a chain.
62
N. Ferro and G. Silvello
Theorem 3. Let T (V, E) be a tree and let Ψ be a family where I = V and ∀vi ∈ V , Vvi = ΓV− (vi ). Then {Vvi }vi ∈V is an Inverse Nested Set family. Differently from Theorem 2 we report the proof of this theorem because it is slightly different form the one presented in [6]. Proof. By definition of the set of the ancestors of a node, ∀vi ∈ V , |Vvi | = |ΓV− (vi )| ≥ 1 and so ∅ ∈ / {Vvi }vi ∈V (condition 3.4). Let {Bvj }vj ∈J be a subfamily of {Vvi }vi ∈V . We prove condition 3.5 by induction on the cardinality of J. |J| = 1 is the base case and it means that every subfamily {Bvj }vj ∈J ⊆ {Vvi }vi ∈V is composed only by one set Bv1 whose intersection is the set itself and belongs to the family {Vvi }vi ∈V by definition. For |J| = n− 1 we assume that ∃ vn−1 ∈ V | vj ∈J Bvj = Bvn−1 ∈ {Vvi }vi ∈V ; equivalently we can say that ∃ vn−1 ∈ V | vj ∈J ΓV− (vj ) = ΓV− (vn−1 ), thus, ΓV− (vn−1 ) is a set of nodes that is composed of common ancestors of the n − 1 considered nodes. For |J| = n, we have to show that ∃ vt ∈ V | ∀ vn ∈ J, Bvn−1 ∩ Bvn = Bvt ∈ {Vvi }vi ∈V . This is equivalent to show that ∃ vt ∈ V | ∀ vn ∈ J, ΓV− (vn−1 ) ∩ ΓV− (vn ) = ΓV− (vt ). Ab absurdo suppose that ∃ vn ∈ J | ∀ vt ∈ V, ΓV− (vn−1 ) ∩ ΓV− (vn ) = ΓV− (vt ). This would mean that vn has no ancestors in J and, consequently, in V ; at the same time, this would mean that vn is an ancestor of no node in J and, consequently, in V . But this means that V is the set of nodes of a forest and not of a tree. Now, we have to prove condition 3.6. Let {Bvj }vj ∈J be a subfamily of {Vvi }vi ∈V . Ab absurdo suppose that ∃Bvk ∈ {Bvj }vj ∈J | ∀Bvh ∈ {Bvj }vj ∈J , Bvh ⊆ Bvk ⇒ ∃Bvh , Bvg ∈ {Bvj }vj ∈J | Bvh Bvg ∧ Bvg BVh . This means that {Bvj }vj ∈J is a topped but not linearly ordered family. This means that we can find Bvg , Bvh , Bvk ∈ {Bvj }vj ∈J | ((Bvh ∩ Bvk = ∅) ∧ (Bvh ∪ Bvk ⊂ Bvg ) ∧ (Bvh Bvk ) ∧ (Bvk Bvh )) ⇒ ∃vh , vk , vg ∈ V | ((ΓV− (vh ) ∩ ΓV− (vk ) = ∅) ∧ (ΓV− (vh ) ∪ ΓV− (vk ) ⊆ ΓV− (vg )) ∧ (ΓV− (vh ) ΓV− (vk )) ∧ (ΓV− (vk ) ⊆ ΓV− (vh ))). This means that there are two paths from the root of T to vg , one through vh and a distinct one through vk , thus δV− (vg ) = 2 and so T is not a tree.
4
FAST and NESTOR: A Set-Theoretic View of Annotation Hierarchies
From the data model perspective, in the context of DLs we have to take into account both the organization of digital resources and the organization of annotations. In the Definition 1 we treated both annotations and documents as Digital Objects (DO), it is worthwhile for the rest of the work to maintain this notation and, as a generalization, we indicate as documents every resource type managed by a DL which is not an annotation. The NESTOR framework can be applied to all DO organizations with a tree structure. In the following we show how NESTOR can be applied to a tree where
FAST and NESTOR: How to Exploit Annotation Hierarchies
63
the nodes are documents and where each node can be the root of a sub-tree of annotations. The union of the document tree and the annotation sub-trees forms a DO tree that can be uniformly mapped into one of the set data models formalized in the NESTOR framework. Let T (DT , ET ) be a document tree where DT = {di }, 1 ≤ i ≤ n is the set of documents representing the nodes of the tree and ET ⊂ DT × DT is the set of edges connecting the nodes. From Proposition 1 we know that for all di ∈ DT may exist a tree Hda [di ] of annotations rooted in di ; Hda is a forest representing the union of all the trees Hda [di ]. From Definition 1 we know that DO = D ∪ A, thus we can define the tree T H (DOT H , ET H ) where DOT H = DT ∪ DO and ET H = ET ∪Eda . T H (DOT H , ET H ) is a DO tree because its nodes may be both documents and annotations. In Figure 4 we can see a document tree, two annotation sub-trees and how the union of these trees forms a DO tree. From this Figure we can see that we have to deal we three trees each of those enabling the access to a different granularity level of information: T (DT , ET ) permits us to access only the documents managed by a DL, Hda permits us to access only the annotations and the annotated documents (the roots of the annotation trees) and T H (DOT H , ET H ) permits us to access the whole resource space composed by documents and annotations. All the DOs belonging to the tree T H (DOT H , ET H ) can be encoded in eXtensible Markup Language (XML) files; for instance if T (DT , ET ) represents an archival tree, each node (document) would represent a division of the archive such as fonds, sub-fonds or series that can be encoded in XML files. The annotations, as well, are encoded following the FAST XSchema and thus are treated as XML files. All the relationships between the nodes of T H (DOT H , ET H ) are hard coded
d1
d3
d2
d5
d6
a1
d4
d7
a6
d8
T (DT , ET )
a3
a2
a7
a4
a5
a8
Hda [d1 ]
a9
Hda [d4 ]
Hda
T H (DOT H , ET H )
Fig. 4. A sample document tree T (DT , ET ) with two nodes which are roots of two annotation sub-trees
64
N. Ferro and G. Silvello
inside the XML files; indeed, for instance, FAST XSchema permits us to encode the “annotate” links information and a similar approach is adopted by Encoded Archival Description (EAD) metadata format [10] in the archival context. The adoption of the NESTOR framework enables the separation of the information about the hierarchical structure of DOs from their content because the structural links are mapped into inclusion dependencies between sets. In the case of annotations only the “annotate” links can be treated by means of NESTOR because they form a tree structure; instead, the “relate-to” links form a directed acyclic graph that is out of the scope of NESTOR. Following the Theorems 2 and 3 the document tree, the annotation forest and the DO tree can be straightforwardly mapped into an equivalent number of NS-F or INS-F. We define {Ti }i∈I to be the family mapped from T (DT , ET ), {Hj }j∈J [dt ] to be the family mapped from Hda [dt ] where H = {{Hj }j∈J [dt ]} with dt ∈ DT is the collection of families {Hj }j∈J and {T Hk }k∈K to be the family mapped from T H (DOT H , ET H ). If the mapping from the trees to the families is done following Theorem 2 we obtain a collection of NS-F, as we can see in Figure 5, instead if it is done following Theorem 3 we obtain a collection of INS-F. In Figure 5 we represent only the structure of the NS-M, but every set can contain some elements such as XML files encoding the content of DOs. By means of NESTOR we separate the content from the hierarchical structure of DOs enabling a flexible way to access and exchange DOs in a distributed environment. In a distributed environment where two or more DLs exchange DOs we have to consider not only how to exchange the DOs but also which additional information may be necessary to properly understand the meaning of the exchanged DOs. For instance, if we consider the annotation Ai ∈ Hda , in order to understand its meaning we need, at least, to access the document annotated by Ai . In order to infer the context of Ai we need the DOs in the path from Ai to the root
T1 H1 H2
H5
T1
T2 T5
H3 H4
T6
T7
T4
{Hj }j∈J [d1 ]
T8
T3
{Ti }i∈I
T4 H6 H7
H8
H9
{Hj }j∈J [d4 ]
{T Hk }k∈K
Fig. 5. The trees in Figure 4 mapped into NS-Families
FAST and NESTOR: How to Exploit Annotation Hierarchies
65
of the tree Hda . The information brought by the root of the annotation tree could be not sufficient to understand the proper meaning of Ai , so we may have the necessity to reconstruct the path from Ai up to the root of the document tree T (DT , ET ). By means of NESTOR these operations can be easily done by mapping the trees into the INS-M that fosters the reconstruction of the upper levels of the hierarchy. Then the reconstruction of the path becomes a series of set operations that do not involve the content of DOs (for instance the XML files encoding the DOs). If we consider the INS-F {T Hk }k∈K we can determine the nearest common ancestor of two or more annotations simply intersecting the sets at which they belong; indeed, suppose we want to know the correlation between two annotations belonging to the same DO tree, by means of the INS-M we can determine their nearest common document throughout a single intersection operation between sets. On the other hand, we can decide to use the NS-M because it is useful to determine the descendants of a node. If we consider the NSF {Ti }i∈I , starting from a document we can reconstruct all its descendants in the document space, if we consider the collection of families H we can reconstruct all the annotations related to a document or related to a particular annotation and if we consider the NS-F T H (DOT H , ET H ) starting from a document we can easily reconstruct all its descendants both in the document and in the annotation spaces. Furthermore, the use of set data models permits us to change the structure of the hierarchy without affecting the DOs content; indeed any variation in the hierarchical structure will affect only the inclusion order between the sets and not the data that is encoded into the XML files.
5
Final Remarks
In this paper we have described how the NESTOR framework can be applied to the FAST system enabling a set-theoretical view of annotation hierarchies. We have shown that documents and annotations are often organized into tree data structures and how these hierarchies can be joined together in a tree representing the whole information space composed by documents and annotations. Furthermore, we have pointed out some of the advantages of using the set data models defined in the NESTOR framework. Future works will concern the definition of how to exchange annotations between DLs in a distributed environment exploiting the set-theoretical extension of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) described in [6]. Furthermore, we shall analyze how the application of NESTOR affects the FAST search framework [5]. From the text-based search point-of-view, the NESTOR Framework can be exploited to display the search results in the right context of the hierarchy and not only in a flat ranked list.
Acknowledgments The work reported has been supported by a grant from the Italian Veneto Region. The study is also partially supported by EuropeanaConnect (Contract ECP2008-DILI-52800).
66
N. Ferro and G. Silvello
References 1. Agosti, M., Ferro, N.: A formal model of annotations of digital content. ACM Trans. Inf. Syst. 26(1) (2007) 2. Candela, L., Castelli, D., Ferro, N., Koutrika, G., Meghini, C., Pagano, P., Ross, S., Soergel, D., Agosti, M., Dobreva, M., Katifori, V., Schuldt, H.: The DELOS Digital Library Reference Model. Foundations for Digital Libraries. ISTI-CNR at Gruppo ALI, Pisa (November 2007) 3. Davey, B.A., Priestley, H.A.: Introduction to Lattices and Order, 2nd edn. Cambridge University Press, Cambridge (2002) 4. Duranti, L.: Diplomatics: New Uses for an Old Science. Society of American Archivists and Association of Canadian Archivists in association with Scarecrow Press (1998) 5. Ferro, N.: Annotation Search: The FAST Way. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 15–26. Springer, Heidelberg (2009) 6. Ferro, N., Silvello, G.: The NESTOR Framework: How to Handle Hierarchical Data Structures. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 215–226. Springer, Heidelberg (2009) 7. Haslhofer, B., Jochum, W., King, R., Sadilek, C., Schellner, K.: The LEMO Annotation Framework: Weaving Multimedia Annotations With the Web. International Journal on Digital Libraries 10(1), 15–32 (2009) 8. Ioannidis, Y.E., Maier, D., Abiteboul, S., Buneman, P., Davidson, S.B., Fox, E.A., Halevy, A.Y., Knoblock, C.A., Rabitti, F., Schek, H.J., Weikum, G.: Digital Library Information-Technology Infrastructures. International Journal on Digital Libraries 5(4), 266–274 (2005) 9. Jech, T.: Set Theory-The Third Millenium edn. Springer, Heidelberg (2003) 10. Pitti, D.V.: Encoded Archival Description. An Introduction and Overview. D-Lib Magazine 5(11) (1999) 11. Vegas, J., Crestani, F., de la Fuente, P.: Context Representation for Web Search Results. Journal of Information Science 33(1), 77–94 (2007)
A New Domain Independent Keyphrase Extraction System Nirmala Pudota, Antonina Dattolo, Andrea Baruzzo, and Carlo Tasso Artificial Intelligence Lab Department of Mathematics and Computer Science University of Udine, Italy {nirmala.pudota,antonina.dattolo,andrea.baruzzo,carlo.tasso}@uniud.it
Abstract. In this paper we present a keyphrase extraction system that can extract potential phrases from a single document in an unsupervised, domain-independent way. We extract word n-grams from input document. We incorporate linguistic knowledge (i.e., part-of-speech tags), and statistical information (i.e., frequency, position, lifespan) of each n-gram in defining candidate phrases and their respective feature sets. The proposed approach can be applied to any document, however, in order to know the effectiveness of the system for digital libraries, we have carried out the evaluation on a set of scientific documents, and compared our results with current keyphrase extraction systems.
1
Introduction
A keyphrase is a short phrase (typically it contains one to three words) that provides a key idea of a document. A keyphrase list is a short list of keyphrases (typically five to fifteen phrases) that reflects the content of a single document, capturing the main topics discussed and providing a brief summary of its content. If every document is attached with keyphrases, a user can choose easily which documents to read and/or understand the relationships among documents. Document keyphrases are used successfully in Information Retrieval (IR) and Natural Language Processing (NLP) tasks, such as document indexing [9], clustering [10], classification [14], and summarization [4]. Among all of them, document indexing is one important application of automatic keyphrase generation in digital libraries, where the major part of publications usually are not associated with keyphrases. Furthermore, keyphrases are well exploited for other tasks such as thesaurus creation [13], subject metadata enrichment [25], query expansion [21], and automatic tagging [18]. Despite of having many applications, only a small percent of documents have keyphrases assigned to them. For instance, in digital libraries, authors assign keyphrases to their documents when they are instructed to do so [9], other digital content, like news or magazine articles, usually do not have keyphrases
The authors acknowledge the financial support of the Italian Ministry of Education, University and Research (MIUR) within the FIRB project number RBIN04M8S8.
M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 67–78, 2010. c Springer-Verlag Berlin Heidelberg 2010
68
N. Pudota et al.
since it is neither mandatory nor necessary for the document authors to provide keyphrases. Manually assigning keyphrases to documents is tedious, timeconsuming, and as well as expensive. Therefore, automatic methods that generate keyphrases for a given document are beneficial. Witten et al. [24] defined two fundamental approaches for automatic keyphrase generation: 1. Keyphrase assignment : in this case, the set of possible keyphrases is limited to a predefined vocabulary of terms (e.g., subject headings, classification schemes, thesaurus). The task is to classify documents based on the content into different keyphrase classes that correspond to the terms of a pre-defined list. In this process, the document can be associated with keyphrases constituted by words (or n-grams) that are not contained in the document. 2. Keyphrase extraction: in contrast to the previous case, keyphrase extraction selects the most indicative phrases present in the input document. In this process, selection of keyphrases does not depend on any vocabulary and such phrases are supposed to be available in the document itself. In this paper, we concentrate on the keyphrase extraction problem leaving the more general task of keyphrase assignment. The work presented here is part of a wide research project PIRATES (Personalized Intelligent tag Recommendation and Annotation TEStbed) [2,3,7], a framework for personalized content retrieval, annotation, and classification. Using an integrated set of tools, PIRATES framework lets the users experiment, customize, and personalize the way they retrieve, filter, and organize the large amount of information available on the Web. Furthermore, the framework undertakes a novel approach that automates typical manual tasks such as content annotation and tagging, by means of personalized tag recommendations and other forms of textual annotations (e.g., keyphrases). The rest of this paper is organized as follows: Section 2 introduces the related work. The proposed domain independent keyphrase extraction system is described in detail in Section 3. Empirical evaluation is presented in Section 4 and finally we conclude the paper in Section 5.
2
Related Work
Keyphrase extraction methods usually work in two stages: (i) a candidate identification stage, identifies all possible phrases from the document and (ii) a selection stage, selects only few candidate phrases as keyphrases. Existing methods for keyphrase extraction can be divided into supervised and unsupervised approaches, illustrated in the following: A. The supervised approach treats the problem as a classification task. In this approach, a model is constructed by using training documents, that have already keyphrases assigned (by humans) to them. This model is applied in order to select keyphrases from previously unseen documents. Turney
A New Domain Independent Keyphrase Extraction System
69
(developer of Extractor 1 ) [22] is the first one who formulated keyphrase extraction as a supervised learning problem. According to him, all phrases in a document are potential keyphrases, but only phrases that match with human assigned ones are correct keyphrases. Turney uses a set of parametric heuristic rules and a genetic algorithm for extraction. Another notable keyphrase extraction system is KEA (Keyphrase Extraction Algorithm) [24]; it builds a classifier based on the Bayes’ theorem using training documents, and then it uses the classifier to extract keyphrases from new documents. In the training and extraction, KEA analyzes the input document depending on orthographic boundaries (such as punctuation marks, newlines) in order to find candidate phrases. In KEA two features are exploited: tf×idf (term frequency × inverse document frequency) and first occurrence of the term. Hulth [11] introduces linguistic knowledge (i.e., part-of-speech (pos) tags) in determining candidate sets: 56 potential pos-patterns are used by Hulth in identifying candidate phrases in the text. The experimentation carried out by Hulth has shown that, using a pos tag as a feature in candidate selection, a significant improvement of the keyphrase extraction results can be achieved. Another system that relies on linguistic features is LAKE (Learning Algorithm for Keyphrase Extraction) [8]: it exploits linguistic knowledge for candidate identification and it applies a Naive Bayes classifier in the final keyphrase selection. All the above systems need a training data in small or large extent in order to construct an extraction system. However, acquiring training data with known keyphrases is not always feasible and human assignment is timeconsuming. Furthermore, a model that is trained on a specific domain, does not always yield to better classification results in other domains. B. The unsupervised approach 2 eliminates the need of training data. It selects a general set of candidate phrases from the given document, and it uses some ranking strategy to select the most important candidates as keyphrases for the document. Barker and Cornacchia [1] extract noun phrases from a document and ranks them by using simple heuristics based on their length, frequency, and the frequency of their head noun. In [5], Bracewell et al. extract noun phrases from a document, and then cluster the terms which share the same noun term. The clusters are ranked based on term and noun phrase frequencies. Finally, top-n ranked clusters are selected as keyphrases for the document. In [17], Liu et al. propose another unsupervised method, that extracts keyphrases by using clustering techniques which assure that the document is semantically covered by these terms. Another unsupervised method that 1 2
http://www.extractor.com/ Note that unsupervised approaches might use tools like POS taggers which rely on supervised approaches. However, as such tools are usually already available for most languages, we consider an approach is unsupervised if it does not make use of any training documents that have already keyphrases assigned to them.
70
N. Pudota et al.
utilizes document cluster information to extract keyphrases from a single document is presented in [23]. Employing graph-based ranking methods for keyphrase extraction is another widely used unsupervised approach, exploited for example in [16]. In such methods, a document is represented as a term graph based on term relatedness, and then a graph-based ranking model algorithm (similar to the PageRank algorithm [6]) is applied to assign scores to each term. Term relatedness is approximated in between terms that co-occur each other within a pre-defined window size. Keyphrase extraction systems that are developed by following unsupervised approach are in general domain independent since they are not constrained by any specific training documents.
3
Domain Independent Keyphrase Extraction (DIKpE) System Description
Domain independent keyphrase extraction approach, which doesn’t enforce any training data has many applications. For instance, it can be useful for a user who wants to know quickly the content of a new Web page, or who wants to know the main claim of a paper at hand. In such cases, keyphrase extraction approach that can be applied without a corpus3 of the same kind of documents is very useful. Simple term frequency is sometimes sufficient to know the document overview; however, more powerful techniques are desirable. Our approach is applied to any document without the need of a corpus. It is solely based on a single document. In the following, we provide a detailed description of our approach. The general workflow in DIKpE system is shown in Fig. 1 and is illustrated in detail in the following subsections 3.1, 3.2, and 3.3. We follow three main steps: (i) extract candidate phrases from the document (ii) calculate feature values for candidates (iii) compute a score for each candidate phrase from its feature values and rank the candidate phrases based on their respective scores, in such a way, highest ranked phrases being assigned as keyphrases. 3.1
Step1: Candidate Phrase Extraction
This step is divided in the following substeps: – Format conversion. We assume that the input document can be in any format (e.g., pdf ), and as our approach only deals with textual input, our system first exploits document converters to extract the text from the given input document. – Cleaning and Sentence delimiting. The plain text form is then processed to delimit sentences, following the assumption that no keyphrase parts are located simultaneously in two sentences. Separating sentences by inserting 3
A collection of documents.
A New Domain Independent Keyphrase Extraction System
71
Fig. 1. Workflow in DIKpE system
a sentence boundary is the main aim of this step. The result of this step is a set of sentences each containing a sequence of tokens, bounded by the sentence delimiter. – POS tagging and n-gram extraction. We assign a pos tag (noun, adjective, verb etc.) to each token in the cleaned text, by using Stanford log-linear part-of-speech tagger4 . The Stanford pos tagger uses 36 types5 of pos tags (for the documents written in Italian, an Italian pos tagger developed using n-gram model trained on the La Repubblica corpus6 is utilized.). The assigned pos tags are later utilized for filtering candidate phrases and in calculating pos value feature. The next step in our procedure is to extract n-grams. We have observed that in the dataset utilized for the experimentation, phrases that are constituted by more than 3 words are rarely assigned as keyphrases, so, in our process, we set the value of ‘n’ to the maximum value 3. We extract all possible subsequences of phrases up to 3 words (uni-grams, bi-grams, and tri-grams). – Stemming and Stopword removing. From the extracted n-grams, we remove all phrases7 that start and/or end with a stopword and phrases containing the sentence delimiter. Partial stemming (i.e., unifying the plural forms and singular forms which mean essentially the same thing) is performed using the first step of Porter stemmer algorithm [20]. To reduce the size of the candidate phrase set, we have filtered out some candidate phrases by using their pos tagging information. Uni-grams that are not labeled as noun, adjective, and verb are filtered out. For bi-grams and tri-grams, only pos-patterns defined by Justeson and Katz [12] and other patterns that include adjective and verb forms are considered. 4 5 6 7
http://nlp.stanford.edu/software/tagger.shtml. pos tagging follows the Penn Treebank tagging scheme. http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica In our use of this term, we mean any n-gram (n=1,2,3) phrase.
72
N. Pudota et al.
– Separating n-gram lists. Generally, in a document, uni-grams are more frequent than bi-grams, and bi-grams are more frequent than tri-grams and so on. In the calculation of phrase frequency (explained in Section 3.2) feature, this shows a bias towards n-grams which are having small value of ‘n’. In order to solve this problem, we have separated n-grams of different lengths (n=1, n=2, and n=3) and arranged them in three different lists. These lists are treated separately in calculation of feature sets and in final keyphrase selection. As a result of step 1, we obtain a separate list of uni-gram, bi-gram, and tri-gram candidate phrases (with corresponding pos tags) per document after the proper stemming and stopword removal explained above. 3.2
Step2: Feature Calculation
The candidate phrase extraction step is followed by a feature calculation step that characterizes each candidate phrase by statistical and linguistic properties. Five features for each candidate phrase are computed; these are: phrase frequency, pos value, phrase depth, phrase last occurrence, and phrase lifespan, illustrated in the following. – phrase frequency: this feature is same as the classical term frequency (tf) metric. But instead of calculating it with respect to the whole length of the document, we compute it with respect to each n-gram list. With a separate list for each n-gram in hand, the phrase frequency for phrase P in a list L is: f requency(P, L) =
f req(P, L) , size(L)
where: • f req(P, L) is the number of times P occurs in L; • size(L) is the total number of phrases included in L. – pos value: as described in [1], most author-assigned keyphrases for a document turn out to be noun phrases. For this reason, in our approach, we stress the presence of a noun in a candidate phrase while computing a pos value for the phrase. A pos value is assigned to each phrase by calculating the number of nouns (singular or plural) normalizing it by the total number of terms in the phrase. For instance, in a tri-gram phrase, if all tokens are noun forms, then the pos value of the phrase is 1, if two tokens are noun forms, then the pos value is 0.66, and if one noun is present, the value is 0.33. All remaining phrases which do not include at least one noun form are assigned the pos value 0.25. The same strategy is followed for bi-gram and uni-gram phrases. – phrase depth: this feature reflects the belief that important phrases often appear in the initial part of the document especially in news articles and scientific publications (e.g., abstract, introduction). We compute the position
A New Domain Independent Keyphrase Extraction System
73
in the document where the phrase first appears. The phrase depth value for phrase P in a document D is: depth(P, D) = 1 − [
f irst index(P ) ], size(D)
where f irst index(P ) is the number of words preceding the phrase’s first appearance; size(D) is the total number of words in D. The result is a number between 0 and 1. Highest values represent the presence of a phrase at the very beginning of the document. For instance, if a phrase appears at 16th position, while the whole document contains 700 words, the phrase depth value is 0.97, indicating the first appearance at the beginning of the document. – phrase last occurrence: we give also importance to phrases that appear at the end of the document, since keyphrases may also appear in the last parts of a document, as in the case of scientific articles (i.e., in the conclusion and discussion parts). The last occurrence value of a phrase is calculated as the number of words preceding the last occurrence of the phrase normalized with the total number of words in the document. The last occurrence value for phrase P in a document D is: last occurrence(P, D) =
last index(P ) , size(D)
where last index(P ) is the number of words preceding the phrase’s last appearance; size(D) is the total number of words in D. For instance, if a phrase appears for the last time at 500th position last time in a document that contains 700 words, then the phrase last occurrence value is 0.71. – phrase lifespan: the span value of a phrase depends on the portion of the text that is covered by the phrase. The covered portion of the text is the distance between the first occurrence position and last occurrence position of the phrase in the document. The lifespan value is computed by calculating the difference between the phrase last occurrence and the phrase first occurrence. The lifespan value for phrase P in a document D is: lif espan(P, D) =
[last index(P ) − f irst index(P )] , size(D)
where last index(P ) is the number of words preceding the phrase’s last appearance; f irst index(P ) is the number of words preceding the phrase’s first appearance; size(D) is the total number of words in D. The result is a number between 0 and 1. Highest values mean that the phrase is introduced at the beginning of the document and carried until the end of the document. Phrases that appear only once through out the document have the lifespan value 0. As a result of step 2, we get a feature vector for each candidate phrase in the three n-gram lists.
74
3.3
N. Pudota et al.
Step3: Scoring and Ranking
In this step a score is assigned to each candidate phrase which is later exploited for the selection of the most appropriate phrases as representatives of the document. The score of each candidate phrase is calculated as a linear combination of the 5 features. We call the resulting score value keyphraseness of the candidate phrase. The keyphraseness of a phrase P with non empty feature set {f1 ,f2 ,...f5 }, with non-negative weights {w1 ,w2 ,..w5 } is: 5 wi fi keyphraseness(P ) = i=1 5 i=1 wi In this initial stage of the research, we assign equal weights to all features, yielding to the computation of the average. Therefore: n
keyphraseness(P ) =
1 fi , n i=1
where: – – – – – –
n is the total number of features (i.e., 5 in our case); f1 is the phrase frequency; f2 is the phrase depth; f3 is the phrase pos value; f4 is the phrase last occurrence; f5 is the phrase lifespan.
Producing Final Keyphrases. The scoring process produces three separate lists L1 , L2 , and L3 containing respectively all the uni-grams, bi-grams and trigrams with their keyphraseness values. We then select some keyphrases, which are considered to be the most important from each list. In order to produce the ‘k’ final keyphrases, we have followed the same strategy that was utilized in [15]. In every list, the candidate phrases are ranked in descending order based on the keyphraseness values. Top 20% (i.e., 20% of ‘k’ ) keyphrases are selected from “L3 ”, Top 40% (i.e., 40% of ‘k’ ) are selected from “L2 ”, and remaining 40% of rest of ‘k’ keyphrases are selected from “L1 ’. In this way top k keyphrases for the given document are extracted.
4
Evaluation
The effectiveness and efficiency of our system has been tested on a publicly available keyphrase extraction dataset [19] which contains 215 full length documents from different computer science subjects. Each document in the dataset contains a first set of keyphrases assigned by the paper’s authors and a second set of keyphrases assigned by volunteers, familiar with computer science papers. DIKpE is evaluated by computing the number of matches between the
A New Domain Independent Keyphrase Extraction System
75
keyphrases attached to the document and the keyphrases extracted automatically. The same partial stemming strategy exploited in candidate phrase selection (see section 3.1) is used also in matching keyphrases. For instance, given the following keyphrase sets S1 {component library, facet-based component retrieval, ranking algorithm, component rank, retrieval system} and S2 {component library system, web search engine, component library, component ranks, retrieval systems, software components} suggested by our system, the number of exact matches is 3: {component library, component rank, retrieval system}. We have carried out two experiments in order to test our system’s performance. For the first experiment, we have considered keyphrase extraction works presented by Nguyen&Kan [19] and KEA [24] as baseline systems. From the available 215 documents, Nguyen&Kan has taken 120 documents to compare these with KEA. The maximum number of keyphrases for each document (i.e., ‘k’ ) is set to ten in Nguyen&Kan. We have taken their results [19] as reference, and in the first experiment we have worked on 120 documents randomly selected from the 215 documents. In both the experiments, we removed the bibliography section from each document in the dataset in order to better utilize the phrase last occurrence feature. Table-1 shows the average number of exact matches of three algorithms when 10 keyphrases are extracted from each document: our system significantly outperforms the other two. For the second experiment, we have extracted keyphrases for all 215 documents and compared our approach exclusively with the results provided by KEA. We have utilized a total of 70 documents (with keyphrases assigned by authors) extracted from the 215 documents dataset to train the KEA algorithm. For each document, we extracted 7, 15 and 20 top keyphrases using both our approach and KEA. The results are shown in Table-2: it is clear that even though our system does not undertake any training activity, it greatly outperforms KEA performance. A sample output of the DIKpE system for three sample documents is shown Table-3. For each document top seven keyphrases extracted by DIKpE are Table 1. Overall Performances System Average # of exact matches KEA 3.03 Nguyen&Kan 3.25 DIKpE 4.75
Table 2. Performance of DIKpE compared to KEA Keyphrases Average number of exact matches Extracted KEA DIKpE 7 2.05 3.52 15 2.95 4.93 20 3.08 5.02
76
N. Pudota et al.
Table 3. Top seven keyphrases extracted by DIKpE system to three sample documents Document
#26. Accelerating 3D Convolution using Graphics Hardware.
#57. Contour-based Partial Object Recognition using Symmetry in Image Databases. keyphrases convolution object assigned by hardware accelera- image the tion document volume visualization contour authors recognition symmetry
keyphrases 3D convolution assigned by filtering volunteers visualization volume rendering
occlusion object recognition symmetry contour estimation keyphrases high pass filters partial object recognition assigned by volume rendering object recognition DIKpE sys- filter kernels objects in images tem 3d convolution occlusion of objects convolution objects visualization symmetry filtering contours
#136. Measuring eGovernment Impact: Existing practices and shortcomings. e-government law interoperability architectures measurement evaluation benchmark measurement e-government public administration business process measuring e-government impact business process e-governmental services public administration e-government measurement business
presented: keyphrases that are assigned by the document authors are shown in normal font, Italics indicates keyphrases that are assigned by volunteers, and boldface in DIKpE’s row (third row) shows keyphrases that have been automatically extracted and matched with author or volunteer assigned keyphrases. Even if some keyphrases of DIKpE do not match with any of the keyphrases, they are still correctly related to the main theme of the document.
5
Conclusion and Future Work
In this paper, we have presented an innovative and hybrid approach to keyphrase extraction that works on a single document without any previous parameter tuning. Our current work focuses on the integration of DIKpE system with other tools presented in the PIRATES framework, in order to exploit keyphrase extraction method for the automatic tagging task. Further work will focus on the
A New Domain Independent Keyphrase Extraction System
77
evaluation procedure. We assumed here that a keyphrase extraction system is optimal, if it provides the same keyphrases that an author defines for his documents. However, in general there may exist many other keyphrases (different than those pre-assigned by authors) that are also appropriate for summarizing a given document. Thus, a further aspect to consider is to take into account the human subjectivity in assigning keyphrases, considering also adaptive personalization techniques for tuning the extraction process to the specific user’s interests. In this paper, we evaluated DIKpE performance on scientific publications which are well structured and lengthy in general; in future, we are planning to test the effectiveness of the system on short documents such as news or blog entries. Finally, for the future work, we plan to investigate different ways to compute the coefficients of linear combination of features. We also need to concentrate on a better way to decide the number of keyphrases to be extracted by the system, instead of using a fixed number.
References 1. Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000) 2. Baruzzo, A., Dattolo, A., Pudota, N., Tasso, C.: A general framework for personalized text classification and annotation. In: Houben, G.-J., McCalla, G., Pianesi, F., Zancanaro, M. (eds.) UMAP 2009. LNCS, vol. 5535, pp. 31–39. Springer, Heidelberg (2009) 3. Baruzzo, A., Dattolo, A., Pudota, N., Tasso, C.: Recommending new tags using domain-ontologies. In: IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 3, pp. 409–412. IEEE, Milan (2009) 4. Berger, A.L., Mittal, V.O.: Ocelot: a system for summarizing web pages. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 144–151. ACM, New York (2000) 5. Bracewell, D.B., Ren, F., Kuroiwa, S.: Multilingual single document keyword extraction for information retrieval. In: Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, pp. 517–522 (2005) 6. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30(1-7), 107–117 (1998) 7. Dattolo, A., Ferrara, F., Tasso, C.: Supporting personalized user concept spaces and recommendations for a publication sharing system. In: Houben, G.-J., McCalla, G., Pianesi, F., Zancanaro, M. (eds.) UMAP 2009. LNCS, vol. 5535, pp. 325–330. Springer, Heidelberg (2009) 8. D’Avanzo, E., Magnini, B., Vallin, A.: Keyphrase extraction for summarization purposes: the lake system at duc2004. In: DUC Workshop, Human Language Technology conference/North American chapter of the Association for Computational Linguistics annual meeting, Boston, USA (2004) 9. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 668–673. Morgan Kaufmann Publishers, San Francisco (1999)
78
N. Pudota et al.
10. Hammouda, K.M., Matute, D.N., Kamel, M.S.: Corephrase: Keyphrase extraction for document clustering. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 265–274. Springer, Heidelberg (2005) 11. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics, Morristown (2003) 12. Justeson, J., Katz, S.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995) 13. Kosovac, B., Vanier, D.J., Froese, T.M.: Use of keyphrase extraction software for creation of an AEC/FM thesaurus. Electronic Journal of Information Technology in Construction 5, 25–36 (2000) 14. Krulwich, B., Burkey, C.: Learning user information interests through the extraction of semantically significant phrases. In: Hearst, M., Hirsh, H. (eds.) AAAI 1996 Spring Symposium on Machine Learning in Information Access, pp. 110–112. AAAI Press, California (1996) 15. Kumar, N., Srinathan, K.: Automatic keyphrase extraction from scientific documents using n-gram filtration technique. In: Proceedings of the Eight ACM symposium on Document engineering, pp. 199–208. ACM, New York (2008) 16. Litvak, M., Last, M.: Graph-based keyword extraction for single-document summarization. In: Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24. ACL, Morristown (2008) 17. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 257–266. ACL, Singapore (2009) 18. Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 1318–1327. ACL, Singapore (2009) 19. Nguyen, T.D., Kan, M.Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.L., Cao, T.H., Sølvberg, I., Rasmussen, E.M. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007) 20. Porter, M.F.: An algorithm for suffix stripping. Readings in information retrieval, 313–316 (1997) 21. Song, M., Song, I.Y., Allen, R.B., Obradovic, Z.: Keyphrase extraction-based query expansion in digital libraries. In: Proceedings of the 6th ACM/IEEE-CS joint Conference on Digital libraries, pp. 202–209. ACM, New York (2006) 22. Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336 (2000) 23. Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd National Confernce on Artificial Intelligence, pp. 855–860. AAAI Press, Chicago (2008) 24. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: practical automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on Digital libraries, pp. 254–255. ACM, New York (1999) 25. Wu, Y.F.B., Li, Q.: Document keyphrases as subject metadata: incorporating document key concepts in search results. Information Retrieval 11(3), 229–249 (2008)
An Event-Centric Provenance Model for Digital Libraries Donatella Castelli, Leonardo Candela, Paolo Manghi, Pasquale Pagano, Cristina Tang, and Costantino Thanos Istituto di Scienza e Tecnologie dell’Informazione “Alessandro Faedo” – CNR, Pisa, Italy {name.surname}@isti.cnr.it
Abstract. Provenance is intended as the description of the origin and/or of the descendant line of data. In the last decade, keeping track of provenance has become crucial for the correct exploitation of data in a wide variety of application domains. The rapid evolution of digital libraries, which have become today advanced systems for the integration and management of cross-domain digital objects, recently called for models capturing the aspects of data provenance in this application field. However, there is no common definition of digital library provenance and existing solutions address the problem only from the perspective of specific application scenarios. In this paper we propose a provenance model for digital libraries, inspired by approaches and experiences in the e-Science and cultural heritage worlds and based on the notion of event occurred to an object. The model aims at capturing the specificities of provenance for digital libraries objects in order to provide practitioners and researchers in the field with common DL-specific provenance description languages.
1 Introduction To use a computer science definition, “provenance, also called lineage, is the term used to describe the source and derivation of data” (ref. Peter Buneman [8]). Nowadays, keeping track of provenance is becoming crucial for the correct exploitation of data in a wide variety of application domains. For example, physics and medical science communities are not only interested in the data resulting from their experiments, but also in: (i) the origin of the data, i.e. the physical or virtual location where it was originally produced, or (ii) in the descendant line of the data, i.e. the sequence of actions that followed its production and determined its transformation in a sequence of intermediate digital manifestations. Lately, provenance has become increasingly important for Digital Libraries (DLs). DLs, which started out as a digital replicate of traditional libraries, have witnessed a rapid evolution in the last two decades. In particular, content is no longer limited to digital text documents described by bibliographic metadata records and functionality not restricted to ingestion and metadata-based search of such objects. DLs today handle socalled information objects [9], intended as digital objects of any file format associated to metadata records of different kinds (e.g. geo-spatial information, licensing) and to other objects, to form graphs of arbitrary complexity. Similarly, functionality is richer and much more complex than in the past, in order to cope with such graph-oriented data M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 79–88, 2010. c Springer-Verlag Berlin Heidelberg 2010
80
D. Castelli et al.
models and variety of application domains. Due to their cross-domain nature, DLs numbered among the challenges that of introducing ways for keeping track of provenance of information objects and offering functionality to exploit at best such information. Provenance has been studied extensively by the e-Science community where several models have been proposed [18,20] to capture the notion of provenance of objects generated by scientific workflows. A scientific workflow consists of a set of orchestrated processes generating output objects from a set of input objects. In these applicative scenarios provenance is typically recorded at the time at which the workflow is executed and output objects are typically “versioned” and not “modified” by the workflows. However, the life-cycle of DL information objects differs from e-Science’s and such models can hardly be adopted in this context. Here, objects are processed (i.e. deleted, updated, generated) by actions fired by independent actors within different jurisdictions, and based on input objects possibly originated in multiple environments and not necessarily obeying to the same management rules (e.g. versioning). DLs require a provenance model in-sync with these peculiar processing patterns. A number of application specific solutions can be found in the literature, solving provenance representation from the perspective of peculiar DL domains. However, none of them aims at capturing the notion of provenance in a broader sense, that is beyond the boundaries of individual DL solutions and capturing the commonalities of DL information objects on that respect. In this paper we intend to address this issue and propose a novel model for DL provenance based on the notion of event. Unlike other provenance models [18,16], which capture the notion of causality between different objects of interest, our model claims that the provenance of an object is described by its history, expressed as the sequence of events happened to the object since its birth. The objective of the model is two-fold: providing guidelines and best practices for DL designers and developers to deal with provenance for DLs and, at the same time, facilitating DL systems interoperability on that respect. Moreover, the model mitigates the long-standing granularity problem [7], in which the expressiveness of a provenance model is constrained by the granularity level of objects. The remainder of the paper is organized as follows: Section 2 describes our provenance model, while Section 3 introduces a case study to exemplify the usage of the model and show how it can tackle the granularity problem. Finally, Section 4 illustrates the contributions of this work in the context of related works and Section 5 ends the paper.
2 Description of the Model In this section we propose an event-centric model, whose structure is illustrated in Figure 1. According to the model events are happenings that occur to objects, here intended as instances of the reference objects, which in turn represent the entities of interest. More in detail, Definition. Reference objects are conceptual objects classified as such by an extrinsic authority. Reference Objects can be abstract, such as concepts and ideas, or concrete, such as people and places.
An Event-Centric Provenance Model for Digital Libraries
81
Fig. 1. Relationships and entities of the model
A reference object can have multiple instances over time, called objects. Definition. An Object represents an instance of a reference object and is uniquely identifiable. An object can be a physical materialization, a digital surrogate or a digital instantiation, of a reference object. All the relationships and entities of the model are shown In this model, we are concerned only with the identity of the entities. The instances of these entities may be expressed in the form of free text, in a machine-parsable ontology or in a grammar-defined language. The modeling of these entities is outside the scope of this model. In this model, objects are created through an event related with one reference object. Such a relationship is represented by the relationship hasReference. Different objects relating to the same reference object can be considered as distinct versions of it. Definition. The relationship hasReference maps every object onto exactly one reference object. Definition. An Event is a happening that has an effect on a reference object involving at least one object relating to that reference object. The provenance of an object is a list of all the events happened to the reference object related to that object. Therefore, the provenance of an object is not established by the chain of all the intermediate versions of the object since its birth, but is established by tracing all the objects relating to the same reference object. This approach enables to partially reconstruct provenance even when some versions of a reference object are missing. This feature is useful when the life cycle of the objects is not confined to a controlled environment. Sometimes we are only interested in a particular type of events happened to an object. For example, we may be interested in “transfer of custody” events of an artwork or only
82
D. Castelli et al.
in its “restoration” events. Therefore, we introduce the concept of type into the model to enable this kind of event filtering. Definition. The entity Type describes the type of an event. Definition. The relationship hasType maps an event onto zero or one type. An event may be intentionally triggered by an agent or may be the result of coincident actions of multiple players. Example events are “the fall of the Berlin Wall” and “update of census statistics”. An event is related to objects through the two relationships hasInputObject and happenedTo. Definition. The relationship hasInputObject maps an event onto zero or more objects. It identifies all the objects necessary for the happening of an event. Definition. The relationship happenedTo maps an event onto exactly one object. This property identifies the object subjected to an event. The object subjected to an event may represent an initial version or a new version of a reference object. In such cases, the event would be a creation event or a modification event, which takes a previous version of the reference object as input. An event has a number of attributes. How an event happened is described by a description. Example descriptions are “change all letters to lowercase” and “compute by exponential smoothing the updated value of the temperature reported by a sensor”. Definition. Description is an entity that describes how an event happened. Definition. The relationship how maps an event onto zero or one description. An event happens in a particular place at a particular time. Understanding where an event happened can be useful in establishing the legal aspect of an event. For example, the use of a particular cipher may be legal in one country while illegal in another one. Understanding when an event happened enables us to sort events cronologically. Whether an event is instantaneous or has a duration depends on the time with which it is associated. Definition. Place represents any referenceable physical location. Definition. The relationship where establishes where an event happened. It maps an event onto zero or one place. Definition. Time comprises all temporal notions. We assume that instances of this entity are comparable. Definition. The relationship when establishes when an event happened. It maps an event onto zero or one time. An event can be influenced by a number of factors. When an event is artificial, it may be controlled by agents. In this case, the agents may have rationales why they triggered the event. For example, the “Apollo 11 mission is controlled by NASA”. The outcome of the event may be affected by some parameters.
An Event-Centric Provenance Model for Digital Libraries
83
Definition. The entity Agent comprises people, groups or organizations capable of controlling an event. An agent controls an event when he/she can initiate and terminate the event at will. Definition. The relationship controlledBy maps an event onto zero or more agents. This relationship describes which agent controls an event. Definition. The entity Rationale represents the motivation of why an agent has triggered an event. Definition. The relationship why describes the rationale behind an agent’s decision to trigger an event and is only defined by events controlled by some agents. It maps an event onto zero or one rationale. Definition. The entity Parameter represents is a value, not an entity. Definition. The relationship hasInputParameter maps an event onto zero or more parameters. It identifies parameters that affect the outcome of an Event. We have seen all the entities and relationships of the model. We assume that the entities are expressed in models defined elsewhere in a digital library and that instances of these entities can be uniquely referenced.
3 Case Study: AquaMaps In this section, we illustrate through an example how, by using the proposed model, provenance can be queried at an arbitrary granularity level even when events are only captured at their native granularity. The example is taken by an application, called AquaMaps, implemented in the framework of the D4Science project [3]. D4Science (DIstributed colLaboratories Infrastructure on Grid ENabled Technology 4 Science - Jan 2008-Dec 2009) is a project co-funded by European Commissions Seventh Framework Programme for Research and Technological Development. Its major outcome is a production e-Infrastructure enabling on-demand resource sharing across organizations boundaries. Through its capabilities this e-Infrastructure accelerates multidisciplinary research by overcoming barriers related to heterogeneity (especially related to access to shared content), sustainability and scalability. In the context of D4Science, content can be of very heterogeneous nature, ranging from textual and multimedia documents to experimental and sensor data, to images and other products generated by elaborating existing data, to compound objects made of heterogeneous parts. The D4Science e-infrastructure also supports the notion of Virtual Research Environments (VREs), i.e. integrated environments providing seamless access to the needed resources as well as facilities for communication, collaboration and any kind of interaction among scientists and researchers. VREs are built by dynamically aggregating the needed constituents, i.e data collection, services and computing resources, after on-demand hiring them through the e-Infrastructure [3]. A D4Science VRE offers a virtual view of an information space populated by objects that
84
D. Castelli et al.
can be composed by different parts, each of which can be derived through specialized elaborations from different heterogeneous sources. If the D4Science supported flexibility for sharing and re-using provides a great potentiality to VRE users, it also largely increases the importance of all the requirements that motivate the need for provenance information, i.e. authenticity and data quality, reproducibility, policies management, etc. Moreover, it also makes particularly challenging the association of provenance information with the information objects and their parts. One of the VREs that are currently supported by the D4Science Infrastructure is AquaMaps [2]. This VRE inherits its name from a homonymous service which implements an approach to generate model-based, large-scale predictions of currently known natural occurrence of marine species. The AquaMaps service allows the biodiversity community to establish/predict species geographic distribution based on socalled “species ecological envelopes”. In order to enhance this service, a dedicated VRE has been deployed in D4Science to provide biodiversity scientists with an experimentation environment, providing seamless access to a potentially large array of data sources and facilities. Information objects in AquaMaps can be elementary objects or compound objects, e.g. ,a datasets expressed in form of a relational table, whose records are elementary objects interrelated with part-of relationships with the compound object of the table. Since the modification of one object may lead to the implicit modification of another object, e.g. , the one that contains the former, a provenance model for AquaMaps must be able to capture these implicit modifications. One of the main functionality of AquaMaps is to generate fish occurrence prediction maps for marine species. Fish occurrence data are collected from a number of sources such as OBIS [5] and GBIF [4], these sources may in turn be aggregators that harvest data from other sources. The collected data is then fed into Fishbase [1] and may be curated. Prediction maps are generated from data in Fishbase based on prediction formulae provided by experts. Each prediction map is composed of half-degree cells. The prediction map of a half- degree cell indicates the probability of occurrence of a certain species in that location. One provenance query is to find out why we get a certain value for a particular half-degree cell. The main difficulty of such query arises from the socalled “granularity problem”. Previous research efforts in provenance have recognized this problem and attempted [11] to enforce a single granularity level. However, the end result was either a too coarse grain model that does not capture enough information or a too fine grain model that is too laborious to maintain. Most importantly, in order to match the granularity of processes to that of objects, we would need to introduce counterintuitive fictitious processes. Instead, in our model, modifications to tables and tuples can be modeled as events at their native granularity, i.e. , if a process is applied to a table, then it is modeled as an event that takes a table as input. The following example queries are written in set notation. The computation process is shown in Listing 1. In this example, we are interested in computing all the events happened to the information object Salmon. Line 2 computes all the information objects that contain the object Salmon. Line 3 computes all the reference objects of these information objects. Line 4 computes the time at which the information object Salmon came to existence. Under the assumption that time is
An Event-Centric Provenance Model for Digital Libraries
85
Listing 1. Example showing how to retrieve all the events that modified the information object Salmon. We assume that the data model and the provenance model are interoperable by the function ‘find objects that contain(r)’ and that time is comparable. 1. r = record of Salmon 2. containing objects = find objects that contain(r) 3. reference objects = {o.hasReference : o∈ { r } ∪ containing objects } 4. t = e.when, where e ∈ all events ∧ e.happenedTo = r 5. events = { e : e ∈ all events ∧ e.happenedTo.hasReference ∈ reference objects ∧ e.when ≤ t} 6. provenance(r) = sort(events)
comparable, line 5 computes all the events happened to the reference objects before that time. Line 6 sorts the events in cronological order. The second query is to find out contributors along the data collection chain. The difficulty lies in understanding when to attribute credits because not all events happened to an object are significant and the criteria of establishing which events are significant should be domain specific. In Aquamaps, all the data source providers and aggregators should be credited. We assume that all the data objects are brought into the system by an event of type ‘create’ and that event is controlledBy its data provider. Listing 2 shows how the contributors to the information object Salmon can be computed. The first 5 lines are identical as those in Listing 1. Line 6 defines a set ‘contributing events’ of event types whose values are assumed to come from a controlled vocabulary. An agent of an event whose type is in ‘contributing events’ is considered to be a contributor. Therefore by extending the model with a controlled vocabulary and defining a ‘contributing events’ set, we can appropriately credit the contributors. The same approach can be applied to find out the copyright holders of an information object in a DL. Listing 2. Example showing the retrieval of all the contributors to the information object ‘Salmon’ 1. r = record of Salmon 2. containing objects = find objects that contain(r) 3. reference objects = {o.hasReference : o∈ { r } ∪ containing objects } 4. t = e.when, where e ∈ all events ∧ e.happenedTo = r 5. events = { e : e ∈ all events ∧ e.happenedTo.hasReference ∈ reference objects ∧ e.when ≤ t} 6. contributing events = {‘creation’, ‘harvest’} 7. creditors(r) = {e.controlledBy : e ∈ events ∪ e.type ∈ contributing events}
The third query is to find an explanation of the presence of a certain information object. The difficulty of this query is similar to that of the first one, i.e., that of being capable of returning the result independently of the data processing granularity. An example is shown in Listing 3. The function contains(object1, object2) in line 5 returns true if object2 is part of object1 or if object2 equals to object1. Line 4 computes the events that created the information object Salmon, which may have been created in a creation event at the granularity level of tuple or in a modification event at the granularity level of table. In both cases, the creation event can be correctly retrieved.
86
D. Castelli et al.
Listing 3. Example showing how we can explain the presence of the information object ‘Salmon’ 1. r = record of Salmon 2. containing objects = find objects that contain(r) 3. reference objects = {o.hasReference : o∈ { r } ∪ containing objects } 4. events = { e : e ∈ all events ∧ e.happenedTo.hasReference ∈ reference objects ∧ !contains(e.hasInput,r) ∧ contains(e.happenedTo, r)} 5. rationale = {e.why : e ∈ events}
We have seen that under the reasonable assumption that the part-of relationship is captured by the data model, we can mitigate the long-standing granularity problem with a relatively simple provenance model.
4 Related Work and Contributions Provenance has extensively been studied by the e-Science community [21]. The objects of concern are data generated by scientific workflows and object provenance is described by a static description of the workflow as “the set of actions executed over the object”. Workflow-based provenance models assume that each object has its own provenance trail, which can be represented by the versions of the object generated at different steps of the workflow or by the processes of the workflow that contributed to the creation of the object. Some provenance models focus on the trail of versions approach [13]. Moreau et al. [17] focused instead on the trail of processes approach. The authors surveyed provenance requirements of a number of use-cases and, through the organization of a series of provenance challenges [6], defined a directed acyclic graph model suitable for describing causal relationships between objects of interest. Complementary to the workflow-based approach taken by the e-Science community lays the event-centric approach. In the event-centric approach, the provenance of an object is described by the chains of events that affected its status. This approach is typically taken by the cultural and heritage community to describe historical events [15], typically intended as meetings between physical and/or abstract historical entities in some space-time context. Such models focus on the interweaving of the entities rather than on the reason and outcome of the meeting. Ram and Liu [19] proposed a generic full-blown event-based model to describe the semantics of provenance. The event-based provenance model proposed in this paper differs from this one in two main aspects. First of all, the model reflects the importance, typical of digital libraries, of keeping track of the agents firing an event. Understanding how an event happened and what parameters directly influenced an event are important in this context. This is not the case for historical events, which are identified a posteriori for their symbolic significance and are generally caused by the coincidence of actions of independent users. Secondly, Ram and Liu’s model views provenance as information that enriches information objects while our model views provenance as the glue that links information objects. In this sense, differently from process provenance approaches [14], in our model events are orthogonal to objects. Events regarding an object do not belong to the provenance briefcase of the object but are shared among all objects involved in the
An Event-Centric Provenance Model for Digital Libraries
87
event. Decoupling objects from events in the model has two immediate implementative benefits: (i) the approach is well-applicable to distributed/parallel computing scenarios, where participating services can collaboratively operate over shared objects; (ii) since the size of provenance data can easily exceed that of the objects [10], keeping the two apart reduces redundancy phenomena and contributes to system scalability. In an optimized implementation, our work lies in between lazy provenance and eager provenance [12] in that events are captured prior to the provenance request, while the list of all the events happened to an information object is generated upon provenance request. By leaving the decision on when to generate provenance trails exogenous to the model, it is possible to make time-space trade-offs on a per-application basis, for example, we can pre-compute and cache provenance trails for a certain set of objects.
5 Conclusions and Future Work Digital Libraries rapidly evolved in the last decades to become today advanced systems for the integration and management of cross-domain digital objects. As such, keeping track of provenance information of the objects of a digital library, possibly operated over, imported and controlled by different agents through differently trustable processes, becomes a crucial issue for the users of such systems. On the other hand, existing solutions typically address problem-specific provenance requirements and do not follow a general-purpose modeling approach. In this paper we proposed an eventbased provenance model for digital libraries, inspired by well-known provenance models in the field of e-Science and cultural heritage. Driven by acquired experiences in such fields, we considered a number of digital library real-case scenarios and defined a provenance model which captures what we believe are the essential aspects for describing the provenance of digital library information objects. In the future we plan to implement a provenance service, devised to be easily integrated in any digital library system to offer support for provenance management based on our event model. The aim is to be able to use the service in the existing production systems of D4Science, to endow the AquaMaps VRE with provenance support. Acknowledgments. The work reported has been partially supported by the DL.org Coordination and Support Action, within FP7 of the European Commission, ICT-2007.4.3, Contract No. 231551).
References 1. 2. 3. 4. 5. 6. 7.
A Global Information System on Fishes, http://www.fishbase.org/ Aquamaps, http://www.aquamaps.org/ D4Science, http://www.d4science.eu/ Global Biodiversity Information Facility, http://www.gbif.org/ Ocean Biogeographic Information System, http://www.iobis.org/ Provenance Challenges, http://twiki.ipaw.info/bin/view/Challenge/ Braun, U., Garfinkel, S.L., Holland, D.A., Muniswamy-Reddy, K.-K., Seltzer, M.I.: Issues in automatic provenance collection. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 171–183. Springer, Heidelberg (2006)
88
D. Castelli et al.
8. Buneman, P., Khanna, S., Tan, W.-C.: Computing provenance and annotations for views (2002) 9. Candela, L., Castelli, D., Ferro, N., Ioannidis, Y., Koutrika, G., Meghini, C., Pagano, P., Ross, S., Soergel, D., Agosti, M., Dobreva, M., Katifori, V., Schuldt, H.: The DELOS Digital Library Reference Model - Foundations for Digital Libraries. In: DELOS: a Network of Excellence on Digital Libraries (February 2008) ISSN 1818-8044, ISBN 2-912335-37-X 10. Chapman, A., Jagadish, H.V.: Issues in building practical provenance systems 11. Cheng, X., Pizarro, R., Tong, Y., Zoltick, B., Luo, Q., Weinberger, D.R., Mattay, V.S.: Bioswarm-pipeline: a light-weight, extensible batch processing system for efficient biomedical data processing. Frontiers in neuroinformatics 3 (2009) 12. Chiew Tan, W.: Research problems in data provenance. IEEE Data Engineering Bulletin 27, 45–52 (2004) 13. Cui, Y., Widom, J., Wiener, J.L.: Tracing the lineage of view data in a warehousing environment. ACM Transactions on Database Systems 25, 2000 (1997) 14. Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: Proceedings of ACM SIGMOD, pp. 1345–1350 (2008) 15. Doerr, M., Ore, C.-E., Stead, S.: The CIDOC Conceptual Reference Model - A New Standard for Knowledge Sharing. In: ER (Tutorials, Posters, Panels & Industrial Contributions), pp. 51–56 (2007) 16. Foster, I., V¨ockler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: Proceedings of the 14th Conference on Scientific and Statistical Database Management, pp. 37–46 (2002) 17. Miles, S., Groth, P., Branco, M., Moreau, L.: The requirements of recording and using provenance in e-science experiments. Technical report, Journal of Grid Computing (2005) 18. Moreau, L., Freire, J., Futrelle, J., Mcgrath, R.E., Myers, J., Paulson, P.: The open provenance model: An overview. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 323–326. Springer, Heidelberg (2008) 19. Ram, S., Liu, J.: Understanding the semantics of data provenance to support active conceptual modeling. In: Embley, D.W., Oliv´e, A., Ram, S. (eds.) ER 2006. LNCS, vol. 4215, pp. 1–12. Springer, Heidelberg (2006) 20. Sahoo, S., Barga, R., Goldstein, J., Sheth, A.: Provenance algebra and materialized viewbased provenance management. Technical report, Microsoft Research (2008) 21. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34, 31–36 (2005)
A Digital Library Effort to Support the Building of Grammatical Resources for Italian Dialects Maristella Agosti1 , Paola Beninc` a2 , Giorgio Maria Di Nunzio1 , Riccardo Miotto1 , and Diego Pescarini2 1
Department of Information Engineering, University of Padua Via Gradenigo, 6/a, 35131 Padua, Italy {maristella.agosti,giorgiomaria.dinunzio,riccardo.miotto}@unipd.it 2 Department of Linguistics and Performing Arts, University of Padua Via Beato Pellegrino, 1, 35137 Padua, Italy {paola.beninca,diego.pescarini}@unipd.it
Abstract. In this paper we present the results of a project, named ASIt, which provides linguists with a crucial test bed for formal hypotheses concerning human language. In particular, ASIt aims to capture crosslinguistic variants of grammatical structures within a sample of about 200 Italian Dialects. Since dialects are rarely recognized as official languages, first of all linguists need a dedicated digital library system providing the tools for the unambiguous identification of each dialect on the basis of geographical, administrative and geo-linguistic parameters. Secondly, the information access component of the digital library system needs to be designed to allow users to search the occurrences of a specific grammatical structure (e.g. a relative clause or a particular word order) rather than a specific word. Thirdly, since ASIt has been specifically geared to the needs of linguists, user-friendly graphical interfaces need to be created to give easy access to and make the building of the language resource easier and distributed. The paper reports on the ways these three main aims have been achieved.
1
Introduction
Since the 1990s, the explosion of corpus-based research and the use of automatic learning algorithms has heightened the pace of growth in language resources and as a consequence many corpora have been automatically built by means of machine learning techniques. However, this may have led to a reduction in the quality of the corpora themselves [1]. In order to make a linguistic resource usable for machines and for humans a number of issues need to be addressed: crawling, downloading, cleaning, normalizing, and annotating the data are only some of the steps that need to be done to produce valuable content [2]. Data quality has a cost, and human intervention is required to achieve the highest quality possible for a resource of usable scientific data. From a computer science M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 89–100, 2010. c Springer-Verlag Berlin Heidelberg 2010
90
M. Agosti et al.
point of view, curated databases [3] are a possible solution for designing, controlling and maintaining collections that are consistent, integral and high quality. A curated database is a database the content of which has been collected by a great deal of human effort and which has certain characteristics: data has been edited from existing sources and provenance; raw data are annotated to enrich their interpretation and description; the database has to be updated regularly by curators who can be technicians, computer scientists, or linguists, depending on the type of the maintenance task that has to be conducted. In this setting of multidisciplinary collaboration it is important to use all competences synergistically, with the aim of building a new research approach for the production of new knowledge which would otherwise be impossible to create. In the present contribution we show the results of a multidisciplinary collaboration which synergistically makes use of the competences of two different teams, one of linguists and one of computer scientists, which have collaborated in envisioning, designing and developing a digital library system able to manage a manually curated resource of dialectal data named ASIt1 (Atlante Sintattico d’Italia, Syntactic Atlas of Italy) and which provides linguists with a crucial test bed for formal hypotheses concerning human language. From the computational point of view, the project aims to implement a digital library system that enables the management of a resource of curated dialect data and provides access to grammatical data, also through an advanced user interface specifically designed to update and annotate the primary data. The paper is organised as follows: Section 2 outlines the peculiarities of the ASIt project and the main lines of the methodology adopted; Section 3 presents the requirements on the tagging system, which are strictly linked to both the specificity of the linguistic data and the formal theory that the system has to deal with; Section 4 presents the main characteristics of the digital library system that manages and gives access to the ASIt linguistic resource; Section 5 reports on some conclusions and future work.
2
Methodology
The manually curated resource of dialectal data stored and managed by the digital library system were collected by means of questionnaires consisting of sets of Italian sentences, each sentence having many parallel dialectal translation. However, the ASIt data resource differs from other multilingual parallel corpora in the following three aspects: 1. It contains data on about 200 Italian Dialects. Since dialects are rarely recognized as official languages, linguists need a dedicated digital library system providing the unambiguous identification of each dialect on the basis of geographical, administrative and geo-linguistic parameters. 2. It aims to capture cross-linguistic variants of grammatical structures. In other words, the information access component needs to allow users to search 1
http://asit.maldura.unipd.it/
A Digital Library Effort to Support the Building
91
the occurrences of a specific grammatical structure, e.g. a relative clause or a particular word order, rather than a specific word or meaning, even though a specific word such as “who” or “what” is a possible target, it appears in the data in many different dialectal forms. 3. It has been specifically geared to the needs of linguists: user-friendly graphical interfaces have therefore been created to give easy access to language resources and to make the building of the language resources easy and distributed. Given the originality of the ASIt enterprise and the granularity of the collected data, the two teams decided to work synergistically to build a new piece of knowledge which was unattainable otherwise if the two teams were working separately or only in a support/assistance way. Moreover, a synergic approach has already been adopted in other scientific challenges, in similar but nonetheless different areas of research, and it has produced valid scientific results [4]. A number of considerations were made in terms of the approach to be used in the design of the procedures for the management, storage and maintenance of the data that were to be produced in the course of the development of the project. The main aim of the project is the preparation of a co-ordinated collection of Italian dialects; this co-ordinated collection can be conceived only because the present research team is building on previous and long-lasting research that has produced intermediate and basic results, some of which have been documented in [5,6]2 . This means that the data the ASIt project has produced is based on long-standing experience of data collection, documentation, and preservation. Therefore, we decided to design and maintain the ASIt collection of data and to bring them in line with the rules of definition and maintenance of “data curation”, as defined in the e-Science Data Curation Report [7]: “The activity of managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purposes, and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose”. As a consequence, a major target of the ASIt project has been the design and development of a “curated database” of Italian dialects. The actions we have undertaken to design this new curated database have benefited from previous experience gained in the curation of data for experimental evaluation campaigns in the field of Information Retrieval [8]. The research conducted in parallel in that field has improved the understanding of systems operating in the area of languages to produce curated databases that allow the re-use of data to generate new knowledge and the maintenance of unique observational data which is impossible to re-create [9], as is the case for data collection campaigns through questionnaires of the ASIt project. 2
Since the early part of the project focused mainly on Northern Italian dialects, ASIt was formerly called ASIS (Atlante Sintattico dell’Italia Settentrionale, Syntactic Atlas of Northern Italy).
92
M. Agosti et al.
3
The ASIt Enterprise
The ASIt enterprise builds on a long standing tradition of collecting and analysing linguistic corpora, which has originated different efforts and projects over the years. The part of ASIt which is here referred has focused on the design of a new digital library system that was requested to manage and give access to a curated data resource in innovative ways for the final users together with the possibility of using the same digital library system in different scientific contexts. The design has been conducted in a way that makes possible the use of the digital library system and the data resource also as a part of other scientific efforts, making the comparison among phenomena of different dialects easier and speeding up the subsequent process of formal analysis, and making both the system and the resource scalable and expandable for other purposes. One of these further efforts has been recently undertaken using Cimbrian as a test case for synchronic and diachronic language variation, and it needs to be dealt with within the ASIt enterprise3 . In the context of the management of the Cimbrian variation, the granularity of the data resource is going to change from the sentence level to the word level, but the digital library system is only going to be expanded, and not redesigned, because of its original approach in the design that was supporting modularity and scalability. 3.1
Corpus
Dialectal data stored in the curated resource were gathered during a twentyyear-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy. These data and information were collected by means of questionnaires formed by sets of Italian sentences: dialectal speakers were asked to translate them into their dialects and write their translations in the questionnaire; therefore, each questionnaire is associated with many parallel dialectal translations. At present, there are eight different questionnaires written in Italian and almost 450 questionnaires that are corresponding to the eight questionnaires in Italian and that are written in more than 200 different dialects, for a total of more than 45,000 sentences and more than 10,000 tags stored in the data resource managed by the digital library system. 3.2
Remarks on the Annotation System
The design of a tagset for corpus annotation is normally carried out in compliance with international standards — e.g. CES (Corpus Encoding Standard)4 — which in turn are based on the specifications of SGML (Standard Generalized Markup Language)5 and international guidelines like EAGLE (Expert Advisory 3 4 5
http://ims.dei.unipd.it/websites/cimbrian/project http://www.cs.vassar.edu/CES/ http://www.w3.org/MarkUp/SGML/
A Digital Library Effort to Support the Building
93
Group on Language Engineering Standard)6 and TEI (Text Encoding Initiative)7 guidelines. According to these standards, each tagset is formed by several sub-systems responsible for the identification and description of different linguistic “units”: text, section, paragraph, clause, word, etc. Given the objectives of the ASIt enterprise, we have focused on the tagging of sentence-level phenomena, which according to the EAGLE guidelines should in turn depend on two kinds of annotation: Morphosyntactic annotation: part of speech (POS) tagging; Syntactic annotation: annotation of the structure of sentences by means of a phrase-structure parse or dependency parse. A tagset based on this distinction is normally designed to be used in combination with software applications processing linguistic data on the basis of probabilistic algorithms, which assign every lexical item a POS tag and, subsequently, derive the structure of the clause from the bottom up. First of all, it is worth noting that the ASIt enterprise has a different objective, being a scientific project aiming to account for minimally different variants within a sample of closely related languages. As a consequence, while other tagsets are designed to carry out a gross linguistic analysis of a vast corpus, the ASIt tagset aims to capture fine-grained grammatical differences by comparing various dialectal translations of the same sentence. Moreover, in order to pin down these subtle asymmetries, the linguistic analysis must be carried out manually. Given its peculiarities, the ASIt team does not need a thorough POS disambiguation, since the ’trivial’ identification of basic parts of speech (e.g. Nouns vs Verbs) is not enough to capture cross-linguistic differences between closely related languages. Secondly, the linguistic variants displayed by Italian Dialects cannot be reduced to lexical distinctions, i.e. syntactic differences are in general unpredictable on the basis of the properties of single lexical items. We therefore need a specific tagset designed to capture sentence-level phenomena without taking into consideration POS tags. The requirements at the basis of the ASIt tagset are finally recapitulated below: – – – – – – 3.3
objective: scientific, theoretical; focus on grammar; analysis: top down; minimal unit of analysis: sentence; completeness: we lack the analysis of POS; accuracy: complete, but the analysis has to be carried out manually. Granularity of the Tagged Phenomena
To explain why the needs for ASIt are so special we have to take into consideration two different aspects: 1. the nature of Italian dialects, and 2. the kind of linguistic theory the ASIt data resource aims to be related to. 6 7
http://www.ilc.cnr.it/EAGLES96/home.html http://www.tei-c.org/index.xml
94
M. Agosti et al.
The Italian dialectal area presents a kind of variation that involves parametric choices affecting many general aspects of syntax, morphology, and phonology. If we concentrate on syntax, we find, for example, a phenomenon that cuts Italy in two, namely subject clitic pronouns: the dialects of Northern Italy have an obligatory subject clitic in at least one person of the verb (some have a subject clitic for all persons of the inflected verb), while Central and Southern Dialects never display subject clitics. Since the nature of clitics is one major topic of theoretical reflection, Italy offers an impressive range of possible variations of this phenomenon. The kind of information we want to gather from the data resource involves for example not only the presence of a certain element, but also the absence of an element that can be omitted supposedly only in some constructions and in conjunction with specific characteristics of the language. For example, the complementiser che (“that”) is optionally omitted in Italian varieties in subordinates with subjunctive, conditional, or future tense (mood); so we must have a tag mentioning the “absence of an element”. Furthermore, in Southern Dialects there are varieties with two (or even three) specialised complementisers, sensitive to the modality of the subordinate clause and to elements moved in the periphery of the sentence. In this case too the presence is just as important as the absence of a given complementiser. 3.4
Building the Tagset
On the basis of requirements such as those outlined above, we have selected a list of tags capturing relevant phenomena, namely, grammatical properties that are expected to discriminate between dialects, i.e. between grammars. Examples of phenomena sensitive to linguistic variation are: the presence of subject clitics, the syntactic behaviour of different verbal classes (e.g. transitive, unergative and unaccusative verbs), the distribution of negative words (e.g. negation marker “not” and negative indefinites like “none” or “any”), the alternation of finite and infinite verbs in subordinate clauses. Many features that are captured by traditional POS tagsets are clearly not relevant, because they cannot identify grammatical variants within the sample of languages under investigation. For instance, a tag distinguishing concrete vs abstract nouns is relevant to disambiguate different meanings of the same string (e.g. “spirit of wine” vs “spirit of the times”), but, at the same time, concreteness is not expected to play any role in determining grammatical variants within the Italo-Romance domain. In contrast, a semantic feature of nouns like mass vs count is relevant for many phenomena, as is the case for relational or kinship or inalienable possession. In general, we focussed only on POS features that are relevant to the analysis of sentence-level phenomena. Moreover, we included several POS tags in our tagset that allow us to distinguish classes of lexical items displaying peculiar syntactic behaviours: for instance, different classes of adverbs (like manner, aspectual) occupy different positions in the structure of the clause, kinship nouns
A Digital Library Effort to Support the Building
95
differ from other nouns in refusing definite articles and/or requiring possessive adjectives, meteorological verbs in some dialects can obligatorily require an expletive subject. While these classes are not taken into consideration by standard tagsets, they are relevant to our aims because they are involved in syntactic phenomena which distinguish Italian Dialects from each other. Unlike standard tagsets, we also need a subgroup of tags identifying invisible phenomena, e.g. absence of an overt subject, absence of an expected complementizer, absence of a clitic pronoun within a sequence. Standard tagsets are expected to identify and specify any lexical item of the text under analysis, whereas the ASIt system must be sensitive to unpronounced/absent items. Moreover, given the scientific purposes of the ASIt enterprise, our tagset has to be open to new tags in order to capture linguistic phenomena that, according to our new findings, become crucial in distinguishing between grammars.
4
The Digital Library System
In this section we report on the design and construction of the digital library system and its information access component to deal with data curated resources of Italian dialects. A three phase approach was adopted: at the beginning the world of interest was represented at a high level by means of a conceptual representation based on the analysis of requirements, afterwards the world of interest was progressively refined to obtain the logical model of the data of interest, and, lastly, the digital library system and the interface to access the data were implemented and verified. In order to efficiently store and manage the amount of data recorded in the questionnaires, the interviews, and the tagged sentences, the component of the digital library system that manages and stores the data is based on the relational database approach, designing and developing a specific relational schema. In the following subsections we will briefly sketch the three main parts of the digital library system: the linguistic data resource of Italian Dialects, the component that permits the user interaction for updating and creating new linguistic annotations, and the information access component for retrieving and using the data resource. 4.1
A Data Resource of Italian Dialects
The design of the data resource schema needed careful attention on the design of the conceptual schema representing the data at an abstract level. However, prior to the conceptual schema, we carried out a thorough analysis of the requirements in order to generalise, identify, and isolate the main entities, which can be grouped into three broad areas: – The point of inquiry, which is the location where a given dialect is spoken; – The administrative area (namely, region, province), the location belongs to; – The geo-linguistic area, i.e. the linguistic group the dialect belongs to.
96
M. Agosti et al.
Fig. 1. The conceptual schema of the data resource of Italian dialects: the three main areas of interest are shown with ovals. Attributes of the entities have been removed for better readability.
The result of the conceptual design has generated a conceptual schema where these three areas are considered of central interest, so they were depicted in the conceptual schema that was produced at the end of the conceptual design procedure. The schema is reported in Figure 1 where the three broad areas of interest are represented by ovals. The “point of inquiry” contains most of the data of the whole data resource. Within this area the following entities were identified and defined: – DIALECT, the name of the dialect (which normally corresponds to the name of the town or city where the dialect is spoken); – ACTOR, the user, who can have different roles, namely the person who prepared the questionnaire, the speaker who translated the sentences into a dialect, the editor of the questionnaire, and so on; – QUESTIONNAIRE, the sets of sentences (either in Italian or in the dialect); – SENTENCE, the units forming the questionnaires; – TAG, the linguistic tags specifying the grammatical properties of each sentence.
A Digital Library Effort to Support the Building
97
The entity DIALECT connects all the different parts of the data resource and is therefore like a cornerstone, while the administrative and geo-linguistic areas complete the definition of each dialect, specifying the geographical area where it is spoken and the linguistic subgroup to which it belongs. 4.2
Managing the Tagset
The data resource schema was designed to allow linguists to easily access and analyse the data by retrieving phenomena and/or grammatical elements of one or more geographic locations, together with the characteristics of a specific phenomenon, the co-occurrence of different linguistic phenomena in the same dialect and the context conditioning the differences. In order to reach these objectives, the data can be retrieved on the basis of a set of 194 tags which specify the grammatical properties of each sentence. This interface has been carefully designed by considering the information gathered during the requirements analysis phase, and in particular: – The 194 tags exploited by the ASIt system have been grouped into grammatical classes to allow the editor to efficiently manage the whole tagset. Examples of the defined classes include tags regarding subjects, verbs, interrogative/exclamative clauses, clitic pronouns, etc. – The list of tags associated with each Italian sentence can be automatically associated with the corresponding dialect translation. Then the editor can simply add/remove some tags in order to capture the differences between the Italian input and its dialectal translation. Translations can be done either through an already reported dialect or by inserting new varieties. In the latter case, some additional details about the new typology of dialect are required, in particular the geographic area. This interface aims at providing users with a very easy and intuitive tool for translating sentences by following a sequence of actions. In fact, first of all, the editor is required to choose the Italian questionnaire to translate, and only once this has been done he/she will gain access to the other parts of the interface. At this point, the user has to define the dialect, and after this he will be able to choose the Italian sentence and insert its translation along with the respective tags. The interface allows for all the most common data editing operations, in particular saving, inserting, and updating. 4.3
Accessing the Linguistic Data Resource
The digital library system makes use of the PostgreSQL8 relational database management system as component for the management of permanent data. The digital library system stores all the data of interest, among those there are questionnaires, sentences, different translations of the sentences, geographic information about the dialects, and all grammar details related to the tags. 8
http://www.postgresql.org/
98
M. Agosti et al.
The data resource can be accessed through an interface which allows for searching data and filtering of results. The information access component, which supports function that are similar to those of a search engine, is open to public use and it has been designed to be accessible online. Among the available functions, it allows the user either to visualize the results in the Web page or to download them into a spreadsheet for further analysis of possible correlations. Search operations can be performed in different ways, ranging from simple searches specified by just some tags, to more articulated ones by imposing some filters to the data retrieved. The filters are mainly related to geographical information, but the number of the questionnaire, as well as a particular sentence, can be selected. The results are tabulated according to the retrieved Italian sentences and right behind appear sub-sections representing all the different translations, together with details about dialects and grammar tags. The visualization of the results is also characterized by a show/hide mechanism for the set of translations of an Italian sentence, thus providing the user with a cleaner and more compact representation of the results. The Web-based interface was designed combining HTML and Javascript for the graphic part, while JavaServer Pages (JSP) technology was used to define the connections with the data resource and all the dynamic operations. JSP was chosen among other script languages mainly for its high portability and the possibility of easily connecting to PostgreSQL. Moreover, an approach based on Ajax (shorthand for asynchronous JavaScript and XML) was exploited to minimize the exchange of data between the clients and the main server. A screen-shot of the search interface is shown in Figure 2.
Fig. 2. The interface for searching grammatical phenomena in the database of the Italian dialects
A Digital Library Effort to Support the Building
99
Fig. 3. Distribution of grammatical phenomena in questionnaires using GeoRSS tagging and Google maps APIs
Beside the different search options, the digital library system also allows for the visualization of the geographical distribution of grammatical phenomena. This can be done by exploiting the geographical coordinates of each location, which are kept in the data resource. Given these coordinates, the system automatically creates one of the Geotagging formats (GeoRSS9 , KML10 , etc.) and exploits GoogleMaps11 APIs to visualize it. An example of the usage of this API is shown in Figure 3 which displays the distribution of a subset of the points of inquiry. This option is very important because a user can graphically view how the dialects are distributed throughout the country, and perform further analysis based on these visualizations.
5
Conclusions and Future Work
Since Summer 2009 the digital library system has been in common use by the team of linguists, whose useful feedback has enabled the refinement and improvement of user interaction with the system. The user functions that have been provided are finally consistent with the purposes of the project, aiming to speed up the comparison of syntactic structures across dialects. The reached results have been considered of interest as a starting platform for addressing a wider set of languages and phenomena: from the end of 2009, 9 10 11
http://www.georss.org/ http://www.opengeospatial.org/standards/kml/ http://maps.google.it/
100
M. Agosti et al.
the members of the ASIt team have been taking part in a new project aiming to include in the digital library data from German dialects spoken in Italy and to enrich the system with a thorough POS tagset. A Web site reporting on this project has been recently made available to the public12 from which it will be possible to monitor the evolution of the results here reported.
References 1. Sp¨ arck Jones, K.: Computational linguistics: What about the linguistics? Computational Linguistics 33, 437–441 (2007) 2. Kilgarriff, A.: Googleology is bad science. Computational Linguistics 33, 147–151 (2007) 3. Buneman, P.: Curated databases. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, p. 2. Springer, Heidelberg (2009) 4. Agosti, M.: Information Access using the Guide of User Requirements. In: Agosti, M. (ed.) Access through Search Engines and Digital Libraries, pp. 1–12. Springer, Heidelberg (2008) 5. Beninc` a, P.: I dati dell’ASIS e la sintassi diacronica. In: Banfi, E., et al. (eds.) Atti del convegno internazionale di studi Trento, Tubingen, Niemeyer, Ottobre 21-23, pp. 131–141 (1995) 6. Beninc` a, P., Poletto, C.: The ASIS enterprise: a view on the construction of a syntactic atlas for the Northern Italian Dialects. In: Bentzen, K., Vangsnes, Ø.A. (eds.) Nordlyd. Monographic issue on Scandinavian Dialects Syntax, vol. 34, pp. 35–52 (2007) 7. Lord, P., Macdonald, A.: e-Science Curation Report. Data curation for e-Science in the UK: an audit to establish requirements for future curation and provision. The JISC Committee for the Support of Research, JCSR (2003) 8. Agosti, M., Di Nunzio, G.M., Ferro, N.: Scientific data of an evaluation campaign: Do we properly deal with them? In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 11–20. Springer, Heidelberg (2007) 9. Agosti, M., Di Nunzio, G.M., Ferro, N.: The importance of scientific data curation for evaluation campaigns. In: Thanos, C., Borri, F., Candela, L. (eds.) Digital Libraries: Research and Development. LNCS, vol. 4877, pp. 157–166. Springer, Heidelberg (2007)
12
http://ims.dei.unipd.it/websites/cimbrian/home
Interactive Visual Representations of Complex Information Structures Gianpaolo D’Amico, Alberto Del Bimbo, and Marco Meoni Media Integration Communication Center, University of Florence, Italy {damico,delbimbo}@dsi.unifi.it, [email protected] http://www.micc.unifi.it
Abstract. One of the most challenging issues in managing the large and diverse data available on the World Wide Web is the design of interactive systems to organize and represent information, according to standard usability guidelines. In this paper we define a framework to collect and represent information from different web resources like search engines, real-time networks and multimedia distributed databases. A prototype system has been developed, following the Rich Internet Application paradigm, to allow end-users to visualize, browse and analyze documents and their relationships in a graph-based user interface. Different visual paradigms have been implemented and their effectiveness has been measured in usability tests with real users. Keywords: Information Visualization, Graph Drawing, Usability, Multimedia Databases, Social networks, Rich Internet Applications.
1
Introduction
In the last years the amount of information available on the Web have increased not only in size, but also in complexity. The problem of information overload has dramatically become a problem of information evolution. Most of the documents accessible through the internet consist of multimedia data (audio, video, images) and websites like YouTube and Flickr have become very popular among end-users. In addition social networks and blogging platforms like Facebook or Technorati give the possibility to add information and leave comments according to the user generated content paradigm. More recently, we assist at the diffusion of Real-Time Web, a new phenomenon based on the real-time delivery of activity streams from users of web services, such as those provided by Twitter and Friendfeed. The complexity of the new structure of information has thus become a big issue in the field of user experience and web usability. Many attempts have been recently made to implement a commonly accepted solution to organize all these data (e.g. Google Wave), but there is not yet a standard framework for the presentation of complex information to the user. M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 101–112, 2010. c Springer-Verlag Berlin Heidelberg 2010
102
G. D’Amico, A. Del Bimbo, and M. Meoni
In this paper we propose a solution for the visualization of information which uses advanced presentation techniques derived from the field of Information Visualization [1], with the goal of making large and complex content more accessible and usable to the end-users. Our system consists of a graphical user interface for querying at the same time different web resources: a web clustering engine, two multimedia databases and a social network. Results are then organized and visualized according to a semantic strategy, using specific interactive features designed to explore and browse the structure of data. Two different visual presentation paradigms have been implemented, performing extensive experimental analysis of the layout interfaces and measuring their effectiveness with real users. The reminder of this paper is organized as follows. In section 2 we propose the basic concepts regarding the framework in this work, describe the architecture of the developed prototype application and the graphical user interface paradigms that have been designed. In section 3 we describe the experiments performed to evaluate the effectiveness of the proposed solutions and finally conclusions and discussion about the future work are presented in section 4. 1.1
Previous Work
Advanced visualizations of information for large data repositories have been proposed in [2], where authors implemented a tool for exploring the open shared knowledge databases Freebase [3] and Wikipedia. These systems are designed only to improve the visual representation of semantic web structures. Differently, the approach proposed in [4], develops the use of web clustering engines [5] as data sources for the visualization. These systems forward the user’s queries to the classical web search engines, take back the results and organize them in categorized groups called clusters, in order to provide a semantic representation of the information to the user. However, a common trait of all these approaches is that they do not use information related to the user queries, extracted from other repositories like social networks or multimedia sharing services. In addition, all the previous visualization tools are built on top of a dedicated resource and cannot be extended to other information repositories.
2
The Visual Interactive Framework
Our framework consists of a web-based application which allows users to perform a query, extracting and merging results from diverse knowledge repositories, and letting users to explore information by means of an interactive graph-based user interface. The initial query is performed on a main data resource, which can be a web search engine or a simple database. Results are then used to query other repositories, which return related information, such as social networks, multimedia sharing platforms or real-time web services. All these results are then merged
Interactive Visual Representations of Complex Information Structures
103
and organized according to an algorithm which creates a structured description of the information, that accounts for semantic structures underlying the data. An interactive graphical user interface presents all the information according to information visualizations techniques. The framework is not designed to handle only a specific data source, but to provide a generic architecture, in order to minimize its adjustments to different complex structures of information. The proposed system is implemented according to the Rich Internet Application (RIA) paradigm: the user interface runs in a Flash virtual machine inside a web browser. RIAs offer many advantages to the user experience, because users benefit from the best of web and desktop applications paradigms. In fact, following this approach, both high levels of interaction and collaboration (obtainable by a web application) and robustness in multimedia environments (typical of a desktop application) are guaranteed. Other advantages of this solution regard the deployment of the application, since installation is not required, because the application is updated only on the server; moreover it can run anywhere, regardless of what operating system is used, provided that a browser with Flash plugin is available. The user interface is written in the Action Script 3.0 programming language, using Adobe Flex. Figure 1 shows the system architecture, that can be divided into 4 main modules: main data sources access, related content extractor, common resource description and merging, and user interface. 2.1
Main Resource
The system submits user’s queries to the main source of information, which can be a search engine or a database. In our implementation we have used a clustering search engine, which groups results according to a semantic proximity algorithm. In particular, we have used the Carrot2 [6] search engine, with the Lingo clustering engine. The algorithm implemented by Lingo is a web search clustering algorithm, that aims at discovering the thematic threads in search results, creating results groups with a label and description that is meaningful to a human. The algorithm [7] extracts frequent phrases from the input documents, assuming that they are the most informative source of human-readable topic descriptions. The original term-document matrix is reduced using Singular Value Decomposition (SVD), then the engine tries to discover any existing latent structure of diverse topics in the search result. Finally, it matches group descriptions with the extracted topics and assign relevant documents to them. This strategy provides an overview of the different arguments related to the query, and help the user to find the desired information. 2.2
Related Content Resource
Information extracted from the main resource are processed and used to query some of the most common social networks, image and video sharing platforms.
104
G. D’Amico, A. Del Bimbo, and M. Meoni
Fig. 1. System architecture: (1) Main data source access, (2) Related content data source, (3) Combination of main source data and related data, conversion to interface format, (4) Graphical user interface
The objective is the enrichment of information with related content in order to provide the user with a larger number of information sources to improve information completeness and increase user’s knowledge. Our system uses the publicly available APIs of YouTube as source of video contents, Flickr for images and pictures and Technorati for social and real-time content. 2.3
Resource Description
All information extracted from main and related content resources are converted in a XML graph data structure organized in entities (called nodes) and relations (edges). [8]: each set of results is associated with a node and any relation among the results is associated with edges. The structure of data was designed to attain two goals: a) to be easily adaptable to all common data repositories or search engines, in order to implement a standardized representation of single elements, clusters, ranking informations and semantic relations; b) to be lightweight, so that it can be easily transmitted and processed by the RIA application that runs within a browser plugin.
Interactive Visual Representations of Complex Information Structures
2.4
105
Graphical User Interface
The user interface was designed in order to optimize comprehension of the data structures resulting from the diverse information sources. This objective is achieved using graph representation [9,10], that maximizes data comprehension and relations analysis. Moreover a visual sort ranking [11] allows user to understand which is the best element among the result elements, both for the main resource and for related contents. We propose two different types of graphic representations, described in the following, that summarize the main authoritativeness of the sources, the presence of multimedia data and the presence of information obtained from social networks, in order to provide a comprehensive overview of the results using simple visualization paradigms. If there are data relationships then they are represented by graph edges, otherwise only elements with no relationships are represented. For each element we can find a graphic representation that summarizes visually the ranking of the main data resource, along with multimedia contents ranking and social ranking. Below every element and besides the name of their group, we show buttons to access the related contents of the node, differing from each other by color (figure 2). The interface was designed to let users interact with the presentation of the search results, allowing nodes drag & drop to correct errors of automatic node positioning, making new refinement queries using a search bar, and delving deep into queries related to a certain element by double clicking on the represented object. The graph is animated with smooth transitions between different visualizations [12]. The graphical representation of the graph is made using the open source graph drawing framework Birdeye Ravis [13]. Geometric paradigm. This paradigm is based on simple geometry properties. Groups are represented by a geometric shape differing in dimension, colour and shape type, as shown in figure 2. The dimension of the geometric shapes represents the authoritativeness of one group relative to the others, related to the original data source hierarchy. The main data source order has to be as clear as possible, in order to represent the authoritativeness concept between the contents found. Due to the importance of this concept, it has been connected to the clearest graphic concept: shape size. The more relevant contents group is represented by the bigger shape size, while less relevant contents are represented by little shapes. Color is connected to multimedia contents, because multimedia is to considered as the “colour” given to textual information. The larger the amount of multimedia contents is associated to a node, the more the geometric shape is filled with the red colour; the lesser the amount of multimedia contents there are, the more the colour tends to white modifying the filling brightness value. The shape type is connected to the social sources: the more the content is obtained from social networks the larger the number of sides if the geometric shape, starting from the basic shape of a triangle to represent content that has
106
G. D’Amico, A. Del Bimbo, and M. Meoni
Fig. 2. A description of geometric interface: each geometric shape is associated with a visual representation of its authoritativeness, social and multimedia ranking index and linked to similar nodes through relationship edges. Under the shape there are the contents access buttons.
Fig. 3. The prototype system running in a web browser. In fig. A is shown the geometric paradigm interface, in fig. B the urban paradigm visualization interface.The search results shown have been obtained searching with the “Linux” keyword.
no connection to social networks. This graphical concept is based on the idea that contents coming from social networks are able to give multilateralism to the original information. Urban paradigm. This paradigm is based on concepts closed to human perceptual skill: groups are represented by different shape buildings with a different number of people close to them and a different number of trees around them.
Interactive Visual Representations of Complex Information Structures
107
Fig. 4. Classification of buildings according to importance: small house, house, small palace, palace, skyscraper. Bigger buildings correspond to a more authoritativeness node.
As described in the description of the geometric paradigm, every visualized element is connected to a specific meaning of its group, trying to give the best semantic correspondence between the shown graphic and joined contents. The building type is the main element of this representation. It is connected to node authoritativeness concept: increasing the group contents representation means increasing the building dimension. Less representative nodes are shown as little houses: when the authoritativeness node index grows up it is represented by bigger houses until the biggest one: the skyscraper. Building size is to be considered by the absolute representation of the node, differing from what happened to the geometric representation in which we used the relative index. Information referring to related contents are represented by elements outside the building, like people near it and trees along it.
Fig. 5. Access to search results content: modal window with a list of results. For each result is shown a small caption, a link to original content and a thumbnail, if available. In fig. A is reported the list of web results, in fig. B the list of YouTube videos related to searched terms.
108
G. D’Amico, A. Del Bimbo, and M. Meoni
We represent the presence of multimedia information using trees: the more trees are represented along the building the greater is the multimedia content regarding the represented node. As regards the urban metaphor, trees represent the urban element that provides colour to the city like multimedia contents to the information do. The social contents of the nodes is represented by people near the building. Many people represent a high social index. Contents access. Users can inspect all the information sources related to each node of the graphs that contain the search results, such as web pages, blog entries, pictures and videos related to the subject of the query. Contents can be accessed by clicking on buttons under each node name. Every button, if selected, enlarges its dimension in order to help the user in clicking the area. Clicking on the button a modal window is opened, to list the search results ordered by importance and with a short description (figure 5). The other color buttons let users access the related contents obtained from related contents extractor.
3
Experimental Analysis
The system has been tested to evaluate how much an end-user can benefit from the graph-based presentation of semantically related documents and from the inclusion of multimedia data in the search results, all combined together using the interactive visualization representation. Testing was focused on evaluating the effectiveness of the two different visual paradigms and web usability [14] of the system. 11 users of varying background ad expertise have been selected to carry on the test, performed according to standard web usability tests [15]: 3 students and researchers of the Media Integration and Communication Center, 3 students of Master in Multimedia-Content Design and 5 non-technical users. The test was designed to be conducted in different sessions, and with different methods of operation: Trained testing. Participants were first given a brief tutorial (lasting about 10 minutes) about using the experimental application and about the meaning of visual paradigms. Untrained testing. Participants were required to complete the experiment task without knowledge about application. Each participant was asked to find a document, image or video about a topic using a keyword given by the test supervisor. Users were not allowed to modify the keyword or to refine the search adding more keywords and, after starting the query, they were allowed to use only the mouse to interact with the system. The tasks assigned in the experiments were: Task 1. find an installation guide of Ubuntu operating system through the keyword ubuntu. Task 2. find a web page describing the climate conditions that can be expected in Italy, using the keyword Italy.
Interactive Visual Representations of Complex Information Structures
109
Task 3. find the name of the founder of the social network “Facebook”, using the keyword facebook. Task 4. find an image of one or more players of American Football, using the keyword football. Tasks were followed by a short interview in which subjects were asked about their experiences and their understanding of interface, data representations and visual paradigms. The number of mouse clicks used to complete each task, and the time spent were recorded and used to evaluate the system [16]. To avoid the bias due to repeated search tasks each user participated only to one of the two tests: 9 users were assigned to the “trained testing” (3 users for each results presentation paradigm: Google list, geometric and urban paradigms) and 2 users were assigned to the “untrained testing” (one for each visualization paradigm: geometric and urban). The users participating to the “trained testing” used only one of the interfaces. The “trained testing” comprise a comparison with the results obtainable using Google. Some users had to complete the tasks of the test with the classical Google search engine, without refining the proposed keyword. We have chosen Google as reference for testing results since every user was familiar with its interface, so that we could consider them as “trained”, and because it can be considered stateof-the-art of the web search engines that use ranked lists for the presentation of search results. The objective of this comparative test is to evaluate the effect of the knowledge increase that is the goal of the visual representations proposed. The objective of the untrained testing is to evaluate the interface usability. 3.1
Trained Testing Results
Table 1 and 2 report the results of the trained testing experiments in terms of number of mouse clicks and time (in seconds), respectively. Table 1. Number of mouse clicks used to complete tasks assigned to the users participating in the trained testing
Google user 1 Google user 2 Google user 3 Google users avg. Urban user 1 Urban user 2 Urban user 3 Urban users avg. Geometric user 1 Geometric user 2 Geometric user 3 Geometric users avg.
Task 1 Task 2 Task 3 Task 4 6 3 2 3 2 5 2 3 3 4 4 5 3.7 4 2.7 3.7 20 15 2 2 4 2 15 2 9 6 0 5 11 7.7 5.7 3 2 2 2 2 2 6 3 3 2 7 11 2 2 5 5.3 2.3
110
G. D’Amico, A. Del Bimbo, and M. Meoni
Table 2. Number of seconds used to complete tasks assigned to the users participating in the trained testing Task 1 Task 2 Task 3 Task 4 User 1: Google 120 120 20 30 User 2: Google 70 110 30 15 User 3: Google 160 90 40 30 Google users avg. 116.7 106.7 30 25 User 4: urban 300 250 100 100 User 5: urban 180 25 300 20 User 6: urban 300 180 10 120 Urban users avg. 260 151.7 136.7 80 User 7: geometric 30 60 20 15 User 8: geometric 45 60 120 20 User 9: geometric 20 120 160 40 Geometric users avg. 31.7 80 100 25
The overall performance of the system is encouraging; considering the geometric visualization paradigm the average number of clicks is slightly higher than that required by the Google interface, because users have a better knowledge of the Google interface (that is also quite leaner than the system proposed), but the average time spent is lower, thanks to the effectiveness of the visualization paradigm. The simplicity of the symbols used in the geometric paradigm enforce a lighter cognitive load than the urban representation paradigm. This effect is clearly shown also the “untrained testing” results reported in the following. Some of the differences in terms of click number and time required for tasks 2 and 3, when using the proposed system, are due to the fact that the clustering process is performed for each query, thus leading to some differences in the results, that may have some influence on the results. Users understood the meaning of the two visualization paradigms. The main difficulties in the geometric visualization is that it was not always easy to distinguish differences in terms of color and side number, while for the urban visualization users had issues in understanding the meaning of the different types of building. Table 3. Number of clicks used to complete tasks assigned to the users participating in the untrained testing
User 8: urban User 9: urban Urban users avg. User 8: geometric User 9: geometric Geometric users avg.
Task 1 Task 2 Task 3 Task 4 35 48 20 10 31 35 50 8 33 41.5 35 9 72 54 34 40 40 7 11 2 56 30.5 22.5 21
Interactive Visual Representations of Complex Information Structures
111
Table 4. Number of seconds used to complete tasks assigned to the users participating in the untrained testing
User 8: urban User 9: urban Urban users avg. User 8: geometric User 9: geometric Geometric users avg.
3.2
Task 1 Task 2 Task 3 Task 4 420 500 310 100 180 210 130 20 260 151.7 136.7 80 310 180 200 230 140 50 90 30 31.7 80 100 25
Untrained Testing Results
Table 3 and 4 report the results, in terms of clicks and seconds required to accomplish the tasks, for the “untrained testing”. As expected the figures are much higher than in the previous test, however the tests revealed that the main difficulties were due to the comprehension of the meaning of the buttons used to access the contents related to the search, that are not associated to any visual paradigm. For example, when required to search the images in task 4, the users had to figure out which button is used to show the images related to the search. The time required to complete the tasks using the urban visualization paradigm is 2.6 times higher than with the other representation, despite the fact that the average number of clicks is about the same: this is due to the fact that the more graphically detailed presentation requires more time to be understood, than the abstract representation.
4
Conclusions
In this paper was presented a framework to visualize heterogenous information from the World Wide Web. Given a query string, the proposed system extracts the results from a web clustering engine and represent them according to a graph-based visualization technique. The GUI allows the end-user to explore the information space and visualize related content extracted from different resources, like multimedia databases and social networks. Two different visualization paradigms have been developed and tested in usability experiments, to evaluate their effectiveness in letting end-user to have a better comprehension of the categories and semantic relationships existing between the search results, thus achieving a more efficient retrieval of the web documents. Experimental results demonstrate the effectiveness of the proposed solution. Future work will address an extended experimental evaluation with different user-interfaces, to overcome the difficulties highlighted in the experiments, as well as an expansion of methods used for the extraction and linking of multimedia content related to the textual searches.
112
G. D’Amico, A. Del Bimbo, and M. Meoni
References 1. Card, S.K., Mackinlay, J., Shneiderman, B.: Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann, San Francisco (January 1999) 2. Hirsch, C., Hosking, J., Grundy, J.: Interactive visualization tools for exploring the semantic graph of large knowledge spaces. In: Workshop on Visual Interfaces to the Social and the Semantic Web (VISSW 2009) (February 2009) 3. Bollacker, K., Cook, R., Tufts, P.: Freebase: a shared database of structured general human knowledge. In: AAAI 2007: Proceedings of the 22nd national conference on Artificial intelligence, pp. 1962–1963. AAAI Press, Menlo Park (2007) 4. Di Giacomo, E., Didimo, W., Grilli, L., Liotta, G., Palladino, P.: Whatsonweb+: An enhanced visual search clustering engine. In: Visualization Symposium. PacificVIS 2008. IEEE Pacific, pp. 167–174 (March 2008) 5. Ferragina, P., Gulli, A.: The anatomy of a hierarchical clustering engine for webpage, news and book snippets. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, 395–398 (2004) 6. Weiss, D., Osinski, S.: Carrot2. open source search results clustering engine 7. Osinski, S., Stefanowski, J., Weiss, D.: Lingo: Search results clustering algorithm based on singular value decomposition. In: Intelligent Information Systems, pp. 359–368 (2004) 8. Bondy, J.A., Murty, U.S.R.: Graph Theory (Graduate Texts in Mathematics). Springer, Heidelberg (2007) 9. Di Giacomo, E., Didimo, W., Grilli, L., Liotta, G.: Graph visualization techniques for web clustering engines. Transactions on Visualization and Computer Graphics 13(2), 294–304 (2007) 10. Herman, I., Melan¸con, G., Marshall, M.S.: Graph visualization and navigation in information visualization: A survey. IEEE Transactions on Visualization and Computer Graphics 6(1), 24–43 (2000) 11. Keller, T., Tergan, S.O.: Visualizing knowledge and information: An introduction. In: Knowledge and Information Visualization, pp. 1–23 (2005) 12. Misue, K., Eades, P., Lai, W., Sugiyama, K.: Layout adjustment and the mental map. Journal of Visual Languages & Computing 6(2), 183–210 (1995) 13. Birdeye information visualization and visual analytics library 14. Nielsen, J., Loranger, H.: Web Usability. Addison-Wesley, M¨ unchen 15. Krug, S.: Don’t Make Me Think: A Common Sense Approach to Web Usability. New Riders Press, Indianapolis (October 2000) 16. Nielsen, J., Molich, R.: Heuristic evaluation of user interfaces. In: CHI 1990: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 249–256. ACM Press, New York (1990)
Mathematical Symbol Indexing for Digital Libraries Simone Marinai, Beatrice Miotti, and Giovanni Soda Dipartimento di Sistemi e Informatica University of Florence, Italy [email protected]
Abstract. In this paper we describe our recent research for mathematical symbol indexing and its possible application in the Digital Library domain. The proposed approach represents mathematical symbols by means of Shape Contexts (SC) description. Indexed symbols are represented with a vector space-based method, but peculiar to our approach is the use of Self Organizing Maps (SOM) to perform the clustering instead of the commonly used k-means algorithm. The retrieval performance are measured on a large collection of mathematical symbols gathered from the widely used INFTY database.
1
Introduction
Nowdays, Digital library technologies are well established and understood. This is proven by the large number of papers related to this topic published in the last few years and by the broad range of systems already available in the Web. In most cases DLs deal with digitized documents, books and journals that are represented as images where Document Recognition techniques can be applied. For instance in many cases the document images are processed by means of Optical Character Recognition (OCR) techniques in order to extract their textual content. One related research area, that has not yet fully implemented in DL systems, is based on Document Image Retrieval approaches, where relevant documents are identified relying only on image features [1]. In the last few years, most documents belonging to DLs are “born-digital” (rather than being digitized) and, as consequence, some techniques have been proposed to perform the retrieval of a user query [2] [3]. Digital Libraries can now be considered as collections of digital contents which are available through the Internet, but not necessarily public. DLs architectures and services are in continuous evolution because of their close contact with Web2.0 technologies. In the era of social networks, the users are called as main characters to build and maintain DLs which can be reached by Internet [4]. The interaction between users, to develop and update DLs, is the aim of DLs owners. Some examples of this new approach to DLs are Flickr, Facebook and other social networks which ask for the user help to increase the contents and to classify each element by means of keywords. M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 113–124, 2010. c Springer-Verlag Berlin Heidelberg 2010
114
S. Marinai, B. Miotti, and G. Soda
In our research we are interested in analyzing techniques which can be used in the phase of information extraction and retrieval from scanned or digital born documents in the Digital Library. These techniques can be classified according to the working level in three categories [5,1]. The first is the free browsing and is the easiest to implement: in this case a user browses through a document collection, looking for the desired information by visually checking the document images. The second is the recognition-based retrieval which is based on the recognition of the document contents. According to it, the similarity between documents is evaluated at symbolic level and it assumes that a recognition engine can extract the full text of text-based documents or a set of metadata from multimedia documents. The textual information is then indexed and the retrieval is performed by means of keywords furnished by the user. The recognition-based approach has the advantage that the similarity computation and results ranking has a lower computational cost. On the other hand it has some limitations when OCR systems cannot perform well (when dealing with very noisy documents or containing multi-lingual text) or are not yet fully implemented (such as for mathematical symbol recognition, that is addressed in this paper). Some of the earliest methods adopted for the recognition-based approach, and in particular for OCR-based text retrieval, have been described in two comprehensive surveys [6,7]. A mixed approach is proposed in [8] where document image analysis techniques are used together with OCR engines and metadata extraction. The last category is based on recognition-free retrieval methods and can be regarded also as content based approaches. In this case the similarity is evaluated considering the actual content of the document images. That is some features, closely related to the document images like colors texture or shape, are extracted. The user can perform the retrieval on a Query by Example (QbE) approach i.e. presenting a query image to the system and looking at a result ranking. One advantage of a recognition-free approach is the possibility of looking for information without the need of some specific background knowledge. For example users may perform a QbE query with keywords in any language and the system does not need to know the language in the phase of the document indexing [9]. On the other hand, even recognition-free approaches have some limitations, especially regarding the selection of the feature set. Most systems work with low level features such as color, texture and shape, while only few system are able to extract high level or semantic features. In [10], [11] and [12] key-word spotting techniques have been proposed based on the Word Shape Coding and on set of low-level features. A different approach has been proposed in [9] where words are indexed on the basis of character shapes. Due to the large number of scientific and technical documents that are nowadays available in Digital Libraries, many efforts have been devoted to build systems which are able to recognize the mathematical expressions embedded in printed documents. Because of the very large number of symbol classes and the spatial relationships among symbol, OCR engines often fail in the recognition phase [13] [14].
Mathematical Symbol Indexing for Digital Libraries
115
According to [15], most systems for mathematical expression analysis are based on four main steps. Layout Analysis that is used to extract the layout of the document images. The most common techniques rely on connected components extraction and are based on bottom-up (e.g. [16]) or top-down (e.g. [17] approaches. Symbol Segmentation that is aimed at identifying each individual symbol, that in most cases correspond to one connected component. Symbol Recognition that is mostly based on machine learning techniques. For instance, Takiguchi et al. [18] and Suzuki et al. [13] represent the symbols according to pixel intensity value features and physical peculiarities. Structural Analysis is performed in order to understand whole equations. Toyota et al. [19] build relation trees among various symbols according to the mutual physical positions or to logic considerations. Document image retrieval techniques have been seldom used to process mathematical expressions [20]. However, several researchers envisage the usefulness of search systems that could search for text and also for “fine-grain mathematical data” such as equations and functions [21]. Most search systems for mathematical documents rely on specific markup languages. For instance the MathWebSearch system harvests the Web for formulae indexed with MathML or OpenMath representations [22]. In this paper, we present a system based on a recognition-free approach for the retrieval of mathematical symbols belonging to a collection of documents. We do not explicitly deal with the symbol recognition, but we focus on the retrieval of mathematical symbols. This can be considered as a preliminary phase to the mathematical formulae retrieval which is the general aim of our work. The system described in this paper is made up by three steps: in the first step for each symbol in the collection we compute a set of features that are used to index it. In the second step visual queries proposed by the user are analyzed in order to compute the same features computed during the indexing. In the last step the similarity among query vectors and each coded collection element is evaluated and the results are ranked. The paper is organized as follows. In Sect. 2 we describe the indexing and retrieval method. The Infty database and the experiments are described in Sect. 3. Conclusions and future work are drawn in Sect. 4.
2
Mathematical Symbol Indexing
Checking two occurrences of a mathematical symbol a human observer is able to glean over the differences and basing on the symbol shape he can assert these images represent the same symbol. This kind of visual analysis should be extended to be used in a automatic process. Each image contains local interest points that concentrate most of its information, in particular the shape of the symbol is a peculiarity of the object. To demonstrate this feature of mathematical symbols we show in Fig. 1 some examples of queries with the corresponding top ten results as reported by our system.
116
S. Marinai, B. Miotti, and G. Soda
Fig. 1. Examples of queries with the first 10 retrieved symbols. The first symbol is the query.
Two main approaches can be considered to describe an image (e.g. [23]): the brightness-based ones take into account the pixel values, whereas the featurebased approaches involve the use of physical peculiarities of the symbol, such as edge elements and connected components. In printed documents the mathematical symbols correspond in most cases to separated connected components. In our work we therefore use a feature-based approach that is more appropriate to describe the symbol shape. Since the number of symbol classes in mathematics is very large we need a compact way to express similarity among symbols in the same class even if they look like different. The classes include characters from Greek, German, Latin alphabets and mathematical symbols. Moreover, some of them appear in general in different styles, fonts and sizes. Additional details about the dataset that we used in our experiments are reported in Sect. 3. Among other approaches, we use Shape Contexts (SC) [23] to describe mathematical symbols shapes considering both the internal and external contours. In general, similar shapes have similar descriptors and therefore different symbols can be compared considering the SCs and then establishing a similarity measure. The symbol image is processed to identify the internal and external contours that are subsequently described as a set of points. A subset P of sampled points is then extracted as representative of the symbol shape (Fig. 2). 2.1
Keypoints Selection
An important point of the SC-based symbol representation is the identification of keypoints on which Shape Contexts have to be evaluated. In the original paper [23] the keypoints are extracted from the internal and external symbol contours. In particular the contours are sampled with a regular spacing between keypoints. We performed the first experiments (described in Sect. 3) following this approach. We considered also other approaches to identify keypoints by looking for salient contour points. To this purpose we considered both the corner and the local maximum curvature approaches. A corner point has two dominant and different edge directions in its neighborhood and can be detected considering the gradient of the image. In the second
Mathematical Symbol Indexing for Digital Libraries
(a)
(b)
117
(c)
Fig. 2. (a) An example of mathematical symbol. (b) The sampled point set used to compute SCs. (c) The logarithmic mask.
approach we can define a local maximum point as a point whose local contour has a curvature which is higher than a given threshold. The curvature can be estimated considering the angle between consecutive line segments. The segments are computed with a linear interpolation of the contour on the basis of a maximum distance between the contour points and the interpolating segment. This distance has an important role in the evaluation of the curvature: with a small distance the approach will be sensible to the noise in the image and too many keypoints will be identified. On the opposite with a large distance some relevant points will be lost. Some symbols, for example the “0”, present curvature values almost constant and near to the average value while other symbols, as a “[”, present only few points whose curvature values are interesting (the two corners and the endpoints). In order to alleviate the problems, the keypoints are selected among the points with a curvature greater than a threshold that is dynamically adjusted for each symbol by setting it to the average of all the curvature values in the image. In Sect. 3 we compare the results that are obtained with these approaches for keypoint selection. 2.2
Shape Contexts Evaluation
The Shape Context for each point pi in P can be computed by considering the relative position of the other points in P . The SC for pi is obtained by computing a coarse histogram hi whose bins are uniform in log-polar space (Fig. 2 (b), (c)) as described in the following. Let m be the cardinality of P and pj be one of the remaining m–1 points in P . The point pj is assigned to one bin according to the logarithm of the Euclidean distance between pi and pj and to the direction of the link between pi and pj . The histogram hi is defined to be the Shape Context of pi . The m SC vectors indirectly describe the whole symbol. It is clear that shape contexts are invariant to translations. The SC computation can be modified in order to obtain descriptions that are scale and rotation invariant [23]. For mathematical symbol indexing the rotation can be misleading because we could confuse symbols such as 6 with 9 and ∪ with ∩. To deal with mathematical symbols we also adapted the SC computation taking into account the SC radius (maximum distance of points included in the histogram) and the most set of points to be considered in the histogram population.
118
S. Marinai, B. Miotti, and G. Soda
Large values of the SC radius allows to embrace the whole image and therefore each SC is influenced by points very far from it. With a small radius we should deal with the points that fall out of the last mask bins. Two address the latter point two alternatives are possible. In one solution all the external points are included in the last bins that will have values significantly higher than the other bins. On the opposite, if the external points are not counted the resulting SC will describe a little portion of the symbol shape with the risk to loose information. To find out the most suitable radius we performed several experiments which are reported in detail in [24]. From these experiments it turned out that an halfway radius (20 pixels) that is a little bit smaller than the average image size should be preferred in most cases. To increase the robustness against the symbol noise we compute hi by counting the number of all the symbol points that belong to each bin instead of considering only the points in P . In so doing, each SC bin is more populated and more informative. We followed this approach because the symbols in the Infty dataset are small (on the average 20 x 30 pixel), and therefore the number of contour pixels is low and with the standard algorithm only a few bins of the SCs would contain some points. This choice is supported by some preliminary experiments that we described in [24]. 2.3
SOM-Based Visual Dictionary
The comparison between the Shape Contexts in a query symbol and those in each indexed object can provide a very accurate evaluation of similarity among symbols. However, the computational cost of a pairwise comparison is too high and cannot be considered when dealing with large data-sets. One typical solution of this problem is based on the transformation of the shape representation using techniques adopted in the vector space model of Information Retrieval. To this purpose, the vector quantization is first performed by clustering the vector representations and then identifying each vector with the index of the cluster it belongs to. The clusters are in most cases identified by running the Kmeans algorithm on a sub-set of the objects to be indexed. In the textual analogy each cluster is considered as a “visual-word” and each symbol can be represented on the basis of the frequencies of each “visual-word” in its description [25]. Although simple to implement, one limitation of the K-means clustering algorithm is that it does not take into account any similarity among clusters. In other words points belonging to different clusters (or SCs corresponding to different “visual-words”) contribute in the same way to the symbol similarity either when the clusters are similar or when they are dissimilar. The peculiarity of the approach described in this paper is the use of Self Organizing Maps (SOM) to perform the vector quantization. In this case, in contrast with K-means, the clusters are topologically ordered in the SOM map. As example we depict in Fig. 3 a portion of an SOM, used to index the mathematical symbols, together with two symbols. Each cluster is pictorially depicted by reconstructing a virtual Shape Context that corresponds to the values in the SOM related to that particular
Mathematical Symbol Indexing for Digital Libraries
119
Fig. 3. Visual words obtained by SOM clustering. Each circle is a graphical representation of one cluster centroid. We show also two symbols with a reference to some visual words.
centroid. From the map it is clear that similar SCs are placed in close neurons in the map. The similarity between the query vector and each element of the index is evaluated by means of the cosine similarity function. As proposed in [20], we have modified the cosine formula to take advantage of the topological peculiarity of the SOM map. In particular, we perform an inexact match between two vector representations of two symbols considering, for each element of the query vector that has not a correspondent in the indexed vector, the four or eight neighbors of it and we take as winner the maximum among them weighted according to its position in the map.
3
Experiments
We made our experiments on two datasets collected by the INFTY project [26]. The InftyCDB-3-A dataset consists of 188,752 symbols scanned at 400 dpi extracted from 19 articles printed by various publishers, and from other sources so as to cover all the most important mathematical symbols. The InftyCDB-3-B contains 70,637 symbols scanned at 600 dpi from 20 articles. Ground-truth information at the symbol level is provided for both datasets. It is important to notice that the same code has been assigned to different symbols that look similar (e.g. the summation symbol and the Greek letter Σ). In the two datasets (which consist of 346 pages) there are 393 different classes.
120
S. Marinai, B. Miotti, and G. Soda
Table 1. Precision at 0 % Recall for the K-means and SOM experiments Methods
K-means SOM
SC Std SC AllPoints
74.69 79.93
73.17 86.81
Table 2. Precision at 0 % Recall for the detection of keypoints based on local maximum curvature and corner methods. Three sizes of the SOM are compared as well. Curvature
Corner
Precision at 0% Recall
10x10 10x20 20x20
10x10 10x20 20x20
sim sim4 sim8
95.82 96.22 97.86 95.83 96.21 97.84 95.82 96.19 97.83
90.87 91.44 93.31 90.97 91.40 93.29 90.82 91.51 93.44
Before indexing the data, we computed the SC clusters on a set of 22, 923 symbols, belonging to 53 pages randomly selected from the whole dataset. From each symbol we extract around 50 SCs so that we used 1,102,049 feature vectors for clustering. We then indexed all the 259,357 symbols in the two datasets and we performed several experiments to compare alternative approaches that can be used to index the data. To evaluate the retrieval results we use the Precision-Recall curves and a single numerical value: the Precision at 0 % Recall, which is obtained through an interpolation procedure of the Precision-Recall curve as detailed in [27]. As usual, the Recall is defined as the fraction of the relevant symbols which have been |tp| retrieved: Recall = |tp+f ; the Precision is defined as the fraction of retrieved n| |tp| symbols which are relevant: P recision = |tp+f p| ; where tp (true positive) are the retrieved symbols which are relevant, fp (false positive) are the retrieved symbols which are not relevant, fn are the relevant symbols which have not been retrieved. We computed the P-R curve for interpolation after estimating the precision when the recall is a multiple of 10. In the experiments reported in this paper we made 392 queries randomly selected from the dataset. To obtain a correct comparison among methods, we always used the same set of 392 queries. To estimate the suitability of the SOM clustering we first compared it with the K-means clustering. Some preliminary experiments are reported in [20] and in [24]. The latter are summarized in Table 1 where we can verify that the SOM clustering together with a computation of SCs with all the symbol points (SC AllPoints) provides the best results. To compare
Mathematical Symbol Indexing for Digital Libraries
121
Table 3. Area under the curve (AUC) of P-R in case of 3920 queries
AUC
sim
sim4
sim8
2345,08
2410,34
2383,76
the two approaches for keypoint selection described in Sect. 2.1 we performed some experiments where we used the same settings of the previous experiments. In Table 2 and Fig. 4 we show the Precision at 0 % Recall and the PrecisionRecall curves for the two approaches. In the experiments we considered three map sizes (with 100, 200, and 400 neurons) and also three functions to compute
(a) Curvature
(b) Corner Fig. 4. Precision-Recall curves related to Table 2
122
S. Marinai, B. Miotti, and G. Soda
Fig. 5. Comparison of Precision-Recall curves
the symbol similarity. The similarity among query vectors and indexed elements is evaluated by means of the cosine similarity function and two variants of it sim4 and sim8 as explained in detail in [20]. From these experiments we can observe that larger maps provide in general better results. However, in this experiment there is no considerable difference among the three similarity functions. To compare the two methods we report in Fig. 5 the best plots of Fig. 4. From this figure it is clear that the curvature method is better than the other. From the previous experiments it is not possible to understand whether there is any advantage in the use of one similarity function with respect to the others. This is probably due to the low number of queries that we used. We therefore performed an additional experiment considering an SOM map with 400 centroids and the curvature approach. The other settings are fixed as before, but we performed 3920 queries. The results, evaluated with the Precision Recall curve, show that the values of Precision at 0 % Recall are nearly the same for all similarity functions. To compare in a better way the methods, we considered the area under these curves as a quality measure. The results are shown in Table 3. As we can see, the sim4 similarity function although starting from a similar value than the others, tends to have higher values when the Recall increases.
4
Conclusions
In this paper, we proposed a technique for image-based symbol retrieval based on Shape Context representation encoded with a bag of visual words method. The peculiarity of the approach is the use of Self Organizing Maps for clustering the Shape Contexts into a suitable visual dictionary. We also compared various methods to compute the SCs as well as different clustering approaches. Experiments performed on a large and widely used data set containing both
Mathematical Symbol Indexing for Digital Libraries
123
alphanumeric and mathematical symbols allows us to positively evaluate the proposed approach. Future work includes a deeper comparison of various retrieval approaches. We aim also to extend the retrieval mechanism to incorporate the structural information of the formulae in the retrieval algorithm.
References 1. Marinai, S.: A Survey of Document Image Retrieval in Digital Libraries. In: Sulem, L.L. (ed.) Actes du 9`eme Colloque International Francophone sur l’Ecrit et le Document, SDN 2006, pp. 193–198 (September 2006) 2. Chen, N., Shatkay, H., Blostein, D.: Use of figures in literature mining for biomedical digital libraries. In: Proc. DIAL, pp. 180–197 (2006) 3. Esposito, F., Ferilli, S., Basile, T., Mauro, N.D.: Automatic content-based indexing of digital documents through intelligent processing techniques. In: Proc. DIAL, pp. 204–219 (2006) 4. Gazan, R.: Social annotations in digital library collections. DLib. Magazine 14(11/12) (2008) 5. Wan, G., Liu, Z.: Content-based information retrieval and digital libraries. Information Technology & Libraries 27, 41–47 (2008) 6. Doermann, D.: The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding 70(3), 287–298 (1998) 7. Mitra, M., Chaudhuri, B.: Information retrieval from documents: A survey. Information Retrieval 2(2/3), 141–163 (2000) 8. Bela¨ıd, A., Turcan, I., Pierrel, J.M., Bela¨ıd, Y., Hadjamar, Y., Hadjamar, H.: Automatic indexing and reformulation of ancient dictionaries. In: DIAL 2004: Proceedings of the First International Workshop on Document Image Analysis for Libraries, Washington, DC, USA, p. 342. IEEE Computer Society, Los Alamitos (2004) 9. Marinai, S., Marino, E., Soda, G.: Font adaptive word indexing of modern printed documents. IEEE Transactions on PAMI 28(8), 1187–1199 (2006) 10. Bai, S., Li, L., Tan, C.: Keyword spotting in document images through word shape coding. In: ICDAR 2009: Proceedings of the Tenth International Conference on Document Analysis and Recognition, p. 331. IEEE Computer Society, Los Alamitos (2009) 11. Li, L., Lu, S.J., Tan, C.L.: A fast keyword-spotting technique. In: ICDAR 2007: Proceedings of the Ninth International Conference on Document Analysis and Recognition, Washington, DC, USA, pp. 68–72. IEEE Computer Society, Los Alamitos (2007) 12. Lu, S., Li, L., Tan, C.L.: Document image retrieval through word shape coding. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1913–1918 (2008) 13. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: Infty: an integrated ocr system for mathematical documents. In: DocEng 2003: Proceedings of the 2003 ACM Symposium on Document Engineering, pp. 95–104. ACM, New York (2003) 14. Garain, U., Chaudhuri, B.B., Chaudhuri, A.R.: Identification of embedded mathematical expressions in scanned documents. In: ICPR, vol. 1, pp. 384–387 (2004) 15. Guo, Y., Huang, L., Liu, C., Jiang, X.: An automatic mathematical expression understanding system. In: ICDAR 2007: Proceedings of the Ninth International Conference on Document Analysis and Recognition, Washington, DC, USA, vol. 2, pp. 719–723. IEEE Computer Society, Los Alamitos (2007)
124
S. Marinai, B. Miotti, and G. Soda
16. Anil, K.J., Bin, Y.: Document representation and its application to page decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 294–308 (1998) 17. Chang, T.Y., Takiguchi, Y., Okada, M.: Physical structure segmentation with projection profile for mathematic formulae and graphics in academic paper images. In: ICDAR 2007: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Washington, DC, USA, vol. 2, pp. 1193– 1197. IEEE Computer Society, Los Alamitos (2007) 18. Takiguchi, Y., Okada, M., Miyake, Y.: A study on character recognition error correction at higher level recognition step for mathematical formulae understanding. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 2, pp. 966–969 (2006) 19. Toyota, S., Uchida, S., Suzuki, M.: Structural analysis of mathematical formulae with verification based on formula description grammar. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 153–163. Springer, Heidelberg (2006) 20. Marinai, S., Miotti, B., Soda, G.: Mathematical symbol indexing using topologically ordered clusters of shape contexts. In: Int’l. Conference on Document Analysis and Recognition, pp. 1041–1045 (2009) 21. Youssef, A.: Roles of math search in mathematics. In: Borwein, J.M., Farmer, W.M. (eds.) MKM 2006. LNCS (LNAI), vol. 4108, pp. 2–16. Springer, Heidelberg (2006) 22. Kohlhase, M., Sucan, I.: A search engine for mathematical formulae. In: Calmet, J., Ida, T., Wang, D. (eds.) AISC 2006. LNCS (LNAI), vol. 4120, pp. 241–253. Springer, Heidelberg (2006) 23. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4), 509–522 (2002) 24. Marinai, S., Miotti, B., Soda, G.: Mathematical symbol indexing. In: AI*IA 2009: Proceedings of the XIth International Conference of the Italian Association for Artificial Intelligence Reggio Emilia on Emergent Perspectives in Artificial Intelligence, pp. 102–111. Springer, Heidelberg (2009) 25. Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag-of-visualwords representations in scene classification. In: MIR 2007: Proceedings of the international workshop on multimedia information retrieval, pp. 197–206. ACM, New York (2007) 26. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: Infty: an integrated ocr system for mathematical documents. In: DocEng 2003: Proceedings of the 2003 ACM symposium on Document engineering, pp. 95–104. ACM, New York (2003) 27. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)
Using Explicit Word Co-occurrences to Improve Term-Based Text Retrieval Stefano Ferilli1 , Marenglen Biba2 , Teresa M.A. Basile1 , and Floriana Esposito1 1
Dipartimento di Informatica Universit` a di Bari via E. Orabona, 4 - 70125 Bari, Italia {ferilli,basile,esposito}@di.uniba.it 2 Computer Science Department University of New York, Tirana Rr. ”Komuna e Parisit”, Tirana, Albania [email protected]
Abstract. Reaching high precision and recall rates in the results of term-based queries on text collections is becoming more and more crucial, as long as the amount of available documents increases and their quality tends to decrease. In particular, retrieval techniques based on the strict correspondence between terms in the query and terms in the documents miss important and relevant documents where it just happens that the terms selected by their authors are slightly different than those used by the final user that issues the query. Our proposal is to explicitly consider term co-occurrences when building the vector space. Indeed, the presence in a document of different but related terms to those in the query should strengthen the confidence that the document is relevant as well. Missing a query term in a document, but finding several terms strictly related to it, should equally support the hypothesis that the document is actually relevant. The computational perspective that embeds such a relatedness consists in matrix operations that capture direct or indirect term co-occurrence in the collection. We propose two different approaches to enforce such a perspective, and run preliminary experiments on a prototypical implementation, suggesting that this technique is potentially profitable.
1
Introduction
The retrieval of interesting documents in digital libraries and repositories is today a hot problem, that is becoming harder and harder as long as the amount of available documents dramatically increases and their content quality tends to be of lower and lower quality. When a user issues a query, on one hand, there is the need for a stricter selection of the returned document, in order to filter out irrelevant ones; on the other hand, the retrieved documents should satisfy his specific interests from a semantic viewpoint. This is reflected in the classical evaluation measures used in Information Retrieval, precision and recall. M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 125–136, 2010. c Springer-Verlag Berlin Heidelberg 2010
126
S. Ferilli et al.
Almost all searches for documents are currently based on their textual content. The problem is that user interaction in information retrieval typically takes place at the purely lexical level, but term-based matching is clearly insufficient to even approximately catch the intended semantics of the query, because the syntactic aspects of sentences embed significant information that is missed by the bag-of-word approach. The exact matching of query terms with terms of documents in the repository implies a number of practical problems in retrieving the desired information, such that trying to solve them usually results in oscillations between the two extremes of high recall with very low precision or high precision with very low recall, without being able to find a suitable trade-off and balance that is useful to the user. In addition to being excessively simplistic in itself, the term matching-based approach suffers also from problems and tricks that are intrinsic to Natural Language, such as word synonymy (different words having the same meaning) and polysemy (single words having several meanings in different contexts). However, the problem we want to face in this work is yet more advanced, and can be summarized in the following example. We clearly would like the following text: The locomotive machine will leave from Platform C of Penn Station after letting all passengers getting down from the carriages and sleeping-cars. to be retrieved as an answer to a query made up of just one term ‘trains’. In fact, in the classical setting, this document would not be returned in the result set, because nor the exact word, nor its stem, appear in the text. Still worse, not even the concept of ‘train’ itself is present in it, although a large number of terms that are specific to the train domain are pervasive in the sentence. This complies with the one domain per sentence assumption [6]. In the following section, a brief overview of the techniques proposed in the literature for attacking the shortcomings of pure term-based text retrieval is provided. Then, the proposal of explicit exploitation of term co-occurrence related information is described in Section 3. Specifically, two different techniques are proposed to obtain such a result: SuMMa and LSE, described in two separate subsections. Subsequently, Section 4 discusses some of the pros and cons of the two approaches, showing that they have complementary advantages and disadvantages as regards time and space requirements for computation. Some preliminary but encouraging results obtained on a toy problem are also presented, suggesting that co-occurrences play a relevant role in improving the quality of term-based search result based only on the lexical level. Lastly, Section 5 concludes the paper and outlines current and future work that is planned to better assess the effectiveness of the proposed approach.
2
Related Work
Most techniques for Information Retrieval are based on some variation of the Vector Space [15], a geometrical interpretation of the problem in which each document is represented as a point in a multi-dimensional space, whose dimensions
Using Explicit Word Co-occurrences to Improve Term-Based Text Retrieval
127
are the terms, and where the coordinates are given by (some function of) the number of occurrences of that term in that document. Since pure occurrences would be affected by terms that are very frequent just because of the kind of collection, and hence are not significant for the specific documents, weighting functions are used that merge a local factor (saying how important is a term for a specific document) smoothed by a global factor (evaluating the spread of that term in the whole collection). The best-known example is TF*IDF (Term Frequency * Inverse Document Frequency) [14]. Practically, the space associated to a document collection consists of a very large matrix, called Term-Document Matrix, whose rows corresponds to the terms appearing at least once in the collection, and whose columns correspond to documents in the collection. The user query can be considered as a document as well, and hence represented as a vector expressed in the same dimensions as the space, which allows to easily compare it to each document in the collection (Euclidean distance is a straightforward way for doing this), and then to rank the results by increasing distance. Another problem is how to interpret the query terms. The usual default is considering them as a conjunction, i.e. as connected by an AND logical operator. Thus, only documents including all the terms expressed in the query are retrieved. This is intended to improve precision, but negatively affects recall. The opposite perspective, of considering the terms as a disjunction, i.e. as connected by an OR logical operator, is indeed exploited for widening the research (an option often reported under the specification ‘Find similar items’ in the search engines). It returns result sets having much higher recall but very difficult to handle by the user because of precision. An intermediate solution is allowing the user to enter complex queries, with any combination of the NOT, AND and OR logical operators and any nesting of parentheses, in order to better specify his intended target, but this requires to set up quite long and complex logic expressions even for simple queries, and very few users are acquainted with boolean logics and able to properly exploit its operators. Thus, several techniques have been proposed to automatically improve the quality of the search results without charging inexperienced users with the task of better specifying their needs, but trying to better exploit classical queries made up of just a sequence of terms. One approach is called query expansion, and consists in extending the query actually entered by the user with additional terms computed in such a way to hopefully result in more hits than those produced by the original query. A strategy for doing this consists in expanding the query with terms that co-occur in some document with those in the original query. Indeed, independently of the weighting function exploited for defining the Vector Space, terms not appearing in a document will have a null weight in the corresponding matrix cell for that document. Conversely, a significant presence of terms related to those in the query, although not exactly corresponding to them, should make up for such an absence, raising the degree of similarity. However, other studies have proved that this approach yields worse results than the original query, since the additional co-occurring terms are usually also very frequent in the collection independently of the relatedness to the query [12].
128
S. Ferilli et al.
Another very famous approach is the Latent Semantic Indexing (LSI) [8, 11], where the underlying assumption is that the actual meaning of a text is often obscured by the specific words chosen to express it, and hence usual term matching approaches are not suitable to catch such a meaning. The proposed solution is based on a mathematical technique called Singular Value Decomposition (SVD for short), that splits a given matrix into three matrices such that their matrix product yields again the original matrix. In the case of a Term-Document matrix representing the vector space of a collection of documents, the three matrices obtained by SVD represent, respectively, the relationships of the terms with a number of abstract items that can be considered as the concepts underlying the document collection, the concepts themselves (that correspond to the ‘latent’ semantics underlying the collection) and the relationships between these concepts and the documents. By ignoring the concepts having less weight, and hence less relevant, and focussing on the k most important ones only, the matrix product of the resulting truncated matrices represents an approximation of the original vector space, where the most relevant relationships have been amplified, and hence have emerged, while the less relevant ones have been stripped off. Thus, the new weight reflects the hidden semantics, at the expenses of the predominance of lexical information. The LSI approach has been widely appreciated in the literature [7], but is not free of problems. First of all, computing the SVD is computationally heavy even for medium-sized datasets (thousands of documents). Second, the choice of the amount of relevant concepts to be considered is still debated among researchers: Indeed, being the semantic ‘latent’, these concepts are not ‘labelled’, and hence one actually does not know what he is keeping in and what he is keeping out when computing the truncated product. Thus, there are only indirect ways for validating such a technique. The interesting point in the LSI approach is that a document can be retrieved according to a query even if the terms in the query are not present in that document. Another semantic approach, that relies on an explicit representation of concepts, proposes to switch from the specific terms to their underlying concepts according to some standard. The idea is that, in this case, the semantics is directly plugged into the mechanism and hence will be considered during the retrieval phase. In other words, the space is not any more Terms by Documents, but rather it is Concepts by Documents. The first issue is what reference standard for concepts is to be used. Much of the literature agreed to exploit WordNet [10], a famous lexical database that identifies the concepts as the underlying set of synonymous words that express them. Each such set, called a synset (synonymous set), is given a unique identifier, and several syntactic and semantic relationships are expressed among concepts and between concepts and words. Thus, WordNet is in many respects a perfect resource for bridging the gap between the lexical level (terms) and the semantic one (concepts). The problem in this case is that, due to polysemy, several different concepts can correspond to a same word in the query or in a document. Including all of them in the vector space representation of the document would often yield a much larger and less precise space than the one based on terms, which in turn would make more complex the proper retrieval of
Using Explicit Word Co-occurrences to Improve Term-Based Text Retrieval
129
documents. Conversely, trying to identify the only correct concept for each word requires an additional step of Word Sense Disambiguation (WSD) [2], another hot topic in the research because no trivial and highly reliable technique still exist for accomplishing such a task. An approach that mixes the latent semantics with clustering of documents is Concept Indexing [3], where documents are collected, according to some underlying semantics, into clusters such that documents in the same cluster are similar to each other, while documents belonging to different clusters are very different. Such a grouping can be carried out either in a supervised manner, or in an unsupervised one, and exploits very efficient algorithms. Then, each cluster is considered as corresponding to a concept, and exploited for dimensionality reduction purposes in the perspective of the retrieval step. Specifically, the dimensionality reduction step takes place by considering as a dimension a representative for each cluster found. It can be an actual document in the cluster, or just a surrogate thereof, computed as a kind of average of the cluster components.
3
Exploitation of Co-occurrences
Our proposal for attacking the problem represented by the examples in the Introduction and improving the search results in a text collection is to exploit term co-occurrence to compute a modified version of the Vector Space in which related terms are considered significant for a document even if they do not explicitly appear in it. Indeed, studies in [13] have confirmed that statistics on word co-occurrence can closely simulate some human behaviours concerning Natural Language, and in other works some interest has been also put on term cooccurrence as a way for improving the access to digital libraries [1]. Our idea is similar to (and inspired by) the LSI approach, but our aim is explicitly introducing the co-occurrence factor in the version space. Indeed, many papers state that term co-occurrence is taken into account by the LSI, but no specific demonstration of how much influence co-occurrences have in the overall result seems available in the literature. Moreover, the approach we propose differs also from that in [12], because here the term co-occurrence is plugged directly into the vector space, and not used just for query expansion. More precisely, the way in which we propose to discover such a relationship is by finding sequences of documents in the collection that pairwise have at least one term in common. In this setting, different levels of co-occurrences can be defined. The most straightforward, called co-occurrence of order 1, refers to pairs of terms appearing in a same document. However, it is quite intuitive that the co-occurrence relation fulfils in some ‘semantic’ sense, if not in the mathematical sense, a kind of transitive property: If terms a and b appear in the same document, and hence can be considered as related, and terms b and c cooccur in another document, and hence can be considered as related as well, then also a and c, even in case they never co-occur together in the same document, can be considered in some sense related, due to their common relationship to
130
S. Ferilli et al.
the intermediate term b. This is called a co-occurrence of order 2. Taking further this approach, a co-occurrence of order k between two terms t1 and t2 can be defined as the fact that there is a chain made up of k documents, < d1 , . . . , dk > such that t1 ∈ d1 , t2 ∈ dk , and any two adjacent documents di and di+1 in the chain have at least one term in common. Of course, the longer the chain, the less strict the relation, and hence the lower the value for that co-occurrence should be, up to the value 0 that should be reserved to the case of absolutely no (direct or indirect) connection between the two terms in the document collection. Thus, our proposal is to explicitly take into account co-occurrences when building the vector space for document retrieval. Consider the initial (possibly weighted) Term-Document Matrix A[n × m], where n is the number of terms and m is the number of documents in the collection. A matrix reporting the cooccurrence between two any terms in the collection will be a symmetric matrix sized n×n: Let us call it T , and ignore for the moment the order of co-occurrences expressed by such a matrix. By multiplying this matrix and the original TermDocument Matrix, we obtain a new Term-Document Matrix A = T × A
(1)
sized n × m, that has re-weighted the importance of each term in each document by explicitly taking into account also the term co-occurrences in the collection, so that its elements can be different than 0 even when a term does not appear in a document, but is in some way related to terms actually appearing in that document. Specifically, the value should be larger according to the closeness of such relationships and the amount of terms in the document to which that term is related by some order of co-occurrence. As shown in the following, there are two different ways for computing the matrix T . 3.1
The Straightforward Approach: SuMMa
Co-occurrences can be introduced into a vector space by simply following the geometrical definitions of matrices. Indeed, the term co-occurrence of order 1 can be straightforwardly computed by multiplying A by its transposed matrix: T = A × AT
(2)
Now, T [n × n] is a Term-Term Matrix whose elements are 0 if and only if the two corresponding terms never appear in the same document, or a value other than 0 otherwise. Now, co-occurrences of order 2 can be computed by multiplying T by itself and, in general, co-occurrences of order k can be obtained by multiplying T by itself k times: Tk = T k = T k−1 × T
for k > 1
(3)
Each matrix Tk has size [n × n], and its elements are 0 if and only if there is no chain of documents of length k in the document collection such that the two corresponding terms have a co-occurrence of order smaller than or equal
Using Explicit Word Co-occurrences to Improve Term-Based Text Retrieval
131
to k. If weights in the original matrix A are greater than or equal to 1, then the larger the value of a matrix item, the closer the co-occurrence between the two corresponding terms, and hence the stronger the relationship between them. More specifically, each value will be increased by all possible co-occurrences of order at most k that can be found in the document collection. We call the indexing scheme that follows this approach SuMMa (acronym of Successive Multiplication of Matrices). In order to catch even the slightest relatedness between terms, the power to be computed should be T = T ∞. Clearly, the longest chain possible in a collection including m documents will have length m, which represents a practical upper bound to the power to be computed, but it is likely to be considerably high anyway. However, in more realistic situations, there will be no need of actually computing T m : Applying progressive multiplications, it is possible to stop the procedure at the first k such that T k+1 does not change any 0 item with respect to T k . Although this can significantly reduce the computation required, a different option can be defining in advance the desired approximation, by specifying the largest order k that is considered significant, and hence that must be taken into account, and carrying out the multiplications up to just that power: T = T k . 3.2
The Theoretical Approach: LSE
A very interesting insight into Latent Semantic Analysis, on which LSI is based, has been provided in [4, 5]. There, the Authors provide the proof of theoretical results according to which, in their opinion, co-occurrence of terms is demonstrated to underlie the LSI technique. Actually, to be more precise, they prove how the co-occurrence of terms has an important connection to the SVD, but this does not prove as a straightforward consequence that, or to which extent, the very same connection is in some way expressed by the vector space resulting from the application of the LSI. Let us first present the result in question. Given a Term-Document Matrix A[n × m], we already pointed out that, by applying SVD, it can be split into three distinct matrices: – U [n × r], that represents the connection between terms in the collection and underlying latent concepts – W [r × r], a diagonal matrix whose diagonal elements represent the latent concepts and their weight/importance in the collection – V [m × r], that represents the connection between the documents in the collection and the underlying latent concepts where: n represents the number of terms in the collection, r represents the number of latent concepts underlying the collection and m represents the number of documents in the collection, such that: A =U ×W ×VT
(4)
By choosing the number k < r of relevant concepts to be considered, and stripping off the r − k elements of the diagonal of W having lower values, and the
132
S. Ferilli et al.
corresponding columns in U and V , an approximation of the original vector space A more centered on the selected concepts can be obtained again as above: Ak = Uk × Wk × VkT
(5)
Now, [4] demonstrated that, if instead of performing the above product, one performs the following: T = U × W × WT × UT
(6)
or, equivalently, its truncated version: T k = Uk × Wk × WkT × UkT
(7)
the resulting matrix has value 0 in all elements for which no co-occurrence of any order exists in the given document collection, and a value different than 0 in all other cases, and that the smaller the order of co-occurrence between two words, the higher such a value. Thus, this matrix can be straightforwardly applied in our approach. We named the approach that exploits this kind of computation LSE (acronym of Latent Semantic with Explicit co-occurrences).
4
Discussion and Preliminary Results
A first, immediately apparent difference between SuMMa and LSE is in the order of co-occurrence that can be taken into account. Indeed, LSE yields at once a Term-Term Matrix that accounts for all possible co-occurrences of any order, while SuMMa requires to preliminarily set a threshold k of interesting cooccurrence order to be computed, or else needs to discover on-the-fly the k such that no higher-order co-occurrences can be found in the document collection. In any case, such a value can be large, and require longer computational times. In this respect, the user can decide which approach to use according to various considerations. If he needs to exploit the full set of co-occurrence orders, he can go for the LSE. Conversely, if he wants to reduce the indexing computational time, or he wants to purposely set a bound on the order of co-occurrences to be considered (indeed, an intuitive assumption can be that co-occurrences above a given order are not significant and can be safely ignored without loss of retrieval power), he can choose the SuMMa. The SuMMa is also useful in case one has time biases, because it can be stopped at any multiplication step and still return a significant result (where only less significant co-occurrences have been ignored), whereas the LSE requires all the needed computation to be carried out at once, and no significant intermediate result is available before the process accomplishment. Empirically, a prototypical Java language implementation of the two techniques revealed that the time and space requirements of the two approaches are complementary, and specifically space requirements are lower for the LSE approach, while SuMMa is faster. Progressively extending the cardinality of the
Using Explicit Word Co-occurrences to Improve Term-Based Text Retrieval
133
document dataset, the preliminary prototype has shown that SuMMa was able to stand matrices up to about 1500 terms (present in 36 documents), while LSE reached about 2600 (present in 69 documents). As to time, in the above case of a document collection made up of 36 medium-length documents including a total of 1475 terms, applying SuMMa with k = 7 (a threshold that can be considered sufficient to include in the outcome all significant co-occurrences between terms) took about 1 minute for indexing on an Intel Dual Core processor running at 3 GHz, while the LSE approach with truncation k = 2, took 10 minutes for computing the SVD, plus an additional minute for the final matrix multiplication on the same architecture. A quick analysis of the two techniques can show the point. Since both must perform (1), the different complexity depends on the preliminary computations. Space evaluation. In SuMMa, for (2), matrices Tn×n and An×m are needed, for a total of n2 + nm values, that must also be kept for (3), where additionally Tk and T k−1 are needed, both of size n × n, and hence yield a total memory requirement of 3n2 + nm. Conversely, LSE needs to store matrices Un×r , Wr×r , Vr×m as a result of the SVD in (4), where W is diagonal (and hence can be reduced to just the vector of r diagonal values), and r (the rank of A) is usually comparable or equal to m. This results in a total of m2 +2nm+m values, that can be significantly reduced after truncation of r to k in (5). Then, for computation of (7), Tn×n , Un×k and Wk×k are needed, i.e. (representing W as a vector of k elements) n2 + kn + k values. Hence, if n > m, (6) is the worst step, which makes SuMMa worst, whereas if m > n the worst becomes LSE because of step (4). Time evaluation. (2) computes n2 elements in T by m multiplications and m−1 additions, for a total of n2 m, while k repetitions of (3) compute each n2 elements by n multiplications and n−1 additions, for a total of kn3 . Thus, overall SuMMA requires 2(kn3 +mn2 ) steps (multiplications and sums). Then, the SVD has been proved to have complexity O(min(n2 m, m2 n)). As to (6), it can be performed by the associative property of matrix multiplication as (Uk × (Wk × WkT )) × UkT W × W T requires k 2 multiplications; the intermediate product results in a n × k matrix each of which elements is obtained by a single multiplication, for a total of kn; the external product produces a n × n matrix, each of whose elements is obtained by r multiplications and r − 1 summations, for a total of kn2 . Thus, overall LSE requires O(min(n2 m, m2 n)) + kn2 + kn + k2 steps (that, considering again k comparable to m, becomes O(min(n2 m, m2 n)) + mn2 + mn + m2 ). As to the quality of the result, the following experiment was run. The set of 36 categories included by WordNet Domains [9] under the section ‘Social Science’ was selected: social science, folklore, ethnology, anthropology, body care, health, military, school, university, pedagogy, publishing, sociology, artisanship, commerce, industry, aviation, vehicles, nautical, railway, transport, book keeping, enterprise, banking, money, exchange, finance, insurance, tax, economy, administration, law, diplomacy, politics, tourism, fashion, sexuality. Then, the Wikipedia (www.wikipedia.org) Web page for each such category was downloaded, and the resulting collection was indexed using both SuMMa and LSE. Thus, issuing a query, each of the retrieved documents represents a possible category for the
134
S. Ferilli et al.
query, and hence this can be seen as an approach to the Text Categorization [16] problem. Note that this is not the ideal environment on which applying the proposed technique: Indeed, being present a single document for each category, and being the categories quite disjoint (they should in principle represent a partition of the ‘Social Science’ section), little co-occurrences can be expected to be found. In order to avoid the problems concerning co-occurrences of frequent terms that are not very significant to the specific query interests, we used for the original matrix A the T F ∗ IDF weighting schema [14], that should assign small values to terms that appear frequently and uniformly in the collection, and hence are not discriminative for particular queries. Of course, other weighting schemata that fulfil the same requirement can be adopted as well. The objective was retrieving all the Wikipedia pages related to a given query, but not the others. More precisely, since for any query it is always possible to assess a similarity to any document in the collection, the technique always returns the whole set of documents, ordered by decreasing similarity. Thus, the actual objective consisted in finding all the Web pages related to the query in the top ranking positions, and the others in the lower positions. The technique should work because one can assume that articles concerning the same subject share many specific technical terms for that subject, and hence the co-occurrence weight of such terms should be high. Clearly, a number of general terms related to social science can be found in all documents, and hence there is a chance of having at least a chain that can connect any article to any other, which represents an additional difficulty to stress the approach and test its robustness. Here we report two sample queries, and the corresponding top positions of the ranking. The former is a sentence actually present in document ‘school’: “A school is an institution designed to allow and encourage students to learn, under the supervision of teachers”. LSE School > Pedagogy > Law > University > Banking > ... SuMMa School > Sociology > University > Ethnology > Aviation > ... The latter is a pair of words that are not specifically taken from any indexed document: “commerce transport ”. LSE Nautical > Transport > Vehicles > Commerce > Railway > ... SuMMa Tax > Administration > Transport > Politics > Vehicles > ... In the former case, both techniques retrieved the correct document in first positions. Then, they retrieve several similar documents/categories in the other positions, although with different ranking. Interestingly, SuMMa returns ‘Ethnology’ in fourth position, that has some relatedness to the query, although no query term is present in the corresponding Web page. In the latter case, both techniques return sensible, although partly different, results in all positions. We can conclude that these preliminary results showed that the technique is able to select the relevant documents and place them in the top positions of the ranking. Although spurious documents are sometimes found, even in the first places, overall the top-ranked documents are actually related to the query, and hence
Using Explicit Word Co-occurrences to Improve Term-Based Text Retrieval
135
the precision is quite high in the first positions. An interesting point is that the technique seems to work even for short queries made up of a single word or very few words, while other techniques based on latent semantic require longer queries in order to better assess the net of relationships with the various documents.
5
Conclusions
Reaching high precision and recall rates in the results of term-based queries on text collections is becoming more and more crucial, as long as the amount of available documents increases and their quality tends to decrease. In particular, retrieval techniques based on the strict correspondence between terms in the query and terms in the documents miss important and relevant documents when it just happens that the terms selected by their authors are slightly different than those used by the final user that issues the query. Several approaches proposed in the literature try to tackle this problem by switching to an (implicit or explicit) semantic level, but this solution introduces further problems that have not been completely solved yet. Our proposal is to remain in the purely lexical level, that ensures simpler handling, but to explicitly consider term cooccurrence when building the vector space. Indeed, although the actual presence of a query term in a document is clearly a significant hint of the relevance of the document, an absence thereof must not necessarily mean that the document is irrelevant: The presence in a document of different but related terms to those in the query should strengthen the confidence that the document is relevant as well. Missing a query term in a document, but finding several terms strictly related to it, should equally support the hypothesis that the document is actually relevant. The computational perspective of such a relatedness that we proposed to adopt consists in direct or indirect term co-occurrence in the collection. We proposed two different approaches to enforce such a perspective, and run preliminary experiments on a prototypical implementation, that suggested this technique to be potentially profitable. Currently, more extensive experiments are being run to obtain a statistically significant assessment of the performance of the proposed technique in terms of precision and recall. Moreover, a more thorough comparison to other existing techniques, and in particular to the LSI, are planned to highlight the strengths and weaknesses of each with respect to the other. Future work will also include the definition of techniques to threshold the search results avoiding that the whole list is displayed. Indeed, not being associated to the actual presence of specific terms, the query result always include in the ranking all documents in the collection. This means that also the precision/recall evaluation needs to be carried out after defining a precise strategy to select the results to be considered. Available options are using an absolute threshold for the final weight assigned to each document, or a fixed number of results to be returned independently of the actual weights, or a technique based on the difference in weight between two adjacent elements in the ranking, or a combination thereof. Other issues to be studied are related to the efficiency improvement in terms of space and time of
136
S. Ferilli et al.
computation, in order to make the technique practically viable also on real-sized document collections.
References [1] Buzydlowski, J.W., White, H.D., Lin, X.: Term co-occurrence analysis as an interface for digital libraries. In: Proceedings of the Joint Conference on Digital Libraries (2001) [2] Ide, N., V´eronis, J.: Word sense disambiguation: The state of the art. Computational Linguistics 24, 1–40 (1998) [3] Karpys, G., Han, E.: Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval and categorization. Technical Report 00016, Minnesota University Minneapolis (March 2000) [4] Kontostathis, A., Pottenger, W.M.: Detecting patterns in the lsi term-term matrix. In: Proceedings of the ICDM 2002 Workshop on Foundations of Data Mining and Discovery (2002) [5] Kontostathis, A., Pottenger, W.M.: A framework for understanding latent semantic indexing (lsi) performance. Inf. Process. Manage. 42(1), 56–73 (2006) [6] Krovetz, R.: More than one sense per discourse. In: NEC Princeton NJ Labs., Research Memorandum (1998) [7] Landauer, T.K., Dumais, S.T.: A solution to plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review 104, 111–140 (1997) [8] Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998) [9] Magnini, B., Cavagli` a, G.: Integrating subject field codes into wordnet. In: Proceedings of LREC 2000, Second International Conference on Language Resources and Evaluation, pp. 1413–1418 (2000) [10] Miller, G.A.: Wordnet: A lexical database for English. Communications of the ACM 38(11), 39–41 (1995) [11] O’Brien, G.W.: Information management tools for updating an svd-encoded indexing scheme. Technical Report CS-94-258, University of Tennessee, Knoxville (October 1994) [12] Peat, H.J., Willet, P.: The limitations of term co-occurrence data for query expansion in document retrieval systems. Journal of the American Society for Information Science 42(5), 378–383 (1991) [13] Rapp, R.: The computation of word associations: Comparing syntagmatic and paradigmatic approaches. In: Proceedings of the 29th International Conference on Computational Linguistics (2002) [14] Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGrawHill, New York (1983) [15] Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975) [16] Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Semantic Relatedness Approach for Named Entity Disambiguation Anna Lisa Gentile1, , Ziqi Zhang2, Lei Xia3, , and Jos´e Iria4, 1
2
Department of Computer Science, University of Bari, Italy [email protected] Department of Computer Science, The University of Sheffield, UK {z.zhang,l.xia}@dcs.shef.ac.uk 3 Archaeology Data Service, University of York, UK [email protected] 4 IBM Research - Zurich, Switzerland [email protected]
Abstract. Natural Language is a mean to express and discuss about concepts, objects, events, i.e., it carries semantic contents. One of the ultimate aims of Natural Language Processing techniques is to identify the meaning of the text, providing effective ways to make a proper linkage between textual references and their referents, that is, real world objects. This work addresses the problem of giving a sense to proper names in a text, that is, automatically associating words representing Named Entities with their referents. The proposed methodology for Named Entity Disambiguation is based on Semantic Relatedness Scores obtained with a graph based model over Wikipedia. We show that, without building a Bag of Words representation of the text, but only considering named entities within the text, the proposed paradigm achieves results competitive with the state of the art on two different datasets.
1 Introduction Reading a written text implies the comprehension of the information that words are carrying. Comprehension is an intrinsic capacity for a human, but not for a machine. Providing machines with such ability, by anchoring meanings to words, can be considered a task with great significance for Artificial Intelligence. The focus of this work is on proper names, that is, on such words within text that represent entites: we want to give a meaning to such pieces of text that carry high information potential. Many tasks could benefit of such added value, such as Information Retrieval (for example, lots of Web Search queries concern Named Entities). We propose an automatic method to associate a unique sense (the referent, that will be also addressed in the remaining of this work as meaning, concept or simply sense) to each entity (the reference within the text), exploiting Wikipedia1 as freely available
The work was done while this author was at The University of Sheffield, UK as visiting researcher. The work was done while this author was working at The University of Sheffield, UK. 1 http://en.wikipedia.org/wiki/Wikipedia M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 137–148, 2010. c Springer-Verlag Berlin Heidelberg 2010
138
A.L. Gentile et al.
Knowledge Base, showing a novel solution for Named Entity Disambiguation (NED). We show the correctness of the proposed methodology with two experimental sessions, consolidating results obtained in [1]. Our contributions are twofold. Firstly, we use a random-walk model based Semantic Relatedness approach to NED. Graph-based models have previously been applied to Word Sense Disambiguation (WSD) [2,3,4,5] but not experimented for the problem of NED: to the best of our knowledge, previous approaches to NED were based on Vector Space model, treating concepts and context texts as a bag of words [6,7], while graphbased models have been exploited for a specific type of NED, which is Person Name Disambiguation [8], or for specific domains, such as bibliographic citations [9]. The solution proposed in this work exploits Semantic Relatedness Scores (calculated with a random walk on a graph) as input for disambiguation step. Secondly, we introduce a different way for representing the context for the target entity which, rather than consisting of surrounding words, is composed of only other named entities present in the text. Our approach has the advantage of using relatedness scores independently for the NED task, that means using semantic relations as input for NED: this is useful in terms of clearly separation of two different functions, which can be refined and ameliorated separately one from each other. Compared to the best result by Cucerzan [7], which is an accuracy of 91.4% and of 88.3%, our method achieves a competitive accuracy of 91.5 % and 89.8% respectively and it adds the benefit of having two clearly separate steps (relatedness sores, disambiguation), thus providing a glimpse of improving results in both directions. The work is structured as follows: Section 2 proposes an overview of the NED task, with focus on available solutions exploiting Wikipedia. Section 3 presents the proposed NED methodology, describing in details the four designed steps. Section 4 presents the experiments carried out to validate the proposed solution and finally conclusions close the paper.
2 Related Work In Natural Language Processing, Named Entity Disambiguation is the problem of mapping mentions of entities in a text with the object they are referencing. It is a step further from Named Entity Recognition (NER), which involves the identification and classification of so-called named entities: expressions that refer to people, places, organizations, products, companies, and even dates, times, or monetary amounts, as stated in the Message Understanding Conferences (MUC) [10]. The NED process aims to create a mapping between the surface form of an entity and its unique dictionary meaning. It can be assumed to have a dictionary of all possible entity entries. In this work we use Wikipedia as such a dictionary. Many studies that exploit Wikipedia as a knowledge source have recently emerged [11,12,13]. In particular, Wikipedia turned to be very useful for the problem of Named Entities due to its greater coverage than other popular resources, such as WordNet [14] that, resembling more to a dictionary, has little coverage on named entities [12]. Lots of previous works exploited Wikipedia for the task of NER, e.g., to extract gazetteers [15] or as an external knowledge of features to use in a Conditional Random Field NER-tagger [16], to improve entity ranking in the field
Semantic Relatedness Approach for Named Entity Disambiguation
139
of Information Retrieval [17]. On the other hand, little has been carried out on the field of NED. The most related works on NED based on Wikipedia are those by Bunescu and Pasca [6] and Cucerzan [7]. Bunescu and Pasca consider the problem of NED as a ranking problem. The authors define a scoring function that takes into account the standard cosine similarity between words in the context of the query and words in the page content of Wikipedia entries, together with correlations between pages learned from the structure of the knowledge source (mostly using Wikipedia Categories assigned to the pages). Their method achieved accuracy between 55.4% and 84.8% [6]. Cucerzan proposes a very similar approach: the vectorial representation of the document is compared with the vectorial representation of the Wikipedia entities. In more details the proposed system represents each entity of Wikipedia as an extended vector with two principal components, corresponding to context and category information; then it builds the same kind of vector for each document. The disambiguation process consists of maximizing the Context Agreement, that is, the overlap between the document vector for the entity to disambiguate and each possible entity vector. The best result for this approach is an accuracy of 91.4% [7]. Both described works are based on the Vector Space Model, which means that a pre-computation on the Wikipedia knowledge resource is needed to build the vector representation. What is more, their methods treat content in a Wikipedia page as a bag-of-words (with the addition of categories information), without taking into account other structural elements in Wikipedia. Another method for NED based on Wikipedia is the WibNED algorithm [18]. This method is an adaptation of Lesk dictionary-based WSD algorithm [19], with the difference that in the WibNED algorithm the words to disambiguate are only those representing an entity. WibNED takes as input a document d = {w1 , . . . , wj , ej+1 , wj+2 , . . . , wh , eh+1 , wh+2 , . . . } and returns a list of Wikipedia URIs X = {s1 , s2 , . . . , sk } in which each element si is obtained by disambiguating the target entity ei on the ground of the information obtained from Wikipedia for each candidate URI (Wikipedia page content of the URI) and words in the context C of ei . The context C of the target entity ei is defined to be a window of n words to the left and another n words to the right, for a total of 2n surrounding words. If other entities occur in the context of the target entity, they are considered as words and not as entities. The main limitation of this approach is that it is merely based on word overlapping and there is no specific weight associated to different categories of words in the context (entities and common words are treated the same way). Contrary to these works, we propose a novel method, which uses a graph model combing multiple features extracted from Wikipedia. We calculate Semantic Relatedness over this graph and we exploit obtained relatedness values to resolve the problem of NED. Graph-based models have been applied to a subtask of NED, which is Person Name Disambiguation, usually benefiting from the social networks in people related tasks. Minkov et al. [8] consider extended similarity metrics for documents and other objects embedded in graphs, implemented via a lazy graph walk. They provide an instantiation of this framework for email data, where content, social networks and a timeline are integrated in a structural graph. The suggested framework is evaluated for two email-related problems: disambiguating names in email documents, and threading. Resolving the
140
A.L. Gentile et al.
referent of a person name is also an important complement to the ability to perform named entity extraction for tasks like social network analysis or studies of social interaction in email. The authors model this problem as a search task: based on a namemention in an email message m, they formulate query distribution Vq , and then retrieve a ranked list of person nodes. Experiments carried out on the Cspace corpus [20], manually annotated with personal names, show that reranking schemes based on the graph-walk similarity measures often outperform baseline methods, with a maximum accuracy of 83.8%. The main differences between this method and the one we propose within this work is that our method is applicable to all kind of proper names because it does not rely on resources such as social networks and that relatedness scores can be used offline after they have been calculated. Semantic relatedness between words or concepts measures how much two words or concepts are related by encompassing all kinds of relations between them, such as hypernymy, hyponymy, antonymy and functional relations. There is a large number of literature on computing semantic relatedness between words or concepts using knowledge extracted from Wikipedia, such as [12] and [21]. However, the main limitation of these methods is that they only make use of one or two types of features; and they generally adapt WordNet-based approaches [22,14,23] by employing similar types of features extracted from Wikipedia. In contrast, we believe that other information content and structural elements in Wikipedia can be also useful for the semantic relatedness task; and that combining various features in an integrated model in the semantic relatedness task is crucial for improving performance. For this reason, we propose a random graph walk model based on a combination of features extracted from Wikipedia for computing semantic relatedness.
3 Methodology Given a set of surfaces and their corresponding concept relatedness matrix R, our NED algorithm returns for each surface one sense (concept), that is, collectively determined by other surfaces and their corresponding concepts. To achieve this goal the proposed method performs four main sequential steps: 1) each text is reduced to the list of surfaces of appearing entities; 2) for each surface, Wikipedia is used to retrieve all its possible meanings (also denoted as concepts) and build a feature space for each of them; 3) all concepts, their features and relations are transformed into a graph representation: a random graph walk model is then applied to combine the effects of features and derive a relatedness score; 4) for each surface a single meaning is chosen, taking into account Semantic Relation within the entity graph. 3.1 Concept Retrieval In more details, as a starting point for the proposed methodology we assume that each text has been reduced to the list of its contained named entity surfaces, as it is simply obtainable with a standard NER system, as e.g Yamcha [24]. Then for each surface, Wikipedia is used to retrieve all its possible meanings and build a feature space for each of them. More precisely we query Wikipedia using surface to retrieve relevant pages. If
Semantic Relatedness Approach for Named Entity Disambiguation
141
a surface matches an entry in Wikipedia, a page will be returned. If the surface has only one sense defined in Wikipedia then we have a single result: the page describing the concept that matches the surface form. We refer to this page as the sense page for the concept. Alternatively a disambiguation page may be returned if the surface has several senses defined in Wikipedia. Such a page lists different senses as links to other pages and with a short description for each one. For the purpose of this work, we deliberately choose the disambiguation page for every surface, which means we query Wikipedia by adding the string “(disambiguation)” to the surface words and follow every link on the page and keep all sense pages for that surface. This is done by appending the keyword “(disambiguation)” to a surface as a query. Thus, for every surface, we obtain a number of concepts (represented as sense pages) as input to our disambiguation algorithm. Once we have identified relevant concepts and their sense pages for the input concept surface forms, we use the sense page retrieved from Wikipedia for each concept to build its feature space. We identify the following features that are potentially useful: 1. Words composing the titles of a page (title words): words in the title of a sense page; plus words from all its redirecting links in Wikipedia (different surfaces for the same concept). 2. Top n most frequently used words in the page (frequent words n): prior work makes use of words extracted from the entire page [12], or only those from the first paragraph [21] In our work, we use the most frequent words; based on the intuition that word frequency indicates the importance of the word for representing a topic. 3. Words from categories (cat words) assigned to the page: each page in Wikipedia is assigned several category labels. These labels are organized as a taxonomy. We retrieve the category labels assigned to a page by performing a depth limited search of 2, and split these labels to words. 4. Words from outgoing links on the page (link words): the intuition is that links on the page are more likely to be relevant to the topic, as suggested by Turdakov and Velikhov [25]. Thus, for each concept, we extract above features from its sense page, and transform the text features into a graph conforming to the random walk model, which is used to compute Semantic Relatedness between meanings belonging to different surfaces. 3.2 Random Graph Walk Model A random walk is a formalization of the intuitive idea of taking successive steps in a graph, each in a random direction [26]. Intuitively, the harder it is to arrive at a given node starting from another, the less related the two nodes are. The advantage of a random-walk model lies at its strength of seamlessly combining different features to arrive at one single measure of relatedness between two entities [27]. Specifically, we build an undirected weighted typed graph that encompasses all concepts identified in the page retrieval step and their extracted features. The graph is a 5-tuple G = (V, E, t, l, w), where V is the set of nodes representing the concepts and their features; E : V × V is the set of edges that connect concepts and their features, representing an undirected path from concepts to their features, and vice versa; t : V → T is the node type function (T = {t1 , . . . , t|T | } is a set of types, e.g., concepts, title words, cat words, . . . ),
142
A.L. Gentile et al.
Fig. 1. The Graph representation model of concepts, features, and their relations. Circles indicate nodes (V) representing concepts and features; bold texts indicate types (T) of nodes; solid lines connecting nodes indicate edges (E), representing relations between concepts and features; italic texts indicate types (L) of edges. Different concepts may share features, enabling walks on the graph.
l : E → L is the edge label function (L = {l1 , . . . , l|L| } is a set of labels that define relations between concepts and their features), andw : L → R is the label weight function that assigns a weight to an edge. Figure 1 shows a piece of the graph with types and labels described before. Concepts sharing same features will be connected via the edges that connect features and concepts. We define weights for each edge type, which, informally, determine the relevance of each feature to establish the relatedness between any two concepts. Let Ltd = l(x, y) : (x, y) ∈ E ∩ T (x) = td be the set of possible labels for edges leaving nodes of type td . We require that the weights form a probability distribution over Ltd , i.e., l∈Lt w(l) = 1 d
We build an adjacency matrix of locally appropriate similarity between nodes as w(lk ) lk ∈L |(i,·)∈E : l(i,·)=lk | (i, j) ∈ E Wij = 0 otherwise
(1)
where Wij is the ith -row and j th -column entry of W, indexed by V. The above equation distributes uniformly the weight of edges of the same type leaving a given node. The weight model (wm), that is, weights associated to each type of edges in the graph, has been determined applying a simulated annealing method [28]. The algorithm explores the search space of all possible combinations of feature weights and iteratively reduces the difference between a gold standard solution and that of our system. The algorithm allows us to run our method on one dataset in an iterative manner, where in each iteration, the algorithm generates a random wm for the feature set and scores our system results obtained with that model against the gold standard. If a wm obtained in the current iteration produces better results than the previous, the simulated annealing
Semantic Relatedness Approach for Named Entity Disambiguation
143
algorithm will attempt to adjust weights based on that model in next iterations. Thus, by running simulated annealing for relatively large number of iterations, we expect the system performance to converge; by which we obtain the final optimum weight model for that feature set. This tuning has been done in advance using a standard testing dataset for semantic relatedness, the WordSimilarity-353 Test Collection [29], and empirically derived the optimum weight model for our chosen feature set. To simulate the random walk, we apply matrix transformation using the formula P (t) (j | i) = [(D−1 W )t ]ij , as described by Iria et al. in [27], where D is the diagonal degree matrix given by Dii = k Wik , and t is a parameter representing the number of steps of the random walk. In our work, we have set t = 2 in order to compute the relatedness for walks that start in a concept and traverse one feature to arrive at another concept. Unlike PageRank [30], we are not interested in the stationary behavior of the model. The resulting matrix of this transition P (t) (j | i) is a sparse, non-symmetric matrix filled with probabilities of reaching node i from j after t steps. To transform probability to relatedness, we use the observation that the probability of walking from i to j then coming back to i is always the same as starting from j, reaching i and then coming back to j. Thus we define a transformation function as: Rel(i | j) = Rel(j | i) =
P (t) (j | i) + P (t) (i | j) 2
(2)
and we normalize the score to range {0, 1} using: Rel(i | j) =
Rel(j | i) max Rel(j | i)
(3)
3.3 Named Entity Disambiguation The final step of the methodology consists of choosing a single meaning (concept) for each entity surface, exploiting Semantic Relatedness scores derived by the graph. Given S = {s1 , . . . , sn } the set of surfaces in a document, C = {c1k , . . . , cmk } (with k = 1 · · · | S |), the set of all their possible senses (concepts) and R the matrix of relatedness Rel(i | j) with each cell indicating the strength of relatedness between concept cik and concept cjk (where k = k , that is, cik and cjk have different surface forms), we define the entity disambiguation algorithm as a function f : S → C, that given a set of surfaces S returns the list of disambiguated concepts, using the concept relatedness matrix R. We define different kind of such functions f and compare results in Section 4. As first and simple disambiguation function we define the highest method: we build candki the list of candidates winner concepts for each surface, with i being the candidate concept for surface k (k = 1 · · · | S |); if some of surfaces k has more that one candidate winner, for each k surface with multiple i values, we simply pick the concept that among the candidates has the highest value in the matrix R. The combination method calculates for each concept cik the sum of relatedness with all different concepts cjk from different surfaces (such as j = i, k = k). Given V = {v1 , . . . , v|C| } the vector of such values, the function returns for each surface sk the concept cik that has the max vi .
144
A.L. Gentile et al.
The propagation method works as follows: taking as seed the highest similarity value in the matrix R we fix the 2 concepts i and j giving that value: for their surface form k and k we delete rows and columns in the matrix R coming from other concepts for the same surfaces ( all ctk and ctk with t = i and t = j). This step is repeated recursively, picking next highest value in R. The stop condition consists of having one concept row in the matrix R for each surface form. In the following section we present our experiments and evaluation.
4 Experiments We performed the experiment with an “in vitro evaluation”, which consists of testing systems independently of any application, using specially constructed benchmarks. What we want to prove is that the usage of Semantic Relatedness scores is profitable for the issue of NED and that the graph of interconnections between entities is influent for the disambiguation decision. As benchmark to test our system we used data provided by Cucerzan in [7], which is publicly available2 . Test data consist of two different datasets. Each dataset constists of several documents containing a list of Named Entites, labelled with the corresponding page in Wikipedia. As described in Section 3 we retrieve concepts for each surface and we build a graph with all identified possible concepts for each text. After running the Random Walk on the built graph and transforming the transition matrix in a relatedness matrix we obtain an upper triangular matrix with a score of relatedness between different concepts, belonging to different surfaces. The first dataset we used for experiments, that we will refer in what follows as NEWS, consists 20 news stories: for each story it is provided the list of all entities, annotated with the corresponding page in Wikipedia. The number of entities in each story can vary form 10 to 50. Some of the entities have a blank annotation, because they do not have a corresponding page in the Wikipedia collection: among all the identified entities, 370 are significantly annotated in the test data. As input for our system we started from the list of entities spotted in the benchmark data and for each entity the list of all possible meaning is retrieved, e.g., for surface “Alabama” following concepts are retrieved: Alabama −→ [AlabamaClaims | Genus | CSSAlabama | AlabamaRiver | Alabama(people) | N octuidae | Harvest(album) | USSAlabama | Alabamalanguage | Alabama(band) | Moth | Universityof Alabama | Alabama, N ewY ork]
The second dataset, that we will refer in what follows as WIKI, consists of 350 Wikipedia entity pages selected randomly. The text articles contain 5,812 entity surface forms. We performed our evaluation on 3,136 entities, discarding all non-recallable surfaces, that is, all those surfaces having no correspondance in the Wikipedia collection. Indeed, an error analysis carried out by Cucerzan on this dataset showed that it contains many surface forms with erroneous or out-of-date links, as reported in [7]. We evaluate performance in terms of accuracy, that is, the ratio of number of correctly disambiguated entities on total number of entities to disambiguate. Results obtained applying all defined disambiguation functions to the relatedness matrix are shown in 2
http://research.microsoft.com/users/silviu/WebAssistant/TestData
Semantic Relatedness Approach for Named Entity Disambiguation
145
table 1 for the NEWS dataset and in table 2 for the WIKI dataset. Both tables also report figures obtained by Cucerzan on the same datasets [7]. In the first experiment the best result equals the best available system at the state of the art, with an accuracy of 91.4% and all proposed functions are above the baseline of 51.7% (baseline returns always the first available result). Between three proposed methods, the combination method obtained the best result, equalling the best available system at the state of the art. The highest method achieves results below the state of the art of 91.4%, even if, with an accuracy of 82.2% is far over the baseline of 51.7% (baseline returns always the first available result). The motivation can be that it takes into account only the best relatedness score for each concept to decide sense assignment, without considering the rest of the scores. The propagation method works even worse because adds to the disadvantage of the first one also the propagation of errors. It reaches an accuracy of 68.7%, which is in the middle between the baseline and the state of the art. Table 1. Comparison of proposed Named Entity Disambiguation Functions on NEWS dataset Literature Systems Accuracy Function Accuracy Cucerzan baseline [7] 51.7% Highest 82.2% Cucerzan [7] 91.4% Combination 91.5% Propagation 68.7%
To consolidate this result we conducted the same experiment on the WIKI dataset. Table 2. Comparison of proposed Named Entity Disambiguation Functions on WIKI dataset Literature Systems Accuracy Function Accuracy Cucerzan baseline [7] 86.2% Highest 87.1% Cucerzan [7] 88.3% Combination 89.8% Propagation 84.3%
The second experiment definitively confirms the trend reported in the first one. The three proposed disambiguation functions have the same behavior on the WIKI dataset. The combination method is the best one: it achieves an accuracy of 89.8%, outperforming the accuracy of 88.3% reported by the state-of-the art system. The highest method is between the baseline and the state-of-the-art system, with an accuracy of 87.1%. The propagation method is the worst one: with an accuracy of 84.3% is under the baseline of 86.2%. As expected and already assessed in the NEWS experiment, the combination method performs much better than others, outperforming the state of the art system on the WIKI dataset. The motivation can be found in the fact that it considers relatedness scores in their entirety, giving value to the interaction of all concepts instead of couples of concepts. We consider such value as an encouraging outcome for the proposed novel method: the second experiment reinforces results of the first one, and states the correctness of the proposed methodology.
146
A.L. Gentile et al.
5 Conclusions In this work we proposed a novel method for Named Entity Disambiguation. Experiments showed that the paradigm achieves significant results: the overall accuracy is 91.5% and 89.8% on two different datasets, which is a result competitive to the state of the art. The successful accuracy reached hints at the usefulness of Semantic Relatedness measures for the process of NED. A consideration must be done on the decision of using Wikipedia as entity inventory: in terms of lexical coverage, especially with reference to specific domains, the choice impose some limitations. It is the case of specific technical areas, such as Medicine, Biology, specific fields of engineering, etc., which can have little or no coverage within the Wikipedia source. However, the methodology is still applicable when the entity inventory is a different Knowledge Base: the only requirement is that we have coverage for entities of interest and that it is possible to calculate relatedness scores between them. We could easily change Wikipedia with a domain ontology, obtaining relatedness scores and then applying the same proposed NED step. Theoretically we might expect this to be true but obviously, experiments are needed to prove the efficacy of such approach and future work can follow this direction. Also, as future work, we plan to design new disambiguation functions over the relatedness matrix to achieve better results.
References 1. Gentile, A.L., Zhang, Z., Xia, L., Iria, J.: Graph-based Semantic Relatedness for Named Entity Disambiguation. In: Dicheva, D., Nikolov, R., Stefanova, E. (eds.) Proceedings of S3T 2009: International Conference on Software, Services and Semantic Technologies, Sofia, Bulgaria, October 28-29, pp. 13–20 (2009) 2. Agirre, E., Mart´ınez, D., L´opez de Lacalle, O., Soroa, A.: Two graph-based algorithms for state-of-the-art WSD. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, pp. 585–593. Association for Computational Linguistics (July 2006) 3. Mihalcea, R.: Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In: HLT/EMNLP 2005, Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, The Association for Computational Linguistics (2005) 4. Navigli, R., Lapata, M.: Graph connectivity measures for unsupervised word sense disambiguation. In: Veloso, M.M. (ed.) IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1683–1688 (2007) 5. Sinha, R., Mihalcea, R.: Unsupervised Graph-based Word Sense Disambiguation Using Measures of Word Semantic Similarity. In: Proceedings of the First IEEE International Conference on Semantic Computing (ICSC 2007), pp. 363–369. IEEE Computer Society, Los Alamitos (2007) 6. Bunescu, R.C., Pasca, M.: Using Encyclopedic Knowledge for Named Entity Disambiguation. In: EACL 2006, 11st Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, The Association for Computer Linguistics (2006) 7. Cucerzan, S.: Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 708–716. Association for Computational Linguistics (June 2007)
Semantic Relatedness Approach for Named Entity Disambiguation
147
8. Minkov, E., Cohen, W.W., Ng, A.Y.: Contextual search and name disambiguation in email using graphs. In: Efthimiadis, E.N., Dumais, S.T., Hawking, D., J¨arvelin, K. (eds.) SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, pp. 27–34. ACM, New York (2006) 9. Kalashnikov, D.V., Mehrotra, S.: A probabilistic model for entity disambiguation using relationships. In: SIAM International Conference on Data Mining, SDM (2005) 10. Grishman, R., Sundheim, B.: Message Understanding Conference- 6: A Brief History. In: COLING, pp. 466–471 (1996) 11. Ponzetto, S.P., Strube, M.: Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution. In: Moore, R.C., Bilmes, J.A., Chu-Carroll, J., Sanderson, M. (eds.) Proceedings of HLT-NAACL, Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, ACL (2006) 12. Strube, M., Ponzetto, S.P.: WikiRelate! Computing Semantic Relatedness Using Wikipedia. In: Proceedings of the Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, pp. 1419–1424. AAAI Press, Menlo Park (2006) 13. Zesch, T., Gurevych, I., M¨uhlh¨auser, M.: Analyzing and Accessing Wikipedia as a Lexical Semantic Resource. In: Biannual Conference of the Society for Computational Linguistics and Language Technology (2007) 14. Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for word sense identification. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 265–283. MIT Press, Cambridge (1998) 15. Toral, A., Munoz, R.: A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia. In: Workshop on New Text, 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy (April 2006) 16. Kazama, J., Torisawa, K.: Exploiting wikipedia as external knowledge for named entity recognition. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 698–707 (2007) 17. Vercoustre, A.M., Thom, J.A., Pehcevski, J.: Entity ranking in Wikipedia. In: Wainwright, R.L., Haddad, H. (eds.) Proceedings of the 2008 ACM Symposium on Applied Computing (SAC), pp. 1101–1106. ACM, New York (2008) 18. Gentile, A.L., Basile, P., Semeraro, G.: WibNED: Wikipedia Based Named Entity Disambiguation. In: Agosti, M., Esposito, F., Thanos, C. (eds.) Post-proceedings of the Fifth Italian Research Conference on Digital Libraries - IRCDL 2009: A Conference of the DELOS Association and the Department of Information Engineering of the University of Padua. Revised Selected Papers, DELOS: an Association for Digital Libraries, pp. 51–59 (2009) 19. Banerjee, S., Pedersen, T.: An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 136–145. Springer, Heidelberg (2002) 20. Minkov, E., Wang, R., Cohen, W.W.: Extracting personal names from emails: Applying named entity recognition to informal text. In: Proceedings of HLT/EMNLP 2005, Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, British Columbia, Canada (2005) 21. Zesch, T., M¨uller, C., Gurevych, I.: Using wiktionary for computing semantic relatedness. In: Fox, D., Gomes, C.P. (eds.) Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, pp. 861–866. AAAI Press, Menlo Park (2008)
148
A.L. Gentile et al.
22. Banerjee, S., Pedersen, T.: Extended Gloss Overlaps as a Measure of Semantic Relatedness. In: Gottlob, G., Walsh, T. (eds.) IJCAI 2003, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 805–810. M. Kaufmann, San Francisco (2003) 23. Resnik, P.: Disambiguating noun groupings with respect to WordNet senses. In: Proceedings of the 3th Workshop on Very Large Corpora, pp. 54–68. ACL (1995) 24. Kudo, T., Matsumoto, Y.: Fast Methods for Kernel-Based Text Analysis. In: Hinrichs, E., Roth, D. (eds.) Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 24–31 (2003) 25. Turdakov, D., Velikhov, P.: Semantic relatedness metric for wikipedia concepts based on link analysis and its application to word sense disambiguation. In: Kuznetsov, S.D., Pleshachkov, P., Novikov, B., Shaporenkov, D. (eds.) SYRCoDIS. CEUR Workshop Proceedings, CEURWS.org, vol. 355 (2008) 26. Lov´asz, L.: Random walks on graphs: A survey. Combinatorics, Paul Erd¨os is Eighty 2, 353–398 (1996) 27. Iria, J., Xia, L., Zhang, Z.: Wit: Web people search disambiguation using random walks. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval 2007), Prague, Czech Republic, pp. 480–483. ACL (2007) 28. Nie, Z., Zhang, Y., Wen, J., Ma, W.: Object-level ranking: bringing order to web objects. In: WWW 2005: Proceedings of the 14th international conference on World Wide Web, pp. 567–574. ACM, New York (2005) 29. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Transactions on Information Systems 20(1), 116–131 (2002) 30. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Seventh International World-Wide Web Conference (WWW 1998). (1998)
Merging Structural and Taxonomic Similarity for Text Retrieval Using Relational Descriptions Stefano Ferilli1,3 , Marenglen Biba2 , Nicola Di Mauro1,3, Teresa M.A. Basile1 , and Floriana Esposito1,3 1
2
Dipartimento di Informatica, Universit` a di Bari via E. Orabona, 4 - 70125 Bari, Italia {ferilli,ndm,basile,esposito}@di.uniba.it Computer Science Department, University of New York, Tirana Rr. ”Komuna e Parisit”, Tirana, Albania [email protected] 3 Centro Interdipartimentale per la Logica e sue Applicazioni Universit` a di Bari via E. Orabona, 4 - 70125 Bari, Italia
Abstract. Information retrieval effectiveness has become a crucial issue with the enormous growth of available digital documents and the spread of Digital Libraries. Search and retrieval are mostly carried out on the textual content of documents, and traditionally only at the lexical level. However, pure term-based queries are very limited because most of the information in natural language is carried by the syntactic and logic structure of sentences. To take into account such a structure, powerful relational languages, such as first-order logic, must be exploited. However, logic formulæ constituents are typically uninterpreted (they are considered as purely syntactic entities), whereas words in natural language express underlying concepts that involve several implicit relationships, as those expressed in a taxonomy. This problem can be tackled by providing the logic interpreter with suitable taxonomic knowledge. This work proposes the exploitation of a similarity framework that includes both structural and taxonomic features to assess the similarity between First-Order Logic (Horn clause) descriptions of texts in natural language, in order to support more sophisticated information retrieval approaches than simple term-based queries. Evaluation on a sample case shows the viability of the solution, although further work is still needed to study the framework more deeply and to further refine it.
1
Introduction
The spread of digital technologies has caused a dramatic growth in the availability of documents in digital format, due to the easy creation and transmission thereof using networked computer systems. Hence, the birth of several repositories, aimed at storing and providing such documents to interested final users. The shortcoming of this scenario is in the problem of finding useful documents M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 149–160, 2010. c Springer-Verlag Berlin Heidelberg 2010
150
S. Ferilli et al.
that can satisfy an information need (the so-called information overload problem). Indeed, in legacy environments, few selected publications were available, and librarians could properly assess their content and consequently tag them for subsequent retrieval. Now, the amount of available documents is so huge that manual evaluation and tagging is infeasible. This represented a significant motivation for the invention of proper Information Retrieval techniques, that could rely on automatic techniques for document indexing and retrieval. More precisely, documents are almost always indexed based on their textual content, and are searched for by expressing textual queries. Due to the inborn complexity of Natural Language, information retrieval techniques have typically focussed their attention on the lexical level, that seemed a good tradeoff between computational requirements and outcome effectiveness. The text is seen as a sequence of un-related words (bag-of-words) and the query is expressed as a set of terms that are to be found in the documents. The weakness of such approaches, however, is that the syntactic and logical structure underlying the sentences is completely lost. Unfortunately, the real meaning of a sentence is mostly determined just by that level, and hence the quality of the term-based retrieval outcomes can be significantly affected by such a lack. Indeed, although much more computationally demanding than simple bag-of-word approaches traditionally exploited in the literature, techniques that take into account the syntactic structure of sentences are very important to fully capture the information they convey. Reporters know very well that, swapping the subject and the object in a sentence like “The dog bit the man”, results in very different interest of the underlying news. The landscape has slightly changed recently, due to the improved computational capabilities of current computer machines. Thus, considering the structural level of natural language sentences in text processing is no more technically infeasible, although still hard. However, handling the structural aspects in Natural Language Processing (NLP for short) cannot be reduced to just building syntactic parsers for the various languages. Sophisticated techniques for representing and handling this kind of information are needed, as well. First-Order Logic (or FOL) is a powerful representation language that allows to express relationships among objects, which is often an unnegligible requirement in real-world and complex domains. Logic Programming [12] is a computer programming framework based on a FOL sub-language, which allows to perform reasoning on knowledge expressed in the form of Horn clauses. Inductive Logic Programming (ILP) [14] aims at learning automatically logic programs from known examples of behaviour, and has proven to be a successful Machine Learning approach in domains where relations among objects must be expressed to fully capture the relevant information. One of the main reasons why FOL is a particularly complex framework compared to simple propositional or attribute-value ones relates to the problem of indeterminacy, meaning that different portions of one formula can be mapped in (often many) different ways onto portions of another. An obstacle towards fruitful application of FOL to NLP is the fact that, in the traditional FOL approach, predicates that make up the description language are defined by the knowledge engineer that is in charge of setting up the reasoning
Merging Structural and Taxonomic Similarity for Text Retrieval
151
or learning problem, and are uninterpreted by the systems. Conversely, some kinds of information need to be interpreted in order to be fully exploited, which requires a proper background knowledge to be set up. For instance, numeric information must be exploited referring to mathematical concepts such as number ordering relationships and arithmetic operations. Analogously, descriptions of natural language sentences obviously include words of the vocabulary, that are the expression of underlying concepts among which many implicit relationships exist that can be captured by a taxonomy. Being able to properly consider and handle such an information is crucial for any successful application of pure FOL techniques to NLP [7]. An advantage of the logic framework is that a background knowledge can be defined and provided to help improving performance or effectiveness of the results. In the above case, a taxonomic background knowledge is needed. Unfortunately, unless the problem domain is very limited, natural language typically requires huge taxonomic information, and the problems of synonimy and polisemy introduce further complexity. In these cases, the use of existing stateof-the-art taxonomies can be a definite advantage. This work proposes the use of a framework for similarity assessment between FOL Horn clauses, enhanced to properly take into account also a taxonomic background knowledge, in order to find documents whose textual content is similar to a prototype sentence representing the query of the user. The basic similarity framework is borrowed from [6], while the taxonomic information is provided by WordNet. The next section shows how natural language can be described in FOL, and how the introduction of taxonomic background knowledge can support the exploitation of implicit relationships between the concepts underlying the descriptions. Then, Section 3 describes the similarity formula and framework for structural and taxonomic similarity assessment. Section 4 shows experiments that suggest the effectiveness of the proposed approach. Lastly, Section 5 concludes the paper and outlines future work directions.
2
NLP = FOL + Taxonomies
The considerations reported in the previous section motivate the adoption of both FOL and taxonomic information for the description of natural language sentences. To give an idea, consider the following sentences: 1 - “The boy wants a small dog” 2 - “The girl desires a yellow canary” 3 - “The hammer hits a small nail”
They structurally exhibit the same grammatical pattern, thus no hint is available to assess which is more similar to which. Going more in depth, at the lexical level, the only common word (‘small’) appears in sentences 1 and 3, which would suggest they are closer to each other than to sentence 2. However, it becomes clear that the first two are conceptually the most similar to each other as long as one knows and considers that ‘boy’ and ‘girl’ are two young persons, ‘to want’ and
152
S. Ferilli et al. Table 1. First-order logic language for structural description of sentences subj(X,Y) pred(X,Y) dir obj(X,Y) ind obj(X,Y) noun(X,Y) verb(X,Y) adj(X,Y) adv(X,Y) prep(X,Y) sing(X) pl(X) past(X) pres(X) fut(X)
Y is the subject of sentence X Y is the predicate of sentence X Y is the direct object of sentence X Y is the in direct object of sentence X Y is a noun appearing in component X of the sentence Y is a verb appearing in component X of the sentence Y is an adjective appearing in component X of the sentence Y is an adverb appearing in component X of the sentence Y is a preposition appearing in component X of the sentence the number of noun X is singular the number of noun X is plural the tense of verb X is past the tense of verb X is present the tense of verb X is future
‘to desire’ are synonyms and ‘dog’ and ‘canary’ are two pets. In an information retrieval perspective, if 1 represents the user query, and {2,3} are the documents in the repository, we would like 2 to be returned first, while pure syntactic or lexical techniques would return 3 as the best matching solution. Note that the interesting case is that of sentences that are very close or identical grammatically, but very different in meaning; indeed, for sentences that differ already at the grammatical level the basic structural similarity framework described in [6] would be enough for assessing their degree of distance. There are several levels of the grammatical structure that can be exploited to describe natural language sentences at different levels of abstraction. Of course, the deeper the level, the more complex the description and the more computational demanding its processing. The best grain-size to be exploited depends on the particular situation, and should represent a suitable tradeoff between expressive power and complexity. For demonstration purposes, in the following let us consider the very simple sentence structural description language reported in Table 1. Additionally, each noun, verb, adjective or adverb is described by the corresponding concept (or word) in the sentence, that is to be interpreted according to the taxonomy. To specify which literals are to be interpreted, suppose that they are enclosed as arguments of a tax/1 predicate. This yields, for the previous three sentences, the following descriptions: s1 = sentence(s1) :- subj(s1,ss1), pred(s1,ps1), dir obj(s1,ds1), noun(ss1,nss1), sing(nss1), tax(boy(nss1)), verb(ps1,vps1), pres(ps1), tax(want(vps1)), adj(ds1,ads1), tax(small(ads1)), noun(ds1,nds1), sing(nds1), tax(dog(nds1)). s2 = sentence(s2) :- subj(s2,ss2), pred(s2,ps2), dir obj(s2,ds2), noun(ss2,nss2), sing(nss2), tax(girl(nss2)), verb(ps2,vps2), pres(ps2), tax(desire(vps2)), adj(ds2,ads2), tax(yellow(ads2)), noun(ds2,nds2), sing(nds2), tax(canary(nds2)).
Merging Structural and Taxonomic Similarity for Text Retrieval
153
s3 = sentence(s3) :- subj(s3,ss3), pred(s3,ps3), dir obj(s3,ds3), noun(ss3,nss3), sing(nss3), tax(hammer(nss3)), verb(ps3,vps3), pres(ps3), tax(hit(vps3)), adj(ds3,ads3), tax(small(ads3)), noun(ds3,nds3), sing(nds3), tax(nail(nds3)).
As already pointed out, setting up a general taxonomy is a hard work, for which reason the availability of an already existing resource can be a valuable help in carrying out the task. In this example we will refer to the most famous taxonomy available nowadays, WordNet (WN) [13], that provides both the conceptual and the lexical level. Note that, if the concepts are not explicitly referenced in the description, but common words in natural language are used instead, due to the problem of polysemy (a word may correspond to many concepts), their similarity must somehow combine the similarities between each pair of concepts underlying the words. Such a combination can consist, for instance, in the average or maximum similarity among such pairs, or more sensibly can exploit the domain of discourse. A distance between groups of words (if necessary) can be obtained by couplewise working on the closest (i.e., taxonomically most similar) words in each group.
3
Similarity Framework
Many AI tasks can take advantage from techniques for descriptions comparison: subsumption procedures (to converge more quickly), flexible matching, instancebased classification techniques or clustering, generalization procedures (to focus on the components that are more likely to correspond to each other). Here, we are interested in the assessment of similarity between two natural language texts described by both lexical/syntactic features and by taxonomic references. Due to its complexity, few works exist on FOL descriptions comparison. In [6], a framework for computing the similarity between two Datalog Horn clauses has been provided, which is summarized in the following. Let us preliminary recall some basic notions involved in Logic Programming. The arity of a predicate is the number of arguments it takes. A literal is an n-ary predicate, applied to n terms, possibly negated. Horn clauses are logical formulæ usually represented in Prolog style as l0 :- l1 , . . . , ln where the li ’s are literals. It corresponds to an implication l1 ∧ · · · ∧ ln ⇒ l0 to be interpreted as “l0 (called head of the clause) is true, provided that l1 and ... and ln (called body of the clause) are all true”. Datalog [3] is, at least syntactically, a restriction of Prolog in which, without loss of generality [16], only variables and constants (i.e., no functions) are allowed as terms. A set of literals is linked if and only if each literal in the set has at least one term in common with another literal in the set. We will deal with the case of linked Datalog clauses. In the following, we will call compatible two sets or sequences of literals that can be mapped onto each other without yielding inconsistent term associations (i.e., a term in one formula cannot correspond to different terms in the other formula).
154
S. Ferilli et al.
Intuitively, the evaluation of similarity between two items i and i might be based both on parameters expressing the amounts of common features, which should concur in a positive way to the similarity evaluation, and of the features of each item that are not owned by the other (defined as the residual of the former with respect to the latter), which should concur negatively to the whole similarity value assigned to them [11]: n , the number of features owned by i but not by i (residual of i wrt i ); l , the number of features owned both by i and by i ; m , the number of features owned by i but not by i (residual of i wrt i ). A similarity function that expresses the degree of similarity between i and i based on the above parameters, and that has a better behaviour than other formulæ in the literature in cases in which any of the parameters is 0, is [6]: sf (i , i ) = sf(n, l, m) = 0.5
l+1 l+1 + 0.5 l+n+2 l+m+2
(1)
It takes values in ]0, 1[, which resembles the theory of probability and hence can help human interpretation of the resulting value. When n = m = 0 it tends to the limit of 1 as long as the number of common features grows. The fullsimilarity value 1 is never reached, being reserved to two items that are exactly the same (i = i ), which can be checked in advance. Consistently with the intuition that there is no limit to the number of different features owned by the two descriptions, which contribute to make them ever different, it is also always strictly greater than 0, and will tend to such a value as long as the number of nonshared features grows. Moreover, for n = l = m = 0 the function evaluates to 0.5, which can be considered intuitively correct for a case of maximum uncertainty. Note that each of the two terms refers specifically to one of the two items under comparison, and hence they could be weighted to reflect their importance. In FOL representations, usually terms denote objects, unary predicates represent object properties and n-ary predicates express relationships between objects; hence, the overall similarity must consider and properly mix all such components. The similarity between two clauses C and C is guided by the similarity between their structural parts, expressed by the n-ary literals in their bodies, and is a function of the number of common and different objects and relationships between them, as provided by their least general generalization C = l0 :- l1 , . . . , lk . Specifically, we refer to the θOI generalization model [5]. The resulting formula is the following: fs(C , C ) = sf(k − k, k, k − k) · sf(o − o, o, o − o) + avg({sf s (li , li )}i=1,...,k ) where k is the number of literals and o the number of terms in C , k is the number of literals and o the number of terms in C , o is the number of terms in C and li ∈ C and li ∈ C are generalized by li for i = 1, . . . , k. The similarity of the literals is smoothed by adding the overall similarity in the number of overlapping and different literals and terms.
Merging Structural and Taxonomic Similarity for Text Retrieval
155
The similarity between two compatible n-ary literals l and l , in turn, depends on the multisets of n-ary predicates corresponding to the literals directly linked to them (a predicate can appear in multiple instantiations among these literals), called star, and on the similarity of their arguments: sf s (l , l ) = sf(ns , ls , ms ) + avg{sf o (t , t )}t /t ∈θ where θ is the set of term associations that map l onto l and S and S are the stars of l and l , respectively: ns = |S \ S | ls = |S ∩ S | ms = |S \ S | Lastly, the similarity between two terms t and t is computed as follows: sf o (t , t ) = sf(nc , lc , mc ) + sf(nr , lr , mr ) where the former component takes into account the sets of properties (unary predicates) P and P referred to t and t , respectively: nc = |P \ P | lc = |P ∩ P | mc = |P \ P | and the latter component takes into account how many times the two objects play the same or different roles in the n-ary predicates; in this case, since an object might play the same role in many instances of the same relation, the multisets R and R of roles played by t and t , respectively, are to be considered: nr = |R \ R | lr = |R ∩ R | mr = |R \ R | Now, since the taxonomic predicates represent further information about the objects involved in a description, in addition to their properties and roles, term similarity is the component where the corresponding similarity can be introduced in the overall framework. Hence, the similarity between two terms becomes: sf o (t , t ) = sf(nc , lc , mc ) + sf(nr , lr , mr ) + sf(nt , lt , mt ) where the additional component refers to the similarity between the taxonomic information associated to the two terms t and t . In particular, it suffices providing a way to assess the similarity between two concepts. Then, in case the taxonomic information is expressed in the form of words instead of concepts, either a Word Sense Disambiguation [8] technique is exploited to identify the single intended concept for polysemous words, or, according to the one-domainper-discourse assumption, the similarity between two words can be referred to the closest pair of concepts associated to those words. In principle, in case of synonymy or polysemy, assuming consistency of domain among the words used in a same context [10], the similarity measure, by couplewise comparing all concepts underlying two words, can also suggest a ranking of which are the most probable senses for each, this way serving as a simple Word Sense Disambiguation procedure, or as a support to a more elaborate one. To assess the similarity between concepts in a given taxonomy, and indirectly the similarity between the words that express those concepts, (1) can be applied directly on the taxonomic relations. The most important and significant relationship among concepts expressed in any taxonomy is the generalization/specialization one, relating concepts or classes to their super- and sub-concepts or classes, respectively. According to the definition in [2], this yields a similarity measure rather than
156
S. Ferilli et al.
a full semantic relatedness measure, but we are currently working to extend it by taking into account relations other than hyponimy as well. Intuitively, the closer a common ancestor of two concepts c and c , the more they can be considered as similar to each other, and various distance measure proposed in the literature are based on the length of the paths that link the concepts to be compared to their closest common ancestor. In our case, (1) requires three parameters: one expressing the common information between the two objects to be compared, and the others expressing the information carried by each of the two but not by the other. If the taxonomy is a hierarchy, and hence can be represented as a tree, this ensures that the path connecting each node (concept) to the root (the most general concept) is unique: let us call < p1 , . . . , pn > the path related to c , and < p1 , . . . , pn > the path related to c . Thus, given any two concepts, their closest common ancestor is uniquely identified, as the last element in common in the two paths: suppose this is the k-th element (i.e., ∀i = 1, . . . , k : pi = pi = pi ). Consequently, three sub-paths are induced: the sub-path in common, going from the root to such a common ancestor (< p1 , . . . , pk >), and the two trailing sub-paths (< pk+1 , . . . , pn > and < pk+1 , . . . , pn >). Now, the former can be interpreted as the common information, and the latter as the residuals, and hence their lengths (n −k, k, n −k) can serve as arguments (n, l, m) to apply the similarity formula. This represents a novelty with respect to other approaches in the literature, where only one or both of the trailing parts are typically exploited, and is also very intuitive, since the longest the path from the top concept to the common ancestor, the more they have in common, and the higher the returned similarity value. Actually, in real-world domains the taxonomy is not just a hierarchy, but rather it is a heterarchy, meaning that multiple inheritance must be taken into account and hence a concept can specialize many other concepts. This is very relevant as regards the similarity criterion above stated, since in a heterarchy the closest common ancestor and the paths linking two nodes are not unique, hence many incomparable common ancestors and paths between concepts can be found, and going to the single common one would very often result in overgeneralization. Our solution to this problem is computing the whole set of ancestors of either concept, and then considering as common information the intersection of such sets, and as residuals the two symmetric differences. Again this is fairly intuitive, since the number of common ancestors can be considered a good indicator of the common information and features between the two concepts, just as the number of different ancestors can provide a reasonable estimation of the different information and features they own.
4
Evaluation
To assess the effectiveness of the proposed technique, two separate evaluations must be carried out. First of all, the taxonomic similarity measure alone must be proved effective in returning sensible similarity values for couples of concepts. Then, the overall similarity framework integrating both structural and taxonomic similarity assessment must be proved effective in evaluating the similarity
Merging Structural and Taxonomic Similarity for Text Retrieval
157
Table 2. Sample similarity values between WordNet words/concepts Concept
Concept
cat (wild) [102127808] cat (pet) [102121620] mouse (animal) [102330245] mouse (device) [103793489] cat (pet) [102121620] cat (wild) [102127808] dog (pet) [102084071] mouse (animal) [102330245] mouse (animal) [102330245] mouse (device) [103793489] cat (domestic) [102121620] horse (domestic) [102374451]
tiger (animal) [102129604] tiger (animal) [102129604] cat (pet) [102121620] computer (device) [103082979] dog (pet) [102084071] dog (pet) [102084071] horse (domestic) [102374451] computer (device) [103082979] mouse (device) [103793489] cat (pet) [102121620] computer (device) [103082979] horse (chess) [103624767]
Similarity 0.910 0.849 0.775 0.727 0.627 0.627 0.542 0.394 0.394 0.384 0.384 0.339
between complex description. In the case of NLP, the latter must additionally show its ability in overcoming problems of interpretation due to the presence of polysemous words. We will show with some examples that both these requirements can be satisfied by the proposed approach. As to the taxonomic similarity assessment alone, consider the following words and concepts, and some of the corresponding similarity values reported in Table 2: 102330245 mouse (animal) : ’any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails’ 103793489 mouse (device) : ’a hand-operated electronic device that controls the coordinates of a cursor on your computer screen as you move it around on a pad; on the bottom of the device is a ball that rolls on the surface of the pad’ 103082979 computer (device) : ’a machine for performing calculations automatically’ calculating machines)’ 102121620 cat (pet) : ’feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats’ 102127808 cat (wild) : ’any of several large cats typically able to roar and living in the wild’ 102129604 tiger (animal) : ’large feline of forests in most of Asia having a tawny coat with black stripes; endangered’ 102084071 dog (pet) : ’a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds’ 102374451 horse (animal) : ’solid-hoofed herbivorous quadruped domesticated since prehistoric times’ 103624767 horse (chess) : ’a chessman shaped to resemble the head of a horse; can move two squares horizontally and one vertically (or vice versa)’
At the level of concepts, it is possible to note that the similarity ranking is quite intuitive, in that less related concepts receive a lower value. The closest pairs are
158
S. Ferilli et al.
‘wild cat’-‘tiger’ and ‘pet cat’-‘tiger’, followed by ‘mouse (animal)’-‘cat (pet)’, then by ‘mouse (device)’-‘computer (device)’, by ‘cat (pet)’-‘dog (pet)’ and by ‘dog (pet)’-‘horse (animal)’, all with similarity values above 0.5. Conversely, all odd pairs, mixing animals and devices or objects (including polysemic words), get very low values, below 0.4. Then, for checking the overall structural and taxonomic similarity assessment capability, let us go back to the sample sentences in the Introduction for an application of the proposed taxonomically-enhanced similarity framework to descriptions of sentences written in natural language. As a first step, the similarity between single words must be assessed. Applying the proposed procedure, the similarity values are as follows: boy-girl = 0.75 girl-hammer = 0.435 want-hit = 0.361 yellow-small = 0.562 dog-canary = 0.667 canary-nail = 0.386
boy-hammer = 0.435 want-desire = 0.826 desire-hit = 0.375 small-small = 1 dog-nail = 0.75
It is possible to note that all similarities agree with the intuition, except the pair dog-nail that gets a higher similarity value than dog-canary, due to the interpretations of ‘dog’ as ‘a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward’ and ‘nail’ as ‘a thin pointed piece of metal that is hammered into materials as a fastener’. Without considering taxonomic information, the generalization between s1 and s2 and between s2 and s3 is: sentence(X) :- subj(X,Y), pred(X,W), dir obj(X,Z), noun(Y,Y1), sing(Y1), verb(W,W1), prest(W1), adj(Z,Z1), noun(Z,Z2), sing(Z2).
while the generalization between s1 and s3 is: sentence(X) :- subj(X,Y), pred(X,W), dir obj(X,Z), noun(Y,Y1), sing(Y1), verb(W,W1), pres(W1), adj(Z,Z1), tax(small(Z1)), noun(Z,Z2), sing(Z2).
so that the latter, having an additional literal with respect to the former, will take a greater similarity value due to just the structural similarity of the two sentences, in spite of the very different content. Conversely, by considering the taxonomic similarity among words, the comparisons become: fs(s1,s2) = 2.444
fs(s1,s3) = 2.420
fs(s2,s3) = 2.318
where, indeed, the first two sentences neatly get the largest similarity value with respect to the other combinations. Notwithstanding the ‘dog-nail’ ambiguity, the overall correct similarity ranking between sentences is correct. In order to better understand the effect of the approach on the similarity values, other tests were performed on the following sample sentences:
Merging Structural and Taxonomic Similarity for Text Retrieval 1234-
159
“the boy buys a jewel for a girl” “the girl receives a jewel from a boy” “a young man purchases a gem for a woman” “a young man purchases a precious stone for a woman”
having a different structure (e.g., transitive vs intransitive in 1-2), containing synonyms that could belong to different synsets (boy vs ’young man’ in 1-3), or made up of multiple words instead of a single one (e.g., gem vs precious stone in 1-4). For these sentences the similarity values obtained were: fs(1,2) = 2.383
5
fs(1,3) = 2.484
fs(1,4) = 2.511
Conclusions
Information retrieval effectiveness has become a crucial issue with the enormous growth of available digital documents and the spread of Digital Libraries. Search and retrieval are mostly carried out on the textual content of documents, and traditionally only at the lexical level. However, pure term-based queries are very limited because most of the information in natural language is carried by the syntactic and logic structure of sentences. To take into account such a structure, powerful relational languages, such as first-order logic, must be exploited. However, logic formulæ constituents are typically uninterpreted (they are considered as purely syntactic entities), whereas words in natural language express underlying concepts that involve several implicit relationships, as those expressed in a taxonomy. This problem can be tackled by providing the logic interpreter with suitable taxonomic background knowledge. This work proposed the exploitation of a similarity framework that includes both structural and taxonomic features to assess the similarity between FirstOrder Logic (Horn clause) descriptions of texts in natural language, in order to support more sophisticated information retrieval approaches than simple termbased queries. Although the proposed framework applies to any kind of structural description and taxonomy, being able to reuse an already existing taxonomy would be of great help. For this reason, the examples reported in this paper exploited the WordNet (WN) database, that can be naturally embedded in the proposed framework. Other works exist in the literature that combine in various shapes and for different purposes structural (and possibly logical) descriptions of sentences, some kind of similarity and WN. Some concern Question Answering [17], others Textual Entailment [9, 4, 1, 15]. However, the taxonomy relationships exploited in these works, or the way in which the structure of sentences is handled, makes them useless to our purpose. Evaluation on a sample case shows the viability of the solution, and its robustness with respect to problems due to lexical ambiguity and polysemy. Several nonorganized small experiments on tens of sentences having different length have been carried out so far, confirming the sample results, but revealing large computational times (1-2 min for long sentences). Thus, a first direction for future work will concern efficiency improvement in order to make it scalable. After that, more
160
S. Ferilli et al.
thorough experimentation and fine-tuning of the taxonomic similarity computation methodology by exploiting other relationships represented in WordNet will make sense. Also application of the proposed similarity framework to other problems, such as Word Sense Disambiguation in phrase structure analysis would be interesting directions deserving further investigation.
References [1] Agichtein, E., Askew, W., Liu, Y.: Combining lexical, syntactic, and semantic evidence for textual entailment classification. In: Proc. 1st Text Analysis Conference, TAC (2008) [2] Budanitsky, A., Hirst, G.: Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In: Proc. Workshop on WordNet and Other Lexical Resources, 2nd meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh (2001) [3] Ceri, S., Gottl¨ ob, G., Tanca, L.: Logic Programming and Databases. Springer, Heidelberg (1990) [4] Clark, P., Harrison, P.: Recognizing textual entailment with logical inference. In: Proc. 1st Text Analysis Conference, TAC (2008) [5] Esposito, F., Fanizzi, N., Ferilli, S., Semeraro, G.: A generalization model based on oi-implication for ideal theory refinement. Fundamenta Informaticæ 47(1-2), 15–33 (2001) [6] Ferilli, S., Basile, T.M.A., Biba, M., Di Mauro, N., Esposito, F.: A general similarity framework for horn clause logic. Fundamenta Informaticæ 90(1-2), 43–46 (2009) [7] Ferilli, S., Fanizzi, N., Semeraro, G.: Learning logic models for automated text categorization. In: AI*IA 2001: Advances in Artificial Intelligence. Springer, Heidelberg (2001) [8] Ide, N., V´eronis, J.: Word sense disambiguation: The state of the art. Computational Linguistics 24, 1–40 (1998) [9] Inkpen, D., Kipp, D., Nastase, V.: Machine learning experiments for textual entailment. In: Proc. 2nd PASCAL Recognising Textual Entailment Challenge, RTE-2 (2006) [10] Krovetz, R.: More than one sense per discourse. In: NEC Princeton NJ Labs., Research Memorandum (1998) [11] Lin, D.: An information-theoretic definition of similarity. In: Proc. 15th International Conf. on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998) [12] Lloyd, J.W.: Foundations of Logic Programming, 2nd edn. Springer, Berlin (1987) [13] Miller, G.A.: Wordnet: A lexical database for English. Communications of the ACM 38(11), 39–41 (1995) [14] Muggleton, S.: Inductive logic programming. New Generation Computing 8(4), 295–318 (1991) [15] Pennacchiotti, M., Zanzotto, F.M.: Learning shallow semantic rules for textual entailment. In: Proc. International Conference on Recent Advances in Natural Language Processing, RANLP 2007 (2007) [16] Rouveirol, C.: Extensions of inversion of resolution applied to theory completion. In: Inductive Logic Programming, pp. 64–90. Academic Press, London (1992) [17] Vargas-Vera, M., Motta, E.: An ontology-driven similarity algorithm. Tech. Report kmi-04-16. Knowledge Media Institute (KMi), The Open University, UK (July 2004)
Audio Objects Access: Tools for the Preservation of the Cultural Heritage Sergio Canazza1 and Nicola Orio2 1
2
Sound and Music Computing Group, Department of Information Engineering, University of Padova, Via Gradenigo 6/a, 35100, Padova, Italy [email protected] Information Management Systems Research Group, Department of Information Engineering, University of Padova, Via Gradenigo 6/a, 35100, Padova, Italy [email protected]
Abstract. The digital re-recording of analogue material, in particular phonographic discs, can be carried out using different approaches: mechanical, electro-mechanical, opto-mechanical, and opto-digital. In this paper, we investigate the differences among these approaches, using two novel methods that have been developed on purpose: a system for synthesizing audio signals from still images of phonographic discs and a tool for the automatic alignment of audio signals. The methods have been applied to two case studies, taken from a shellac disc. Results point out that this combined approach can be used as an effective tool for the preservation of and access to the audio documents.
1
Introduction
The opening up of archives and libraries to a large telecoms community represents a fundamental impulse for cultural and didactic development. Guaranteeing an easy and ample dissemination of some of the fundamental moments of the musical culture of our times is an act of democracy which cannot be renounced and which must be assured to future generations, even through the creation of new instruments for the acquisition, preservation and transmission of information. This is a crucial point, which is nowadays the core of reflection of the international archive community. If, on the one hand, scholars and the general public have begun paying greater attention to the recordings of artistic events, on the other hand, the systematic preservation and access to these documents is complicated by their diversified nature and amount. Within the group of documents commonly labeled audio analogue the sound recordings on discs are the most spread in the world from 1898 until about 1990. The common factor with this group of documents is the method of recording the information. This is by means of a groove cut into the surface by a cutting stylus and which is modulated by the sounds, either directly in the case of acoustic recordings (shellac 78 rpm discs) or by electronic amplifiers (shellac 78 rpm or vinyl discs). The wide time span in which these formats (with: different speed, number of audio channels, carrier chemistry characteristics) have been M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 161–172, 2010. c Springer-Verlag Berlin Heidelberg 2010
162
S. Canazza and N. Orio
developed makes it even harder to select the correct playing format for each carrier. It should be clear the importance of transfer into the digital domain (active preservation), namely for carriers in risk of disappearing, respecting the indications of the international archive community (see [2,3,1] for some guidelines proposals and [26,5] for the ethics of the audio documents re-recordings). It is well-known that the recording of an event can never be a neutral operation, since the timbre quality and the plastic value of the recorded sound, which are of great importance in contemporary music (electro-acoustic, pop/rock, ethnic music) are already determined by the choice of the number and arrangement of the microphones used during the recording. Moreover, the audio processing carried out by the tonmeister is a real interpretative element added to the recording of the event. Thus, musicological and historic-critical competence becomes essential for the individuation and correct cataloguing of the information contained in audio documents. The commingling of a technical and scientific formation with historic-philological knowledge also becomes essential for preservative rerecording operations, which do not coincide completely with pure A/D transfer, as it is, unfortunately, often thought. The increased dimensionality of the data contained within an audio digital library should be dealt with by means of automatic annotation. The auditory information contained in the audio medium can be augmented with cross-modal cues. For instance, the visual and textual information carried by the cover, the label and possible attachments has to be acquired through photos and/or videos. The storage and representation of this valuable information is common practice and is usually based on well-known techniques for image and video processing, such as OCR, video segmentation and so on. We believe that it is interesting as well, even if not studied yet, to deal with other information regarding the carrier corruption and imperfection occurred during the A/D conversion. After a detailed overview of the debate evolved since the Seventies inside the archivist community on the audio documents active conservation (Sec. 2), this work describes different approaches for re-recording of phonographic discs (Sec. 3) and a tool to align the different audio signals (Sec. 3). Sec. 5 provides two case studies where the similarities and differences of the approaches to re-recording are highlighted. A number of applications are described in the concluding section.
2
Audio Archivists: A Discussion 30 Years Long
A reconnaissance on the most significant positions of the debate evolved since the Seventies inside the archivist community on the audio documents active conservation, points out, at least, three different points of view [23], described below. 2.1
William Storm
William Storm [27] individuated two types of re-recording which are suitable from the archival point of view: 1) the sound preservation of audio history, and 2) the sound preservation of an artist. The first type of re-recording (Type I)
Audio Objects Access: Tools for the Preservation
163
represents a level of reproduction defined as the perpetuation of the sound of an original recording as it was initially reproduced and heard by the people of the era. The second type of re-recording (Type II) was presented by Storm as a more ambitious research objective: it is characterized by the use of playback equipment other than that used originally with the intent of obtaining the live sound of original performers, transcending the limits of a historically faithful reproduction of the recording. 2.2
Dietrich Sch¨ uller
Sch¨ uller in [26] and in [5] points directly towards defining a procedure which guarantees the re-recording of the signals best quality by limiting the audio processing to the minimum. He goes on to an accurate investigation of signal alterations, which he classifies in two categories: (1) intentional and (2) unintentional. The former include recording, equalization, and noise reduction systems, while the latter are further divided into two groups: (i) caused by the imperfection of the recording technique of the time (distortions), and (ii) caused by misalignment of the recording equipment (wrong speed, deviation from the vertical cutting angle in cylinders or misalignment of the recording in magnetic tape). The choice whether or not to compensate for these alterations reveals different re-recording strategies: (A) the recording as it was heard in its time (Storm’s Audio History Type I); (B) the recording as it has been produced, precisely equalized for intentional recording equalizations (1), compensated for eventual errors caused by misaligned recording equipment (2ii) and replayed on modern equipment to minimize replay distortions; (C) the recording as produced, with additional compensation for recording imperfections (2i). 2.3
George Brock-Nannestad
George Brock-Nannestad [6] examines the re-recording of acoustic phonographics recordings (pre-1925). In order to have scientific value, the re-recording work requires a complete integration between the historical-critical knowledge which is external to the signal and the objective knowledge which can be inferred by examining the carrier and the degradations highlighted by the analysis of the signal. 2.4
Guidelines for an Audio Preservation Protocol
Starting from these positions, we define the preservation copy as a digital data set that groups the information carried by the audio document, considered as an artifact. It aims to preserve the documentary unity, and its bibliographic equivalent is the facsimile or the diplomatic copy. Signal processing techniques are allowed only when they are finalized to the carrier restoration. Differently by the Sch¨ uller position, it is our belief that in a preservation copy only the intentional alterations (1) must be compensated (correct equalization of the re-recording system and decoding of any possible intentional signal processing interventions). All the unintentional alterations (also the ones caused by misalignments of the
164
S. Canazza and N. Orio
recording equipment) could be compensated only at the access copy level: these imperfections/distortions must be preserved because they witness the history of the audio document transmission. The A/D transfer process should represent the original document characteristics, from either information and material points of view, as it arrived to us. According to the indications of the international archive community [2,3,1,16,15,17]: 1) the recording is transferred from the original carrier; 2) if necessary, the carrier is cleaned and restored so as to repair any climactic degradations which may compromise the quality of the signal; 3) re-recording equipment is chosen among the current professional equipment available in order not to introduce further distortions; 4) sampling frequency and bit rate must be chosen with respect to the archival sound record standard (at least, 96 kHz / 24 bit, adopting the guideline: the worse the signal, the higher the resolution); 5) the digital audio file format should support high resolution, it should be transparent with simple coding schemes, without data reduction. It is important to highlight that this protocol fits very well in the ontology approach used in [21,22] developed in the field of digital object preservation describing its internal relationships to support the preservation process. The ontology is an extension of the CIDOC Conceptual Reference Model (CIDOC-CRM), which is an ISO standard for describing cultural heritage [11,12,14].
3
Phonographic Discs Re-recording Systems
Four typologies of playing equipment exist. Mechanic. It uses a mechanical phonograph, the most common device for playing recorded sound from the 1870s through the 1950s, where the stylus is used to vibrate a diaphragm radiating through a horn. Electro-mechanical. Turntable drive systems, configured for use with a pickup: (1) piezo-electric crystal (where the mechanical movement of the stylus in the groove generates a proportional electrical voltage by creating stress within a crystal); (2) magnetic cartridges, moving magnet or moving coil. Both operate on the same physics principle of electromagnetic induction. Opto-mechanical. A laser turntable is a phonograph that plays gramophone records using a laser device as the pickup, rather than a conventional diamond tipped stylus [18]. This playback system has the advantage of never physically touching the disc during playback. The laser pickup uses five beams one on each channel to track the sides of the groove, one on each channel to pick up the sound (just below the tracking beams), and a fifth to track the surface of the record and keep the pickup at a constant height, which compensates for record thickness and any warping. The lasers focus on a section of the groove above the level where a conventional stylus will have traveled, and below the typical depth of surface scratches, giving the possibility of like-new reproduction even from worn or scratched records. Using a laser pickup reduces many problems associated with physical styli: horizontal tracking angle error, leveling adjustment issues, channel-balance error, stereo crosstalk, anti-skating compensation,
Audio Objects Access: Tools for the Preservation
165
acoustic feedback, problems tracking warped, and cartridge hum pickup. Unfortunately, the laser turntable is, however, extraordinarily sensitive to record cleanliness. Moreover, when a phonographic disc is inserted into the tray drawer and the drawer closed, the turntable reads the surface of the disc, displaying the number of tracks: the record must be black or opaque-colored; transparent or translucent records may not play at all. The laser turntables are constrained to the reflected laser spot only and are susceptible to damage and debris and very sensitive to surface reflectivity. National Library of Canada and, since 2001, the Library of Congress in Washington DC used also this system. Opto-digital. Nowadays, automatic text scanning and optical character recognition are in wide use at major libraries: unlike texts, A/D transfer of historical sound recordings is often an invasive process. Digital image processing techniques can be applied to the problem of extracting audio data from recorded grooves, acquired using an electronic camera or other imaging system. The images can be processed to extract the audio data. Such an approach offers a way to provide non-contact reconstruction and may in principle sample any region of the groove, also in the case of a broken disc. These scanning methods may form the basis of a strategy for: a) larger scale A/D transfer of mechanical recordings which retains maximal information (2D or 3D model of the grooves) about the native carrier; b) small scale A/D transfer processes, where there are insufficient resources (trained personnel and/or high-end equipments) for a traditional A/D transfer by means of turntables and A/D converters; c) the active preservation of carriers with heavy degradation (breakage, flaking, exudation). In literature there are several approaches to this problem (see [13,9,28]). The authors have developed a system (Photos of GHOSTS: PoG [7]): a) able, automatically, to recognize different rpm and to perform tracks separation; b) works with both low-cost hardware and not-trained personnel; c) is robust with respect to dust and scratches; d) outputs de-noised and de-wowed audio, by means of novel restoration algorithms. An equalization curve choice by the user is possible: the system has hundreds of curves stored, each one with appropriated references (date, company, roll-off, turnover). Moreover, PoG allows the user to process the signal by means of several audio restoration algorithms [8,4]. The
Fig. 1. Waveforms of the audio signals taken with the four re-recording systems of La signorina sfinciusa: turntable (top-left); phonograph (bottom-left); laser (top-right); PoG (bottom-right)
166
S. Canazza and N. Orio
system uses a customized scanner device with rotating lamp carriage in order to position every sector with the optimal alignment relative to the lamp (coaxially incident light). The software automatically finds the record center and radius from the scanned data, for performing groove rectification and track separation. Starting from the light intensity curve of the pixels in the scanned image, the groove is modeled and, so, the audio samples are obtained. Fig. 1 shows a comparison of the waveforms of the audio signals extracted by means of the described equipments (excerpt of La signorina sfinciusa).
4
Audio Alignment
Audio alignment is normally used to compare the characteristics of different performances of a music work. The goal is to find, for each point in one performance, the corresponding point in the second performance. This information can be useful to highlight differences in expressive timings, such as the use of rallentando and accelerando, and the duration of notes and rests which can be modified by the performers. A graphical representation of the alignment curve, which matches pairs of points in the two performances in a bi-dimensional representation as shown in Fig. 2, gives a direct view of the main differences between the styles of two performances [10]. Alignment can be applied also to different versions of the same recording. For instance, in the case of electro-acoustic music, recordings published at
Fig. 2. Graphical representation of the local distance matrix. X-axis: the audio signal synthesized by PoG system; y-axis: audio signal extracted by means of phonograph.
Audio Objects Access: Tools for the Preservation
167
different times may have undergone different post processing and editing phases [24]. In this case, alignment may highlight cuts and insertions of new material in the recordings, which may be difficult to detect manually, showing the usage of previous released material inside a new composition. We propose to apply alignment techniques to compare the re-recordings of a disc with the one obtained through the technique based on digital images described in the previous section. For instance, it is likely that the recording speeds differ slightly depending on the quality of the analogue equipment. Moreover, there can be local differences in sound quality due to a different sensitiveness to possible local damages on the record surface. Once the two recordings have been aligned it is possible to compare their sound quality in corresponding points, in order to assess objectively the quality of the re-recording. The comparison can be used to estimate which kind of equipment has been used for the analogue re-recordings, in the common case this information has not been maintained. Moreover, point to point comparison highlights critical parts of the recordings, where the effect of damages is more relevant, giving indications on how to precede with possible digital restorations. In order to be aligned, the recordings have to be processed to extract relevant features that capture the most relevant acoustic characteristics. The representations are then aligned by computing a local distance matrix between the two representations and by finding the optimal global path across this matrix. A popular approach to compute the global alignment is Dynamic Time Warping (DTW), that have been developed in the speech recognition research area [25] and applied, together with other techniques such as hidden Markov models, to audio matching [19,20,10]. As the name suggests, DTW can be computed efficiently using a dynamic programming approach. The first step thus regards the choice of the acoustic parameters that are to be used. Given the relevance of spectral information, the similarity function is normally based on the frequency representation of the signal. After a number of tests, we choose to focus on frequency resolution, using large analysis windows of 8192 points with a sampling rate of 48 kHz, using an hopsize between two subsequent windows of 4096 points. With this parameters, each point to be aligned corresponds to an audio frame of about 0.17 seconds, while the time span between two subsequent points is 0.08 seconds. After choosing how to describe the digital recordings, a suitable distance function has to be chosen. We propose to use the cosine of the angle between the vectors representing the amplitude of the Fourier transform, which can be considered a measure of the correlation between the two spectra. Thus, given two recordings f and g, the local distance d(m, n) between two frames can be computed according to equation K d(m, n) =
Fm (i) Gn (i) Fm Gn
i=1
(1)
where Fm (Gn ) is the magnitude spectrum of frame m(n) of recording f (g), while in our application K = 8192 points. Local distance can be represented by a distance matrix, as shown in Fig. 2, where the main similarities are along the
168
S. Canazza and N. Orio
diagonal with large dark squares correspond to long sustained notes and brighter areas represent a low degree of similarity between two frames. In order to reduce computational cost, the local distance needs to be computed only in proximity of the diagonal. After the local distance matrix is computed, DTW finds the best aligning path computing the cumulative distance c(m, n) between the two recordings. We chose to force the global alignment to involve all the frames of both signals, so that for each frame in one of the recordings there is at least one corresponding frame in the second. In this way it is possible to compare the acoustic characteristics of any couple of recordings, for instance assessing how different analogue equipments are robust to a particular damage, highlighting its effect in the frequency representation.
5
Case Studies
As case studies we selected two shellac discs, which : – Pasquale Abete, Fronne ’e limone (the term “fronne ’e limone” represents a popular music genre from Southern Italy), recorded in New York in May 1921, thus it is a mechanical recording and it has been played at 80 rpm. – Leonardo Dia, La signorina sfinciusa (The funny girl), recorded in New York on the 24th July 1929, in this case, because it was an American electric recording it has been played at 78.26 rpm. The audio signal was extracted in four ways: 1. mechanical: phonograph His Master Voice, Monarch model, equipped with a pickup HMV Exhibition and a horn Columbia. The re-recording was carried out in a 4m2 room; the signal was recorded by means of a cardiod microphone Rhøde NT23, installed in parallel to the horn axis, 7.5 cm away and central respect to the horn. The microphone was set in semi-cardiod position, in order to compensate the room reverberation. A balanced wire transmitted the analogue signal to a A/D portable board Motu 828 MkII, where it is transferred in the digital domain, with a 48 kHz sample rate and a 24 bit resolution. 2. Electro-mechanical: turntable Diapason model 14-A with arm Diapason, pickup Shure M44/7, Stylus Expert. The setting was: 78.26 rpm; 3.5 mil; 4g; truncated elliptical; FFRR equalization curve. Prism A/D Converter Dream AD-2. 3. Opto-mechanical: player laser ELP mod. LT-1XRC. The setting was: 78.26 rpm, FFRR equalization curve. Prism A/D Converter Dream AD-2. 4. Opto-digital: PoG system, with a photo taken at 4800 dpi, 8 bit grayscale, without digital correction. The four periodograms of the audio signals of the first case study, shown in Fig. 3, highlight that, even if turntable and laser are more sensitive, both to the
Audio Objects Access: Tools for the Preservation phonograph
phonograph
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2 0
0.2 0
1000
2000
3000
0
4000
0
1000
2000
turntable
0.4
0.4
0.2
3000
4000
3000
4000
3000
4000
0.2 0
1000
2000
3000
0
4000
0
1000
2000
laser
laser
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 0
1000
2000
3000
0
4000
0
1000
2000
PoG
PoG
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2 0
4000
1 0.8 0.6
0
3000 turntable
1 0.8 0.6
0
169
0.2
0
1000
2000
3000
0
4000
0
1000
2000
Fig. 3. Periodograms of the audio signals (about 20 seconds each) taken with the four re-recording systems of Fronne ’e limone (left) and La signorina sfinciusa (right)
original signal and to the artifacts (scratch and dust) on the carrier, the four systems give comparable results. A closer analysis of the turntable spectrum, showed a peak at about 11 Hz, which is due to the mechanical vibrations of the arm and probably to a resonance between the arm and the engine. The same peak is not observed in the three other spectra. Similar considerations apply to the second case study. The method described in Sec. 4 was used to evaluate differences and similarities between these audio signals. At first we compared the playback speed of the mechanical equipment with PoG, which is a useful reference because it is not subject to mechanical variations. For the two case studies, we noticed that all the systems had constant speed, with no variation from the nominal value. Fig. 2 highlights this regular trend for the phonograph. We also computed and aligned the most relevant features proposed in the audio processing literature. In particular, the first four spectral moments (centroid, spread, skewness, and kurtosis), brightness, and spectral rolloff have been computed using an analysis window of 8192 points (all the signals were sampled at 48 kHz, 24 bit resolution). As it can be noted from Fig. 4, spectral rolloff is a good indicator for highlighting the use of a phonograph, because the rolloff values are very close to the ones given by PoG. Moreover, we computed mean and variance of the point to point differences of brightness values, using PoG as
5000
5000
4500
4500
4000
4000
3500
3500
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
0
500
0
100
200
300
400
500
600
700
800
900
1000
0
0
100
200
300
400
500
600
700
800
900
1000
Fig. 4. Rolloff of the audio signals taken with the four re-recording systems of Fronne ’e limone (left) and La signorina sfinciusa (right): phonograph (green, dash-dot), turntable (red, dashed), laser (blue, dotted), and PoG (black, solid)
170
S. Canazza and N. Orio
Table 1. Mean and variance of the difference between brightness of three re-recording systems using PoG as a reference Case Study
Equipment Mean Variance
Fronne ’e limone
phonograph
0.158
0.611
Fronne ’e limone
turntable
0.024
0.713
Fronne ’e limone
laser
1.734
1.647
La signorina sfinciusa phonograph
-0.467 1.167
La signorina sfinciusa turntable
-0.420 0.974
La signorina sfinciusa laser
0.182
0.600
a reference. Results in Tab. 1 show that, at least with the two case studies, it is possible to use the mean difference between the turntable acquisition and PoG in order to distinguish it from the laser acquisition.
6
Conclusions
Audio archivists can take advantage from a variety of equipments for the rerecording of phonographic discs. We presented two case studies using four different paradigms: mechanical, electro-mechanical, opto-mechanical, and optodigital. Although there are some differences on the sensitiveness in respect to either the audio signal and to the corruptions, the four approaches give comparable results. The choice of the system depends on the aims of the audio archive, as described in Sect. 2, but is also biased by the resources (human and economical) that are available for the A/D transfer task. The results of this study suggest different directions. A federation of libraries may have a number of different digital versions of the same material, and metadata may describe neither the A/D transfer nor the link to the original carrier. By applying the procedure described in this work, that is synthesizing an audio signal using PoG and aligning it to the existing digital documents, it is possible to identify the kind of A/D process, its eventual defects, and retrieve the original analogue recording. It is common practice for large archives to store digital images of the carriers. Given the reduced costs of mass storage, it is likely that also high quality pictures are stored in the archive and used by PoG. Our approach can be exploited for monitoring the quality of A/D transfer, by using objective measures for checking playback speed, detecting missing samples or digital tick, and equalization curves. Monitoring can be carried either on already transferred documents or during the A/D process.
Audio Objects Access: Tools for the Preservation
171
For smaller archives, which cannot afford the personnel and technological costs, our results show that preservation copies can be created using only high resolution photos. The access copy can be synthesized on demand using PoG, or similar approaches developed in the meanwhile. Finally, if the disc is a unique copy of historical recordings, an opto-digital approach allows us to listen to the document (and create a preservation copy) in a non invasive way (i.e. without the use of some kind of adhesive) also when the carrier is seriously damaged so that alternative re-recordings are unfeasible.
References 1. AES-11id-2006: AES Information document for Preservation of audio recordings – Extended term storage environment for multiple media archives. AES (2006) 2. AES22-1997: AES recommended practice for audio preservation and restoration – Storage and handling – Storage of polyester-base magnetic tape. AES (2003) 3. AES49-2005: AES standard for audio preservation and restoration – Magnetic tape – Care and handling practices for extended usage. AES (2005) 4. Bari, A., Canazza, S., Poli, G.D., Mian, G.: Toward a methodology for the restoration of electro-acoustic music. J. New Music Research 30(4), 365–374 (2001) 5. Boston, G.: Safeguarding the Documentary Heritage. A guide to Standards, Recommended Practices and Reference Literature Related to the Preservation of Documents of all kinds. UNESCO (1988) 6. Brock-Nannestad, G.: The objective basis for the production of high quality transfers from pre-1925 sound recordings. In: AES Preprint n ◦ 4610 Audio Engineering Society 103rd Convention, pp. 26–29, New York (1997) 7. Canazza, S., Dattolo, A.: Listening the photos. In: Proceedings of 25th Symposium on Applied Computing, March 22-26, Sierre, Switzerland (accepted for publication, 2010) 8. Canazza, S., Vidolin, A.: Preserving electroacoustic music. Journal of New Music Research 30(4), 351–363 (2001) 9. Cavaglieri, S., Johnsen, O., Bapst, F.: Optical retrieval and storage of analog sound recordings. In: Proceedings of AES 20th International Conference, Budapest, Hungary (October 2001) 10. Dixon, S., Widmer, G.: Match: a music alignment tool chest. In: Proceedings of the International Conference of Music Information Retrieval, pp. 492–497 (2005) 11. Doerr, M.: The cidoc crm – an ontological approach to semantic interoperability of metadata. AI Magazine 24(3), 75–92 (2003) 12. Doerr, M.: Increasing the power of semantic interoperability for the european library. ERCIM News 26, 26–27 (2006) 13. Fedeyev, V., Haber, C.: Reconstruction of mechanically recorded sound by image processing. Journal of Audio Engineering Society 51(12), 1172–1185 (2003) 14. Gill, T.: Building semantic bridges between museums, libraries and archives: the cidoc conceptual reference model. First Monday 9(5) (2004) 15. IASA-TC 03: The Safeguarding of the Audio Heritage: Ethics, Principles and Preservation Strategy. IASA Technical Committee (2005) 16. IASA-TC 04: Guidelines on the Production and Preservation of Digital Objects. IASA Technical Committee (2004)
172
S. Canazza and N. Orio
17. IFLA/UNESCO: Safeguarding our Documentary Heritage / Conservation pr´eventive du patrimoine documentaire / Salvaguardando nuestro patrimonio documental. CD-ROM Bi-lingual: English/French/Spanish. UNESCO “Memory of the World” Programme, French Ministry of Culture and Communication (2000) 18. Miller, D.: The Science of Musical Sounds. Macmillan, New York (1922) 19. Miotto, R., Orio, N.: Automatic identification of music works through audio matching. In: Proceedings of 11th European Conference on Digital Libraries, pp. 124–135 (2007) 20. M¨ uller, M., Kurth, F., Clausen, F.: Audio matching via chroma-based statistical features. In: Proceedings of the International Conference of Music Information Retrieval, pp. 288–295 (2005) 21. Ng, K.N., Vu Pham, T., Ong, B., Mikroyannidis, A., Giaretta, D.: Preservation of interactive multimedia performances. International Journal of Metadata, Semantics and Ontologies 3(3), 183–196 (2008) 22. Ng, K.N., Vu Pham, T., Ong, B.: Ontology for Preservation of Interactive Multimedia Performances. In: Metadata and Semantics. Springer, US (2009) 23. Orcalli, A.: On the methodologies of audio restoration. Journal of New Music Research 30(4), 307–322 (2001) 24. Orio, N., Zattra, L.: Audio matching for the philological analysis of electro-acoustic music. In: Proceedings of the International Computer Music Conference, pp. 157– 164 (2007) 25. Rabiner, L., Juang, B.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs (1993) 26. Sch¨ uller, D.: Preserving the facts for the future: Principles and practices for the transfer of analog audio documents into the digital domain. Journal of Audio Engineering Society 49(7-8), 618–621 (2001) 27. Storm, W.: The establishment of international re-recording standards. Phonographic Bulletin 27, 5–12 (1980) 28. Stotzer, S., Johnsen, O., Bapst, F., Sudan, C., Ingol, R.: Phonographic sound extraction using image and signal processing. In: Proceedings of ICASSP 2004, vol. 4, pp. 289–292 (May 2004)
Toward Conversation Retrieval Matteo Magnani1 and Danilo Montesi2 1
2
Dept. of Computer Science, University of Bologna Mura A. Zamboni 7, 40100 Bologna [email protected] Dept. of Computer Science, University of Bologna Mura A. Zamboni 7, 40100 Bologna [email protected]
Abstract. Social Network Sites can be seen as very large information repositories containing millions of text messages usually organized into complex networks involving users interacting with each other at specific times. In this paper we discuss how traditional information retrieval techniques can be extended to deal with these social aspects. In particular we formalize the concept of conversation in the context of Social Network Sites and define constraints regulating ranking functions over conversations.
1
Introduction
Social Network Sites (SNSs) are among the most relevant places where information is created, exchanged and transformed, as witnessed by the number of their users and by their activity during events or campaigns like the terror attack in Mumbai in 2008 or the so-called Twitter revolution in Iran in 2009. If we look at the kind of interactions happening inside SNSs, these services can be seen as on-line third places [1]. Third places, like coffee-bars, are so called because they are distinct from the two usual social environments of home and workplace and are important because they enable specific communication patterns (like a peer conversation between an employer and an employee on topics not related to their working activity) and facilitate the emergence of political ideas and the aggregation of individuals. SNSs can be seen as very large information repositories with relevant potential applications — from practical areas like politics and marketing to more theoretical fields like social sciences and psychology. Therefore, being able to retrieve information from these repositories is very valuable. However, SNSs are more than just collections of text messages, making the retrieval process not trivial and requiring specific concepts and techniques. To highlight some of the characteristic features of SNSs we may consider the example of Friendfeed, a well known SNS recently acquired by Facebook, that
This work has been partially funded by Telecom Italia and by PRIN project Tecniche logiche e operazionali per interazione tra componenti.
M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 173–182, 2010. c Springer-Verlag Berlin Heidelberg 2010
174
M. Magnani and D. Montesi
offers features that can be associated both to Twitter, e.g., providing status updates, and Facebook, e.g., creating complex conversations [2]. Using this SNS, users can follow each other, post entries, comment on these entries, and like them. This can be represented using the following relational schema: User(ID, Type, Name, Description) Following(Follower, Followed) Entry(PostID, PostedBy, Timestamp, Text, Language) Comment(PostID, EntryRef, PostedBy, Timestamp, Text, Language) Like(User, EntryRef, Timestamp) This schema represents a much more complex data model than it may appear at a first glance. In Figure 1 we represent an example of a portion of a typical social data structure corresponding to this schema, to highlight its main features — multiple graphs, labeled arcs and unstructured text. From this example, it is easy to see that in SNSs text messages are included into complex structures, where who posts something and when it has been posted are as important as what has been posted. An information extraction process should thus model and exploit these aspects. More specifically, text messages (or posts) are only basic bricks composing more complex and socially relevant conversations between communities of users. While it is certainly interesting to analyze each post on its own, it is also very important to be able to manipulate these complex structures, e.g., clustering conversations, retrieving the conversations about a given topic, or understanding the topic of a conversation. Evidently, all these information retrieval capabilities are based on the availability of a model to rank a set of conversations with respect to some information requirements. In this paper we deal with this problem. The problem of retrieving information from SNSs has already been addressed in the past, because of its theoretical and practical relevance. In particular,
Fig. 1. An example of a social data structure. Users (circles) can follow each other, post text entries, make comments (sometimes directed to specific users through the symbol @) and like other entries
Toward Conversation Retrieval
175
there is a plethora of tools that can be used to monitor the usage of keywords or tags, e.g., Twitter Search. However, these tools work on simple text collections and do not consider their structure. Studies in Structured Information Retrieval [3], with specific reference to structured documents [4,5,6,7,8] have considered the problem of retrieving parts of documents, but this seems to be different from the organization of Social Network conversations, where we have no overlapping messages but connections between them and we should consider user interactions while ranking conversations. Studies in Hypertext Information Retrieval [9] and Web Information Retrieval have developed methods to consider connections between text documents, like Google’s PageRank [10], but these approaches do not include user interaction (e.g., the popularity of the author of a Web page) and do not provide means to compute the aggregate relevance of sequences of text messages into super-structures, like in SNS conversations. At the same time they can certainly be used to compute a user popularity index based on friendship connections, as it will be discussed later. This paper is organized as follows. In Section 2 we formalize the concept of conversation in a Friendfeed- or Facebook-like SNS, starting from simple text interactions between two users and composing them into larger structures. This formalization is necessary to define a ranking function for conversations. In Section 3 we define a set of properties that a ranking function for conversations should satisfy. We conclude the paper with some final remarks.
2
Modeling a Conversation
In Figure 2 we represent the basic components of a communication process. A sender codifies a message and sends it through a communication channel. The message is then decoded by the receiver to interpret it. When we deal with social network data, we know only some of these components, as illustrated in the lower part of the figure. In particular, we know the users exchanging the message and its textual representation. The channel enabling the communication is the Web page of the SNS, and it is constant in our discussion, therefore it will be omitted. In addition, we know the time when the message is posted by the sending user.
Fig. 2. A graphical representation of a step of a communication process (interaction), and its equivalent in SNS data
176
M. Magnani and D. Montesi
Definition 1 (Dyadic interaction). Let U be a set of people, T a set of timestamps, and M a set of text messages. A dyadic interaction is a tuple (t, u1 , u2 , m), where t ∈ T , u1 , u2 ∈ U, and m ∈ M. If I = (t, u1 , u2 , m) is a dyadic interaction, we will notate ts(I) = t (timestamp of the interaction), snd(I) = u1 (sender), rec(I) = u2 (receiver), and msg(I) = m (text message). A dyadic conversation is a chronological sequence of text messages exchanged between two users: Definition 2 (Dyadic conversation). A Dyadic conversation is a sequence (I1 , . . . , In ), where ∀i ∈ [1, n] Ii is a dyadic interaction and ∀i, j ∈ [1, n], i < j (ts(Ii ) < ts(Ij )).
Fig. 3. A graphical representation of a dyadic conversation
In complex social environments we usually experience conversations between sets of users. The concept of dyadic interaction can be easily extended to deal with multiple receivers: Definition 3 (Polyadic interaction). We model a polyadic conversation as a tuple (t, u1 , U2 , m), where t ∈ T , u1 ∈ U, U2 ⊆ U, and m ∈ M. If I = (t, u1 , U2 , m) is a polyadic interaction, we will notate rec(I) = U2 . It follows that a polyadic conversation is a chronological sequence of text messages exchanged between one sender and a set of receivers, where the people involved may change during the conversation. Definition 4 (Polyadic conversation). A polyadic conversation is a sequence (I1 , . . . , In ) where ∀i ∈ [1, n] Ii is a polyadic interaction and ∀i, j ∈ [1, n], i < j (ts(Ii ) < ts(Ij )).
3
Ranking Conversations
In this section we define some properties characterizing the functions that can be used to rank a set of conversations. To rank text messages we can compute their relevance with regard to some information requirements, e.g., using a vector space model and a list of keywords.
Toward Conversation Retrieval
177
Fig. 4. A graphical representation of a polyadic conversation. Notice that during the conversation the set of receivers may change
However, the same sentence pronounced by two different people will have different degrees of importance — a message from a Prime Minister will probably be more important than a message from one of the authors of this paper, at least in some contexts. In addition, and with absolutely no reference to the previous example, the identity of the speaker may be much more important than what he is saying, making his social interactions very popular even when he is not saying anything meaningful. The computation of the importance of a conversation with respect to some information requirements depends also on the users involved in the communication. We will thus use a concept of popularity of sender and receivers, depending again on some contextual information. Finally, the same people may exchange the same message, but at different times this may be more or less important — for example, a five-year-old message can be less important than very recent news. Therefore, we will use a measure of the timeliness of a social interaction. 3.1
Basic Functions
In the following we assume to know how to compute the relevance of a text message, the popularity of a user and the timeliness of a timestamp. All these functions will take respectively a text message, a user and a timestamp as input, in addition to a context within which the function should be evaluated. Relevance (rel : M × C → [0, 1], where C represents our information requirements) can be evaluated using any IR model. However, in our definitions we will require an independence property, stating that adding a new message to a conversation does not change the relevance of previous messages. Therefore, in case tf-idf-inspired measures are used, particular attention should be paid to the definition of the idf term. In the context of Social Network Sites, our information requirements can be expressed as a list of keywords, which will be omitted from the following definitions to enhance their readability. Popularity (pop : U × C → [0, 1]) can be defined in several different ways. The popularity of a person depends on the context (e.g., at a public institutional
178
M. Magnani and D. Montesi
event, or at a conference on computing), but in the context of a SNS we will typically compute it as a function of her followers (or friends). Also in this case, in the following definitions the context will be assumed to be constant, i.e., the specific SNS, and will not be represented in the equations. As for the relevance, our definitions are based on an independence principle: sending a message to a user does not increase her popularity — an assumption that we may like to relax in further works. Finally, timeliness (tml : T × C → [0, 1]) of an interaction can be defined in many ways. Considering the average duration of a conversation in SNSs (a few minutes on Friendfeed) we may decide not to consider this measure (e.g., using a constant function) or we could return a result proportional to the input timestamp. 3.2
Ranking Interactions
The properties defined in this subsection constrain a ranking function for social interactions, indicated as score, to be monotonic with respect to changes of its constituents. The first property regards the relevance of a text message, and states that if we take an interaction between two people at time t and we substitute the message with a more relevant message, the score of the interaction does not decrease: if rel(m) ≥ rel(m ) then score((t, u1 , u2 , m)) ≥ score((t, u1 , u2 , m ))
(1)
Similarly, if we take an interaction and substitute one actor (sender or receiver) with a more popular actor, this will not decrease its score: if pop(u1 ) ≥ pop(u1 ) then score((t, u1 , u2 , m)) ≥ score((t, u1 , u2 , m)) if pop(u2 ) ≥
pop(u2 )
then score((t, u1 , u2 , m)) ≥
score((t, u1 , u2 , m))
(2) (3)
Finally, if we say something at a different time, when it is more timely, this does not decrease its score: if tml(t) ≥ tml(t ) then score((t, u1 , u2 , m)) ≥ score((t , u1 , u2 , m))
(4)
To compute the score of a polyadic interaction we need to extend the concept of popularity to sets of users. By definition, we set the popularity of no users to 0, and the popularity of a singleton to the popularity of the only user in the set: pop({}) = 0
(5)
pop({u}) = pop(u)
(6)
Now we can see how to compute the popularity of a set of more than one user. If we add to a set a user with popularity equal or greater (resp. less) than the popularity of the set, the overall popularity will not decrease (resp. increase): if pop(u) ≥ pop(U ) then pop(U ∪ {u}) ≥ pop(U )
(7)
if pop(u) ≤ pop(U ) then pop(U ∪ {u}) ≤ pop(U )
(8)
Toward Conversation Retrieval
179
Similarly, if we substitute a user with another one which is more popular, the overall popularity will not decrease: if pop(ui ) ≥ pop(ui ) then pop((u1 , . . . , ui , . . . , un )) ≥ pop((u1 , . . . , ui , . . . , un ))
(9)
The ranking of polyadic interactions can now be defined by the same set of rules already defined for dyadic interactions (1, 2, 3 and 4), extended to allow sets of receivers. We will not rewrite these rules, as they are trivial revisions of the previous equations. 3.3
Ranking Conversations
Now, we can use a ranking function for interactions to rank a set of conversations. The first step consists in defining the score associated to an empty conversation as 0, and the score of a conversation with a single interaction as the score of that interaction: score(()) = 0 (10) score((I)) = score(I)
(11)
If we add to a conversation a new interaction whose score is greater (resp. less) than the one of the conversation, the overall score will not decrease (resp. increase): if score(I) ≥ score((I1 , . . . , In )) then score((I1 , . . . , In , I)) ≥ score((I1 , . . . , In ))
(12)
if score(I) ≤ score((I1 , . . . , In )) then score((I1 , . . . , In , I)) ≤ score((I1 , . . . , In ))
(13)
This rule specifies that many irrelevant interactions will reduce the overall score of the conversation. Similarly, if we substitute an interaction with another one with a higher score we will not decrease the score of the whole conversation. if score(Ii ) ≥ score(Ii ) then score((I1 , . . . , Ii , . . . , In )) ≥ score((I1 , . . . , Ii , . . . , In ))
(14)
It is worth noting that the values of the ranking functions and the combination function may vary significantly, still satisfying these constraints. While tests on real cases may help in choosing effective values, it is our opinion that users should be able to interact with the combination function so that they can specify how much each aspect (relevance, popularity, timeliness) should be weighted for each search task.
4
Conversational Density
The score of a conversation with respect to a given topic (or keyword) is not the only metric to be used to compute its importance. In fact, we may expect
180
M. Magnani and D. Montesi
that the comment rate of a message indicates a sort of degree of interest in that message or topic. As an example, consider a passionate political discussion: some people will not wait their turn to speak, and the more the conversation will touch sensitive topics, the more people will increase the frequency of their interactions, starting speaking together. Unfortunately, SNSs do not give us a simple way to understand the loudness of a message and other fundamental aspects like non-verbal signals [11]. However, conversational density may tell us something more than a single message can. The fact that the frequency, or density, of interactions during a conversation may tell us something of the content of the message, of its emotional charge, or degree of interestingness, is also suggested by the analysis of real messages. As an example consider Figure 5, where we represent the activity related to a discussion on the Friendfeed SNS using a frequency-modulation chart. Looking at what is happening inside dense parts of this conversation is outside the scope of this paper, but it should be clear that density may be used as an indicator of something that is not otherwise directly stored in the data, which could for example indicate a degree of interest in that topic. We could even postulate the existence of thresholds of density, so that for example when an on-line conversation exceeds this threshold people can no longer follow all the conversational threads and new sub-topics and sub-conversations spring inside the same chain of comments — a phenomenon that is sometimes perceived by SNS users. However, also this analysis lies outside the scope of this contribution, and we will develop this aspects in future works. From this discussion, it seems important to be able to model a concept of density of a conversation. Intuitively, we can define density as the duration of the conversation divided by the number of interactions, and this is certainly a reasonable choice. However, this would not allow us to highlight the fact that the conversation represented in Figure 5 contains some very dense regions. Therefore, we can consider the usage of alternative density functions.
Fig. 5. A graphical representation of different densities during the development of a conversation inside the Friendfeed SNS
Toward Conversation Retrieval
181
When we have less than two interactions, there is not any real conversation, and no concept of density. By definition, we set the density of these conversations to 0. dns(()) = dns((I)) = 0 (15) Now consider two equal conversations. If we add an interaction to the first one and we add another interaction with a higher timestamp to the second one, the first conversation will become denser. This concept is illustrated in Figure 6(1). if ts(I) ≤ ts(I ) then dns((I1 , . . . , In , I)) ≥ dns((I1 , . . . , In , I ))
(16)
If we put a new interaction inside a conversation, this will obviously increase its density. This is illustrated in Figure 6(2). dns((I1 , . . . , Ii , Ii+1 , In )) < dns((I1 , . . . , Ii , I, Ii+1 , In ))
(17)
Finally, we can leave the more intuitive concept of density defined as the duration of the conversation divided by the number of interactions, adding the following rule which favors conversations with variable internal densities over regular conversations. According to this rule, and as represented in Figure 6(3), if we add the same interaction to two different conversations and the first is denser than the second it will be denser also after the addition. if dns((I1 , . . . , In )) ≥ dns((I1 , . . . , Im )) then
, I)) dns((I1 , . . . , In , I)) ≥ dns((I1 , . . . , Im
(18)
Fig. 6. Effect of adding new interactions (gray vertical lines) to a conversation on its density: vertical lines represent interactions, horizontal lines indicate the time
5
Concluding Remarks
In this paper we have introduced some preliminary aspects of what we call conversation retrieval, an information retrieval activity which exploits structural aspects in addition to the exchanged text messages. In particular, we have presented a set of properties that should be satisfied to consider relevance, popularity and timeliness in the computation of the ranking of a conversation. Many alternative functions satisfying these properties can then be used, potentially
182
M. Magnani and D. Montesi
leading to different results — we consider this a strength of our proposal, because users may choose between alternative approaches to favor different aspects, e.g., text relevance or popularity. Moreover, we have discussed the importance of having an additional notion of density, which can be potentially used as a metric to associate degrees of interest or other emotional labels to conversations or parts of conversations. The next step of this research will consist in the implementation of a conversation retrieval system to be tested in real cases.
References 1. Oldenburg, R.: Third Place: Inspiring Stories about the Great Good Places at the Heart of Our Communities. Marlowe & Company (2000) 2. Celli, F., Di Lascio, F.M.L., Magnani, M., Pacelli, B., Rossi, L.: Social Network Data and Practices: the case of Friendfeed. In: International Conference on Social Computing, Behavioral Modeling, & Prediction. Springer, Berlin (2010) 3. Fuhr, N., R¨ olleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems 15(1), 32–66 (1997) 4. Lalmas, M.: Dempster-Shafer’s theory of evidence applied to structured documents: modelling uncertainty. In: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 110–118. ACM Press, New York (1997) 5. Fuhr, N., Großjohann, K.: XIRQL: A query language for information retrieval in XML documents. In: SIGIR Conference (2001) 6. Amer-Yahia, S., Fernandez, M.F., Srivastava, D., Xu, Y.: Phrase matching in XML. In: Proceedings of the International Conference on Very Large Data Bases (2003) 7. Amer-Yahia, S., Botev, C., Shanmugasundaram, J.: Texquery: a full-text search extension to XQuery. In: WWW (2004) 8. Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: Flexpath: Flexible structure and full-text querying for xml. In: SIGMOD Conference (2004) 9. Agosti, M., Smeaton, A.F.: Information retrieval and hypertext. Kluwer Academic, Boston (1996) 10. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Computer Networks and ISDN Systems, pp. 107–117 (1998) 11. Watzlawick, P., Bavelas, J.B., Jackson, D.D.: Pragmatics of Human Communication: A Study of Interactional Patterns, Pathologies, and Paradoxes. W. W. Norton and Co. (1967)
Improving Classification and Retrieval of Illuminated Manuscript with Semantic Information Costantino Grana, Daniele Borghesani, and Rita Cucchiara Universit` a degli Studi di Modena e Reggio Emilia Via Vignolese 905/b - 41100 Modena [email protected]
Abstract. In this paper we detail a proposal of exploitation of expertmade commentaries in a unified system for illuminated manuscripts images analysis. In particular we will explore the possibility to improve the automatic segmentation of meaningful pictures, as well as the retrieval by similarity search engine, using clusters of keywords extracted from commentaries as semantic information.
1
Introduction
The availability of semantic data is a well known advantage for different image retrieval tasks. In the recent literature, a lot of works have been proposed to create, manage and further exploit semantic information to be used in multimedia systems. The reason is that the information retrieval from textual data is quite successful. Semantic data are typically exploited in web searches in the form of tags (i.e. Google Images), but the process of tagging an image is known to be very boring from a user perspective, and likewise the process of linking correctly textual information about an image to the image itself is very tricky and error-prone. Nevertheless, often the amount of information and details held by a human-made description of an image is very precious, and it cannot be fully extracted using techniques based on vision. Regarding the system globally, it is known that many artistic or historical documents cannot be made available to the public, due to their value and fragility, so museum visitors are usually very limited in their appreciation of this kind of artistic productions. For this reason, the availability of digital versions of the artistic works, made accessible —both locally at the museums owning the original version and remotely— with a suitable software, represents undoubtedly an intriguing possibility of enjoyment (from the tourist perspective) and study (from an expert perspective). Italy, in particular, has a huge collection of illuminated manuscripts, but many of them are not freely accessible to the public. These masterpieces contain thousands of valuable illustrations: different mythological and real animals, biblical episodes, court life illustrations, and some of them even testify the first attempts M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 183–193, 2010. c Springer-Verlag Berlin Heidelberg 2010
184
C. Grana, D. Borghesani, and R. Cucchiara
in exploring perspective for landscapes. Usually manual segmentation and annotation for all of them is dramatically time consuming. For this reason, the accomplishment of the same task with an automatic procedure is very desirable but, at the same time, really challenging due to the visual appearance of these pictures (their arrangement over the page, their various framing into the decorative parts and so on). In this papers, we propose a solution for automatic manuscript segmentation and pictures extraction. A modification of the bag-of-keypoints approach is used to efficiently apply it in the context of automatic categorization of artistic handdrawn illustrations (i.e. separating illustrations depending on their content, e.g. people vs animals). On the top of this system, we integrated the knowledge of a complete commentary available for the specific manuscript we have. A standard keyword clustering approach (usually known as tagcloud) has been used to find out the most relevant subject within the entire book on in a smaller section, then we explored the correspondence of clusters of words and clusters of extracted pictures to verify if we can use textual data to help and improve the object recognition. The final goal is to provide automatic content-based functionalities such as searches for similarity, comparison, recognition of specific elements (people, life scenes, animals, etc...) in artistic manuscripts including also the textual search in the retrieval engine.
2
Related Work
The problem of image analysis and classification of historical manuscripts is becoming a significant subject of research in recent years, even if the availability of complete systems for the automatic management of illuminated manuscripts digital libraries is quite limited. The AGORA [1] software performs a map of the foreground and the background and consequently propose a user interface to assist in the creation of an XML annotation of the page components. The Madonne system [2] is another initiative to use document image analysis techniques for the purpose of preserving and exploiting cultural heritage documents. In [3], Le Bourgeois et al. highlighted some problems with acquisition and compression, then authors gave a brief subdivision of documents classes, and for each of them provided a proposal of analysis. They distinguished between medieval manuscripts, early printed documents of the Renaissance, authors manuscripts from 18th to 19th century and, finally, administrative documents of the 18th 20th century: the authors perform color depth reduction, then a layout segmentation that is followed by the main body segmentation using text zones location. The feature analysis step uses some color, shape and geometrical features, and a PCA is performed in order to reduce the dimensionality. Finally the classification stage implements a K-NN approach. The bag-of-keypoints approach has become increasingly popular and successful in many object recognition and scene categorization tasks. The first proposals constructed a vocabulary of visual words by extracting image patches, sampled from a grid [4]. More advanced approaches used an interest point detector to
Improving Classification and Retrieval of Illuminated Manuscript
185
select the most representative patches within the image [5]. The idea was finally evolved toward the clustering and quantization of local invariant features into visual words as initially proposed by [6] for object matching in videos. Lately the same approach was exploited in [7], which proposed the use of visual words in a bag-of-words representation built from SIFT descriptors [8] and various classifiers for scene categorization. SIFT in particular was one of the first algorithms which combined an interest points detector and a local descriptor to gather a good robustness to background clutter and good accuracy in description. Lately several aspects have been investigated. For example, as shown in [9], the bagof-words approach creates a simple representation but potentially introduces synonymy and polysemy ambiguities, which can be solved using probabilistic latent semantic analysis (PLSA) in order to capture co-occurrence information between elements. In [10] the influence of different strategies for keypoint sampling in the categorization accuracy has been studied: the Laplacian of Gaussian (LoG), the Harris-Laplace detector used in [11] and random sampling. A recent comparison of vocabulary construction techniques is proposed in [12]. The idea to exploit text in order to improve or integrate image (and video) retrieval is not new. In fact a lot of complex retrieval systems present a fusion stage in which visual features are somehow fused to audio and/or textual features. This process is usually referred as multimodal fusion. For instance, in [13] and [14] in the context of video retrieval, Snoek et al. learned a list of concept-specific keywords, and based on this list they construct a word frequency histogram from shot-based speech transcripts. In [15], Chen et al. aimed to improve image web search engines by extracting textual information from the image “environment” (tags, urls, page content, etc. . . ) and users’ logs. The text description (which semantically describes the image) is then combined with other low-level features extracted from the image itself to compute a similarity assessment. Textual information are also a very familiar approach for querying the system (since web search engines rely on it), so several works propose the query-by-keyword functionality along with the query-by-example and query-by-concept modalities. For references about this topic, please refer to [16].
3
Automatic Segmentation and Retrieval
In [17] we described a system and the techniques used for text extraction and picture segmentation of illuminated manuscripts. The goal of the automatic segmentation system is to subdivide the document into its main semantic parts, in order to enable the design of new processing modules to manage and analyze each part, relieving the user of the task of manual annotation of the whole book collection. In that work we also introduced a first module for content-based retrieval functionalities by visual similarity with an ad-hoc designed user interface. The module for text segmentation computes the autocorrelation matrix over gray-scale image patches and converts them into a polar representation called
186
C. Grana, D. Borghesani, and R. Cucchiara
direction histogram: a statistical framework able to handle angular datasets (i.e. a mixture of Von Mises distributions) generates a compact representation of such histograms that are then the final features used to classify each block through an SVM classifier. The text-free parts of the image are then passed to a second module that separates plain background, decorations and miniatures. We use here a sliding window approach and represent each window with a descriptor that joins color features (RGB and Enhanced HSV Histograms) and texture features (Gradient Spatial Dependency Matrix (GSDM). As in [18], we exploited a Lipschitz embedding technique to reduce the dimensionality of the feature space and again used an SVM classifier to obtain the desired classification. Some examples of picture extraction results are shown in Fig. 1.
Fig. 1. Example of picture detection results
4
Bag-of-Keypoints Classification
One of the most successful strategies to perform object and scene recognition is the bag-of-keypoints approach. The main idea comes from the text categorization (bag-of-words), and it consists in defining, during the training phase: a. a set of “words” that is rich enough to provide a representative description of each and all the classes; b. the occurrences of these “words” for each class. In our context, since we cannot directly extract high level semantic words, we can define “visual words” by clustering accordingly visual descriptors (e.g. keypoint descriptors): the set of centroids of each cluster creates the so called vocabulary. After having counted, for each class, the occurrences of each word, the classification can then be easily performed extracting the histogram of the visual words of an example, and then finding the class that has the most similar occurrences distribution.
Improving Classification and Retrieval of Illuminated Manuscript
(a)
187
(b)
Fig. 2. Some representative of the people, animals and decorations classes
In [7] scene categorization is accomplished following this procedure and making use of Harris affine detector as keypoint detector (mapped to a circular region in order to normalize them for affine transformations) and SIFT as keypoint descriptors. In our system, for performance reasons, we preferred the use of SURF [19]: it is a very successful local descriptor, it relies on integral images for image convolutions, it uses a fast Hessian matrix based interest point detector, performed with box filters (an approximation procedure again relying on integral images), and eventually it uses a simple Haar wavelet distribution descriptor resulting in a 64 feature vector. These factors make SURF computationally more affordable than SIFT, and very similar in terms of accuracy of the point matching performances. The training set in our system is composed of patches of miniatures belonging to different classes. SURF keypoint descriptors are extracted over all patches and the visual vocabulary V is then made of the k cluster centroids obtained running a k-means clustering procedure: V = {vi , i = 1 . . . k}, with vi cluster centroid. Once the vocabulary is computed, each class is characterized by a specific distribution of visual word occurrences: therefore we obtain p(vi |Cj ), for each class j, for each visual word i. In order to avoid later numerical problems, we apply a Laplace smoothing. The number k is a key parameter of the whole process: low k will generate a poorly descriptive vocabulary, while high k will over fit the training data, therefore the training phase will slide through several k, finding the best value through cross validation. On any new image patch I to classify, the SURF descriptors are extracted and each casts a vote for the closest cluster centroid; I can be thus described as a histogram of the visual words of the vocabulary: each bin N (vi ) counts
188
C. Grana, D. Borghesani, and R. Cucchiara
the number of times in which a word vt has been extracted from the image, constituting the feature vector. The final classification has been accomplished using Na¨ıve Bayes. Na¨ıve Bayes is a simple but effective classification technique based on Bayes’ rule: given the image I and a prior probability p(Cj ) for j-th class, the classifier assigns to I the class with the largest posterior p(Ci |I) according to Eq. 1 (thus assuming the independence of visual words within the image): p(Cj |I) = p(Cj )
k
p(vi |Cj )N (vi )
(1)
i=1
5
Knowledge Summary by Tag Cloud
Given a text describing a dataset of images, the correlation between frequency (thus importance) of terms and visual content is conceptually straightforward. For this reason, a clustering of keyword in such a document probably will lead to highlighting the most important concepts (and thus visual objects) included in the dataset. The standard text processing approach in text retrieval systems is the following: 1. Parsing of the document into words. 2. Representing words by their stems. For example: “draw”, “drawing” and “drawn” are represented by the stem “draw”. 3. Definition of a stop list, i.e. a list of words to reject due to being common (such as “the”) or recurring often in most documents and thus not being discriminant for a particular document. 4. Representation of the document as a vector of words and the relative frequency of occurrence within the document (different weighting techniques are possible). A very rough but effective clustering procedure to extract important keywords from a document is the tag cloud. Tag clouds are a common visual representation of user-generated tags or generally word content of a document, at different size and/or color based on its incidence, typically employed to describe the content of the document itself. We employed this simple procedure to generate and select from the commentary some keywords to use in our tests.
6
Retrieval by Similarity
The final application devised for this work is a system aimed at presenting all the details of illuminated manuscripts in a user friendly way. Eventually an interface is proposed to the user, that can perform ranked image retrieval by content similarity: given a query picture, relative histograms are compared using histogram intersection metric, and similarity values are normalized, fused and finally ranked, from the most similar to the query.
Improving Classification and Retrieval of Illuminated Manuscript
(a) query
189
(b) results
Fig. 3. Example of content-based retrieval example of people. Navigating through the interface, the user can select the query image, and the system proposes the retrieval results ordered by descending appearance similarity.
As referred in the next section, text information can also be used along with visual descriptors to describe visual content. In fact, the scope of this work is just focused on investigating the relation between textual data (retrieved from commentaries) and pictures in order to propose a multimodal search engine with a joint use of both pictorial and textual information. So far, the similarity results provided as an example in Fig.3 are computed using a color histograms (HSV and RGB) as visual descriptors.
7
Results
In this paper, we used the digitalized pages of the Holy Bible of Borso d’Este, which is considered one of the best Renaissance illuminated manuscripts. Tests have been performed on a dataset of 320 high resolution digitalized images (3894x2792), a total amount of 640 pages. Each page of the dataset is an illuminated manuscript composed by a two-column layered text in Gothic font, spaced out with some decorated drop caps. The entire surrounding is highly decorated. The segmentation procedure described in Section 3 was run over the bible pages, providing us a set of valuable illustrations within the decoration texture (miniature illustrations of scenes, symbols, people and animals), rejecting border decorations (ornaments) and text. Some samples are shown in Fig.2. Once the pictures have been processed, a set of tag clouds has been generated from the commentary. The first tag we analyzed was “Geremia” (Fig.4), that is the Italian for prophet Jeremiah. A section of the Bible is dedicated to him, so in that section there are a lot of visual references about him. In the relative section of the commentary, the whole word count is about 2900.
190
C. Grana, D. Borghesani, and R. Cucchiara
!
"
!
#
" !
""
"
$
#
""!
#
#
Fig. 4. Tag cloud generated for the word “Geremia”
In this section of the bible, a total amount of 68 pictures was extracted from the illustration of the pages. The same features used in the retrieval by similarity module referred in 6 have been extracted from the pictures, and then clustered using the hierarchical Complete Link algorithm with an automatic clustering level selection. We obtained 4 clusters, and over the most populated one we obtained a recall of 73.68% and a precision of 84.85%. See Fig.5 for some samples coming from this cluster. The second tag we analyzed was “cervo” (Fig.6), that is the Italian for deer. A lot of animals are depicted within this book, and the deer is one of the most frequent, so there are a lot of visual references about it. The whole word count within the relative commentary pages in this case is about 9100. In the subset of pages where a deer is depicted, a total amount of 99 pictures was extracted from the illustration of the pages. Visual features have been extracted and then clustered. We obtained 3 clusters, and over the most populated one we obtained a recall of 84.31% and a precision of 72.88%. See Fig.7 for some samples coming from this cluster.
Improving Classification and Retrieval of Illuminated Manuscript
191
Fig. 5. Samples from the most populated cluster generated by clustering visual features of pages containing the word “Geremia”
!
!
" !
Fig. 6. Tag cloud generated for the word “cervo”
192
C. Grana, D. Borghesani, and R. Cucchiara
Fig. 7. Samples from the most populated cluster generated by clustering visual features of pages containing the word “cervo”
8
Conclusions
In this paper, we investigated the use of textual keywords from experts-made commentaries in a system for automatic segmentation and extraction of pictures from illuminated manuscripts. We showed that textual information (if available) can be considered a useful addition not only to help categorization of the visual content, but also to give to the user the possibility to search the visual content with textual keywords, providing a multimodal search engine analogously to common web search engine.
References 1. Ramel, J., Busson, S., Demonet, M.: AGORA: the interactive document image analysis tool of the BVH project. In: International Conference on Document Image Analysis for Libraries, pp. 145–155 (2006) 2. Ogier, J., Tombre, K.: Madonne: Document Image Analysis Techniques for Cultural Heritage Documents. In: Digital Cultural Heritage, Proceedings of 1st EVA Conference, Oesterreichische Computer Gesellschaft, pp. 107–114 (2006) 3. Le Bourgeois, F., Trinh, E., Allier, B., Eglin, V., Emptoz, H.: Document Images Analysis Solutions for Digital libraries. In: International Conference on Document Image Analysis for Libraries, pp. 2–24. IEEE Computer Society, Los Alamitos (2004)
Improving Classification and Retrieval of Illuminated Manuscript
193
4. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4), 594 (2006) 5. Agarwal, S., Awan, A.: Learning to detect objects in images via a sparse, partbased representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(11), 1475–1490 (2004) 6. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: International Conference on Computer Vision, vol. 2, pp. 1470–1477 (2003) 7. Dance, C.R., Csurka, G., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: ECCV Workshop on Statistical Learning in Computer Vision, pp. 1–22 (2004) 8. Lowe, D.: Object recognition from local scale-invariant features. In: International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999) 9. Quelhas, P., Monay, F., Odobez, J., Gatica-Perez, D., Tuytelaars, T.: A thousand words in a scene. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(9), 1575–1589 (2007) 10. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 490–503. Springer, Heidelberg (2006) 11. Lazebnik, S., Schmid, C., Ponce, J.: Affine-invariant local descriptors and neighborhood statistics for texture recognition. In: ICCV 2003: Proceedings of the Ninth IEEE International Conference on Computer Vision, Washington, DC, USA, p. 649. IEEE Computer Society, Los Alamitos (2003) 12. Uijlings, J.R.R., Smeulders, A.W.M., Scha, R.J.H.: Real-time bag of words, approximately. In: International Conference on Image and Video Retrieval (2009) 13. Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. In: MULTIMEDIA 2005: Proceedings of the 13th annual ACM international conference on Multimedia, pp. 399–402. ACM, New York (2005) 14. Snoek, C.G.M., Worring, M., Geusebroek, J.M., Koelma, D.C., Seinstra, F.J., Smeulders, A.W.M.: The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1678–1689 (2006) 15. Chen, Z., Wenyin, L., Zhang, F., Li, M.: Web mining for web image retrieval. J. Am. Soc. Inf. Sci. Technol. 52(10), 831–839 (2001) 16. Jing, F., Li, M., Zhang, H.J., Zhang, B.: A unified framework for image retrieval using keyword and visual features. IEEE Transactions on Image Processing 14(7), 979–989 (2005) 17. Grana, C., Borghesani, D., Cucchiara, R.: Describing Texture Directions with Von Mises Distributions. In: International Conference on Pattern Recognition (2008) 18. Hjaltason, G., Samet, H.: Properties of Embedding Methods for Similarity Searching in Metric Spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 530–549 (2003) 19. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Computer Vision and Image Understanding 110(3), 346–359 (2008)
Content-Based Cover Song Identification in Music Digital Libraries Riccardo Miotto, Nicola Montecchio, and Nicola Orio Department of Information Engineering University of Padova Padova, Italy {riccardo.miotto,nicola.montecchio,nicola.orio}@dei.unipd.it
Abstract. In this paper we report the status of our research on the problem of content-based cover song identification in music digital libraries. An approach which exploits both harmonic and rhythmic facets of music is presented and evaluated against a test collection. Directions for future work are proposed, and particular attention is given to the scalability challenge.
1
Introduction
As digital libraries continue to gain an ever more pervasive role in the everyday experience of users, so do the information needs of users grow in complexity up to the point that classic text and metadata-based information retrieval techniques are not suitable to be applied to multimedia content such as music. Contentbased music identification has become an important research topic because it can provide tools to efficiently retrieve and organize music documents according to some measure of similarity. A prominent social phenomenon of the last years regarded an increasing number of users joining social communities to upload their personal recordings and performances. The large availability of such non-commercial recordings puts a major interest towards the cover identification problem: generally, the term cover song defines a new rendition of a previously recorded song in genres such as rock and pop; cover songs can be either live or studio recordings with a potentially completely different arrangement. A typical approach to music identification is audio fingerprinting [1], that consists in a content-based signature of a music recording to describe digital music even in presence of noise, distortion, and compression [2]; several commercial systems have appeared that make use of such techniques to identify an audio recording, e.g. [3], however audio fingerprinting is most useful for retrieving the exact recording given as query, while it is not explicitly designed for a cover identification task. On the contrary, cover identification approaches must be able to identify a song from the recording of a performance, yet independently from the particular performance. For example, identification of live performances may not benefit M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 195–204, 2010. c Springer-Verlag Berlin Heidelberg 2010
196
R. Miotto, N. Montecchio, and N. Orio
from the fingerprint of other performances, because most of the acoustic parameters may be different. Collecting all the possible live and cover versions of a music work is clearly unfeasible. As stated in [4], one of the most interesting challenges in Music Information Retrieval is to exploit the complex interaction of the many different music information facets: pitch (melodic), temporal (rhythmic), harmonic, timbral, editorial, textual, and bibliographic. The system presented here attempts at exploiting in particular the harmonic and rhythmic facets. Cover music identification methodologies described in literature generally exploit Chroma features to describe the harmonic content of music recordings. In particular Chroma have been widely exploited in [5], [6] and [7]. Since Chroma features are high dimensional, they considerably affect computational time for search operations; efficiency becomes then a key issue if an identification system is proposed to a large community of users, as in the case of a Web-based music search engine or collaborative digital library. In [7], we proposed an efficient methodology to identify classical music recordings by applying the Locality Sensitive Hashing (LSH) paradigm [8], a general approach to handle high dimensional spaces by using ad-hoc hashing functions to create collisions between vectors that are close in the space. LSH has been applied to efficient search of different media [9]. This paper focuses on pop music identification, through the integration of feature descriptors that are relative to two characterizing aspects of a song: harmonic and rhythmic content. The main idea is that usually cover songs preserve not only the harmonic-melodic characteristics of the original work but also its rhythmic profile. We show that combining information evidence from the two facets effectively improves identification rate. This work builds on previously published material [7,10].
2
System Model
Evidence given by Chroma and rhythmic descriptors is combined into a single ranking of possible matching songs. The rhythmic descriptors used are Rhythm Histogram (RH) features [11], which were originally proposed in a genre classification task. While Chroma features have been thoroughly investigated previously and are provided with an efficient implementation, RH features have only recently been adopted by us and their performances in terms of speed are not comparable yet; both aspects are described below. An overview of the system is depicted in Figure 1. 2.1
Chroma Features
A music descriptor widely applied to cover identification is Chroma features. Chroma features are related to the intensity associated with each of the 12 semitones within an octave, with all octaves folded together. The concept behind
Content-Based Cover Song Identification in Music Digital Libraries
197
!"#"$ %&""! '"()' &*' +" $!"" * !",! -.!!"./"#$" 01 "((2 $ 3(4+ 3""'$
Fig. 1. Overview of the system model
chroma is that the perceived quality of a chord depends only partially on the octaves in which the individual notes are played. Instead, what seems to be relevant are the pitch classes of the notes (the names of the notes on the chromatic scale) that form a chord. This robustness to changes in octaves is also exploited by artists who play a cover song: while the main melody is usually very similar to the original one, the accompaniment can have large variations without affecting the recognizability of the song. As described in [7], a Chroma vector c is a 12-dimensional vector of pitch classes, computed by processing a windowed signal with a Fourier transform. According to the approach proposed in [12], chroma features have been computed using the instantaneous frequency within each FFT bin to identify strong tonal components and to achieve higher resolution. In Figure 2(a) a Chroma vector corresponding to an A7 chord is depicted; Figure 2(b) shows the evolution of Chroma vectors over time for an excerpt of the song “Heroes” by D. Bowie. For each vector c, quantization q is achieved by considering the ranks of the chroma pitch classes, instead of their absolute values, to obtain a general representation robust to variations due to different performing styles. In particular, rank-based quantization is carried out by computing the rank of the value of the energy in the various pitch classes. Rank-based quantization aims at a final compact representation, which can be obtained by considering that a vector q can be thought as a twelve digit
198
R. Miotto, N. Montecchio, and N. Orio
number represented in base k. A simple hashing function h can be computed by obtaining the decimal representation of this number, according to equation h=
12
k i−1 qi
(1)
i=1
where additional hashing techniques can be applied to store the values h in one array, which can be accessed in constant time. A typical technique is to compute the reminder of h divided by a carefully chosen prime number. The described approach is applied both to the songs in the collection and to the queries. With the main goal of efficiency, retrieval is carried out using the bag of words paradigm. Similarity between the query and the recordings in the collection is measured by simply counting the number of hashes they have in common. This measure of similarity does not take into account the distribution of hash values. In particular, the occurrence of a chroma hash inside a song and the frequency of a chroma hash across the collection of documents have not been considered. The choice of this particular similarity measure has been motivated by a number of tests using short queries of about 10 seconds, where this simple measure outperformed more complex ones [7]. Since queries of a cover identification task can be complete recordings, the frequency of chroma hashes and their relative position along the song may become a relevant piece of information. In order to handle this issue, long music queries have been divided into overlapping short sub-queries of a fixed duration and for each query an independent retrieval task is carried out. A similar processing is applied to documents, that are divided into overlapping frames with a length comparable to the one of the sub-queries. At the end, the result scores of each single retrieval are combined. In particular, preliminary evaluation showed that, in this context, geometric mean outperformed all the other main fusion techniques reported in literature [13]. A problem that may affect retrieval effectiveness is that chroma-based representation is sensible to transpositions. The problem is not dealt with in this paper, as the focus mainly resides in the integration with rhythmic features; it is however part of future work and possible solutions are described in Section 4. 2.2
Rhythm Histogram Features
Rhythm Histogram features [11] are a descriptor for the general rhythmic characteristics of an audio document. In a RH the magnitudes of each modulation frequency bin for all the critical bands of the human auditory range are summed up to form a histogram of “rhythmic energy” per modulation frequency. In their original form, a single “global” RH represents a whole piece; in our approach, as is the case for Chroma features, a sequence of RHs is computed for each song by segmenting the audio into overlapping windows of 15 seconds, in order to be able to individually match parts of songs which might be characterized by different rhythmic structures (e.g. verse and chorus). Figures 2(c) and 2(d) show the global RH and the sequence of local RHs for David Bowie’s “Heroes”.
Content-Based Cover Song Identification in Music Digital Libraries
199
6 0.4
0.35
6.5
0.3 7
time [s]
0.25
0.2
0.15
7.5
8
0.1 8.5 0.05 9 0 A
Bb
B
C
C#
D
D#
E
F
F#
G
G#
A
Bb
B
C
C#
pitch class
D
D#
E
F
F#
G
G#
pitch class
(a) Single Chroma vector
(b) Chroma vectors evolution 0
12 20
10 segment position in the original audio [s]
40
8
6
4
60
80
100
120
140
2
160 0 2
4
6 modulation frequency [Hz]
(c) Global RH
8
10
0
2
4
6
8
10
modulation frequency [Hz]
(d) Local RHs
Fig. 2. Chroma and Rhythm Histogram features for D. Bowie’s “Heroes”
The first step in the computation of the similarity between the songs a and b is the construction of the similarity matrix M , in which each entry mij is given by the cosine similarity of the i-th RH of a and the j-th RH of b. For each segment of a, the best matching segment of b (that is the one with the highest cosine similarity) is retained, and the mean of these values over all segments of a is computed; a simmetric procedure is then applied to song b and finally the average1 of these two scores is returned as the similarity of a and b. Experimental results showed that this strategy performs slightly better than the simpler comparison of the global RHs. It is clear that this approach is computationally intensive, since the cosine similarity of the RHs must be computed for each song in the collection and for each segment pair. Possible optimizations, similar to the ones used for Chroma features, are under investigation. In Section 3.3 a straightforward strategy for reducing the computational load is proposed, based on the consideration that query songs can be compared to just a small subset of the songs in the collection while retaining the same precision in the results. 1
A bi-directional comparison procedure is used in order to have a symmetric similarity measure. Experiments however showed that the simpler uni-directional comparison strategy yields similar results.
200
2.3
R. Miotto, N. Montecchio, and N. Orio
Feature Combination
The similarity score s for a pair of songs, that governs the ranking returned by the system, is computed combining the two scores c and r given by the Chroma features and the Rhythm Histogram features respectively. Two strategies have been used: – linear combination s = (1 − α)c + αr
α ∈ [0, 1]
(2)
s = c1−α rα
α ∈ [0, 1]
(3)
– weighted product As pointed out in Section 3.3, their performance is similar.
3
Experimental Results
Experimental results are presented to show how performances can be improved by combining the scores for the two feature descriptors used. The performances of the system are evaluated using Mean Reciprocal Rank (MRR) as a measure of precision. 3.1
Test Collection
The proposed approach has been evaluated with a test collection of 500 recordings of pop and rock songs, taken from personal collections of the authors. The idea was to have a collection as close as possible to a real scenario. In fact, the indexed documents were all the original version of the music works – i.e., the studio album versions – for which it is expected that metadata are correctly stored and that can be reasonably used in a real system as the reference collection. The query set included 60 recordings of different versions of a subset of the collection, which were live version by the same artist who recorded the original song and studio or live covers by other artists. We decided to have in the collection one single correct match for each query in order to balance the contribution of each different song. Queries had different durations, generally including the core of the music works – verses and choruses – plus possible introductions and endings, in particular in the live versions. All the audio files were stereophonic recordings (converted to monophonic in the audio feature extraction step) with a sampling rate of 44.1 kHz and stored in MP3 format at different bitrates (at most 192 kbps). In fact, in order to simulate a real context, we preferred a compressed format rather than an uncompressed one such as PCM. 3.2
Individual Features Results
The performance of Chroma features individually is already satisfying, with a MRR of 78.4%. Rhythmic Histogram features on the other hand are less reliable, resulting in a MRR of 34.0%. If the RH features scores are computed directly on the global RH (instead of subdividing the song and computing the best match for each segment) MRR is is 28.5%.
Content-Based Cover Song Identification in Music Digital Libraries
3.3
201
Combined Features Results
Figure 3 shows typical dispositions of the feature score pairs for some queries; each point in the feature space is associated to a comparison between a query song and the songs in the collection, the red circles being associated to the relevant matches. In particular Figure 3(a) is an example of the best possible situation, in which both Chroma features and Rhythm Histogram features individually rank the correct match for the query song in the first position. Figure 3(b) depicts the most common situation, in which Chroma features correctly identify the match but RH features are misleading; the dual situation is reported in Figure 3(c), which is rare but represents a significant evidence for the usefulness of RH features. Finally Figure 3(d) presents the situation in which neither feature descriptor can correctly identify the best match.
1.00
0.99
0.99
Rhythm histogram similarity
Rhythm histogram similarity
0.98 0.98
0.97
0.96
0.97
0.96
0.95 0.95 0.94 0.00
0.02
0.04
0.06 0.08 Chroma similarity
0.10
0.12
0.14
(a) Smoke on the water - Deep Purple
0.00
0.06 Chroma similarity
0.08
0.10
0.12
0.99
0.98
0.98 Rhythm histogram similarity
Rhythm histogram similarity
0.04
(b) Sweet child of mine - Guns N’ Roses
0.99
0.97
0.96
0.95
0.97
0.96
0.95
0.94
0.93 0.00
0.02
0.02
(c) Sweet Skynyrd
0.04
0.06 Chroma similarity
home
0.08
Alabama
0.10
-
0.12
Lynyrd
0.94 0.00
0.02
0.04 0.06 Chroma similarity
0.08
0.10
(d) You shook me - AC/DC
Fig. 3. Disposition of song similarities in the feature space
Perhaps the most interesting disposition of score pairs in the feature space is the one depicted in Figure 4: neither feature can identify the matching song by itself, but a combination of the two is indeed able to rank it in the first position.
202
R. Miotto, N. Montecchio, and N. Orio 0.99
Rhythm histogram similarity
0.98
0.97
0.96
0.95
0.94 0.00
0.02
0.04
0.06 Chroma similarity
0.08
0.10
0.12
Fig. 4. Disposition of song similarities in the feature space for “All along the watchtower” by Jimi Hendrix
The two approaches to feature combination reported in Equations 2 and 3 have been tested for several values of the parameter α, which weights the influence of RH features in the score, and the resulting MRRs are depicted in Figure 5. For the optimal value of α, the MRR increases from 78.4% (using only Chroma features) to 82.0% and 81.6%, using a linear combination and a weighted product of the features scores respectively. Similar performances are achieved using a single global RH for computing the similarity of songs, with a MRR of 81.5% for the case of the best linear combination of features. Even though the MRR maximum value is located in local peaks of the graphic, which are probably due to the rather small size of the test collection, setting α in a rather large neighbourhood of its optimal value still yields a significant improvement in MRR. As anticipated in Section 2.2, it is clear that performing the comparison of a query song against the whole set of songs in the collection is unfeasible for a large collection, especially when comparing all the segments against each other. Fortunately Chroma features are able to rank the relevant match in the first positions, and this can be done efficiently thanks to the hashing mechanisms discussed above; an effective solution is to exploit this robustness by reranking only the top t position with the aid of Rhythm Histogram features: with t ranging from 15 to 50 the optimal MRR (82.0%) is unchanged for the collection used. Although the collection is very small, previous experiments with Chroma features on larger collections [7] have shown how the relevant matches for query songs are almost never ranked in very low positions, thus Rhythm Histogram features can be effectively exploited computing them on just a very small fraction of the songs in the collection.
Content-Based Cover Song Identification in Music Digital Libraries
203
α
Fig. 5. MRR for the presented approaches to feature combination
4
Conclusion
The paper presented a methodology for the pop music cover identification problem in a music digital library, mainly focusing on the improvements given by the introduction of a rhythmic profile descriptor in addition to the description of harmonic content. Many directions for future work are yet to be explored, and the most promising ones are briefly reviewed below. The modeling of harmonic content still lacks an effective solution for handling the possible transpositions in pitch of cover songs; in fact, if the cover song used as query and the original work stored in the collection are played in different tonalities, they will have totally different sets of chroma. This problem can be simply addressed by considering that a transposition of n semitones will result in a rotation of Chroma vectors of n steps. Therefore, the tonality issue can be faced by simply computing all the twelve transpositions for each query and identifying all the transposed versions, though this step is likely to decrease the performance both in terms of efficacy and efficiency. Alternatively, a methodology to estimate the key of a song may be exploited in order to transpose the chroma sets to a reference tonality [6]. Including key estimation algorithms in the proposed system will be an important part of future work. While harmonic content description has been deeply studied, the investigation of rhythmic content descriptors is still in a very early stage. In particular, computational performance is the main concern, and at this aim an hash-based retrieval approach is under development. Finally, it is clear how evaluation of the system should be performed on a larger test collection. However, this step poses additional issues, not only related to the size of the data that has to be managed, but also to problems regarding music
204
R. Miotto, N. Montecchio, and N. Orio
genres. In fact, some music genres are defined by a very characteristic rhythm (e.g. reggae); thus rhythmic descriptors might be in such cases detrimental to the final performances.
References 1. Cano, P., Batlle, E., Kalker, T., Haitsma, J.: A review of audio fingerprinting. Journal of VLSI Signal Processing 41, 271–284 (2005) 2. Cano, P., Koppenberger, M., Wack, N.: Content-based music audio recommendation. In: Proceedings of the ACM International Conference on Multimedia, pp. 211–212 (2005) 3. Wang, A.: An industrial-strength audio search algorithm. In: Proceedings of ISMIR (2003) 4. Downie, J.: Music information retrieval. Annual Review of Information Science and Technology 37, 295–340 (2003) 5. Kurth, F., Muller, M.: Efficient index-based audio matching. IEEE Transactions on Audio, Speech, and Language Processing 16(2), 382–395 (2008) 6. Serra, J., Gomez, E., Herrera, P., Serra, X.: Chroma binary similarity and local alignment applied to cover song identification. IEEE Transactions on Audio, Speech, and Language Processing 16(6), 1138–1151 (2008) 7. Miotto, R., Orio, N.: A music identification system based on chroma indexing and statistical modeling. In: Proceedings of International Conference on Music Information Retrieval, pp. 301–306 (2008) 8. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. The VLDB Journal, 518–529 (1999) 9. Slaney, M., Casey, M.: Locality-sensitive hashing for finding nearest neighbors [lecture notes]. IEEE Signal Processing Magazine 25(2), 128–131 (2008) 10. Miotto, R., Montecchio, N.: Integration of chroma and rhythm histogram features in a music identification system. In: Proceedings of the Workshop on Exploring Musical Information Spaces, WEMIS (2009) 11. Lidy, T., Rauber, A.: Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In: Proceedings of International Conference on Music Information Retrieval, pp. 34–41 (2005) 12. Ellis, D., Poliner, G.: Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, pp. IV-1429–IV-1432 (2007) 13. Fox, E., Shaw, J.: Combination of multiple searches. In: Proceedings of the Second Text REtrieval Conference (TREC-2), pp. 243–249 (1994)
Toward an Audio Digital Library 2.0: Smash, a Social Music Archive of SHellac Phonographic Discs Sergio Canazza1 and Antonina Dattolo2 1
2
University of Padova, Dep. of Information Engineering, Sound and Music Computing Group Via Gradenigo, 6/B, 3513 Padova, Italy [email protected] https://www.dei.unipd.it/~canazza/ University of Udine, Dep. of Mathematics and Computer Science Via delle Scienze 206, 33100 Udine, Italy [email protected] www.dimi.uniud.it/antonina.dattolo/
Abstract. In the music field, an open issue is represented by the creation of innovative tools for acquisition, preservation and sharing of information. The strong difficulties in preserving the original carriers, together dedicated equipments able to read any (often obsolete) format, encouraged the analog/digital (A/D) transfer of audio contents in order to make them available in digital libraries. Unfortunately, the A/D transfer is often an invasive process. This work proposes an innovative and not-invasive approach to audio extraction from complex source material, such as shellac phonographic discs: PoG (Photos Of Ghosts) is a new system, able to reconstruct the audio signal from a still image of a disc surface. It is automatic, needs of low-cost hardware, recognizes different rpm and performs an automatic separation of the tracks; also it is robust with respect to dust and scratches.
1
Introduction
The availability of digital archives and libraries on the Web represents a fundamental impulse for cultural and didactic development. Guaranteeing an easy and ample dissemination of the music culture of our times is an act of democracy that must be assured to future generations: the creation of new tools for the acquisition, the preservation, and the transmission of information is nowadays a key challenge for international archives and libraries communities [3]. Scholars and the general public have begun paying greater attention to the recordings of artistic events, but the systematic preservation and consultation of these documents are complicated by their diversified nature; in fact, the data contained in the recordings offer a multitude of information on their artistic and cultural life, that goes beyond the audio signal itself. From the first recording on paper, M. Agosti, F. Esposito, and C. Thanos (Eds.): IRCDL 2010, CCIS 91, pp. 205–217, 2010. c Springer-Verlag Berlin Heidelberg 2010
206
S. Canazza and A. Dattolo
´ made in 1860 (by Edouard-L´ eon Scott de Martinville “Au Clair de la Lune” using his phonautograph1 ), until to the modern Blu-ray Disc, what we have in the audio carriers field today, is a Tower of Babel: a bunch of incompatible analog and digital approaches and carriers – paper, wire, wax cylinder, shellac disc, film, magnetic tape, vinyl record, magnetic and optical disc, to mention only the principal ones – without standard players able to read all of them. With time the reading of certain of those original documents has become difficult or even impossible. We know that some discs and cylinders are broken, that the resin in which the groove was engraved is sometimes dislocated, that the groove itself is often damaged by the styli of the heads of that age. It is worth noting that, in the Seventies/Eighties of 20th Century, expert associations (AES, NARA, ARSC) were still concerned about the use of digital recording technology and digital storage media for long-term preservation. They recommended re-recording of endangered materials on analogue magnetic tapes, because of: a) rapid change and improvement of the technology, and thus rapid obsolescence of hardware, digital format and storage media; b) lack of consensus regarding sample rate, bit depth and record format for sound archiving; c) questionable stability and durability of the storage media. The digitization was considered primarily a method of providing access to rare, endangered, or distant materials – not a permanent solution for preservation. Smith, still in 1999, suggested that digitization should be considered a means for access, not preservation – “at least not yet” [14]. Nowadays, it is well-known that preserving carriers maintaining dedicated equipment for the ever growing numbers of formats in playable condition is hopeless, and the audio information stored in obsolete formats and carriers is in risk of disappearing. At the end of the 20th century, the traditional “preserve the original” paradigm shifted to the “distribution is preservation” [6] idea of digitizing the audio content and making it available using digital libraries technology. For these reasons, if, on the one hand, it is evident the importance of transferring into the digital domain (active preservation) a least carriers in risk of disappearing, respecting the indications of the international archivist community [1,13], on the other part, it became urgent to study and design new forms for organizing and sharing music archives, taking in account the radical revolution imposed by Web 2.0; in fact, the modalities in which people communicate, meet and share information have been strongly influenced by the wide popularity of user generated contents [4]: Facebook, YouTube, Wikipedia, Myspace are only some examples of widely known Web 2.0 applications. The rest of this introduction is organized in three subsections: the first (Subsection 1.1) is dedicated to discuss features and limitations of the current Web 2.0 applications in the music field; the second (Subsection 1.2) synthesizes the innovative aspects of the contribute presented in this paper; finally, the last (Subsection 1.2) describes the chosen application domain, that is represented by the shellac discs. 1
Unlike Edison’s similar 1877 invention, the phonograph, the phonautograph only created visual images of the sound playback capabilities. Scott de Martinville’s device was used only for scientific investigations of sound waves.
Toward an Audio Digital Library 2.0: Smash, a Social Music Archive
1.1
207
Current Web 2.0 Applications in the Music Field
In the field of music, several communities, applications and services based on Web 2.0 perspective have been developed. They propose new modalities of social interaction both for music creation and fruition. Some open directories, such as All Things Web 2.0 (www.allthingsweb2.com), GO2WEB20 (go2web20.net), or Feedmyapp (feedmyapp.com), propose daily updated lists of them. Existing applications may be distinct, at least, in the following four macrocategories: – Sharing music. This typology of applications proposes online communities, allowing the users to share and promote free music and related news, broken down by genre, and ranked by community votes. Some examples are Bebo (www.bebo.com), or Laudr Underground Music (www.laudr.com). – Creating music. This typology of applications aims to create large collaborative databases, containing audio snippets, samples, recordings, bleeps, enabling the users to innovative forms of access, browsing and creation. Some of these applications, such as Freesound Project (www.freesound.org) focus only on sounds, while others, such as ccMixter (ccmixter.org), include also songs and compositions. In numerous of these applications, such as Amie street (amiestreet.com), musicians and fans can connect, share their videos/mp3s, promote live shows and to discover the latest up-and-coming music. – Analyzing the user’s behavior. This typology of applications tracks how the users interact with music online everyday, and apply a set of metrics with different aims; among them, for example, NextBigSound (thenextbigsound.com) tracks the number of plays, views, fans, comments, and other key metrics for 700,000 artist profiles across major web properties like Facebook, MySpace, Last.fm, Twitter. The aim is to support professionals, such as agents, managers, artists or publicists, in the goal of understanding and increasing the online audience. Since the purchase decisions of a critical mass of consumers has moved online, NextBigSound accurately measures, reports, and uses the interaction to make decisions. – Recommending music. This typology of applications collects the music selection history of the users, and inferring from it the individual user tastes proposes new recommendations; on the basis of the user feedback on approval or disapproval of individual songs, the system calibrates the successive suggestions. Examples of music recommenders are Pandora (www.pandora.com), Last.fm (www.last.fm), and Discogs (short for discographies); this last is the largest online database of vinyl discs and one of the largest online databases of electronic music releases. In it, a user can rate the correctness and completeness of the full set of data for an existing resource, as assessed by the users who have been automatically determined, by an undisclosed algorithm, to be experienced and reliable enough to express their votes. An item’s “average” vote is displayed with the resource’s data.
208
S. Canazza and A. Dattolo
Notwithstanding the differences among the systems, all these services tend to divide users in two categories: – the large group of music listeners, which mainly have the task of evaluating and recommending music; – the restricted group of the music content creators, which are required to have skills in the field of music composition or music performance. Currently, the large group of music listeners can generally browse the music by genre, artist or album, but rarely they can use music features automatically extracted from the audio. An interesting example in this direction is MusicSurfer2 , able to mimics humans in listening, understanding and recommending music. It automatically extracts descriptions related to instrumentation, rhythm and harmony. Together with complex similarity measures, the descriptions allow users to navigate on multimillion track music collections in a flexible, efficient, and truly personalized way. As a music similarity opinion engine it can also generate smart playlists. A lightweight version of the similarity engine is available for embedding in portable devices, such as iPods. 1.2
Contribute of This Work
The analysis of the current Web 2.0 applications, dedicated to the music, highlights the presence of a large set of services, but also highlights a set of general limitations: 1. current Web 2.0 applications do not provide users with features and metadata directly extracted from the audio signal; 2. the documentation, that a user owns, such as the cover or a photo of a disc, is not used by the system in order to automatically extract metadata; 3. a user cannot use these applications for listening his/her discs, or for comparing audio features. With the aim to face these limitations and propose innovative modalities of fruition and preservation, we are currently working on a large project, dedicated to the realization of a social music archive, named Smash (Social Music Archive of SHellac phonographic discs). In this paper, we focus our discussion on a specific component of Smash, called PoG (Photos Of Ghosts), created for reconstructing the audio signal from a still image of the surface of shellac phonographic discs. This activity represents an original contribute: in fact, although automatic text scanning and optical character recognition are in wide use at major libraries, yet, unlike texts, A/D transfer of historical sound recordings is often an invasive process. In PoG this process is not invasive, since it is based on a photo: we add, to the contextual information (such as still images of the covers, labels, possible annex, mirror of the discs), the still image of the shellac disc; PoG enables the user to 2
Developed by Music Technology Group (MTG) of the Universitat Pompeu Fabra in Barcelona. See musicsurfer.iua.upf.edu/
Toward an Audio Digital Library 2.0: Smash, a Social Music Archive
209
listen the disc by using the photo, maintaining the link between the original real object (i.e. shellac disc) of his/her discography, the contextual information and the metadata included in the database record. We strongly believe that this is the only way to preserve the history of the document transmission. 1.3
The Case Study Domain
We present some experimental results applying our approach to shellac discs. The shellac disc is a common audio mechanical carrier. In 1886-1901 the first engraved discs commercialized by Emile Berliner appeared, first in vertical grooves like phonographic cylinders, then in lateral grooves. Prior the appearance of magnetic tapes, radio broadcasting was recorded live on discs. The phonograph disc is composed of a spiral groove obtained by casting or by direct cutting, in which a sound signal is recorded in the shape of a lateral or vertical modulation, or both if stereophonic. The common factor with the mechanical carrier is the method of recording the information, which is obtained by means of a spiral groove obtained by casting or by direct cutting, in which a sound signal is recorded (either directly in the case of acoustic recordings or by electronic amplifiers) in the shape of a lateral or vertical modulation, or both if stereophonic. Mechanical carriers include: phonograph cylinders; coarse groove gramophone, instantaneous and vinyl discs. 1.4
Shellac Discs
The Shellac disc is a common audio mechanical carrier. The common factor with the mechanical carrier is the method of recording the information, which is obtained by means of a groove cut into the surface by a stylus modulated by the sound, either directly in the case of acoustic recordings or by electronic amplifiers. Mechanical carriers include: phonograph cylinders; coarse groove gramophone, instantaneous and vinyl discs. Tab. 1 summarizes the typologies of these Table 1. Typologies of analogue mechanical carriers Carrier cylinder – recordable cylinder – replicated
Period 1886-1950s 1902-1929
coarse groove disc – repli- 1887-1960 cated coarse and microgroove discs 1930-1950s – recordable (“instantaneous discs”) microgroove disc (“vinyl”) - 1948replicated
Composition Wax Wax and Nitrocellulose with plaster (“Blue Amberol”) Mineral powders bound by organic binder (“shellac”) Acetate or nitrate cellulose coating on aluminum (or glass, steel, card) Polyvinyl chloride - polyacetate co-polymer
Stocks 300,000 1,500,000 10,000,000 3,000,000
30,000,000
210
S. Canazza and A. Dattolo
carriers [10]. There are more then 1,000,000 Shellac discs in the worldwide audio archives. They are very important because some of these discs contain music ever re-recorded (R&B, Jazz, Ethnic, Western classical, etc.). The paper is organized as follows: in Section 2 we discuss related work; in Section 3 we propose PoG, our system of audio data extraction, while in Section 4 we present some experimental results through two cases of study. Finally conclusion and future works end the paper.
2
Related Work
Some phonographs are able to play gramophone records using a laser beam as the pickup (laser turntable) [12]; this playback system has the advantage of never physically touching the record during playback: the laser beam traces the signal undulations in the record, without friction. Unfortunately, the laser turntables are only constrained to the reflected laser spot and are very sensitive to the reflexion coefficient of the used part of the disc, and susceptible to damage and debris. In addiction the variation of the distance between the grooves allowing simultaneous reading of several grooves. Digital image processing techniques can be applied to the problem of extracting audio data from recorded grooves, acquired using an electronic camera or other imaging system. The images can be processed to extract the audio data. The audio signal can be determined by measurement of the horizontal path of the groove. Such an approach offers a way to provide non-contact reconstruction and may in principle sample any region of the groove, also in the case of a broken disc. These scanning methods have several advantages: a) delicate samples can be played without further damage; b) broken samples can be re-assembled virtually; c) the re-recording approach is independent from record material and format (wax, metal, shellac, acetates, etc.); d) effects of damage and debris (noise sources) can be reduced through image processing; e) scratched regions can be interpolated; f) discrete noise sources are resolved in the “spatial domain” where they originate rather than being an effect in the audio playback; g) dynamic effects of damage (skips, ringing) are absent; h) classic distortions (wow, flutter, tracking errors, etc.) are absent or removed as geometrical corrections; i) no mechanical method needed to follow the groove; j) they can be used for mass digitization. In the literature, there are some approaches to the use of image processing to reconstruct sound [5,8,7]; in general, they can be based on: Electronic Cameras (2D or horizontal only view, frame based); Confocal Scanning (3D or vertical+horizontal view, point based); Chromatic sensors (3D, point based); White Light Interferometry (3D, frame based).
Toward an Audio Digital Library 2.0: Smash, a Social Music Archive
211
In [5] a high resolution analog picture of each side of the disk is shot. The film becomes an intermediate storage media. In order to listen to the sound, the picture is scanned using a high resolution circular scanner. The scanner is made by a glass turntable, a 2048-sensor CCD-linear camera is mounted on microscope lens above the glass. Light source located below the tray lightens the film by transparency. Fadeyev et al. [8] apply a methodology partially derived from long standing analysis methods used in high energy particle and nuclear physics to follow the trajectories of charged particles in magnetic fields using position sensitive radiation detectors [9]. The device usedis the “Avant 400 Zip Smart Scope” manufactured by Optical Gauging Products. It consists of a video zoom microscope and a precision X-Y table. The accuracy of motion in the X-Y (horizontal) plane over the distance L (mm) is (2.5 + L/125) microns. The video camera had a CCD 6.4 mm x 4.8 mm containing 768x494 pixels of dimension 8.4 x 9.8 microns. With appropriate lenses installed it imaged a field of view ranging between approximately 260 x 200 microns and 1400 x 1080 microns. Both the systems listed above are applied on shellac disc, both on mechanical and electric recordings. Although the high Signal to Noise Ration (SNR) of the audio signal extracted (more than 40 dB in a 78 rpm shellac disc), these techniques is not adapt in the case of typical european audio archive (they have small-medium dimension) because the hardware equipment is expensive. In [7] is presented a full three-dimensional (3D) measurement of the record surface; in this study the color-coded confocal imaging method is considered, in particular the model CHR150 probe, manufactured by STIL SA, is used. This probe is coupled to custom configured stage movement and read out through a computer. The stages are controlled by DC servo motors and read out by linear encoders. The linear stage resolution is 100 nanometers and the accuracy was 2 µm. This system get very interesting results in audio cylinder (both wax and amberol), but it needs several hours for scanning. Summarizing, it can be used for saving selected records, not for a mass saving.
3
Audio Data Extraction: Photos of GHOSTS (PoG)
Photos of Grooves and HOles, Supporting Tracks Separation (Photos of GHOSTS or simply PoG) [11] is the system proposed in this work; it is distinguished by its following features: – is able to recognize different rpm and to perform tracks separation automatically; – does not require human intervention; – works with low-cost hardware; – is robust with respect to dust and scratches; – outputs de-noised and de-wowed audio, by means of novel restoration algorithms. The user can choose to apply an equalization curve among the hundreds stored in the system, each one with appropriated references (date, company, roll-off, turnover).
212
S. Canazza and A. Dattolo
The software automatically finds the disc centre and radius from the scanned data, using instruments developed in the consolidated literature on iris detection [11], for groove rectification and for tracks separation. Starting from the light intensity curve of the pixels in the scanned image, the groove is modeled and the audio samples are obtained [2]. The complete process is depicted in Fig. 1.
Fig. 1. Photos of GHOSTS schema
In particular, the system includes: 1. An innovative de-noise algorithm in a frequency domain [2] based on the use of a suppression rule, which considers the psychoacoustics masking effect. The spreading thresholds which present the original signal x(n) are not known a priori and are to be calculated. This estimation can be obtained by applying a noise reduction STSA standard technique leading to an estimate of x(n)in the frequency domain; the relative masking thresholds mk , defined as the non negative threshold under which the listener does not perceive an additional noise, can be calculated by using an appropriate psychoacoustic model. The obtained masking effect is incorporated into one of the EMSR technique [2], taking into consideration the masking thresholds mk for each k frequency of the STFT transform. Then is created a cost function depending on mk : its minimization gives the suppression rule for the noise reduction. This cost function can be a particularization of the mean square deviation to include the masking thresholds, under which the cost of an error is equal to zero. 2. Unlike the methods listed in Sec. 2, 225 different equalization curves, which cover almost all the electric recordings, since 1925.
Toward an Audio Digital Library 2.0: Smash, a Social Music Archive
213
3. The design and the realization of ad-hoc prototype of a customized (very) low-cost scanner device; it is equipped with a rotating lamp carriage in order to position every sector with the optimal alignment relative to the lamp (coaxially incident light). In this way we improved (from experimental results: more than 20%) the accuracy of the groove tracking step. PoG may form the basis of a strategy for: a) large scale A/D transfer of mechanical recordings, retaining maximal information (2D or 3D model of the grooves) about the native carrier; b) small scale A/D transfer processes, where there are not sufficient resources (trained personnel and/or high-end equipments) for a traditional transfer by means of turntables and converters; c) the active preservation of carriers with heavy degradation (breakage, flaking, exudation).
4
Experimental Results
In this section we present our experimental results of applying the above described technique related to audio data extraction. We conducted a series of experiments with real usage data from different international audio archives. A number of examples generated by the method described in this paper is available at: avires.dimi.uniud.it/tmp/DL/Experimental_Results.html As case study, we selected the 1929 double-sided 78 rpm 10” shellac disc Victor V-12067-A (BVE 53944-2) and focused our attention on the song La signorina sfinciusa (The funny girl). The performers are Leonardo Dia (voice), Alfredo Cibelli (mandolin), plus unknown (two guitars). New York, July, 24th, 1929. 3’20”. Since it is a immigration song, dedicated to a poor market (made by the Italian people emigrated in the United States of America), the audio quality of the recording is below to the standard of that age. The considered 78 rpm is largely damaged: both sides have scratches. Moreover, some areas are particularly dark: we hypothesized that this corruption has been caused by some washes (made before the disc was acquired by the audio archive of one of the author of this paper) in which chemical aggressive substances have been used. The corruptions cause evident distortions if the disc is played by means a stylus. The audio signal was extracted in two ways: 1. By means of the Rek-O-Kut-Rodine 3 turntable; the A/D transfer was carried out with RME Fireface 400 at 44.1 kHz, 16 bit; no equalization curve has been applied. 2. Using PoG system; the image was taken at 4800 dpi, 8 bit grayscale, without digital correction. The clicks were removed in the video-domain, applying a filter directly in the still image of the disc surface. A Median filter is performed (usually employed for reducing the effect of motion on an image). A selection including the click is
214
S. Canazza and A. Dattolo
Fig. 2. Waveform of the audio signals sampled by means of turntable (top) and synthesized by means of PoG (bottom): it is evident the click corruptions in the top figure. X-axis: time (s); Y-axis: normalized amplitude
Fig. 3. Spectra of the audio signals sampled by means of turntable (top) and synthesized by means of PoG (bottom): in audio frequency range there aren’t artifact caused by PoG. X-axis: time (s); Y-axis: frequency (Hz)
Toward an Audio Digital Library 2.0: Smash, a Social Music Archive
215
drawn: then the filter searches for pixels of similar brightness, discarding pixels that differ too much from adjacent pixels, and replaces the center pixel with the median brightness value of the searched pixels. Finally, the hiss was reduced by means a de-noise algorithm in a frequency domain based on the use of a suppression rule, which considers the psychoacoustics masking effect (Sect. 3). Then, the signal was resampled at 44.1 kHz. Fig. 2 shows the waveforms (time domain: it can be noticed the click removal by EKF in the bottom figure) of both signals, where it is evident the de-hiss carried out by the de-noise process. In audio frequency range there aren’t artifact caused by PoG; moreover, it can be noticed the SNR enhancement obtained by means of de-noised algorithm (see Fig. 3).
5
Conclusion and Future Work
The system discussed in this work, PoG, proposes a new approach for extracting audio signal from complex source material. The methodology is not invasive and innovative; the audio signal is reconstructed using a still image of a disc surface; we focused our discussion on the shellac phonographic discs, showing as PoG guarantees good performances also when the discs have heavy corruptions and supports different formats (rpm, diameter, channels number) without to change the equipment setup. In addition, it is important to highlight that, using PoG, the contextual information 3 (matrix and catalogue numbers, recording year, authors’ and performers’ name, etc.) are included in the photographic documents (can be read in in the label and/or in the mirror of the disc), unlike occurs in audio file (mp3, BWF, etc.), where the this information is subjectively added by the user. Our aim is to study innovative models for preserving and sharing audio documents in the context of Web 2.0/3.0. This paper represents a first step in this direction; starting from PoG, future works will focus on: – we are carrying out a deep analysis and comparison among the different tools (see Section 2) and PoG; – three regions of the disc contain the audio information: the groove surface paths, the groove walls and the groove bottom. PoG calculates the audio signal by measurement of the path of the groove. Unfortunately this interface is often damaged by scratches and bumps, which can give rise to unreliable information. The audio signal can also be determined from the vertical and horizontal path on the bottom of the groove, but it is not very well defined because each pressing has a different shape and the bottom of the groove often contains dirt. In this sense, we believe that the only source of reliable 3
Here, as it is common practice in the audio processing community, we indicate as contextual information the additional content-independent information; we use the term metadata to indicate content-dependent information that can be automatically extracted by the audio signal.
216
S. Canazza and A. Dattolo
information is on the slope of the walls groove, which is not easy to measure. Our idea is to enhance PoG using the interferometry techniques as well as an optical fibre that follows the groove of the disc: the reflected light (injected into the fiber by a laser) is transmitted by a lens to an X-Y position detector. Analyzing the data measured both from the interferometer and from the position detector of optical fiber will be potentially able to model the slope of the walls groove. – we are designing a new tools (codenamed Photos of GHOSTS 3D ) able to produce a 3D model of the groove, using the same techniques discussed in the last point. In this way, we will be potentially able to listen the photos of wax (and Amberol) cylinders as well as discs with vertical modulation; – we are currently actively working for a first release of a social network prototype, in which PoG will be used as a new tool for sharing and preserving audio documents.
References 1. AES-11id-2006. AES Information document for Preservation of audio recordings – Extended term storage environment for multiple media archives. AES (2006) 2. Canazza, S.: Noise and Representation Systems: A Comparison among Audio Restoration Algorithms. Lulu Enterprise, USA (2007) 3. Candela, L., Castelli, D., Ioannidis, Y., Koutrika, G., Pagano, P., Ross, S., Schek, H.-J., Schuldt, H.: The Digital Library Manifesto. DELOS - Network of Excellence on Digital libraries (2006) 4. Casoto, P., Dattolo, A., Pudota, N., Omero, P., Tasso, C.: Semantic tools for accessing, analysing, and extracting information from user generated contents: open issues and challenges. In: Murugesan, S. (ed.) Handbook of Research on Web 2.0, 3.0 and X.0: Technologies, Business and Social Applications, October 2009, ch. XVIII (2009) 5. Cavaglieri, S., Johnsen, O., Bapst, F.: Optical retrieval and storage of analog sound recordings. In: Proceedings of AES 20th International Conference, Budapest, Hungary (October 2001) 6. Cohen, E.: Preservation of audio in folk heritage collections in crisis. In: Proceedings of Council on Library and Information Resources, Washington, DC, USA, pp. 65–82 (2001) 7. Fadeyev, V., Haber, C., Maul, C., McBride, J.W., Golden, M.: Reconstruction of recorded sound from an edison cylinder using three-dimensional non-contact optical surface metrology. J. Audio Eng. Soc. 53(6), 485–508 (2005) 8. Fedeyev, V., Haber, C.: Reconstruction of mechanically recorded sound by image processing. Journal of Audio Engineering Society 51(12), 1172–1185 (2003) 9. Fruhwirth, R., Regler, M., Bock, R., Grote, H., Notz, D.: Data Analysis Techniques for High Energy Physics, 2nd edn. Cambridge University Press, Cambridge (August 2000) 10. IFLA-UNESCO. Safeguarding our Documentary Heritage / Conservation pr´eventive du patrimoine documentaire / Salvaguardando nuestro patrimonio documental. CD-ROM Bi-lingual: English/French/Spanish. UNESCO “Memory of the World” Programme, French Ministry of Culture and Communication (2000)
Toward an Audio Digital Library 2.0: Smash, a Social Music Archive
217
11. Orio, N., Snidaro, L., Canazza, S.: Semi-automatic metadata extraction from shellac and vinyl disc. In: Proceedings of Workshop on Digital Preservation Weaving Factory for Analogue Audio Collections (2008) 12. Resum´e, R.R.: An optical turntable. In: Engineers degree, Stanford University, Stanford (1986) 13. Sch¨ uller, D.: The ethics of preservation, restoration, and re-issues of historical sound recordings. Journal of Audio Engineering Society 39(12), 1014–1016 (1991) 14. Smith, A.: Why digitize? In: Proceedings of Council on Library and Information Resources, Washington, DC, USA (1999)
Author Index
Agosti, Maristella
89
Baruzzo, Andrea 67 Basile, Teresa M.A. 125, 149 Beninc` a, Paola 89 Bergamin, Giovanni 39 Biba, Marenglen 125, 149 Borghesani, Daniele 183 Caffo, Rossella 1 Canazza, Sergio 161, 205 Candela, Leonardo 13, 79 Castelli, Donatella 13, 79 Cucchiara, Rita 183 D’Amico, Gianpaolo 101 Dattolo, Antonina 67, 205 de Gemmis, Marco 27 Del Bimbo, Alberto 101 Di Mauro, Nicola 149 Di Nunzio, Giorgio Maria 89 Esposito, Floriana
125, 149
Ferilli, Stefano 125, 149 Ferro, Nicola 55 Gentile, Anna Lisa Grana, Costantino Iria, Jos´e
137 183
Magnani, Matteo 173 Manghi, Paolo 79 Marinai, Simone 113 Meoni, Marco 101 Messina, Maurizio 39 Miotti, Beatrice 113 Miotto, Riccardo 89, 195 Montecchio, Nicola 195 Montesi, Danilo 173 Musto, Cataldo 27 Narducci, Fedelucio Orio, Nicola
161, 195
Pagano, Pasquale 79 Pescarini, Diego 89 Pudota, Nirmala 67 Semeraro, Giovanni 27 Silvello, Gianmaria 55 Soda, Giovanni 113 Tang, Cristina 79 Tasso, Carlo 67 Thanos, Costantino Vitali, Stefano
5
137 Xia, Lei
Lops, Pasquale 27 Lunghi, Maurizio 47
27
137
Zhang, Ziqi
137
13, 79