This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
and syntax to get each image related paragraph(s) from the article using the following steps: 1. Images are mentioned in paragraph by many ways such as: Fig/Figure n, where n is the figure number, or Figs/Figures n1, n2 nn to refer to many figures. Image name may contain letter after the number; e.g. Fig 1b. So regular expression is used to search for figure reference in all paragraphs. 2. If the above failed, check if the image name contains letter after figure number, search using figure number only without the letter. 3. If the above failed, search in HTML tags in addition to normal text; since some files contain links to the figure without explicitly mentioning its name.
Medical Image Retrieval: ISSR at CLEF 2009
191
4. If the above failed, search in the text about the word Figure or Fig only without any number, since some articles have only one figure and refer to it without any number. 5. HTML tags don’t distinguish between the text under the image, i.e. the image caption, and the normal text in the article, so check whether found paragraph is the image caption, ignore it in this case. 6. If the paragraph(s) found has other terms that are not found in the image caption, add them to the image file. This is done after normalizing the caption and the paragraph. According to [11], textual retrieval achieved best performance when caption and title only are used, while adding article abstract or the whole article achieved poor results. For that, we used 2008 collocation with title and caption to determine the best methods for indexing and retrieval using the Lemur toolkit. The Lemur Toolkit supported indexing in several types of documents. We prepared documents in TREC text format. For Indexing, we used Indri index and tested Porter and Krovetz for word stemming, different retrieval models supported by Lemur (KL-divergence, vector space, tf.idf, Okapi and InQuery) are tested for retrieval. Best result is obtained by using Indri indexing and Okapi language model. Then we used a free stop word list published by The Information Retrieval Group [10], it gives better performance, we also updated the stop word list by adding common terms found in the queries that are not relevant to the medical domain such as ’show me’, ’image’, and ’photo’, this slightly improved the performance. Pseudo relevance feedback included in Lemur toolkit is applied; many feedback settings are checked to determine the best one. The above settings are applied after adding the relevant paragraph(s) into 2008 collection of title and caption, and there was about 30.
3
Experimental Results
All runs are for text only, 5 of them are for English queries, 2 for French and 2 for German. Here is the description of the first five runs for English queries: Run 1 uses only title and caption as text features. Run 2 uses title, caption and added paragraph(s) as text features. Run 3 uses only title and caption as text features with pseudo relevance feedback. Run 4 uses title, caption and added paragraph(s) as text features with pseudo relevance feedback. Run 5 uses title, caption and added paragraph(s) as text features with pseudo relevance feedback and the updated stop word list. Two runs are submitted for French queries after using the automatic Google Translation to translate them into English, they are: Run 6 uses title, caption and added paragraph(s) as text features with updated stop word list. Run 7 uses title, caption and added paragraph(s) as text features with the updated stop word list and pseudo relevance feedback. Two runs are submitted for German queries after using the hg’];jll automatic Google Translation tool to translate them into English, they are: Run 8 uses
192
W. Arafa and R. Ibrahim
title, caption and added paragraph(s) as text features with the updated stop word list. Run 9 uses title, caption and added paragraph(s) as text features with the updated stop word list and pseudo relevance feedback. Table 1. Results of ISSR nine submitted runs. All of them use textual features only for retrieval. They are all automatic, no manual feedback was involved (AUTO).
: added paragraphs, PRF: Pseudo Relevance Feedback, USWL: Updated Stop Word List, MAP: Mean Average Precision, R-Prec: R-Precision. Language
PRF USWL MAP R-Prec Recall
# Run Name 1 2 3 4 5 6 7 8 9
4
ISSR ISSR ISSR ISSR ISSR ISSR ISSR ISSR ISSR
Text Text Text Text Text Text Text Text Text
1 2 1rfb 4 5 FR 1 FR 2 DE 1 DE 2
English English English English English French French German German
No Yes No Yes Yes Yes Yes Yes Yes
No No Yes Yes Yes No Yes No Yes
No No No No Yes Yes Yes Yes Yes
0.3499 0.3315 0.277 0.2672 0.2692 0.2951 0.3111 0.1997 0.1981
0.3827 0.3652 0.3014 0.2916 0.2945 0.3354 0.3338 0.2314 0.2197
0.7269 0.7485 0.6791 0.7341 0.7358 0.6956 0.7667 0.6808 0.6088
Results and Analysis
Results described in table 1 show that the best results are obtained when only the image’s caption and title features are used. Using paragraphs in addition to title and caption decreased the performance as opposed to the results of the same experiment on 2008 collection! Recall increased after adding paragraphs (run 2 is better than run 1) and slightly increased after using the updated stop word list (run 5 is better than run 4), these results are similar to the results of 2008 collection. Using domain specific stop word list also increased the MAP (run 5 is better than run 4). The decrease in MAP after adding paragraphs is surprising and was not expected, but careful analysis of the results of each query shows an improvement in recall as expected since more relevant terms are added to the image annotation so more relevant images are retrieved. The proposed method significantly increases recall by 60% of the queries, 28% decreased and 12% unchanged as shown in Table 2. We believe that improving recall is important for the case of the ImageCLEFmed collection, which can be regarded as a small collection (about 84’000 images), where recall plays a more important role than in larger collections, which can rely on information redundancy with a less important role for recall [13]. One reason for this unexpected result is that the additional text has many noise terms that are irrelevant to the image. Even worse, some paragraphs mention many figures, so words relevant to one image are added to other irrelevant images mentioned in the paragraph. These reasons caused the similarity between many documents and the query decrease so that they got low rank in the
Medical Image Retrieval: ISSR at CLEF 2009
193
Table 2. Number of queries affected by adding paragraphs to the title and image caption for MAP, R-precision and recall for the 25 English queries Measure MAP R-prec Recall Effect Queries Ratio Queries Ratio Queries Ratio Increased 9 36% 11 44% 15 60% Same 0 0% 3 12% 3 12% Decreased 16 64% 11 44% 7 28%
retrieved list; this is the reason of decreased MAP since its value depends on document ranking. This noise did not decrease retrieval MAP on 2008 collection for two reasons: firstly, a significant number of articles in 2008 dataset doesn’t have HTML files at April 2009 when we were preparing the test data. So, there was no irrelevant text added to the images in these articles, so they got the same rank in the retrieved list. Secondly, 1605 images (about 2.4% of the collection) didn’t have captions at all, 471 of them are relevant to at least one query. By applying the proposed technique, annotations to these images are added, so a better retrieval for these images is shown. Pseudo Relevance Feedback (PRF) is used in runs 3, 4, 5, 7, and 9. PRF is considered a successful simple query expansion technique where most frequent words in top k documents used to expand query terms. In our case, first k documents don’t include relevant documents, or if they do, they have a lot of noise because of added paragraphs; so many irrelevant terms are added to the modified queries (compare run 4 with run 2). Our English retrieval result was the 4th best group results of all 13 participating groups in textual runs, and our best run with 0.3499 MAP was the 13th run of all 52 textual runs, while the MAP of overall best run was 0.4293. Our best run with added paragraphs with 0.3315 MAP was the 7th best group of all 13 groups and the 18th best run of all 52 runs. For multilingual retrieval task, French and German queries are translated using Google online machine translation [8]. French translated queries increased the MAP than original English queries by about 15% (run 7 is better than run 5). On the other hand, German queries decreased the MAP by 26% (run 9 is less than run 5). However, as the Google translator is a general machine translation and not a domain-specific tool, the ineffective handling of terms was expected.
5
Conclusion and Future Work
A simple syntax-based technique to add relevant text to image annotation is proposed; and this technique is tested on image retrieval using Lemur toolkit. The results show that it is a promising approach. We intend to enhance this approach using semantic extraction methods such as shallow NLP techniques
194
W. Arafa and R. Ibrahim
or statistical approaches to extract only relevant sentences from the paragraph denoting the image instead of adding the whole paragraph in order to reduce noise terms. Acknowledgments. We acknowledge ImageCLEFMed track organizer for providing 2008 data set.
References 1. The Cross Language Image Retrieval Track, http://imageclef.org/ 2. Hersh, W., Muller, H.: Image Retrieval in Medicine: The ImageCLEF Medical Image Retrieval Evaluation, Ass&t Bulletin (March 2007) 3. Muller, H., Michoux, N., Bandon, D., Geissbuhler, A.: A review of content-based image retrieval systems in medicine - clinical benefits and future directions. International Journal of Medical Informatics 73, 1–23 (2004) 4. Croft, B., Metxler, D., Stohman, T.: Engines: Information Retrieval in Practice. Addison-Wesley, Reading (2009) 5. Tuomas, T.: Comparable Corpora in Cross-Language Information Retrieval. University of Tampere, Faculty of Information Sciences (2008) 6. Pirkola, A.: The Effects of Query Structure and Dictionary Setups in DictionaryBased Cross-Language Information Retrieval. In: SIGIR 1998 Cross-language Information Retrieval (1998) 7. Hiemstra, D.: Using language models for information retrieval. Ph.D. Thesis, Centre for Telematics and Information Technoloy (2001) 8. Och, F.J.: Statistical machine translation: Foundations and recent advances. Tutorial at MT Summit X (2005) 9. The Lemur Toolkit for Language Modeling and Information Retrieval, http://www.lemurproject.org 10. IR Linguistic Utilities, http://ir.dcs.gla.ac.uk/resources/linguistic_utils 11. Diaz-Galiano, M., Garcia-Cumbreras, M.A., Martin-Valdivia, M.T., Urena-Lopez, L.A., Montejo-Raez, A.: SINAI at ImageCLEFmed 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 12. Muller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C., Hersh, W.: Overview of the CLEF 2009 medical image retrieval track (2008) 13. Muller, H., Patrick, R.: Query and Document Translation by Automatic Text Categorization: A Simple Approach to Establish a Strong Textual Baseline for ImageCLEFmed 2006. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 57–61. Springer, Heidelberg (2006)
An Integrated Approach for Medical Image Retrieval through Combining Textual and Visual Features Zheng Ye1,2 , Xiangji Huang2 , Qinmin Hu2 , and Hongfei Lin1 1
Department of Computer Science and Engineering, Dalian University of Technology Dalian, Liaoning, 116023, China 2 Information Retrieval and Knowledge Managment Lab, York University, Toronto, Canada {yezheng,jhuang}@yorku.ca, vhu@cse.yorku.ca, hflin@dlut.edu.cn
Abstract. In this paper, we present an empirical study for monolingual medical image retrieval. In particular, we present a series of experiments in ImageCLEFmed 2009 task. There are three main goals. First, we evaluate traditional well-known weighting models in the text retrieval domain, such as BM25, TFIDF and Language Model (LM), for context-based image retrieval. Second, we evaluate statistical-based feedback models and ontology-based feedback models. Third, we investigate how content-based image retrieval can be integrated with these two basic technologies in traditional text retrieval domain. The experimental results have shown that: 1) traditional weighting models work well in context-based medical image retrieval task especially when the parameters are tuned properly; 2) statistical-based feedback models can further improve the retrieval performance when a small number of documents are used for feedback; however, the medical image retrieval can not benefit from ontology-based query expansion method used in this paper; 3) the retrieval performance can be slightly boosted via an integrated retrieval approach. Keywords: CBIR, Visual and Textual Retrieval, Weighting Model, Pseudo Relevance Feedback, Ontologies, MeSH.
1
Introduction
Medical images are becoming more important in research, diagnosis and teaching, as they are available in digital form. Image retrieval techniques can facilitate the processes. In image retrieval, there are two main approaches, Context based Image Retrieval and Content Based Image Retrieval (CBIR). The first one uses the textual context information (e.g. annotation or surrounding texts) of an image for retrieval. The techniques used are similar to traditional text retrieval. However, the textual description for an image is always short and noisy, which is different from that of traditional adhoc text collections. It is necessary to test and adapt traditional weighting models for this particular task. The second one C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 195–202, 2010. c Springer-Verlag Berlin Heidelberg 2010
196
Z. Ye et al.
is CBIR, which uses low-level image features to retrieve images similar to an example. Existing techniques can only extract very simple features from images, which may be not representative enough for retrieval purpose. For example, a CBIR system solely based on color feature could return an image of blue sky when issuing a query with an example image of a blue car. In order to address the problems stated above, we first evaluate traditional well-known weighting models in the text retrieval domain, such as BM25, TFIDF and LM, for context-based image retrieval [1]. Second, on the basis of the baseline results, we use statistical-based pseudo relevance feedback and ontology-based query expansion approaches to further enhance the retrieval performance. Finally, we note that, some images in the same article share the same contextual text for comparison reason, and the circumstance is always that only one of these images is what we are looking for. So it is impossible for us to filter the other images only based on the context-based image retrieval technologies. We propose to incorporate different image content features to enhance context-based image retrieval technologies. The remainder of this paper is organized as follows. In Section 2, we describe the basic retrieval models. In Section 3, we present the statistical-based and ontology-based query expansion models used in our experiments. In Section 4, we propose an integrated approach for medical image retrieval. In Section 5, we present the experimental results and analyses. In Section 6, we conclude the paper with a discussion of our findings and a look at future work.
2
Weighting Models
In the previous medical image retrieval tasks, a number of different information retrieval (IR) toolkits, such as Lemur, Jirs and Lucene, are used as the basic retrieval systems for context-based medical image retrieval [2,3]. However, there is no systematic comparison of different weighting models for ImageCLEFmed task. In addition, it is not clear that whether the default parameters in these models empirically tuned for traditional adhoc datasets are optimal. In this paper, we have made comparisons of four well-known weighting models: BM25 [4], JM-LM [5], TFIDF [6] and DFR [7]. The corresponding term weighting functions are as follows. – BM25 w=
N − n + 0.5 (k3 + 1 ) ∗ qtf (k1 + 1) ∗ tf ∗ log ∗ (1) k1 ∗ ((1 − b) + b ∗ dl/avdl) + tf n + 0.5 k3 + qtf
– JM-LM w = (1 + – TFIDF w = qtf ∗
tf ∗ F reqT otColl μ ∗ ) 1−μ dl ∗ Ft
k1 ∗ tf tf + k1 ∗ (1 − b + b ∗
dl avdl )
∗ log(1 +
(2) N ) n
(3)
An Integrated Approach for Medical Image Retrieval
197
N +1 ) n exp T F = tf ∗ log2 (1 + avdl/dl)
(4)
– DFR w = T F ∗ qtf ∗ N ORM ∗ loge (
N ORM = (tf + 1)/(df ∗ (T F + 1)) n exp = idf ∗ (1 − e−qtf /df ) where w is the weight of a query term, N is the number of indexed documents in the collection, n is the number of documents containing the term, tf is withindocument term frequency, qtf is within-query term frequency, dl is the length of the document, avdl is the average document length, the ki s are tuning constants (which depend on the database and possibly on the nature of the queries and are empirically determined).
3
Query Expansion
Pseudo-relevance feedback (PRF) via query expansion (QE) has been proven to be effective in many information retrieval (IR) tasks. In this section, we introduce two query expansion methods, namely a statistical-based feedback method in 3.1 and an ontology-based query expansion method in 3.2. 3.1
Query Expansion with the Bose-Einstein Distribution
The pseudo relevance feedback method used in our experiments is DFR-based weighting model described in [7]. The basic idea of these term weighting models for query expansion is to measure the divergence of a term’s distribution in a pseudo relevance set from its distribution in the whole collection. The higher this divergence is, the more likely the term is related to the query topic. We use Bo1 weighting model in this set of experiments. The Bo1 term weighting model is based on the Bose-Einstein statistics. Using this model, the weight of a term t in the exp doc top-ranked documents is given by: 1 + Pn + log2 (1 + Pn ) (5) Pn where exp doc usually ranges from 3 to 10 [7]. Another parameter involved in the query expansion mechanism is exp term, namely the number of terms extracted from the exp doc top-ranked documents. exp term is usually larger than F , F is the frequency of the term in the collection, exp doc [7]. Pn is given by N and N is the number of documents in the collection. tfx is the frequency of the query term in the exp doc top-ranked documents. w(t) = tfx · log2
3.2
Query Expansion with MeSH Ontology
In the medical domain, terms are highly synonymous and ambiguous. This motivates us to investigate using ontology to expand the original query terms.
198
Z. Ye et al.
The Medical Subject Headings (MeSH1 ) is a thesaurus developed by the National Library of Medicine of the United States. MeSH contains two organization files, namely an alphabetic list with bags of synonymous and related terms and a hierarchical organization of descriptors associated to the terms. A term is composed by one or more words. We have used the longest match approach to recognize the MeSH terms in a query. In particular, if all the words of a term are in the original query, we add its synonymous terms to the query. To compare the words of a particular term with those of the query, we convert all the words in lowercase and we do not remove stop words in this step. In order to reduce the noise brought by the expansion of the original query, only three categories of MeSH terms (A: Anatomy, C: Diseases, E: Analytical, Diagnostic and Therapeutic Techniques and Equipment) have been used for query expansion. Table 6 in Section 5.2 will present the MeSH-based query expansion results under four different basic weighting models.
4
An Integrated Approach
Content-based Image Retrieval (CBIR) systems enable users to search a large image database by issuing an image sample, in which the actual contents of the image will be analyzed. The contents of an image refer to its features – colors, shapes, textures, or any other information that can be derived from the image itself. This kind of technology sounds interesting and promising. The key issue in CBIR is to extract representative features to describe an image. However, this is a very difficult research topic. According to ImageCLEFmed conference notes [3], CBIR always performs poorly, while context-based image retrieval can always achieve good performance in terms of MAP measurement. However, content features are also needed, especially when context information is not easy to obtain or a number of images share the same context. This motivates us to combine these two technologies. In particular, we explore three representative features for medical image retrieval. 1. Color and Edge Directivity Descriptor (CEDD): is a low level feature which incorporates color and texture information in a histogram [8]. 2. Tamura Histogram Descriptor: features coarseness, contrast, directionality, line-likeness, regularity, and roughness. The relative brightness of pairs of pixels is computed such that degree of contrast, regularity, coarseness and directionality may be estimated [9]. 3. Color Histogram Descriptor: Retrieving images based on color similarity is achieved by computing a color histogram for each image that identifies the proportion of pixels within an image holding specific values (that humans express as colors). Current research is attempting to segment color proportion by region and by spatial relationship among several color regions. Examining images based on the colors they contain is one of the most widely used techniques because it does not depend on image size or orientation. 1
http://www.nlm.nih.gov/mesh/
An Integrated Approach for Medical Image Retrieval
199
The final ranking score is obtained by merging the context-based similarity score (Scontext ) and the content-based similarity score (Scontent ). In particular, we use linear combination. The formula is described as follows. score = (1 − λ) ∗ Scontext + λ ∗ Scontent
5 5.1
(6)
Experiments Experiments Setting
The dataset used in our experiments is the dataset of Medical Image Retrieval 2009. It contains over 70,000 images, consisting of images from articles published in Radiology and Radiographics including the text of the captions and a link to the html of the full text articles. More information for the dataset and topics can be found in [2]. For the preprocessing of the image contextual text, we use blank delimiter to separate words for indexing and searching, and stopwords are removed. Besides these two simple steps, no further technologies have been used. In our experiments, the default values of k1 , k3 and b in the BM25 function are 1.2, 8 and 0.75 respectively; the default value of μ in JM-ML model is 0.15. 5.2
Experimental Results and Analyses
In this section, we first discuss the influence of two basic retrieval techniques on context based medical image retrieval. Then, we present the experimental results of the integrated retrieval approach. Influence of Basic Weighting Model In Table 1, we present the top 10 official results of the ImageCLEFmed 2009 Track for comparison. The forth run marked by superscript ‘o’ is our best official textual run. Thereafter, all our official runs are marked in the same way. From Table 2, we can see that the DFR weighting model has achieved the best performance under default setting. Table 1. MAP Performance of Top 10 Offcial Textual Runs Runs LIRIS maxMPTT extMPTT sinai CTM t york.In expB2c1.0o ISSR text 1 ceb-essie2-automatic deu run1 pivoted clef2009 ISSR Text 2 BiTeM EN
MAP 0.4293 0.3795 0.3685 0.3499 0.3484 0.3389 0.3362 0.3315 0.3206
200
Z. Ye et al.
Table 2. A Comparison of Four Weighting Models with Default Setting: BM25, JM LM, TFIDF and DFR Runs BM25o JM - LM TF IDF DFR MAP 0.3515 0.3444 0.3608 0.3730 Table 3. MAP Performance BM25 Model with Different Parameter (b) b 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP 0.3560 0.3554 0.3586 0.3577 0.3553 0.3543 0.3515 0.3503 0.3490 0.3450 Table 4. Performance JM-LM Model with Different Parameter (μ) μ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAP 0.3401 0.3486 0.3542 0.3603 0.3677 0.3734 0.3774 0.3796 0.3817 0.0390
As for the parameters tuning, we test different settings for b in BM25 and μ in JM-LM. From Table 3, we can see that the BM25 model is more robust against changes in the tuning parameters. For JM-LM model in Table 4 we can see, as the increase of parameter μ, the performance increases significantly. When μ takes the value of 0.9, the JM-LM weighting model outperforms the DFR weighting model. Although the JM-LM weighting model is not as steady as the BM25 weighting model, it is still promising if the parameter can be properly tuned. Influence of Query Expansion The main goal of this set of experiments is to investigate how many top documents and terms should be used for query expansion. For the limitation of space, we only present experimental results on the basis of BM25 model. Actually, we observe similar results of the other three models. Form Table 5 we can see, in general, the performance can be boosted if we can set the parameters properly. However, when the number of documents for query expansion increases from 5 to 10, the performance drops quickly. The results in Table 5 suggest that only a very small number of documents are useful for query expansion in context-based medical image retrieval. Table 6 presents the ontology-based query expansion results under four different weighting models. Ontology-based query expansion approach does not work well as we expected. Our conjecture is that MeSH-based query expansion may also bring negative terms into query, especially the abbreviation terms. In addition, when JM-LM is used as the basic retrieval model, the performance drops remarkably. This is another evidence that the performance of JM-LM is not steady. Performance of Integrated Approach From Table 7, we can see that the retrieval performance can be slightly boosted by integrating content features. Among these three features, CEDD can improve the performance most. However, the improvements are marginal. More representative features are needed to be developed for retrieval purpose.
An Integrated Approach for Medical Image Retrieval
201
Table 5. MAP Performance of Query Expansion – baseline MAP=0.3730 docs/terms 5 10 20 30 50
5 0.3901 0.3576 0.3443 0.3371 0.3388
10 0.3940 0.3622 0.3533 0.3415 0.3405
20 0.3947 0.3643 0.3520 0.3377 0.3345
30 0.3961 0.3670 0.3526 0.3401 0.3351
50 0.3958 0.3672 0.3541 0.3434 0.3372
70 100 0.3954 0.3963(6.25%) 0.3688 0.3683 0.3561 0.3613 0.3432 0.3448 0.3379 0.3391
Table 6. MAP Performance of MeSH-based Query Expansion Runs BM25 JM - LM TF IDF DFR Not-FB 0.3515 0.3444 0.3608 0.3730 MeSH-FB 0.3458o 0.3056 0.3529 0.3685o Table 7. MAP Performance of Integrated Approach (BM25 as basic weighting model) Runs BM25 baseline λ = 0.1 0.3515 λ = 0.2 0.3515 λ = 0.3 0.3515
6
CEDD 0.3515 0.3552 0.3616
Tamura 0.3514 0.3544 0.3544
Color 0.3529 0.3524 0.3516
Conclusions
In this study, we first evaluate four well-known weighting models for context based medical image retrieval. The performance of the four weighting models is comparable, but DFR weighting model works best under default settings. JMLM model is not steady for this task. However, if the parameter can be tuned properly, it is still promising. Second, we investigate query expansion technologies for the medical image retrieval task. In general, statistical-based QE method outperforms ontology-based methods. The experimental results also suggest that only a small number of top ranked documents are useful for statistical-based QE method. Ontology-based methods sound interesting and useful, however the actual performance is not good in our experiments. More sophisticated processing for this kind of methods is needed. Finally, we explore three content features for content-based medical image retrieval. The experimental results have shown that retrieval performance can only be slightly improved. The current features extracted from images may not be representative enough to capture the characteristics of images. Better features are required to improve CBIR. In future work, we plan to work on the following two directions. First, we will use data-driven approaches to choose optimal parameters for statisticalbased QE. Second, we will explore the correlation of different content features of images. If more features can be integrated into medical image retrieval properly, we believe the retrieval performance can be further improved.
202
Z. Ye et al.
Acknowledgements. We thank the reviewer’s valuable and constructive comments. This research is jointly supported by NSERC of Canada, the Early Researcher/Premier’s Research Excellence Award, Natural Science Foundation of China (No. 60373095 and 60673039) and the National High Tech Research and Development Plan of China (2006AA01Z151).
References 1. Zhai, C.: Statistical Language Models for Information Retrieval: A Critical Review. Foundations and Trends in Information Retrieval 2(3), 137–213 (2008) 2. Muller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C., Hersh, W.: Overview of the CLEF 2009 Medical Image Retrieval Track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 3. Muller, H., Deselaers, T., Kim, E., Kalpathy-Cramer, J., Deserno, T., Clough, P., Hersh, W.: Overview of the ImageCLEFmed 2007 Medical Retrieval and Annotation Tasks. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 472–491. Springer, Heidelberg (2008) 4. Hancock-Beaulieu, M., Gatford, M., Huang, X., Robertson, S.E., Walker, S., Williams, P.W.: Okapi at TREC-5. In: Text REtrieval Conference (TREC) TREC-5 Proceedings (1996) 5. Zhang, R., Yi, C., Zheng, Z., Metzler, D., Nie, J.: Search Result Re-ranking by Feedback Control Adjustment for Time-Sensitive Query. In: Proceedings of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT), pp. 165–168 (2009) 6. Singhal, A., Buckley, C., Mitra, M.: Pivoted Document Length Normalization. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. ACM, New York (1996) 7. Amati, G.: Probabilistic Models for Information Retrieval based on Divergence From Randomness. PhD thesis, Department of Computing Science, University of Glasgow (2003) 8. Gasteratos, A., Vincze, M., Tsotsos, J.: CEDD: Color and Edge Directivity Descriptor. A Compact Descriptor for Image Indexing and Retrieval. In: Gasteratos, A., Vincze, M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 312–322. Springer, Heidelberg (2008) 9. Tamura, Hideyuki Mori, S., Yamawaki, T.: Textural Features Corresponding to Visual Perception. IEEE Transactions on Systems, Man and Cybernetics 8, 460– 473 (1978)
Analysis Combination and Pseudo Relevance Feedback in Conceptual Language Model LIRIS Participation at ImageCLEFMed Lo¨ıc Maisonnasse1 , Farah Harrathi1 , Catherine Roussey1,2 , and Sylvie Calabretto1 1
2
Universit´e de Lyon, CNRS, INSA de Lyon, Universit´e Lyon 1, LIRIS UMR5205, Villeurbanne, France firstname.lastname@liris.cnrs.fr Cemagref, 24 Av. des Landais, BP 50085, 63172 Aubi´ere, France
Abstract. This paper presents the LIRIS contribution to the CLEF 2009 medical retrieval task (i.e. ImageCLEFmed). Our model makes use of the textual part of the corpus and of the medical knowledge found in the Unified Medical Language System (UMLS) knowledge sources. As proposed in [6] last year, we used a conceptual representation for each sentence and we proposed a language modeling approach. We test two versions of conceptual unigram language model; one that use the logprobability of the query and a second one that compute the KullbackLeibler divergence. We used different concept detection methods and we combine these detection methods on queries and documents. This year we mainly test the impact of the use of additional analysis on queries. We also test combinations on French queries where we combine translation and analysis, in order to solve the lack of French terms in UMLS, this provide good results close from the English ones. To complete these combinations we proposed a pseudo relevance method. This approach use the n first retrieve documents to form one pseudo query that is used in the Kullback-Leibler model to complete the original query. The results of this approach show that extending the queries with such an approach improves the results.
1
Introduction
The previous ImageCLEFmed tracks show the advantages of conceptual indexing (see [6]). Such indexing allows one to better capture the content of queries and documents and to match them at an abstract semantic level. On these conceptual representation [5] proposed a conceptual language modeling approach that handle different conceptual representations of documents or queries. In this paper we extend this approach it in various ways. The rsv value in [5] is computed through a simple query likelihood. We also evaluate here the use of a KullbackLeibler divergence as proposed in many language model approaches. Then we C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 203–210, 2010. c Springer-Verlag Berlin Heidelberg 2010
204
L. Maisonnasse et al.
compare combinations of conceptual representations with the divergence rather than combinations with likelihood. In last year participation we used two analyses for documents and queries, as results presented in [5] show that combining analysis on queries is an easy way to improve the results; so we make use this year of two supplementary analysis on queries. One of them is a new method of concept detection using only statistical methods. Finally we complete this model by proposing a pseudo relevance feedback extension of queries based on our language model approach. This paper first presents the different extensions of our conceptual model. Then we detail the different documents and queries analysis. And finally we show and discuss our results obtain at CLEF 09.
2
Conceptual Model
We rely on a language model defined over concepts, as proposed in [5], which we refer to as Conceptual Unigram Model. We assume that a query q is composed of a set C of concepts, each concept being independent to the others conditionally on a document model. First we compute the rsv of this approach by simply computing the log-probability of the concept set C assuming a model Md of the document d as: RSVlog (q, d) = log(P (C|Md )) log(P (ci |Md )#(ci ,q) ) =
(1)
ci ∈C
where #(ci , q) denotes the number of times concept ci occurs in the query q. The quantity P (ci |Md ) is directly estimated through maximum likelihood, using Jelinek-Mercer smoothing, P (ci |Md ) = (1 − λu ) |c|∗|i |dd + λu |c|∗|i |DD where |ci |d (respectively |ci |D ) is the frequency of concept ci in the document d (respectively in the collection D), and | ∗ |d (respectively | ∗ |D ) is the size of d, i.e. the number of concepts in d (respectively in the collection). In a second approach we compute the rsv of a query q for a document d by using Kullback-Leiber divergence between the document model Md estimated over d and the query model Mq estimated over the query q, this results in: P (ci |Mq ) P (ci |Mq ) log RSVkld (q, d) = −D (Mq Md) = (2) P (ci |Md ) ci ∈C P (ci |Mq ) ∗ log(P (ci |Md )) − P (ci |Mq ) ∗ log(P (ci |Mq )) = ci ∈C
ci ∈C
Since the last element of the decomposition correspond to query entropy and does not affect documents ranking, we only compute the following decomposition: RSVkld (q, d) ∝ P (ci |Mq ) ∗ log(P (ci |Md )) (3) ci ∈C
where P (ci |Md ) is estimated as previously. P (ci |Mq ) is directly computed through |c | maximum likelihood on the query by P (ci |Md ) = |∗|i qq where |ci |q is the frequency of concept ci in the query and | ∗ |q is the size of q.
Analysis Combination and Pseudo Relevance Feedback
2.1
205
Model Combination
We present here the method used to combine different sets of concepts (i.e. concepts obtained from different analyses of queries and/or documents) with the two rsv presented above. We used the results obtain in [5] to select the best combinations on queries and documents. First, we group the different analysis of a query. To do so, we assume that a query is represented by a set of sets of concepts Q = {Cq }; and that the probability of this set assuming a document model is computed by the product of the probability of each query concept set Cq . Assuming that the first rsv RSVlog uses the log-probability and that the second RSVkld uses a divergence, the combination of the rsv is computed through a sum over the different queries: RSV (Q, d) ∝ RSV (Cq , d) (4) Cq ∈Q
where RSV (Cq , d) is either RSVlog (equation 1) or RSVkld (equation 3). With this fusion,the best rsv will be obtained for a document model which can generate all analyses of the queries with high probability. Second, we group the different analysis d of a document D = {d}. We assume that a query can be generated by different models of the same document Md∗ (i.e. a set of models corresponding to each document d of D). Based on [5] results, we keep the higher probability among the different models, this result in: RSV (Q, D) = argmaxd∈D RSV (Q, d)
(5)
With this method, documents are ranked, for a given query, according to their best document model. 2.2
Pseudo Relevance Feedback
Based on the n first results selected for one query set Q obtain by one RSV (equation 4), we compute a pseudo relevance feedback score P RF . This score correspond to the rsv obtain by the pseudo query Qf d constitute by the merging of the n first documents retrieved with the query Q added, with a smoothing parameter, to the results obtained by the original query Q. P RF (Qf d , d) = (1 − λprf )RSV (Q, d) + (λprf )RSV (Qf d , d)
(6)
where RSV (Q, d) is either RSVlog or RSVkld and RSV (Qf d , d) is the same type of rsv apply on the pseudo-query Qf d that correspond to the merging of the n first results retrieved by RSV (Q, d). λprf is a smoothing parameter that allows to give lower or higher importance to the pseudo query. If different collection analysis are used, we finally merge these results using equation 5.
3
Concepts Detection
UMLS is a good candidate as a knowledge source for medical text indexing. It is more than a terminology because it describes terms with associated concepts.
206
L. Maisonnasse et al.
This knowledge is large (more than 1 million concepts, 5.5 million of terms in 17 languages). UMLS is not an ontology, as there is no formal description of concepts, but its large set of terms and their variants specific to the medical domain, enables full scale conceptual indexing. In UMLS, all concepts are assigned to at least one semantic type from the Semantic Network. This provides consistent categorization of all concepts in the meta-thesaurus at the relatively general level represented in the Semantic Network. The Semantic Network also contains relations between concepts, which allow one to derive relations between concepts in documents (and queries). 3.1
Linguistic Detection Process
The detection of concepts based on linguistic analysis of document from a thesaurus is a relatively well established process. It consists of four major steps (refer to [5] for details on these steps): 1. Morpho-syntactic Analysis (POS tagging) of document with a lemmatization of inflected word forms; 2. Filtering empty words on the basis of their grammatical class; 3. Detection in the document of words or phrases appearing in the metathesaurus; 4. Possible filtering of concepts identified. 3.2
Statistical Detection Process
We develop a statitical method of concept detection that could be apply on several languages without any linguistic analysis. This method replace the morphosyntactic analysis (step 1 and 2 of previous section) by statistical method. Our method is composed of four main steps: 1. 2. 3. 4.
Empty Word and Simple Term Extraction based on corpus analysis. Compound Term Extraction. Concept Detection Concept Filtering
The last two steps (3 and 4) are similar to the linguistic detection process, thus we will not describe them in the next paragraphs. Empty Word and Simple Term Extraction. Empty words are words that have no discriminate power to identify a specific document over a corpus, because they have a linear distribution over all the documents. They can be stop words or general word like the day of the week and so one. In order to extract the empty word of the document we use two corpora: The indexing corpus and the support corpus. The support corpus should have the same languages than the indexing corpus but should deals with another domain. For example in our experiment the indexing corpus is about medicine, the support corpus is about laws (the
Analysis Combination and Pseudo Relevance Feedback
207
European Parliament collection1 ). We define empty word as a word belonging to the indexing corpus and the support corpus and its frequency inside the two corpora should be above a threshold fixed by experience. Simple terms are the words of the indexing corpus which are not detected as empty word. Compound Terms Extraction. We assume that compound term (term composed of more than one word) is a kind of word collocation. According to [1] we can detect words involved in a collocation by following two assumptions. (1) The words must appear together significantly more often than expected by chance. (2) The words should appear in a relatively rigid way because of syntactic constraints. [3] uses the Mutual Information (MI) measure to extract a collocation of two words. Unfortunately the MI measure is not able to extract compound terms composed of empty words and it is not adapted to extract compound terms of more than two words. Thus we propose to adapt the Mutual Information measure to avoid these two drawbacks. Considering two words m1 and m2 , our formula of the AdaptedM utualInf ormation(AM I) is: ⎧ ⎨ log2 P (m1 ,m22 ) if m2 is an empty word P (m1 ) (7) AM I(m1 , m2 ) = P (m ,m ) 1 2 ⎩ log2 otherwise P (m1 )∗P (m2 ) Where P (m1 ) is estimated by counting the number of observations of m1 in the collection and normalizing by N , the size of the collection. P (m1 , m2 ) is estimated by counting the number of times that m1 is followed by m2 and normalizing by N . The term extraction process is iterative and incremental process. The compound terms of the iteration i + 1 (that is to say that their length is i + 1 words) are built from the term of the iteration i (their length is i words). The extraction process starts from the simple terms composed of 1 word. For each couple of words (simple term + another word of the indexing corpus), we compute the AM I. If its AM I is above a threshold, this new compound term is added in the starting list of the next iteration of this process. In our experiment we fixed the AMI threshold to 15. The iterations carry one as far as a new compound term is extracted. 3.3
Linguistic Detection versus Statistical Detection
We test our new statistical concept detection process on the collection CLEFmed 2007 using UMLS. We compare the statistical detection from those obtained by [4] using linguistic techniques with the similar collection and UMLS. In [4], three linguistic analyzers are used prior to the concept detection: MetaMap MM, MiniPar MP and TreeTagger TT. The results obtained by these various analyzers as those obtain by our statistical method, that we named FA, are given in Table 1. We note the linguistic methods perform slightly better for MAP, and the statistic method perform better for P@5. Thus we can conclude 1
http://www.statmt.org/europarl/
208
L. Maisonnasse et al.
that our statistical method of concept detection has similar results than those using linguistic techniques. Table 1. comparison of statistical versus linguistic concept detection using CLEFMed 2007 Method analysis MAP Linguistic MM 0.246 MP 0.246 TT 0.258 Statistical FA 0.244
3.4
P@5 0.357 0.424 0.462 0.425
Δ MAP -0.81% -0.81% -5.43%
Δ P@5 19.05% 0.24% -8.01%
Our Four Detection Processes
Due to the previous results we use the fourth analyses in our experiments. ¿From these analyses, we use the MP and TT ones to analyse the collection and we pick some to analyse the query depending of the runs. This year we also test this combination approach on French queries, where we first detect concepts with our term mapping tools with the French version of TreeTagger. Then we translate the French queries from French to English with Google API2 and we extract concepts from this English translation with the MP and the TT analysis.
4
Evaluation
We train our methods on the corpus CLEFmed 2008 and we run the best parameters obtained on CLEFmed 2009 corpus[2]. On this year collection, we submit 10 runs, these runs explore different variations of our model. Previous year results show that merging queries improves the results, we test this year the impact of adding new analysis only on the queries. So we first test 3 model variations: – (UNI.log) that use the conceptual unigram model (as define in 1). – (UNI.kld) that use the conceptual unigram model with the divergence (as define in 3). – (PRF.kld) that combine the conceptual unigram model with a pseudo relevance feedback (as define in 6). For each model, we test it on the collection analysed by two detection methods, MiniPar and TreeTagger (MPTT), using the model combination methods proposed in section 2.1 and we test it with the three following query analysis: – (MPTT) that groups MP and TT analysis, – (MMMPTT) that groups the two preceding analysis with MM one, – (MMMPTTFA) that groups the three preceding analysis with FA one. 2
http://code.google.com/intl/fr/apis/ajaxlanguage/documentation/
Analysis Combination and Pseudo Relevance Feedback
4.1
209
Results
From each method we use the bests parameters obtained on ImageCLEFmed 08 corpus for MAP and we use these parameters on the new 09 collection. We first compare the variation between the results on the two rsv define for MAP and for different query merging on, table 2. Results show that the two rsv give close results on 2008 queries. On 2009 queries, our best result is obtained with the logprobability and with two analyses (MPTT) on the query. Using the four analyses (MMMPTTFA), the log-probability is slightly better than the KL-divergence but the results are close. As presented before, we test our combination model on French queries, from these queries we obtain different concept sets by merging detection methods and by translating, or not, the query to English in order to find the UMLS concepts that are not linked with French terms. This method obtains the good results of 0.377 in MAP. This shows that the combinations methods can be used on translation methods. We then test our pseudo relevance feedback method for this, we query with RSVkld and we process the relevance feedback, the results are presented in table 3. The results, we achieve on 2008 queries, show that the best results are obtain with the pseudo query build on the 100 first documents initially retrieve. On 2008, merging more analysis of the query improve the results. Transposed to 2009 the results also show good results, but the best results are obtained by using only two analyses (MPTT). Table 2. Results for different query analysis combination, for the two unigram models MPTT 2008 2009 log-probability 0.280 0.420 KL-divergence 0.279 -
MMMPTT MMMPTTFA 2008 2009 2008 2009 0.276 0.412 0.281 0.410 0.416
Table 3. Results for different size of pseudo relevance feedback with the KullbackLeiber divergence and with different query analysis size of the MPTT pseudo query (n) 2008 2009 20 0.279 50 0.289 100 0.292 0.429
5
MMMPTT MPTTFA MMMPTTFA 2008 2009 2009 2009 0.281 0.290 0.299 0.416 0.424 0.418
Conclusion
Using the conceptual language model provides good performance in medical IR, and merging conceptual analysis is still improving the results. This year we explore a variation of this model by testing the use of a Kullback-Leiber divergence and we improve it by integrating a pseudo relevance feedback. The
210
L. Maisonnasse et al.
two model variations provide good but similar results. Adding a pseudo relevance feedback improves the results providing the best MAP results for 2009 CLEF campaign. We also made an experimentation on French queries where we use the combination method to solve the ’lack’ of French terms in UMLS, this results show that combination methods can also be used on various methods of concepts detection.
References [1] Smadja, F.A.: Retrieving collocations from text: Xtract. Computational Linguistics 19(1), 143–177 (1993) [2] M¨ uller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C., Hersh, W.: Overview of the clef 2009 medical image retrieval track. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) [3] Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990) [4] Chevallet, J.P., Maisonnasse, L., Gaussier, E.: Combinaison d’analyses s´emantiques pour la recherche d’information m´edicale. In: INFORSID (ed.) Atelier RISE (Recherche d’Information SEmantique) dans le cadre de la conf´erence INFORSID (May 2009) [5] Gaussier, E., Maisonnasse, L., Chevallet, J.P.: Model fusion in conceptual language modeling. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956. Springer, Heidelberg (2008) [6] Gaussier, E., Maisonnasse, L., Chevallet, J.P.: Multiplying concept sources for graph modeling. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 585–592. Springer, Heidelberg (2008)
The MedGIFT Group at ImageCLEF 2009 Xin Zhou1 , Ivan Eggel2, and Henning M¨ uller1,2 1 2
Geneva University Hospitals and University of Geneva, Switzerland University of Applied Sciences Western Switzerland, Sierre, Switzerland henning.mueller@sim.hcuge.ch
Abstract. MedGIFT is a medical imaging group of the Geneva University Hospitals and the University of Geneva, Switzerland. Since 2004, the group has participated ImageCLEF each year, focusing on the medical imaging tasks. For the medical image retrieval task, two existing retrieval engines were used: the GNU Image Finding Tool (GIFT) for visual retrieval and Apache Lucene for text. Various strategies were applied to improve the retrieval performance. In total, 16 runs were submitted, 10 for the image–based topics and 6 for the case–based topics. The baseline GIFT setup used for the past three years obtained the best results among all our submissions. For medical image annotation two approaches were tested. One approach is using GIFT for retrieval and kNN (k–Nearest Neighbors) for classification. The second approach used the Scale–Invariant Feature Transform (SIFT) with a Support Vector Machine (SVM) classifier. Three runs were submitted, two with the GIFT–kNN approach and one using the common results of the two approaches. The GIFT–kNN approach gave stable results. The SIFT–SVM approach did not achieve the expected performance, most likely due to the SVM Kernel used that was not optimized.
1
Introduction
A medical retrieval task has been part of ImageCLEF1 since 2004 [1,2]. The MedGIFT2 research group has participated in all these competitions using the same technology as a baseline and tried to improve the performance of this baseline over time. The GIFT3 (GNU Image Finding Tool, [3]) has been the technology used for visual retrieval. Visual runs using GIFT have also been made available to other participants of ImageCLEF. For text retrieval, Lucene4 was employed in 2009. The full text of the articles was indexed with no optimization. More information concerning the setup and collections of the medical retrieval task can be found in [4]. 1 2 3 4
http://www.imageclef.org/ http://www.sim.hcuge.ch/medgift/ http://www.gnu.org/software/gift/ http://lucene.apache.org/
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 211–218, 2010. c Springer-Verlag Berlin Heidelberg 2010
212
2
X. Zhou, I. Eggel, and H. M¨ uller
Retrieval Tools Reused
This section describes the basic technologies used for retrieval. 2.1
Text Retrieval Approach
The text retrieval used in 2009 is based on Lucene. No specific terminologies such as MeSH (Medical Subject Headings) were used. Only one textual run was submitted. The texts were indexed entirely from the HTML (Hyper Text Markup Language), removing links and metadata. The query text was not modified. 2.2
Visual Retrieval Techniques
GIFT has been used for the visual retrieval for the past five years. This tool is open source and can be used by other participants of ImageCLEF as well. The goal of using standard GIFT is also to provide a baseline to facilitate the evaluation of other techniques. GIFT uses a partitioning of the image into fixed regions to obtain local features. During the last 3 years, the performance obtained by GIFT remained unsatisfying. Various strategies were tried out in order to get improvements, such as integration of aspect–ratio as feature, automatic query expansion and threshold optimization for axes for the annotation task. In ImageCLEF 2009, query expansion with negative examples was carried out for the image retrieval task, and SIFT features were integrated into the image annotation task.
3
Results
This section describes our main results for the two medical tasks. 3.1
Medical Image Retrieval
All the runs were obtained by using GIFT with 8 gray levels. Various strategies were tried to increse performance. One strategy is to query the images belonging to one topic separately, and then to combine the obtained results. Another strategy is to apply negative feedback using the query images of other topics as we assume that the topics are sufficiently different. Adding aspect–ratio is another feature that has worked well in the past. In total, 16 automatic runs were submitted : 2 textual, 10 visual and 4 mixed. 10 run were for the image– based retrieval topics and 6 for the case–based topics. Runs were labeled by the strategies applied. The labels and their signification are: – – – –
txt textual retrieval; vis visual retrieval; mix combination of textual and visual retrieval; sep one query per image is performed to produce a list of similar images for each query image;
The MedGIFT Group at ImageCLEF 2009
213
– AR adding aspect ratio; – NgRan query expansion by randomly taking images from other topics as negative examples; – sum basic results fusion: if one item has several similarity scores, the sum of all scores is used; – max basic results fusion: if one item has several similarity scores, the maximum value is used; – 0.x for a mixed run, 0.x is the weight for the visual retrieval and (1 − 0.x) for the textual retrieval; – EN the language used for textual retrieval is English; – BySim for results fusion, each result is weighted by the similarity score given by Lucene/GIFT; – ByFreq for results fusion, each result is weighted by the number of appearances. The results of the 25 ad–hoc topics are shown in Table 1 and those of the case–based topics (26–30) are shown in Table 2. Mean average precision (MAP), binary preference (Bpref), and early precisions (P10, P30) are used as measures. Table 1. Results of the runs for the image–based topics Run best textual run (LIRIS) HES-SO-VS txt EN MedGIFT vis GIFT8 (best visual run) MedGIFT vis sep max MedGIFT vis sep sum AR MedGIFT vis sep sum MedGIFT vis sep max AR MedGIFT vis sum negRan MedGIFT vis max negRan best automatic mixed run (DEU) MedGIFT mix 0.3NegRan EN MedGIFT mix 0.5 EN MedGIFT mix 0.5NegRan EN
run type MAP Textual 0.4293 Textual 0.3179 Visual 0.0153 Visual 0.0131 Visual 0.013 Visual 0.0114 Visual 0.0102 Visual 0.0098 Visual 0.0079 Mixed 0.3682 Mixed 0.29 Mixed 0.2097 Mixed 0.1354
Bpref 0.4568 0.3498 0.0347 0.0276 0.0303 0.0282 0.0303 0.028 0.0248 0.386 0.3216 0.2456 0.1691
P10 0.664 0.600 0.068 0.076 0.072 0.052 0.076 0.044 0.044 0.544 0.604 0.592 0.488
P30 num rel ret 0.552 1814 0.4987 1462 0.0467 284 0.056 266 0.052 262 0.0573 259 0.0547 253 0.053 210 0.044 201 0.4827 1753 0.516 1176 0.4293 848 0.3267 547
Image–Based Topics. In total, 59 textual runs were submitted for ImageCLEFmed 2009. The average score (MAP) for the textual runs is around 0.3. The Lucene search engine with a standard setup(HES–SO–VS txt.txt ) performed slightly better than the average. The best textual runs used mapping of text to Medical Subject Headings (MeSH) or the Unified Medical Language System (UMLS) to reach an improvement [5,6,7]. 5 groups submitted 16 visual runs. Our best run is the baseline that used GIFT with 8 gray levels(MedGIFT vis GIFT8.txt ). The baseline obtained the highest MAP among all visual runs. The run using the one query per image strategy was officially ranked as second but it outperformed the other visual
214
X. Zhou, I. Eggel, and H. M¨ uller
runs on early precision. As the performance was fairly limited, additional tests were performed and are described in Section 3.1. The second best visual run was submitted by the Image and Text Integration(ITI) group from the National Library of Medicine. Various low level global features were used and a linear combination of these features was applied [8]. SVMs were used to map visual features to semantic terms based on a predefined visual concept tree built from the consolidated ImageCLEFmed collection. Despite the integration of a visual concept tree with machine learning, the results were not extremely high. There were 29 mixed textual/visual runs. The MedGIFT runs are among the five best runs. However, as textual runs outperform the visual runs, many mixed runs are not even as good as the corresponding textual runs. Compared with our textual baseline run all mixed runs obtained worse performance. Several other groups had similar conclusions [8,9]. York University declared that the Color and Edge Directivity Descriptor(CEDD) slightly boosted the performance of a textual run [10]. Both the group from York University and our group used a similar linear combination strategy for fusing the results. Considering the fact that visual runs submitted by York University obtained the worst results among all submitted visual runs, the improvement detected by York University might require further investigation. The best mixed run is from the DEU group, that combined visual and textual features into a single feature matrix [11]. The results show that fusion in the feature space can obtain good results. Follow–Up Analyses. Follow–up analyzes were performed once the ground– truth was available. Using the one query per image strategy and negative query expansion did not improve visual retrieval. The performance for each topic with the three main techniques is shown in Figure 1. The similarities among topic images for each topic are shown in Figure 2 to show homogeneous and heterogeneous topics. To obtain the similarity among topic images, all topic images were indexed and queries with each topic image were performed. In a pairwise comparison the images of one topic were analyzed. The result shown is the average score among all pairwise per topic using the GIFT baseline run. For the submitted visual runs with negative query expansion, negative examples were randomly selected. In an additional approach, negative examples were selected based on the similarity score obtained through visual queries. These runs slightly outperformed the submitted runs using negative examples. In Figure 1, the new run using one negative example is also presented. Comparing the baseline(MedGIFT vis GIFT8) with one query per image (MedGIFT vis sep max) shows that the performance for a topic is not correlated with similarity among the topic images. On the one hand, topics 2, 4, 14, 24, 25 contain images with little similarity, with only topic 2 being improved using one query per image. On the other hand, topics 5, 15, 23 contain very similar images and still the one query per image strategy gave significantly better results. For all other topics the baseline obtained better scores. Using negative examples outperformed the baseline run and the one query per image strategy only rarely (for example topics 6, 16, 18).
The MedGIFT Group at ImageCLEF 2009
215
0.2 medGIFT_vis_GIFT8(baseline) medGIFT_vis_sep_max medGIFT_vis_neg1
0.15
0.1
0.05
0 0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Fig. 1. The performance obtained by GIFT configurations for visual retrieval per topic image similarity for each topic 0.6
0.5
0.4
0.3
0.2
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Fig. 2. The similarity among the images of one topic showing whether the images depict a similar topic or different aspects of the search topic
Case–Based Topics. In 2009, the MedGIFT group submitted 1 mixed run, 4 visual runs and 1 textual run for case–based topics. In total, 11 textual runs, 2 mixed runs and 5 visual runs were submitted for this task. In Table 2, the three best runs of other groups are shown, all of them were textual. The MedGIFT runs were best for visual and mixed retrieval. Our best textual run used Lucene with its standard configuration(HES-SO-VS txt case). By combining the visual
216
X. Zhou, I. Eggel, and H. M¨ uller
run with a textual run (MedGIFT mix 0.5BySim EN ) the MAP decrease significantly but slightly more relevant cases could be found. Table 2. Results of the runs for the case–based retrieval topics Run run type MAP ceb-cases-essie2-automatic Textual 0.3355 sinai TA cbt Textual 0.2626 Textual 0.1912 aueb ipl HES-SO-VS txt case Textual 0.1906 MedGIFT mix 0.5BySim EN Mixed 0.0655 MedGIFT vis maxBySim AR Visual 0.021 MedGIFT vis sumBySim AR Visual 0.019 MedGIFT vis maxByFreq AR Visual 0.0025 MedGIFT vis sumByFreq AR Visual 0.0025
3.2
Bpref 0.2766 0.2264 0.1252 0.1531 0.0488 0.029 0.026 0.0035 0.0035
P10 0.34 0.34 0.24 0.32 0.14 0.04 0.06 0 0
P30 num rel ret 0.2267 74 0.2267 89 0.1867 93 0.2 71 0.0867 74 0.0533 41 0.0533 42 0.0067 26 0.0067 26
Medical Image Annotation
In the medical image annotation task 6 groups submitted a total of 18 runs. Three of these runs were submitted by the MedGIFT group. Two runs used the same strategy as in the past 2 years: – using GIFT to find a list of similar images; – reordering the list by integrating the aspect ratio; – using 5 nearest neighbors (5NN) to perform the classification for each axis by voting using descending weights. Details can be found in the papers of ImageCLEF 2007 [12] and 2008 [13]. One run was submitted to test a SIFT–SVM approach. The standard Gaussian kernel was used for the SVMs. No optimizations of the SVMs were tried. As the results of the SIFT–SVM approach were not optimal we used this run in combination with one of our standard runs for the submission. In both cases, the N most similar images were retrieved for each test image and then used for the classification. The results are shown in Table 3. Best results were obtained using GIFT–5NN as in the past years. Using a combination with SIFT–SVM gave worse results. Table 3. Results of the runs submitted to the medical image annotation task run ID 2005 2006 2007 2008 SUM best system (TAU Biomed) 356 263 64.3 169.5 852.8 second best system (IDIAP) 393 260 67.23 178.93 899.16 GE GIFT8 AR0.2 vdca5 th0.5.run 618 507 190.73 317.53 1633.26 GE GIFT16 AR0.1 vdca5 th0.5.run 641 527 210.93 380.41 1759.34 791.5 612.5 272.69 420.91 2097.6 GE GIFT8 SIFT commun.run
The MedGIFT Group at ImageCLEF 2009
217
Two groups (Biomed and IDIAP) submitted runs significantly outperforming all other techniques. Very similar techniques were used as Biomed was inspired from by IDIAP [14]. Their system uses the following approach: – extract local features from a sub–set of images using random points; – use k–means clustering to create a dictionary of visual words; – sample each image with a denser grid and represent each image as a histogram of the visual words; – train a classifier using SVMs with a X 2 kernel. This approach has proven to obtain best results for the past three years.
4
Conclusions
This paper summarizes the participation of the MedGIFT group in ImageCLEF2009. The medical image retrieval and medical image annotation tasks were addressed. A preliminary analysis of our results for the medical retrieval task shows that visual retrieval is able to improve early precision. Overall performance (measure by MAP) of mixed–media runs relied highly on the performance of the textual run. Textual/visual run fusion strategies require further study as currently the MAP of mixed runs is often lower than that of the corresponding textual run. An additional analysis were carried out to better understand the obtained results. Query performance of a topic is not directly related to the similarity among the images of the topic. There is still a big gap of performance between textual and visual retrieval. Keywords are naturally linked to semantic topics and this for semantic topics text–based approaches perform much better, although even for the visual topics the text retrieval results obtain better results. Using SVMs together with local features based on salient points shows to obtain reasonable results but requires further optimization as our obtained results were by far not as good as those groups obtaining the best results.
Acknowledgments This study was partially supported by the Swiss National Science Foundation (Grant 200020–118638/1), the HES–SO (BeMeVIS), and the European Union in the 6th Framework Program through KnowARC (Grant IST 032691).
References 1. Clough, P., M¨ uller, H., Deselaers, T., Grubinger, M., Lehmann, T.M., Jensen, J., Hersh, W.: The CLEF 2005 cross–language image retrieval track. In: Peters, C., Gey, F.C., Gonzalo, J., M¨ uller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 535–557. Springer, Heidelberg (2006)
218
X. Zhou, I. Eggel, and H. M¨ uller
2. Clough, P., M¨ uller, H., Sanderson, M.: The CLEF cross–language image retrieval track (ImageCLEF) 2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 597–613. Springer, Heidelberg (2005) 3. Squire, D.M., M¨ uller, W., M¨ uller, H., Pun, T.: Content–based query of image databases: inspirations from text retrieval. In: Ersboll, B.K., Johansen, P. (eds.) Pattern Recognition Letters (Selected Papers from The 11th Scandinavian Conference on Image Analysis SCIA 1999), vol. 21(13-14), pp. 1193–1198 (2000) 4. M¨ uller, H., Kalpathy-Cramer, J., Eggers, I., Bedrick, S., Said, R., Bakke, B., Kahn Jr., C.E., Hersh, W.: Overview of the 2009 medical image retrieval task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 5. Maisonnasse, L., Harrathi, F.: Analysis combination and pseudo relevance feedback in conceptual language model. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 6. Lana-Serrano, S., Villena-Rom´ an, J., Gonz´ alez-Crist´ obal, J.C.: MIRACLE at ImageCLEFmed 2009: Reevaluating strategies for automatic topic expansion. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 7. D´ıaz-Galiano, M.C., Mart´ın-Valdivia, M.T., Ure˜ na-L´ opez, L.A., Garc´ıaCumbreras, M.A.: SINAI at ImageCLEF 2009 medical task. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 8. Simpson, M., Rahman, M.M., Demner-Fushman, D., Antani, S., Thoma, G.R.: Text– and content–based approaches to image retrieval for the ImageCLEF 2009 medical retrieval track. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 9. Boutsis, I., Kalamboukis, T.: Combined content–based and semantic image retrieval. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 10. Ye, Z., Huang, X., Lin, H.: Towards a better performance for medical image retrieval using an integrated approach. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 11. Berber, T.: Alpko¸cak, A.: DEU at ImageCLEFmed 2009: Evaluating re-ranking and integrated retrieval model. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece (September 2009) 12. Zhou, X., Depeursinge, A., M¨ uller, H.: Hierarchical classification using a frequency– based weighting and simple visual features. Pattern Recognition Letters 29(15), 2011–2017 (2008) 13. Zhou, X., Gobeill, J., M¨ uller, H.: The medgift group at imageclef 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 712–718. Springer, Heidelberg (2009) 14. Avni, U., Goldberger, J., Greenspan, H.: TAU MIPLAB at ImageClef 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009)
An Extended Vector Space Model for Content-Based Image Retrieval Tolga Berber and Adil Alpkocak Dokuz Eylul University, Dept. of Computer Engineering, Tinaztepe Buca, 35160, Izmir, Turkey {tberber,alpkocak}@cs.deu.edu.tr
Abstract. This paper describes participation of Dokuz Eylul University to the ImageCLEF2009Med task. This year, we proposed a new model for content-based image retrieval combining both textual and visual information in the same space. It simply extends traditional vector space model of text retrieval with visual terms. The proposed model also supports to close the semantic gap problem of content-based image retrieval. Experiments showed that our proposed system improves the performance of textual retrieval methods by adding visual terms. The proposed method was evaluated on the ImageCLEFmed 2009 dataset and it was ranked the best performance among the participants in automatic mixed retrieval including both text and visual features. Keywords: Content-based Image Retrieval, Vector Space Model, Semantic Gap, Visual Terms.
1 Introduction Content-Based Image Retrieval (CBIR) Systems aim to use image content to search and retrieve images from image collections. Instead of precise pixel-to-pixel matching, CBIRs use some low-level image features like color distribution, texture of a shape, etc. to define image content. Because low-level features are insufficient to define semantics of an image, result sets generated by CBIR systems could not satisfy user information need. This problem is called the semantic gap problem. The model we present in this paper is an integrated retrieval system which extends the well-known textual information retrieval technique with visual terms. Proposed model aims to close the semantic gap problem by helping to map low-level features into high level textual semantic concepts. Moreover, this combination of textual and visual modality into one model also helps to query a textual database with visual content or visual database with textual content. Consequently, images could also be defined with semantic concepts instead of low-level features. The rest of this paper is structured as follows. In Section 2, our proposal is described. Section 3 contains the experimentation results on the ImageCLEFmed 2009 dataset. Finally, section 4 concludes the paper and gives a look at possible future works on this subject. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 219–222, 2010. © Springer-Verlag Berlin Heidelberg 2010
220
T. Berber and A. Alpkocak
2 The Integrated Retrieval Model Proposed model is an extension to classical vector space model (VSM), focusing on to integrate both textual and visual features into one model, so it is called Integrated Retrieval Model (IRM). We also aim to close the semantic gap problem of content-based image retrieval by mapping low-level visual features to classical VSM, where a document is represented as a vector of terms. Hence, a document repository, D, becomes a sparse matrix whose rows are document vectors, columns are term vectors as follows:
⎡ w1,1 w1,2 w1, n ⎤ ... ⎢ ⎥ D=⎢ # # % # ⎥ (1) ⎢ wm,1 wm,2 " wm, n ⎥ ⎣ ⎦ where wij is the weight of term j in document i, n and m is term and document count, respectively. Literature proposes a plenty number of weighting schemes [1][2][3]. We used pivoted unique term weighting scheme proposed in [4] and [5]. The system proposes a modification on D matrix (Eq. 1) by adding visual terms representing visual contents. Formally, document-term matrix becomes as follows:
⎡ w1,1 ⎢ D=⎢ # ⎢ wm ,1 ⎣
w1, 2 # wm , 2
... % "
w1, n i1, n +1 i1, n + 2 # # # wm , n im ,n +1 im , n +1
... % ...
i1, n + k ⎤ ⎥ # ⎥ im , n +1 ⎥⎦
(2)
where iij is the weight of visual term j in document i, k is the number of visual terms. Visual and textual features are normalized independently. In sum, IRM extends traditional text-based VSM with visual features. Initially, we have used two simple visual terms representing color information of the image. It simply counts the number of gray pixels. Then, the first visual term represents what amount of image is grayscale, and the second visual term is the complement of first term. In other words, it is the probability of color pixels in an image.
3 Experimentations In order to evaluate the proposed method, we conducted five runs with the ImageCLEFmed 2009 dataset [6], however we present only two of them. We preprocessed all 74902 documents including combination of title and captions. First, all documents were converted into lowercase. All numbers and some punctuation characters like dash (-) and apostrophe (’) were removed. However, some of the non-letter characters like comma (,) and slash (/) were replaced with a space. This is because dash character conveys an important role as in X-Ray and T3-MR. Then, we choose the words surrounded by spaces as index terms. For each image in the data set, we have added two visual terms as shown in previous section. Altogether, the total number of indexing terms became 33615. After preprocessing phase, we implemented text-only retrieval on the dataset. Here, we normalized text term weights as shown in [2], and we simply calculated dot
An Extended Vector Space Model for Content-Based Image Retrieval
221
product of query and document vectors as similarity function. Then, the top 1000 documents having the highest similarity scores were selected as result set for each query. The first row of Table 1, whose run identifier is deu_baseline, shows the results we obtained from this experimentation. This ranked as 16th position since we have used a very simple retrieval method, without any enhancement. Table 1. Results of experimentations with the ImageCLEFmed 2009 dataset Run Identifier NumRel deu_baseline 2362 deu_IRM 2362
RelRet 1742 1754
MAP 0.339 0.368
P@5 0.584 0.632
P@10 0.520 0.544
P@30 0.448 0.483
P@100 0.303 0.324
Rank 16 1
Run Type Text Mixed
The second experimentation is about the integrated retrieval method, which combines two visual terms with previous experimentation. The results are shown in the second row of Table 1. These results present that IRM had a better performance than our baseline retrieval in all measures. Furthermore, this performance gain was obtained by using simple visual feature. Figure 1 illustrates the precision and recall values of our experimentations. IRM outperformed classical vector space model with respect to recall measure at all precision levels. These results show that combining textual retrieval techniques with good visual features positively affects the results and improves the system performance.
Fig. 1. Precision-Recall graph of baseline and IRM runs
4 Discussion and Future Work In this paper, we proposed a new content-based image retrieval model combining both visual and textual features of a document in same model. So, this model may help to bridge the semantic gap problem of content-based image retrieval systems. We evaluated the proposed approach on the ImageCLEFmed 2009 dataset and our method
222
T. Berber and A. Alpkocak
ranked the best performance among the participants in mixed automatic run track. Experimentation results showed that the proposed method performs better than any other automatic mixed retrieval approaches even when a simple visual feature is used. The integrated retrieval model is a starting point, and ultimate goal of system is to close the semantic gap in visual information retrieval systems. It is promising that usage of simple visual term gives better results than most of the textual models. It is also expected that addition of new visual features to the model improves system performance.
Acknowledgements This work is supported by Turkish National Science Foundation (TÜBİTAK) under project number 107E217.
References 1. Amati, G.: Probabilistic models for information retrieval based on divergence from randomness. PhD thesis, Department of Computing Science, University of Glasgow (2003) 2. Hancock-Beaulieu, M., Gatford, M., Huang, X., Robertson, S.E., Walker, S., Williams, P.W.: Okapi at TREC-5. In: Text REtrieval Conference (TREC) TREC-5 Proceedings (1996) 3. Zheng, Z., Metzler, D., Zhang, R., Yi, C., Nie, J.: Search result re-ranking by feedback control adjustment for time-sensitive query. In: Proceedings of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (2009) 4. Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. ACM, New York (1996) 5. Chisholm, E., Kolda, T.G.: New term weighting formulas for the vector space method in information retrieval. Technical report (1999) 6. Müller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C.E., Hersh, W.: Overview of the ImageCLEF 2009 Medical Image Retrieval Track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010)
Using Media Fusion and Domain Dimensions to Improve Precision in Medical Image Retrieval Sa¨ıd Radhouani, Jayashree Kalpathy-Cramer, Steven Bedrick, Brian Bakke, and William Hersh Department of Medical Informatics & Clinical Epidemiology Oregon Health and Science University (OHSU) Portland, OR, USA radhouan@ohsu.edu
Abstract. In this paper, we focus on improving retrieval performance, especially early precision, in the task of solving medical multimodal queries. The queries we deal with consist of a visual component, given as a set of image-examples, and textual annotation, provided as a set of words. The queries’ semantic content can be classified along three domain dimensions: anatomy, pathology, and modality. To solve these queries, we interpret their semantic content using both textual and visual data. Medical images often are accompanied by textual annotations, which in turn typically include explicit mention of their image’s anatomy or pathology. Annotations rarely include explicit mention of image modality, however. To address this, we use an image’s visual features to identify its modality. Our system thereby performs image retrieval by combining purely visual information about an image with information derived from its textual annotations. In order to experimentally evaluate our approach, we performed a set of experiments using the 2009 ImageCLEFmed collection using our integrated system as well as a purely textual retrieval system. Our integrated approach consistently outperformed our text-only system by 43% in MAP and by 71% in precision within the top 5 retrieved documents. We conclude that this improved performance is due to our method of combining visual and textual features. Keywords: Medical Image Retrieval, Performance Evaluation, Image Classification, Image Modality Extraction, Domain Dimensions, Media Fusion.
1
Introduction
Advances in digital imaging technologies and the increasing prevalence of Picture Archival and Communication Systems (PACS) have led to a substantial growth in the number of digital images stored in hospitals and medical systems in recent years. Medical images can form an essential component of a patient’s health record, and the ability to retrieve them is useful to a variety of important clinical tasks, including diagnosis, education, and research. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 223–230, 2010. c Springer-Verlag Berlin Heidelberg 2010
224
S. Radhouani et al.
Image retrieval systems (IRS) do not currently perform as well as their text counterparts [1]. Medical and other IRS have historically relied on indexing annotations or captions associated with the images. The last decade, however, has seen advancements in the area of content-based image retrieval (CBIR) [2,3]. Although CBIR systems have demonstrated success in fairly constrained medical domains (e.g., dermatology, chest x-rays, lung CT images, etc.), they have demonstrated poor performance when applied to databases with a wide spectrum of imaging modalities, anatomies, and pathologies [1,4,5,6]. In this paper, we address the problem of solving medical multimodal queries. We do this by focusing on improving early precision (i.e., precision at top 5 documents), as early precision is believed to be important to many users of search systems[7]. The queries we deal with contain visual data, given as a set of image-examples, as well as textual data in the form of a set of words that can be categorized along three dimensions: anatomy, pathology, and modality. In previous work, we have shown that it is possible to automatically identify the modality of a medical image using purely visual means [8]. However, it can be challenging to identify the anatomy or the pathology (e.g., slight fracture of a femur) relying solely on visual analysis of an image, particularly in unconstrained settings. On the other hand, it is often relatively straightforward to identify an image’s anatomy and the pathology using textual analysis of the image’s accompanying annotations— but these same text-based approaches often fail to correctly identify an image’s modality. This is due to the fact that while image annotations frequently include explicit mention of the image’s anatomical or pathological features, the annotations often leave the modality unsaid. To overcome these problems and therefore improve retrieval performance, especially early precision, we propose to merge the results of textual and visual earch techniques [9,8,10]. In this paper, we first present a brief description of our system (Section 2). We next describe the evaluation experiments we performed using our visual-based search technique and text-based search technique (Section 3). We then discuss the obtained results in Section 4. Finally, we conclude this paper and provide some perspectives (Section 5).
2
Adaptive Medical Image Retrieval System
Starting in 2007, we created and have continued to develop a multimodal image retrieval system based on an open-source framework that allows the incorporation of user search preferences. We designed a flexible database schema that allows us to easily incorporate new collections while facilitating retrieval using both text and visual techniques. We used the Ruby programming language, with the open source Ruby On Rails1 web application framework. To manage the mappings between the images
1
http://www.rubyonrails.org/
Using Media Fusion and Domain Dimensions to Improve Precision
225
and their associated metadata, we used the PostgreSQL2 relational database system. Our system also indexes the title and caption fields for each image. Our system features a customizable query parser which allows the user to choose between several options to improve the precision of the search. In addition to stop word removal and stemming, our parser can identify a query’s desired search modality, if specified. The results can then be filtered to only return images whose purported modality matches that of the query. The system is also linked to the National Library of Medicine’s Unifed Medical Language System (UMLS) Metathesaurus; the user may choose to perform manual or automatic query expansion using synonyms generated from the Metathesarus. Medical images often begin life with rich metadata in the form of DICOM headers describing their imaging modality or anatomy. However, since most teaching or on-line image collections are made up of compressed standalone JPEG files, it is very common for medical images to exist “in the wild” sans metadata. In previous work [8], we described a modality classifier that could identify the imaging modality for medical images using supervised machine learning. We extended that work to the new dataset used for ImageCLEF 2009. One of the biggest challenges in creating such a modality classifier is creating a labeled training dataset of sufficient size and quality. In 2009, as in 2008, we used a regular-expression based text parser, written in Ruby, to extract the modality from the image captions for all images in the collection, when available. Images, where a unique modality could be identified based on the caption, were used for training the modality classifier. Our text processing module is very simple. After applying a stop-word list, we index documents and queries using the Vector Space Model (VSM). During the retrieval process, a list of relevant documents can be ranked with regard to their relevance to the corresponding query. While no specific treatment was applied to documents, we theorized that it would be useful to use external knowledge to interpret the queries’ semantic content. Indeed, each query contains a precise description of a user need materialized by a set of words belonging to three semantic categories: modality of the image (e.g., MRI, x-ray, etc.), anatomy (e.g., leg, head, etc.), and pathology (e.g., cancer, fracture, etc.). We call these categories “domain dimensions” and define them as follow: “A dimension of a domain is a concept used to express the themes in this domain” [11]. The idea behind our approach is that, in a given domain, a theme can be developed with reference to a set of dimensions of this domain. For instance, a physician wishing to write a report about a medical image, first focuses on a domain (“medicine”). Next, they refer to specific dimensions of this domain (e.g., “anatomy”), then choose words from this dimension (e.g., “femur”), and finally write their report. In order to solve multimodal medical queries, we proposed using the domain dimensions to interpret their semantic content. To do so, we first needed to define the dimensions. For this purpose, we used external resources, such as ontologies or thesaurus, to define our dimensions as hierarchies of concepts. Every 2
http://www.postgresql.org/
226
S. Radhouani et al.
concept is denoted by a set of words. Our system was then able to use these dimensions to extract query semantics by attempting to identify, for each query, what dimensions may or may not be present. Then, our system uses boolean operators to combine the extracted query dimensions and thereby constrain the raw search results. Our query process contains two main steps. The first step consists of using the initial user-supplied query text to search for documents based on the VSM. The result of this step is a list of ranked documents D. The second step consists of selecting, from D, those documents that satisfy the Boolean expression formulated based on the domain dimensions.
3
Experimental Evaluation
In order to experimentally evaluate our approach, we performed a set of experiments using the 2009 ImageCLEF collection, which consists of 74,902 medical images together with textual annotations for each one[12]. The following sections describe these experiments and their results. In total, 9 runs were performed; all of them are automatic; two of them were based entirely on textual data and the remaining eight are based on both textual and visual data. Our baseline run, labeled “no mod,” is based on the VSM, where each document/query is represented by a vector of words. The result of this run will be compared to those obtained by the other runs, which are based on domain dimensions and/or a combination of textual and visual data. 3.1
Modality Extraction-Based Experiences
1.0
We performed two mixed runs that used the automatically extracted modality to filter results. The custom query parser first extracted the desired modality from the query, if it existed. The “mod1” run used the custom parser to remove stop-words from the query and limit the results to the desired modality. This run was expected to have high precision but potentially lower recall as it did not
0.6 0.4 0.0
0.2
precision
0.8
baseline run modality filtered run
P5
P15
P30
P200
P100
Fig. 1. Improvement in early precision with the use of modality classification
Using Media Fusion and Domain Dimensions to Improve Precision
227
use any term expansion. Also, if the modality classier was not accurate or the modality extraction from the textual query was too strict, the results could be limited. In order to try to increase the recall, we also performed a run, labeled “umls,” where term expansion based on the UMLS metathesaurus was used. As can be seen in Figure 1), there is a improvement, especially in early precision with the use of modality filtration. 3.2
Dimensions-Based Experiences
To define the domain dimensions, we utilized the UMLS Metathesaurus and its database of semantic types. The types we ultimately included were as follows: Anatomy: “Body Part,” “Organ, or Organ Component,” “Body Space or Junction,” “Body Location or Region,” and “Cell” Pathology: “Sign or Symptom,” “Finding,” “Pathologic Function,” “Injury or Poisoning,” “Disease or Syndrome,” “Neoplastic Process,” “Neoplasms,” “Anatomical Abnormality,” “Congenital Abnormality,” adn “Acquired Abnormality” Modality: “Manufactured Object,” and “Diagnostic Procedure” After automatically extracting the dimensions from each query, we used them to perform the following runs: R1: We used dimensions to rank the retrieved documents for a given query into three tiers. Documents that contained the three dimensions belonging to the query were considered to be most relevant, and were therefore highly ranked and placed in the first result tier. These were followed by those documents that were only missing the modality. Finally, we placed documents that contain at least one of the query dimensions in the third tier. Since the modality was not always explicitly described in the text, we used the visual data to extract it from each image. Thereafter, we used it to re-rank the document list obtained using the textual data. For example, documents from which the modality has been extracted from an image were ranked at the top of results for queries that specified a modality. In all the following runs, we applied this technique using the result of the “mod1” run. For simplicity of writing, we call this process “checking modality.” R2:Re-rank the document list obtained during run R1 by “checking modality.” R3: Re-rank the document list obtained during run no mod by the “checking modality.” R4: Selecting, for each query, only documents that contain the Anatomy and the Pathology dimensions relevant to the query. The obtained results are reranked by “checking modality.” R5: From the result of the “R2” run, we randomly selected one image from each article. In CLEF documents, a textual article might contain more than one image (as in, e.g., the case of multiple figures being in the same published paper). In this run, if an article is retrieved by the textual approach, we randomly select one of its images (instead of keeping all images as we had done for the other runs).
228
S. Radhouani et al.
R6: We selected from the result of the “R1” run only those documents from which modality had been extracted during the “mod1” run.
4
Results
For each query, the results are measured by mean average precision (MAP) of top 1000 documents, precision at 10 documents (p@10), and precision at 5 documents (p@5). The results given by the baseline run are 0.1223 (MAP), 0.416 (p@10), and 0.38 (p@5). Obtained results are presented in Table 1 where rows correspond to the runs, and values correspond to their corresponding results. We also include data from the best runs (based on MAP and p@5) in ImageCLEF 2009 campaign. Table 1. Results of our experimental evaluation Run Name umls mod1 R1 R2 R3 R4 R5 R6 no mod Best in p@5 Best in MAP
MAP 0.1753 0.1698 0.1756 0.1582 0.1511 0.1147 0.1133 0.1646 0.1223 0.3775 0.4293
P@5 0.712 0.592 0.592 0.624 0.608 0.6 0.584 0.68 0.416 0.744 0.696
P@10 0.664 0.552 0.536 0.54 0.524 0.484 0.516 0.612 0.38 0.716 0.664
As described above, our system has been designed to maximize precision, often at the expense of recall. Since we do not use any advanced natural language processing, and we filter images based on purported modality, we were expecting a relatively low recall. Consequently, as the MAP is highly dependent and limited by recall, we believe that it makes more sense to compare our results to those obtained by the other ImageCLEF 2009 participants in terms of our early precision levels (p@5 or p@10). We have generally high early precision levels, in particular in our “umls” run (p@5 = 0.712). This was the second best p@5 result obtained during the ImageCLEF 2009 campaign; the absolute best was 0.744 and was obtained from an interactive run. Therefore, our “umls” run had the highest fully-automatic p@5 in ImageCLEF2009. Independent of the other ImageCLEF 2009 participants, most of our runs outperform our baseline, reaching an improvement of 43% in MAP (R1) and 71% in p@5 (umls). From the result of the “R1” run, we notice that the use of domain dimensions is of great interest in solving medical multimodal queries. Indeed, by using domain dimensions, we highlight the “relevant words” that describe the queries’ semantic content. Using these words, the system can retrieve only
Using Media Fusion and Domain Dimensions to Improve Precision
229
documents that contain the anatomy, the modality, and the pathology described in the query text. We notice that at p@5 and p@10, all our mixed runs outperform our baseline. This is not surprising, because the first ranked documents are those that have been retrieved both by the text-based search technique and the visualbased search technique. This supports conclusions drawn from our previous work, which found that the retrieval performance can be improved demonstrably by merging the results of textual and visual search techniques [9,8,10]. Two of our runs displayed decreased MAP when compared with our baseline. The first of these was the “R4” run, wherein the modality dimension was ignored during the querying process. This decrease might be explained by the fact that the modality is described in some documents, and its use is thought to be beneficial— for example, consider our “R1” run, wherein we used the three dimensions and obtained our highest performance. The second lower-than-baseline run was our “R5” run, in which only one image was randomly selected from each article. An increase in the performance would have been surprising, since this technique is not accurate at all, and there is a high risk that the selected image is irrelevant. Indeed, it is better to keep all images of each article, thus, if one of them is relevant to the corresponding query, it will be retrieved.
5
Conclusions
In order to improve early precision in the task of solving medical multimodal images, we combined a text-based search technique with a visual-based one. The first technique consisted of using domain dimensions to highlight relevant words that describe the queries’ semantic content. While anatomy and pathology are relatively easy to identify from textual documents, it can be extremely difficult, if not impossible, to identify an image’s modality using solely textual features. We therefore experienced the best results when we combined our textual techniques with a visual-based search technique that automatically extracts modality information from images visually. The obtained results in terms of precision at p@5 and p@10 are very encouraging and outperform our baseline. Among the ImageCLEF 2009’s particiants, we obtained the second best overall and best automatic result in terms of precision at p@5. However, in terms of MAP, even though our results outperform our baseline, they are significantly below the best performance obtained in ImageCLEF 2009. This was expected, since this year our sole focus was on improving early precision, and we did not use any sophisticated natural language processing to improve our system’s recall. Our future work will attempt to address this issue. We believe that our current text-based search technique has significant room for improvement. We plan to use further textual processing, such as term expansion and pseudo-relevance feedback, in order to improve our recall, and hope to be able to compare our results to the best ImageCLEF 2009’ performance.
230
S. Radhouani et al.
Acknowledgements We acknowledge the support of NLM Training Grant 2T15LM007088, NLM Grant 1K99LM009889-01A1, NSF Grant ITR-0325160, and the Swiss National Science Foundation grant PBGE22-121204.
References 1. Hersh, W.R., M¨ uller, H., Jensen, J.R., Yang, J., Gorman, P.N., Ruch, P.: Advancing biomedical image retrieval: Development and analysis of a test collection. J. Am. Med. Inform. Assoc., M2082 (June 2006) 2. Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12), 1349–1380 (2000) 3. Tagare, H.D., Jaffe, C.C., Duncan, J.: Medical image databases: A content-based retrieval approach. J. Am. Med. Inform. Assoc. 4(3), 184–198 (1997) 4. Aisen, A.M., Broderick, L.S., Winer-Muram, H., Brodley, C.E., Kak, A.C., Pavlopoulou, C., Dy, J., Shyu, C.R., Marchiori, A.: Automated storage and retrieval of thin-section ct images to assist diagnosis: System description and preliminary assessment. Radiology 228(1), 265–270 (2003) 5. Schmid-Saugeona, P., Guillodb, J., Thirana, J.P.: Towards a computer-aided diagnosis system for pigmented skin lesions. Computerized Medical Imaging and Graphics: The Official Journal of the Computerized Medical Imaging Society 27(1), 65–78 (2003) 6. M¨ uller, H., Michoux, N., Bandon, D., Geissbuhler, A.: A review of content-based image retrieval systems in medical applications–clinical benefits and future directions. International Journal of Medical Informatics 73(1), 1–23 (2004) 7. Jansen, B., Spink, A.: How are we searching the world wide web? a comparison of nine search engines. Information Processing and Management 42(1), 248–263 (2006) 8. Kalpathy-Cramer, J., Hersh, W.: Automatic image modality based classification and annotation to improve medical image retrieval. Studies in Health Technology and Informatics 129(Pt 2), 1334–1338 (2007) 9. Hersh, W., Kalpathy-Cramer, J., Jensen, J.: Medical image retrieval and automated annotation: Ohsu at imageclef 2006. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 660–669. Springer, Heidelberg (2007) 10. Radhouani, S., HweeLim, J.: pierre Chevallet, J., Falquet, G.: Combining textual and visual ontologies to solve medical multimodal queries. In: IEEE International Conference on Multimedia and Expo., pp. 1853–1856 (2006) 11. Radhouani, S.: Un mod`ele de recherche d’information orient´e pr´ecision fond´e sur les dimensions de domaine. PhD thesis, University of Geneva, Switzerland, and University of Grenoble, France (2008) 12. M¨ uller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kann, J. C.E.: Overview of the medical retrieval task at imageclef 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010)
ImageCLEF 2009 Medical Image Annotation Task: PCTs for Hierarchical Multi-Label Classification Ivica Dimitrovski1,2, Dragi Kocev1, Suzana Loskovska2, and Sašo Džeroski1 1 2
Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia Department of Computer Science, Faculty of Electrical Engineering and Information Technologies, Skopje, Macedonia {ivicad,suze}@feit.ukim.edu.mk, {dragi.kocev,saso.dzeroski}@ijs.si
Abstract. In this paper, we describe an approach to the automatic medical image annotation task of the 2009 CLEF cross-language image retrieval campaign (ImageCLEF). This work focuses on the process of feature extraction from radiological images and their hierarchical multi-label classification. To extract features from the images we use two different techniques: edge histogram descriptor (EHD) and Scale Invariant Feature Transform (SIFT) histogram. To annotate the images, we use predictive clustering trees (PCTs) which are able to handle target concepts that are organized in a hierarchy, i.e., perform hierarchical multi-label classification. Furthermore, we construct ensembles (Bagging and Random Forests) that use PCTs as base classifiers: this improves the predictive/classification performance.
1 Introduction The amount of medical images produced is constantly growing. Manual description and annotation of each image is time consuming, expensive and impractical. This calls for development of image annotation algorithms that can perform the task reliably. Automatic annotation classifies an image into one of a set of classes. If the classes are organized in a hierarchy and several of them can be assigned to an image, we are talking about hierarchical multi-label classification (HMLC). This paper describes our approach to the medical image annotation task of ImageCLEF 2009 (for details see [1]). The objective of this task is to provide the IRMA (Image Retrieval in Medical Applications) code [2] for each image of a given set of previously unseen medical (radiological) images. The IRMA coding system consists of four axes: technical axis (T, image modality), directional axis (D, body orientation), anatomical axis (A, body region examined) and biological axis (B, biological system examined). The database of medical images contains 12677 fully annotated radiographs (training dataset for the classifier) and 1733 testing images without labels. The annotation should be performed by using the four different annotation label sets (the competitions from 2005-2008) in turn. The code is strictly hierarchical because each sub-code element is connected to only one code element. This characteristic of the IRMA code allow us to exploit the C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 231–238, 2010. © Springer-Verlag Berlin Heidelberg 2010
232
I. Dimitrovski et al.
code hierarchy and construct an automatic annotation system based on predictive clustering trees for hierarchical multi-label classification [3]. This approach is directly applicable for the datasets of ImageCLEF2007 and ImageCLEF2008 where the images were labeled according to the IRMA code scheme. To apply the same algorithm for the ImageCLEF2005 and ImageCLEF2006 datasets, we mapped the class numbers with the corresponding IRMA codes. Some images from the ImageCLEF2005 dataset can belong to more than one IRMA code. In the classification process, we use the most general IRMA code (that contains 0) to describe these images. Automatic image classification/annotation relies on numerical features that are computed from the image pixel values. In our approach, we use an edge histogram descriptor (to extract the global features of the images) and SIFT histogram (to extract the local features from the images). We combine the feature vectors (histograms) with simple concatenation in a single vector with 2080 features. The purpose of the concatenation of the global and the local features is to tackle the problem of intra-class variability vs. inter-class similarity and the different distribution of images between the training and the testing dataset (the testing dataset contains many images of some classes that are under-represented in the training set). Tomassi et al. [4] show that high and mid level combination of the different feature extraction techniques yield better results when SVMs are used as classifiers. In our work, we use ensembles of predictive clustering trees [3,5]. The ensembles of trees, such as random forests, can effectively exploit the information provided by the large number of features. Thus, we expect that concatenation of the feature extraction techniques yields better performance than the other combination methods. The remainder of the paper is organized as follows: Section 2 describes the techniques for feature extraction from images. Section 3 introduces predictive clustering trees and their use for HMLC. In Section 4, we explain the experimental setup. Section 5 reports the obtained results. Conclusions and a summary are given in Section 6, where we also discuss some directions for further work.
2 Feature Extraction from Images This section describes the techniques for feature extraction from images that we use to describe the X-ray images from ImageCLEF 2009. We shortly describe the edge histogram descriptor and the scale invariant feature transform. To learn a classifier and to annotate the images from the testing set, we use the feature vector obtained with simple concatenation of the features obtained from these two techniques. Edge Histogram Descriptor: Edge detection is a fundamental problem of computer vision and has been widely investigated [6]. The goal of edge detection is to mark the points in a digital image at which the luminous intensity changes sharply. An edge representation of an image drastically reduces the amount of data to be processed, yet it retains important information about the shapes of objects in the scene. Edges in images constitute important features to represent their content. One way of representing important edge features is to use a histogram. An edge histogram in the image space represents the frequency and the directionality of the brightness changes in the image. To represent it, MPEG-7 contains edge histogram
ImageCLEF 2009 Medical Image Annotation Task: PCTs for HMLC
233
descriptors (EHD). These basically represent the distribution of five types of edges in each local area called a sub-image. The sub-images are defined by dividing the image space into 4×4 non-overlapping blocks. Thus, the image partition always yields 16 equal-sized sub-images, regardless of the size of the original image. To characterize the sub-images, we then generate a histogram of edge distribution for each sub-image. Edges in the sub-images are categorized into five types: vertical, horizontal, 45-degree diagonal, 135-degree diagonal and non-directional edges. Thus, the histogram for each sub-image represents the relative frequency of occurrence of the five types of edges in the corresponding sub-image. As a result, each local histogram contains five bins. Each bin corresponds to one of the five edge types. Since there are 16 sub-images in the image, a total of 5×16=80 histogram bins are required. Note that each of the 80-histogram bins has its own semantics in terms of location and edge type. Edge detection is performed using the Canny edge detection algorithm [7]. SIFT histogram: Many different techniques for detecting and describing local image regions have been developed [8]. The Scale Invariant Feature Transform (SIFT) was proposed as a method of extracting and describing key-points which are reasonably invariant to changes in illumination, image noise, rotation, scaling, and small changes in viewpoint [8]. For content based image retrieval, good response times are required and this is hard to achieve when using the huge amount of data contained in descriptors by local features. The descriptors using local features can be extremely big because an image may contain many key-points, each described by a 128 dimensional vector. To reduce the descriptor size, we use histograms of local features [9]. With this approach, the amount of data is reduced by estimating the distribution of local feature values for every image. The creation of these histograms is a three step procedure. First, the key-points are extracted from all database images, where a key-point is described with a 128 dimensional vector of numerical values. For the key-point extraction and descriptor calculation, we use the default parameters proposed by Lowe [8]. The key-points are clustered in 2000 clusters using k-means. Afterwards, for each key-point we discard all information except the identifier of the most similar cluster center. A histogram of the occurring patch-cluster identifiers is created for each image. To be independent of the total number of key-points in an image, the histogram bins are normalized to sum to 1. This results in a 2000 dimensional histogram.
3 Ensembles of PCTs In this section, we discuss the approach we use to classify the data at hand. We shortly describe the predictive clustering trees (PCT) framework, its use for HMLC and the learning of ensembles. PCTs for Hierarchical-Multi Label Classification: In the PCT framework [5], a tree is viewed as a hierarchy of clusters: the top-node corresponds to one cluster containing all data, which is recursively partitioned into smaller clusters while moving down the tree. PCTs can be constructed with a standard “top-down induction of
234
I. Dimitrovski et al.
decision trees” (TDIDT) algorithm. The heuristic for selecting the tests is the reduction in variance caused by partitioning the instances. Maximizing the variance reduction maximizes cluster homogeneity and improves predictive performance. A leaf of a PCT is labeled with/predicts the prototype of the set of examples belonging to it. With instantiation of the variance and prototype functions, the PCTs can handle different types of data, e.g., multiple targets [10] or time series [11]. A detailed description of the PCT framework can be found in [5]. To apply PCTs to the task of HMLC the example labels are represented as vectors with Boolean components. The i-th component of the vector is 1 if the example belongs to class ci and 0 otherwise (See Fig. 1). The variance of a set of examples (S) is defined as the average squared distance between each example’s label vi and the mean label v of the set, i.e.,
∑ d(v ,v )
2
i
Var(S) = i
(1)
|S |
The higher levels of the hierarchy are more important: an error in the upper levels costs more than an error on the lower levels. Considering that, a weighted Euclidean distance is used as a distance measure. d(v1 , v 2 ) =
∑ w(c )(v i
1,i − v 2,i )
2
(2)
i
where vk,i is the i’th component of the class vector vk of an instance xk, and the class weights w(c) decrease with the depth of the class in the hierarchy. In the case of HMLC, the notion of majority class does not apply in a straightforward manner. Each leaf in the tree stores the mean v of the vectors of the examples that are sorted in that leaf. Each component of v is the proportion of examples v i in the leaf that belong to class ci. An example arriving in the leaf can be predicted to belong to class ci if v i is above some threshold ti. The threshold can be chosen by a domain expert. A detailed description of PCTs for HMLC can be found in [3].
Fig. 1. A toy hierarchy. Class label names reflect the position in the hierarchy, e.g., ‘2/1’ is a subclass of ‘2’. The set of classes {1, 2, 2/2} is indicated in bold in the hierarchy and is represented as a vector.
Ensemble Methods: An ensemble classifier is a set of classifiers. Each new example is classified by combining the predictions of every classifier from the ensemble.
ImageCLEF 2009 Medical Image Annotation Task: PCTs for HMLC
235
These predictions can be combined by taking the average (for regression tasks) or the majority vote (for classification tasks) [12,13], or by taking more complex combinations. We have adopted the PCTs for HMLC as base classifiers. Average is applied to combine the predictions of the different trees because the leaf’s prototype is the proportion of examples of different classes that belong to it. Just like for the base classifiers a threshold should be specified to make a prediction. We consider two ensemble learning techniques that have primarily been used in the context of decision trees: bagging and random forests. Bagging [12] constructs the different classifiers by making bootstrap replicates of the training set and using each of these replicates to construct one classifier. Each bootstrap sample is obtained by randomly sampling training instances, with replacement, from the original training set, until a number of instances is obtained equal to the size of the training set. Bagging is applicable to any type of learning algorithm. A random forest [13] is an ensemble of trees, where diversity among the predictors is obtained both by bootstrap sampling, and by changing the feature set during learning. More precisely, at each node in the decision tree, a random subset of the input attributes is taken, and the best feature is selected from this subset (instead of the set of all attributes). The number of attributes that are retained is given by a function f of the total number of input attributes x (e.g., f(x) = 1, f(x) = x , f(x) = ⎣log 2 x⎦ +1 ,…). By setting f(x) = x , we obtain the bagging procedure. PCTs for HMLC are used as base classifiers.
4 Experimental Design We decided to split the training images into training and development images. To tune the system for different distribution of images across classes in the training set and the test set, we generated several splits where the distributions of the images differed (in varying ways) between the training and development data. We constructed a classifier for each axis from the IRMA code separately (see Section 1). From each of the datasets, we learn a PCT for HMLC and Ensembles of PCTs (Bagging and Random Forests). The ensembles consisted of 100 un-pruned trees. The feature subset size for Random Forests was set to 11 (using the formula f( 2080 ) = ⎣log 2 ( 2080 )⎦ ). To compare the performance of a single tree and an ensemble we use PrecisionRecall (PR) curves. These curves are obtained with varying the value for the classification threshold: a given threshold corresponds to a single point from the PRcurve. For more information, see [3]. According to these experiments and previous research the ensembles of PCTs have higher performance as compared to a single PCT when used for hierarchical annotation of medical images [14]. Furthermore, the Bagging and Random Forest methods give similar results. Because the Random Forest method is much faster than the Bagging method, we submitted only the results for the Random Forest method. To select an optimal value of the threshold (t), we performed validation on the different development sets. The threshold values that give the best results were used for the prediction of the unlabelled radiographs according to the four different classification schemes (see Section 1).
236
I. Dimitrovski et al.
Fig. 2. Example images with same value for axis D, but different values for the axis combining D with the first code from A
To reduce the intra-class variability for axis D and improve the prediction performance, we decided to modify the hierarchy for this axis and include the first code of axis A from the corresponding IRMA code. Fig. 2 presents example images that have the same code for axis D, but are visually very different. After inclusion of the first code from the axis A, these images belong to different classes.
5 Results For the ImageCLEF 2009 medical annotation task, we submitted one run. In this task, our result was third among the participating groups, with a total error score of 1352.56. The results for the particular datasets are presented in Table 1. From the results, we can note the high error for the annotations from ImageCLEF2005 and ImageCLEF2006. Recall that we pre-processed the images and the classes from 2005 and 2006 were mapped to an IRMA code. One class from the annotation from ImageCLEF2005 corresponds to multiple labels from the hierarchical annotation of the IRMA code and we used the most general class. This restricted the classifier to make more specific predictions. The performance for the ImageCLEF2008 is worse than the performance for ImageCLEF2007 because ImageCLEF2008 has a bigger hierarchy and more test images. Similar conclusions can be made by analyzing the PR curves shown in Fig. 3. For each of the axes (T, D, A and B) we present three PR curves that correspond to the different annotation schemes. The PR curves for 2006 and 2007 coding schemes are equal because we simply mapped the class numbers to the corresponding IRMA codes. From the presented values for the AU PRC (Area under the Average PrecisionRecall Curve) it can be seen that we obtain best results for the ImageCLEF2007 dataset. The AU PRC values for the ImageCLEF2005 dataset are very low considering the total number of classes, but this is mainly because we didn’t apply a one-to-one mapping as for the ImageCLEF2006 dataset.
ImageCLEF 2009 Medical Image Annotation Task: PCTs for HMLC
237
Table 1. Error score for the medical image annotation task and AU PRC per axis, using random forests of PCTs for HMLC Annotation label sets
Error score
Number of wildcards (*)
Axis T
Axis D
Axis A
Axis B
2005 2006 2007 2008
549 433 128.1 242.26
0 0 2550 2613
0.9990 0.9998 0.9998 0.9995
0.7712 0.8177 0.8177 0.7488
0.7059 0.7419 0.7419 0.6621
0.9843 0.9948 0.9948 0.9760
AU PRC / RF
The excellent performance for the prediction task for axes T and B is due to the simplicity of the problem, the hierarchies along these axes contain only a few nodes (8 and 19 nodes for ImageCLEF2008, respectively). This means that in each node in the hierarchy there is a large portion of the examples, thus learning a good classifier is not a difficult task. The classifiers for the other two axes have satisfactory predictive performance, but here the predictive task is somewhat more difficult (especially for axis A). The size of the hierarchy along the A and D axis, for ImageCLEF2008 are 202 and 88 nodes, respectively.
Fig. 3. Precision-Recall curves for the random forest predictions of the codes for T, D, A and B axis, respectively, for the four different competition tasks. The PR curves for the axes T and B are close to each other for each year. For the axes D and A, the upper PR curves are for the years 2006/07, the lower ones are for 2008 and the PR curves in the middle are for 2005.
6 Conclusions This paper presents a hierarchical multi-label classification approach to medical image annotation. For efficient image representation, we use edge histogram descriptor and SIFT histograms. The predictive modeling problem that we consider is to learn PCTs and ensembles of PCTs that predict a hierarchical annotation of an X-ray image. Using these approaches, we obtained good predictive performance and ranked third on the ImageCLEF 2009 competition. There are several ways to further improve the predictive performance of the proposed approach. First, one could try to tackle the shift in distribution of images between the training and the testing set. One solution is to develop extensions of the PCT approach that can handle such differences. Another approach is to generate virtual samples of the images that are underrepresented in the training set by rotation,
238
I. Dimitrovski et al.
translation and manipulation of contrast and brightness. Second, better performance may be obtained by post-processing the output from the ensembles and by reducing the dependence from the thresholding: instead of the hard threshold, use the raw probabilities. Third, we could use additional feature extraction techniques and combine them using different combination schemes (other than concatenation). In summary, we presented a general approach to hierarchical image annotation. The approach can be easily extended with new feature extraction methods, and can thus be applied to other domains. It can be also easily applied to arbitrary domains, because it can handle hierarchies with arbitrary sizes (bigger hierarchies, hierarchies that are organized as trees or directed acyclic graphs).
References 1. Tommasi, T., Caputo, B., Welter, P., Guld, M.O., Deserno, T.M.: Overview of the CLEF 2009 medical image annotation track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Lehmann, T.M., Schubert, H., Keysers, D., Kohnen, M., Wein, B.B.: The IRMA code for unique classification of medical images. In: Proc. of SPIE - Medical Imaging 2003, vol. 5033, pp. 440–451 (2003) 3. Vens, C., Struyf, J., Schietgat, L., Dzeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Machine Learning 73(2), 185–214 (2008) 4. Tommasi, T., Orabona, F., Caputo, B.: Discriminative cue integration for medical image annotation. Pattern Recognition Letters 29(15), 1996–2002 (2008) 5. Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of clustering trees. In: Proc. of the 15th ICML, pp. 55–63 (1998) 6. Ziou, D., Tabbone, S.: Edge Detection Techniques an Overview. International Journal of Pattern Recognition and Image Analysis 8(4), 537–559 (1998) 7. Canny, J.F.: A computational approach to edge detection. IEEE Trans. Pattern Analysis and Machine Intelligence 8(6), 679–698 (1986) 8. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 9. Deselaers, T., Keysers, D., Ney, H.: Discriminative training for object recognition using image patches. In: CVPR 2005, San Diego, CA, vol. 2, pp. 157–162 (2005) 10. Kocev, D., Vens, C., Struyf, J., Dzeroski, S.: Ensembles of Multi-Objective Decision Trees. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 624–631. Springer, Heidelberg (2007) 11. Dzeroski, S., Gjorgjioski, V., Slavkov, I., Struyf, J.: Analysis of Time Series Data with Predictive Clustering Trees. In: Džeroski, S., Struyf, J. (eds.) KDID 2006. LNCS, vol. 4747, pp. 63–80. Springer, Heidelberg (2007) 12. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 13. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001) 14. Dimitrovski, I., Kocev, D., Loskovska, S., Dzeroski, S.: Hierarchical annotation of medical images. In: Proc. of the 11th International Multiconference − IS 2008, Ljubljana, Slovenia, pp. 170–174 (2008)
Dense Simple Features for Fast and Accurate Medical X-Ray Annotation Uri Avni1 , Hayit Greenspan1 , and Jacob Goldberger2 1
BioMedical Engineering, Tel-Aviv University 2 Engineering School, Bar-Ilan University
Abstract. We present a simple, fast and accurate image categorization system, applied to medical image databases within the ImageCLEF 2009 medical annotation task. The methodology presented is based on local representation of the image content, using a bag of visual words approach in multiple scales, with a kernel based SVM classifier. The system was ranked first in this challenge, with total error score of 852.8.
1 Introduction This work presents a classification system that is based on the bag-of-visual-words (BoW) paradigm which is a recently introduced concept that has been successfully applied to scenery image classification tasks (see e.g. [1,2,3]). Aproaches based on local features were presented in recent ImageCLEF medical annotation challenges [4,5], including the most successful ones (e.g., [6,7]). In 2006 Deselaers et al. [6] displayed the best medical annotation results using a local features approach, where the features are local patches of different sizes taken at every position, and scaled to a common size. No dictionary was used, rather, the feature space was quantized uniformly in every dimension and the image was represented as a sparse histogram in the quantized space. Tommasi et al. [7] had the highest score in 2007 and 2008. In this work both global and local features were used, with different integration techniques. As the local features, modified SIFT descriptors were used, sampled randomly. Local features on four image quadrants are were learned and represented separately. Nowak et al. [3] showed that the number of patches is the single most influential parameter governing performance. Based on several of the above conclusions, and with several differentiations, we present a classification system with the following characteristics: We sample patches densely, and show that in this case simple features give comparable classification accuracy to SIFT features, while taking significantly less computation time. We argue that building a visual dictionary from the data does not significantly compromise the computation time. A visual dictionary can be built from a select group of images, with computation time that does not depend on the database size. Moreover, when spatial coordinates are included in the features, indexing dictionary words by the coordinates significantly accelerates the lookup process. An overview and detailed description of the proposed classification system is provided in Section 2. The experimental validation and sensitivity analysis is described in Section 3. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 239–246, 2010. c Springer-Verlag Berlin Heidelberg 2010
240
U. Avni, H. Greenspan, and J. Goldberger
2 A Classification System Based on a Dictionary of Visual-Words We review the classification system we have implemented for the ImageCLEF 2009 medical annotation challenge. Key components are shown in the flow-diagram in Fig. 1. Patch extraction Given an image, feature detection is used to extract several small local patches. Each small patch shows a localized view of the image content. These patches are considered as candidates for basic elements, or “words”. The patch size needs to be larger than a few pixels across, in order to capture higher-level semantics such as edges or corners. At the same time, the patch size should not be too large if it is to serve as a common building block of many images. Common feature detection approaches include using a regular sampling grid, a random selection of points, or the selection of points with high information content using salient point detectors. We utilize all the information in the image, by sampling rectangular patches of fixed size N × N around every pixel. Feature space description Following the feature detection step, the feature representation method involve representing the patches using feature descriptors. In this step, a large random subset of images is used (ignoring their labels). We extract patches of size N × N using a regular grid, and normalize each patch by subtracting its mean gray level, and dividing it by its standard deviation. This step insures invariance to local changes in brightness, provides local contrast enhancement and augments the information within a patch. Patches that have a single intensity value are abundant in x-ray images. These patches are common in all categories, much like stop-words in text documents. These patches are ignored. We are left with a large collection of several million vectors of length N 2 . To reduce both the computational complexity of the algorithm and the level of noise, we apply a principal component analysis procedure (PCA) to this initial patch collection. The first few components of the PCA, which are the components with the largest eigenvalues, serve as a basis for the information description. A popular alternative approach to raw patches is the SIFT representation [8] which is beneficial in scenery images [9], where object scales can vary. We examine this option in the experiments defining the system parameter set. In addition to patch content information represented either by PCA coefficients or SIFT descriptors, we add the patch center coordinates to the feature vector. This addition introduces spatial information into the image representation, without the need to explicitly model the spatial dependency between patches. Quantization The final step of the bag-of-words model is to convert vector represented patches into visual words and to generate a representative dictionary. A visual word can be considered as a representative of several similar patches. A frequently-used method is to perform K-means clustering over the vectors of the initial collection, and then cluster them into K groups in the feature space. The resultant cluster centers serve as a vocabulary of K visual words. Due to the fact that we included spatial coordinates as part of the feature space, the visual words have a localization component in them, which is reflected as a spatial spread of the words in the image plane. Words are more dense in
Dense Simple Features for Fast and Accurate Medical X-Ray Annotation
241
areas with greater variability across images in the database. Dictionary words are stored in a kd-tree, indexed by the spatial coordinates. From an input image to a representative histogram A given (training or testing) image can now be represented by a unique distribution over the generated dictionary of words. In our implementation, patches are extracted from every pixel in the image. For an x-ray image of size 512 × 512 there are typically several hundreds of thousands of non-empty patches. The patches are projected into the selected feature space, and translated (quantized) to indices by looking up the most similar feature-vector in the generated dictionary. Using the spatial indexation of dictionary words, dictionary lookup process is accelerated, by comparing a new patch only to dictionary words at a certain radius from it. The dictionary generation process and the shift from a given image to its representative histogram, are shown in Figure 1 left column and right column, respectively. Note that as a result of including spatial features, both the image local content and spatial layout are preserved in the discrete histogram representation. Multi-scale image information may in some cases provide additional information that supports the required discrimination. To address this we repeat the dictionary building process for scaled down replications of the input image, using the same patch size. The image representation in this case is a 1-D concatenation of histograms from varying scales. This process, demonstrated in Figure 2, provides a richer image representation. It does not imply scale invariance, as in [6]. In our experiments we found that objects of interest in radiographs appear at roughly similar size-range across all images, thus invariance to scale is not a necessity. Classification Image classification is performed using a non-linear multiclass Support Vector Machine (SVM) with different kernels. We examined several non-linear kernels, commonly used with histogram data: – Histogram intersection kernel [10]: K(x, y) = exp(−
i
min(xi , yi ))
−γx−y2
– Radial Basis Function kernel: K(x, y) = e 2 i| – χ2 kernel: K(x, y) = exp(−γ i |x|xii−y +yi | )
Note that histogram intersection has no free kernel parameters, which makes it convenient for fast parameter evaluation. The two other kernels have a free tradeoff parameter γ, and require careful optimization. In order to classify multiple categories, we use the one-vs-one extension of the binary classifier, where N (N − 1)/2 binary classifiers are trained for all pairs of categories in the dataset. Whenever an unknown image is classified with a binary classifier it casts one vote for its preferred class, and the final result is the class with the most votes. Since each binary classifier runs independently, parallelization of both training and testing phases of the SVM is straightforward. It is implemented as a parallel enhancement of the LIBSVM library1. 1
http://www.csie.ntu.edu.tw/∼cjlin/libsvm
U. Avni, H. Greenspan, and J. Goldberger Images database
Image
Patch extraction Patch extraction Feature space description
Feature space description
Clustering Quantization
Image model Frequency
242
Dictionary
0.04 0.02 0 0
100 Word number
200
Fig. 1. Dictionary building and image representation flow chart
0
100
200
300
400
500
600
Fig. 2. Image representation at multiple scales
Dense Simple Features for Fast and Accurate Medical X-Ray Annotation
243
3 Experiments and Results A key component in using the BoW paradigm in a categorization task is the tuning of the system parameters. An optimization step is thus required for a given task and image archive. We focus on three components of the system: finding the optimal set of local features, finding the optimal dictionary size, and optimizing the classifier parameters. We use the 2007 labels of the IRMA database. Each of the 116 categories is treated as a separate label, disregarding the hierarchical nature of the IRMA code. We optimized the system parameters using several cross-validation experiments. In the following experiments, 10,667 images were used for training and 2000 randomly drawn images were used for testing and verification. The optimization is performed independently in three steps: finding the optimal set of local features, finding optimal dictionary size, and optimizing classifier parameters. Local features. We examined three feature extraction strategies: raw patches, raw patches with normalized variance, and SIFT descriptors. In all cases we added spatial coordinates to the feature vector. We used dense extraction of features around every pixel in the image. There are often strong artifacts near the image border that are not relevant to the image category, so a 5% margin from the image border was ignored. The feature extraction step produces about 100,000 to 200,000 features from a single image. It is our experience that X-ray images from the same category usually appear in a similar scale and orientation in a given archive. In this task the invariance of the SIFT features to scale and orientation is therefore not necessary. We used SIFT descriptors taken at a single scale, without aligning the orientation, as in [7]. Raw patches and normalized patches of size N × N were dimensionally reduced using PCA. Table 1 summarizes the classification results of the three feature sets. Normalizing patch variance improves the classification rate compared to raw patches. The gain can be attributed to the local contrast invariance achieved in this step. In this task, using normalized patches proved marginally preferable to SIFT descriptors in terms of classification accuracy. However, when using raw patches, the feature extraction step is significantly faster than with SIFT descriptors, as seen in Figure 3. The majority of the running time was spent in the image representation step; this step takes over 3 seconds per image with the SIFT features, and less than half a second with the simpler raw patches. Time was measured on a dual quad-core Intel Xeon 2.33 GHz. In the following sections variance normalized raw patches are used as features. Figure 4(a) depicts the effect of using 4 to 10 components for variance-normalized raw patches. It can be seen that the number of components has a minimal effect on classification accuracy. The addition of spatial coordinates to the feature set, on the other hand, improves the classification performance noticeably, as seen in Figure 4(b). Table 1. Comparison of different features Features Average % Standard Deviation Raw Patches 88.43 0.32 SIFT 90.80 0.41 Normalized 91.29 0.56
244
U. Avni, H. Greenspan, and J. Goldberger
ZĂǁWĂƚĐŚĞƐ ^/&d Ϭ
ϮϬϬ
ƵŝůĚĚŝĐƚŝŽŶĂƌLJ
ϰϬϬ dŝŵĞ^ĞĐŽŶĚƐ
džƚƌĂĐƚĨĞĂƚƵƌĞƐ
ϲϬϬ
ϴϬϬ
dƌĂŝŶĐůĂƐƐŝĨŝĞƌ
ůĂƐƐŝĨLJ
Fig. 3. Running time using SIFT descriptors and normalized raw patches 92.5
93
92
92 91
91
accuracy %
accuracy %
91.5
90.5 90 89.5
90 89 88
89
87
88.5
86
88
4
8
6
PCA components
(a)
10
85
0
2
4
6
8
spatial weight
(b)
Fig. 4. (a): Effect of the number of PCA components in a patch on classification accuracy. (b) Effect of spatial features: Weight of spatial features (x-axis); Classification accuracy (y-axis).
We found that when using 7 PCA components, the optimal range for the x, y coordinates was [−3, 3]. Bars show means and standard deviations from 20 experiments running on 1000 random test images. We next investigated the appropriate number of words in the dictionary. As Figure 5 shows, increasing the number of dictionary words proved useful up to 1000 words. Adding additional words after that point increased the computational time with no evident improvement in the classification rate. Combining the above, the classification system used normalized raw patch features, with 7 PCA components, spatial features with weight [-3,3], and 1000 visual words. Using the SVM with a histogram intersection kernel achieved a classification accuracy of 91.29%. We next examined two additional kernel types with the SVM classifier, the Radial basis function and χ2 kernels. We used the optimal features and dictionary size, consistently across all experiments. For these kernel types the SVM cost parameter C, and free kernel parameter γ were scanned simultaneously over a grid to find the classifier’s optimal working point. The results of these experiments are summarized in Table 2. The χ2 kernel is ranked first by a small margin with 91.62% accuracy, followed by the RBF kernel with 91.45%. As a final experiment, we take information from multiple image scales into account by repeating the dictionary creation step on scaled-down versions of the original image. The image representation thus was a concatenation of histograms built on the single scale dictionaries. We used 3 scales: the original image, 1/2 size and 1/8 size. Using 3
Dense Simple Features for Fast and Accurate Medical X-Ray Annotation
245
89.5 89
% Correct
88.5 88 87.5 87 86.5 86 85.5
200
400
600
800
1000
1200
1400
1600
Number of words
Fig. 5. Effect of dictionary size on classification accuracy Table 2. Comparison of SVM kernel types, for 1-scale and 3-scale models Kernel Average % 1-scale Average % 3-scales Radial Basis 91.45 91.59 Histogram Intersection 91.29 91.89 91.62 91.95 χ2
(a)
(e)
(b)
(f)
(c)
(g)
(d)
(h)
Fig. 6. Detecting category ’posteroanterior, left hand’: (a),(b),(c),(d) Correctly classified. (e) False negative, misclassified as ’left anterior oblique, left hand’. False positives come from categories: (f) anteroposterior, left carpal joint (g) anteroposterior, left foot (h) right anterior oblique, right foot.
scales further improved the accuracy for all kernels, as seen in the right-most column of Table 2. The average classification accuracy with the χ2 kernel is 91.95%. Figure 6 demonstrates the subtlety of the challenge by examining the classification accuracy on one category: ’Posteroanterior, Left hand’. In this run there are 2000 random test images, with 57 images from the examined category. Out of which, 56 were
246
U. Avni, H. Greenspan, and J. Goldberger
Table 3. Error score of the submitted medical image annotation run, and the second best result. Lower is better. Run 2005 2006 2007 2008 Sum This work 356 263 64.3 169.5 852.8 Second best - Idiap[4] 393 260 67.23 178.93 899.16
correctly detected by the described system, for eg. (a,b,c,d). Only one image, (e), was falsely classified - it was detected as ’Left anterior oblique, Left hand’ (false negative). 3 images from other categories, (f,g,h), were misclassified as ’Posteroanterior, Left hand’ (false positives). The system parameters, tuned to the labels from 2007, were applied to the 4 labeling sets of the ImageCLEF 2009 medical annotation challenge. The results are presented in Table 3. Our system was ranked first on 3 of the 4 labeling sets (2005, 2007 and 2008), and first in the overall error score. To conclude, in this study we applied a visual words approach to medical image annotation. We showed that using dense sampling in multiple scales while keeping the features simple makes the system both accurate and computationally efficient. The system ranked first in ImageCLEF 2009 medical annotation challenge. Currently we are looking at extending the capabilities to categorizations of healthy vs pathology cases, as well as within-pathology identifications.
References 1. Varma, M., Zisserman, A.: Texture classification: are filter banks necessary? In: CVPR, vol. 2, pp. 691–698 (2003) 2. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR, vol. 2, pp. 524–531 (2005) 3. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 490–503. Springer, Heidelberg (2006) 4. Tommasi, T., Caputo, B., Welter, P., Deserno, T.M.: Overview of the CLEF 2009 medical image annotation track. In: CLEF Working Notes (2009), http://www.clef-campaign.org/2009/working_notes 5. Deselaers, T., Deserno, T.M.: Medical image annotation in ImageCLEF 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 523–530. Springer, Heidelberg (2009) 6. Deselaers, T., Hegerath, A., Keysers, D., Ney, H.: Sparse patch-histograms for object classification in cluttered images. In: Franke, K., M¨uller, K.-R., Nickolay, B., Sch¨afer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 202–211. Springer, Heidelberg (2006) 7. Tommasi, T., Orabona, F., Caputo, B.: Discriminative cue integration for medical image annotation. Pattern Recogn. Lett. 29(15), 1996–2002 (2008) 8. Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, vol. 2, pp. 1150–1157 (1999) 9. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. Int. J. Comput. Vision 73(2), 213–238 (2007) 10. Barla, A., Odone, F., Verri, A.: Histogram intersection kernel for image classification. In: ICIP, vol. 3 (2003)
Automated X-Ray Image Annotation Single versus Ensemble of Support Vector Machines Devrim Unay1 , Octavian Soldea2 , Sureyya Ozogur-Akyuz3, Mujdat Cetin2 , and Aytul Ercil2 1 2
Electrical and Electronics Engineering, Bahcesehir University, Istanbul, Turkey Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey 3 Mathematics and Computer Science, Bahcesehir University, Istanbul, Turkey {devrim.unay,sureyya.akyuz}@bahcesehir.edu.tr, {octavian,mcetin,aytulercil}@sabanciuniv.edu
Abstract. Advances in the medical imaging technology has lead to an exponential growth in the number of digital images that needs to be acquired, analyzed, classified, stored and retrieved in medical centers. As a result, medical image classification and retrieval has recently gained high interest in the scientific community. Despite several attempts, such as the yearly-held ImageCLEF Medical Image Annotation Challenge, the proposed solutions are still far from being sufficiently accurate for real-life implementations. In this paper we summarize the technical details of our experiments for the ImageCLEF 2009 medical image annotation challenge. We use a direct and two ensemble classification schemes that employ local binary patterns as image descriptors. The direct scheme employs a single SVM to automatically annotate X-ray images. The two proposed ensemble schemes divide the classification task into sub-problems. The first ensemble scheme exploits ensemble SVMs trained on IRMA sub-codes. The second learns from subgroups of data defined by frequency of classes. Our experiments show that ensemble annotation by training individual SVMs over each IRMA sub-code dominates its rivals in annotation accuracy with increased process time relative to the direct scheme.
1
Introduction
Digital medical images, such as standard radiographs (X-Ray) and computed tomography (CT) images, represent a large part of the data that need to be stored, archived, retrieved, and shared among medical centers. Manual labeling of this data is not only time consuming, but also error-prone due to inter/intraobserver variations. In order to realize an accurate classification of digital medical images one needs to develop automatic tools that allow high performance image annotation, i.e. a given image is automatically labeled with a text or a code without any user interaction.
This work was supported in part by the Marie Curie Programme of the European Commission under FP6 IRonDB project MTK-CT-2006-047217.
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 247–254, 2010. c Springer-Verlag Berlin Heidelberg 2010
248
D. Unay et al.
Several attempts in the field of medical images have been performed in the past, such as the WebMRIS system [1] for cervical spinal X-Ray images, and the ASSERT system [2] for CT images of the lung. While these efforts consider retrieving a specific body part only, other initiatives have been taken in order to retrieve multiple body parts. The yearly held ImageCLEF Medical Image Annotation challenge, run as part of the Cross-Language Evaluation Forum (CLEF) campaign, aims in automatic classification of an X-Ray image archive containing more than 12,000 images randomly taken from the medical routine. The dataset contains images of different body parts of people from different ages, of different genders, under varying viewing angles and with or without pathologies. A potent classification system requires the image data to be translated into a more compact and more manageable representation containing descriptive features. Several feature representations have been investigated in the past for such a classification task. Among others, image features, such as average value over the complete image or its sub-regions [3] and color histograms [4], have been investigated. Recently in [5], texture features like local binary patterns (LBP) [6] have been shown to outperform other types of low-level image features in classification of X-Ray images. Subsequently in [7], it has been shown that retaining only the relevant local binary pattern features achieves comparable classification accuracies with smaller feature sets, thus leading to reduced processing time and storage space requirements. A less investigated path is to exploit from hierarchical organization of medical data, such as the ImageCLEF data labeled by the IRMA coding system, using ensemble classifiers. Accordingly, in this paper we explore the annotation performance of two ensemble classification schemes based on IRMA sub-codes and frequency of classes, and compare them to the well-known single-classifier scheme over the ImageCLEF-2009 Medical Annotation dataset. The paper is organized as follows. Section 2 presents our feature extraction and classification steps in detail. Then, in Section 3 we introduce the image database and the experimental evaluation performed. And finally, Sections 4 and 5, present corresponding results and our conclusions, respectively.
2 2.1
Method Feature Extraction
We extract spatially enhanced local binary patterns as features from each image in the database. LBP [6] is a gray-scale invariant local texture descriptor with low computational complexity. The LBP operator labels image pixels by thresholding a neighborhood of each pixel with the center value and considering the results as a binary number. Formally, given a pixel at (xc ,yc ), the resulting LBP code can be expressed as: LBPP,R (xc , yc ) =
P −1 n=0
s(in − ic )2n
(1)
Automated X-Ray Image Annotation
249
Fig. 1. The image is divided into 4x4 non-overlapping sub-regions from which LBP histograms are extracted and concatenated into a single, spatially enhanced histogram
where n runs over the P neighbors of the central pixel, ic and in are the graylevel values of the central and the neighboring pixels, and s(x) is 1 if x ≥ 0 and 0 otherwise. Eventually, a histogram of the labeled image fl (x, y) can be defined as Hi =
I(fl (x, y) = i),
i = 0, . . . , L − 1
(2)
x,y
where L is the number of different labels produced by the LBP operator, and I(A) is 1 if A is true and 0 otherwise. The derived LBP histogram contains information about the distribution of local micro-patterns, such as edges and flat areas, over the image. Because not all LBP codes are informative [6], we use uniform version of LBP and reduce the number of informative codes from 256 to 59 (58 informative bins + one bin for noisy patterns). In order to obtain a more local description, we divide images into 4x4 non-overlapping sub-regions and concatenate the LBP histograms extracted from each region into a single, spatially enhanced feature histogram, as in [5] (Figure 1). Finally, we obtain a total of 944 features per image, and each feature is linearly scaled to [-1,+1] range before presented to the classifier. 2.2
Image Annotation
In this work we use a support vector machine (SVM) based learning framework to automatically annotate the images. SVM [8] is a popular machine learning algorithm that provides good results for general classification tasks in the computer vision and medical domains: e.g. nine of the ten best models in ImageCLEFmed 2006 competition were based on SVM [9]. In a nutshell, SVM maps data to a higher-dimensional space using kernel functions and performs linear discrimination in that space by simultaneously minimizing classification error and maximizing geometric margin between classes.
250
D. Unay et al.
Fig. 2. Illustration of ensemble classification based on IRMA sub-codes. A separate SVM is trained for each sub-code, and final decision is formed by concatenating predictions of each SVM.
Among all available kernel functions for data mapping in SVM, Gaussian radial basis function is the most popular choice, and therefore it is used here. In this work we used LibSVM1 library (version 2.89) for SVM and empirically found its optimum parameters on the dataset. Direct Annotation Scheme (D). In this scheme, we classify images by using a single SVM with one versus all multi-class model. To the contrary, ensemble schemes break down the annotation task to subproblems by dividing the data into subgroups based on 1)IRMA sub-codes, and 2)frequency of classes. Ensemble Annotation by IRMA sub-codes (E-1). In the IRMA coding system, images are categorized in a hierarchical manner based on four sub-codes describing image modality, image orientation, body region examined, and biological system investigated. Accordingly, in this scheme we train a separate SVM for each sub-code and merge their predictions to form the final decision, as illustrated in Figure 2. Ensemble Annotation by frequency of classes (E-2). On the contrary, this ensemble scheme successively divides the data into sub-groups based on frequency of classes and trains a separate SVM on each sub-group (Figure 3). Let L1 , L2 , . . . , Ln be the set of classes in the training set and m ∈ N be a positive integer parameter. Without loss of generality, assume L1 , L2 , . . . , Ln are sorted in their decreasing cardinality values. We divide the training set in a sequence clusters C1 , C2 , . . . , Ck , such that , . . . , Lm , U1 } , C2 = n C1 = {L1 , L2 n {Lm+1 , Lm+2 , . . . , L2m , U2 } , where U1 = i=m+1 Li , U2 = i=2m Li , and so on, see Figure 3. For each Ci we train a SVM. Let Si be the SVM trained on Ci . When classifying, we begin from S1 . If S1 suggests one of the L1 , L2 , . . . , Lm labels, then we consider this result a valid classification. If the result is U1 , then we proceed further to S2 . We follow recursively this procedure, until we eventually reach Sk , which finishes the classification procedure. Note that we adjust Ck to include only Li labels. 1
Available at http://www.csie.ntu.edu.tw/ cjlin/libsvm
Automated X-Ray Image Annotation
251
Fig. 3. Illustration of second ensemble SVM scheme for m = 2. The first cluster, C1 , consists of classes {L1 , L2 , U1 } . The second cluster, C2 , consists of {L3 , L4 , U2 } , and so on.
3 3.1
Experimental Setup Image Data
The database released for the ImageCLEF-2009 Medical Annotation task includes 12677 fully classified (2D) radiographs for training and a separate test set consisting of 2000 radiographs. The aim is to automatically classify the test set using four different label sets including 57 to 193 distinct classes. A more detailed explanation of the database and the tasks can be found in [10]. 3.2
Evaluation
We evaluate our SVM-based learning using two schemes depending on the availability of test data labels: 1)5-fold cross validation if test data labels are missing, and 2)ImageCLEF error counting scheme, otherwise. In the former scheme, the training database is partitioned into five subsets. Each subset is used once for testing while the rest are used for training, and the final result is assigned as the average of the five validations. Note that for each validation all classes were equally divided among the folds. We measure the overall classification performance using accuracy, which is the number of correct predictions divided by the total number of images. To the contrary, the error counting scheme is introduced by the challenge organizers to compare all runs submitted. Further details on this scheme can be found in [10]. 3.3
Runs Submitted
As Computer Vision and Pattern Analysis (VPA) Laboratory of Sabanci University, we submitted three different runs to the ImageCLEF 2009 medical image annotation task. One obtained by the direct scheme (VPA-SABANCI-1), and two with the ensemble schemes (VPA-SABANCI-2 and -3). For each run, the optimum parameter setting was realized by trial-and-error.
252
4
D. Unay et al.
Results
In this section, we present the results obtained by the proposed annotation schemes. In Table 1 we observe the results realized on the training database with 5-fold cross-validation. Ensemble scheme based on IRMA sub-codes clearly outperforms others, especially in terms of the 2007, 2008 and overall accuracies. Table 1. Performance of VPA-SABANCI runs on training data
Run Type VPA-SABANCI-1 D VPA-SABANCI-2 E-1 VPA-SABANCI-3 E-2
2005 88.0 88.0 83.3
Accuracy (%) 2006 2007 2008 Average 83.2 83.2 83.1 84.4 83.2 91.7 93.0 89.0 77.4 77.6 77.6 79.0
Table 2 provides a detailed performance comparison of the direct scheme and the IRMA sub-codes based ensemble one over 2007 and 2008 labels. Simplifying the classification task by training a separate SVM over each sub-code, considerably improves the final accuracy relative to the usage of a single SVM. Furthermore, 2008 accuracies of individual SVMs excel those of 2007 despite higher number of classes (thus a more difficult classification problem). The underlying reason for this observation may be attributed to the more realistic labels of 2008. Table 2. Efficacy of ensemble classification based on IRMA sub-codes. Values in parenthesis refer to the number of distinct classes for that sub-task. Ensemble by IRMA sub-codes Direct SVM1 SVM2 SVM3 SVM4 Final 2007 accuracy (%) 96.7(5) 85.6(27) 88.0(66) 96.4(6) 91.7 83.2 2008 accuracy (%) 99.2(6) 86.3(34) 88.0(97) 98.5(11) 93.0 83.1
The results achieved on the test dataset in terms of prediction errors are presented in Table 3, together with the results of the best run realized in the challenge for comparison. As observed, IRMA sub-codes based ensemble scheme (E-1) outperforms its rivals again. With this performance, VPA-SABANCI-2 run is ranked 7th among 18 runs submitted to the competition. Compared to our solution, the best run of the challenge exploits multiresolution analysis. Figure 4 displays exemplary confusions realized by the best performing VPASABANCI-2 run for a class with few samples that lead to low recognition performance (19,5%), which may be partly due to low number of examples, and partly because of high visual similarity between the confused classes and the reference class (Most confusions are between images of the same body part, i.e. the head. Note that, at manual categorization these images were assigned to different labels because of variation in image acquisition, such as view angle).
Automated X-Ray Image Annotation
253
Table 3. Performance of VPA-SABANCI runs, in comparison with the best run of the challenge, on test data. D refers to direct scheme, while E-1 and E-2 refer to ensemble schemes based on IRMA code and data distribution, respectively.
Run VPA-SABANCI-1 VPA-SABANCI-2 VPA-SABANCI-3 TAUbiomed (best run)
Type D E-1 E-2
2005 578 578 587 356
2006 462 462 498 263
Error 2007 2008 201.31 272.61 155.05 261.16 169.33 300.44 64.30 169.50
Sum 1513.92 1456.21 1554.77 852.80
Fig. 4. Exemplary confusions realized by the proposed approach for a class with relatively low accuracy. Reference class with the corresponding label, number-of-examples, accuracy, and a representative X-ray image are shown on the left, while three mostobserved confusions in descending order are displayed to the right. Table 4. Computational expense of VPA-SABANCI runs for testing on a PC with with 2.40GHz processor and 6GB RAM. T = 1.83min, M = 140MB, and k = #classes m m being the split parameter defined in Section 2.2. Typically, k > 4 in our case. Run Type CPU Time Memory Usage VPA-SABANCI-1 D T M VPA-SABANCI-2 E-1 4T M VPA-SABANCI-3 E-2 kT M
Table 4 demonstrates the computational requirements of the proposed schemes for testing. As observed, ensemble schemes require over 4-fold resources than the direct scheme on a single processor architecture. Nevertheless, this additional requirement can be canceled out by parallel processing.
5
Conclusion
In this paper we have introduced a classification work with the aim of automatically annotating X-Ray images. We have explored the annotation performances
254
D. Unay et al.
of two ensemble classification schemes based on individual SVMs trained on IRMA sub-codes and frequency of classes, and compared the results with the popular single-classifier scheme. Our experiments on the ImageCLEF-2009 Medical Annotation database revealed that breaking the annotation problem down to sub-problems by training individual SVMs over each IRMA sub-code outperforms its rivals in terms of annotation accuracy with the compromise of increased computational expense.
References 1. Long, L.R., Pillemer, S.R., Lawrence, R.C., Goh, G.H., Neve, L., Thoma, G.R.: WebMIRS: web-based medical information retrieval system. In: Sethi, I.K., Jain, R.C. (eds.) Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 3312, pp. 392–403 (December 1997) 2. Shyu, C.R., Brodley, C.E., Kak, A.C., Kosaka, A., Aisen, A.M., Broderick, L.S.: Assert: a physician-in-the-loop content-based retrieval system for hrct image databases. Comput. Vis. Image Underst. 75(1-2), 111–132 (1999) 3. Rahman, M.M., Desai, B.C., Bhattacharya, P.: Medical image retrieval with probabilistic multi-class support vector machine classifiers and adaptive similarity fusion. Computerized Medical Imaging and Graphics 32(2), 95–108 (2008) 4. Mueen, A., Sapian Baba, M., Zainuddin, R.: Multilevel feature extraction and x-ray image classification. J. Applied Sciences 7(8), 1224–1229 (2007) 5. Jacquet, V., Jeanne, V., Unay, D.: Automatic detection of body parts in x-ray images. In: IEEE Computer Society Workshop on Mathematical Methods in Biomedical Image Analysis, MMBIA (2009) 6. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 7. Unay, D., Soldea, O., Ekin, A., Cetin, M., Ercil, A.: Automatic Annotation of X-ray Images: A Study on Attribute Selection. In: Medical Content-based Retrieval for Clinical Decision Support (MCBR-CDS) Workshop in Conjunction with MICCAI 2009 (2009) 8. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998) 9. M¨ uller, H., Deselaers, T., Deserno, T., Clough, P., Kim, E., Hersh, W.: Overview of the imageCLEFmed, medical retrieval and medical annotation tasks. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 595–608. Springer, Heidelberg (2007) 10. Tommasi, T., Caputo, B., Welter, P., G¨ uld, M.O., Deserno, T.M.: Overview of the clef 2009 medical image annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010)
Topological Localization of Mobile Robots Using Probabilistic Support Vector Classification Yan Gao and Yiqun Li Department of Computer Vision and Image Understanding Institute for Infocomm Research, Singapore {ygao,yqli}@i2r.a-star.edu.sg
Abstract. Topologically localizing a mobile robot using visual information alone is a difficult problem. We propose a localization system that comprises Gaussian derivatives as raw local descriptors, a three-tier spatial pyramid of histograms as the image descriptor, and probabilistic multi-class support vector machines for classification. Based on the probability estimate, the proposed system is able to predict the unknown class which corresponds to locations that are not imaged in the training sequence. To exploit the continuity of the sequence, a smoothing procedure can be applied, which is shown to be simple yet effective.
1
Introduction
The RobotVision@ImageCLEF1 addresses the problem of topological localization of a mobile robot using visual information. The objective is to determine the topological location of a robot based on images acquired with a perspective camera mounted on the robot platform. The obligatory track requires the decision to be based on a single test image. The optional track, on the other hand, allows the use of images acquired before the current test image in a sequence of recording to exploit the continuity of the sequence. Our localization system is mainly built upon a probabilistic multi-class support vector classifier that is included in the LIBSVM software[2]. By making use of the probability estimate of the classification, we are able to handle the unknown class which corresponds to locations that have not been imaged in the training sequence. Simple yet efficient local descriptors based on Gaussian derivatives on the lightness component of the LAB color space are used. Histograms of the local descriptors are extracted using a three-tier spatial pyramid. For the optional task, a smoothing procedure is proposed to enhance the prediction of the current test image based on the predictions made on earlier images in the sequence.
2
The Data
The training and validation data are from the IDOL2 database[6]. The image sequences in the database are acquired using the MobileRobots PowerBot 1
http://imageclef.org/2009/robot
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 255–260, 2010. c Springer-Verlag Berlin Heidelberg 2010
256
Y. Gao and Y. Li
robot platform. In the final experiment for the ImageCLEF robot vision task, one sequence is chosen as the training data, while a previously unreleased image sequence is used for testing. The training sequence consists of 1034 image frames of 5 classes according to the robot’s topological location, namely, one-person office(BO), corridor(CR), two-persons office(EO), kitchen(KT), and printer area(PA). The test sequence consists of 1690 image frames classified into 6 classes, 5 of which are the same as those of the training sequence, and one additional unknown(UK) class corresponding to the additional rooms that are not imaged previously. The test sequence is acquired 20 months after the training sequence. For more details please refer to [1].
3 3.1
System Description Feature Extraction
Each training image in the training sequence is first converted to a representation in the LAB color space. Normalized Gaussian derivatives are obtained on the L component. Altogether 5 partial derivatives (Lx , Ly , Lxx , Lyy , Lxy ) are computed on the smoothed L component using Gaussian filters. A code book is built on the computed Gaussian derivations using k-means clustering, with k = 32. The computed Gaussian derivatives are then quantized into the 32 codewords. A three-tier spatial pyramid [5] of histograms is then obtained on each image, with the three tiers made up by the whole image, 2 × 2 sub-images, and 4 × 4 subimages. The histograms have 32 bins which corresponds to the 32 codewords. Altogether there are 1 + 2 × 2 + 4 × 4 = 21 histograms. Each image is represented by a 21 × 32 = 672 dimensional feature vector. An illustration of the feature extraction process is given in figure 1. 3.2
The Classifier
In this section we describe the probabilistic multi-class support vector machine used as the classifier. Binary support vector classification. Over the recent years, support vector machines [8] have been developed to a state-of-the-art classifier. Let D be a training data set and D = {(x, y)} ∈ (Rd × Y)|D| , where Rd denotes an ddimensional input space, Y = {±1} denotes the label set of x, and |D| is the size of D. Given D, SVM finds an optimal separating hyperplane which classifies the two classes by the minimal expected test error. Let w∗ , x + b∗ = 0 denote this hyperplane, where w∗ and b∗ are normal vector and bias, respectively. w∗ and b∗ can be found by minimizing |D|
Φ(w) =
p 1 w2 + C ξi 2 i=1
subject to: yi (w, xi + b) ≥ 1 − ξi , i = 1, · · · , |D|
(1)
Topological Localization of Mobile Robots
Image
L-component
Ly
Lxx
257
Lx
Lyy 0.12
0.1
0.08
0.06
0.04
0.02
0
Lxy
0
100
Quantization
200
300
400
500
600
700
Histogram
Fig. 1. The original image from the test sequence, its L component of the LAB color space, 5 Gaussian derivatives (Lx Ly Lxx, Lyy, Lxy). The image is then quantized into 32 bins. A histogram is obtained by concatenating histograms obtained on the whole image, the 2 × 2 sub-images, and the 4 × 4 sub-images.
where ξi (ξi ≥ 0) is the i-th slack variable and C is the regularization parameter controlling the trade-off between function complexity and training error. p = 1 or 2 corresponds to the case of L-1 norm or L-2 norm based soft margin, respectively. The solution of the optimization problem is |D| w∗ = i=1 αi yi xi (2) b∗ = 1 − w∗ , xs where αi is a non-negative coefficient of xi . In this way, the optimal separating hyperplane is expressed as w∗ , x + b∗ =
|D| i=1
αi yi xi , x + b∗ = 0
(3)
258
Y. Gao and Y. Li
In the classification stage, f (x) = w∗ , x + b∗ is used as the decision function, and a test sample x is labeled as y = sgn [f (x)], where sgn(·) denotes the sign function. The kernel trick [4] can be conveniently embedded into SVM to handle the nonlinearly separable patterns. A kernel, k, is defined to be k(x, y) = φ(x), φ(y), where φ(·) is the associated mapping from a feature space, Rn , to a kernel space, F . This mapping is often nonlinear, and the dimensionality of F can be of high or even infinite dimensions. The nonlinearly separable patterns in Rn can become linearly separable in F with higher probability. Hence, the optimal separating hyperplane is constructed in F instead of Rn , and it becomes f (x) =
|D|
αi yi φ(xi ), φ(x) + b∗ =
i=1
|D|
αi yi k(xi , x) + b∗ .
(4)
i=1
The χ2 kernel has been found to give good performance on histogram based features [3]: (q) (q) 2 d (x − x ) i j k(xi , xj ) = exp(−γχ2 (xi , xj )) = exp −γ , (5) (q) (q) q=1 xi + xj (q)
where xi refers to the qth dimension of the feature vector xi , which also corresponds to a bin in the histogram. Probabilistic estimate for binary support vector classification. Instead of predicting the label, one can instead estimate a class probability P (y = +1|x) and P (y = −1|x). In [7], the probability is approximated by P (y = +1|x) =
1
, 1+ P (y = −1|x) = 1 − P (y = +1|x), eAf (x)+B
(6)
where f (x) is the decision value of the SVM classifier, and A and B are determined by solving a maximum likelihood problem [7]. Probabilistic estimate for multi-class support vector classification. A multi-class classification problem can be solved by solving pairwise binary classification problems. For a multi-class classification problem that classify each input x into one of M classes y ∈ Y = {1, ..., M }, pairwise class probabilities rmn = P (y = m|y = m or n, x), m = 1, ..., M, n = 1, ..., M, m = n are first computed. The objective is to estimate the multi-class probability pi = P (y = m|x), m = 1, ..., M . In [9] it is proposed that p=
arg min
pm ,m=1,...,M
M m=1 n:n=m
(rnm pm − rmn pn )2 subject to
M
pm = 1, pm ≤ 0, ∀m. (7)
m=1
It can be proven that equation (7) has a unique solution under mild conditions and can be solved using a linear system of equations. The decision of the multiclass classifier is then given by δ = arg maxm pm
Topological Localization of Mobile Robots
3.3
259
Handling the Unknown Class
Since the test sequence include images taken in additional rooms not recorded in the training sequence, we make use of the probability estimate of SVMs to determine when to classify an image into the unknown(UK) class. The approach taken is to classify images with low confidence of belonging to the winning class into the unknown class. Recall that m pm = 1, and the decision made by SVMs is δ = arg maxm pm . The probability of the sample being from the wining class is P (y = δ|x) = pδ . If pδ is small, it means that the classifier is not very confident about putting the sample into the winning class δ. In this case, the sample is classified as the unknown class. Formally, y = UK if pδ < T,
(8)
where T is a threshold that can be tuned on the validation set. 3.4
Exploiting Continuity of the Sequence
The optional track of the robot vision task allows the use of earlier images for the prediction of current test image. Assume that a sequence of predictions have been made, denoted by y1 , ..., yi . The smoothing procedure make corrections to the prediction of the current image yi . Denote the corrected prediction of yi as yi , then yi is given by yi = mode(yi−h+1 , yi−h+2 , ..., yi ),
(9)
where mode is a function that finds the value that occurs most frequently. Note that the original prediction yi is not overwritten by yi . The smoothing procedure on the following images will make use of yi , not yi . Otherwise the same label will prorogate through the whole sequence. One drawback of the smoothing procedure is that when the robot enters a new room, the prediction will be h images delayed.
4
Experiments and Results
The classification system is coded in MATLAB while we make use of the LIBSVM software as the classifier. We have modified the LIBSVM source code to include the χ2 kernel. The parameters of the classifier include the kernel parameter γ = 5, The regularization parameter in SVMs C = 1, the threshold T = 0.6 for classification of the unknown class, and the number of earlier images h = 20 used for the smoothing procedure in the optional task. All are tuned on the validation set. The classification performance is evaluated according to the rules set for the task. 1 point is awarded for each correctly classified image, -0.5 point for each misclassified image, and 0 point for each image that is not classified (not implemented in the proposed system). Our system gives a score of 784, which corresponds to an accuracy rate of 64.26%. This is ranked 4th in the competition (the
260
Y. Gao and Y. Li
top score is 793). For the optional task, after applying the proposed smoothing procedure, the score raises to 884.5, which corresponds to an accuracy rate of 68.22%. This is ranked 3rd in the competition (the top score is 916.5). The benefits brought by the probabilistic multi-class support vector machines lie in the classification of the unknown class. If we do not make use of the probability estimate to determine the unknown class, the performance drops to a score of 700. The proposed localization system is computationally efficient in both training and classification phases. On an Intel Core2 Duo CPU E8400 PC, the extraction of pyramid histograms of Gaussian derivatives on each image frame takes about 0.3 second (MATLAB implementation). The training of a probabilistic multi-class support vector machine using LIBSVM takes 25 seconds on the whole training set of 1034 images. Classification of each test image only takes about 0.004 second.
5
Conclusion
We propose a robot localization system based on the probabilistic multi-class support vector machines. By incorporating the prediction of the unknown class based on the probability estimates, the performance improves from a score of 700 to 784, or 12%. By applying a smoothing procedure to exploit continuity of the sequence, the performance further improves from 784 to 884.5, or 12.8%.
References 1. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the CLEF 2009 robot vision track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 3. Chapelle, O., Haffner, P., Vapnik, V.N.: Support vector machines for histogrambased image classification. IEEE Transactions on Neural Networks 10(5), 1055–1064 (1999) 4. Crisianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000) 5. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178 (2006) 6. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2007), San Diego, CA, USA (October 2007) 7. Platt, J.C., Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999) 8. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 9. Wu, T.-F., Lin, C.-J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)
The University of Amsterdam’s Concept Detection System at ImageCLEF 2009 Koen E.A. van de Sande, Theo Gevers, and Arnold W.M. Smeulders Intelligent Systems Lab Amsterdam, University of Amsterdam, The Netherlands http://www.colordescriptors.com
Abstract. Our group within the University of Amsterdam participated in the large-scale visual concept detection task of ImageCLEF 2009. Our experiments focus on increasing the robustness of the individual concept detectors based on the bag-of-words approach, and less on the hierarchical nature of the concept set used. To increase the robustness of individual concept detectors, our experiments emphasize in particular the role of visual sampling, the value of color invariant features, the influence of codebook construction, and the effectiveness of kernel-based learning parameters. The participation in ImageCLEF 2009 has been successful, resulting in the top ranking for the large-scale visual concept detection task in terms of both EER and AUC. For 40 out of 53 individual concepts, we obtain the best performance of all submissions to this task. For the hierarchical evaluation, which considers the whole hierarchy of concepts instead of single detectors, using the concept likelihoods estimated by our detectors directly works better than scaling these likelihoods based on the class priors.
1
Introduction
Robust image retrieval is highly relevant in a world that is adapting to visual communication. Online services like Flickr show that the sheer number of photos available online is too much for any human to grasp. Many people place their entire photo album on the internet. Most commercial image search engines provide access to photos based on text or other metadata, as this is still the easiest way for a user to describe an information need. The indices of these search engines are based on the filename, associated text or (social) tagging. This results in disappointing retrieval performance when the visual content is not mentioned, or properly reflected in the associated text. In addition, when the photos originate from non-English speaking countries, such as China, or Germany, querying the content becomes much harder. To cater for robust image retrieval, the promising solutions from the literature are in majority concept-based [1], where detectors are related to objects, like trees, scenes, like a desert, and people, like big group. Any one of those brings an understanding of the current content. The elements in such a lexicon offer users a semantic entry by allowing them to query on presence or absence of visual content elements. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 261–268, 2010. c Springer-Verlag Berlin Heidelberg 2010
262
K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders Point Sampling Strategy
Color Descriptors
Bag-of-Words
Fixed-length feature vector
Harris-Laplace
ColorSIFT
Hard assignment Fixed-length feature vector
I Input t image i
Dense sampling p g
ColorSIFT
Soft assignment
Fixed-length Fixed length feature vector
2x2 spatial pyramid
Fig. 1. University of Amsterdam’s ImageCLEF 2009 concept detection scheme. The scheme serves as the blueprint for the organization of Section 2.
The Large-Scale Visual Concept Detection Task [2] evaluates 53 visual concept detectors. The concepts used are from the personal photo album domain: beach holidays, snow, plants, indoor, mountains, still-life, small group of people, portrait. For more information on the dataset and concepts used, see the overview paper [2]. Based on our previous work on concept detection [3,4,5], we have focused on improving the robustness of the visual features used in our concept detectors. Systems with the best performance in image retrieval [3,6] and video retrieval [4,7] use combinations of multiple features for concept detection. The basis for these combinations is formed by good color features and multiple point sampling strategies. This paper is organized as follows. Section 2 defines our concept detection system. Section 3 details our experiments and results. Finally, in section 4, conclusions are drawn.
2
Concept Detection System
We perceive concept detection as a combined computer vision and machine learning problem. The first step is to represent an image using a fixed-length feature vector. Given a visual feature vector xi , the aim is then to obtain a measure, which indicates whether semantic concept C is present in photo i. We may choose from various visual feature extraction methods to obtain xi , and use a supervised machine learning approach to learn the appearance relation between C and xi . The supervised machine learning process is composed of two phases: training and testing. In the first phase, the optimal configuration of features is learned from the training data. In the second phase, the classifier assigns a probability p(C|xi ) to each input feature vector for each semantic concept C.
The University of Amsterdam’s Concept Detection System
2.1
263
Point Sampling Strategy
The visual appearance of a concept has a strong dependency on the viewpoint under which it is recorded. Salient point methods [8] introduce robustness against viewpoint changes by selecting points, which can be recovered under different perspectives. Another solution is to simply use many points, which is achieved by dense sampling. We summarize our sampling approach in Figure 1: HarrisLaplace and dense point selection, and a spatial pyramid.1 Harris-Laplace point detector. In order to determine salient points, Harris-Laplace relies on a Harris corner detector. By applying it on multiple scales, it is possible to select the characteristic scale of a local corner using the Laplacian operator [8]. Hence, for each corner the Harris-Laplace detector selects a scale-invariant point if the local image structure under a Laplacian operator has a stable maximum. Dense point detector. For concepts with many homogenous areas, like scenes, corners are often rare. Hence, for these concepts relying on a Harris-Laplace detector can be suboptimal. To counter the shortcoming of Harris-Laplace, random and dense sampling strategies have been proposed [9,10]. We employ dense sampling, which samples an image grid in a uniform fashion using a fixed pixel interval between regions. We use an interval distance of 6 pixels and sample at multiple scales (σ = 1.2 and σ = 2.0). Spatial pyramid. Both Harris-Laplace and dense sampling give an equal weight to all keypoints, irrespective of their spatial location in the image. To overcome this limitation, Lazebnik et al. [11] suggest to repeatedly sample fixed subregions of an image, e.g. 1x1, 2x2, 4x4, etc., and to aggregate the different resolutions into a so called spatial pyramid. Since every region is an image in itself, the spatial pyramid can be used in combination with both the Harris-Laplace point detector and dense point sampling [12]. For the ideal spatial pyramid configuration, some claim 2x2 is sufficient [11], others suggest to include 1x3 also [6]. We use a spatial pyramid of 1x1, 2x2, and 1x3 in our experiments. 2.2
Color Descriptor Extraction
In the previous section, we addressed the dependency of the visual appearance of semantic concepts on the viewpoint under which they are recorded. However, the lighting conditions during photography also play an important role. We [3] analyzed the properties of color descriptors under classes of illumination changes within the diagonal model of illumination change, and specifically for data sets consisting of Flickr images. In ImageCLEF, the images used also originate from Flickr. Here we use the four color descriptors from the recommendation table in [3]. The descriptors are computed around salient points obtained from the Harris-Laplace detector and dense sampling. For the color descriptors in Figure 1, each of those four descriptors can be inserted. 1
Software to perform point sampling, color descriptor computation and the hard and soft assignment is available from http://www.colordescriptors.com
264
K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders
SIFT. The SIFT feature proposed by Lowe [13] describes the local shape of a region using edge orientation histograms. The gradient of an image is shiftinvariant: taking the derivative cancels out offsets [3]. Under light intensity changes, i.e. a scaling of the intensity channel, the gradient direction and the relative gradient magnitude remain the same. Because the SIFT feature is normalized, the gradient magnitude changes have no effect on the final feature. To compute SIFT features, we use the version described by Lowe [13]. OpponentSIFT. OpponentSIFT describes all the channels in the opponent color space using SIFT features. The information in the O3 channel is equal to the intensity information, while the other channels describe the color information in the image. The feature normalization, as effective in SIFT, cancels out any local changes in light intensity. C-SIFT. The C-SIFT feature uses the C invariant [14], which can be intuitively seen as the gradient (or derivative) for the normalized opponent color space O1/I and O2/I. The I intensity channel remains unchanged. C-SIFT is known to be scale-invariant with respect to light intensity. See [15,3] for detailed evaluation. RGB-SIFT. For the RGB-SIFT, the SIFT feature is computed for each RGB channel independently. Due to the normalizations performed within SIFT, it is equal to transformed color SIFT [3]. The feature is scale-invariant, shiftinvariant, and invariant to light color changes and shift. 2.3
Bag-of-Words Model
We follow the well-known bag-of-words model, also known as codebook approach, see e.g. [16,10,17,18,3]. First, we assign visual descriptors to discrete codewords predefined in a codebook. Then, we use the frequency distribution of the codewords as a feature vector representing an image. We construct a codebook with a maximum size of 4096 using k-means clustering. An important issue is codeword assignment. An comparison of codeword assignment is presented in [18]. Here we only detail two codeword assignment methods: – Hard assignment. Given a codebook of codewords, the traditional codebook approach assigns each descriptor to a single best representative codeword in the codebook. Basically, an image is represented by a histogram of codeword frequencies describing the probability density over codewords. – Soft-assignment. The traditional codebook approach may be improved by using soft-assignment through kernel codebooks. A kernel codebook uses a kernel function to smooth the hard-assignment of image features to codewords. Out of the various forms of kernel-codebooks, we selected codeword uncertainty based on its empirical performance [18]. Each of the possible sampling methods from Section 2.1 coupled with each visual descriptor from Section 2.2, and an assignment approach results in a separate visual codebook. An example is a codebook based on dense sampling of RGB-SIFT
The University of Amsterdam’s Concept Detection System
265
features in combination with hard-assignment. Naturally, various configurations can be used to combine multiple of these choices. For simplicity, we employ equal weights in our experiments when combining different features. 2.4
Machine Learning
The supervised machine learning process is composed of two phases: training and testing. In the first phase, the optimal configuration of features is learned from the training data. From all machine learning approaches on offer to learn the appearance relation between C and xi , the support vector machine is commonly regarded as a solid choice [19]. Here we use the LIBSVM implementation [20] with probabilistic output [21]. The parameter of the support vector machine we optimize is C. In order to handle imbalance in the number of positive versus negative training examples, we fix the weights of the positive and negative class by estimation from the class priors on training data. It was shown by Zhang et al. [17] that in a codebook-approach to concept detection the earth movers distance and χ2 kernel are to be preferred. We employ the χ2 kernel, as it is less expensive in terms of computation. In the second machine learning phase, the classifier assigns a probability p(C|xi ) to each input feature vector for each semantic concept C, i.e. the trained model is applied to the test data.
3 3.1
Concept Detection Experiments Submitted Runs
We have submitted five different runs. All runs use both Harris-Laplace and dense sampling with the SVM classifier. We do not use the EXIF metadata provided for the photos. – – – –
OpponentSIFT: single color descriptor with hard assignment. 2-SIFT: uses OpponentSIFT and SIFT descriptors. 4-SIFT: uses OpponentSIFT, C-SIFT, RGB-SIFT and SIFT descriptors. Soft 4-SIFT: uses OpponentSIFT, C-SIFT, RGB-SIFT and SIFT descriptors with soft assignment. The soft assignment parameters have been taken from our PASCAL VOC 2008 system [3]. – Rescaled 4-SIFT: the same ordering of images as 4-SIFT, but with all concept detector outputs linearly scaled so the number of images with a score > 0.5 is equal to the concept prior probability in the training set. 3.2
Evaluation per Concept
In table 1, the overall scores for the evaluation of concept detectors are shown. As for the evaluation of single detectors only the ranking of the images within a single concept matters, the rescaled version of 4-SIFT achieves the exact same performance as 4-SIFT. We note that the 4-SIFT run with hard assignment
266
K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders
Table 1. Overall results of the our runs evaluated over all concepts in the Photo Annotation task using the equal error rate (EER) and the area under the curve (AUC) Run name
Codebook
Average EER Average AUC
4-SIFT Hard-assignment Soft 4-SIFT Soft-assignment 2-SIFT Hard-assignment OpponentSIFT Hard-assignment
0.2345 0.2355 0.2435 0.2530
0.8387 0.8375 0.8300 0.8217
Table 2. Results per concept for our runs in the Large-Scale Visual Concept Detection Task using the Area Under the Curve. The highest score per concept is highlighted using a grey background. The concepts are ordered by their highest score. ŽŶĐĞƉƚ ůŽƵĚƐ ^ƵŶƐĞƚͲ^ƵŶƌŝƐĞ ^ŬLJ >ĂŶĚƐĐĂƉĞͲEĂƚƵƌĞ ^ĞĂ DŽƵŶƚĂŝŶƐ >ĂŬĞ ĞĂĐŚͲ,ŽůŝĚĂLJƐ dƌĞĞƐ tĂƚĞƌ EŝŐŚƚ ZŝǀĞƌ KƵƚĚŽŽƌ &ŽŽĚ ĞƐĞƌƚ ƵŝůĚŝŶŐͲ^ŝŐŚƚƐ ŝŐͲ'ƌŽƵƉ WůĂŶƚƐ &ůŽǁĞƌƐ ƵƚƵŵŶ WŽƌƚƌĂŝƚ hŶĚĞƌĞdžƉŽƐĞĚ EŽͲWĞƌƐŽŶƐ WĂƌƚůLJͲůƵƌƌĞĚ tŝŶƚĞƌ ^ŶŽǁ ĂLJ EŽͲůƵƌ
ϰͲ^/&d ^ŽĨƚϰͲ^/&d
ϮͲ^/&d KƉƉ͘^/&d
Ϭ͕ϵϱϴ Ϭ͕ϵϱϯ Ϭ͕ϵϰϱ Ϭ͕ϵϰϰ Ϭ͕ϵϯϱ Ϭ͕ϵϯϰ Ϭ͕ϵϭϭ Ϭ͕ϵϬϲ Ϭ͕ϵϬϯ Ϭ͕ϵϬϭ Ϭ͕ϴϵϴ Ϭ͕ϴϵϳ Ϭ͕ϴϵϬ Ϭ͕ϴϵϱ Ϭ͕ϴϵϭ Ϭ͕ϴϴϬ Ϭ͕ϴϴϭ Ϭ͕ϴϳϳ Ϭ͕ϴϲϴ Ϭ͕ϴϳϬ Ϭ͕ϴϲϱ Ϭ͕ϴϱϴ Ϭ͕ϴϱϬ Ϭ͕ϴϱϮ Ϭ͕ϴϰϯ Ϭ͕ϴϰϲ Ϭ͕ϴϰϭ Ϭ͕ϴϰϯ
Ϭ͕ϵϱϭ Ϭ͕ϵϰϳ Ϭ͕ϵϯϱ Ϭ͕ϵϰϬ Ϭ͕ϵϯϮ Ϭ͕ϵϯϬ Ϭ͕ϵϭϮ Ϭ͕ϴϵϴ Ϭ͕ϴϵϮ Ϭ͕ϴϵϮ Ϭ͕ϴϵϱ Ϭ͕ϴϵϭ Ϭ͕ϴϳϵ Ϭ͕ϴϴϭ Ϭ͕ϴϵϭ Ϭ͕ϴϳϯ Ϭ͕ϴϳϬ Ϭ͕ϴϱϯ Ϭ͕ϴϰϲ Ϭ͕ϴϲϯ Ϭ͕ϴϱϳ Ϭ͕ϴϱϳ Ϭ͕ϴϯϳ Ϭ͕ϴϰϱ Ϭ͕ϴϯϮ Ϭ͕ϴϮϵ Ϭ͕ϴϯϭ Ϭ͕ϴϯϲ
Ϭ͕ϵϱϴ Ϭ͕ϵϱϰ Ϭ͕ϵϰϴ Ϭ͕ϵϰϮ Ϭ͕ϵϯϬ Ϭ͕ϵϯϭ Ϭ͕ϵϬϯ Ϭ͕ϵϬϳ Ϭ͕ϵϬϮ Ϭ͕ϵϬϯ Ϭ͕ϴϵϱ Ϭ͕ϴϴϵ Ϭ͕ϴϵϲ Ϭ͕ϴϵϱ Ϭ͕ϴϲϱ Ϭ͕ϴϴϮ Ϭ͕ϴϳϳ Ϭ͕ϴϴϭ Ϭ͕ϴϳϱ Ϭ͕ϴϲϲ Ϭ͕ϴϲϰ Ϭ͕ϴϱϵ Ϭ͕ϴϱϴ Ϭ͕ϴϱϮ Ϭ͕ϴϰϲ Ϭ͕ϴϰϱ Ϭ͕ϴϰϱ Ϭ͕ϴϰϱ
Ϭ͕ϵϰϱ Ϭ͕ϵϰϲ Ϭ͕ϵϯϬ Ϭ͕ϵϯϲ Ϭ͕ϵϮϲ Ϭ͕ϵϮϮ Ϭ͕ϵϬϬ Ϭ͕ϴϴϰ Ϭ͕ϴϴϭ Ϭ͕ϴϴϲ Ϭ͕ϴϵϮ Ϭ͕ϴϴϯ Ϭ͕ϴϳϭ Ϭ͕ϴϳϳ Ϭ͕ϴϴϰ Ϭ͕ϴϲϭ Ϭ͕ϴϱϴ Ϭ͕ϴϯϵ Ϭ͕ϴϯϲ Ϭ͕ϴϰϵ Ϭ͕ϴϰϲ Ϭ͕ϴϱϰ Ϭ͕ϴϮϲ Ϭ͕ϴϯϬ Ϭ͕ϴϮϴ Ϭ͕ϴϮϱ Ϭ͕ϴϮϰ Ϭ͕ϴϮϯ
ŽŶĐĞƉƚ EŽͲsŝƐƵĂůͲdŝŵĞ /ŶĚŽŽƌ &ĂŵŝůŝLJͲ&ƌŝĞŶĚƐ WĂƌƚLJůŝĨĞ sĞŚŝĐůĞ ŶŝŵĂůƐ ŝƚLJůŝĨĞ ^ƚŝůůͲ>ŝĨĞ ^ƉƌŝŶŐ ĂŶǀĂƐ ^ƵŵŵĞƌ DĂĐƌŽ EŽͲsŝƐƵĂůͲ^ĞĂƐŽŶ ^ŵĂůůͲ'ƌŽƵƉ ^ŝŶŐůĞͲWĞƌƐŽŶ KƵƚͲŽĨͲĨŽĐƵƐ EŽͲsŝƐƵĂůͲWůĂĐĞ KǀĞƌĞdžƉŽƐĞĚ EĞƵƚƌĂůͲ/ůůƵŵŝŶĂƚŝŽŶ ^ƵŶŶLJ DŽƚŝŽŶͲůƵƌ ^ƉŽƌƚƐ ĞƐƚŚĞƚŝĐͲ/ŵƉƌĞƐƐŝŽŶ KǀĞƌĂůůͲYƵĂůŝƚLJ &ĂŶĐLJ ǀĞƌĂŐĞ
ϰͲ^/&d ^ŽĨƚϰͲ^/&d
ϮͲ^/&d KƉƉ͘^/&d
Ϭ͕ϴϯϯ Ϭ͕ϴϯϬ Ϭ͕ϴϯϰ Ϭ͕ϴϯϰ Ϭ͕ϴϯϮ Ϭ͕ϴϭϴ Ϭ͕ϴϮϲ Ϭ͕ϴϮϰ Ϭ͕ϴϮϮ Ϭ͕ϴϭϳ Ϭ͕ϴϭϯ Ϭ͕ϴϭϮ Ϭ͕ϴϬϱ Ϭ͕ϳϵϮ Ϭ͕ϳϵϮ Ϭ͕ϳϵϮ Ϭ͕ϳϴϵ Ϭ͕ϳϴϴ Ϭ͕ϳϳϴ Ϭ͕ϳϲϯ Ϭ͕ϳϰϰ Ϭ͕ϲϵϱ Ϭ͕ϲϱϴ Ϭ͕ϲϱϲ Ϭ͕ϱϲϱ
Ϭ͕ϴϯϱ Ϭ͕ϴϯϱ Ϭ͕ϴϯϰ Ϭ͕ϴϯϰ Ϭ͕ϴϯϮ Ϭ͕ϴϮϴ Ϭ͕ϴϮϲ Ϭ͕ϴϮϱ Ϭ͕ϴϬϭ Ϭ͕ϴϭϬ Ϭ͕ϴϭϯ Ϭ͕ϳϵϭ Ϭ͕ϴϬϲ Ϭ͕ϳϵϱ Ϭ͕ϳϵϱ Ϭ͕ϳϴϭ Ϭ͕ϳϴϲ Ϭ͕ϳϴϮ Ϭ͕ϳϴϯ Ϭ͕ϳϲϱ Ϭ͕ϳϰϳ Ϭ͕ϲϵϱ Ϭ͕ϲϲϮ Ϭ͕ϲϱϲ Ϭ͕ϱϱϵ
Ϭ͕ϴϮϮ Ϭ͕ϴϮϯ Ϭ͕ϴϮϮ Ϭ͕ϴϯϭ Ϭ͕ϴϯϮ Ϭ͕ϴϭϭ Ϭ͕ϴϭϵ Ϭ͕ϴϬϴ Ϭ͕ϴϭϮ Ϭ͕ϴϬϯ Ϭ͕ϳϵϭ Ϭ͕ϴϬϱ Ϭ͕ϳϵϰ Ϭ͕ϳϴϰ Ϭ͕ϳϴϬ Ϭ͕ϳϴϰ Ϭ͕ϳϴϭ Ϭ͕ϳϳϳ Ϭ͕ϳϳϱ Ϭ͕ϳϰϰ Ϭ͕ϳϮϱ Ϭ͕ϲϳϵ Ϭ͕ϲϱϳ Ϭ͕ϲϱϯ Ϭ͕ϱϴϬ
Ϭ͕ϴϭϱ Ϭ͕ϴϭϬ Ϭ͕ϴϭϯ Ϭ͕ϴϭϵ Ϭ͕ϴϮϮ Ϭ͕ϳϵϳ Ϭ͕ϴϭϯ Ϭ͕ϳϵϱ Ϭ͕ϳϵϭ Ϭ͕ϳϵϬ Ϭ͕ϳϴϮ Ϭ͕ϳϵϱ Ϭ͕ϳϴϮ Ϭ͕ϳϳϲ Ϭ͕ϳϲϵ Ϭ͕ϳϳϰ Ϭ͕ϳϳϵ Ϭ͕ϳϳϭ Ϭ͕ϳϳϰ Ϭ͕ϳϰϭ Ϭ͕ϳϭϬ Ϭ͕ϲϳϯ Ϭ͕ϲϱϳ Ϭ͕ϲϱϴ Ϭ͕ϱϴϯ
Ϭ͕ϴϯϴϳ
Ϭ͕ϴϯϳϱ
Ϭ͕ϴϯϬϬ
Ϭ͕ϴϮϭϳ
achieves not only the highest performance amongst our runs, but also over all other runs submitted to the Large-Scale Visual Concept Detection task. In table 2, the Area Under the Curve scores have been split out per concept. We observe that the three aesthetic concepts have the lowest scores. This comes as no surprise, because these concepts are highly subjective: even human annotators only agree around 80% of the time with each other. For virtually all concepts besides the aesthetic ones, either the Soft 4-SIFT or the Hard 4-SIFT is the best run. This confirms our beliefs that these (color) descriptors are not redundant when used in combinations. Therefore, we recommend the use of these 4 descriptors instead of 1 or 2. The difference in overall performance between the Soft 4-SIFT or the Hard 4-SIFT run is quite small. Because the soft codebook
The University of Amsterdam’s Concept Detection System
267
Table 3. Results using the hierarchical evaluation measures for our runs in the LargeScale Visual Concept Detection Task Average Annotation Score Run name
Codebook
Soft 4-SIFT 4-SIFT 2-SIFT OpponentSIFT Rescaled 4-SIFT
Soft-assignment Hard-assignment Hard-assignment Hard-assignment Hard-assignment
with agreement without agreement 0.7647 0.7623 0.7581 0.7491 0.7398
0.7400 0.7374 0.7329 0.7232 0.7199
assignment smoothing parameter was directly taken from a different dataset, we expect that the soft assignment run could be improved if the soft assignment parameter was selected with cross-validation on the training set. Together, our runs obtain the highest Area Under the Curve scores for 40 out of 53 concepts in the Photo Annotation task (20 for Soft 4-SIFT, 17 for 4-SIFT and 3 for the other runs). This analysis has shown us that our system is falling behind for concepts that correspond to conditions we have included invariance against. Our method is designed to be robust to unsharp images, so for Out-of-focus, Partly-Blurred and No-Blur there are better approaches possible. For the concepts Overexposed, Underexposed, Neutral-Illumination, Night and Sunny, recognizing how the scene is illuminated is very important. Because we are using invariant color descriptors, a lot of the discriminative lighting information is no longer present in the descriptors. Again, there should be better approaches possible for these concepts, such as estimating the color temperature and overall light intensity. 3.3
Evaluation per Image
For the hierarchical evaluation, overall results are shown in table 3. When compared to the evaluation per concept, the Soft 4-SIFT run is now slightly better than the normal 4-SIFT run. Our attempt to improve performance for the hierarchical evaluation measure using a linear rescaling of the concept likelihoods has had the opposite effect: the normal 4-SIFT run is better than the Rescaled 4-SIFT run. Therefore, further investigation into building a cascade of concept classifiers is needed, as simply using the individual concept classifiers with their class priors does not work.
4
Conclusion
Our focus on invariant visual features for concept detection in ImageCLEF 2009 has been successful. It has resulted in the top ranking for the large-scale visual concept detection task in terms of both EER and AUC. For 40 individual concepts, we obtain the best performance of all submissions to the task. For the hierarchical evaluation, using the concept likelihoods estimated by our detectors directly works better than scaling these likelihoods based on the class priors. Acknowledgements. Special thanks to Cees Snoek, Jasper Uijlings and Jan van Gemert for providing valuable input and their cooperation over the years.
268
K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders
References 1. Snoek, C.G.M., Worring, M.: Concept-based video retrieval. FTIR 4(2), 215–322 (2009) 2. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large scale visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 3. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. IEEE Transactions on PAMI (in press, 2010) 4. Snoek, C.G.M., van de Sande, K.E.A., de Rooij, O., Huurnink, B., van Gemert, J.C., Uijlings, J.R.R., et al.: The MediaMill TRECVID 2008 semantic video search engine. In: TRECVID Workshop (2008) 5. Uijlings, J.R.R., Smeulders, A.W.M., Scha, R.J.H.: Real-time bag-of-words, approximately. In: ACM CIVR (2009) 6. Marszalek, M., Schmid, C., Harzallah, H., van de Weijer, J.: Learning object representations for visual object class recognition. In: Visual Recognition Challenge Workshop, in Conjunction with IEEE ICCV (2007) 7. Wang, D., Liu, X., Luo, L., Li, J., Zhang, B.: Video diver: generic video indexing with diverse features. In: ACM MIR, Augsburg, Germany, pp. 61–70 (2007) 8. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: A survey. FTCGV 3(3), 177–280 (2008) 9. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: IEEE CVPR, vol. 2, pp. 524–531 (2005) 10. Jurie, F., Triggs, B.: Creating efficient codebooks for visual recognition. In: IEEE ICCV, Beijing, China, pp. 604–610 (2005) 11. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE CVPR, vol. 2, pp. 2169–2178 (2006) 12. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: A comparison of color features for visual concept classification. In: ACM CIVR, pp. 141–150 (2008) 13. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 14. Geusebroek, J.M., van den Boomgaard, R., Smeulders, A.W.M., Geerts, H.: Color invariance. IEEE Transactions on PAMI 23(12), 1338–1350 (2001) 15. Burghouts, G.J., Geusebroek, J.M.: Performance evaluation of local color invariants. CVIU 113, 48–62 (2009) 16. Leung, T.K., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. IJCV 43(1), 29–44 (2001) 17. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. IJCV 73(2), 213–238 (2007) 18. van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Visual word ambiguity. IEEE Transactions on PAMI (in press, 2010) 19. Vapnik, V.N.: The Nature of Statistical Learning Theory. 2nd edn (2000) 20. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 21. Lin, H.T., Lin, C.J., Weng, R.C.: A note on Platt’s probabilistic outputs for support vector machines. ML 68(3), 267–276 (2007)
Enhancing Recognition of Visual Concepts with Primitive Color Histograms via Non-sparse Multiple Kernel Learning Alexander Binder1 and Motoaki Kawanabe1,2 1
Fraunhofer Institute FIRST, Kekul´estr. 7, 12489 Berlin, Germany {alexander.binder,motoaki.kawanabe}@first.fraunhofer.de 2 TU Berlin, Franklinstr. 28/29, 10587 Berlin, Germany
Abstract. In order to achieve good performance in image annotation tasks, it is necessary to combine information from various image features. In recent competitions on photo annotation, many groups employed the bag-of-words (BoW) representations based on the SIFT descriptors over various color channels. In fact, it has been observed that adding other less informative features to the standard BoW degrades recognition performances. In this contribution, we will show that even primitive color histograms can enhance the standard classifiers in the ImageCLEF 2009 photo annotation task, if the feature weights are tuned optimally by non-sparse multiple kernel learning (MKL) proposed by Kloft et al.. Additionally, we will propose a sorting scheme of image subregions to deal with spatial variability within each visual concept.
1
Introduction
Recent research results show that combining information from various image features is inevitable to achieve good performance in image annotation tasks. With the support vector machine (SVM) [1,2], this is implemented by mixing kernels (similarities between images) constructed from different image descriptors with appropriate weights. For instance, the average kernel with uniform weights or the optimal kernel trained by multiple kernel learning (called 1 -MKL later) have been used so far. Since the sparse 1 -MKL tends to overfit by ignoring quite a few kernels, Kloft et al. [3] proposed the non-sparse MKL with p -regularizer (p ≥ 1), which bridges the average kernel (p = ∞) and 1 -MKL. The non-sparse MKL is successfully applied to object classification tasks; it could outperform the two baseline methods by optimizing the tuning parameter p ≥ 1 through cross validation. In particular, it is useful to combine less informative features such as color histograms with the standard bag of words (BoW) representations [4]. We will show that by p -MKL additional simple features can enhance classification performances of some visual concepts in the ImageCLEF 2009 photo annotation task [5], while with the average kernel they just degrade recognition rates. Since the images are not aligned, we will also propose a sorting scheme of image subregions to deal with the spatial variability, when computing similarities between different images. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 269–276, 2010. c Springer-Verlag Berlin Heidelberg 2010
270
2
A. Binder and M. Kawanabe
Features and Kernels Used in Our Experiments
Features. For the following experiments, we prepared two kinds of image features: one is the BoW representations based on the SIFT descriptors [6] and the other is the pyramid histograms [7] of color intensities (PHoCol). The BoW features were constructed in a standard way. By the code used in [8], the SIFT descriptors were computed on a dense grid of step size six over multiple color channels: red, green, blue, and grey. Then, for both grey and combined red-greenblue channels, 4000 visual words (prototypes) were generated by using k-means clustering with large sets of SIFT descriptors selected randomly from the training images in analogy to [9]. For each image, one of the visual words was assigned to the base SIFT at each grid point and the set of words was summarized in a histogram within each cell of the spatial tilings 1 × 1, 2 × 2 and 3 × 1 [7]. Finally, we obtained 6 BoW features (2 colors × 3 pyramid levels). On the other hand, the PHoCol features were computed by making histograms of color intensities with 10 bins within each cell of the spatial tiling 4 × 4 and 8 × 8 for various color channels: grey, opponent color 1, opponent color 2, normalized red, normalized green, normalized blue. The finer pyramid levels were considered, because the intensity histograms usually contain only little information. Sorting the color histograms. The spatial pyramid representation [7] is very useful, in particular, when annotating aligned images, because we can incorporate spatial relations of visual contents in images properly. However, if we want to use histograms on higher-level pyramid tilings (4 × 4 and 8 × 8) as parts of input features for general digital photos of the annotation task, it is necessary to handle large spatial variability within each visual concept. Therefore, we propose to sort the cells within a pyramid tiling according to the slant of the histograms. Mathematically, our sort criterion sl(h) is defined as a[h]i =
k≤i
hk ,
a[h]i a[h]i = , a[h]k k
sl(h) = −
a[h]i ln(a[h]i ).
(1)
i
The idea behind the criterion can be explained intuitively. The accumulation process a[h] maps the histogram h with only one peak at the highest intensity bin to the minimum entropy distribution (Fig. 1 left) and that with only one peak at the lowest intensity bin to the maximum entropy distribution (Fig. 1 right). If the original histogram h is flat, the accumulated histogram a[h] becomes a linearly increasing function which has an entropy in between the two extremes (Fig 1 middle). On the other hand, it is natural to think that all possible permutations π are not equally likely in sorting of the image cells. In many cases, spatial positions of visual concepts can change more horizontally than vertically (e.g. sky, sea). Therefore, we introduced a sort cost in order to punish large changes of the vertical positions of the image cells before and after sorting. sc(π) = C max(v(π(k)) − v(k), 0) (2) k
Enhancing Recognition of Visual Concepts via Non-sparse MKL
1
1
1
0.5
0.5
0.5
0
1 2 3 4 5 6 7 8 9 10
0
1 2 3 4 5 6 7 8 9 10
0
1
1
1
0.5
0.5
0.5
0
1 2 3 4 5 6 7 8 9 10
0
1 2 3 4 5 6 7 8 9 10
0
271
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Fig. 1. Explanation of the slant score. Upper: intensity histograms. Lower: corresponding histograms accumulated.
Here v(i) denotes the vertical position of the i-th cell within a pyramid tiling and the constant C is chosen such that the sort cost is upper-bounded by one and lies at a similar range compared to the χ2 -distance between the color histograms. The sort cost is used to modify the PHoCol kernels. When comparing images x and y, the squared distance between the sort costs are added to the χ2 -distance between the color histograms. k(x, y) = exp[−σ{dχ2 (hx , hy ) + (scx − scy )2 }] .
(3)
In our experiments, we computed the sorted PHoCol features on 4 × 4 and 8 × 8 pyramid tilings and constructed kernels with and without the sort cost modification. Although the intensity-based features have a lesser performance as standalone image descriptors even after the sorting modifications, combining them with the standard BoW representations can enhance performances in some of the 53 classification problems in the ImageCLEF09 task with almost no additional computation costs. Kernels. We used the χ2 -kernel except for the cases that the sort cost was incorporated. The kernel width was set to be the mean of the inner χ2 -distances computed over the training data. All kernels were normalized.
3
Experimental Results
We aim at showing an improvement over a gold standard represented by BoW features with average kernel while lacking the ground truth on the test data. Therefore we evaluated all settings using 10-fold cross validation on the ImageCLEF09 photo annotation training data, consisting of 5000 images. This allows to perform statistical testing and to predict generalization errors for selecting better methods/models. The BoW baseline is a reduced version ouf our ImageCLEF submission described in the working notes. The submitted version gave rise to results behind the ISIS and INRIA-LEAR groups by AUC
272
A. Binder and M. Kawanabe
(margins 0.022, 0.006) and by EER also behind CVIUI2R (margins 0.02, 0.005, 0.001). XRCE and CVIUI2R performed better by the hierarchy measure (margins 0.013, 0.014). We report in this section performance comparison between SVMs with the average kernels, the sparse 1 -MKL, and the non-sparse p -MKL [3]. In p -MKL, the tuning parameter p is selected for each class from the set {1.0625, 1.125, 1.25, 1.5, 2} by cross validation scores and the regularization parameter of the SVMs was fixed to one. We chose the average precision (AP) as the evaluation criterion which is also employed in the Pascal VOC Challenges due to its sensitivity to smaller changes, even when AUC values are already saturated above 0.9. This rank-based measure is invariant against the actual choice of a bias. We did not employ the equal error rate (EER), because it can suffer from unbalanced sizes of the ImageCLEF09 annotations. We remark that several classes have less than 100 positive samples and generally no learning algorithm generalizes well in such cases. We will pose four questions and present experimental results to answer them in the following. Does MKL help for combining the bag of words features? Our first question is whether the MKL techniques are useful compared to the average kernel SVMs for combining the default 6 BoW features. The upper panel of Fig. 2 shows the performance differences between p -MKL with class-wise optimal p’s and SVMs with the average kernels over all 53 categories. The classes are sorted as in the guidelines of the ImageCLEF09 photo annotation task. In general, we see just minor improvements by applying p -MKL in 33 out of 53 classes and for only one class it achieved major gain. Seemingly, the chosen BoW features have on average similar information contents. The cross-validated scores (average AP 0.4435 and average AUC 0.8118) of the baseline method imply that these 6 BoW features contributed mostly to the final results of our group achieved on the testdata of the annotation task. On the other hand, the lower panel of Fig. 2 indicate that the canonical 1 MKL is not a good idea in this case. On average over all classes 1 -MKL gives worse results compared to the baseline. We attribute this to the harmful effects of sparsity in noisy image data. Our observations are quantitatively supported by Wilcoxon signed rank test (the significance level α = 0.05) which can tell the significance of the performance differences. For p -MKL vs the average kernel SVM, we have 10 statistically significant classes with 5 gains and 5 losses, while there are 12 statistically significant losses and only one gain in comparison between 1 -MKL and the average kernel baseline. Do sorted PHoCol features improve the BoW baseline? To answer this question we compared classifiers which takes both the BoW and PHoCol features with the baselines which rely only on the BoW representations. For each of the two cases and each class, we selected the best result in the AP score among various settings which will be explained later.
Enhancing Recognition of Visual Concepts via Non-sparse MKL
273
0.015
0.01
0.005
0
−0.005
−0.01
−0.015
0
5
10
15
20
25
30
35
40
45
50
0
5
10
15
20
25
30
35
40
45
50
0.06
0.04
0.02
0
−0.02
−0.04
−0.06
−0.08
Fig. 2. Class-wise performance differences when combining the 6 BoW features. Upper: p -MKL vs the average kernel SVM. Lower: 1 -MKL vs the average kernel SVM. Baseline mean AP 0.4434, mean AUC 0.8118.
For combinations of the BoW and PHoCol features, we considered the six sets of base kernels in Table 1. For each set, the kernel weights are learned by p -MKL with the tuning parameters p ∈ {1, 1.0625, 1.25, 1.5, 2}. The baselines only with the 6 BoW were also computed by taking the best result from p -MKL and the average kernel SVM. In Fig. 3, we can see several classes with larger improvements over the BoW baseline by employing the full setup including PHoCol features with the sort modification and the optimal kernel weights learned by p -MKL. We also see slight decreases of the AP score on 6 classes out of all 53, where the worst setback is just of the size 0.004. In fact, they are rather minor compared to the large gains on their complement. Note that the combinations of PHoCol did not include the average kernel SVM as an option, while the best performances with the BoW only could be achieved by the average kernel SVM. Thanks to flexibility of p MKL, classification performances by the larger combination (PHoCol+BoW) were never much worse than the standard BoW classifiers, even PHoCols are much less informative.
274
A. Binder and M. Kawanabe Table 1. The sets of base kernels tested set no. BoWs 1 2 3 4 5 6
all all all all all all
6 6 6 6 6 6
sorted PHoCols color sort costs opponent color 1 & 2 no opponent color 1 & 2 yes grey no grey yes normalized red, green, blue no normalized red, green, blue yes
spatial tiling both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8 both 4 × 4 & 8 × 8
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 −0.01
0
5
10
15
20
25
30
35
40
45
50
Fig. 3. Class-wise performance gains by the combination of the PHoCol and BoW over the standard BoW only. The baseline has mean AP 0.4434 and mean AUC 0.8118.
The gains were statistically significant according to Wilcoxon signed rank test with the level α = 0.05 on the 9 classes: Winter (13), Sky (21), Day (28), Sunset Sunrise (32), Underexposed (38), Neutral Illumination (39), Big Group (46), No Persons(47) and Aesthetic Impression (51) in Fig. 3. This is not surprising, as we would expect for these outdoor classes to have a certain color profile, while the two ’No’ and ’Neutral’ classes have a large number of samples for generalization via the learning algorithm. We remark that the sorted PHoCol features are very fast to compute and that the MKL training times are negligible compared to those necessary for computing SIFT descriptors, clustering and assigning visual words. Actually we could compute the PHoCol kernels on the fly. In summary, the result of this experiment shows that combining additional features with lower standalone performance can further improve recognition performances. In the next experiment we show that the non-sparse MKL is the key to the gain brought by the sorted PHoCol features. Does averaging suffice for combining extra PHoCol features? We consider again the same two cases (i.e. PHoCol+BoW vs Bow only) as in the previous experiments. In the first case, the average kernels are always used as the
Enhancing Recognition of Visual Concepts via Non-sparse MKL
275
0.03 0.02 0.01 0 −0.01 −0.02 −0.03 −0.04 −0.05 −0.06
0
5
10
15
20
25
30
35
40
45
50
0
5
10
15
20
25
30
35
40
45
50
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 −0.01
Fig. 4. Class-wise performance differences. Upper: combined Phocol and BoW by average kernel vs baseline with BoW only. Lower: combined Phocol and BoW by p -MKL vs the same features with average kernel.
combination of the base kernels in each set instead of p -MKL and the best AP score was obtained for each class and each case. The performances of the second case were calculated in the same way as the last experiment. From the upper panel of Fig. 4, we see a mixed result with more losses than gains. That is, the average kernels of PHoCol and BoW rather degrade the performance compared to the baselines with BoW only. Additionally, for the combination of PHoCol and BoW, we compared p -MKL with the average kernel SVMs in the lower panel of Fig. 4. This result shows clearly that the average kernel fails in the combination of highly and poorly expressive features throughout most classes. We conclude that the non-sparse MKL techniques are essential to achieve further gains by combining extra sorted PHoCol features with the BoW representations. Does the sort modification improve the PHoCol features? The default PHoCol features gave substantially better performances for the classes snow and desert on which the sorted ones do improve only somewhat compared to BoW models. We assume that the higher importance of color together with low spatial variability of color distributions in these concepts explains the gap. The default PHoCols without sorting degraded performances strongly in three other
276
A. Binder and M. Kawanabe
classes, where the sorted version does not lead to losses. In this sense, the sorting modification seems to make classifiers more stable on average over all classes.
4
Conclusions
We have shown that primitive color histograms can further enhance recognition performance over the standard procedure using BoW representations in most visual concepts of the ImageCLEF2009 photo annotation task, if they are combined optimally by the recently developed non-sparse MKL techniques. This fact was not known before and nobody has pursued this direction, because the average kernels constructed from such heterogenous features degrade classification performance substantially due to high noise in the least informative kernels. Furthermore, we gave insights and evidences when p -MKL is particularly useful: it can achieve better performance when combining informative and noisy features, even if the average kernel SVMs and the sparse 1 -MKL fail. Acknowledgements. We like to thank Shinichi Nakajima, Marius Kloft, Ulf Brefeld and Klaus-Robert M¨ uller for fruitful discussions. This work was supported in part by the Federal Ministry of Economics and Technology of Germany (BMWi) under the project THESEUS (01MQ07018).
References 1. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 2. M¨ uller, K.R., Mika, S., R¨ atsch, G., Tsuda, K., Sch¨ olkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw. 12(2), 181–201 (2001) 3. Kloft, M., Brefeld, U., Sonnenburg, S., Laskov, P., M¨ uller, K.R., Zien, A.: Efficient and accurate Lp-norm multiple kernel learning. In: Adv. In: Neur. Inf. Proc. Sys., NIPS (2009) 4. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV 2004, Prague, Czech Republic, pp. 1–22 (May 2004) 5. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large-scale visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comp. Vis. 60(2), 91–110 (2004) 7. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of CVPR 2006, New York, USA, pp. 2169–2178 (2006) 8. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pat. Anal. & Mach. Intel. 27(10), 1615–1630 (2005) 9. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pat. Anal. & Mach. Intel. (2010)
Using SIFT Method for Global Topological Localization for Indoor Environments Emanuela Boroş, George Roşca, and Adrian Iftene UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania {emanuela.boros,george.rosca,adiftene}@info.uaic.ro
Abstract. The paper represents a brief description of our system as one of the solutions to the problem of global topological localization for indoor environments. The experiment involves analyzing images acquired with a perspective camera mounted on a robot platform and applying a feature-based method (SIFT) and two main systems in order to search and classify the given images. To obtain acceptable results and improved performance improvement, the algorithm acquires two main maturity levels: one capable of running in real-time and taking care of the computers’ resources and the other one capable of classifying correctly the input images. One of the principal benefits of the developed system is a server-client architecture that brings efficiency to the table along with statistical methods that improve the quality of data with their design.
1 Introduction A proper understanding of human learning is important to consider while making any decision. Our need in imitating the human capability of learning has become the main purpose of science. And so, methods are needed to be used to summarize, describe and classify collections of data. Recognizing places where we have been before and remembering objects that we have seen are difficult operations to imitate or interpret. The subject of this paper outlines the Robot Vision task1, hosted for the first time in 2009 by ImageCLEF. The task addresses the problem of topological localization of a mobile robot using visual information. Specifically, we were asked to determine the topological location of a robot based on images acquired with a perspective camera mounted on a robot platform. We received training data consisting of an image sequence recorded in a five room subsection of an indoor environment under fixed illumination conditions and at a given time [1]. Our system is able to globally localize the robot, i.e. to estimate the robot’s position even if the robot is passively moved from one place to another within the mapped area. The process of recognizing the robot’s position is done answering the question Where are you? (With possible answers: I’m in the kitchen, or I’m in the corridor, etc.). This is achieved combining an improved algorithm for object recognition from [5] with an algorithm for classifying amounts of data. The system is divided in two main managers for classifying images: one that we call the brute finder and the other one the managed finder. The managed finder is an 1
Robot Vision task: http://www.imageclef.org/2009/robot
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 277–282, 2010. © Springer-Verlag Berlin Heidelberg 2010
278
E. Boroş, G. Roşca, and A. Iftene
algorithmic search that picks the most representative images for every room. The brute finder is applying an algorithm for extracting images’ features to every image in every room, creating meta files for all the images with no exception. The rest of the paper is organized as follows: in section 2 we describe our system (UAIC System) separated on main components. Results and evaluation of the system are reported in sections 3 and 4. In the last section conclusions regarding our participation in Robot Vision task at ImageCLEF 2009 are draws.
2 UAIC System Our system is based mainly on SIFT algorithm [5, 6] and it has the aim to implement a method for extracting distinctive invariant features [3, 4] from images that can be used to perform reliable matching between different views of an object or scene (See Figure 1). The features have to be invariant to a lot of changes that images must suffer. Translations, rotations, scales and luminance changes can cause the difference of two pictures [2]. It is virtually impossible to compare two images using traditional methods such as a direct comparison between gray values as it could be really simple with an existing API (Java Advanced Imaging API2).
Fig. 1. UAIC system used in Robot Vision task
The vision of the project is a source of information all about local points of interest via the above channels to offer. The purpose of information and visualized objects is 2
Java Advanced Imaging API (JAI): http://java.sun.com/javase/technologies/desktop/media/
Using SIFT Method for Global Topological Localization for Indoor Environments
279
to organize and communicate valuable data to people, so they can derive increased knowledge that guides their thinking and behavior. The architecture of the system is similar to server-client architecture, and it is possible to accomplish more requests at a time. Therefore, one of the maturity levels that the system has is the possibility of running in real-time thanks this ability. The server supports a training stage with the data from IDOL Database [7]. In addition, the server is responsible of the method for extracting distinctive invariant features from images (SIFT) that can be used to perform reliable matching between different views of an object or scene. The client is based on knowledge that it receives from the server at request, testing the new images. We didn’t choose a mechanism of incremental learning, we chose a statistical one as the people learn through observation, trial-and-error and experiment. As we now, learning happens during interaction. We managed the images (interactions with objects for human learning) so they become a system of storing features of them. 2.1 The Server Module This module has two parts: one part necessary for training and one part necessary for classifying, both based on extracted key points (points of interest) from images. 2.1.1 Trainer Component Training and validation were performed on a subset of the publicly available IDOL2 Database [7]. The database contains image sequences acquired in a five room subsection of an office environment, under three different illumination settings and over a time frame of 6 months. The test sequences were acquired in the same environment, 20 months after the training data, and contain additional rooms that were not imaged previously. This fact means that variations such as changes in illumination conditions, changes in the rooms, people moving around and changes in the robots’ path could take place. At the next step, the server loads once with all the key points’ files obtained with SIFT algorithm and waits for requests. Accordingly with the number of considered images for rooms we have two types of classifiers: the brute classifier and the managed classifier. The brute classifier uses in the training and in the finding process all the available meta files. The managed classifier creates the representative meta files for the representative images from the training data. First of all, when it gets through all the steps that we explained at SIFT algorithm subsection, it chooses only the images that have almost 10-16 percent similarities with images treated before. In the end, we obtain only 10 from 50 - 60 images appreciatively, also 10 meta files (with the key points for them), the most representative images for, in this case, every room that has been loaded as a training directory. With these two methods, we have trained our application twice (that took us 2 days): one for all pictures and one for representative images. 2.1.2 SIFT Algorithm In order to extract key points from images, both in training and in classifying processes we use SIFT algorithm [6]. SIFT algorithm uses a feature called Scale Invariant Feature Transform [3], which is a feature-based on image matching approach, which
280
E. Boroş, G. Roşca, and A. Iftene
lessens the damaging effects of image transformations to a certain extent. Features extracted by SIFT are invariant to image scaling and rotation, and partially invariant to photometric changes. This is the key step in achieving invariance to rotation as the key point descriptor can be represented relative to this orientation and therefore achieves invariance to image rotation. All key points are written to meta files that represent the database for the server. 2.1.3 Classifier Component In this iteration, the database for the management of point of interests will be created and it will be browse using the SIFT algorithm. Access to the database is done by the server, which loads files with key points and waits for requests. Thus, first classifier called the brute classifier loads all the meta files into memory for training and for classifying. The second called managed classifier creates the representative meta files for the representative images from the batch. First of all, when it gets through all the steps that we explained at SIFT algorithm subsection, it chooses only the images that have the almost 10-16 percent similarities with images treated before. In the end these selected pictures represent the most representative images, and we obtain only 10 from 50-60 images appreciatively. Only these files with corresponding meta files (with the key points) are considered then in the training and classification processes. 2.2 The Client Module The client module is the tester and it has two phases: a naive one and a more precise one. This implies comparison at its bases. This module only sends one image to the server and receives a list of results that represents, in this case, the list with the rooms where the images for testing belong.
3 Results Experiments were done testing both of the servers’ methods of classifying images. The training in the case of getting the most representative images from the batches of images provided takes longer than the other method, of course. The wanted results in this case were far beyond from what we were expecting: lower than in the case of using the brute finder. The results were more explicit in this second case. As we knew that a new room was introduced in the robot’s path, we had to do test the system in two situations: one with unknown room treated (not recognizing the room) and with unknown room not treated. Plain search is the process of getting the results for topological localization with the brute finder. From 21 runs that were submitted in Robot Vision task, 5 of them were ours (see Table 1). The results were more explicit on the brute finder even though it took a lot of time to complete the training. The get representative method didn’t give the expected results, but it is faster in comparison to the brute method.
Using SIFT Method for Global Topological Localization for Indoor Environments
281
Table 1. UAIC runs in Robot Vision Task Run ID 155 157 156 158 159
Details Full search using all frames Run Duration: 2.5 days on one computer Run Duration: 2.5 days for this run Search using representative pictures from all rooms Run Duration: 30 minutes on one computer Run Duration: 30 minutes for one run Wise search on representative images with Unknown threaded.
Score 787.0
Ranking 2
787.0 599.5
3 5
595.5 296.5
6 13
4 Evaluation In this section we try to identify plusses and minuses of our approach. For that we compare the results obtained for two of our better runs: first called Plain search with unknown rooms treated (PlainSearchUK), and second called Plain search with unknown rooms not treated (PlainSearchNoUK). For both runs we offered 1690 values, in which for first one we got 1088 correct values, in time that for the second run we got 963 correct values. As we can see from the graphical representation from Figure 2, the success rate for the ones with unknown rooms not treated (No UK) is better on the known rooms than in the other case, of unknown images treated. The brute finder found in No UK case more correct values for the rooms that the robot knew (or the server has learn) than in the UK case. Because of that fact that, in the case of unknown rooms not treated, the classifier (server) had to assign every image to a category (room), the percentage for every room had significantly increased. The best results were obtained for the CR and KT rooms. This was possible because of the bigger amount of images representing those rooms, and of course, more key points. This fact means that the statistical methods applied need to be improved.
Fig. 2. Comparison between PlainSearchUK and PlainSearchNoUK Runs
282
E. Boroş, G. Roşca, and A. Iftene
5 Conclusions This paper presents the UAIC system which took part in the Robot Vision task. We used to apply a feature-based method (SIFT) and two main systems in order to search and classify the given images. The first system uses the most important/representative images for a room’s category. The second system is a brute force one and the results in this case are statistically significant. From analysis part we deduce that methods used have a better behavior in cases when in comparison processes are rooms with more key points. Future work will focus on working on more productive statistical methods and maybe integrating the brute finder with the managed one.
References 1. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the CLEF 2009 robot vision track. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Lindeberg, T.: Scale-space theory in computer vision. Kluwer Academy Publisher, Dordrecht (1994) 3. Lindeberg, T.: Feature detection with automatic scale selection. Technical report ISRN KTH NA/P-96/18-SE. Department of Numerical Analysis and Computing Science, Royal Institute of Technology, S-100 44 Stockholm, Sweden (1996) 4. Lindeberg, T., Bretzner, L.: Real-time scale selection in hybrid multi-scale representations. In: Griffin, L.D., Lillholm, M. (eds.) Scale-Space 2003. LNCS, vol. 2695, pp. 148–163. Springer, Heidelberg (2003) 5. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, Corfu, Greece, pp. 1150–1157 (1999) 6. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 7. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proceedings IROS, San Diego, CA, USA (2007)
UAIC at ImageCLEF 2009 Photo Annotation Task Adrian Iftene, Loredana Vamanu, and Cosmina Croitoru UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania {adiftene,loredana.vamanu,cosmina.croitoru}@info.uaic.ro
Abstract. In this paper, UAIC system participating in the ImageCLEF 2009 Photo Annotation task is described. The UAIC team’s debut in the ImageCLEF competition has enriched us with the experience of developing the first system for the Photo Annotation task, paving the way for subsequent ImageCLEF participations. Evaluation of the used modules shown that there is more work to capture as many possible situations.
1 Introduction The ImageCLEF 2009 visual concept detection and annotation task used training and test sets consisting of thousand images from Flickr image database. All images had multiple annotations with references to holistic visual concepts and were annotated at an image-based level. The visual concepts were organized in a small ontology with 53 concepts, which could be used by the participants for the annotation task. For the image classification we used four components: (1) the first uses face recognition, (2) the second one use training data, (3) the third one uses associated exif file and (4) the fourth uses default values calculated according to the degree of occurrence in the training set data. In what follows we describe our system and its main components, analyze the results obtained and discuss the experience gained.
2 The UAIC System The system has four main components. The first component try to identify in every image people faces and after that, accordingly with the number of these faces, make the classification. The second one uses for classification the clusters built from training data and calculates for every image the minimum distance between image and clusters. The third one uses for classification details extracted from associated exif file. If none of these components can perform the image classification, then the fourth component uses default values determined in the training set data. Face Recognition Module. Some categories implies the presence of people, such as Abstract Categories (through concepts: Family Friends, PartyLife, Beach Holidays, CityLife), Activity through the concept Sports, of course Persons and Representation having one of the concepts Portrait. We used, Faint1, a JAVA library that recognizes if there 1
Faint: http://faint.sourceforge.net/
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 283–286, 2010. © Springer-Verlag Berlin Heidelberg 2010
284
A. Iftene, L. Vamanu, and C. Croitoru
are any faces in a given photo. Also, we can receive how many and how much percent of that picture is the face. In this way we were able to decide if there is a big group, a small one, if the photo was a portrait. Unfortunately, if the light isn’t normal, it works well for 80% of these cases, in comparison with day pictures (after our estimations). Clustering using Training Data Module. We used a similarity processing for finding some concepts. For this, we have selected the most representing pictures for some concepts from test data and we used JAI2 (Java Advanced Imaging API) for manipulating images easily and a small program that calculates a rate of similarity between the clusters of photos and the photo wanted to be annotated. It was hard to find the most representing photos for concepts and to build their associated clusters as every concept can be seen so different in different seasons, different time of day, etc; but the hardest part was to decide the acceptance rate. Using the training data, we ran for some images that we expected to be annotated with one or more of the concepts that were illustrated by the pictures in our small clusters and we notated the rates. We also made the same thing with pictures that shouldn’t be annotated with one of the concepts but were very similar with the pictures chosen to compare to and we also notated the rates. In the end, we made kind of a compromise average rate and this was our limit rate. Of course that this algorithm could be improved, it can be calculated a rate for every cluster and maybe this way the program would be more accurate. The concepts that we tried to annotate in this way were CityLife, Clouds, Snow, Desert, Sea and Snow. Exif Processing Module. We processed the exif information for every picture and, according to the parameters of the camera with which the picture was taken, we were able to annotate concepts for example related to illumination, but also correlating with other concepts found; some more abstract concepts like City Life or LandscapeNature could also be found this way. Because the information taken from exif can or can not be accurate, and some of our limits can be subjective, the concepts discovered annotated that were not clearly, were set to 0.5. Default values Module. There are five categories that contain disjoint concepts, which implies that only one concept from this kind of category can be found in a photo. Taking this into consideration, if by any other method a concept from a disjoint category was not discovered, and then a default value will be inserted. The default concept is selected according to the degree of occurrence in the training set data.
3 Results The run time needed for the test data was 24 hours. It took so long because of the similarities process as this supposed to compare a photo with 7 up to 20 photos for every category that was built from training data. We have submitted only one run in which we took into consideration the relation between categories and the hierarchical order. The scores for the hierarchical measure were over 65% in both cases with and without annotator agreement, unfortunately the results were not that great when the evaluation per concept was made as we only tried to annotate 30 out of 53 concepts (the average results are presented in Table 1 [3]). 2
JAI: http://java.sun.com/javase/technologies/desktop/media/jai/
UAIC at ImageCLEF 2009 Photo Annotation Task
285
Table 1. Average values for EER and AUC for UAIC run
Run ID UAIC_34_2_1244812428616_changed
Average EER 0.4797
Average AUC 0.105589
Our detailed results regarding AUC on classes are presented in Figure 1. We can see how many results are zero while the average was 0.105589 and the best value was 0.469569 for winter class (number 13 in below figure). Along class winter (13) we obtained good results for classes: autumn (12), night (29), day (28) and spring (10). In all these cases classification was done using the module that process the exif file.
Fig. 2. AUC - Detailed Results on Classes
The lower values for AUC were obtained for classes where we have not rules for classifying: partylife (1), family-friends (2), beach-holidays (3), etc. In all these cases we cannot apply any rule from our defined rules and AUC value remain on zero. We made some statistics on the results obtained for every technique we used to annotate in order to determine which worked and which failed (see Table 2). Table 2. Average values on Methods used to Annotate Method Applied by Module Additional File Processing Default Values Image Similarity Face Recognition Without Values
EER 0,447955 0,467783 0,467672 0,487723 0,502011
AUC 0,225782 0,146464 0,132259 0,081350 0
286
A. Iftene, L. Vamanu, and C. Croitoru
The best results came from the exif processing; this helped us to annotate photos with concepts from categories such as: Season, Time of Day, Place, Blurring and other concepts more abstract as Landscape-Nature. The results could have been improved if the limits used to annotate a concept from the parameters of the camera with which the photo was taken could have been more objective chosen. The use of image similarity had pretty good results. One of the facts that contributed to bad annotation of this method probably is the choice of the pictures that are the most representative for a concept, and this part is very subjective. Also the solution of using many different pictures that have a different view of the concept is not good enough as the process of similarity can be damaged and also it must be taken into consideration that a comparison between two photos costs a lot. As it concerns the face recognition method, we were expecting better results. Probably what can be improved here is to find an algorithm that also recognizes a person even though her entire face is not captured in the photo.
4 Conclusions The system built by us for this task has four main components. The first component try to identify in every image people faces and after that accordingly with the number of these faces make the classification. The second one use for classification clusters built from training data and calculates for every image the minimum of distances between image and clusters. The third one uses for classification details extracted from associated exif file. If none of these components can perform image classification, it is done by the fourth module using default values. From our run evaluation we conclude that some of the applied rules are better than others. For the future, on basis of detailed analysis of the results we will try to apply rules in descending order of their quality. Also, we intend to use the prediction (like in [1]), in combination with a method that used extracting feature vectors for each of the images similar to [2].
References 1. Abonyi, J., Feil, B.: Cluster Analysis for Data Mining and System Identification. Birkhäuser, Basel (2007) 2. Hare, J.H., Lewis, P.H.: IAM@ImageCLEFPhotoAnnotation 2009: Naïve application of a linear-algebraic semantic space. In: CLEF Working Notes 2009, Corfu, Greece (2009) 3. Nowak, S., Dunker, P.: Overview of the CLEF 2009 Large-Scale Visual Concept Detection and Annotation Task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010)
Learning Global and Regional Features for Photo Annotation Jiquan Ngiam and Hanlin Goh Institute for Infocomm Research 1 Fusionopolis Way, #21-01 Connexis (South Tower), Singapore 138632 {jngiam,hlgoh}@i2r.a-star.edu.sg
Abstract. This paper describes a method that learns a variety of features to perform photo annotation. We introduce concept-specific regional features and combine them with global features. The regional features were extracted through a novel region selection algorithm based on Multiple Instance Learning. Supervised classification for photo annotation was learned using Support Vector Machines with extended Gaussian Kernels over the χ2 distance, together with a simple greedy feature selection. The method was evaluated using the ImageCLEF 2009 Photo Annotation task and competitive benchmarking results were achieved.
1
Introduction
In the ImageCLEF 2009 Photo annotation task, the concepts involved ranged from holistic concepts (e.g. landscape, seasons) to regional image elements (e.g. person, mountain), which only involved a sub-region of the image. This broad range of concepts necessitates the use of a large feature set. The set of features was coupled with feature selection using a simple greedy algorithm. For the regional concepts, we experimented with a new idea involving conceptspecific region selection. We hypothesize that if we can find the relevant region supporting a concept, features from this region will be good indicators of whether the concept exists in the image. For these concepts, global features provide contextual information while regional features help improve classification. Our method follows the framework in [1], involving Support Vector Machines and extended Gaussian Kernels over the χ2 distance. The framework provides a structured approach for integrating the various global and local features.
2 2.1
Image Feature Extraction Global Feature Extraction
The global features listed in Table 1 were computed over the entire image. Each type of feature provides a histogram describing the image. In features where a quantized HSV space was used, 12 Hue, 3 Saturation and 3 Value bins were employed, resulting in a total of 108 bins. The bins are of equal width in each dimension. The choice of these parameters was motivated by [2]. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 287–290, 2010. c Springer-Verlag Berlin Heidelberg 2010
288
J. Ngiam and H. Goh Table 1. List of global features extracted
Feature
Dim. Description
HSV Histogram
108 Quantized HSV Histogram
Color Auto Correlogram [2]
432 Computed over a quantized HSV space with 4 distances: 1,3,5 and 7. For each color-distance pair (c, d), the probability of finding the same color at exactly distance d away was computed.
Color Coherence Vector [3]
216 Computed over a quantized HSV space with two states: coherent and incoherent. The τ parameter was set to 1% of the image size.
Census Transform [4]
256 A simple transformation of each pixel into a 8-bit value based on its 8 surrounding neighbors, based on two states: either ‘>=’ or ‘<’ its neighbor.
Edge Orientation Histogram
37
Interest Point SIFT [5]
500 SIFT features quantized into a visual words dictionary of 500 visual words using k-Means clustering.
Densely Sampled SIFT [6]
1500 SIFT features densely sampled at 10 pixel intervals, with 4 scales (4, 8, 12, 16 px radius) and 1 orientation. Features are quantized into 1500 visual words by k-Means clustering.
2.2
Edge orientation histogram computed by the Canny edge detector. Each pixel is assigned to either an edge (with orientation) or non-edge. Orientations are quantized into 5 degree angle bins. An additional bin is concatenated for non-edges.
Region Selection
The classification problem for some concepts can be framed in the Multiple Instance Learning (MIL) framework. The concept (e.g. Mountain) exists if and only if a region within the image demonstrates the concept. Hence, one is motivated to consider whether it is possible to improve classification performance by considering only appropriate region(s) in an image. In our method, a region is defined as a bounding box, which image features can be computed in a manner similar to the extraction of global features, but just for the pixels within the bounding box. Hence, a descriptor for a region is a feature vector that is extracted based only on region in a bounding box. Therefore, the problem of finding regional features is reduced to one of finding suitable bounding boxes for each image in a MIL setting whereby each image is a bag-of-regions and regions are considered to be true iff they contain the target concept. Furthermore, a bag is true iff it contains a true region. The EMDiverse Density [7] was used together with Efficient Subwindow Search (ESS) [8] to search for a target concept with good diverse density. ESS considers all
Learning Global and Regional Features for Photo Annotation
289
possible bounding boxes, but requires multiple restarts since the algorithm is susceptible to local minimas. For photo annotation, a region selector is learned based on the densely sample SIFT features for each concept. From each region, three histograms, namely: 1) HSV, 2) interest point SIFT and 3) densely sampled SIFT histograms, form the concept-specific regional features.
3
Learning and Feature Selection
3.1
Support Vector Machine
To obtain the final classification, each concept was considered independently using SVM with extended Gaussian kernels over the χ2 distance [1], K(Si , Sj ) =
f ∈f eatures
1 2 χ (f (Si ), f (Sj )) μf
(1)
where μf is the average χ2 distance for a particular feature. μf is used to normalize the distances across different features. Both global and regional features were treated in the same manner. 3.2
Greedy Feature Selection
Since different features work well with different concepts, weighting was incorporated into the kernel function to combine the different features. However, learning these weights is non-trivial and one often resorts to ad-hoc methods such as genetic algorithms. Here, we use a simple greedy algorithm for feature selection as follows. Algorithm. Greedy Feature Selection 1: F = {all global features} 2: For each feature f ∈ F : Compute error rate if f is removed 3: Remove the feature which results in best improvement 4: Repeat (2-3) until removing any feature results in worse performance 5: Consider each feature f ∈ F : Compute error rate if f is added 6: Consider each feature f ∈ F : Compute error rate if f is removed 7: Add or remove the feature which gives best improvement 8: Repeat (5-7) until local optima is reached 9: Return F
4
Method Evaluation and Results
To evaluate the performance of the method and benchmark it against other methods from various groups worldwide, we used the data set provided for the ImageCLEF Photo Annotation 2009 task. The data set consists of 53 concepts
290
J. Ngiam and H. Goh Table 2. Results evaluated with average EER and average AUC Experiment Setting
Average EER Average AUC
Global and regional features 0.253296 Global features only 0.255945
0.813893 0.811421
spanning from visual elements (trees, people) to abstract concepts (aesthetics, blur). Although an ontology was provided, our method did not use the ontology heavily, except for handling disjoint cases. The training and test sets consist of 5000 and 13,000 images respectively. Two different experimental settings using different features were evaluated. The first used all global and regional features, while only global features were used in the second setting. Their performances in terms of average equal error rate (EER) and average area under ROC curve (AUC) are shown in Table 2. We observe that regional features help to improve the performance slightly, though the difference is insignificant. While a new hierarchical measure was introduced by the organizers of this task, we did not specifically optimize for it because there was a lack of information regarding the annotator agreement values. Interestingly, for this evaluation metric, the experiment setting that used only global features negligibly outperformed the other setting that used all features.
References 1. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. Int. J. Comput. Vision 73(2), 213–238 (2007) 2. Ojala, T., Rautiainen, M., Matinmikko, E., Aittola, M.: Semantic image retrieval with hsv correlograms. In: 12th Scandinavian Conf. on Image Analysis, pp. 621–627 (2001) 3. Pass, G., Zabih, R., Miller, J.: Comparing images using color coherence vectors. In: MULTIMEDIA 1996: Proceedings of the Fourth ACM International Conference on Multimedia, pp. 65–73. ACM, New York (1996) 4. Wu, J., Rehg, J.M.: Where am I: Place instance and category recognition using spatial pact. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 5. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 6. Bosch, A., Zisserman, A., Mu˜ noz, X.: Scene classification using a hybrid generative/discriminative approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(4), 712–727 (2008) 7. Zhang, Q., Goldman, S.A.: Em-dd: An improved multiple-instance learning technique. In: Advances in Neural Information Processing Systems, pp. 1073–1080. MIT Press, Cambridge (2001) 8. Lampert, C.H., Blaschko, M.B., Hofmann, T.: Beyond sliding windows: Object localization by efficient subwindow search. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8 (2008)
Improving Image Annotation in Imbalanced Classification Problems with Ranking SVM Ali Fakeri-Tabrizi, Sabrina Tollari, Nicolas Usunier, and Patrick Gallinari Universit´e Pierre et Marie Curie - Paris 6, Laboratoire d’Informatique de Paris 6 - UMR CNRS 7606 104 avenue du pr´esident Kennedy, 75016 Paris, France firstname.lastname@lip6.fr
Abstract. We try to overcome the imbalanced data set problem in image annotation by choosing a convenient loss function for learning the classifier. Instead of training a standard SVM, we use a Ranking SVM in which the chosen loss function is helpful in the case of imbalanced data. We compare the Ranking SVM to a classical SVM with different visual features. We observe that Ranking SVM always improves the prediction quality, and can perform up to 23% better than the classical SVM.
1
Introduction
The goal of image annotation is to label an image with one or more concepts that it represents. Given a training set of labeled images, the standard way to solve this problem is to define several binary classification problems. For a given concept, the positive examples are the images that are labeled with this concept, and the negative examples are the other images. One of the main difficulties of this approach is that the datasets are generally highly imbalanced: it often happens that a vast majority (or minority) of the training images are labeled with a particular concept. For example, in the VCDT 2009 [3], the concept Neutral Illumination appears in 4656 images out of 5000, and the concept Beach Holidays has only 78 occurrences. The performance of standard classification algorithms is highly affected by this imbalance, because they are designed to optimize the accuracy (or, equivalently, the misclassification rate). In such a case, the accuracy is biased towards the majority class: taking the example of the concept Beach Holidays, a classifier that never predicts the presence of this concept has an accuracy of 1 − 78/5000 98%. Thus it is possible to obtain a very good accuracy with a useless classifier. In this paper, we propose to change the learning algorithm to avoid the bias of the accuracy measure. To that end, during training, we use a ranking algorithm, namely a Ranking SVM [2]. Similarly to a standard SVM, a Ranking SVM learns a function that gives scores to examples, and the class label is decided by thresholding the score. Unlike a standard SVM, the ranking algorithm we use, aims at optimizing the Area Under the ROC Curve (AUC) instead of the accuracy. This boils down to maximizing the probability that a positive example C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 291–294, 2010. c Springer-Verlag Berlin Heidelberg 2010
292
A. Fakeri-Tabrizi et al.
has a greater score than a negative example. So, the loss of the Ranking SVM does not suffer from the same bias as the accuracy measure, and we can expect much better performance on imbalanced datasets. We present experiments with various visual features on data from the VCDT 2009. The results show that Ranking SVM can perform up to 23% better than a classical SVM.
2
SVM and Ranking SVM
Consider N training examples (xi , yi )N i=1 with yi ∈ {−1, 1}, and an embedding φ mapping the observations xi in feature space endowed with a dot product ., .. The “classical” SVM finds a hyperplane w, φ(x) + b = 0 that separates the two classes as well as possible. The goal is thus to optimize the accuracy, or, equivalently, to minimize the misclassification rate E defined as: E=
N 1 [[yi (w, φ(xi ) + b) ≤ 0]] N i=1
(1)
where [[.]] is the indicator function. Formally, the primal formulation of SVM is N
min
w,b,ξ
1 ||w||2 + C ξi 2 i=1
(2)
subject to yi (w, φ(xi ) + b) ≥ 1 − ξi and ξi ≥ 0, where C is a user-specified regularization parameter. The embedding φ may be defined implicitly by a kernel function K(xi , xj ). In this paper, however, we will only consider linear kernels. The Ranking SVM does not strive to separate the two classes, but rather learns a score function of the form w, φ(x) that gives greater scores to positive examples than to negative ones. We choose to optimize the Area under the ROC Curve (AUC) as in [2]. The AUC is the probability that a positive example has a greater score than a negative one, and can be computed as: AU C = 1 −
SwappedP airs p.n
(3)
where p (resp. n) is the number of positive (resp. negative) examples, and: SwappedP airs = |{ (i, j) : (yi > yj ) and w, φ(xi ) < w, xj }|
(4)
SwappedPairs is the number of pairs of examples that are in the wrong order. The ranking SVM uses a training criterion similar to the classical SVM, but using pairs of examples rather than single examples. The primal formulation is: min ||w||2 + C ξi,j (5) w,ξ
i,j:yi >yj
under the constraints w, φ(xi ) ≥ 1 + w, φ(xj ) − ξi,j and ξi,j ≥ 0.
Improving Image Annotation in Imbalanced Classification Problems
293
Table 1. Equal Error Rate (EER) and Area Under the ROC Curve (AUC) scores obtained by SVM and Ranking SVM with various visual features. Ranking SVM which deals with imbalanced data obtains always the best results. N: number of dimensions. Visual Descriptor Random HSV HSV SIFT SIFT SIFT+HSV SIFT+HSV Mixed+PCA Mixed+PCA
N 51 51 1024 1024 1075 1075 180 180
Classifier SVM Ranking SVM SVM Ranking SVM SVM Ranking SVM SVM Ranking SVM
EER 0.495 0.460 0.378 (-18%) 0.451 0.350 (-22%) 0.459 0.378 (-18%) 0.353 0.294 (-17%)
AUC 0.506 0.551 0.669 (+21%) 0.561 0.690 (+23%) 0.552 0.669 (+21%) 0.694 0.771 (+11%)
Strictly speaking, a Ranking SVM does not learn a classifier. A classifier can however be obtained by comparing the scores with an appropriate threshold. In the following, the classifier is obtained by comparing the score to 0: if an observation x is such that w, φ(x) > 0, then we predict that x is in the positive class, otherwise we predict it in the negative class. Although this choice may not be optimal, it is a simple decision rule that gives good results in practice.
3
Experiments
The corpus is composed of the 5000 Flickr images from the training set of the VCDT 2009 [3]. We randomly divide this set in two parts: a training set of 3000 images and a test set of 2000 images. Each image is tagged by one or more of the 53 hierarchical visual concepts. We want to show that using an appropriate loss function, improves the performance for various visual features. First, we segment images into 3 horizontal regions and extract HSV features. Second, we extract SIFT keypoints, and then we cluster them to obtain a visual dictionary. Third, we make an early fusion by concatenating HSV and SIFT spaces (SIFT+HSV ). Fourth, we use a concatenation of various visual features from 3 labs proposed by the AVEIR consortium [1] reduced using a PCA (Mixed+PCA). To compare the performances of classical SVM and of Ranking SVM, we use SVMperf 1 in which a SVM classifier and a Ranking SVM are implemented. We only consider linear kernels. Table 1 gives the results obtained by SVM and Ranking SVM with the various visual features on the 2000 test images. First, we can notice that we obtain low results using SVM on HSV, SIFT and HSV+SIFT, and better results with Mixed+PCA. The early fusion of HSV and SIFT does not give better results than SIFT only. For all experiments, the Ranking SVM gives better results than the SVM. Finally, using Ranking SVM, the best results are obtained using Mixed+PCA data. In Figure 1, we compare the differences of EER between SVM and Ranking SVM, both on Mixed+PCA, in function of 1
http://svmlight.joachims.org/svm_perf.html
294
A. Fakeri-Tabrizi et al.
Difference of EER
0.2
0.15
0.1
0.05
0 0
500
1000
1500
2000
2500
Number of positives examples
Fig. 1. Differences of EER between SVM and Ranking SVM, on Mixed+PCA, in function of the number of positive examples in the training data set. The difference is bigger when the number of positive examples is small or great in the training set.
the number of positive examples in the training data set. We see that when the number of positive examples is small (or large), the difference is visible. It means that, comparing to classical SVM, a Ranking SVM is particularly efficient when the data are highly imbalanced.
4
Conclusion
This work shows that the choice of the loss function is important in the case of imbalanced data set problem. Using Ranking SVM can improve the results. We also see that the results are dependent on the feature types. Still, for all the features, the use of Ranking SVM improves the performances up to 23% comparing to classical SVM. As perspective, we can study the importance of the decision threshold used with the Ranking SVM. We may increase the performance by choosing a more appropriate threshold than 0. Acknowledgment. This work was partially supported by the French National Agency of Research (ANR-06-MDCA-002 AVEIR project).
References 1. Glotin, H., Fakeri-Tabrizi, A., Mulhem, P., Ferecatu, M., Zhao, Z.-Q., Tollari, S., Quenot, G., Sahbi, H., Dumont, E., Gallinari, P.: Comparison of various AVEIR visual concept detectors with an index of carefulness. In: CLEF Working Notes (2009) 2. Joachims, T.: A support vector method for multivariate performance measures. In: International Conference on Machine Learning, ICML (2005) 3. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large scale — visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010)
University of Glasgow at ImageCLEF 2009 Robot Vision Task: A Rule Based Approach Yue Feng, Martin Halvey, and Joemon M. Jose Department of Computing Science, University of Glasgow, Glasgow, G12 8RZ, UK {yuefeng,halvey,jj}@dcs.gla.ac.uk
Abstract. For the submission from the University of Glasgow for the ImageCLEF 2009 Robot Vision Task a large set of interesting points were extracted using an edge corner detector, these points were used to represent each image. The RANSAC method [1] was then applied to estimate the similarity between test and training images based on the number of matched pairs of points. The location of robot was then annotated based on the training image which contains the highest number of matched point pairs with the test image. A set of decision rules with the respect to the trajectory behaviour of robot’s motion were defined to refine the final results. An illumination filter was also applied for two of the runs in order to reduce the illumination effect.
1 Introduction We describe the approaches and results for 3 independent runs submitted by the University of Glasgow for the ImageCLEF 2009 Robot Vision task [2]. For this task training and validation was performed on a subset of the publically available IDOL2 Database [3]. The database contains image sequences acquired in a 5 room subsection of an office setting, under 3 different illumination settings. Our strategy is to analyse the visual content of the test sequence and compare it with the training sequence to determine location. This approach is able to automatically and efficiently detect if two images are similar, or find similar images within a database of images. This image matching approach estimates the visual distance between an unannotated frame and each frame in the training sequence and returns a ranked list, where the unannotated frame is annotated the same as highest ranked training image. The image matching techniques are combined with knowledge of robot motion and trajectory to determine the robots location. In addition, an illumination filter is integrated into one of the runs to minimise lighting effects with the goal of improving the predictive accuracy.
2 Methodology and Approach As both the training and test sequences are captured using the same camera in the same geographic condition, it is assumed that frames taken in the same location will contain similar content and geometric information. Motivated by this assumption, our image matching algorithm consists of the following successive stages: (1) A corner detection method is used to create an initial group of points of interest (POI); (2) The C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 295–298, 2010. © Springer-Verlag Berlin Heidelberg 2010
296
Y. Feng, M. Halvey, and J.M. Jose
RANSAC algorithm [1] is applied to establish point correspondences between two frames and calculate the fundamental matrix [4] (this matrix encodes an epipolar constraint which is applied to the general motion and rigid structure; this is used to compute the geometric information for refining matched point pairs); (3) the number of refined matched points will be regarded as the similarity between two frames. POI are used instead of all of the pixels in the frame in order to reduce computational cost. As corners are local image features characterised by locations where variations of intensity in both X and Y directions are high, it is easier to detect and compare the POI in these areas, such as edges or textures. In order to exploit this, a Harris corner detector [5] was employed to initiate the POI, as it has strong invariance to rotation, scale, illumination variation and image noise. The Harris corner detector uses the local autocorrelation function to measure the local changes of the signal to detect the corner positions in each frame. The next step is to use a point matching technique to establish point correspondences between two frames. The point matching method generates putative matches between two frames by looking for points that are maximally correlated with each other inside a window surrounding each point. Only points that correlate strongly with each other in both directions are returned. Given the initial P POI, a parameter X is used to check whether this point is fitted or not, is first estimated using N points chosen at random from P. The number of points in P fit the model with values of X within a tolerance value T given by the user is then found. If this number is satisfactory, it is regarded as a fit and the operation terminates with success. Such operations are carried on looping through all the POI. In this work, T is set at 95%; this high threshold reduces the number of points of interest. The initial matching pair may contain mismatches, thus a post processing step for refining the results is needed. Given the assumption that frames taken in the same location will contain similar geometric information, a fundamental matrix [4] was applied. Given the initial matching points, the fundamental matrix F can be estimated given a minimum of seven point’s correspondence. Its seven parameters represent the only geometric information about the cameras that can be obtained through point correspondence alone. Given the computed F, it is applied on all the matching pairs in order to eliminate incorrectly matched pairs, where the matched point pair should satisfied with the following formula, =0, x′ are the corresponding points in two images, Fx describes an epipolar line on which the other corresponding point x′ on the other image must lie. After applying the matrix F on all the paired points the number of matched point pairs remaining is regarded as the similarity between two images. In order to localise the robot, we assume that the robots position can be retrieved by finding the most similar frame in the training sequence. Given the results of point matching, each test frame can be annotated as being from one of the possible rooms and the trajectory of the robot could be generated. The trajectory can be represented using the extracted annotation information frame by frame. By studying the training sequence released as part of the ImageCLEF 2009 Robot Vision training and test sets, we find (i) the robot does not move “randomly”, (ii) the period of time that the robot stays in the one room is always more than 0.5 seconds, which corresponds to more than 12 continuous frames, (iii) the robot always enters one room from the outside of the room and then exits this room to the place
A Rule Based Approach
297
where it came from instead of to a different place. Based on the above observations, a set of rules to help determine the location of the robot at any given time were devised. Rule 1: The robot will not stay in one place for a period less than 20 frames. If rule 1 is violated and the location before the false detection period is the same as the location behind it, the location of the false period will be revised and annotated the same as the previous period. Rule 2: If the location of the robot changes from room A to room B without passing through the corridor and printing area, there must be a false detection. If this rule is violated, then a window with a size of N frames is applied on the location boundary to recalculate the similarity. The similarity between the test image and the top 10 matched training images is summed as the recalculated similarity. The frames with the highest score will be used to annotate the current frame with the location. Rule 3: Since test sequence contains additional rooms that were not images previously, no correspondence frames in the training set could be used to annotate these rooms. We define one rule that any frame detected less than 15 matched point pairs with the training frames is annotated as an unknown room.
3 Results and Evaluation 3 runs were submitted for ImageCLEF 2009 Robot Vision task each using a different combination of the image matching, decision rules and an illusion filter approaches. Run1: Uses every first frame out of 5 continuous frames of both the training and testing sequences for image matching, this is followed by the application of the rule based model to refine the results. Run 2: Uses all of the frames in training and testing set for image matching, followed by the application of the rule based model. The illumination filter is applied for pre-processing the frames. Run 3: Uses every first frame out of 5 continuous frames for both training and test sequence for image matching, followed by the application of the rule based model to refine the results. In addition an illumination filter called Retinex [6], was applied to improve visual rendering of a frame in which lighting conditions are not good. For run 1 and 3 we reduce the number of frames considered, in order to reduce redundancy, as every second consists of 25 frames, there are few changes amongst 5 frames. Once a keyframe has been annotated, all the frames in that shot are annotated similarly. Since the computational cost of our approach is linear, this keyframe representation can reduce the processing time by 80%. However, there is a chance that this may result in false detections as the robot changes location. All 3 runs were submitted for official evaluation, and the benefits of these approaches were measured using precision and the ImageCLEF score. The score is used as the measure of the overall performance of the systems, and is calculated as follows, +1.0 for each correctly annotated frame, +1 for correct detection of an unknown room, -0.5 points for each incorrectly annotated frame and 0 points for each image that was not annotated. Table 1 shows the results of the three runs. It can be seen clearly that the second run achieved the highest accuracy and score overall. This second run was the best performing run for the obligatory task at ImageCLEF Robot Vision task 2009 [2]. Comparing our 1st and 2nd runs, the 2nd run improves the accuracy from 59% to 68.5%, demonstrating that additional frames for training and testing could increase the image matching results. Also the illumination filter does not improve performance.
298
Y. Feng, M. Halvey, and J.M. Jose Table 1. Results of submitted 3 runs, total frame 1689 Run Run1 (Baseline) Run 2 Run 3
Accuracy 59% 68.5% 25.9%
Score 650.5 890.5 -188
4 Conclusion In this paper we have described a vision-based localization framework for a mobile robot and applied this approach as part of the Robot Vision task for ImageCLEF 2009. This approach is applicable to indoor environments to identify the current location of the robot. The novelty in this approach is a methodology for obtaining image similarity using POI based image matching together with a rule-based reasoning for simulating the moving behaviour of the robot to refine the annotation results. The evaluated results show our proposed method achieved the second highest score in all the submissions to the Robot Vision task in ImageCLEF 2009. The experimental results show that using a coarse-to-fine approach is successful since it considers the visual information along with the motion behaviour of the mobile robot. The results also reflect the magnitude of the difficulty of the problems in robot vision, such as how to annotate an unknown room correctly, we believe we have gained insight to the practical problems and will use our findings in future work. Acknowledgments. This research is supported by the European Commission under contract FP6-027122-SALERO.
References 1. Lu, L., Dai, X., Hager, G.: Efficient particle filtering using RANSAC with application to 3D face tracking. Image and Vision Computing 24(6) (January 2006) 2. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the CLEF 2009 robot vision track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II, Springer. LNCS, vol. 6242, Heidelberg (2010) 3. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proc. IROS (2007) 4. Zhong, H.X., Pang, Y.J., Feng, Y.P.: A new approach to estimating fundamental matrix. Image and Vision Computing 24(1) (2006) 5. Harris, C., Stephens, M.: A combined Corner and Edge Detector. In: Proc. Alvey Conf. (1987) 6. Rahman, Z., Jobson, D.J.: Retinex processing for automatic image enhancement. Journal of Electronic Imaging 13(1) (2004)
A Fast Visual Word Frequency - Inverse Image Frequency for Detector of Rare Concepts Emilie Dumont1,2 , Herv´e Glotin1,2 , S´ebastien Paris1, and Zhong-Qiu Zhao3 1
Sciences and Information Lab. LSIS UMR CNRS 6168, France 2 University of Sud Toulon-Var, France 3 College of Computer Science and Information Engineering, Hefei University of Technology, China
Abstract. In this paper we propose an original image retrieval model inspired from the vector space information retrieval model. We build for different features and different scales a visual concept dictionary composed by visual words intended to represent a semantic concept, and then we represent an image by the frequency of the visual words within the image. Then the image similarity is computed as in the textual domain where a textual document is represented by a vector in which each component is the frequency of occurrence of a specific textual word in that document. We then adapt the common text-based paradigm by using the TF-IDF weighting scheme to construct a WF-IIF weighting scheme in our Multi-Scale Visual Dictionary (MSVD) vector space model. The experiments are conducted on the 2009 Visual Concept Detection ImageCLEF Campaign. We compare WF-IIF to usual direct SupportVector Machine (SVM) algorithm. We demonstrate that SVM and WFIIF are in average over all the concept giving the same Area Under the Curve (AUC). We then discuss the fusion process that should enhance the whole system, and of some particular properties of MSVD, that shall be less dependant of the training set size of each concept than the SVM.
1
Introduction
Visual document indexing and retrieval from digital libraries have been extensively studied for decades. In the literature, there has been a large variety of approaches proposed to retrieve images efficiently for users. A content-based image retrieval (CBIR) approach relies on certain low-level image features, such as color, shape, and texture, for retrieving images. A major drawback of this approach is that there is a ’semantic gap’ between low-level features of images and high-level human concepts. Image analysis is a typical domain for which a high degree of abstraction from low-level methods is required, and where the semantic gap immediately affects the user. Recently, the main issue is how to relate low-level image features to high-level semantic concepts because if image content is to be identified to understand the meaning of an image, the only available independent information is the low-level pixel data. To recognize the displayed scenes from the raw data of an image C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 299–306, 2010. c Springer-Verlag Berlin Heidelberg 2010
300
E. Dumont et al.
the algorithms for selection and manipulation of pixels must be combined and parametrized in an adequate manner and finally linked with the natural description. In other words, research focuses on how to extract semantics from low-level features which approximate well the user interprets of the images content (objects, themes, events). The state of the art techniques in reducing this gap include mainly three categories: i) using ontologies to define high-level concepts, ii) using machine learning tools to associate low level features with query concepts and iii) introducing relevance feedback into retrieval process to improve responses. We propose here a new approach for semantic interpretation that is using a multi-scale mid level visual representation. Images are systematically decomposed in regions with different sizes, and regions are represented according several features. These different aspects are fused by stage to obtain a more complex image representation, where an image is a vector of visual word frequency. These visual words are defined by a concept Multi-Scale Visual Dictionary (MSVD). This original multi-scale analysis intended to be robust to the large variety of visual concept. Related approaches without this multi-scale extension can be found in the literature. In Picard’s work, which was the first to develop the general concept of a visual thesaurus by transforming the main idea of text dictionary to a visual dictionary [1]. One year later, she proposed examples of a visual dictionary based on texture, in particular the FourEyes system [2] but no experiment was carried out in order to show the quality of these systems A first method consists in building a visual dictionary from the feature vectors of region of segmented image. In [3], authors use a self-organizing map to select visual elements, in [4] SVMs are trained on image regions of a small number of images belonging to seven semantic categories. In [5,6], regions are clustered by similar visual features with a competitive agglomeration clustering. And then, images are represented as vectors based on this dictionary. The semantic content of those visual elements depends primarily on the quality of the segmentation.
2
Model Word Frequency - Inverse Image Frequency through a Multi-Scale Visual Dictionary
We propose to use a Multi-Scale Visual Dictionary to represent an image, each visual word is intended to represent a semantic concept. Images are systematically decomposed in regions with different sizes, and regions are represented according several features. These different aspects are fused by stage to obtain a more complex image representation, where an image is a vector of visual word frequency. This original multi-scale analysis is expected to be robust to the large variety of visual concept. In a second step, we propose an adaptation of the common text-based paradigm. TF-IDF [7] is a classical information retrieval term weighting model, which estimates the importance of a term in a given textual document by multiplying the raw term frequency (TF) of the term in a document by the term’s inverse document frequency (IDF) weight. Our imagebased classifier is analogous to the TF*IDF approach, where we define a term
A Fast Visual Word Frequency - Inverse Image Frequency
301
by a visual word and we called this method Word Frequency - Image Inverse Frequency (WF-IIF). 2.1
Multi-Scale Visual Dictionary
Visual Atoms. Visual atoms or elements are intended to represent semantic concepts, they should be automatically computable and an image can be automatically described in terms of those visual elements. This visual representation should also have a relationship between the content of the image. A visual element is an image area, i.e. images are split into a regular grid. The size of the grid is obviously a very important factor in the whole process. Smaller grid allow a more precise description with fewer visual elements, while bigger grid may contain more information, but require a larger number of elements. We then propose a multi-scale process that is integrating all the grid size and is selecting the best ones. Global Visual Dictionary. There is not universal dictionary contrary to the textual domain. So our first step is to automatically create a visual dictionary composed by a large set of visual words representing a sub-concept. Each visual element is represented by different feature vectors (color, texture, edge, ...). For each feature and size of grid, we cluster visual elements using the K-Means algorithm, with a predefined number of clusters and using the Euclidean distance in order to group visual elements and to smooth some visual artefacts. The KMeans algorithm is one of the most popular iterative descent clustering method. It is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance is chosen as the dissimilarity measure. Then, for each cluster, we select the medoid to be a visual word wi and to compose the visual dictionary of a feature, i.e. W {wi }, i = 1, . . . , P with P = K × nF × nG, where K, nF , nG design the number of clusters in the K-means, the number of features computed for each block and the number of grid respectively. Image Transcription by Block Matching. Based on the visual dictionary, we replace visual elements by the nearest visual word (one of the medoids) in the visual dictionary. To match a block to a visual word, we find the visual word for which a distance measure between the block’s visual elements and the visual word is a minimum. In this stage, every block of an image is matched to one of the visual words from the visual dictionary. And then, the image representation is based on the frequency of the visual words within the image for each feature. Similarity to the textual domain where a textual document is represented by a vector in which each component is the frequency of occurrence of a specific textual word in that document. Visual Vocabulary Reduction. In the textual domain the ”bag-of-words” representations are surprisingly effective for text classification. The representation is high dimensional though, containing many non consistent words for text categorization like “the”, “a” ... These non consistent words result in reduced
302
E. Dumont et al.
generalization performance of subsequent classifiers, e.g., from ill-posed principal component transformations. In this communication our aim is to reduce the least relevant visual words from the bag-of-words representation. In a visual document, visual words do not have the same importance to determine the presence of a concept. So we want to select the most discriminative visual words for a concept given to compose a Visual Concept Dictionary associated with this concept. We use classical methods such as document frequency thresholding (DF), word frequency thresholding (WF), information gain (IG) [8], mutual information (MI) [9] and entropy-based reduction [10]. 2.2
Vector Based Visual Concept Detection
We used the TF-IDF weighting scheme in the vector space model together with cosine similarity to determine the similarity between a visual document and a concept. As an image is represented as a vector, the dimensionality P of the vector is the number of words in the MSVD. Each dimension corresponds to a separate visual word. If a word occurs in the image, its value in the vector is non-zero. Several different ways of computing word weights have been developed. One of the best known schemes is TF-IDF weighting. The basic concept is that term frequency TF is the number of times the term occurs in a document while the IDF is the inverse of the number of document in which a word occurs. In our case a document is an image, so we use the Word Frequency - Image Inverse Term Frequency (WF-IIF) model. the number of document in which a word occurs The visual word count in the given image is simply the number of times a given visual words appears in that image. This count gives a measure of the importance of the visual word w ci within the particular image I j . Thus we have the term frequency, defined as follows. nci,j c wfi,j = c , k nk,j
(1)
where nci,j is the number of occurrences of the considered visual word (w ci ) in image I j for the concept c, and the denominator is the sum of number of occurrences of all visual words in image I j for the topic c. The inverse image frequency is a measure of the general importance of the visual word obtained by dividing the number of all images by the number of images containing the visual word, and then taking the logarithm of that quotient. |I c | , (2) |{I : wic ∈ I c }| where |I c | is the total number of images in the corpus where the concept c appears, |{I : wi ∈ I c }| is the number of images where the visual words wci appears (that is nci,j = 0). Relevancy rankings of images in a visual keyword search can be calculated, using the assumptions of documents similarities theory, by comparing the deviation of angles between each image vector and the original query vector where the query is represented as same kind of vector as the images. iific = log
A Fast Visual Word Frequency - Inverse Image Frequency
303
Here for a particular image I to classify, v c1 = Ej wf ci,j iific where tf ci,j and iific are computed off-line on the training set and v c2 = tf ci iif ci where tf ci is computed on I. Using the cosine the similarity cos θc between v c2 and query v c1 can be calculated.
3
Experiments
Experiments are conducted on image data which is used in the Photo Annotation task of ImageCLEF 2009 [11]. It represents a total of 8000 images annotated by 53 concepts. The criterion commonly used to measure the ranking quality of a classification algorithm is the mean area under the ROC curve (AUC) integrated over all the concepts and denoted AU C. A large variety of features offers a better representation of concepts, since these concepts can very differ and have a combination of very variable characteristics. For example, it is clear that a concept such as sky, sea or forest, colour or texture will have a big impact while for a concept such as face, edge and colour will be favoured. So, we extract HSV histogram, edge histogram, Gabor filter histogram, generalized fourier descriptors [12], and profile entropy feature [13]. The size of the grid is obviously a very important factor in the whole process. Smaller grid allow a more precise description with fewer visual elements, while bigger grid may contain more information, but require a larger number of elements. We choose to use a multi-scale process. In order to take full advantages of our multi-scale approach, we use different grid sizes : 1 × 1, 2 × 2, 4 × 4, 8 × 8, and 2 × 4, 4 × 2, 8 × 4, 4 × 8, so nG = 8. To construct our global visual dictionary, we must define the number of clusters and also during the vocabulary reduction. These parameters was fixed by experimental results on the development test set. For the global visual dictionary, we varied the number of clusters by 50 to 2500. And for the number of visual words into the visual concept dictionary, we tested with 10 to 10000. We optimize our parameters based on parts of the initial training set of 5000 images: it is split into a new training set with 2500 images, a validation set with 2000 images and a test set with 500 images.
4
Results
Figure 1 shows the AU C results of our method for different parameters. Based on these results, and in order to have the best compromise between AUC and time computing, we choose to use a cluster number of 250 and 6000 visual words. With these parameters, we obtain a AU C equals to 0.668. So, we train our model again, but on the whole training set (5000 images) and we tested this model on the test set of 3000 images, we obtained a AU C equals to 0.688.
304
E. Dumont et al.
0.7 0.68 0.66 0.64 AUC
0.62 0.6 0.58 0.56 0.54 0.52 0.5
100 clusters 250 clusters 500 clusters 1000 clusters 2500 clusters
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of visual words in the concept visual dictionary
Fig. 1. AU C results on the validation set according the number of visual words in the Visual Concept Dictionary
4.1
Vocabulary Reduction
Visual words do not have the same importance to determine the presence of a concept. We select the most discriminative visual words for a concept given to compose a Visual Concept Dictionary. Automatic word selection methods such as information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. In order to select the words, we test various method depicted previously giving the results in the table 1. We see that IG, MI and Ent methods are similar, we use information gain, IG. Table 1. AU C and EER results for different vocabulary reduction methods WF DF IG MI Ent AU C 0.680 0.668 0.688 0.680 0.687 EER 0.365 0.375 0.356 0.364 0.360
4.2
Comparison with Classical SVM
We compare our method WF-IIF with the classical SVM [14] using the RBF kernel and the one-against-all strategy and exactly the same information as our MSVD. First, for each concept, we run the Linear Discriminant Analysis (LDA) on the joint of all features to decrease the high dimension impact. Then, we train a support vector machine for every concept on the LDA feature, of which the outputs are considered as the confidences with which the samples belong to the concept. We use the same sets as in the visual dictionary: a training set to train SVM models, validation sets to optimize parameters, and the final test set to evaluate the method. We compare with the best ImageCLEF2009 system but results are not on the same data. The ISIS group [15] applies a system that is based on four main steps: a spatial pyramid approach and saliency points detection, SIFT features extraction, codebook transformation, and the final learning step is based on SVM with χ2 kernel.
A Fast Visual Word Frequency - Inverse Image Frequency
305
Fig. 2. AUC results by concept for the Visual Dictionary vs the LDA+SVM method and best ImageCLEF2009 system 1
Visual Dictionary SVM
0.9 0.8
AUC
0.7 0.6 0.5 0.4 0.3 0.2 0
500
1000
1500
2000
2500
Concept frequency
Fig. 3. AUC results by concept frequency for the WF-IIF vs the LDA+SVM method
In average the LDA+SVM method obtains a AU C equals to 0.653. In ImageCLEF campaign evaluation, this method obtained 0.72. The performance is smaller with our data set which is a part of the global data of ImageCLEF, so the comparison with the best results has a bias.
5
Conclusion
We can see that on average WF-IIF is more competitive than LDA+SVM. In particular, better results are obtained especially with concepts with rare occur#positives 1 rence ( #negatives ≤ 10 ) probably due to lack of positives examples in the SVM training, see figure 3 where concepts are sorting by the number of positive sample in the training set. However our WF-IIF model on MSVD needs few minutes for training on a pentium IV 3Ghz, 4 GRam, comparing with the LDA+SVM model which costs more than 5 hours. The test processing of our WF-IIF on MSVD is also faster than the LDA+SVM.
306
E. Dumont et al.
Acknowledgement. This work was partially supported by the French National Agency of Research: ANR-06-MDCA-002 AVEIR project and ANR Blanc ANCL.
References 1. Picard, R.W.: Toward a visual thesaurus. In: Springer Werlag Workshops in Computing, MIRO (1995) 2. Picard, R.W.: A society of models for video and image libraries (1996) 3. Zhang, R., Zhang, Z.M.: Hidden semantic concept discovery in region based image retrieval. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 996–1001 (2004) 4. Lim, J.H.: Categorizing visual contents by matching visual “keywords”. In: Huijsmans, D.P., Smeulders, A.W.M. (eds.) VISUAL 1999. LNCS, vol. 1614, pp. 367– 374. Springer, Heidelberg (1999) 5. Fauqueur, J., Boujemaa, N.: Mental image search by boolean composition of region categories. In: Multimedia Tools and Applications, pp. 95–117 (2004) 6. Souvannavong, F., Hohl, L., M´erialdo, B., Huet, B.: Enhancing latent semantic analysis video object retrieval with structural information. In: IEEE International Conference on Image Processing, ICIP 2004, Singapore, October 24-27 (2004) 7. Salton, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGrawHill, Inc., New York (1986) 8. Mitchell, T.: Machine Learning (October 1997) 9. Seymore, K., Chen, S., Rosenfeld, R., Chen, S., Rosenfeld, R.: Nonlinear interpolation of topic models for language model adaptation. In: Proceedings of ICSLP-1998, vol. 6, pp. 2503–2506 (1998) 10. Jensen, R., Shen, Q.: Fuzzy-rough data reduction with ant colony optimization. Fuzzy Sets and Systems (March 2004) 11. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large-scale visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 12. Smach, F., Lemaˆıtre, C., Gauthier, J.P., Miteran, J., Atri, M.: Generalized fourier descriptors with applications to objects recognition in svm context. J. Math. Imaging Vis. 30(1), 43–71 (2008) 13. Glotin, H., Zhao, Z., Ayache, S.: Efficient image concept indexing by harmonic and arithmetic profiles entropy. In: Proceedings of 2009 IEEE International Conference on Image Processing (ICIP 2009), Cairo, Egypt, November 7-11 (2009) 14. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 15. van de Sande, K., Gevers, T., Smeulders, A.: The university of Amsterdam’s concept detection system at imageCLEF 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010)
Exploring the Semantics behind a Collection to Improve Automated Image Annotation Ainhoa Llorente, Enrico Motta, and Stefan R¨ uger Knowledge Media Institute, The Open University Walton Hall, Milton Keynes, MK7 6AA United Kingdom {a.llorente,e.motta,s.rueger}@open.ac.uk
Abstract. The goal of this research is to explore several semantic relatedness measures that help to refine annotations generated by a baseline non-parametric density estimation algorithm. Thus, we analyse the benefits of performing a statistical correlation using the training set or using the World Wide Web versus approaches based on a thesaurus like WordNet or Wikipedia (considered as a hyperlink structure). Experiments are carried out using the dataset provided by the 2009 edition of the ImageCLEF competition, a subset of the MIR-Flickr 25k collection. Best results correspond to approaches based on statistical correlation as they do not depend on a prior disambiguation phase like WordNet and Wikipedia. Further work needs to be done to assess whether proper disambiguation schemas might improve their performance.
1
Introduction
Early attempts in automated image annotation were focused on algorithms that explored the correlation between words and image features. More recently, there are some efforts which benefit from exploiting the correlation between words computing semantic similarity measures. In this work, we use indistinctly the term semantic similarity and semantic relatedness. Nevertheless, we refer to the definition by Miller and Charles [1] who consider semantic similarity as the degree to which two words can be interchanged in the same context. Thus, we propose a model that automatically refines the image annotation keywords generated by a non-parametric density estimation approach by considering semantic relatedness measures. The underlying problem that we attempt to correct is that annotations generated by probabilistic models present poor performance as a result of too many “noisy” keywords. By “noisy” keywords, we mean those which are not consistent with the rest of the image annotations and in addition to that, are incorrect. Semantic measures will improve the accuracy of these probabilistic models, allowing these new combined semantic-based models to be further investigated. As there exist numerous semantic relatedness measures and each one of them works with different knowledge bases we extend the model presented in [2] to new measures that perform the knowledge extraction using WordNet, Wikipedia, and World Wide Web through search engines. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 307–314, 2010. c Springer-Verlag Berlin Heidelberg 2010
308
A. Llorente, E. Motta, and S. R¨ uger
The ultimate goal of this research is to explore how semantics can help an automated image annotation system. In other to achieve this, we examine several semantic relatedness measures studying their effect on a subset of the MIRFlickr 25k collection, the proposed dataset for the Photo Annotation Task [3] in the latest edition, 2009, of the ImageCLEF competition. The rest of this paper is structured as follows. Section 2 introduces our model as well as the applied semantic measures. Then, Section 3 describes the experiments carried out on the image collection provided by ImageCLEF2009. Section 4 discusses the results and finally, conclusions are presented in Section 5.
2
Model Description
The baseline approach is based on the probabilistic framework developed by Yavlinsky et al. [4] who used global features together with a non-parametric density estimation to model the conditional probability of an image given a word. The density estimation is accomplished using a Gaussian kernel. A key aspect of this approach is the global visual features used. The algorithm described combines the CIELAB colour feature with the Tamura texture. The process for extracting each of these features is as follows, each image is divided into nine equal rectangular tiles, the mean and second central moment feature per channel are calculated in each tile. The resulting feature vector is obtained after concatenating all the vectors extracted in each tile. In what follows, some of the semantic relatedness measures used in this approach are introduced. Due to space constraints, we refer to exhaustive reviews found in the literature whenever appropriate. 2.1
Training Set Correlation
This approach is introduced in [2] where the training set is computed to generate a co-occurrence matrix that represents the probabilities of the frequency of two vocabulary words appearing together in a given image. This algorithm was previously tested on the Corel5k dataset and in the collection provided by the 2008 ImageCLEF edition showing promising results. 2.2
Web-Based Correlation
The most important limitation affecting approaches that rely on keyword correlation in the training set, is that they are limited to the scope of the topics represented in the collection. Consequently, a web-based approach is proposed that makes use of web search engines as knowledge base. Thus, the semantic relatedness between concepts x and y, is defined by Gracia and Mena [5] as: rel(x, y) = e−2 NWD(x,y) ,
(1)
where N W D stands for Normalized Web Distance which is a generalisation of the Normalized Google Distance defined by Cilibrasi and Vit´anyi [6].
Exploring the Semantics behind a Collection
2.3
309
WordNet Measures
A fair amount of thesaurus-based semantic relatedness measures were proposed and investigated on the WordNet hierarchy of nouns (see [7] for a detailed review). The best result was achieved by Jiang and Conrath using a combination of statistical measures and taxonomic analysis. This was accomplished using the list of 30 noun pairs proposed by Miller and Charles in [1]. During our training phase (Section 3.1), we applied several WordNet semantic measures (Jiang and Conrath [8], Hirst and St-Onge [9], Resnik [10] and, Adapted Lesk [11]) to 1,378 pair of words obtained from our vocabulary. The best performing was the adapted Lesk measure proposed by Banerjee and Pedersen, closely followed by Jiang and Conrath’s relatedness measure. Banerjee and Pedersen defined the extended gloss overlap measure which computes the relatedness between two synsets by comparing the glosses of synsets related to them through explicit relations provided by the thesaurus. 2.4
Wikipedia Measures
According to a review by Medelyan et al. [12], the computation of semantic relatedness using Wikipedia has been addressed from three different point of views; one that applies WordNet-based techniques to Wikipedia followed by [13]; another that uses vector model techniques to compare similarity of Wikipedia articles proposed by Gabrilovich and Markovitch in [14]; and, the final one, which explores the Wikipedia as a hyperlinked structure introduced by Milne and Witten in [15]. The approach adopted in this research is the last one as it is less computationally expensive than the others that work with the whole content of Wikipedia. Milne and Witten proposed their Wikipedia Link-based Measure (WLM) which extracts semantic relatedness measure between two concepts using the hyperlink structure of Wikipedia. Thus, the semantic relatedness between two concepts is estimated by computing the angle between the vectors of the links found between the Wikipedia’s articles whose title matches each one of the concepts.
3
Experimental Work
In this paper, we describe the experiments carried out for the Photo Annotation Task for the ImageCLEF2009 campaign. The main goal of this task is, as described in [3], given a training set of 5,000 images manually annotated with words coming from a vocabulary of 53 visual concepts, to automatically provide annotations for a test set of 13,000 images. 3.1
Training Phase
Before submitting our runs to the ImageCLEF2009, we made a preliminary study about which method performs better. In order to accomplish this goal, we performed a 10-fold cross validation on the training set. Thus, we divided the
310
A. Llorente, E. Motta, and S. R¨ uger
Table 1. Comparative performance of the held-out data for our proposed methods using different semantic relatedness measures. Results are expressed in terms of mean average precision (MAP). In the third column, Δ represents the % percentage of improvement of the method over the baseline. Best performing results are marked with an asterisk. Method
MAP
Δ
Baseline Training Set Correlation Wikipedia Link-based Measure Web-based Correlation (Yahoo) Web-based Correlation (Google)* WordNet: Hirst and St-Onge (HSO) WordNet: Resnik (RES) WordNet: Jiang and Conrath (JCN) WordNet: Adapted Lesk (LESK)
0.2613 0.2720 0.2681 0.2720 0.2736* 0.2675 0.2685 0.2720 0.2721
4.09% 2.60% 4.09% 4.71%* 2.37% 2.76% 4.09% 4.13%
training set of the collection into two parts: a training set of 4,500 images and a validation set of 500. The validation test was used to tune the model parameters. During this training phase, we use as evaluation measure the mean average precision (MAP), as it has shown to have especially good discriminatory and stable capabilities among evaluation measures. For a given query, average precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved, and this value is then averaged over all queries. We consider as queries, all the words that are able to annotate an image in the test set, in our case is the whole 53 vocabulary words. We evaluated the performance of several semantic measures using various knowledge sources as indicated in Table 1. The final goal of this training phase is to select the best performing measure per method. As noted from the results, methods based on word correlation outperform methods based on thesaurus such as WordNet or Wikipedia. The poorest performance corresponds to Wikipedia Link-based Measure. 3.2
Discussion
The fact that many words of the proposed vocabulary are not included in WordNet or in Wikipedia adds a further complication to the process of computing some semantic relatedness measures. Thus, we followed the same approach adopted by [16] which consists in replacing some words by others similar to them. However, these replacements were rather difficult to accomplish as we needed to select a word semantically and, at the same time, visually similar to the original one. Specially difficult were the words that represents a negation such as “no visual season”, “no visual place”, etc. In other cases, the replacement consists in finding the corresponding noun to a given adjective like in the case of “indoor”, “outdoor”, “sunny”, “overexposed” or, “underexposed”. In addition to that, the computation of some semantic relatedness implies a prior
Exploring the Semantics behind a Collection
311
Table 2. Examples of Word Sense Disambiguation (WSD) using Wikipedia and WordNet. The wrong disambiguations are highlighted in bold characters. Word
Wikipedia
WordNet: word#n#1
Indoor Outdoor Canvas Still Life Macro Overexposed Underexposed Plants Partly Blurred Small Group Big Group
The Inside “the region that is inside of something” Outside(magazine) “the region that is outside of something” Canvas “the setting for a fictional account” Still “a static photograph” Macro(computer science) “a single computer instruction” Light “electromagnetic radiation” Darkness “absence of light or illumination” Plant “building for industrial labor” Bokeh “a hazy or indistinct representation” Group (auto racing) “a number of entities considered as a unit” Hunter-gatherer “a group of persons together in one place”
disambiguation task. This occurs, again, in the case of Wikipedia and WordNet. Both of them automatically assign to every word the most usual sense. In the case of WordNet, this sense corresponds to the first sense in the synset (word#n#1) while in Wikipedia corresponds to the sense of the word more probable according to the content store on Wikipedia database. Surprisingly, both methods present similar disambiguation capabilities that are around 70% of accuracy, being WordNet slightly better. Table 2 shows some unlucky examples. Unfortunately, the most popular sense of a word does not necessarily match the sense of the word attributed in our collection. Consequently, these inaccuracies in the disambiguation process translates into poor performance for the resulting methods. This explains the results of Table 1 where Google achieves the best performance as it does not need to do any disambiguation task. This result is closely followed by word correlation using the training set as source of knowledge. Finally, and confirming our previous expectations WordNet and Wikipedia semantic relatedness obtained the lowest results. Among WordNet results, Jiang and Conrath (JCN) measure is narrowly beaten by Adapted Lesk.
4
Analysis of Results
Due to the limitations in the number of runs to be submitted to the ImageCLEF2009 competition, we propose our top four performing runs according to the training process described in Section 3.1. At the end of it, the training and validation set were merged, again, to form a new training set of 5,000 images that was used to predict the annotations in the test set of 13,000 images. Thus, we submitted the following runs: correlation based on the training set, web-based correlation using Google, semantic relatedness using WordNet based on Adapted Lesk measure and finally, Wikipedia Link-based measure using Wikipedia. Evaluation of results were done using the two metrics proposed by ImageCLEF organisers. The first one is based on ROC curves and proposes as measures Equal
312
A. Llorente, E. Motta, and S. R¨ uger
Table 3. Evaluation performance of the proposed algorithms under the EER and AUC metric. A random run is included for comparison purposes. The best performing result is marked with an asterisk. Note that, the lower the EER, the better the performance of the annotation algorithm. Algorithm
EER
AUC
Training Set Correlation* Web-based Correlation (Google) WordNet: Adapted Lesk Wikipedia Link-based Measure Random
0.352478* 0.352485 0.352612 0.356945 0.500280
0.689410* 0.689407 0.689342 0.684821 0.499307
Table 4. Evaluation performance of the proposed algorithms under the Ontology Score (OS) metric considering the agreement among annotators or without it. A random run is included for comparison purposes. The best performing result is marked with an asterisk. In this case, the higher the OS, the better the performance of the annotation algorithm. Algorithm
With Agreement Without Agreement
Web-based Correlation (Google)* Training Set Correlation WordNet: Adapted Lesk Wikipedia Link-based Measure Random
0.6180272* 0.6179764 0.6172693 0.4205571 0.3843171
0.57583610* 0.57577974 0.57497290 0.35027474 0.35097164
Error Rate (EER) and the Area under Curve (AUC) while the second metric is the Ontology Score (OS) proposed by [17] that takes into consideration the hierarchical form adopted by the vocabulary. Table 3 shows the results obtained by the proposed algorithms. These results are rather in tune with the results previously computed during our training process. As expected, the best results correspond to word correlation either using the training set or using a web-based search engine like Google. Results for the OS metric are presented in Table 4. They corroborate previous results computed using the ranked retrieval metric or the metric based on ROC curves. The only variation is that depending on the metric Web-based correlation outperforms training set correlation. It is worth noting that the emphasis of this research is placed on the analysis of the performance of the different semantic relatedness measures more than on the baseline run. However, we were able to perform an additional run accomplishing an adequate selection of image features together with a kernel function obtaining a significant better value of EER, 0.309021. This result was achieved by combining Tamura and Gabor texture with HSV and CIELAB colour descriptors and using a Laplacian kernel function instead of the Gaussian mentioned before.
Exploring the Semantics behind a Collection
5
313
Conclusions
The goal of this research is to explore several semantic relatedness measures that help to refine annotations generated by a baseline non-parametric density estimation algorithm. Thus, we analyse the benefits of performing a statistical correlation using the training set or using the World Wide Web versus approaches based on a thesaurus like WordNet or Wikipedia (considered as a hyperlink structure). Experiments are carried out using the dataset provided by the 2009 edition of the ImageCLEF competition, a subset of the MIR-Flickr 25k collection. Several metrics are employed to evaluate the results, the MAP ranked retrieval metric, the ROC curves based EER metric and the proposed Ontology-based Score. Disparity among results using the three metrics is not significant. Thus, we observe that the best performance is achieved using correlation approaches. This is due to the fact that they do not rely on a prior disambiguation process like WordNet and Wikipedia. Not surprisingly, the worst result corresponds to the semantic measure based on Wikipedia. The reasons behind it might be found in the strong dependency of the semantic relatedness measure on doing a proper word disambiguation. The disambiguation in Wikipedia is automatically performed by selecting the sense of the word more probable according to the content store on Wikipedia database. Most of the vocabulary words do not correspond to real visual features and at the same time present difficult semantics. Consequently, we predicted and posteriorly checked, lower results for concepts classified into categories such as “Seasons”, “Time of the day”, “Picture representation”, “Illumination”, “Quality Blurring” and specially, the most subjective one, “Quality Aesthetics”. Further analysis is needed to determine whether the performance of WordNet and Wikipedia can be improved by incorporating robust disambiguation schemas. Acknowledgments. This work was partially funded by the EU-Pharos project under grant number IST-FP6-45035 and by Santander Universities.
References 1. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Journal of Language and Cognitive Processes 6, 1–28 (1991) 2. Llorente, A., R¨ uger, S.: Using second order statistics to enhance automated image annotation. In: Proceedings of the 31st European Conference on Information Retrieval, vol. 5478, pp. 570–577 (2009) 3. Nowak, S., Dunker, P.: Overview of the CLEF 2009 Large Scale – Visual Concenpt Detection and Annotation Task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 4. Yavlinsky, A., Schofield, E., R¨ uger, S.: Automated image annotation using global features and robust nonparametric density estimation. In: Proceedings of the International ACM Conference on Image and Video Retrieval, pp. 507–517 (2005) 5. Gracia, J., Mena, E.: Web-based measure of semantic relatedness. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 136–150. Springer, Heidelberg (2008)
314
A. Llorente, E. Motta, and S. R¨ uger
6. Cilibrasi, R., Vitanyi, P.: The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007) 7. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics 32(1), 13–47 (2006) 8. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of International Conference Research on Computational Linguistics (1997) 9. Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. In: WordNet: A Lexical Database for English, pp. 305–332. The MIT Press, Cambridge (1998) 10. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453 (1995) 11. Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence (2003) 12. Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from wikipedia. International Journal of Human-Computer Studies 67(9), 716–754 (2009) 13. Ponzetto, S., Strube, M.: Knowledge derived from wikipedia for computing semantic relatedness. Journal of Artificial Intelligence Research (JAIR) 30, 181–212 (2007) 14. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipediabased explicit semantic analysis. In: Proceedings of the 20th International Joint Conference for Artificial Intelligence, pp. 1606–1611 (2007) 15. Milne, D., Witten, I.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceedings of the first AAAI Workshop on Wikipedia and Artifical Intellegence (2008) 16. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M., Soroa, A.: A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of NAACL-HLT (2009) 17. Nowak, S., Lukashevich, H.: Multilabel classification evaluation using ontology information. In: Proceedings of ESWC Workshop on Inductive Reasoning and Machine Learning on the Semantic Web (2009)
Multi-cue Discriminative Place Recognition Li Xing and Andrzej Pronobis Centre for Autonomous Systems, The Royal Institute of Technology SE100-44 Stockholm, Sweden {lixing,pronobis}@kth.se
Abstract. In this paper we report on our successful participation in the RobotVision challenge in the ImageCLEF 2009 campaign. We present a place recognition system that employs four different discriminative models trained on different global and local visual cues. In order to provide robust recognition, the outputs generated by the models are combined using a discriminative accumulation method. Moreover, the system is able to provide an indication of the confidence of its decision. We analyse the properties and performance of the system on the training and validation data and report the final score obtained on the test run which ranked first in the obligatory track of the RobotVision task.
1
Introduction
This paper presents the place recognition algorithm based on multiple visual cues that was applied to the RobotVision task of the ImageCLEF 2009 campaign. The task addressed the problem of visual indoor place recognition applied to robot topological localization. Participants were given training, validation and test sequences capturing the appearance of an office environment under various conditions [1]. The task was to build a system able to answer the question “where are you?” (I am in the kitchen, in the corridor, etc) when presented with a test sequence imaging rooms seen during training, or additional rooms that were not imaged in the training sequence. The results could be submitted for two separate tracks: (a) obligatory, in case of which each single image had to be classified independently; (b) optional, where the temporal continuity of the sequences could be exploited to improve the robustness of the system. For more information about the task and the dataset used for the challenge, we refer the reader to the RobotVision@ImageCLEF’09 overview paper [2]. The visual place recognition system presented in this paper obtained the highest score in the obligatory track and constituted a basis for our approach used in the optional track. The system relies on four discriminative models trained on different visual cues that capture both global and local appearance of a scene. In order to increase the robustness of the system, the cues are integrated efficiently using a high-level accumulation scheme that operates on the separate models
This work was supported by the EU FP7 integrated project ICT-215181-CogX. The supportis gratefully acknowledged.
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 315–323, 2010. c Springer-Verlag Berlin Heidelberg 2010
316
L. Xing and A. Pronobis
adapted to the properties of each cue. Additionally, in the optional track, we used a simple temporal accumulation technique which exploits the continuity of the image sequences to refine the results. Since the misclassifications were penalized in the competition, we experimented with an ignorance detection technique relying on the estimated confidence of the decision. Visual place recognition is a vastly researched topic in the robotics and computer vision communities and several different approaches have been proposed to the problem considered in the competition. The main differences between the approaches relate to the way the scene is perceived and thus the visual cues extracted from the input images. There are two main groups of approaches using either global or local image features. Typically, SIFT [3] and SURF [4] are applied as local features, either using a matching strategy [5,6] or the bag-of-words approach [7,8]. Global features are also commonly used for place recognition and such representations as gist of a scene [9], CRFH [10], or PACT [11] were proposed. Recently, several authors observed that robustness and efficiency of the recognition system can be improved by combining information provided by both types of cues (global and local) [5, 12]. Our approach belongs to this group and four different types of features previously used in the domain of place recognition have been used in the presented system. The rest of the paper gives a description of the structure and components of our place recognition system (Section 2). Then, we describe the initial experiments performed on the training and validation data (Section 3). We explain the procedure applied for parameter selection and study the properties of the cue integration and confidence estimation algorithms. Finally, we present the results obtained on the test sequence and our ranking in the competition (Section 4). The paper concludes with a summary and possible avenues for future research.
2
The Visual Place Recognition System
This section describes our approach to visual place classification. Our method is fully supervised and assumes that during training, each place (room) is represented by a collection of labeled data which captures its intrinsic visual properties under various viewpoints, at a fixed time and illumination setting. During testing, the algorithm is presented with data samples acquired under different conditions and after some time. The goal is to recognize correctly each single data sample provided to the system. The rest of the section describes the structure and components of the system. 2.1
System Overview
The architecture of the system is illustrated in Fig. 1. We use four different cues extracted independently from the visual input. We see that there is a separate path for each cue. Every path consists of two main building blocks: a feature extractor and a classifier. Thus separate decisions can be obtained for every cue. The outputs encoding the confidence of single-cue classifiers are combined using a discriminative accumulation scheme.
Multi-cue Discriminative Place Recognition
317
Fig. 1. Structure of the multi-cue visual place recognition system
2.2
Visual Features
The system relies on visual cues based on global and local image features. Global features are derived from the whole image and thus can capture general properties of the whole scene. In contrast, local features are computed locally, from distinct parts of an image. This makes them much more robust to occlusions and viewpoint variations. In order to capture different aspects of the environment, we combine cues produced by four different feature extractors. Composed Receptive Field Histograms (CRFH). CRFH [13] is a multidimensional statistical representation (a histogram) of the occurrence of responses of several image descriptors applied to the whole image. Each dimension corresponds to one descriptor and the cells of the histogram count the pixels sharing similar responses of all descriptors. This approach allows to capture various properties of the image as well as relations that occur between them. On the basis of the evaluation in [10], we build the histograms from second order Gaussian derivative filters applied to the illumination channel at two scales. PCA of Census Transform Histograms (PACT) Census Transform (CT) [11] is a non-parametric local transform designed for establishing correspondence between local patches. Census transform compares the intensity values of a pixel with its eight neighboring pixels, as illustrated in Figure 2. A histogram of the CT values encode both local and global information of the image. PACT [11] is a global representation that extracts the CT histograms for several image patches organized in a grid and applies Principal Component Analysis (PCA) to the resulting vector. Scale Invariant Feature Transform (SIFT). As one of the local representations, we used a combination of the SIFT descriptor [3] and the scale, rotation and translation invariant Harris-Laplace corner detector [14]. The SIFT descriptor represents local image patches around interest points characterized by coordinates in the scale space in the form of histograms of gradient directions.
318
L. Xing and A. Pronobis
Fig. 2. Illustration of the Census Transform [11]
Speed-Up Robust Features (SURF). SURF [4] is a scale- and rotationinvariant local detector and descriptor which is designed to approximate the performance of previously proposed schemes while being much more computationally efficient. This is obtained by using integral images, a Hessian matrixbased measure for the detector and a distribution of Haar-wavelet responses for the descriptor. 2.3
Place Models
Based on its state-of-the-art performance in several visual recognition domains [15, 16], we used the Support Vector Machine classifier [17] to build the models of places for each cue. The choice of the kernel function is a key ingredient for the good performance of SVMs and we selected specialized kernels for each cue. Based on results reported in the literature, we chose in this paper the χ2 kernel [18] for CRFH, the Gaussian (RBF) kernel [17] for PACT and the match kernel [19] for both local features. In order to extend the binary SVM to multiple classes, we used the one-against-all strategy for which one SVM is trained for each class separating the class from all other classes. SVMs do not provide any out-of-the-box solution for estimating confidence of the decision; however, it is possible to derive confidence information and hypotheses ranking from the distances between the samples and the hyperplanes. In this work, we experimented with the distance-based methods proposed in [5], which define confidence as a measure of unambiguity of the final decision. 2.4
Cue Integration and Temporal Accumulation
As indicated in [5], different properties of viual cues result in different performance and error patterns on the place classification task. The role of the cue integration scheme is to exploit this fact in order to increase the overall performance. Our place recognition system uses the Discriminative Accumulation Scheme (DAS) [16] that was proposed for the place classification problem in [5]. It accumulates multiple cues, by turning classifiers into experts. The basic idea is to consider real-valued outputs of a multi-class discriminative classifier as an indication of a soft decision for each class. Then, all of the outputs obtained from the various cues are summed together, therefore linearly accumulated. In the presented system, this can be expressed by the equation OΣ = a · OCRF H + b · OP ACT + c · OSIF T + d · OSURF , where a, b, c, d are the weights assigned to each cue and a + b + c + d = 1. The vectors O represent the outputs of the multi-class classifiers for each cue. We used a very similar scheme to improve the robustness of the system operating on image sequences. For this, we exploited the continuity of the sequences
Multi-cue Discriminative Place Recognition
319
and accumulated the outputs (of a single cue or integrated cues) for the current sample and N previously classified samples. The result of accumulation was then used as the final decision of the system.
3
Experiments on the Training and Validation Data
We conducted several series of experiments on the training and validation data in order to analyze the behavior of our system and select parameters. We present the analysis and results in successive subsections. 3.1
Selection of the Model Parameters
The first set of experiments was aimed at finding the values of parameters of the place models, i.e. the SVM error penalty C and the kernel parameters. The experiments were performed separately for each visual cue (CRFH, PACT, SIFT and SURF). To find the parameters, we performed cross validation on the training and validation data. For every training set, we selected parameters that resulted in highest classification rate on all available test sets acquired under different conditions. The classification rate was calculated in a similar way as the final score used in the competition i.e. as the percentage of correctly classified images in the whole testing sequence. Figure 3 presents the results obtained for the experiments with the dum-night3 training set which was selected for the final run of the competition. It is apparent that the model based on the SIFT features provides the highest recognition rate on average. However, we can also see that different cues have different characteristics as their performance changes according to different patterns. This suggests that the overall performance of the system could be increased by integrating the outputs of the models. 3.2
Cue Integration and Temporal Accumulation
The next step was to integrate the outputs of the models and choose the proper values of the DAS weights for each model. We performed an exhaustive search for Test Sets 88 86 84 82 80 78 76 74 72 70 68
CRFH
66
PACT
64 62
SIFT SURF DAS
60 dum−cloudy1
dum−cloudy2
dum−sunny1
dum−sunny2
Average
Fig. 3. Classification rates for the best model parameters and the dum-night3 training set. Results are given separately for each test set as well as averaged over all sets.
320
L. Xing and A. Pronobis 84
82
0
b−PACT
0.2
80
0.4 78 0.6 76
0.8 1 1
74 0.8
1 0.6
0.8 0.6
0.4
a−CRFH
72
0.4
0.2
0.2 0
0 c−SIFT
Fig. 4. Classification rates obtained for various values of the DAS weights
the weights on the training and validation data independently for each training set. Then, we selected the values that provided the highest average classification rate over all test sets. The results are presented in Figure 3. This weight selection procedure revealed that once SIFT is used as one of the cues, there is no benefit of adding SURF (the weight for SURF was selected to be 0). This is not surprising since SURF captures similar information as SIFT, while employing some heuristics in order to make the feature extraction process more efficient. According to the results presented in the previous section, those heuristics decrease the overall performance of the system, while not introducing any additional knowledge. Figure 4 illustrates how the the average classification rates for the dum-night3 training set and all test sets changed for various values of the weights used for CRFH, PACT and SIFT (the weight used for SURF is assumed to be 0). The following weights were selected and used for further experiments: a = 0.1, b = 0.15, c = 0.75, d = 0. We performed similar experiments to find the number of past samples we should accumulate over in order to refine the results in case of the optional track. The results revealed that we obtain the highest score when 4 past test samples are accumulated with equal weights with the currently classified sample. 3.3
Confidence Estimation
According to the performance measure used in the competition, the classification errors were penalized. Therefore, we experimented with an ignorance detection mechanism based on the confidence of the decision produced by the system. In order to simulate the case of unknown rooms in the test set, we always removed one room from the training set. Figure 5a-e presents the obtained average results. We gradually increased the value of confidence that is required in order to accept the decision of the system and measured the statistics of the accepted and rejected decisions. In both cases, we measured the percentage of test samples that were classified correctly, misclassified or unknown during training. We can see from the plots that the confidence thresholding procedure rejects mostly samples from unknown rooms and samples that would be incorrectly classified. This increases the classification rate for the accepted samples. At the same time, the
Multi-cue Discriminative Place Recognition 460
100
450
100
540
440
520 80
380 40
Score
400
60
360 340
20
400 60 350
Score
420
40 300
Percentage of Samples
80 Percentage of Samples
Percentage of Samples
80
500 480
60
460 40
440 420
20
20 320
0
400
300 0.8
0
100
450
100
80
400
80
0
0.1
0.2
0.3 0.4 0.5 Confidence Threshold
0.6
0.7
Score
100
321
0
0.1
0.2
(a) CRFH
0.3 0.4 0.5 Confidence Threshold
0.6
0.7
250 0.8
0
(b) PACT
0
0.1
0.2
0.3 0.4 0.5 Confidence Threshold
0.6
0.7
380 0.8
(c) SIFT 560
Score
500
60
480 40
40
300
20
250
20
200 0.8
0
Score
350
60
Percentage of Samples
Percentage of Samples
540 520
460 440 420
0
0.1
0
0.2
0.5 0.4 0.3 Confidence Threshold
0.6
0.7
0
0.1
(d) SURF
0.2
0.5 0.4 0.3 Confidence Threshold
(e) DAS
0.6
0.7
400 0.8
(f) Legend
Fig. 5. Average results of the experiments with confidence-based ignorance detection for separate cues and cues integrated Table 1. Results and scores obtained on the final test set Obligatory Optional Track Track Score
793.0
853.0
Rank
1
4
(a) Scores and ranks
Predicted → 1-person Corridor 2-person Kitchen Printer True↓ Office Office Area 1-person Office 119 (129) 25 (23) 12 (8) 4 (0) 0 (0) Corridor 4 (2) 570 (580) 6 (3) 10 (6) 1 (0) 2-person Office 1 (0) 4 (0) 131 (134) 25 (27) 0 (0) Kitchen 1 (0) 5 (0) 2 (0) 152 (161) 1 (0) Printer Area 5 (0) 138 (139) 10 (7) 3 (2) 120 (128) Unkn. Room 13 (14) 206 (206) 22 (24) 11 (6) 89 (91)
Unkn. Room 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
(b) Confusion matrix. Values in brackets are for the optional track.
plots show the score used for the competition calculated for the accepted samples only. If we use the penalty equal to 0.5 points for each misclassified sample (as used in the competition), the number of rejected errors must be twice as large as the number of rejected samples that would be classified correctly. As a result, the ignorance detection scheme provided only a slight improvement of the final score and we decided not to use confidence thresholding for the final run. However, as shown in Figure 5, if the penalty was increased to 1 point, the improvement would be significant.
4
The Final Test Run
The test sequence and the ID of the training sequence (dum-cloudy3 ) were released in the final round of the competition. For the final run, we used the parameters identified on the training and validation data. In order to obtain the results for the obligatory track, we applied the models independently to each image in the test sequence and integrated the results using the selected weights. We did not perform ignorance detection. In order to obtain the results for the optional task, we applied the temporal averaging to the results submitted to
322
L. Xing and A. Pronobis
the obligatory track. Table 1a presents our scores and ranks in both tracks. Table 1b shows the confusion matrix for the test set. We can see that the temporal averaging filtered out many single misclassifications in the test sequence.
5
Conclusions
In this paper we presented our place recognition system applied to the RobotVision task of the ImageCLEF’09 campaign. Through the use of multiple visual cues integrated using a high-level discriminative accumulation scheme, we obtained a system that provided robust recognition despite different types of variations introduced by changing illumination and long-term human activity. The most difficult aspect of the task turned out to be the novel class detection. We showed that the confidence of the classifier can be used to reject unknonwn or misclassified samples. However, we did not provide any principled way to detect the cases when the classifier dealt with a novel room. Our future work will concentrate on that issue.
References 1. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proc. of IROS 2007 (2007) 2. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the CLEF 2009 Robot Vision task (2009) 3. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2) (2004) 4. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404– 417. Springer, Heidelberg (2006) 5. Pronobis, A., Caputo, B.: Confidence-based cue integration for visual place recognition. In: Proc. of IROS 2007 (2007) 6. Valgren, C., Lilienthal, A.J.: Incremental spectral clustering and seasons: Appearance-based localization in outdoor env. In: Proc. of ICRA 2008 (2008) 7. Filliat, D.: A visual bag of words method for interactive qualitative localization and mapping. In: Proc. of ICRA 2007 (2007) 8. Cummins, M., Newman, P.: FAB-MAP: Probabilistic localization and mapping in the space of appearance. International Journal of Robotics Research 27(6) (2008) 9. Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: Proc. of ICCV 2003 (2003) 10. Pronobis, A., Caputo, B., Jensfelt, P., Christensen, H.I.: A discriminative approach to robust visual place recognition. In: Proc. of IROS 2006 (2006) 11. Wu, J., Rehg, J.M.: Where am I: Place instance and category recognition using spatial PACT. In: Proc. of CVPR 2008 (2008) 12. Weiss, C., Tamimi, H., Masselli, A., Zell, A.: A hybrid approach for vision-based outdoor robot localization using global and local image features. In: Proc. of IROS 2007 (2007) 13. Linde, O., Lindeberg, T.: Object recognition using composed receptive field histograms of higher dimensionality. In: Proc. of ICPR 2004 (2004)
Multi-cue Discriminative Place Recognition
323
14. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: Proc. of ICCV 2001 (2001) 15. Pronobis, A., Mart´ınez Mozos, O., Caputo, B.: SVM-based discriminative accumulation scheme for place recognition. In: Proc. of ICRA 2008 (2008) 16. Nilsback, M.E., Caputo, B.: Cue integration through discriminative accumulation. In: Proc. of CVPR 2004 (2004) 17. Cristianini, N., Taylor, J.S.: An Introduction to SVMs and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000) 18. Chapelle, O., Haffner, P., Vapnik, V.: Support vector machines for histogram-based image classification. IEEE Transactions on Neural Networks 10(5) (1999) 19. Wallraven, C., Caputo, B., Graf, A.: Recognition with local features: the kernel recipe. In: Proc. of ICCV 2003 (2003)
MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation and Retrieval Tasks Trong-Ton Pham1 , Lo¨ıc Maisonnasse2 , Philippe Mulhem1 , Jean-Pierre Chevallet1 , Georges Qu´enot1 , and Rami Al Batal1 1
Laboratoire Informatique de Grenoble (LIG), Grenoble University, CNRS, LIG 2 Laboratoire d’InfoRmatique en Image et Systemes d’information (LIRIS) {Trong-Ton.Pham,Philippe.Mulhem,Jean-Pierre.Chevallet}@imag.fr, {Georges.Quenot,Rami.Albatal}@imag.fr
Abstract. This paper describes mainly the experiments that have been conducted by the MRIM group at the LIG in Grenoble for the the ImageCLEF 2009 campaign, focusing on the work done for the Robotvision task. The proposal for this task is to study the behaviour of a generative approach inspired by the language model of information retrieval. To fit with the specificity of the Robotvision task, we added post-processing in a way to tackle with the fact that images do belong only to several classes (rooms) and that image are not independent from each others (i.e., the robot cannot in one second be in three different rooms). The results obtained still need improvement, but the use of such language model in the case of Robotvision is showed. Some results related to the Image Retrieval task and the Image annotation task are also presented.
1
Introduction
We describe here the different experiments that have been conducted by the MRIM group at the LIG in Grenoble for the ImageCLEF 2009 campaign, and more specifically for the Robotvision task. Our goal for this task was to study the use of language models in the context where we try to guess in which room a robot is in a partially known environment. Language models for text retrieval where proposed ten years ago, and behave very well when all the data cannot be directly extracted from the corpus. We have already proposed such application for image retrieval in [10], achieving very good results. We decided to focus on this challenging task represented by the Robotvision task in CLEF 2009. We also participated to the Image retrieval and the image annotation task for CLEF 2009, and we discuss briefly, because of space constrains, some of our proposal and results. The paper is organized as follows. First we describe the Robotvision task in section 2, our proposal based on language models and the results obtained. In this section, we focus on the features that were used to represent the images, before describing the language model defined on such representation and the post-processing that took advantage of the specificity of the Robotvision task. Because the MRIM-LIG research group participated in two other image related C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 324–331, 2010. c Springer-Verlag Berlin Heidelberg 2010
MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation
325
tasks, we propose in section 3 to describe shortly our main proposals and finding for the image annotation and the image retrieval tasks. We conclude in section 4.
2 2.1
Robovision Track Task Description
The Robotvision task at CLEF 2009 [1], aims at determining “the topological location of a robot based on images acquired with a perspective camera mounted on a robot platform.” A robot is moving on a building floor, going across several (six) rooms, and an automatic process has to indicate, for each image of a video sequence shot by the robot, in which room is the robot. In the test video, a additional room (which was not given in the training set), unknown, is present and has also to be tagged automatically. The full video set is the IDOL video database [6]. 2.2
Image Representation
We have applied a visual language modeling framework for the Robotvision task. This generative model is quite standard in the Information Retrieval field, and already lead to good results for visual scene recognition [10]. Before explaining in detail the language modeling approach, we fix some elements related to the feature extractions of images. To cover the different classes of features that could be relevant, we have extracted color, texture, and region of interest features in our proposal. These features are: HSV color histogram: we extract the color information from HSV color space. One image is represented by a concatenation of n×n histograms, according to non overlapping rectangular patches defined from a n×n grid applied on the image. Each histogram has 512 dimensions; Multi-scale canny edge histogram: we used Canny operator to detect the contour of objects as presented in [15]. An 80-dimensional vector was used to capture magnitudes and gradient of the contours for each patch. This information is extracted from a grid of m×m for each image; Color SIFT: SIFT features are extracted using D. Lowe’s detector [5]. Region around the keypoint is described by a 128-dimensional vector for each R, G, B channel. Based on the usual bag of visual words approach, we construct for each of the features above a visual vocabulary of 500 visual words using k-means clustering algorithm. Each visual word is designated to a concept c. Each image will then be represented using theses concepts and the language model proposed is built on these concepts. 2.3
Visual Language Modeling
The language modeling approach to information retrieval exists from the end of the 90s [11]. In this framework, the relevance status value of a document for a given query is estimated by the probability of generating the query from the document. Even though this approach was originally proposed for unigrams (i.e.
326
T.-T. Pham et al.
isolated terms), several extensions have been proposed to deal with n-grams (i.e. sequences of n terms) [12,13], and, more recently, with relationships between terms and graphs. Thus, [3] proposes (a) the use of a dependency parser to represent documents and queries, and (b) an extension of the language modeling approach to deal with such trees. [8,9] further extend this approach with a model compatible with general graphs, as the ones obtained by a conceptual analysis of documents and queries. Other approaches (as [2,4]) have respectively used probabilistic networks and kernels to capture spatial relationships between regions in an image. In the case of [2], the estimation of the region probabilities relies on an EM algorithm, which is sensitive to the initial probability values. In contrast, in the model we propose, the likelihood function is convex and has a global maximum. In the case of [4], the kernel used only considers the three closest regions to a given region. In [10], we have presented the image as a probabilistic graph which allows capturing the visual complexity of an image. Images are represented by a set of weighted concepts, connected through a set of directed associations. The concepts aim at characterizing the content of the image whereas the associations express the spatial relations between concepts. Our assumption is that the concepts are represented by non-overlapping regions extracted from images. In this competition, the images acquired by the robot are of poor quality, and we decided to not take into account the relationship between concepts. We thus assume that each document image d (equivalent each query image q) is represented by a set of weighted concepts WC . The concepts correspond to a visual word used to represent the image. The weight of concepts captures the number of occurrences of this concept in image. Denoting C the set of concepts over all the whole collection, WC can be defined as a set of pairs (c, w(c, d)), where c is an element of C and w(c, d) is the number of times c occur in the document image i. We are then in a context similar to usual language model for text retrieval. We rely then on a language model defined over concepts, as proposed in [7], which we refer to as Conceptual Unigram Model. We assume that a query q or a document d is composed of a set WC of weighted concepts, each concept being conditionally independent to the others. Unlike [7] that computes a query likelihood, we evaluate the relevance status value rsv of a document image d for query q by using a generalized formula, the negative Kullback-Leiber divergence, noted D. Such divergence is computed between two probability distributions: the document model Md computed over the document image d and the query model Mq computed over the query image q. Assuming the concept independence hypothesis, this leads to: RSVkld (q, d) = −D (Mq Md ) log(P (ci |Mq ) ∗ P (ci |Md )) ∝
(1) (2)
ci ∈C
where P (ci |Md ) and P (ci |Mq ) are the probability of the concept ci in the model estimated over the document d and query q respectively. If we assume a multinomial models for Md and Mq , P (ci |Md ) is estimated through maximum likelihood
MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation
327
(as is standard in the language modeling approach to IR), using Jelinek-Mercer smoothing:
P (ci |Md ) = (1 − λu )
Fc (ci ) Fd (ci ) + λu Fd Fc
(3)
where Fd (c), representing the sum of the weight of c in all graphs from document image d and Fd the sum of all the document concept weights in d. The functions Fc are similar, but defined over the whole collection (i.e. over the union of all the images from all the documents of the collection). The parameter λu helps taking into account reliable information when the information from a given document is scarce. For this part, the quantity P (ci |Mq ) is estimated through maximum likelihood without smoothing on the query. The final result L(qi ) for one query image i is a list of the images dj from the learning set ranked according to the RSVkld (qi , dj ) value. 2.4
Post-Processing of the Results
As we just mentioned, in this basic case we may associate the query image with the room id of the best ranked image. However, because we represent each image with several features and because we have several images of each room in the training set, we post-process this basic result: – Fusion: an image is represented independently for each feature considered (color with a given grid, texture with a given grid, regions of interest). Each of these representations lead to different matching results using the language model. We choose to make a late fusion of the three results obtained using a linear combination: RSV (Q, D) =
RSVkld (qi , di )
(4)
i
where Q and D correspond to the image query and documents, and qi and di describe the query and the document according to a feature i. – Grouping training images by their room: assuming that the closest training image of a query image is not sufficient to determine the room because of their intrinsic ambiguity, we propose to group the results of the n-best images for each room. We are then able to compute a ranked list of room RL instead of an image list for each query image: RLq = [(r, RSVr (q, r)]
with
RSVr (q, r) =
RSV (q, d)
(5)
fn−best (q,r)
where r corresponds to a room and fn−best is a function that select the n images with the best RSV belonging to the room r.
328
T.-T. Pham et al.
– Filtering the unknown room: in the test set of the Robotvision task, we know that one additional room is added. To tackle this point, we assume that if one room r is recognized, then the matching value for r is significantly larger than the matching value for the other rooms, especially compared to the room with the lower matching value. So, if this difference is large (> β), we consider that there is a significant difference and then we keep the tag r for the image. Otherwise we consider the image room tag as unknown. In our experiment, we fixed the threshold β to 0.003 after experiments. – Smoothing window: we exploit the visual continuity in a sequence of images by smoothing the result across the temporal axis. To do that, we use a flat (i.e., all the images in the window have the same weight) smoothing window centered on the current image. In the experiments, we choose the width of window w = 40 (i.e. 20 images before and after the classified image).
2.5
Validating Process
The validation aims at evaluating robustness of the algorithms to visual variations that occur over time due to the changing conditions and human activity. We trained our system with the night3 condition set and tested against all the other conditions from validation set. Our objective was to understand the behavior of our system with the changing conditions and with different types of features. We first study the models one by one. We built 3 different language models corresponding with 3 types of visual features. The training set used is night3 set. The model Mc and Me correspond to the color histogram and the edge histograms generated from a 5×5 grid. The model Ms corresponds to the SIFT color feature extracted from interest points. The recognition rates according to several validation sets are presented in Table 1. Table 1. Results obtained with 3 visual language models (Mc, Me, Ms) Train Validation HSV(Mc) night3 night2 84.24% night3 cloudy2 39.33% night3 sunny2 29.04%
Edge(Me) SIFT color(Ms) 59.45% 79.20% 58.62% 60.60% 52.37% 54.78%
We noticed that, in the same condition (e.g. night-night), the HSV color histogram Mc outperforms the two other models. However, in different conditions, the result of this model dropped significantly (from 84% to 29%). On the other hand, the edge model (Me) and the SIFT color model (Ms) are more robust to the change of conditions. In the worst condition (night-sunny), it still obtains a recognition rate of 52% for Me and 55% for Ms. As the result, we choose to consider only the edge histogram and SIFT feature for the official runs. Then, we studied the impact of the post-processing on the ranked list of the models Me and Ms on the recognition rate in Table 2.
MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation
329
Table 2. Result of the post-processing step based on 2 models Me and Ms Train Validation Fusion Regrouping Filtering Smoothing night3 sunny2 62% 67% (n=15) 72% (β=0.003) 92%(k=20)
The fusion of the 2 models leads to an overall 8% of improvement. The regrouping step helped to pop-up some prominent rooms from the score list by averaging room’s n-best scores. The filtering, using the threshold β=0.003, eliminated some of the uncertain decisions. Eventually, the smoothing step with a window size of 40 helped to increase the performance of a sequence of images significantly, by more than 20% compared to the initial result. 2.6
Submitted Runs and Results
For the official test, we have constructed 3 models based on the validating process. We eliminated the HSV histogram model because of its poor performance on different lighting conditions and there was a little chance to have the same condition. We used the same visual vocabulary of 500 visual concepts generated for night3 set. Each model provided a ranked result corresponding with the test sequence released. The post-processing steps were performed similarly to the validating process employing the same parameters. The visual language models built for the competition are: Me1: visual language model based on edge histogram extracted from 10x10 patches division; Me2: visual language model based on edge histogram extracted from 5x5 patches division, and Ms: visual language model based on color SIFT local features. Our test has been performed on a quad core 2.00GHz computer with 8Gb of memory. The training took about 3 hours on a whole night3 set. Classification of the test sequence was executed in real time. Based on the 3 visual models constructed, we have submitted 4 valid runs to the ImageCLEF evaluation (our runs with smoothing windows were not valid). – 01-LIG-Me1Me2Ms: linear fusion of the results coming from 3 models (score = 328). We consider this run as our baseline; – 02-LIG-Me1Me2Ms-Rk15: re-ranking the result of 01-LIG-Me1Me2Ms with the regrouping of top 15 scores for each room (score = 415); – 03-LIG-Me1Me2Ms-Rk15-Fil003: if the result of the 1st and the 4th in the ranked list is too small (i.e. β = 0.003), we remove that image from the result list (score = 456.5); – 05-LIG-Me1Ms-Rk15: same as 02-LIG-Me1Me2Ms-Rk15 but with the fusion of 2 types of image representation. (score = 25); These result show that the grouping increases results by 27% compared to the baseline. Adding a filtering after the grouping increases again the results, gaining more that 39% compared to the baseline. The use of SIFT features is also validated: the result obtained by the run 05-LIG-Me1Ms-Rk15 in not good, even after grouping the results by room. Our best run 03-LIG-Me1Me2Ms-Rk15Fil003 for the obligatory track is ranked at 12th place among 21 runs submitted
330
T.-T. Pham et al.
in overall. We conclude from these results that the use of post-processing is a must in the context of Robotvision room recognition.
3
Image Retrieval and Image Annotation Tasks Results
This paper focuses on the robovision task, but the MRIM-LIG group also submitted results for the image annotation and the image retrieval tasks. For the image annotation task, we tested a simple late fusion (selection of the best) based on three different sets of features: RGB colors, SIFT features, and an early fusion of hsv color space and Gabor filters energy. We tested two learning frameworks using SVM classifiers: a simple one against all, and a multiple one against all inspired from the work of Tahir, Kittler, Mikolajczyk and Yan called Inverse Random Under Sampling [14]. As a post processing, we applied on all our different runs a linear scaling in a way to fit the learning set a priori probabilities. We took afterward into account the hierarchy of concept in the following way: a) when conflicts occur (for instance the tag Day and the tag Night are associated to one image of the test set), we keep unchanged the larger value tag, and we decrease (linearly) the value all the other conflicting tags, b) we propagated the concepts values in a bottom-up way if the values of the generic concept is increased, otherwise we do not update the pre-existing values. The best result that we obtained was 0.384 for equal error rate (rank 34 on 74 runs) and 0.591 for recognition rate (rank 45 on 74). These results need to be studied further. For the image retrieval task, we focused on a way to generate subqueries, corresponding to potential clusters for the diversity process. We extracted the ten most cooccurring words with the query words, and used these words in conjunction with the initial query to generate sub-queries. One interesting result obtained comes from the fact that, for a text+image run, the result we obtained for the 25 last queries (the one for which we had to generate sub queries) was ranked 6th. This result encourages us to further study the behavior of our proposal.
4
Conclusion
To summarize our work on the Robotvision task, we have presented a novel approach for localization of a mobile robot using visual language modeling. Theoretically, this model fits within the standard language modeling approach which is well developed for IR. On the other hand, this model helps to capture in the same time the generality of the visual concepts associated with the regions from a single image or sequence of images. The validation process has proved a good recognition rate of our system against different illumination conditions. We believe that a good extension of this model is possible in the real scenario of scene recognition (more precisely for robot self-localization). With the addition of more visual features and the increase of system robustness, this could be a suitable approach for the future recognition systems. For the two other tasks in which we participated, we achieved average results. For the image retrieval we will study in the future more specifically the diversity algorithm.
MRIM-LIG at ImageCLEF 2009: Robotvision, Image Annotation
331
Acknowledgment This work was partly supported by: a) the French National Agency of Research (ANR-06-MDCA-002), b) the Quaero Programme, funded by OSEO, French State agency for innovation and c) the R´egion Rhones Alpes (projet LIMA).
References 1. Caputo, B., Pronobis, A., Jensfelt, P.: Overview of the clef 2009 robot vision track. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. Fergus, R., Perona, P., Zisserman, A.: A sparse object category model for efficient learning and exhaustive recognition. In: Conference on Computer Vision and Pattern Recognition (2005) 3. Gao, J., Nie, J.-Y., Wu, G., Cao, G.: Dependence language model for information retrieval. In: ACM SIGIR 2004, pp. 170–177 (2004) 4. Gosselin, P., Cord, M., Philipp-Foliguet, S.: Kernels on bags of fuzzy regions for fast object retrieval. In: International Conference on Image Processing (2007) 5. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 91–110 (2004) 6. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proc. IROs 2007 (2007) 7. Maisonnasse, L., Gaussier, E., Chevalet, J.P.: Model fusion in conceptual language modeling. In: ECIR 2009, pp. 240–251 (2009) 8. Maisonnasse, L., Gaussier, E., Chevallet, J.: Revisiting the dependence language model for information retrieval. In: Poster SIGIR 2007 (2007) 9. Maisonnasse, L., Gaussier, E., Chevallet, J.: Multiplying concept sources for graph modeling. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 585–592. Springer, Heidelberg (2008) (to be published) 10. Pham, T.T., Maisonnasse, L., Mulhem, P., Gaussier, E.: Visual language model for scene recognition. In: Proceedings of SinFra 2009, Singapore (2009) 11. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: ACM SIGIR 1998, pp. 275–281 (1998) 12. Song, F., Croft, W.B.: General language model for information retrieval. In: CIKM 1999, pp. 316–321 (1999) 13. Srikanth, M., Srikanth, R.: Biterm language models for document retrieval. In: Research and Development in Information Retrieval, pp. 425–426 (2002) 14. Tahir, M.A., Kittler, J., Mikolajczyk, K., Yan, F.: A multiple expert approach to the class imbalance problem using inverse random under sampling. In: Multiple Classifier Systems, Reykjavik, Iceland, pp. 82–91 (2009) 15. Won, C.S., Park, D.K., Park, S.-J.: Efficient use of mpeg-7 edge histogram descriptor. ETRI Journal 24(1) (2002)
The ImageCLEF Management System Ivan Eggel1 and Henning M¨ uller2 1
Business Information Systems, University of Applied Sciences Western Switzerland (HES–SO), Sierre, Switzerland 2 Medical Informatics, University and Hospitals of Geneva, Switzerland ivan.eggel@hevs.ch
Abstract. The ImageCLEF image retrieval track has been part of CLEF (Cross Language Evaluation Forum) since 2003. Organizing ImageCLEF and its large participation of research groups involves a considerable amount of work and data to manage. Goal of the management system described in this paper was to create a system for the organization of ImageCLEF to reduce manual work and professionalize the structures. All ImageCLEF sub tracks having a page in a single run submission system reduces work of organizers and makes submissions easier for participants. The system was developed as a web application using Java and JavaServer Faces (JSF) on Glassfish with a Postgres 8.3 database. The main functionality consists of user, collection and subtrack management as well as run submissions. The system has two main user groups, participants and administrators. The main task for participants is to register for subtasks and then submit runs. Administrators create collections for the sub tasks and can define the data and constraints for submissions. The described system was used for ImageCLEF 2009 with 86 subscribed users and more than 300 submitted runs in 7 subtracks. The system has proved to significantly reduce manual work and will be used for upcoming ImageCLEF events and other evaluation campaigns.
1
Introduction
ImageCLEF is the cross–language image retrieval track, which is run as part of the Cross Language Evaluation Forum (CLEF). ImageCLEF1 has seen participation from both academic and commercial research groups worldwide from communities including: cross–language information retrieval (CLIR), content–based image retrieval (CBIR) and human computer interaction. The main objective of ImageCLEF is to advance the field of image retrieval and offer evaluation in various fields of image information retrieval. The mixed use of text and visual features has been identified as important because little knowledge exists on such combinations and most research groups work either on text or on images but only few work on the two. By making available visual and textual baseline results ImageCLEF gives participants data and task to obtain the information that they do not have themselves [1,2]. ImageCLEF 2009 was divided into 7 subtracks (tasks) each of which provides an image collection: 1
http://www.imageclef.org/
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 332–339, 2010. c Springer-Verlag Berlin Heidelberg 2010
The ImageCLEF Management System
333
– ImageCLEFmed: medical retrieval; – ImageCLEFmed–annotation–IRMA: automatic medical image annotation task for the IRMA (Image Retrieval in Medical Applications) data set; – ImageCLEFmed–annotation–nodules: automatic medical image annotation for lung nodules; – ImageCLEFphoto: photographic retrieval; – ImageCLEFphoto–annotation: annotation of images using a simple ontology; – ImageCLEFwiki: image retrieval from a collection of Wikipedia images; – ImageCLEFrobot: robotic image analysis. ImageCLEF has been part of CLEF since 2003, with the number of registered research groups having grown from 4 in 2003 to 86 in 2009.. Taking the ever growing number of participants, it has become increasingly difficult to manage the registration, communication with participants and run submission manually. The data includes a copyright agreement for CLEF, submitted runs, task a user registered for, contact details for each participant. Registered groups became passwords for data download of each of the sub tasks that were send upon signature of the copyright agreement manually. The many manual steps created misunderstandings, data inconsistencies, and a large amount of email requests. After several years of experience with much manual work, a computer–based solution was created in 2009. In this paper we present the developed system based on Java and JSF (Java Server Faces) to manage ImageCLEF events without replacing other already existing tools such as Easychair2 for review management or DIRECT to evaluate results in several other CLEF tasks [3]. The new system was developed to integrate into the ImageCLEF structure and to facilitate organizational issues. This includes a run submission interface to avoid every task developing own solutions.
2
Methods
For the implementation of the system we relied on Java and JSF running on Glassfish v2.1. For data integration a Postgres 8.2 database was employed. The bridge between Java and Postgres was established with a Postgres JDBC 3 driver. Other Technologies used for client side interaction were pure Javascript and AJAX. The server used an Intel Xeon Dual Core 1.6 GHz processor with 2 GB of RAM and total disk space of 244 GB running on SuSe Linux.
3
Results
The ImageCLEF management system3 mainly handles 4 functions: management of users, collections, sub tracks and runs. The possibility of dynamic sub track creation makes the system usable for other events and data of participants can be transferred from one event to another. Participating in a new event mainly includes setting up a new database making the application flexible. 2 3
http://www.easychair.org/ http://medgift.unige.ch:8080/ICPR2010/faces/Login.jsp
334
3.1
I. Eggel and H. M¨ uller
User Management
Account Types. Generally, there are two user groups in the management system: participants and administrators. Participants are users with the goal to participate in one or more ImageCLEF tasks and submit runs. After the registration and the validation of the copyright agreement by the organizers, a user is allowed to submit runs. Administrators are users that enjoy rights to set up and modify the system with essential data, e.g. creating subtracks or delete users. They can also act as participants for run submissions. Usually, all ImageCLEF organizers have their own administrator accounts. To become an administrator the user needs to be registered as a participant. An existing administrator can then convert an existing participant account into an administrator account. User Registration. Each participating group can register easily and quickly. A link for the registration on the initial login page will guide the user to the registration process. For security reasons it is not possible to register as an administrator, so it is necessary to register as a participant first. To complete the registration, the following information needs to be provided: – – – – – – – –
group name (e.g. name of association, university, etc.); group e–mail address (representative for the group); group address; group country; first name of contact persony; last name of contact person; phone number of contact person (not mandatory); selection of sub tracks the participant wishes to participate in.
After submitting the registration form the system validates all input fields and (in case of validity) stores the participant’s registration information to the database, which at the same sends the login password to the participant by e–mail. General Resources/Tasks of User Management. There are several resources and tasks for user management, which include viewing a list of all users, users’ details, updating and deleting a user as well as validating pending participant signatures. In Figure 1 the list of all users shows a table with users row by row. Every row represents a user with the possibility to navigate to the detail and update pages by clicking the according links in the table. There is also a delete button in the row, which will remove the user from the database. Only administrators are allowed to delete participants, however it is not permitted to remove another administrator account. It is possible for every user, regardless of being administrator or participant to view a user detail page, however with the restriction of participants not being able t o see the list of submitted runs within another user’s page (see Figure 2). The system also provides an update function. While participants can only update their own accounts, administrators are allowed to update all participants they wish to. Only administrators possess the authorization to validate a participant’s signature for the copyright agreement.
The ImageCLEF Management System
335
Fig. 1. List of all the users, allowing to sort by various criteria and with different views
Fig. 2. The view of the details of one user
336
3.2
I. Eggel and H. M¨ uller
Collection Management
A collection describes a dataset of images used for the retrieval. Since all subtracks are associated with a collection the creation of a collection has to be performed before adding a sub track. Theoretically, the same collection can be part of several sub tracks. Any administrator can create new collections. For a new collection the user needs to provide information like the name of the collection, the number of images in the collection and the address to its location on the web. Additionally, the user has to provide an imagenames–file, which represents a file containing the names of all images in the collection with one imagename per line. Providing this file is essential to perform checks for run submissions, i.e. if the images specified in the submitted run file are contained in the collection. It is also only possible for administrators to perform updates on existing collections if necessary. The update page provides the possibility to change ordinary collection information as well as the exchange of the imagenames–file. 3.3
Subtrack Management
Each subtrack determines a beginning and an end date preventing participants from submitting runs for this subtrack when the time period for submission is over. Every subtrack allows only a limited number of submitted runs per participant. Like all organizational tasks, creating a new subtrack is only possible for administrators. The interface for the creation of new subtracks asks to provide information like the name of the collection, the maximal number of runs allowed as well as start and end dates of the task. Providing these dates will prevent a participant from submitting runs for this task before the task starts or after the task has finished. It is equally important to select the collection associated with the subtrack, which demands prior creation of at least one collection. In a task view, all submitted runs for the task are listed in a table (only accessible to administrators). Administrators also enjoy the privilege to download all submitted runs for the task in one zip file. All participants in the subtrack are listed. 3.4
Runs
Run submission is one of the central functions of the presented system. Each participant has the opportunity to submit runs. Administrators can act as participants and thus submit runs. Figure 3 shows an example of run submission. The main item of a run submission is the runfile itself, which can be uploaded on the same page. After the file upload and before storage of the metadata to the database, the system executes a runfile validation. Due to varying file formats among the tasks there are specific validators created for each task. In case of invalid files the transaction will be discarded, i.e. the data will not be stored to the system and an error message will notify the user avoiding the submission of runs in incorrect format. Likewise, the validator assures that each image specified in the run file has to be part of the collection. All this avoids the submission of incorrect run files and thus manual work of the organizers.
The ImageCLEF Management System
337
Fig. 3. Example for a run submission
Administrators have the possibility to see all submitted runs in a table, whereas ordinary participants are only allowed to see their own runs. The simplest way for a admin to view his or another user’s submitted runs is to inspect the user’s detail page. For administrators, a table with all submitted runs of all users appears also on the initial sub track page. A useful feature for administrators is the opportunity to download all runs of a subtrack in one zip file. The system generates (at runtime) a zip file including all runs of a particular task. The same page equally provides the facility to download a zipped file of run meta data xml files with each file corresponding to a run. After submission it is still permitted to modify own runs by replacing the runfile or by altering meta information on the run.
4
System Use in 2009
The registration interface of the system provided an easy way for users to register themselves to ImageCLEF 2009. The system counted 86 registered users from 30 countries. 10 of these users were also system administrators, the rest normal ImageCLEF participants. ImageCLEF 2009 consisted of 7 sub tracks (see Table 1). With 37 the ImageCLEFphoto–annotation task had the largest number of participants whereas the RobotVision task with its 16 participants recorded the smallest number. As shown in Table 1, participants of the ImageCLEFmed task submitted 124 runs in total, which was the highest number of submitted runs by
338
I. Eggel and H. M¨ uller Table 1. ImageCLEF tasks with number of users and submitted runs Task ImageCLEFmed ImageCLEFmed-annotation-IRMA ImageCLEFmed-annotation-nodules ImageCLEFphoto ImageCLEFphoto-annotation ImageCLEFwiki RobotVision TOTAL
# users # runs 34 124 23 19 20 0 34 0 37 74 30 57 16 32 86 306
subtrack, although the task did not have the largest number of participants. The high number of submitted runs was partly due to ImageCLEFmed being devided into image–based and case–based topics, allowing groups to submit twice as many runs. Both ImageCLEFmed–annotation tasks as well as ImageCLEFphoto did not use the system’s run submission interface and used other tools. However, it is foreseen that all tasks will provide their run submission in the future. There were a total of 39 participants that did not submit any run on the system. Some of these participants only participated in tasks that did not use the described interface and others finally did not submit any runs. Sometimes groups registered with more than one email address and in these cases we ask groups to remove the additional identifiers and have a unique submission point per group.
5
Conclusion
This paper briefly presents a solution to reduce manual and redundant work for benchmarking events such as ImageCLEF. Goal was to complement already existing systems such as DIRECT or Easychair and supply the missing functionality. All seven ImageCLEF tasks were integrated and almost all participants who registered for ImageCLEF on the paper–based registration also registered electronically. Not all tasks used the provided run submission interface but this is foreseen in the future. With 86 registered users and more than 300 submitted runs the prototype system showed to work in a stable and reliable manner. Several small changes were performed to the system based on comments from the users, particularly in the early registration phase. Reminder emails for forgotten passwords were added as well as several views and restrictions of views on the data. In the first version, run file updates were not possible once the run was submitted. This was changed. Confusion caused the renaming of the original run file names by the system after submission, which was meant to unify the submitted names based on the identifiers given inside the files. Some participants were then unable to properly identify their runs without a certain effort. To avoid this, the system will keep original names of runfiles in the future. There is also more flexibility in the meta data for each of the runs before submission but the goal is to harmonize this across tasks as much as possible.
The ImageCLEF Management System
339
The management system could enormously reduce manual interaction between participants and organizers of ImageCLEF. As the standard CLEF registration was still on paper with a signed copyright agreement, the electronic system gave the possibility to have one contact with participants and then make all information available at a single point of entry, the ImageCLEF web pages and with it the registration system. Passwords did not need to be sent to participants manually but access was organized through the system. Having a single submission interface also lowered the entry burden for participants of several sub tasks. Having only fully validated runs avoided a large amount of manual work for cleaning the data and contact with participants.
Acknowledgements This work was partially supported by the BeMeVIS project of the University of Applied Sciences Western Switzerland (HES–SO).
References 1. Clough, P., M¨ uller, H., Deselaers, T., Grubinger, M., Lehmann, T.M., Jensen, J., Hersh, W.: The CLEF 2005 cross-language image retrieval track. In: Peters, C., Gey, F.C., Gonzalo, J., M¨ uller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 535–557. Springer, Heidelberg (2006) 2. Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.): CLEF 2007. LNCS, vol. 5152. Springer, Heidelberg (2008) 3. Nunzio, G.M.D., Ferro, N.: Direct: A system for evaluating information access components of digital libraries. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 483–484. Springer, Heidelberg (2005)
Interest Point and Segmentation-Based Photo Annotation B´alint Dar´ oczy, Istv´an Petr´ as, Andr´ as A. Bencz´ ur, Zsolt Fekete, D´ avid Nemeskey, D´avid Sikl´ osi, and Zsuzsa Weiner Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {benczur,daroczyb,zsfekete,ndavid,petras,sdavid,weiner}@ilab.sztaki.hu http://www.sztaki.hu
Abstract. Our approach to the ImageCLEF 2009 tasks is based on image segmentation, SIFT keypoints and Okapi BM25-based text retrieval. We use feature vectors to describe the visual content of an image segment, a keypoint or the entire image. The features include color histograms, a shape descriptor as well as a 2D Fourier transform of a segment and an orientation histogram of detected keypoints. We trained a Gaussian Mixture Model (GMM) to cluster the feature vectors extracted from the image segments and keypoints independently. The normalized Fisher gradient vector computed from GMM of SIFT descriptors is a well known technique to represent an image with only one vector. Novel to our method is the combination of Fisher vectors for keypoints with those of the image segments to improve classification accuracy. We introduced correlation-based combining methods to further improve classification quality.
1
Introduction
In this paper we describe our approach to the ImageCLEF Photo, WikiMediaMM and Photo Annotation 2009 evaluation campaigns [11,17,12]. The first two campaigns are ad-hoc image retrieval tasks: find as many relevant images as possible from the image collections. The third campaign requires image classification into 53 concepts organized in a small ontology. The key feature of our solution in the first two cases is to combine textbased and content-based image retrieval. Our method is similar to the method we applied in 2008 for ImageCLEF Photo [7]. Our CBIR method is based on segmentation of the image and on the comparison of features of the segments. We use the Hungarian Academy of Sciences search engine [3] as our information retrieval system that is based on Okapi BM25 [16] and query expansion by thesaurus. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 340–347, 2010. c Springer-Verlag Berlin Heidelberg 2010
Interest Point and Segmentation-Based Photo Annotation
2
341
Image Processing
We transform images into a feature space both in order to define their similarity for ad hoc retrieval and to apply classifiers over them for annotation. For image processing we employ both SIFT keypoints [9] and image segmentation [6, 14, 5, 10]. While SIFT is a standard procedure, we describe our home developed segmenter in more detail below. Our iterative segmentation algorithm [2] is based on a graph of the image pixels where the eight neighbors of a pixel are connected by edges. The weight of an edge is equal to the Euclidean distance of the pixels in the RGB space. We proceed in the order of increasing edge weight as in a minimum spanning tree algorithm except that we do not merge segments if their size and the similarity of their boundary edges are above a threshold. The algorithm consists of several iterations of the above minimum spanning tree type procedure. In the first iteration we join sturdily coherent pixels into segments. In further iterations we gradually increase the limits in order to enlarge segments and reach a required number of them. We performed colour, shape, orientation and texture feature extraction over the segments and environment of keypoints of images. This resulted in approximately 500 − 7000 thousand keypoint descriptors in 128 dimensions and in approximately two hundred segment descriptors in 350 dimensions. The following features were extracted for each segment: mean RGB histogram; mean HSV histogram; normalized RGB histogram; normalized HSV histogram; normalized contrast histogram; shape moments (up to 3rd order); DFT phase and amplitude. For ad hoc image retrieval we considered segmentation-based image similarity only. We extracted features for color histogram, shape and texture information for every segment. In addition we used contrast and 2D Fourier coefficients. An asymmetric distance function is defined in the above feature space as d(Di , Dj ) = min dist(S ik , Sj ), where {Sdt : t ≥ 1} denotes the set of segments of image k Dd . Finally image similarity rank was obtained by substracting the above distance from a sufficiently large constant.
3
The Base Text Search Engine
We used the Hungarian Academy of Sciences search engine [3] as our information retrieval system-based on Okapi BM25 ranking [16] with the proximity of query terms taken into account [15, 4]. We employed stopword removal and stemming by the Porter stemmer. We extended the stop word list with terms such as “photo” or “image” that are frequently used in annotations but does not have a distinctive meaning in this task. We applied query term weighting to distinguish definite and rough query terms, the latter may be obtained from the topic description or a thesaurus. We multiplied the BM25 score of each query term by its weight; the sum of the scores gave the final rank.
342
B. Dar´ oczy et al.
We used a linear combination of the text-based and image similarity based scores for ad hoc retrieval. We considered the text-based score more accurate used small weight for the content-based score.
4
The WikipediaMM Task
We preprocessed the annotation text by regular expressions to remove author and copyright information. We made no differentiation between the title and the body of the annotation. Since file names often contain relevant keywords and also often as substring, we gave score proportional to the length of the matching substring. Since the indexing of all substrings is infeasible, we only performed this step for those documents that already matched at least one query term in their body. For the WikipediaMM task we also deployed query expansion by an online thesaurus1 . We added groups of synonyms with reduces weight so that only the score of the first few best performing synonym was added to the final score to avoid overscoring long lists of synonyms. Table 1. WikiMediaMM ad hoc search evaluation MAP P10 Image 0.0068 0.0244 Text 0.1587 0.2668 Image+Text 0.1619 0.2778 Text+Thesaurus 0.1556 0.2800 Text+Thesaurus lower weight 0.1656 0.2888 Image+Text+Thesaurus lower weight 0.1684 0.2867 1st place: DEUCENG, txt 0.2397 0.4000 2nd place: LAHC, txt+img 0.2178 0.3378
P20 0.0144 0.2133 0.2233 0.2356 0.2399 0.2355 0.3133 0.2811
As seen in Table 1, our CBIR score improved performance in terms of MAP for the price of worse early precision and expansion by thesaurus improved the performance in a similar sense. The results of the winner and second teams are shown in the last rows.
5
The Photo Retrieval Task: Optimizing for Diversity
We preprocessed the annotation text by regular expressions to remove photographer and agency information. This step was in particular important to get rid of the false positives for Belgium-related queries as the majority of the images has the Belga News Agency as annotated source. Since the annotation was very noisy, we could only approximately cleanse the corpus. 1
http://thesaurus.com/
Interest Point and Segmentation-Based Photo Annotation
343
Table 2. ImageCLEF Photo ad hoc search evaluation Text CT Text Image+Text CT Image 1st place: Xerox 2nd place: Inria
F-measure 0.6449 0.6394 0.6315 0.1727 0.80 0.76
P5 0.5 0.52 0.49 0.02
P20 0.64 0.68 0.64 0.03
CR5 0.5106 0.4719 0.4319 0.2282
CR20 0.6363 0.6430 0.6407 0.2826
MAP 0.49 0.50 0.48 0 0.29 0.08
As the main difference from the WikimediaMM task, since almost all queries were related to names of people or places, we did not deploy the thesaurus. Some of the topics had description (denoted by CT in the topic set as well as in Table 2) that we added with weight 0.1. We modified our method to achieve greater diversity within the top 20. For each topic in the ImageCLEF Photo set, relevant images were manually clustered into sub-topics. Evaluation was based on two measures: precision at 20 and cluster recall at rank 20, the percentage of different clusters represented in the top 20. The topics of this task were of two different types and we processed them separately in order to optimize for cluster recall. The first set of topics included subtopics; we merged the hit lists of the subtopics by one by one. The last subtopic typically contained terms from other subtopics negated; we fed the query with negation into the retrieval engine. The other class of topics had no subtopics; here we proceeded as follows. Let Orig(i) be the ith document (0 ≤ i < 999) and OrigSc(i) be the score of this element on the original list for a given query Qj . We modified these scores by giving penalties to the scores of the documents based on their Kullback-Leibler divergence. We used the following algorithm.
Algorithm 1. Algorithm Re-ranking 1. New (0) = Orig(0) and NewSc(0) = OrigSc(0) 2. For i = 1 to 20 (a) New(i) = argmaxk {CLi (k) |i <= k < 999} (b) NewSc(i) = max{CLi (k) |i <= k < 999} (c) For = 0 to (i − 1) NewSc() = NewSc() + c(i)
i−1 Here CLi (k) = OrigSc(k) + α l=0 KL(i, k), where α is a tunable parameter and KL(i, k) is the Kullback-Leibler distance of the ith and kth documents. We used a correction term c(i) at Step (2c) to ensure that the new scores will be also in descending order.
344
6
B. Dar´ oczy et al.
The Photo Annotation Task
The Photo Annotation data consisted of 5000 annotated training and 13000 test images. Our overall procedure is shown in Fig. 1. We used the bag-of-visual words (BOV) approach with the Fisher kernels method for images [13, 1]. The feature vectors were composed from SIFT key point descriptors and the color image segment descriptors such as shape, color histogram as described in Section 2. We used a held-out set to rank each row from the Fisher kernel. After computing the results for all of the 53 concepts, a matrix of dimensionality N × 53 holds the concept detection results, where N is the number of images. The concept detection results from different kernels can be combined. We followed a correlation-based approach that exploits the connection between the concepts of the training annotation. It is described in Section 6.2.
Fig. 1. Our image annotation procedure
6.1
Feature Generation and Modeling
To reduce the size of the feature vectors we modeled them with g = 64 Gaussians. The classical EM algorithm with diagonal covariance matrix assumption was used for the computation of the mixture parameters. To get fixed sized image descriptors we computed g − 1 + g × D × 2 dimensional normalized Fisher vectors per images [13,1], where D = 128 is the dimension of the low level feature vectors. The t × t Fisher kernel matrix contained the L1 distances of all training images from themselves. There are t = 5000 training images. We computed the Fisher kernels for several low level feature type combinations. Such combinations were: SIFT+image segments etc. We used the resulting Fisher kernels for training binary linear classifiers (L2-regularized logistic regression classifier from the LibLinear package [8]) for each of the k = 53 concepts. For prediction we used the s × t kernel matrix with the trained linear classifiers, where s = 13000 denotes the number of test images.
Interest Point and Segmentation-Based Photo Annotation
345
No_Persons −75 −43 −47
−24
Single_Person
52
Big_Group
Small_Group
Portrait
36
−51
28
33 Familiy_Friends
(a) Person-portrait
(b) Relation of person concepts
Lake Sea 88
70
Autumn
Outdoor
64
62 87
56
River No_Blur
53
Portrait
43 47
Familiy_Friends
63 Sunset_Sunrise
Mountains
43
33
Landscape_Nature
Day
Partly_Blurred
Water
24 Overall_Quality
(c) Landscape connections
46
84
Indoor
Single_Person
(d) Portrait implications
Fig. 2. Relations of the concepts. (a) Fragment of the Clef2009 ontology provided with the participants [11]. (b) Part of the auto-correlation matrix (denoted with CC in the text) of the training annotation visualized as a graph. Positive weights mean positive correlation between concepts. (c) Part of the implication matrix (denoted with CI in the text). Connection may be expressed in verbal form, e.g. ‘Landscape Nature implies from Lake, Mountains, Water ‘ (d) Portrait implications. Comparing with (a) it reflects ”hasPerson” relation.
6.2
Annotation-Based Combining
The ontology provided with the participants is an explicit description of the relation of the concepts. However, the user annotation of the training data contains an implicit, weighted graphs of concepts. These graphs can be directed or undirected and can be used to enhance the results of the predictions. The following relations can be extracted from the user annotations. Correlation: the co-occurrence of ConceptX and ConceptY is computed (Fig. 2b). Let A be the t × k annotation matrix. Each entry of A is either 0 or 1. Moreover, let CC = [cij ] be the k × k symmetric correlation matrix where (xi −x)(yi −y) cij = corr (ai , aj ), ai is the ith column of A, corr (x, y) = is the (n−1)sx sy sample correlation coefficient. Implication: ConceptB → ConceptA (see Fig. 2c and 2d).
346
B. Dar´ oczy et al.
Table 3. ImageCLEF2009-PhotoAnnotation results. ”Impl”, ”Cross” and ”Cross and Impl Optimized” corresponds to weighting with PI , PC and Popt respectively. Method Segmentation SIFT SIFT + Segmentation SIFT + Segmentation + Impl SIFT + Segmentation + Cross SIFT + Segm.+ Cross+Impl 1st place:ISIS 2nd place:LEAR
EER 0.372925 0.350944 0.296315 0.291622 0.288599 0.282956 0.234476 0.249469
AUC 0.672872 0.698616 0.771324 0.774395 0.776256 0.780710 0.838699 0.823105
For this, one computes the conditional probability matrix CI = cij where cij = P (Concepti |Conceptj ), i, j = 0..52 denotes the concepts. Based on the two matrices (or graphs) C and C the relations of concepts in the ontology can be characterized. The quality of the annotations can also be judged. For example: Mutually exclusive concepts: should have negative correlation close to -100%. Too much or too few edges: the concept in the annotation may not be discriminative enough. Using matrix CC and CI we exploited the common knowledge of annotations about the relationship between the concepts. With these matrices we re-weighted the output of the predictors. We also considered additional matrices that were produced by raising CC and CI to the pt h power element-wise, where [p ∈ [1, 5]. Each concept corresponds to the same row in CC , CI and their element-wise powers. For a given concept the algorithm selected that row from CC , CI and their element-wise powers that yielded the maximal AUC. i denote the final weight matrix. Its rows are selected according to Let Cmax i i i p the following: Cmax = max(CC , CIi , (CC ) , (CIi )q ), where p, q ∈ [1..5], C i denotes p,q
the it h row of the matrix and i = 0..52, according to the concepts. Let P denote the t × k matrix composed from the outputs of the predictors. Rows correspond to images, while columns correspond to concepts. The combined predictions were computed: PC = P CC and PI = P CI and Popt = P Cmax . Our results are shown in Table 3. The AUC results of the best two team were 0.838 and 0.823 respectively.
7
Conclusions
For image classification, we successfully combined a pure keypoint-based and a region-based method, two image processing algorithms that complement each other. Exploitation of training annotation improved the results. For image retrieval our content-based score improved the text score in combination. The use of the thesaurus and other query expansion techniques increased the performance. We took minimal effort for optimizing for diversity; while our results were strong in MAP.
Interest Point and Segmentation-Based Photo Annotation
347
References 1. Ah-Pine, J., Cifarelli, C., Clinchant, S., Csurka, G., Renders, J.: Xrce’s participation to imageclef 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) Evaluating Systems for Multilingual and Multimodal Information Access. LNCS, vol. 5706. Springer, Heidelberg (2009) 2. Daroczy, B., et al.: Sztaki@imageclef 2009. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (2009) 3. Bencz´ ur, A.A., Csalog´ any, K., Friedman, E., Fogaras, D., Sarl´ os, T., Uher, M., Windhager, E.: Searching a small national domain—preliminary report. In: Proceedings of the 12th World Wide Web Conference (WWW), Budapest, Hungary (2003), http://datamining.sztaki.hu/?q=en/en-publications 4. B¨ uttcher, S., Clarke, C.L.A., Lushman, B.: Term proximity scoring for ad-hoc retrieval on very large text collections. In: SIGIR 2006, pp. 621–622. ACM Press, New York (2006) 5. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002) 6. Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with regions. J. Mach. Learn. Res. 5, 913–939 (2004) 7. Dar´ oczy, B., Fekete, Z., Brendel, M., R´ acz, S., Bencz´ ur, A., Sikl´ osi, D., Pereszl´enyi, A.: Cross-modal image retrieval with parameter tuning. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009) 8. Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: A library for large linear classication. The Journal of Machine Learning Research 9, 1871–1874 (2008) 9. Lowe, D.: Object recognition from local scale-invariant features. In: International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999) 10. Lv, Q., Charikar, M., Li, K.: Image similarity search with compact data structures. In: CIKM 2004: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 208–217. ACM Press, New York (2004) 11. Nowak, S., Dunker, P.: Overview of the CLEF 2009 large scale visual concept detection and annotation task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 12. Paramita, M., Sanderson, M., Clough, P.: Diversity in photo retrieval: overview of the ImageCLEFPhoto task 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 13. Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8 (2007) 14. Prasad, B.G., Biswas, K.K., Gupta, S.K.: Region-based image retrieval using integrated color, shape, and location index. Comput. Vis. Image Underst. 94(1-3), 193–233 (2004) 15. Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval systems. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 207–218. Springer, Heidelberg (2003) 16. Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. In: Document retrieval systems, pp. 143–160. Taylor Graham Publishing, London (1988) 17. Tsikrika, T., Kludas, J.: Overview of the WikipediaMM task at ImageCLEF 2009. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece (2009)
University of Ja´ en at ImageCLEF 2009: Medical and Photo Tasks Miguel A. Garc´ıa-Cumbreras, Manuel Carlos D´ıaz-Galiano, Man´ıa Teresa Mart´ın-Valdivia, Arturo Montejo-Raez, and L. Alfonso Ure˜ na-L´ opez SINAI Research Group, Computer Science Department, University of Ja´en, Spain {magc,mcdiaz,maite,amontejo,laurena}@ujaen.es
Abstract. This papers describes the participation of the SINAI research group in the medical and photo retrieval ImageCLEF tasks. The approach for medical retrieval continues our usage of the MeSH ontology for query expansion, but comparing it to term expansion within the documents in the collection. Regarding the photo retrieval task, diversity of top results has been pursued by applying a clustering algorithm. For both tasks, results and dicussion are included. In general, no relevant findings were obtained with the novel approaches applied.
1
Introduction
This paper presents the participation of the SINAI research group at two different ImageCLEF tasks: the ImageCLEF medical retrieval task [5] and at the ImageCLEF photo retrieval task [6]. For medical retrieval, our main goal is to study the expansion of the collection and the queries using the MeSH ontology. In previous years we have experimented with the expansion of the queries with medical ontologies [4,3] but now we want to compare the results expanding terms in the collection . Other previous works include the development of a system that tests different aspects, such as the application of Information Gain in order to improve the results [2], the expansion of the topics with the MeSH1 ontology [3], and the expansion of the topics again with the UMLS2 metathesaurus that uses minor textual information but more specific [4]. As regards the photo retrieval approach, the task for 2009 is different than the evaluation in 2008, and the organizers give special value to the diversity of results. Given a query, the goal is to retrieve a relevant set of images at the top of a ranked list. Text and visual information can be used to improve the retrieval methods, and the main evaluation points are the use of pseudo-relevant feedback (PRF), query expansion, IR systems with different weighting functions and clustering or filtering methods applied over the cluster terms. Our system makes use of text information, not visual information, to improve the retrieval 1 2
http://www.nlm.nih.gov/mesh/ http://www.nlm.nih.gov/research/umls/
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 348–353, 2010. c Springer-Verlag Berlin Heidelberg 2010
University of Ja´en at ImageCLEF 2009: Medical and Photo Tasks
349
methods, and a new method has been implemented to cluster images. The paper is organized in two sections, one for medical task and another for photo retrieval. Each section includes the description of the system, the collection and the query treatment, the experiments and the results obtained. In addition, conclusions and discussion for further work are presented in Section 4.
2 2.1
Medical Task System Description
Since 2008, a new collection was introduced in this task. We created different textual collections using information of this collection in the web [4]. In 2009, the medical task has been separated into two subtasks, one based in imagen retrieval (adhoc) and another one based in medical case retrieval. For this reason, we have created three different textual collections, two collections for image based retrieval and one for case based retrieval. The collections are: – C: Contains the caption of the image to use in image based retrieval. – CT: Constains the caption of the image and the title of the article to use in image based retrieval. – TA: Constains the title and the text of the full article to be used in medical case based retrieval. We have experimented with expansion of the textual collection. Our initial aim was to expand the textual collection in the same way as the query expansion performed in the experiment of last year [4]. Nevertheless, the time request needed for expansion of minimal textual collection with UMLS and MetaMap software has been excessive. For this reason, we have only used the MeSH ontology to expand collections and topics. From the collections used in image based retrieval, we have created two collections by expanding the text using the MeSH ontology. These new collections have been named CM and CTM respectively. We have not used the collection with full articles for image based retrieval for two reasons: the first one is that we want to experiment the expansion of the collection with minimal textual information, the second reason is that the expansion of a big textual collection requires a large amount of time. 2.2
Experiments
In our experiments we have expanded the whole set of topics obtaining the following: – – – –
t: Original topic set to use in adhoc retrieval. tM: Topic set for adhoc retrieval expanded with MeSH ontology. cbt: Original topic set to use in case based retrieval. cbtM: Topic set for case based retrieval expanded with the MeSH ontology.
350
M.A. Garc´ıa-Cumbreras et al.
The dataset of the collection has been indexed using the Lemur3 IR system, applying KL-divergence weighting function and using Pseudo-Relevance Feedback (PRF). Table 1 shows the main average precision (MAP) of image based retrieval experiments. Contrary to expectations, these results show that query expansion does not improve the performance of the system. The expansion by using only the collection with more textual information (CTM collection) obtains the best results. The last row shows the best system in the competition in the textual runs (LIRIS maxMPTT extMPTT). Table 2 shows the results of medical case based experiments. As we can see in this table, query expansion does not improve the results. The last row shows the best system in the competition in the mixed mode (ceb-cases-essie2-automatic). Table 1. MAP values of image based experiments
C CT CM CTM Best
t
tM
0.3289 0.3569 0.3124 0.3795 0.43
0.2754 0.3077 0.2838 0.3286
Table 2. Results of case based experiments Collection Topics MAP TA TA Best
3
cbt 0.2626 cbtM 0.2605 0.34
Photo Task
3.1
System Description
As mentioned in the introduction, in 2009 the main goal is to explore the benefits of increasing the diversity of results. Figure 1 shows a general scheme of the system developed. In the first module of the system different sets of topics have been built combining the title of the query, the word of each cluster and the title of the last cluster. In a second step the English collection has been preprocessed as usual (English stopwords removal and the Porter’s stemmer [7]), and the documents have been indexed using also the Lemur retrieval system Lemur, with the Okapi weighting function and Pseudo-Relevance Feedback(PRF). Then, these 3
http://www.lemurproject.org/
University of Ja´en at ImageCLEF 2009: Medical and Photo Tasks
351
Fig. 1. General scheme of the SINAI system at ImagePhoto 2009
sets of topics were run over the IR system, and a list of relevant documents was obtained. Clustering Subsystem. It has been found that when there exists variability in top results in a list of documents retrieved as answer to a query, the performance of the retrieval systems increases, being in some cases more desirable to have less but more varied items in this list [1]. In order to increase variability, a clustering system has been applied. The idea behind is rather simple: re-arrange most relevant documents so that documents belonging to different clusters are promoted to the top of the list. We have applied k-means on every list of results returned by the Lemur IR system. This has been done using the Rapid Miner tool4 . The clustering algorithm has tried to group these results, without any concern on ranking, into four different groups (this is the average number of clusters for training documents as specified in their metadata). Once each of the documents in the list has been labeled to its resulting cluster index, the list has been reordered according to the result of the clustering process. Therefore, we fill the list by alternating documents from different clusters. 3.2
Experiments
In our ImagePhoto system we have proved the following configurations: 4
http://rapid-i.com/
352
M.A. Garc´ıa-Cumbreras et al.
1. (1) SINAI1 - Baseline. It is the baseline experiment. It uses Lemur as IR system with automatic feedback. The weighting function applied was Okapi. The topic used is only the query title. 2. (2) SINAI2 - title and final cluster. This experiment combines the query title with the title of the final cluster that appears in the topics file. Lemur also uses Okapi as weighting function and PRF. 3. (3) SINAI3 - title and all clusters. This experiment combines the query title with all the words that appear in the titles of all the clusters. Lemur also uses Okapi as weighting function and PRF. 4. (4) SINAI4 - clustering. The query title and each cluster title (except the last one that combines all) are run against the index generated by the IR system. Several lists of relevant documents are retrieved, and the clustering module combines them to obtain the final list of relevant documents. The aim of this experiment is to increment the diversity of the retrieved results using a clustering algorithm. Table 3 shows the results obtained in our four experiments. The last row shows the best system in the competition with only text (InfoComm group). Table 3. SINAI experiment results for the ImagePhoto tasks Experiment sinai1 T TXT sinai2 TCT TXT sinai3 TCT TXT sinai4 TCT TXT LRI2R TI TXT(Best)
4
CR10
P10
0.4580 0.3798 0.5210 0.4356 0.671
0.796 58 0.778 0.474 0.848
MAP F-measure 0.4454 0.3286 0.4567 0.2233
0.5814 0.4590 0.6241 0.4540 0.7492
Discussion and Conclusions
In medical retrieval task, we have used topic and collection expansion. The topic expasion has been carried out in the same way as in previous year. The expansion of the collection improves the results when the topics are expanded too. However, the obtained results are not successful because we do not get the same results as in previous years. Although the collection is the same as in 2008, in 2009 we rebuilt the collection via Web. This new collection is different from that generated in 2008. We conducted experiments with the 2008 collection and the results are similar to previous years, although the MAP values are lower than those obtained in 2009. This may indicates that there has been an error in the generation of the collection and the results are not relevant. In photo retrieval task, we have experimented with different kinds of cluster combination. However, as we can see in Table 3, the application of clustering does not improve the results greatly. In fact, only in the run SINAI3, which combines the query the original title and the titles of all the clusters, overcomes
University of Ja´en at ImageCLEF 2009: Medical and Photo Tasks
353
the baseline case SINAI1 that only uses the original title. Unfortunately, the experiment SINAI4 that applies our clustering and fusion approach has achieved the worst results. Thus, the obtained results show that it is necessary to continue investigating the clustering solution for diversity. In addition, the use of visual information could improve the final system.
Acknowledgements This work has been supported by the Regional Government of Andalucia (Spain) under excellence project GeOasis (P08-41999), the Spanish Government under project Text-Mess TIMOM (TIN2006-15265-C06-03) and the University of Jaen local project RFC/PP2008/UJA-08-16-14.
References 1. Chen, H., Karger, D.R.: Less is more: probabilistic models for retrieving fewer relevant documents. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 429–436. ACM, Seattle (2006) 2. D´ıaz-Galiano, M., Garc´ıa-Cumbreras, M., Mart´ın-Valdivia, M., Montejo-R´ aez, A., Ure˜ na L´ opez, L.: Using Information Gain to Improve the ImageCLEF 2006 Collection. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 711–714. Springer, Heidelberg (2007) 3. D´ıaz-Galiano, M., Garc´ıa-Cumbreras, M., Mart´ın-Valdivia, M., Montejo-R´ aez, A., Ure˜ na L´ opez, L.: Integrating MeSH Ontology to Improve Medical Information Retrieval. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 601–606. Springer, Heidelberg (2008) 4. D´ıaz-Galiano, M., Garc´ıa-Cumbreras, M., Mart´ın-Valdivia, M., Urea-L´ opez, L., Montejo-R´ aez, A.: Query Expansion on Medical Image Retrieval: MeSH vs. UMLS. In: Peters, C., et al. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 732–735. Springer, Heidelberg (2009) 5. M¨ uller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B., Kahn Jr., C.E., Hersh, W.: Overview of the CLEF 2009 medical image retrieval task (2009) 6. Paramita, M.L., Sanderson, M., Clough, P.: Diversity in photo retrieval: Overview of the ImageCLEFphoto task 2009 (2009) 7. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980), http://portal.acm.org/citation.cfm?id=275705
Overview of VideoCLEF 2009: New Perspectives on Speech-Based Multimedia Content Enrichment Martha Larson1, Eamonn Newman2 , and Gareth J.F. Jones2 1
2
Multimedia Information Retrieval Lab, Delft University of Technology, 2628 CD Delft, Netherlands Centre for Digital Video Processing, Dublin City University, Dublin 9, Ireland m.a.larson@tudelft.nl, {enewman,gjones}@computing.dcu.ie
Abstract. VideoCLEF 2009 offered three tasks related to enriching video content for improved multimedia access in a multilingual environment. For each task, video data (Dutch-language television, predominantly documentaries) accompanied by speech recognition transcripts were provided. The Subject Classification Task involved automatic tagging of videos with subject theme labels. The best performance was achieved by approaching subject tagging as an information retrieval task and using both speech recognition transcripts and archival metadata. Alternatively, classifiers were trained using either the training data provided or data collected from Wikipedia or via general Web search. The Affect Task involved detecting narrative peaks, defined as points where viewers perceive heightened dramatic tension. The task was carried out on the “Beeldenstorm” collection containing 45 short-form documentaries on the visual arts. The best runs exploited affective vocabulary and audience directed speech. Other approaches included using topic changes, elevated speaking pitch, increased speaking intensity and radical visual changes. The Linking Task, also called “Finding Related Resources Across Languages,” involved linking video to material on the same subject in a different language. Participants were provided with a list of multimedia anchors (short video segments) in the Dutch-language “Beeldenstorm” collection and were expected to return target pages drawn from English-language Wikipedia. The best performing methods used the transcript of the speech spoken during the multimedia anchor to build a query to search an index of the Dutch-language Wikipedia. The Dutch Wikipedia pages returned were used to identify related English pages. Participants also experimented with pseudo-relevance feedback, query translation and methods that targeted proper names.
1
Introduction
VideoCLEF 20091 was a track of the CLEF2 benchmark campaign devoted to tasks aimed at improving access to video content in multilingual environments. 1 2
http://www.multimediaeval.org/videoclef09/videoclef09.html http://www.clef-campaign.org
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 354–368, 2010. c Springer-Verlag Berlin Heidelberg 2010
Overview of VideoCLEF 2009
355
The overall goal of the VideoCLEF benchmarking initiative, now referred as “MediaEval”3, is to develop new, forward-looking multimedia retrieval tasks and data sets with which to evaluate these tasks. During VideoCLEF 2009, three tasks were carried out. The Subject Classification Task required participants to automatically tag videos with subject theme labels (e.g., ‘factories,’ ‘physics,’ ‘poverty’, ‘cultural identity’ and ‘zoos’). The Affect Task, also called “Narrative peak detection,” involved automatically detecting dramatic tension in shortform documentaries. Finally, “Finding Related Resources Across Languages,” referred to as the Linking Task, required participants to automatically link video to Web content that is in a different language, but on the same subject. The data sets for these tasks contained Dutch-language television content supplied by the Netherlands Institute of Sound and Vision4 (called in Dutch Beeld & Geluid ), which is one of the largest audio/video archives in Europe. Each participating site had access to video data, speech recognition transcripts, shot boundaries, shotlevel keyframes and archival metadata supplied by VideoCLEF. Sites developed their own approaches to the tasks and were allowed to chose the method and features that they found most appropriate. Seven groups made submissions of task results for evaluation. In 2009, the VideoCLEF track ran for the first time as a full track within the Cross-Language Evaluation Forum (CLEF) evaluation campaign. The track was piloted last year as VideoCLEF 2008 [11]. The VideoCLEF track is successor to the Cross-Language Speech Retrieval (CL-SR) track, which ran at CLEF from 2005 to 2007 [12]. VideoCLEF seeks to extend the results of CL-SR to the broader challenge of video retrieval. VideoCLEF is intended to complement the TRECVid benchmark [15] by running tasks related to the topic or subject matter treated by video and emphasizing the importance of speech and language (e.g., via speech recognition transcripts). TRECVid has traditionally focused on objects, entities and scenes that are depicted in the visual channel. In contrast, VideoCLEF concentrates on what is described in a video, in other words, what a video is about. This paper describes the data sets and the tasks of VideoCLEF 2009 and summarizes the results achieved by the participating sites. We finish with a conclusion and an outlook for MediaEval 2010. For additional information concerning individual approaches used in 2009, please refer to the papers of the individual sites in this volume. 1.1
Data
VideoCLEF 2009 used two data sets both containing Dutch-language television programs. Note that these programs are predominantly documentaries with the addition of some talk shows. This means that the data contains a great deal of conversational speech, including opinionated and subjective speech and speech that has been only loosely planned. In this way, the VideoCLEF data is different and more challenging than broadcast news data, which largely involves scripted speech. 3 4
http://www.multimediaeval.org http://www.beeldengeluid.nl
356
M. Larson, E. Newman, and G.J.F. Jones
The VideoCLEF 2009 Subject Classification Task ran on TRECVid 2007 and 2008 data from Beeld & Geluid. The Affect Task and Linking Task both ran on a data set containing material from the short-form documentary Beeldenstorm, also supplied by Beeld & Geluid. For both data sets, Dutch-language speech recognition transcripts were supplied by the University of Twente [5]. The shot segmentation and the shot-level keyframe data were provided by Dublin City University [1]. Further details are given in the following. TRECVid 2007/2008 data set. In 2009, VideoCLEF attempted to encourage cross-over from the TRECVid community by recycling the TRECVid data set5 for the Subject Classification Task. Notice that the Subject Classification Task is a fundamentally different task than what ran at TRECVid in 2007 and 2008. Subject Classification involves automatically assigning subject labels to videos at the episode level. The subject matter of the entire video is important, not just the concepts visible in the visual channel and not just the shot-level topic. Classifying video, i.e., taking a video and assigning it a topic class subject label, is exactly what the archive staff does at Beeld & Geluid when they annotate video material that is to be stored in the archive. The class labels used for the VideoCLEF 2009 Subject Classification Task are a subset of labels that are used by archive staff. As a result, we have gold standard topic class labels with which to evaluate classification. Additionally, we can be relatively certain that if these labels are already used for retrieval of material from the archive then they are relevant for video search in an archive setting, and we assume, beyond. Original Dutch-language examples of subject labels can be examined in the search engine.6 In the VideoCLEF 2009 Subject Classification Task, archivist-assigned subject labels were used as ground truth.7 The training set is a large subset of TRECVid 2007 and contains 212 videos. The test set is a large subset of TRECVid 2008 and contains 206 videos. The videos are most drawn from a subset of the overall Beeld 5 6
7
http://www-nlpir.nist.gov/projects/tv2007/tv2007.html#3 Visit the search engine at http://zoeken.beeldengeluid.nl. A keyword search will return a results list with a column labeled Trefwoorden or keywords. These are the topic class subject labels that are used in the archive. In total 46 labels were used: aanslagen (attacks), armoede (poverty), burgeroorlogen (civil wars), criminaliteit (crime), culturele identiteit (cultural identity), dagelijks leven (daily life), dieren (animals), dierentuinen (zoos), economie (economy), etnische minderheden (ethnic minorities), fabrieken (factories), families (families), gehandicapten (disabled), geneeskunde (medicine), geneesmiddelen (pharmaceutical drug), genocide (genocide), geschiedenis (history), gezinnen (families), havens (harbors), hersenen (brain), illegalen (undocumented immigrants), journalisten (journalist), kinderen (children), landschappen (landscapes), media (media), militairen (military personnel), musea (museums), muziek (music), natuur (nature), natuurkunde (physics), ouderen (seniors), pers (press), politiek (politics), processen (lawsuits), rechtszittingen (court hearings), reizen (travel), taal (language), verkiezingen (elections), verkiezingscampagnes (electoral campaigns), voedsel (food), voetbal (soccer), vogels (birds), vrouwen (women), wederopbouw (reconstruction), wetenschappelijk onderzoek (scientific research), ziekenhuizen (hospitals).
Overview of VideoCLEF 2009
357
& Geluid collection called Academia,8 which is a collection that was created for use in research and educational settings. The Academia collection currently contains about 7,000 hours of video. In general, each video in the training and test set is an individual episode of a television show. Their length varies widely with the average length being around 30 minutes. Note that the VideoCLEF 2009 Subject Classification set excludes several videos in the TRECVid collection for which archival metadata was not available. We would also like to explicitly point out that participants were not required to make use of the training data set, but were free to collect their own training data, if they wished. Beeldenstorm data set. For both the Affect Task and Linking Task a data set consisting of 45 episodes of the documentary series Beeldenstorm (Eng. Iconoclasm) was used. The Beeldenstorm series consists of short-form Dutch-language video documentaries about the visual arts. Each episode lasts approximately eight minutes. Beeldenstorm is hosted by Prof. Henk van Os, known and widely appreciated, not only for his art expertise, but also for his narrative ability9 . This data set is also supplied by Beeld & Geluid, but it is mutually exclusive with the TRECVid 2007/2008 data set. The narrative ability of Prof. van Os makes the Beeldenstorm set an interesting corpus to use for affect detection and the domain of visual arts offers a wide number of possibilities for interesting multimedia links for the linking task. Finally, the fact that each episode is short makes it possible for assessors to watch the entire episode when creating the ground truth. Knowledge of the complete context is important for relevance judgments for cross-language related resources and also for defining narrative peaks. The ground truth for the Affect Task and Linking Task was created by a team of three Dutch-speaking assessors during a nine-day assessment and annotation event at Dublin City University referred to as Dublin Days. The videos were annotated with the ground truth with the support of the Anvil10 Video Annotation Research Tool [8]. Anvil makes it possible to generate frame-accurate video annotations in a graphic interface. Particularly important for our purposes was the support offered by Anvil for user-defined annotation schemes. Details of the ground truth creation are included in the discussions of the individual tasks in the following section.
2
Subject Classification Task
2.1
Task
The goal of the Subject Classification Task is automatic subject tagging. Semantictheme-based subject tags are assigned automatically to videos. The purpose of 8 9 10
http://www.academia.nl/ http://www.avro.nl/tv/programmas a-z/beeldenstorm/ http://www.anvil-software.de/
358
M. Larson, E. Newman, and G.J.F. Jones
these tags is to make the videos findable to users who are searching and browsing the collection. The information needs (i.e., queries) of the users are not specified at the time of tagging. In VideoCLEF 2009, the Subject Classification Task had the specific goal of reproducing the subject labels that were hand assigned to the test set videos by archivists at Beeld & Geluid. Since these subject labels are currently in use to archive and retrieve video in the setting of a large archive, we are confident about their usefulness for search and browsing in real-world information retrieval scenarios. The Subject Classification Task was introduced during the VideoCLEF 2008 pilot [11]. In 2009, the number of videos in the collection was increased from 50 to 418 and the number of subject labels increased from 10 to 46. 2.2
Evaluation
The Subject Classification Task is evaluated using Mean Average Precision (MAP). This choice of score is motivated by the popularity of techniques that approach the subject tagging task as an information retrieval problem. These techniques return, for each subject label, a ranked list of videos that should receive that label. MAP is calculated by taking the mean of the Average Precision over all subject labels. For each subject label, precision scores are calculated by moving down the results list and calculating precision at each position where a relevant document is retrieved. Average Precision is calculated by taking the average of the precision at each position. Calculations were performed using version 8.1 of the trec eval11 scoring package. 2.3
Techniques
Computer Science, Chemnitz University of Technology, Germany (see also [9]) The task was treated as an information retrieval task. The test set was indexed using an information retrieval system and was queried using the subject labels as queries. Documents returned as relevant to a given subject label were tagged with that label. The number of documents receiving a given label was controlled by a threshold. The submitted runs varied with respect to whether or not the archival metadata was indexed in addition to the speech recognition transcripts. They also varied with respect to whether expansion was applied to the class label (i.e., the query). Expansion was performed by augmenting the original query with the most frequent term occurring in the top five documents returned by an initial retrieval round. If fewer than two documents were returned, queries were expanded using a thesaurus. SINAI Research Group, University of Ja´en, Spain The SINAI12 (see also [13]) The group approached the task as a categorization problem, training SVMs using the training data provided. One run, SINAI svm nometadata, extracted feature vectors from the speech transcripts alone and one run, SINAI svm withmetadata, made use of both speech recognition transcripts and metadata. 11 12
http://trec.nist.gov/trec eval SINAI stands for Sistemas Inteligentes de Acceso a la Informaci´ on
Overview of VideoCLEF 2009
359
Computer Science, Alexandru Ioan Cuza University, Romania (see also [2]) A training set was created by using subject category labels to select documents from Wikipedia and also from the Web at large (using Google). The training set was used to create a category file for each subject category containing a set of informative terms representative of that category. Two categorization methods were applied, one made use of information retrieval techniques to match the speech recognition transcripts of the videos to the category files and the other made use of a Naive Bayes multinomial classifier to classify the videos into the classes represented by the category files. 2.4
Results
The MAP of the results of the task are reported in Table 1. The results confirm the viability of techniques that approach the Subject Classification Task as an information retrieval task. Such techniques proved useful in VideoCLEF 2008 [11] and also provide the best results in 2009 where the size of the collection and the label set increased. Also, consistent with VideoCLEF 2008 observations, performance is better when archival metadata is used in addition to speech recognition transcripts. Table 1. Subject Classification Results Test Set run ID cut1 sc asr baseline cut2 sc asr expanded cut3 sc asr meta baseline cut4 sc asr meta expanded cut5 sc asr meta expanded SINAI svm nometadata SINAI svm withmetadata
MAP 0.0067 0.0842 0.2586 0.2531 0.3813 0.0023 0.0028
In the wake of VideoCLEF 2008, we decided that we wanted to provide a training data set of videos accompanied by speech transcripts in 2009 to see whether training classifiers on data from the same domain as the test data would improve performance. The runs submitted this year demonstrate the efficacy of an approach that combines Web data and information retrieval techniques. A supervised approach which uses same-domain training data cannot easily achieve the same level of performance. These results leave open the question of how much training data is necessary in order for a supervised approach to compete with the information retrieval approach. In all cases, runs that make use of metadata outperform runs that make use of ASR transcripts only. This performance differences demonstrate the high value of using metadata, if available, to supplement ASR transcripts in order to generate class labels for videos.
360
M. Larson, E. Newman, and G.J.F. Jones
The Alexandru Ioan Cuza University (UAIC) team reported results on the training set only. Note that because they train using data that they have collected themselves, the training set constitutes for the purposes of their experiments a separate, unseen test set. The results are not, however, directly comparable to those given in Table 1. We do not repeated them here, but rather refer the interested reader to the UAIC team paper [2]. Here, we include the comment that the best UAIC run involved using both general Web and Wikipedia training data and then combining the output Information-Retrieval approach (which they find improves the quality of the first-best label) and the output of a Naive Bayesclassifier (which they find contributes to the overall label quality). The UAIC results are consistent with our overall conclusion that it is better to collect training data from external sources rather than to use the training set. We believe that there are two possible sources to which we can attribute the failure of the training set to allow the training of high quality classifiers. First, the training set was relatively small, including only 212 videos. Although some semantic categories are represented by a fair number of video items, other categories may have as few as two items associated with them in the training set. Second, the transcripts of the training set contain a high level of speech recognition errors, which means that important terms might be mis-recognized and thus fail to occur in the transcripts at all, or fail to occur with the proper distribution. There is general awareness shared by VideoCLEF participants that although MAP is a useful tool, it may not be the ideal evaluation metric for this task. The reader can refer to the papers of the Chemnitz [9] and SINAI [13] for additional discussion and results reported with additional performance metrics. The ultimate goal of subject tagging is to generate a set of tags for each video that will allow users to find that video while searching or browsing. The utility of a tag assigned to a given video is therefore not entirely independent of the other tags assigned. Under the current formulation of the task, the presence or absence of the tag is the only information that is of use to the searcher. The ranking of a video in a list of videos that are assigned the same tag is for this reason not directly relevant to the utility of that tag for the user. Future work must necessarily involve developing appropriate metrics for evaluating the usefulness to the uses of sets of tags assigned to multimedia items.
3 3.1
Affect Task Task
The goal of the Affect Task at VideoCLEF 2009 was to automatically detect narrative peaks in documentaries. Narrative peaks were defined to be those places in a video where viewers report feeling a heightened emotional effect due to dramatic tension. This task was new in 2009. The ultimate aim of the Affect Task is to move beyond the information content of the video and to analyze the video with respect to characteristics that are important for viewers, but not related to the video topic.
Overview of VideoCLEF 2009
361
Narrative peak detection builds on and extends work in affective analysis of video content carried out in the areas of sports and movies, cf. e.g., [4]. Viewers perceive an affective peak in sports videos due to tension arising from the spontaneous interaction of players within the constraints of the physical world and the rules and conventions of the game. Viewers perceive an affective peak in a movie due to the action or the plot line, which is carefully planned by the script writer and the filmmaker. Narrative peaks in documentaries are a new domain in so far as they cannot be considered to fall into either category. Documentaries convey information and often have storylines, but do not have the all-dominating plot trajectory of a movie. Documentaries often include extemporaneous narrative or interviews, and therefore also have a spontaneous component. The affective curve experienced by a viewer watching a documentary can be expected to be relatively subtly modulated. It is important to differentiate narrative peak detection from other cases of affect detection, such as hotspot detection in meetings. Hotspots are moments during meetings where people are highly involved in the discussion [16]. Hotspots can be self-reported by meeting participants or annotated in meeting video by viewers. In either case, it is the participant and not the viewer whose affective reaction is being detected. We chose the the Beeldenstorm series for the narrative peak detection task in order to make the task as simple and straightforward as possible in its initial year. Beeldenstorm features a single speaker, the host Prof. van Os, and covers a topical domain, the visual arts, that is rich enough to be interesting, yet is relatively constrained. These characteristics help us to control for the effects of personal style of the host and of viewer familiarity with topic in the affect and appeal task. Further, as mentioned above, the fact that the documentaries are short makes it possible for annotators to watch them in their entirety when annotating narrative peaks. 3.2
Evaluation
For the purposes of evaluation, as mentioned above, three Dutch speakers annotated the Beeldenstorm collection by each identifying the three top narrative peaks in each video. Annotators were asked to mark the peaks where they felt the dramatic tension reached its highest level. They were not supplied with an explicit definition of a narrative peak. Instead, all annotators needed to form independent opinions of where they perceived narrative peaks. In order to make the task less abstract, they were supplied with the information that the Beeldenstorm series is associated with humorous and moving moments. They were told that they could use this information to formulate their notion of what constitutes a narrative peak. Peaks were required to be a maximum of ten seconds in length. Although the annotators did not consult with each other about specific peaks, the team did engage in discussion during the definition process. The discussion ensured that there was underlying consensus about the approach to the task.
362
M. Larson, E. Newman, and G.J.F. Jones
In particular, it was necessary to check that annotators understood that a peak must be a high point in the storyline as measured by their perceptions of their own emotional reaction. Dramatic objects or facts in the spoken or visual content that were not part of the storyline as it was created by the narrator/producer were not considered narrative peaks. Regions in the video where the annotator guessed that the speaker or producer had intended there to be a peak, but where the annotator did not feel any dramatic tension were not considered to be peaks. An example of this would be a joke that the annotator did not understand completely. The first two episodes for which the annotators defined peaks were discarded in order to assure that the annotators perception of a narrative peak had stabilized. This warm-up exercise was particularly important in light of the fact that at the end of the annotation effort, assessors reported that it was necessary to become familiar with the style and allow an affinity for the series to develop before they started to feel an emotional reaction to narrative peaks in the video. The peaks identified by the assessors were considered to be a reflection of underlying “true” peaks in the narrative of the video. We assumed that the variation between assessors is the result of noise due to effects such as personal idiosyncracies. In order to generate a ground truth most highly reflective of “true” peaks, the peaks identified by the assessors were merged. The assessment team consisted of three members who each identified three peaks in 45 videos for a total of 405 marked peaks. The assessors were able to give a rough estimate of the minimum distance between peaks and on the basis of their observations, it was decided to consider two peaks that overlapped by at least two seconds to be the same peak. After merging the peaks, 293 of the 405 peaks turned out to be distinct. The merging process was carried out by fitting a 10 second window to overlapping assessor peaks in order to ensure that merged peaks could never exceed the specified peak length of 10 seconds. Evaluation involved the application of two scoring methods, the point-based approach and the peak-based approach. Under point-based scoring, the peaks chosen by each assessor are assessed without merging. A hypothesized peak receives a point in every case in which it falls within eight seconds of an assessor peak. The run score is the total number of peaks returned by all peak hypotheses in the run. A single episode can earn a run between three points (assessors chose completely different peaks) and nine points (assessors all chose the same peaks). There are no episodes in the set that fall at either of these extremes. The distribution of the peaks in the files is such that the best possible run would earn 246 points. Under peak-based scoring, a hypothesis is counted as correct if it falls within an 8 second window of a peak representing a merger of assessor annotations. Three different types of merged reference peaks are defined for peak-based scoring. Three different peak-based scores are reported that differ in the number of assessors required to agree in order for a region in the video to be considered a peak. Of the 293 total peaks identified, 203 peaks are “personal peaks” (peaks identified by only one assessor), 90 are “pair peaks” (peaks that are identified by at least two assessors) and 22 are “general peaks” (peaks upon which all three assessors agreed).
Overview of VideoCLEF 2009
3.3
363
Techniques
Narrative peak detection techniques were developed that used the visual channel, the audio channel and the speech recognition transcript. Each group took a different approach. Computer Science, Alexandru Ioan Cuza University, Romania (see also [2]) Based on the hypothesis that speakers raise their voices at narrative peaks, three runs were developed that made use of the intensity of the audio signal. A score was computed for each group of words that involved a comparison of intensity means and other statistics for sequential groups of words. The top three scoring points were hypothesized as peaks. Computer Vision and Multimedia Laboratory, University of Geneva, Switzerland (see also [7]) The assumption was made that dramatic peaks correspond to the introduction of a new topic and thus correspond to change in word use as reflected in the speech recognition transcripts. Additionally, the video and audio channel effects assumed to be indicative of peaks were explored. Finally, a weighting was deployed that gave more emphasis to positions at which peaks were expected to occur based on the distribution of peaks in the development data. The weighting is used in unige-cvml1, unige-cvml2 and unige-cvml3. Run unige-cvml1 uses text features alone. Run unige-cvml3 uses text plus elevated speaker pitch. Run unige-cvml2 uses text, elevated pitch and quick changes in the video. Run unige-cvml4 uses text only and no weighting. Run unige-cvml5 sets peaks randomly to provide a random baseline for comparsion. Delft University of Technology and University of Twente, Netherlands (see also [10]) Only features extracted from the speech transcripts were exploited. Run duotu09fix predicted peaks at fixed points chosen by analyzing the development data. Run duotu09ind used indicator words as cues of narrative peaks. Indicator words were chosen by analyzing the development data. Run duotu09rep applied the assumption that word repetition, reflecting the use of an important rhetorical device, would indicate a peak. Run duotu09pro used pronouns as indicators of audience directed speech and assumed that high pronoun densities would correspond to points where viewers feel maximum involvement. Run duotu09rat exploited the affective scores of words, building on the hypothesis that use of affective speech characterizes narrative peaks. 3.4
Results
The results of the task are reported in Table 2. The results make clear that it is quite challenging to effectively support the detection of narrative peaks using audio and video features. Recall that unige-cvml5 is a randomly generated run. Most runs failed to yield results appreciably better than this random baseline. The best scoring approaches exploited the speech recognition transcripts, in particular, the occurrence of pronouns reflecting user directed speech and the use of words with high effective ratings.
364
M. Larson, E. Newman, and G.J.F. Jones Table 2. Narrative peak detection results
run ID
duotu09fix duotu09ind duotu09rep duotu09pro duotu09rat unige-cvml1 unige-cvml2 unige-cvml3 unige-cvml4 unige-cvml5 uaic-run1 uaic-run2 uaic-run3
point-based
47 55 30 63 59 39 41 42 43 43 33 41 33
peak-based peak-based peak-based > 1 assessor > 2 assessors > 3 assessors (“personal peaks”) (“pair peaks”) (“general peaks”) 28 8 4 38 12 2 21 7 0 44 17 4 33 18 6 32 6 0 30 11 2 31 8 0 31 9 0 32 8 3 26 7 2 29 10 3 24 7 2
Because of the newness of the Narrative Peak Detection Task, the method of scoring is still a subject of discussion. The scoring method was designed such that algorithms were given as much credit as possible for agreement between the peaks they hypothesized and the peaks chosen by the annotators. See the papers of individual participants [7] [10] for some additional discussion.
4 4.1
Linking Task Task
The Linking Task, also called “Finding Related Resources Across Languages,” involves linking episodes of the Beeldenstorm documentary (Dutch language) to Wikipedia articles about related subject matter (English language). This task was new in 2009. Participants were supplied with 165 multimedia anchors, short (ca. 10 seconds) segments, pre-defined in the 45 episodes that make up the Beeldenstorm collection. For each anchor, participants were asked to automatically generate a list of English language Wikipedia pages relevant to the anchor, ordered from the most to the least relevant. Notice that this task was designed by the task organizers such that it goes beyond a named-entity linking task. Although a multimedia anchor may contain a named entity (e.g., a person, place or organization) that is mentioned in the speech channel, the anchors have been carefully chosen by the task organizers so that this is not always the case. The topic being discussed in the video at the point of the anchor may not be explicitly named. Also, the representation of a topic in the video may be split between the visual and the speech channel.
Overview of VideoCLEF 2009
4.2
365
Evaluation
The ground truth for the linking task was created by the assessors. We adapted the four graded relevance levels used in [6] for application in the Linking Task. Level 3 links are referred to as primary links and are defined as “highly relevant – the page is the single page most relevant for supporting understanding of the video in the region of the anchor.” There is only a single primary link per multimedia anchor representing the one best page to which that anchor can be linked. Level 2 links are referred to as secondary links and are defined as “fairly relevant – the page treats a subtopic (aspects) of the video in the region of the anchor.” The final two levels: Level 1 (defined as: “marginally relevant, the page is not appropriate for the anchor”) and Level 0 (defined as “irrelevant, the page is unrelated to the anchor”), were conflated and regarded as irrelevant. Links classified as Level 1 are generic links, e.g., “painting,” or links involving a specific word that is mentioned, but is not really central to the topic of the video at that point. Primary link evaluation. For each video, the primary link was defined by consensus among three assessors. The assessors were required to watch the entire episode so as to have the context to decide the primary link. Primary links were evaluated using recall (correct links/total links) and Mean Reciprocal Rank (MRR). Related resource evaluation. For each video, a set of related resources was defined. This set necessarily includes the primary link. It also includes other secondary links that the assessors found relevant. Only one assessor needed to find a secondary link relevant for it to be included. However, the assessors agreed on the general criteria to be applied when chosing a secondary link. Related resources were evaluated with MRR. The list of secondary links is not exhaustive, for this reason, no recall score is reported. 4.3
Techniques
Centre for Digital Video Processing, Dublin City University, Ireland (see also [3]) The words spoken between the start point and the end point of the multimedia anchor (as transcribed in the speech recognition transcript) were used as a query and fired off against an index of Wikipedia. For dcu run1 and dcu run2 the Dutch Wikipedia was queried and the corresponding English page was returned. Stemming was applied in dcu run2. Dutch pages did not always have corresponding English pages. For dcu run3, the query was translated first and fired off against an English language Wikipedia index. For dcu run4 a Dutch query expanded using psuedo-relevance feedback was used. TNO Information and Communication Technology, Netherlands (see also [14]) A set of existing approaches were combined in order to implement a sophisticated baseline to provide a starting point for future research. A wikify tool was used to
366
M. Larson, E. Newman, and G.J.F. Jones
find links in the Dutch speech recognition transcripts and in English translations of the transcripts. Particular attention was given to proper names, with one strategy giving preference to links to articles with proper-name titles and another strategy ensuring that proper name information was preserved under translation. 4.4
Results
The results of the task are reported in Table 3 (primary link evaluation) and Table 4 (related resource evaluation). The best run used a combination of different strategies, referred to by TNO as a “cocktail.” The techniques applied by DCU achieved a lower overall score, but demonstrate that in general it is better not to translate the query, but rather to query Wikipedia in the source language and then cross over to the target language by using Wikipedia’s own links article-level links between languages. Note that the difference is in reality not as extreme as suggested by Table 3 (i.e., by dcu run1 vs. dcu run3). A subsequent version of the dcu run3 experiment (not reported in Table 3) that makes use of a version of Wikipedia that has been cleaned up by removing clutter (e.g., articles scheduled for deletion and meta-articles containing discussion) achieves a MRR of 0.171 for primary links. Insight into the difference between the DCU approach and the TNO approach is offered by an analysis that makes a query-by-query comparison between specific runs and average performance. DCU runs provide an improvement over average performance for more queries than TNO run [14]. Table 3. Linking results: Primary link evaluation. Raw count correct and MRR. run ID dcu run1 dcu run2 dcu run3 dcu run4 tno run1 tno run2 tno run3 tno run4 tno run5
5
raw 44 44 13 38 57 55 58 44 47
MRR 0.182 0.182 0.056 0.144 0.230 0.215 0.251 0.182 0.197
Table 4. Linking results: Related resource evaluation. MRR. run ID dcu run1 dcu run2 dcu run3 dcu run4 tno run1 tno run2 tno run3 tno run4 tno run5
MRR 0.268 0.275 0.090 0.190 0.460 0.428 0.484 0.392 0.368
Conclusions and Outlook
In 2009, VideoCLEF participants carried out three tasks, Subject Classification, Narrative Peak Detection and Finding Related Resources Across Languages. These tasks generate enrichment for spoken content that can be used to provide improvement in multimedia access and retrieval. With the exception of the Narrative Peak Detection Task, participants concentrated largely on features derived from the speech recognition transcripts and
Overview of VideoCLEF 2009
367
did not exploit other audio information or information derived from the visual channel. Looking towards next year, we will continue to encourage participants to use a wider range of features. We see the Subject Classification Task as developing increasingly towards a tag recommendation task, where systems are required to assign tags to videos. The tag set might not necessarily be known in advance. We expect that the formulation of this task as an information retrieval task will continue to prove useful and helpful, although we wish to move to metrics for evaluation that will better reflect the utility of the assigned tags for real-world search or browsing. In 2010, VideoCLEF will change its name to MediaEval13 and its sponsorship will be taken over by PetaMedia,14 a Network of Excellence dedicated to research and development aimed to improve multimedia access and retrieval. In 2010, several different data sets will be used. In particular, we introduce data sets containing creative commons data collected from the Web (predominantly English language) that will be used in addition to data sets from Beeld & Geluid (predominantly Dutch data). We will offer a tagging task, and affect task and a linking task as in 2009, but we will extend our task set to include new tasks, in particular: geo-tagging and multimodal passage retrieval. The goal of MediaEval is to promote cooperation between sites and projects in the area of the benchmarking, moving towards the common aim of “Innovation and Education via Evaluation.”
Acknowledgements We are grateful to TrebleCLEF,15 a Coordination Action of European Commission’s Seventh Framework Programme for a grant that made possible the creation of a data set for the Narrative Peak Detection Task and the Linking Task. Thank you to the University of Twente for supplying the speech recognition transcripts and to the Netherlands Institute of Sound and Vision (Beeld & Geluid ) for supplying the video. Thank you to Dublin City University for providing the shot segmentation and keyframes and also for hosting the team of Dutch-speaking video assessors during the Dublin Days event. We would also like to express our appreciation to Michael Kipp for use of the Anvil Video Annotation Research Tool. The work that went into the organization of VideoCLEF 2009 has been supported, in part, by PetaMedia Network of Excellence and has received funding from the European Commission’s Seventh Framework Programme under grant agreement no. 216444.
References 1. Calic, J., Sav, S., Izquierdo, E., Marlow, S., Murphy, N., O’Connor, N.: Temporal video segmentation for real-time key frame extraction. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, ICASSP (2002) 13 14 15
http://www.multimediaeval.org/ http://www.petamedia.eu/ http://www.trebleclef.eu/
368
M. Larson, E. Newman, and G.J.F. Jones
2. Dobrilˇ a, T.-A., Diacona¸su, M.-C., Lungu, I.-D., Iftene, A.: UAIC: Participation in VideoCLEF task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) ´ Jones, G.J.F.: When to cross over? Cross-language linking using 3. Gyarmati, A., Wikipedia for VideoCLEF 2009. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 4. Hanjalic, A., Xu, L.-Q.: Affective video content representation and modeling. IEEE Transactions on Multimedia 7(1), 143–154 (2005) 5. Huijbregts, M., Ordelman, R., de Jong, F.: Annotation of heterogeneous multimedia content using automatic speech recognition. In: Proceedings of the International Conference on Semantic and Digital Media Technologies, SAMT (2007) 6. Kek¨ al¨ ainen, J., J¨ arvelin, K.: Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology 53(13), 1120–1129 (2002) 7. Kierkels, J.J.M., Soleymani, M., Pun, T.: Identification of narrative peaks in video clips: Text features perform best. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 8. Kipp, M.: Anvil – a generic annotation tool for multimodal dialogue. In: Proceedings of Eurospeech, pp. 1367–1370 (2001) 9. K¨ ursten, J., Eibl, M.: Video classification as IR task: Experiments and observations. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 10. Larson, M., Jochems, B., Smits, E., Ordelman, R.: A cocktail approach to the VideoCLEF 2009 linking task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 11. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2008: Automatic generation of topic-based feeds for dual language audio-visual content. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) Evaluating Systems for Multilingual and Multimodal Information Access. LNCS, vol. 5706, pp. 906–917. Springer, Heidelberg (2009) 12. Pecina, P., Hoffmannov´ a, P., Jones, G.J.F., Zhang, Y., Oard, D.W.: Overview of the CLEF-2007 Cross-Language Speech Retrieval track. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 674–686. Springer, Heidelberg (2008) 13. Perea-Ortega, J.M., Montejo-R´ aez, A., Mart´ın-Valdivia, M.T., Ure˜ na L´ opez, L.A.: Using Support Vector Machines as learning algorithm for video categorization. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 14. Raaijmakers, S., Versloot, C., de Wit, J.: A cocktail approach to the VideoCLEF 2009 linking task. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 15. Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVID. In: Proceedings of the ACM International Workshop on Multimedia Information Retrieval (MIR), pp. 321–330. ACM, New York (2006) 16. Wrede, B., Shriberg, E.: Spotting “hot spots” in meetings: Human judgments and prosodic cues. In: Proceedings of Eurospeech, pp. 2805–2808 (2003)
Methods for Classifying Videos by Subject and Detecting Narrative Peak Points Tudor-Alexandru Dobrilă, Mihail-Ciprian Diaconaşu, Irina-Diana Lungu, and Adrian Iftene UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania {tudor.dobrila,ciprian.diaconasu, diana.lungu,adiftene}@info.uaic.ro
Abstract. 2009 marked UAIC’s1 first participation at the VideoCLEF evaluation campaign. Our group built two separate systems for the “Subject Classification” and “Affect Detection” tasks. For the first task we created two resources starting from Wikipedia pages and pages identified with Google and used two tools for classification: Lucene and Weka. For the second task we extracted the audio component from a given video file, using FFmpeg. After that, we computed the average amplitude for each word from the transcript, by applying the Fast Fourier Transform algorithm in order to analyze the sound. A brief description of our systems’ components is given in this paper.
1 Introduction VideoCLEF2 2009 required participants to carrying out cross-language classification, retrieval and analysis tasks on a video collection containing documentaries and talk shows. In 2009, the collection extended the corpus used for the 2008 VideoCLEF pilot track. Two classification tasks were evaluated: “Subject Classification”, which involves automatically tagging videos with subject labels, and “Affect and Appeal”, which involves classifying videos according to characteristics beyond their semantic content. Our team participated in the following tasks: Subject Classification (in which participants had to automatically tag videos with subject labels such as ‘Archeology’, ‘Dance’, ‘History’, ‘Music’, etc.) and Affect Detection (in which participants had to identify narrative peaks, points within a video where viewers report increased dramatic tension, using a combination of video and speech/audio features).
2 Subject Classification In order to classify a video using its transcripts we perform four steps: (1) For each category we extract from Wikipedia and Google web pages related to the video; (2) From the documents obtained at Step 1 we extract only relevant words and compute 1 2
Univeristatea “Alexandru Ioan Cuza” (“Al. I. Cuza” University of Iași). VideoCLEF: http://www.cdvp.dcu.ie/VideoCLEF/
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 369–372, 2010. © Springer-Verlag Berlin Heidelberg 2010
370
T.-A. Dobrilă, et al.
for each term a normalized value using its number of appearances; (3) We perform the same action as in Step 2 on the video transcripts; (4) The terms obtained at Step 2 are grouped into a list of categories given a priori and, using the list of words from Step 3 and a classification tool (Lucene or Weka), we classify the video into one of these categories. Extract Relevant Words from Wikipedia: We used CategoryTree3 to analyze Wikipedia’s category structure as a tree. The query URL was created based on the language, the name of the category and the depth of the search within the tree. We performed queries for each category and obtained Wikipedia pages, which were later sorted by relevance. From the source of each page we extracted the content of the paragraph tags and transformed all words to lower case, lemmatized and counted their number of appearances. In the end we computed for each term out of each category a normalized score as the ratio between its number of appearances and the total number of appearances of all words in that category. Extract Relevant Words from Google: This part is similar to the part performed on Wikipedia, except that from relevant search results terms from the “keywords” meta tag were extracted as well. Lucene allows adding, indexing and searching capabilities to applications [1]. Instead of directly indexing files created at previous steps and corresponding to each category, we generated other files in which every word’s score is proportional to its associated number of appearances from its corresponding files. This way the score returned by Lucene will be greater if the word from the file associated to a category has a higher number of appearances. The Weka4 workbench [2] contains a collection of visualization tools and algorithms for data analysis and predictive modeling. For each category file (model file) and transcript file (test file), we create an ARFF file (Attribute-Relation File Format). Using a filter, provided by the Weka Tool, the content of the newly created files is transformed into instances. Each instance is classified by assigning it a score and the one with the highest score is the result that Weka offers. System Evaluation In Table 1, we report the results of the evaluation in terms of mean average precision (MAP) using the trec_eval tool on training data5. We have evaluated nine runs, using different combinations between resources and classification algorithms. Table 1. UAIC Runs on Training Data Tools\Resources Lucene Weka Lucene and Weka
3
Google 0.12 0.14 0.19
Wikipedia 0.17 0.30 0.33
Google and Wikipedia 0.20 0.35 0.45
CategoryTree: http://www.mediawiki.org/wiki/Extension:CategoryTree Weka: http://www.cs.waikato.ac.nz/ml/weka/ 5 During the evaluation campaign we did not send a run on test data and the data in this table were evaluated by us on the training files provided by the organizers. 4
Methods for Classifying Videos by Subject and Detecting Narrative Peak Points
371
The least conclusive results were obtained using Lucene and resources extracted from Google. Classification results using resources from Wikipedia and Lucene or Weka tool are more representative because the information extracted from this database is more concise. The best results were obtained when resources from both Google and Wikipedia were used. Lucene proved to be more useful when more results for a single input were needed, but Weka tool using the Naive Bayes Multinomial classifier lead to a single, more conclusive, result. Combining both resources and the two tools is much more efficient in terms of the accuracy of the results.
3 Affect Detection Our work is based on the assumption that a narrative peak is a point in the video where the narrator raises his voice within a given phrase, in order to emphasize a certain idea. This means that a group of words is said more intensely than the way previous words are said and, since this applies in any language, we were able to develop a language independent application using statistical analysis. This is why our approach is based on two aspects of the video: the sound and the ASR transcript. The first step is the extraction of the audio from a given video file, which we accomplished with the use of FFmpeg6. We then computed the average amplitude of each word from the transcript, by applying the Fast Fourier Transform (FFT7) algorithm on the audio signal. The amplitude of a point in complex form X is defined as the ratio between the intensity of the frequency in X (as calculated by FFT) and the total number of points in the time-domain signal. FFT proved to be successful, because it helped establish the relation between neighboring words in terms of the way they are pronounced by the narrator. Next, we computed a score for any group of words (which spanned between 5 and 10 seconds) based on the previous group of words. The score is a weighted mean of several metrics, listed in Table 2. In the end, we considered only the top 3 scores, which were exported in .anvil format for later use in Anvil Player. We submitted 3 runs with following characteristics: Table 2. Affect Detection: characteristics of UAIC runs Run ID Metrics Used For Computing The Score Run 1 • Ratio of Means of Amplitudes of Current Group and Previous Group • Ratio of Quartile Coefficients of Dispersion of Current Group and Previous Group Run 2 • Ratio of Means of Amplitudes of Current Group and Previous Group • Ratio of Quartile Coefficients of Dispersion of Current Group and Previous Group • Ratio of Coefficients of Variation of Current Group and Previous Group Run 3 • Ratio of Means of Amplitudes of Current Group and Previous Group • Ratio of Coefficients of Variation of Current Group and Previous Group 6 7
FFmpeg: http://ffmpeg.org/ FFT: http://en.wikipedia.org/wiki/Fast_Fourier_transform
372
T.-A. Dobrilă, et al.
In total, 60 hours of assessor time were devoted to creating the reference files of the narrative peaks for the 45 Beeldenstorm episodes used in the VideoCLEF 2009 Affect Task. Three assessors watched each of the 45 test files and marked their top three narrative peaks using the Anvil tool. Our best run (Run 2) was obtained when more statistical measures were incorporated into the final weighted sum that gave the score of a group of words. This could be improved by adding other metrics (e.g. coefficient of correlation) and by properly adjusting the weights. Our method was successful when the narrator raised his voice in order to emphasize a certain idea, but failed when the semantic meaning of the words played an important role within a narrative peak. Table 3. UAIC Runs Evaluation
Run ID Run 1 Run 2 Run 3
Point based scoring 33 41 33
Peaks based scoring 1 26 29 24
Peaks based scoring 2 7 10 7
Peaks based scoring 3 2 3 2
4 Conclusions This paper presents UAIC’s system which took part in the VideoCLEF 2009 evaluation campaign. Our group built two separate systems for the “Subject Classification” and “Affect Detection” tasks. For the Subject Classification task we created two resources based from Wikipedia pages and results from the Google search engine. These resources are then used by Lucene and Weka tools for classification. For the Affect Detection task we extracted the audio component from a given video file, using FFmpeg. The audio signal is analyzed with the Fast Fourier Transform algorithm and scores are given to groups of neighboring words.
References 1. Hatcher, E., Gospodnetic, O.: Lucene in action. Manning Publications Co. (2005) 2. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) Retrieved 2007-06-25 3. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: Peters, C., Gonzalo, J., Jones, G.J.F., Muller, H., Tsikrika, T., Kalpathy-Kramer, J. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010)
Using Support Vector Machines as Learning Algorithm for Video Categorization Jos´e Manuel Perea-Ortega, Arturo Montejo-R´ aez, Mar´ıa Teresa Mart´ın-Valdivia, and L. Alfonso Ure˜ na-L´ opez SINAI Research Group, Computer Science Department, University of Ja´en, Campus Las Lagunillas, Edificio A3, E-23071, Ja´en, Spain {jmperea,amontejo,maite,laurena}@ujaen.es http://sinai.ujaen.es
Abstract. This paper describes a supervised learning approach to classify Automatic Speech Recognition (ASR) transcripts from videos. A training collection was generated using the data provided by the VideoCLEF 2009 framework. These data contained metadata files about videos. The Support Vector Machines (SVM) learning algorithm was used in order to evaluate two main experiments: using the metadata files for generating the training corpus and without using them. The obtained results show the expected increase in precision due to the use of metadata in the classification of the test videos.
1
Introduction
Multimedia content-based retrieval is a challenging research field that has drawn significant attention in the multimedia research community [5]. With the rapid growth of multimedia data, methods for effective indexing and search of visual content are decisive. Specifically, the interest in multimedia Information Retrieval (IR) systems has grown in recent years, as can be seen at some conferences like for example the ACM International Conference on Multimedia Information Retrieval (ACM MIR1 ) or the TREC Video Retrieval Evaluation (TRECVID2 ) conference. Our group has some experience in this field, using an approach based on the fusion process between a text-based retrieval and an image-based retrieval [1]. Video categorization can be considered a subtask of the multimedia contentbased retrieval. VideoCLEF3 is a recent track of CLEF4 whose aim is to evaluate and improve access to video content in a multilingual environments. One of the main subtask that it proposes is the Subject Classification task, that is about automatically tagging videos with subject theme labels (e.g., “factories”, “poverty”, “cultural identity”, “zoos”, ...) [4]. 1 2 3 4
http://press.liacs.nl/mir2008/index.html http://www-nlpir.nist.gov/projects/trecvid http://www.cdvp.dcu.ie/VideoCLEF Cross Language Evaluation Forum, http://www.clef-campaign.org
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 373–376, 2010. c Springer-Verlag Berlin Heidelberg 2010
374
J.M. Perea-Ortega et al.
In this paper, two experiments about the Subject Classification task are described. To proceed, one main approach has been followed: supervised categorization using the SVM algorithm [2]. Additionally, two corpora have been generated, using the metadata files provided by the VideoCLEF 2009 framework and without using them. The paper is organized as follows: Section 2 describes the approach followed in this work. Then, in Section 3, experiments and results are shown. Finally, in Section 4, the conclusions and further work are presented.
2
The Supervised Learning Approach
2.1
Generating the Training Data
The VideoCLEF 2009 Subject Classification task ran on TRECVid 2007/2008 data from Beeld & Geluid 5 . The training corpus consists of 262 XML files. These ASR files belong to the VideoCLEF 2008 (50 files) and TRECVID 2007 (212 files). Additionally, there are some metadata files about the videos provided by the VideoCLEF organization [4]. For generating the training data, the content of the FreeTextAnnotation labels from ASR files was extracted. Therefore, a TREC file per document was built. Additionally, the content of the description abstract labels from the metadata files was added to generate the learning corpus with metadata. The preprocessing of training corpora was to filter the stopwords and to apply a stemmer. Because all the original files are in Dutch language, the Snowball stopword list for Dutch6 was used, which contains 101 stopwords, and the Snowball Dutch stemmer7 . 2.2
Using SVM as an ASR Classifier
Automatic tagging of videos with subject labels can be seen as a categorization problem, using the speech transcriptions of the test videos like documents to classify. One of the successful uses of SVM algorithms is the task of text categorization into fixed number of predefined categories based on their content. Commonly utilized representation of text documents from the field of IR provides a natural mapping for construction of Mercer kernels utilized in SVM algorithms. For the experiments and analysis carried out in this paper, the Rapid Miner8 framework was selected. This toolkit provides several machine learning algorithms such as SVM and techniques along with other interesting features. The learning algorithm selected for testing the supervised strategy has been Support Vectors Machine [2]. SVM has been used in classification mode, with a 3-degree RBF kernel, nu parameter equal to 0.5 and epsilon set to 0.0001, with p-value at 0.1. The rest of parameters were set to 0. A brief description of the experiments and its results using each corpora generated are showed below. 5 6 7 8
The Netherlands Institute of Sound and Vision (called in Dutch Beeld & Geluid ) http://snowball.tartarus.org/algorithms/dutch/stop.txt http://snowball.tartarus.org/algorithms/dutch/stemmer.html Rapid Miner is available from http://rapid-i.com
Using Support Vector Machines as Learning Algorithm
3
375
Experiments and Results
The Subject Classification task was introduced during the VideoCLEF 2008 as a pilot task [3]. In 2009, the number of videos in the collection was increased from 50 to 418 and the number of subject labels increased from 10 to 46. This task is usually evaluated using Mean Average Precision (MAP), but the R-Precision measure has also been calculated. In 2008, the approach used in our participation in VideoCLEF classification task was the use of an Information Retrieval (IR) system as a classification architecture [7]. We collected topical data from the Internet by submitting the thematic class labels as queries to the Google search engine. The queries were derived from the speech transcripts and a video was assigned to the label corresponding to the top ranked document, returned as result of the video transcript text used as query. This approach was taken since the VideoCLEF 2008 collection provided development and test data, but no training data. Instead, the approach followed in this paper is a first approximation to the automatic tagging of videos using a supervised learning scheme. The SVM algorithm has been selected. During the generation of the training corpus, two experiments have been evaluated: using the metadata files provided by the VideoCLEF organization and without using them. The results obtained are showed in Table 1. Table 1. Experiments and results using SVM as learning algorithm Learning corpus
MAP R-prec
Using metadata 0.0028 0.0089 Without using metadata 0.0023 0.0061
Analyzing the results, it can be observed that the use of metadata during the generation of the training corpus improves the average precision of video classification by about 21.7%, without using metadata for generating the learning corpus. Consistent with VideoCLEF 2008 observations, performance is better when archival metadata is used in addition to speech recognition transcripts.
4
Conclusions and Further Work
The use of metadata as a valuable source of information in text categorization has been already applied some time ago, for example, in the categorization of full-text papers enriched by its bibliographic records [6]. The results of the experiments suggest that training classifiers on speech transcripts of same domain of videos could be a good strategy for the future. We expect to continue this work by applying a multi-label classifier, instead the multiclass SVM algorithm used so far. Additionally, the semantics of the speech transcriptions will also be investigated by studying how the inclusion of
376
J.M. Perea-Ortega et al.
synonyms from external resources such as WordNet9 affects the corpora generated and further improve the performance of our system. On top of that, a method for detecting the linguistic register of the documents to be classified would serve as selector for a suitable training corpus.
Acknowledgments This paper has been partially supported by a grant from the Spanish Government, project TEXT-COOL 2.0 (TIN2009-13391-C04-02), a grant from the Andalusian Government, project GeOasis (P08-TIC-41999), and a grant from the University of Ja´en, project RFC/PP2008/UJA-08-16-14.
References 1. D´ıaz-Galiano, M.C., Perea-Ortega, J.M., Mart´ın-Valdivia, M.T., Montejo-R´ aez, A., Ure˜ na-L´ opez, L.: SINAI at TRECVID 2007. In: Proceedings of the TRECVID 2007 Workshop, TRECVID 2007 (2007) 2. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: N´edellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998) 3. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2008: Automatic Generation of Topic-Based Feeds for Dual Language Audio-Visual Content. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 906–917. Springer, Heidelberg (2009) 4. Larson, M., Newman, E., Jones, G.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 5. Li, J., Chang, S.F., Lesk, M., Lienhart, R., Luo, J., Smeulders, A.W.M.: New challenges in multimedia research for the increasingly connected and fast growing digital society. In: Multimedia Information Retrieval, pp. 3–10. ACM, New York (2007) 6. Montejo-R´ aez, A., Ure˜ na-L´ opez, L.A., Steinberger, R.: Text categorization using bibliographic records: beyond document content. Sociedad Espa˜ nola para el Procesamiento del Lenguaje Natural (35) (2005) 7. Perea-Ortega, J.M., Montejo-R´ aez, A., D´ıaz-Galiano, M.C., Mart´ın-Valdivia, M.T., Ure˜ na-L´ opez, L.A.: Using an Information Retrieval System for Video Classification. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 927–930. Springer, Heidelberg (2009)
9
http://wordnet.princeton.edu/
Video Classification as IR Task: Experiments and Observations Jens K¨ ursten and Maximilian Eibl Chemnitz University of Technology Faculty of Computer Science, Chair Computer Science and Media Straße der Nationen 62 09111 Chemnitz, Germany {jens.kuersten,eibl}@cs.tu-chemnitz.de
Abstract. This paper describes experiments we conducted in conjunction with the VideoCLEF 2009 classification task. In our second participation in the task we experimented with treating classification as an IR problem and used the Xtrieval framework [1] to run our experiments. We confirmed that the IR approach achieves strong results although the data set was changed. We proposed an automatic threshold to limit the number of labels per document. Query expansion performed better than the corresponding baseline experiments in terms of mean average precision. We also found that combining the ASR transcriptions and the archival metadata improved the classification performance unless query expansion was used.
1
Introduction and Motivation
This article describes a system and its configuration, which we used for participation in the VideoCLEF classification task. The task [2] was to categorize dual-language video into 46 different classes based on provided ASR transcripts and additional archival metadata. Thereby, each of the given video documents can have none, one or even multiple labels. Hence the task can be characterized as a real world scenario in the field of automatic classification. Our participation in the task is motivated by its close relation to our research project sachsMedia 1 . The main goals of the project are twofold. The first objective is automatic extraction of low level features from audio and video for automated annotation of poorly described content in archives. On the other hand sachsMedia aims to support local TV stations in Saxony to replace their analog distribution technology with innovative digital distribution services. A special problem of the broadcast companies is the accessibility of their archives for end users. The remainder of the article is organized as follows. In section 2 we briefly review existing approaches and describe the system architecture and its basic 1
Funded by the Entrepreneurial Regions program of the German Federal Ministry of Education and Research from April 2007 to March 2012.
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 377–384, 2010. c Springer-Verlag Berlin Heidelberg 2010
378
J. K¨ ursten and M. Eibl
configuration. In section 3 we present and interpret the results of preliminary and officially submitted experiments. A summary of our findings is given in section 4. The final section concludes the experiments with respect to our expectations and gives and outlook to future work.
2
System Architecture and Configuration
Since the classification task was an enhanced modification of last year’s VideoCLEF classification task [3], we give a brief review on previously used approaches. There were two distinct ways to approach the classification task: (a) collecting training data from external sources like general Web content or Wikipedia to train a text classifier or (b) treat the problem as information retrieval task. Villena-Rom´an and Lana-Serrano [4] combined both ideas by obtaining training data from Wikipedia and assigning the class labels to the indexed training data. The metadata from the video documents were used as query on the training corpus and the dominant label of the retrieved documents was assigned as class label. Newman and Jones [5] as well as Perea-Ortega et. al. [6] approached the problem as IR task and achieved similar strong performance. K¨ ursten et. al. [7] and He et. al. [8] tried to solve the problem with state of the art classifiers like k-NN and SVM. Both used Wikipedia articles for training. 2.1
Resources
Given the impressions from last year’s evaluation and the huge success of the IR approaches as well as the enhancement of the task to a larger number of class labels and more documents, we decided to treat the problem as an IR task to verify these results. Hence we used the Xtrieval framework [1] to create an index on the provided metadata. This index was composed of three fields, one with the ASR output, another with the archival metadata and a third containing both. A language specific stopword list2 and the Dutch stemmer from the Snowball project3 were applied to process the tokens. We used the class labels to query our video document index. Within our framework we decided to use the Lucene4 retrieval core with its default vector-based IR model. An English thesaurus5 in combination with the Google AJAX language API6 was applied for query expansion purposes in the retrieval stage. 2.2
System Configuration and Parameters
The following list briefly explains our system parameters and their values in the experimental evaluation. Figure 1 illustrates the general workflow of the system. 2 3 4 5 6
http://snowball.tartarus.org/algorithms/dutch/stop.txt http://snowball.tartarus.org/algorithms/dutch/stemmer.html http://lucene.apache.org http://de.openoffice.org/spellcheck/about-spellcheck-detail.html#thesaurus http://code.google.com/apis/ajaxlanguage/documentation
Video Classification as IR Task
379
– Source Field (SF): The metadata source was variated to indicate which source is most reliable and whether any combination yields to improvement of the classification or not. – Multi-label Limit (LpD): The number of correct labels is usually related to the content of a video document. Therefore we investigated the relation between the number of assigned labels per document and the classification performance. Another related question is, if a automatic and content-specific threshold might be superior to fixed threshold values. Thus, we compared fixed thresholds to an automatic threshold (see equation 1). – Pseudo-Relevance Feedback (PRF): We performed some initial experiments on the training data to identify promising values for the number of terms and documents to use. We found that selecting a single term from a small set of only five documents was beneficial for this specific task and data set. Using more terms dramatically decreased the classification performance. – Cross-Language Thesaurus Query Expansion (CLTQE): We used cross-language thesaurus query expansion for those queries, which returned less than two documents for a given query. Again, only the first returned term was extracted and fed back to the system to reformulate the query for the same reason as in the case of PRF. The automatic threshold TLpD is based on the scores of the retrieved documents. Thereby RSVavg denotes the average score and RSVmax is the maximum score of the documents retrieved. N umdocs stands for the total number of documents retrieved for a specific class label. Please note that the explanation of the formula given in [9] was not correct.
Class Labels
Doc + Labels
Query Expansion Query Formulation Token Processing Stopword Removal Stemming
CLTQE PRF
Label / Doc Limit
DocList Xtrieval Framework Lucene API
Fig. 1. General System Architecture
380
J. K¨ ursten and M. Eibl
TLpD = RSVavg + 2 ∗
RSVmax − RSVavg N umdocs
(1)
Analyzing our experiments on the training data, we noticed that a number of class labels (which were used as queries) returned only a few or even no documents. Therefore the cross-language thesaurus query expansion (CLTQE) component was implemented. It expanded the English class labels with terms returned from an English thesaurus5 . The resulting English terms were subsequently translated into Dutch. Finally the expanded query was sent to the retrieval engine.
3
Experimental Setup and Results
In this section we report results that were obtained by running various system configurations on the test data. The experimental results on the training data are completely reported in [9]. Regarding the evaluation of the task we had a problem with calculating the measures. The MAP values reported by trec eval and our Xtrieval framework had marginal variations due to the fact that our system allows to return two documents with identical RSV. Unfortunately we were neither able to correct the behavior of our system nor could we find out when or why the trec eval tool reorders our result sets. Since the evaluation results had only small variations (see tables 1 and 2 in [9]) we do only report MAP values calculated by our framework to avoid confusion. Furthermore we present results for additional experiments that were not officially submitted. Column captions 2-5 of all result tables in the following subsections refer to specific system parameters that were introduced in section 2.2. Please note that the utilization of the threshold formula is denoted with x in column LpD. Experiments that were submitted for official evaluation are denoted with *. The performance of the experiments is reported with respect to overall sum of assigned labels (SumL), the average ratio of correct classifications (CR), average recall (AR) as well as mean average precision (MAP) and the F-Measure calculated over CR and AR. 3.1
Baseline Experiments
Table 1 contains results for our experiments without any query expansion. The only difference in the reported runs was the metadata source (SF) that was used in the retrieval stage. It is obvious that the best results in terms of AR and MAP were achieved when the ASR output and the archival metadata was used. The highest correct classification rate was obtained by using only archival metadata terms.
Video Classification as IR Task
381
Table 1. Results for Baseline Experiments ID SF LpD SumL CR AR MAP cut1 l1 base* asr 1 27 0.0741 0.0102 0.0104 cut2 l1 base meta 1 63 0.6349 0.2010 0.2003 cut3 l1 base* meta + asr 1 112 0.5000 0.2814 0.2541
3.2
F-Meas 0.0177 0.3053 0.3601
Experiments with Query Expansion
In the following list of experiments we used two types of query expansion. First we applied the PRF approach on all queries. It was briefly described in section 2.2. Additionally the CLTQE method was implemented to handle cases in which no or only few documents were returned. Table 2 is divided into 3 blocks depending on how many labels per document were allowed. It is obvious that using only archival metadata resulted in highest MAP. Average recall was similar for all experiments using archival metadata or combining archival metadata and ASR transcripts. Looking at the correct classification rate we observed that highest rates were achieved for experiments, where the number of assigned labels for each document were restricted to 1. Without this restriction the correct classification rate decreased dramatically. Using the proposed restriction formula from section 2.2 resulted in a balance of CR and MAP. The evaluation with respect to the F-Measure shows highest performance for the combination of archival metadata and ASR output. Table 2. Results using Query Expansion ID cut4 l0 qe cut5 l0 base cut6 l0 qe cut7 l1 qe* cut8 l1 base cut9 l1 qe* cut10 lx base cut11 lx qe
3.3
SF LpD SumL CR AR MAP F-Meas asr ∞ 1,571 0.0350 0.2764 0.1036 0.0621 meta ∞ 1,933 0.0792 0.7688 0.4391 0.1435 meta + asr ∞ 2,276 0.0690 0.7889 0.4389 0.1269 asr 1 158 0.1266 0.1005 0.0904 0.1120 meta 1 196 0.3776 0.3719 0.2867 0.3747 meta + asr 1 196 0.3622 0.3568 0.2561 0.3595 meta x 396 0.2879 0.5729 0.4115 0.3832 meta + asr x 482 0.2427 0.5879 0.4130 0.3436
Impact of Different Query Expansion Methods
This section deals with the effects of the two automatic expansion techniques. Therefore we switched PRF and CLTQE on and off for selected experiments from section 3.2 and aggregated the results. Table 3 is divided into 2 blocks corresponding to different values for threshold LpD, namely LpD=1 for 1 label per document and LpD=x, where formula (1) from section 2.2 was used.
382
J. K¨ ursten and M. Eibl Table 3. Comparing the Impact of Query Expansion Approaches
ID cut2 l1 base cut12 l1 base cut13 l1 base cut3 l1 base* cut14 l1 qe cut15 l1 qe cut16 lx base cut17 lx base cut18 lx base cut19 lx qe cut20 lx qe cut21 lx qe
SF meta meta meta meta meta meta meta meta meta meta meta meta
+ asr + asr + asr
+ asr + asr + asr
PRF CLTQE LpD SumL CR AR MAP F-Meas no no 1 63 0.6349 0.2010 0.2003 0.3053 yes no 1 195 0.3846 0.3769 0.3055 0.3807 no yes 1 68 0.6176 0.2111 0.2033 0.3146 no no 1 112 0.5000 0.2814 0.2541 0.3601 yes no 1 196 0.3622 0.3568 0.2619 0.3595 no yes 1 112 0.4821 0.2714 0.2275 0.3473 no no x 84 0.5714 0.2412 0.2386 0.3392 yes no x 366 0.3060 0.5628 0.4140 0.3965 no yes x 92 0.5543 0.2563 0.2418 0.3505 no no x 162 0.4383 0.3568 0.2978 0.3934 yes no x 466 0.2489 0.5829 0.4108 0.3489 no yes x 169 0.4083 0.3467 0.2707 0.3750
The results show that the automatic feedback approach is superior to the thesaurus expansion in all experiments. This observation complies with our expectation, because CLTQE was only used in rather rare cases, where no or only few documents matched the given class label. Interestingly using CLTQE results in very small gains in terms of MAP and only when the source field for retrieval was archival metadata (compare ID’s cut2 to cut13 and cut16 to cut18). The CLTQE approach decreased retrieval performance in experiments where both source fields were used. 3.4
General Observations and Interpretation
The best correct classification rates (CR) were achieved without using any form of query expansion (see ID’s cut2, cut3 and cut19) for all data sources used. The best overall CR was achieved by using only archival metadata in the retrieval phase (see ID cut2). Since the archival metadata fields contain intellectual annotations this is a very straightforward finding. Using archival metadata only also resulted in best performance in terms of MAP and AR. Nevertheless the gap to the best results when combining ASR output with archival metadata is very small (compare ID cut5 to cut6 or cut10 to cut11). Regarding our proposed automatic threshold calculation for limitation of the number of assigned labels per document the results are twofold. On the one hand there is a slight improvement in terms of MAP and AR compared to a fixed threshold LpD=1 assigned labels per document. On the other hand the overall correct classification rate (CR) decreases in the same magnitude as MAP and AR are increasing. The interpretation of our experimental results led us to the conclusion that using MAP for evaluating a multi-label classification task is somehow questionable. In our point of view the main reason is that MAP does not take into account the overall correct classification rate CR. Take a close look on the two best performing experiments using archival metadata and ASR transcriptions in table 2 (see ID’s cut6 and cut11). The difference in terms of MAP is about 6%, but
Video Classification as IR Task
383
the gain in terms of CR is about 352%. In our opinion in a real world scenario where assigning class labels to video documents should be completely automatic it would be essential to take into account the overall ratio of correctly assigned labels. We used the F-measure composed of AR and CR to derive an evaluation measure, which takes into account the overall precision of the classification, recall and the total number of assigned labels. Regarding the F-measure the best overall performance was achieved by using our proposed threshold formula on the archival metadata (see ID cut17). Nevertheless the gap between using intellectual metadata only and its combination with automatic metadata like ASR output was fairly small (compare ID’s cut17 to cut19 or cut12 to cut14).
4
Result Analysis - Summary
The following list provides a short summary of our observations and findings from the participation in the VideoCLEF classification task in 2009. – Classification as an IR task: According to the observations from last year, we conclude that treating the given task as a traditional IR task with some modifications is a quite successful approach. – Metadata Sources: Combining ASR output and archival metadata improves MAP and AR when no query expansion was used. However, best performance was achieved by querying archival metadata fields only and using QE. – Label Limits: We compared an automatically calculated threshold to low manual set thresholds and found that the automatic threshold works better in terms of MAP and AR. – Query Expansion: Automatic pseudo-relevance feedback improved the results in terms of MAP in all experiments. The impact of the CLTQE was very small and it even decreased performance when both fields (intellectual and automatic metadata) were queried. – Evaluation Measure: In our opinion using MAP as evaluation measure for a multi-label classification task is questionable. Therfore we also calculated the F-measure based on CR and AR.
5
Conclusion and Future Work
This year we used the Xtrieval framework for the VideoCLEF classification task. With our experimental evaluation we can confirm the observations from last year, where approaches treating the task as IR problem were most successful. We proposed an automatic threshold to limit the number of assigned labels per document to preserve high correct classification rates. This seems to be an issue that could be worked on in the future. A manual restriction of assigned labels per document is not an appropriate solution in a real world problem, where possibly hundreds of thousand video documents have to be labeled with maybe hundreds of different topic labels. Furthermore one could try to evaluate different retrieval models and try to combine the results from those models to gain a better overall
384
J. K¨ ursten and M. Eibl
performance. Finally, it should be evaluated whether assigning field boosts to the metadata sources could improve performance when intellectual annotations are combined with automatically extracted metadata.
Acknowledgments We would like to thank the VideoCLEF organizers and the Netherlands Institute of Sound and Vision (Beeld & Geluid) for providing the data sources for the task. This work was accomplished in conjunction with the project sachsMedia, which is funded by the Entrepreneurial Regions 8 program of the German Federal Ministry of Education and Research.
References 1. K¨ ursten, J., Wilhelm, T., Eibl, M.: Extensible Retrieval and Evaluation Framework: Xtrieval. In: Workshop Proceedings of LWA 2008: Lernen - Wissen - Adaption, W¨ urzburg (October 2008) 2. Larson, M., Newman, E., Jones, J.F.G.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 3. Larson, M., Newman, E., Jones, J.F.G.: Overview of VideoCLEF 2008: Automatic Generation of Topic-based Feeds for Dual Language Audio-Visual Content. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 906–917. Springer, Heidelberg (2009) 4. Villena-Rom´ an, J., Lana-Serrano, S.: MIRACLE at VideoCLEF 2008: Topic Identification and Keyframe Extraction in Dual Language Videos. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 572–576. Springer, Heidelberg (2009) 5. Newman, E., Jones, G.J.F.: DCU at VideoClef 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 923–926. Springer, Heidelberg (2009) 6. Perea-Ortega, J.M., Montejo-Ra´ez, A., Mart´ın-Valdivia, M.T.: Using an Information Retrieval System for Video Classification. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 927–930. Springer, Heidelberg (2009) 7. K¨ ursten, J., Richter, D., Eibl, M.: VideoCLEF 2008: ASR Classification with Wikipedia Categories. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 931–934. Springer, Heidelberg (2009) 8. He, J., Zhang, X., Weerkamp, W., Larson, M.: Metadata and Multilinguality in Video Classification. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Pe˜ nas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 935–938. Springer, Heidelberg (2009) 9. K¨ ursten, J., Eibl, M.: Chemnitz at VideoCLEF 2009: Experiments and Observations on Treating Classification as IR Task. In: Working Notes for the CLEF 2009 Workshop, Corfu, Greece, September 30-2 October (2009) 8
The Innovation Initiative for the New German Federal States.
Exploiting Speech Recognition Transcripts for Narrative Peak Detection in Short-Form Documentaries Martha Larson1, Bart Jochems2 , Ewine Smits1 , and Roeland Ordelman2 1
Multimedia Information Retrieval Lab, Delft University of Technology, 2628 CD Delft, Netherlands 2 Human Media Interaction, University of Twente, 7500 AE Enschede, Netherlands {m.a.larson,e.a.p.smits}@tudelft.nl, b.e.h.jochems@student.utwente.nl, ordelman@ewi.utwente.nl
Abstract. Narrative peaks are points at which the viewer perceives a spike in the level of dramatic tension within the narrative flow of a video. This paper reports on four approaches to narrative peak detection in television documentaries that were developed by a joint team consisting of members from Delft University of Technology and the University of Twente within the framework of the VideoCLEF 2009 Affect Detection task. The approaches make use of speech recognition transcripts and seek to exploit various sources of evidence in order to automatically identify narrative peaks. These sources include speaker style (word choice), stylistic devices (use of repetitions), strategies strengthening viewers’ feelings of involvement (direct audience address) and emotional speech. These approaches are compared to a challenging baseline that predicts the presence of narrative peaks at fixed points in the video, presumed to be dictated by natural narrative rhythm or production convention. Two approaches deliver top narrative peak detection results. One uses counts of personal pronouns to identify points in the video where viewers feel most directly involved. The other uses affective word ratings to calculate scores reflecting emotional language.
1
Introduction
While watching video content, viewers feel fluctuations in their emotional response that can be attributed to their perception of changes in the level of dramatic tension. In the literature on affective analysis of video, two types of content have received particular attention: sports games and movies [1]. These two cases differ with respect to the source of viewer-perceived dramatic tension. In the case of sports, tension spikes arise as a result of the unpredictable interactions of the players within the rules and physical constraints of the game. In the case of movies, dramatic tension is carefully crafted into the content by a team including scriptwriters, performers, special effects experts, directors and producers. The difference between the two cases is the amount and nature of human intention – i.e., premeditation, planning, intervention – involved in the C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 385–392, 2010. c Springer-Verlag Berlin Heidelberg 2010
386
M. Larson et al.
creation of the sequence of events that plays out over time (and space). We refer to that sequence as a narrative and to high points in the dramatic tension within that narrative as narrative peaks. We are interested in investigating a third case of video content, namely television documentaries. We consider documentaries to be a form of “edu-tainment,” whose purpose is both to inform and entertain the audience. The approaches described and tested here have been developed in order to detect narrative peaks within documentary videos. Our work differs in an important respect from previous work in the domains of sports and movies. Dramatic tension in documentaries is never completely spontaneous – the narrative curve follows a previously laid out plan, for example a script or an outline, that is carried out during the process of production. However, dramatic tension is characteristically less tightly controlled in a documentary than it would be in a movie. In a movie, the entire content is subordinated to the plot, whereas a documentary may follow one or more story lines, but it simultaneously pursues the goal of providing the viewer with factual subject matter. Because of these differences, we chose to dedicate separate and specific attention to the affective analysis of documentaries and in particular to the automatic detection of narrative peaks. This paper reports on joint work carried out by research groups at two universities in the Netherlands, Delft University of Technology and the University of Twente, on the Affect Detection task of the VideoCLEF1 track of the 2009 Cross-Language Evaluation Forum (CLEF)2 benchmark evaluations. The Affect Detection task involves automatically identifying narrative peaks in short-form documentaries. In the rest of this paper, we first give a brief description of the data and the task. Then, we present the approach that we took to the task and give the details of the algorithms used in each of the five runs that we submitted. We report the results achieved by these runs and then conclude with a summary and outlook.
2
Experimental Setup
2.1
Data Set and Task Definition
The data set for the VideoCLEF 2009 Affect Detection task consisted of 45 episodes from the Dutch-language short-form documentary series called Beeldenstorm (in English, ‘Iconoclasm’). The series treats topics in the visual arts, integrating elements from history, culture and current events. Beeldenstorm is hosted by Prof. Henk van Os, known not only for his art expertise, but also for his narrative ability. Henk van Os is highly acclaimed and appreciated in the Netherlands, where he has established his ability to appeal to a broad audience.3 Constraining the corpus to contain episodes from Beeldenstorm limits the spoken content to a single speaker speaking within the style of a single documentary 1 2 3
http://www.multimediaeval.org/videoclef09/videoclef09.html http://www.clef-campaign.org/ http://www.avro.nl/tv/programmas a-z/beeldenstorm/
Exploiting Speech Recognition Transcripts for Narrative Peak Detection
387
series. This limitation is imposed in order to help control effects that could be introduced by variability in style or skill. Experimentation of the ability of algorithms to transfer performance to other domains is planned for the future. An additional advantage of using the Beeldenstorm series is that the episodes are relatively short, approximately eight minutes in length. Because they are short, the assessors who create the ground truth for the test collection (discussed below) are able to watch each video in its entirety. It is essential for assessors to watch the entire video in order to judge relative rises in tension over the course of the narrative. In short, the Beeldenstorm program provides a highly suitable corpus for developing and evaluating algorithms for narrative peak detection. Ground truth was created for the Beeldenstorm by a team of assessors who speak Dutch natively or at an advanced level. The assessors were told that the Beeldenstorm series is known to contain humorous and moving moments and told that they could use that information to formulate an opinion of what constitutes a narrative peak. They were asked to mark the three points in the video where their perception of the level of dramatic tension reached the highest peaks. Peaks were required to be a maximum of ten seconds in length. For the Affect Detection task of VideoCLEF 2009, task participants were supplied with an example set containing five Beeldenstorm episodes in which example narrative peaks had been identified by a human assessor. On the basis of their observations and generalizations concerning the peaks marked in the example set, the task participants designed algorithms capable of automatically detecting similar peaks in the test set. The test set contained 45 videos and was mutually exclusive with the example set. Participants were required to identify the three highest peaks in each episode. Up to five different runs (i.e., system outputs created according to different experimental conditions) could be submitted. Further details about the data set and the Affect Detection task for VideoCLEF 2009 can be found in the track overview paper [3]. Participants were provided with additional resources accompanying the test data, including transcripts generated by an automatic speech recognition system [2]. Our approaches, described in the next section, focus on exploiting the contents of the speech transcripts for the purpose of automatically detecting narrative peaks. 2.2
Narrative Peak Detection Approaches
Our approaches consist of a sophisticated baseline and four other techniques for using speech recognition transcripts to automatically detect narrative peaks. We describe each algorithm in turn. Fixing Time Points (duotu09fix). Our baseline approach duotu09fix4 hypothesizes fixed time points for three narrative peaks in each episode. These points were set at fixed distances from the start of each video: (1) 44 secs, (2) 7 mins 9 secs and (3) 3 mins 40 secs. They were selected by analyzing the peak 4
duotu is an acronym indicating the combined efforts of Delft University of Technology and the University of Twente.
388
M. Larson et al.
positions in the example set and choosing three that appeared typical. They are independent of episode content and are the same for every episode. We chose this approach in order to establish a baseline against which our speech-transcriptbased peak detection algorithms can be compared. Because the narrative structure of the episodes adheres to some basic patterns, presumably due to natural narrative rhythm or production convention, choosing fixed time points is actually a quite competitive approach and constitutes a challenging baseline. Counting Indicator Words (duotu09ind). We viewed the example videos and examined the words that were spoken during the narrative peaks that the assessor had marked in these videos. We formulated the hypothesis that the speaker applies a narrow range of strategies for creating narrative peaks in the documentary. These strategies might be reflected in a relatively limited vocabulary of words that could be used as indicators in order to predict the position of narrative peaks. We compiled a list of narrative peak indicators by analyzing the words spoken during each of the example peaks and compiled a list of words and word-stems that seemed relatively independent of the topic at the point in the video and which could be plausibly characteristic of the general word use of the speaker during peaks. The duotu09ind algorithm detects narrative peaks using the following sequence of steps. First, a set of all possible peak candidates was established by moving a 10-second sliding window over the speech recognition transcripts, advancing the window by one word at each step. Each peak candidate is maximally 10 seconds in length, but can be shorter if the speech in the window lasts for less than the 10-second duration of the window. Peak candidates of less than three seconds in length are discarded. Then, the peak candidates are ranked with respect to the raw count of the indicator words that they contain. The size limitation of the sliding window already introduces a normalizing effect and for this reason we do not undertake further normalization of the raw counts. Finally, peak candidates are chosen from the ranked list, starting at the top, until a total of three peaks has been selected. If a candidate has a midpoint that falls within eight seconds of the midpoint of a previously selected candidate in the list, that candidate is discarded and the next candidate from the list is considered instead. Counting Word Repetitions (duotu09rep). Analysis of the word distributions in the example set suggested that repetition may be a stylistic device that is deployed to create peaks. The duotu09rep algorithm uses the same list of peak candidates described in the previous section in the explanation of duotu09ind. The peak candidates are ranked by the number of occurrences they contain of words that occur multiple times. In order to eliminate the impact of function words, stop word removal is performed before the peak candidates are scored. Three peaks are selected starting from the top of the ranked list of peak candidates, using the same procedure as was described above. Counting First and Second Person Pronouns (duotu09pro). We conjecture that dramatic tension rises along with the level to which the viewers feel that they are directly involved in the video content they are watching. The
Exploiting Speech Recognition Transcripts for Narrative Peak Detection
389
duotu09pro approach identifies two possible conditions of heightened viewer involvement: when viewers feel that the speaker in the videos is addressing them directly or as individuals, or, second, when viewers feel that the speaker is sharing something personal. In the duotu09pro approach we use second person pronominal forms (e.g., u, ‘you’; uw ‘your’) to identify audience directed speech and first person pronominal forms (e.g., ik, ‘I’) to identify personal revelation of the speaker. The duotu09pro algorithm uses the same list of peak candidates and the same method of choosing from the ranked candidate lists that was used in duotu09ind and duotu09rep. For duotu09pro, the candidates are ranked according to the raw count of first and second person pronominal forms that they contain. Again, no normalization was applied to the raw count. Calculating Affective Ratings (duotu09rat). The duotu09rat approach uses an affective rating score that is calculated in a straightfoward manner using known affective levels of words in order to identify narrative peaks. The approach makes use of Whissell’s Dictionary of Affect in Language [5] as deployed in the implementation of [4], which is available online.5 This dictionary of words and scores focuses on the scales of pleasantness and arousal levels. The scales are called evaluation and activation and they both range from -1.00 to 1.00. Under our approach, narrative peaks are identified with a high arousal emotion combined with either a very pleasant or unpleasant emotion. In order to score words, we combine the evaluation and the activation scores into an overall affective word score, calculated using Equation 1. wordscore = evaluation2 + activation2 (1) If a certain word has a negative arousal, its wordscore is set to zero. In this way, wordscore captures high arousal only. In order to apply the dictionary, we first translate the Dutch-language speech recognition transcripts into English using the Google Language API.6 The duotu09rat algorithm uses the same list of peak candidates used in duotu09ind, duotu09rep and duotu09pro. Candidates are ranked according to the average wordscore of the words that they contain, calculated using Equation 2. wordscore (2) rating = N N Here, N is the number of words within contained within a peak candidate that are included in Whissell’s Dictionary. Selection of peaks proceeds as in the other approaches with the exception of the fact that the peak proximity condition was set to be more stringent. Edges of peaks are required to be 4 secs apart from each other. The imposition of the more stringent condition reflects a design decision made in regards to the implementation and does not represent an optimized value. The wordscore curve for an example episode is illustrated in Figure 1. The peaks hypothesized by the system are indicated with circles. 5 6
http://technology.calumet.purdue.edu/met/gneff/NeffPubl.html http://code.google.com/intl/nl/apis/ajaxlanguage
390
M. Larson et al.
Fig. 1. Plot of affective score over the course of an example video (Beeldenstorm episode Kluizenaars in de kunst, ‘Hermits in art’). The three top peaks identified by duotu09rat are marked with circles.
3
Experimental Results
We tested our five experimental approaches on the 45 videos in the test set. Evaluation of results was carried out by comparing the peak positions hypothesized by each experimental system with peak positions that were set by human assessors. In total, three assessors viewed each of the test videos and set peaks at the three points where he or she felt most highly affected by narrative tension created by the video content. In total the assessors identified 293 distinct narrative peaks in the 45 test episodes. Peaks identified by different assessors were considered to be the same peak if they overlapped by at least two seconds. This value was set on the basis of observations by the assessor on characteristic distances between peaks. Overlapping peaks were merged by fitting the overlapped region with a ten second window. This process was applied so that merged peaks could never exceed the specified peak length of ten seconds. Two methods of scoring the experiments were applied, the point-based approach and the peak-based approach. Under point-based scoring, a peak hypothesis scores a point for each assessor who selected a reference peak that is within eight seconds of that hypothesis peak. The total number of points returned by the run is the reported run score. A single episode can earn a run between three points (assessors chose completely different peaks) and nine points (assessors all chose the same peaks). In reality, no episode however, falls at either of these extremes. The distribution of the peaks in the files is such that a perfect run would earn 246 points. Under peak-based scoring, the total number of correct peaks is reported as the run score. Three different types of reference peaks are defined for peak-based scoring. The difference is related to the number of assessors required to agree for a point in the video to be counted as a peak. Of these 293 total peaks identified, 203 peaks are “personal peaks” (peaks identified by only one assessor), 90 are “pair peaks” (peaks that are identified by at least two assessors) and 22 are “general peaks” (peaks upon which all three assessors agreed). Peak-based scores are reported separately for each of these types of peaks. A summary of the results of the evaluation is given in Table 1.
Exploiting Speech Recognition Transcripts for Narrative Peak Detection
391
Table 1. Narrative peak detection results measure duotu09fix duotu09ind duotu09rep duotu09pro duotu09rat point-based 47 55 30 63 59 peak-based 28 38 21 44 33 (“personal”) peak-based 8 12 7 17 18 (“pair”) peak-based 4 2 0 4 6 (“general”)
From these results it can be seen that duotu09pro, the approach that counted first and second person pronouns, and duotu09rat, the approach that made use of affective word scores are the best performing approaches. The approach relying on a list of peak indicator words, i.e., duotu09ind, performed surprisingly well considering that the list was formulated on the basis of a very limited number of examples.
4
Conclusion and Outlook
We have proposed four approaches to the automatic detection of narrative peaks in short-form documentaries and have evaluated these approaches within the framework of the VideoCLEF 2009 Affect Detection task, which uses a test set consisting of episodes from the Dutch language documentary on the visual arts called Beeldenstorm. Our proposed approaches exploit speech recognition transcripts. The two most successful algorithms are based on the idea that narrative peaks are perceived where particularly emotional speech is being used (duotu09rat) or when the viewer feels specifically addressed by or involved in the video (duotu09pro). These two approaches easily beat both the random baseline and also a challenging baseline approach hypothesizing narrative peaks at set positions in the video. Approaches based on capturing speaking style, either by using a set of indicator words typical for the speaker, or by trying to determine where repetition is being used as a stylistic device, proved less helpful. However, the experiments reported here are not extensive enough to exclude the possibility that they would perform well given a different implementation. Future work will involve returning to many of the questions opened here, for example, while selecting peak-indicator words, we noticed that contrasts introduced by the word ‘but’ appear to often be associated with narrative peaks. Stylistic devices in addition to repetition, for example, use of questions, could also prove to be helpful. Under our approach, peak candidates are represented by their spoken content. We would also like to investigate the enrichment of the representations of peak candidates using words derived from surrounding regions in the speech transcripts or from an appropriate external text collection. Finally,
392
M. Larson et al.
we intend to develop peak detection methods based on the combination of information sources, in particular, exploring whether using information derived from pronoun occurrences can provide enhancement to affect based rating.
Acknowledgements The work was carried out within the PetaMedia Network of Excellence and has received funding from the European Commission’s 7th Framework Program under grant agreement no. 216444.
References 1. Hanjalic, A., Xu, L.-Q.: Affective video content representation and modeling. IEEE Transactions on Multimedia 7(1), 143–154 (2005) 2. Huijbregts, M., Ordelman, R., de Jong, F.: Annotation of heterogeneous multimedia content using automatic speech recognition. In: Falcidieno, B., Spagnuolo, M., Avrithis, Y., Kompatsiaris, I., Buitelaar, P. (eds.) SAMT 2007. LNCS, vol. 4816, pp. 78–90. Springer, Heidelberg (2007) 3. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2009: New perspectives on speech-based multimedia content enrichment. In: Peters, C., Tsikrika, T., M¨ uller, H., Kalpathy-Kramer, J., Jones, G.J.F., Gonzalo, J., Caputo, B. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 4. Neff, G., Neff, B., Crandon, P.: Assessing the affective aspect of languaging: The development of software for public relations. In: The 52nd Annual Conference of the International Communication Association (July 2002) 5. Whissell, C., Charuk, K.: A dictionary of affect in language: II. Word inclusion and additional validation. Perceptual and Motor Skills 61(1), 65–66 (1985)
Identification of Narrative Peaks in Video Clips: Text Features Perform Best Joep J.M. Kierkels1,2, Mohammad Soleymani2, and Thierry Pun2 1
Department of medical physics, TweeSteden hospital, 5042AD Tilburg, the Netherlands 2 Computer vision and multimedia laboratory (CVML) Computer Science Department, University of Geneva, Battelle Campus, Building A, 7 Route de Drize CH – 1227 Carouge, Geneva, Switzerland jkierkels@tsz.nl, {mohammad.soleymani,thierry.pun}@unige.ch
Abstract. A methodology is proposed to identify narrative peaks in video clips. Three basic clip properties are evaluated which reflect on video, audio and text related features in the clip. Furthermore, the expected distribution of narrative peaks throughout the clip is determined and exploited for future predictions. Results show that only the text related feature, related to the usage of distinct words throughout the clip, and the expected peak-distribution are of use when finding the peaks. On the training set, our best detector had an accuracy of 47% in finding narrative peaks. On the test set, this accuracy dropped to 24%.
1 Introduction A challenging issue in content-based video analysis techniques is the detection of sections that evoke increased levels of interest or attention in viewers of videos. Once such sections are detected, a summary of a clip can be created which allows for faster browsing through relevant sections. This will save valuable time of any viewer who merely wants to see an overview of the clip. Past studies on highlight detection often focus on analyzing sports-videos [1], in which highlights usually show abrupt changes in content features. Although clips usually contain audio, video, and spoken text content, many existing approaches focus on merely one of these [2;3]. In the current paper, we will attempt to compare and show results for all three modalities. The proposed methodology to identify narrative peaks in video clips was presented at VideoCLEF 2009 subtask on “Affect and Appeal” [4]. The clips that were given in this subtask were all taken from a Dutch program called “Beeldenstorm”. They were in Dutch, had durations between seven and nine minutes, consisted of video and audio, and had speech transcripts available. Detection accuracy was determined by comparison against manual annotations on narrative peaks provided by three annotators. The annotators were either native Dutch speakers or fluent in Dutch. Each annotator chose the three highest affective peaks of each episode. C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 393–400, 2010. © Springer-Verlag Berlin Heidelberg 2010
394
J.J.M. Kierkels, M. Soleymani, and T. Pun
While viewing the clips, finding clear indicators as to which specific audiovisual features could be used to identify narrative peaks was not straightforward, even by looking at the annotations that were provided with the training set. Furthermore we noticed that there was little consistency among the annotators because more than three narrative peaks were indicated for all clips. This led to the conclusion that tailoring any detection method to a single person’s view on narrative peaks would not be fruitful and hence we decided to work only with basic features. These features are expected to be indicators of narrative peaks that are common to most observers, including the annotators. Our approach for detecting peaks consists of a top-down search for relevant features, e.g., first we computed possibly relevant features and secondly we investigated which of these features really enhanced detection accuracy. Three different modalities were separately treated. First, video, in MPEG1 format, was used to determine at what place in the clip frames showed the largest change compared to a preceding key frame. Second, Audio, in MPEG layer 3 format, was used to determine at what place in the clip the speaker has an elevated pitch or has an increased speech volume. Third, ext, taken from the available metadata xml files in MPEG 7 format, was used to determine at what place in the clip the speaker introduced a new topic. Next to this, the expected distribution of narrative peaks over clips was considered. Details on how all these steps were implemented are given in Section 2, followed by results of our approach on the given training data in Section 3. Discussions over the obtained results and evaluations are given in Section 4. In Section 5 several conclusions are drawn from these results. In the VideoCLEF subtask, the focus of detecting segments of increased interest is on the data part, e.g., we analyze parts of the shown video-clip to predict their impact on a viewer. Even though there exists a second approach to identify segments of increased interest. This second approach focuses not on the data but directly on the reactions of a viewer, e.g., by monitoring his physiological activity such as heart-rate [5] or by filming his facial expressions [6]. Based on such reactions, the affective state of a viewer can be estimated and one can estimate levels of excitation, attention and interest in a viewer [7]. By themselves, physiological activity measures can thus be used to estimate interest, but they could also be used to validate the outcomes of data-based techniques.
2 Feature Extraction For the different modalities, feature extraction will be described separately in the following subsections. As the topic of detecting affective peaks is quite unexplored, only basic features were implemented. This provides an initial idea of which features are useful, and future studies could focus on enhancing the relevant basic features. Feature extraction was implemented using MATLAB (Mathworks Inc). 2.1 Video Features Our key assumption for video features was that dramatic tension is related to big changes in video. It is a film editors’ choice to include such changes along time [8],
Identification of Narrative Peaks in Video Clips: Text Features Perform Best
395
and this may be used to stress the importance of certain parts in the clip. The proposed narrative peak detector will output a 10 s window of enhanced dramatic tension from videos with the frame rate of 25 frames per second and this precision level is too large and merely slows down computations. Hence only the key frames (I-frames) are treated. A (Video)
B (Audio)
0
0 -0.5 -1 0
100
200
300
400
1
Arb. score
Arb. score
Arb. sc o re
0.5
-0.5 0
C (Text)
0.5
1
100
200
300
400
500
400
500
time(s)
500
time(s)
0.5 0 -0.5 0
100
200
300 time(s)
400
500
Arb. score
1 0.5 0 -0.5 0
100
200
300
time(s)
Fig. 1. Illustration of single modality feature values computed over time. A: Video feature, B: Audio features, C: Text feature. All figures are based on the episode with the identification code (ID), “BG_37016”.
2.2 Audio Features The key assumption for audio was that a speaker has an elevated pitch or has an increased speech volume when applying dramatic tension, as suggested in [9;10]. The audio is encoded at 44.1 kHz sampling rate in mpeg layer 3 format. The audio signals only contain speech except a short opening and ending credits at the start and the end of each episode. The audio signal is divided in 0.5 s segments for which the average pitch of the speaker’s voice is computed by imposing a Kaiser window and applying a Fast Fourier Transform. In the transformed signal, the frequency with maximum power is determined and is assumed to be the average pitch of the speaker’s voice over this window. Next the difference in average pitch between subsequent segments is computed. If a segment’s average pitch is less than 2.5 times as high as the pitch of the preceding segment, its pitch value is set to zero. This way, only those segments with strong increase in pitch (supposed indicator of dramatic tension) are kept. Speech volume is determined by computing the averaged absolute value of the audio signal within the 0.5 s segment. As a final step again, the resulting signals for pitch and volume are both smoothed by averaging over a 10 s window, and the smoothed resulting signal is scaled to have a maximum absolute value of one and subsequently to have a mean of zero. Next, they are down-sampled by a factor 2, resulting in vectors audio1 and audio2 which both contain 1 value per second as is illustrated in Fig. 1B. 2.3 Text Features The main assumption for text is that dramatic tension starts by the introduction of a new topic, and hence involves the introduction of new vocabulary related to this topic. Text transcripts are obtained from the available metadata xml files. The absolute
396
J.J.M. Kierkels, M. Soleymani, and T. Pun
occurrence frequency for each word was computed. Words that occurred only once were considered to be non-specific and were ignored. Words that occurred more than five times were considered too general and were also ignored. The remaining set of words is considered to be topic specific. Based on this set of words, we estimated where the changes in used vocabulary are the largest. A vector v filled with zeros was initialized, having a length equal to the number of seconds in the clip. For each remaining word, its first and last appearance in the metadata container was determined and was rounded off to whole seconds, subsequently all elements in v in between the elements corresponding to these obtained timestamps are increased by one. Again, the resulting vector v is averaged over a 10 s window, scaled and set to zero mean. The resulting vector text is illustrated in Fig. 1C. 2.4 Distribution of Narrative Peaks A clip is directed by a program director and is intended to hold the attention of the viewer. To this end, it is expected that points of dramatic tension are distributed over the duration of the whole clip, and that not all moments during a clip are equally likely to have dramatic tension. For each dramatic tension-point as indicated by the annotators, its time of occurrence was determined (mean of start and stop timestamp) and a histogram, illustrated in Fig. 2, was created based on these occurrences. Based on this histogram, a weighting vector w was created for each recording. Vector w contains one element for each second of the clip. Each element’s value is determined according to the histogram. 8 7
peak count
6 5 4 3 2 1 0 0
50 100 150 200 250 300 350 400 450 500 time(s)
Fig. 2. Histogram that illustrates when dramatic tension-points occur in the clips according to the annotators. Note that during the first several seconds there is no tension-point at all.
2.5 Fusion and Selection For fusion of the features, our approach merely consisted in giving equal importance to all used features. After fusion, the weights vector w can be applied and the final indicator of dramatic tension drama is derived as (shown for all three features): ⎛ ( audio1 + audio2 ) + text ⎞ . drama = w ⋅ ⎜ video + ⎟ 2 ⎝ ⎠ T
(2)
The estimated three points of increased dramatic tension are then obtained by selecting the three maxima from drama. The three top estimates for dramatic points are constructed by selecting the intervals starting 5s before these peaks and ending 5s afterwards. If either the second or third highest point in drama is within 10s of the
Identification of Narrative Peaks in Video Clips: Text Features Perform Best
397
highest point, the point is ignored in order to avoid having an overlap between the detected segments of increased dramatic tension. In those cases, the next highest point is used (provided that the new point is not within 10s) Table 1. Schemes for feature combinations Scheme number 1 2 3 4
Used features Weights Video Yes Audio Yes Text Yes Video, Audio Yes
Scheme number 5 6 7 8
Used features Weights Video, Text Yes Audio, Text Yes Video, Audio, Text Yes Text No
3 Evaluation Schemes and Results Different combinations of the derived features were made and subsequently evaluated against the training data. The schemes tested are listed in table 1. If no weights are used (Scheme 8) vector w contains only ones. Scoring of evaluation results is performed based on agreement with the reviewers’ annotations. Each time a peak that was detected coincides with (at least) one reviewer’s annotation, a point is added. A maximum of three points can thus be scored per clip and since there are five clips in the training set, the maximum score for any scheme is 15. The obtained scores are shown in table 2. Table 2. Results on the training sets. The video ID codes in the dataset start by “BG_”. Scheme number BG_36941 BG_37007 BG_37016 BG_37036 BG_37111 Total 1 0 0 1 1 1 3 2 2 1 1 1 1 6 3 2 1 1 2 1 7 4 0 1 2 1 1 5 5 1 2 2 1 0 6 6 2 1 1 2 1 7 7 1 1 2 1 0 5 8 0 1 1 1 0 3
4 Discussion As can be seen in table 2, the best performing schemes on training samples are scheme 3 and scheme 6 which both result in 7 accurately predicted narrative peaks and hence an accuracy of 47%. These two schemes both include the text based feature and the weights vector. Scheme 6 also contains the audio based feature but fails to achieve an increased accuracy because of this inclusion. Considering that there is also strong disagreement between annotators, an accuracy of 47% (compared against the joint annotations of three annotators) shows the potential of using the automated narrative peak detector. The fact that this best performing scheme is only based on a text
398
J.J.M. Kierkels, M. Soleymani, and T. Pun
based feature corresponds well to the initial observation that there is no clear audiovisual characteristic of a narrative peak when observing the clips. Five schemes have been evaluated using the test samples mainly corresponding to some of the different schemes that were previously used in table 1. The results of these five methods on the test-data, and their explanations are given in table 3. For number 5, all narrative peaks were randomly selected (for comparison with random level detection). Evaluation of these runs was performed in two ways: Peak-based (similar to the scoring system on the training data) and Point-based which can be explained as follows; If a peak that is detected coincides with annotations of more than one reviewer annotation, multiple points are added. Hence the maximum-maximum score for a clip can be nine when annotators fully agree on segments, the minimum-maximum score remains three when annotators fully disagree. Table 3. Results on the test set run number (scheme nr) Score (Peak-based) Score (Point-based) 1 3 33 39 2 7 30 41 33 42 3 6 4 8 32 43 43 5 -32
The difference between the two scoring system lies in the fact that the Point-based scoring system awards more than one point to segments which were selected by more than one annotator. If annotators agree on segments with increased dramatic tension, there will be (in total over three annotators) less annotated segments and hence the probability that by chance our automated approach selects an annotated segment will decrease. Therefore, awarding more points to the detection of these less probable segments seems logical. Moreover, a segment on which all annotators agree must be a really relevant segment of increased tension. On the other hand, this Point-based approach gives equal points to having just one correctly detected segment in a clip (annotated by all three annotators) and to detecting all three segments correctly (each of them by one annotator). Since our runs were selected based on the results that were obtained using the Peakbased scoring system, results on the test data are mainly compared to this scoring. First of all, it should be noted that results are never far better than random level, as can be seen by comparing to run number 5. Surprisingly, the Peak-based and Pointbased scores show a distinctly different ranking of the runs. Run 1 performed the worst under the Point-based scoring, yet it performed best under the Peak-based scoring system. Based on the results obtained on the clips in the test set, it was expected that runs 1 and 3 would perform best. This is clearly reflected in the results we obtain when using the same evaluation method on the test clips, the Peak-based evaluation. However, with the Point-based scoring system this effect disappears. This may indicate that the main feature that we used, the text based feature based on the introduction of a new topic, does not reflect properly the notion of dramatic tension for all annotators, but is biased towards a single annotator.
Identification of Narrative Peaks in Video Clips: Text Features Perform Best
399
Each video clip in the dataset was only annotated for its top three narrative peaks. The lack of a fully annotated dataset with all possible narrative peaks, made it difficult to study the effect of narrative peaks on low level content features. Having all the narrative peaks at different levels on a larger dataset, the correlation between the corresponding different low level content features could have been computed. The significance of these features for estimating narrative peaks could therefore have been further investigated.
5 Conclusions The narrative peak detection subtask described in the VideoCLEF 2009 Benchmark Evaluation has proven to be a challenging and difficult one. Failing to see obvious features when viewing the clips and only seeing a mild connection between new topics and dramatic tension peaks, we resorted to the detection of the start of new topics in the text annotations of the provided video clips and the use of some basic videoand audio-based features. In our initial evaluation based on the training clips, the text based feature proved to be the most relevant one and hence our submitted evaluationruns were centered on this feature. When using a consistent evaluation of training and test clips, the text based feature also led to our best results on the test data. The overall detection accuracy based on the text-based feature dropped from 47% correct detection on the training data to 24% on the test data. It should be stated that results on the test data were just mildly above random level. The randomly drawn results by chance performed better than random level. The simulated random level results are 40 for the point based and 30 for the peak based scoring schemes. The reported results based on the Point-based scoring differed strongly from the results obtained using the scoring system that was employed on the training data. It was shown that although using the peaks distribution as a data driven method enhanced the results on the training data the same approach cannot be generalized due to its bias toward the annotations on the training samples. In fact, the number of narrative peaks is unknown for any given video. The most precise annotation of such documentary clips can be obtained from the original script writer and the narrator himself. Not having access to these resources, more annotators should annotate the videos. These annotators should be able choose freely any number of narrative peaks. To improve the peak detection, a larger dataset is needed to compute the significance of correlations between features and narrative peaks. Given the challenging task that was given, it is our strong belief that the indication that text based features (related to the introduction of new topics) perform well, is a valuable contribution in the search for an improved dramatic tension detector. Acknowledgments. The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2011] under grant agreement n° 216444 (see Article II.30. of the Grant Agreement), NoE PetaMedia. The work of Soleymani and Pun is supported in part by the Swiss National Science Foundation.
400
J.J.M. Kierkels, M. Soleymani, and T. Pun
References 1. Hanjalic, A.: Adaptive extraction of highlights from a sport video based on excitement modeling. IEEE Transactions on Multimedia 7(6), 1114–1122 (2005) 2. Gao, Y., Wang, W.B., Yong, J.H., Gu, H.J.: Dynamic video summarization using twolevel redundancy detection. Multimedia Tools and Applications 42(2), 233–250 (2009) 3. Otsuka, I., Nakane, K., Divakaran, A., Hatanaka, K., Ogawa, M.: A highlight scene detection and video summarization system using audio feature for a Personal Video Recorder. IEEE Transactions on Consumer Electronics 51(1), 112–116 (2005) 4. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 5. Soleymani, M., Chanel, G., Kierkels, J.J.M., Pun, T.: Affective Characterization of Movie Scenes Based on Multimedia Content Analysis and User’s Physiological Emotional Responses. In: IEEE International Symposium on Multimedia (2008) 6. Valstar, M.F., Gunes, H., Pantic, M.: How to Distinguish Posed from Spontaneous Smiles using Geometric Features. In: ACM Int’l Conf. Multimodal Interfaces, ICMI 2007 (2007) 7. Kierkels, J.J.M., Pun, T.: Towards detection of interest during movie scenes. In: PetaMedia Workshop on Implicit, Human-Centered Tagging (HCT 2008), Abstract only (2008) 8. May, J., Dean, M.P., Barnard, P.J.: Using film cutting techniques in interface design. Human-Computer Interaction 18(4), 325–372 (2003) 9. Alku, P., Vintturi, J., Vilkman, E.: Measuring the effect of fundamental frequency raising as a strategy for increasing vocal intensity in soft, normal and loud phonation. Speech Communication 38(3-4), 321–334 (2002) 10. Wennerstrom, A.: Intonation and evaluation in oral narratives. Journal of Pragmatics 33(8), 1183–1206 (2001)
A Cocktail Approach to the VideoCLEF’09 Linking Task Stephan Raaijmakers, Corn´e Versloot, and Joost de Wit TNO Information and Communication Technology, Delft, The Netherlands {stephan.raaijmakers,corne.versloot,joost.dewit}@tno.nl
Abstract. In this paper, we describe the TNO approach to the Finding Related Resources or linking task of VideoCLEF091 . Our system consists of a weighted combination of off-the-shelf and proprietary modules, including the Wikipedia Miner toolkit of the University of Waikato. Using this cocktail of largely off-the-shelf technology allows for setting a baseline for future approaches to this task2 .
1
Introduction
The Finding Related Resources or linking task of VideoCLEF’09 consists of relating Dutch automatically transcribed TV speech to English Wikipedia content. For a total of 45 video episodes, a total of 165 anchors (speech transcripts) have to be linked to related Wikipedia articles. Technology emerging from this task will contribute to a better understanding of Dutch video for non-native speakers. The TNO approach to this problem consists of a cocktail of off-the-shelf techniques. Central to our approach is the use of the Wikipedia Miner toolkit developed by researchers at the University of Waikato3 (see Milne and Witten [9]). The so-called Wikifier functionality of the toolkit detects Wikipedia topics from raw text, and generates cross-links from input text to a relevance-ranked list of Wikipedia pages. Here, ’topic’ means: a Wikipedia topic label, i.e. an element from the Wikipedia ontology, e.g. ’Monarchy of Spain’, or ’rebellion’. We investigated two possible options for bridging the gap between Dutch input text and English Wikipedia pages: translating queries to English prior to the detection of Wikipedia topics, and translating Wikipedia topics detected in Dutch texts to English Wikipedia topics. In the latter case, the use of Wikipedia allows for an abstraction of raw queries to Wikipedia topics, for which the translation process in theory is less complicated and error prone. In addition, we deploy a specially developed part-of-speech tagger for uncapitalized speech transcripts that is used to reconstruct proper names. 1 2
3
Additional information about the task can be found in Larson et al. [7]. This work is supported by the European IST Programme Project FP6-0033812. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. See http://wikipedia-miner.sourceforge.net
C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 401–408, 2010. c Springer-Verlag Berlin Heidelberg 2010
402
2
S. Raaijmakers, C. Versloot, and J. de Wit
Related Work
The problem of cross-lingual link detection is an established topic on the agenda of cross-lingual retrieval, e.g. in the Topic Detection and Tracking community (e.g. Chen and Ku [2]). Recently, Van Gael and Zhu [4] proposed a graph-based clustering method (correlation clustering) for cross-linking news articles in multiple languages to the same event. In Smet and Moens [3], a method is proposed for cross-linking resources to the same (news) events in Dutch and English using probabilistic (latent Dirichlet) topic models, omitting the need for translation services or dictionaries. The current problem, linking Dutch text to English Wikipedia pages, is related to this type of cross-lingual, event-based linking in the sense that Dutch ’text’ (speech transcripts) is to be linked to English text (Wikipedia pages) tagged for a certain ’event’ (the topic of the Wikipedia page). There are also strong connections with the relatively recent topic of learning to rank (e.g. Liu [8]), as the result of cross-linking is a list of ranked Wikipedia pages.
3
System Setup
In this section, we describe the setup of our system. We start with the description of the essential ingredients of our system, followed by the definition of a number of linking strategies based on these ingredients. The linking strategies are combined into scenarios for our runs (Sect. 4). Fig. 1 illustrates our setup. For the translation of Dutch text to English (following Adafre and de Rijke [1]), we used the Yahoo! Babel Fish translation service4 . An example of the output of this service is the following: – Dutch input: als in 1566 de beeldenstorm heeft plaatsgevonden, ´e´en van de grootste opstanden tegen de inquisitie, keert willem zich definitief tegen de koning van spanje – English translation: if in 1566 the picture storm has taken place, one of the largest insurrections against the inquisitie, turn himself willem definitively against the king of Spain Since people, organizations and locations often have entries in Wikipedia, accurate proper name detection seems important for this task. Erroneous translation to English of Dutch names (e.g. ’Frans Hals’ becoming ’French Neck’) should be avoided. Proper name detection prior to translation allows for exempting the detected names from translation. A complicating factor is formed by the fact that the transcribed speech in the various broadcastings is in lowercase, which makes the recognition of proper names challenging, since important capitalization features can no longer be used. To address this problem, we trained a maximum entropy part-of-speech tagger: an instance of the Stanford tagger5 4 5
http://babelfish.yahoo.com/ http://nlp.stanford.edu/software/tagger.shtml
A Cocktail Approach to the VideoCLEF’09 Linking Task
403
(see Toutanova and Manning [10]). The tagger was trained on a 700K part-ofspeech tagged corpus of Dutch, after having decapitalized the training data. The feature space consists of a 5-cell bidirectional window addressing part-of-speech ambiguities and prefix and suffix features up to a size of 3. The imperfect English translation by Babel Fish was observed to be the main reason for erroneous Wikifier results. In order to omit the translation step, we ported the English Wikifier of the Wikipedia Miner toolkit to Dutch, for which we used the Dutch Wikipedia dump and Perl scripts provided by developers of the Wikipedia Miner toolkit. The resulting Dutch Wikifier (’NL Wikifier’ in Fig. 1) has exactly the same functionality as the English version, but unfortunately contains a lot less pages than the English version (a factor 6 less). Even so, the translation process now is narrowed down to translating detected Wikipedia topics (the output of the Dutch Wikifier) to English Wikipedia topics. For the latter, we implemented a simple database facility (to which we shall refer with ’The English Topic Finder’) that uses the cross-lingual links between topics in the Wikipedia database for carrying out the translation of Dutch topics to English topics. An example of the output of the English and Dutch Wikifiers for the query presented above is the following: – Output English Wikifier: 1566, Charles I of England, Image, Monarchy, Monarchy of Spain, Rebellion, Spain, The Picture – Output Dutch Wikifier: 1566, Beeldenstorm, Inquisitie, Koning (*titel), Lijst van koningen van Spanje, Spanje, Willem I van Holland The different rankings of the various detected topics are represented as a tag cloud with different font sizes, and can be extracted as numerical scores from the output. In order to be able to entirely by-pass the Wikipedia Miner toolkit, we deployed the Lucene search engine (Hatcher and Gospodnetic [5]) for performing the matching of raw, translated text with Wikipedia pages. Lucene was used to index the Dutch Wikipedia with the standard Lucene indexing options. Dutch speech transcripts were simply provided to Lucene as a disjunctive (OR) query, with Lucene returning the best matching Dutch Wikipedia pages for the query. The HTML of these pages was subsequently parsed in order to extract the English Wikipedia page references (which are indicated in Wikipedia, whenever present). The set of techniques just described leads to a total of four basic linking strategies. Of the various combinatorial possibilities of these strategies, we selected five promising combinations, each of which corresponding to a submitted run. The basic linking strategies are the following. Strategy 1: proper names only (the top row in Fig. 1). Following proper name recognition, a quasi-document is created that only consists of all recognized proper names. The Dutch Wikifier is used to produce a ranked list of Dutch Wikipedia pages for this quasi-document. Subsequently, the topics of these pages are linked to English Wikipedia pages with the English Topic Finder. Strategy 2: proper names preservation (second row in Fig. 1). Dutch text is translated to English with Babel Fish. Any proper names found in the part-of-speech tagged
404
S. Raaijmakers, C. Versloot, and J. de Wit
Fig. 1. TNO system setup
Dutch text are added to the translated text as untranslated text, after which the English Wikifier is applied, producing a ranked list of matching Wikipedia pages. Strategy 3: topic-to-topic linking (3rd row from the top in Fig. 1). The original Dutch text is wikified using the Dutch Wikifier, producing a ranked list of Wikipedia pages. The topics of these pages are subsequently linked to English Wikipedia pages with the English Topic finder. Strategy 4: text-to-page linking (bottom row in Fig. 1). After Lucene has matched queries with Dutch Wikipedia pages, the English Topic Finder tries to find the corresponding English Wikipedia pages for the Dutch topics in the pages returned by Lucene. This strategy omits the use of the Wikifier and was used as a fall-back option, in case none of the other modules delivered a result. A thresholded merging algorithm removes any results below a hand-estimated threshold and blends the remaining results into a single ordered list of Wikipedia topics, using again hand-estimated weights for the various sources of these results. Several different merging schemata were used for different runs; these are discussed in Sect. 4.
4
Run Scenarios
In this section, we describe the configurations of the 5 runs we submitted. We were specifically interested in the effect of proper name recognition, the relative contributions of the Dutch and English Wikifiers, and the effect of full-text Babel Fish translation as compared to a topic-to-topic translation approach. Run 1: All four linking strategies were used to produce the first run. A weighted merger (’Merger’ in Fig. 1) was used to merge the results from the different strategies. The merger works as follows: 1. English Wikipedia pages referring to proper names are uniformly ranked before all other results.
A Cocktail Approach to the VideoCLEF’09 Linking Task
405
2. The rankings produced by the second linking strategy (rankEN ) and third linking strategy (rankDU ) for any returned Wikipedia page p are combined according to the following scheme: rank(p) = ((rankEN (p) ∗ 0.2) + (rankDU (p) ∗ 0.8)) ∗ 1.4
(1)
The Dutch score was found to be more relevant than the English one (hence the 0.8 vs. 0.2 weights). The sum of the Dutch and English score is boosted with an additional factor of 1.4, awarding the fact that both linking strategies come up with the same result. 3. Pages found by Linking Strat. 2 but not by Linking Strat. 3 are added to the result and their ranking score is boosted with a factor of 1.1. 4. Pages found by Linking Strat. 3 but not by Linking Strat. 2 are added to the result (but their ranking score is not boosted). 5. If Linking Strats. 1 to 3 did not produce results, the results of Linking Strat. 4 are added to the result. Run 2: Run 2 is the same as Run 1 with the exception that Linking Strat. 1 is left out. Run 3: Run 3 is similar to Run 1, but does not boost results at the merging stage, and averages the rankings of the second and third linking strategy. This means that the weights used by the merger in Run 1 (0.8, 0.2 and 1.4) are respectively 0.5, 0.5 and 1.0 for this run. Run 4: Run 4 only uses Linking Strats. 1 and 3. This means that no translation from Dutch to English is performed. In the result set, the Wikipedia pages returned by Linking Strat. 1 are ordered before the results from Linking Strat. 3. Run 5: Run 5 uses all linking strategies except Linking Strat. 1 (it omits proper name detection). In this run a different merging strategy is used: 1. If Linking Strat. 2 produces any results, add those to the final result set and then stop. 2. If Linking Strat. 2 produces no results, but Linking Strat. 3 does, add those to the final result and stop. 3. If none of the preceding linking strategies produces any results, add the results from Linking Strat. 4 to the final result set.
5
Results and Discussion
For VideoCLEF’09, two groups submitted runs for the linking task: Dublin City University (DCU) and TNO. Two evaluation methods were applied by the task organizers to the submitted results. A team of assessors first achieved consensus on a primary link (the most important or descriptive Wikipedia article), with a minimum consensus among 3 people. All queries in each submitted run were
406
S. Raaijmakers, C. Versloot, and J. de Wit
Table 1. Left table: recall and MRR for the primary link evaluation. (Average DCU scores were 0.21 and 0.14, respectively.). Right table: MRR for the secondary link evaluation. (Average DCU score was 0.21.) Run 1 2 3 4 5
Recall 0.345 0.333 0.352 0.267 0.285
MRR 0.23 0.215 0.251 0.182 0.197
Average TNO 0.32 0.215
Run 1 2 3 4 5
MRR 0.46 0.428 0.484 0.392 0.368
Average TNO 0.43
scored for Mean Reciprocal Rank6 for this primary link, as well as for recall. Subsequently, the annotators agreed on a set of related resources that necessarily included the primary link, in addition to secondary relevant links (minimum consensus of one person). Since this list of secondary links is non-exhaustive, for this measure only Mean Reciprocal Rank is reported, and not recall. As it turns out, the unweighted combination of results (Run 3) outperforms all other runs, followed by the thresholded, weighted combination (Run 1). This indicates that the weights in the merging step are suboptimal7 . Merging unweighted results is generally better than applying an if-then-else schema: Run 2 clearly outperforms Run 5. Omitting proper name recognition results in a noticeable drop of performance under both evaluation methods, underlining the importance of proper names for this task. This is in line with the findings of e.g. Chen and Ku [2]. For the primary links, leaving out the ’proper names only’ strategy leads to a drop of MRR from 0.23 (Run 1) to 0.215 (Run 2). Leaving out text translation and ’proper name preservation’ triggers a drop of MRR from 0.23 (Run 1) to 0.182 (Run 4). While various additional correlations between performance and experimental options are open to exploration here, these findings underline the importance of proper names for this task. In addition to the recall and MRR scores, the assessment team distributed the graded relevance scores (Kek¨al¨ ainen and J¨ arvelin[6]) assigned to all queries. In Figs. 2 and 3, we plotted the difference per query of the averaged relevance score to the total average obtained relevance scores for both DCU and TNO runs. For every video, we averaged the relevance scores of the hits reported by DCU and TNO. Subsequently, for every TNO run, we averaged relevance scores for every query, and measured the difference with the averaged DCU and TNO runs. For TNO, Run 1 and 3 produce the best results, with only a small amount of queries below the mean. Most of the relevance 6
7
For a response r = r1 , . . . , rQ to a ranking task, the Mean Reciprocal Rank (MRR) Q 1 1 would be M RR = |Q| , with ranki the rank of answer ri with respect to rank i i=1 the correct answer. For subsequent runs, these weights can now be estimated from the ground truth data that has become available from the initial run of this task.
A Cocktail Approach to the VideoCLEF’09 Linking Task
407
1.5 ’tno_run1.plot’ ’tno_run2.plot’ ’tno_run3.plot’ ’tno_run4.plot’ ’tno_run5.plot’
Difference with mean relevance score (TNO+DCU)
1
0.5
0
-0.5
-1
-1.5
-2
-2.5 0
20
40
60
80
100
120
140
160
180
Queries
Fig. 2. Difference plots of the various TNO runs compared to the averaged relevance scores of DCU and TNO (ordered queries)
3
Difference with mean relevance score (TNO+DCU)
’dcu_run1.plot’ ’dcu_run2.plot’ ’dcu_run3.plot’ ’dcu_run4.plot’ 2
1
0
-1
-2
-3 0
20
40
60
80 Queries
100
120
140
160
Fig. 3. Difference plots of the various DCU runs compared to the averaged relevance scores of DCU and TNO (ordered queries)
results obtained from these runs are around the mean, showing that from the perspective of relevance quality, our best runs produce average results. DCU on the other hand appears to produce a higher proportion of relatively high quality relevance results.
6
Conclusions
In this contribution, we have taken a technological and off-the-shelf-oriented approach to the problem of linking Dutch transcripts to English Wikipedia pages.
408
S. Raaijmakers, C. Versloot, and J. de Wit
Using a blend of commonly available software resources (Babel Fish, the Waikato Wikipedia Miner Toolkit, Lucene, and the Stanford maximum entropy partof-speech tagger), we demonstrated that an unweighted combination produces competitive results. We hope to have demonstrated that this low-entry approach can be used as a baseline level that can inspire future approaches to this problem. A more accurate estimation of weights for the contribution of several sources of information can be carried out in future benchmarks, now that the VideoClef annotators have produced ground truth ranking data.
References 1. Adafre, S.F., de Rijke, M.: Finding Similar Sentences across Multiple Languages in Wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006) 2. Chen, H.-H., Ku, L.-W.: An NLP & IR approach to topic detection. Kluwer Academic Publishers, Norwell (2002) 3. De Smet, W., Moens, M.-F.: Cross-language linking of news stories on the web using interlingual topic modelling. In: SWSM 2009: Proceeding of the 2nd ACM Workshop on Social Web Search and Mining, pp. 57–64. ACM, New York (2009) 4. van Gael, J., Zhu, X.: Correlation clustering for crosslingual link detection. In: Veloso, M.M. (ed.) IJCAI, pp. 1744–1749 (2007) 5. Hatcher, E., Gospodnetic, O.: Lucene in Action. In Action series. Manning Publications Co., Greenwich (2004) 6. Kek¨ al¨ ainen, J., J¨ arvelin, K.: Using graded relevance assessments in IR evaluation. J. Am. Soc. Inf. Sci. Technol. 53(13), 1120–1129 (2002) 7. Larson, M., Newman, E., Jones, G.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 8. Liu, T.-Y.: Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3(3), 225–331 (2009) 9. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Mining (CIKM 2008), pp. 509–518. ACM Press, New York (2008) 10. Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63–70 (2000)
When to Cross Over? Cross-Language Linking Using Wikipedia for VideoCLEF 2009 ´ Agnes Gyarmati and Gareth J.F. Jones Centre for Digital Video Processing Dublin City University, Dublin 9, Ireland {agyarmati,gjones}@computing.dcu.ie
Abstract. We describe Dublin City University (DCU)’s participation in the VideoCLEF 2009 Linking Task. Two approaches were implemented using the Lemur information retrieval toolkit. Both approaches first extracted a search query from the transcriptions of the Dutch TV broadcasts. One method first performed search on a Dutch Wikipedia archive, then followed links to corresponding pages in the English Wikipedia. The other method first translated the extracted query using machine translation and then searched the English Wikipedia collection directly. We found that using the original Dutch transcription query for searching the Dutch Wikipedia yielded better results.
1
Introduction
The VideoCLEF Linking Task involved locating content related to sections of an automated speech recognition (ASR) transcription cross-lingually. Elements of a Dutch ASR transcription were to be linked to related pages in an English Wikipedia collection [1]. We submitted four runs by implementing two different approaches to solve the task. Because of the difference between the source language (Dutch) and the target language (English), a switch between the languages is required at some point in the system. Our two approaches differed in the switching method. One approach performed the search in a Dutch Wikipedia archive with the exact words (either stemmed or not) and then returned the corresponding links pointing to the English Wikipedia pages. The other one first performed an automatic machine translation of the Dutch query into English, the translated query was then used to search the English Wikipedia archive directly.
2
System Description
For our experiments we used the Wikipedia dump dated May 30th 2009 for the English archive, and the dump dated May 31st 2009 for the Dutch Wikipedia collection. In a simple preprocessing phase, we eliminated some information irrelevant to the task, e.g. information about users, comments, links to other C. Peters et al. (Eds.): CLEF 2009 Workshop, Part II, LNCS 6242, pp. 409–412, 2010. c Springer-Verlag Berlin Heidelberg 2010
410
´ Gyarmati and G.J.F. Jones A.
languages we did not need. For indexing and retrieving, we used the Indri model of the open source Lemur Toolkit [2]. English texts were stemmed using Lemur’s built-in stemmer, while Dutch texts were stemmed using Oleander’s implementation [5] of Snowball’s Dutch stemmer algorithm [6]. We used stopword lists provided by Snowball for both languages. Queries were formed based on sequences of words extracted from the ASR transcripts using the word timing information in the transcript file. For each of the anchors defined by the task, the transcript was searched from the anchor starting point until the given end point, and the word sequence between these boundaries extracted as the query. These sequences were used directly as queries for retrieval from the Dutch collection. The Dutch Wikipedia’s links pointing to the corresponding articles of the English version were returned as the solution for each anchor point in the transcript. For the other approach queries were translated automatically from Dutch to English using the query translation component developed for the Multimatch project [3]. This translation tool combines the WorldLingo machine translation engine augmented with a bilingual dictionary from the cultural heritage domain automatically extracted from the multilingual Wikipedia. The translated query was used to search the English Wikipedia archive.
3
Run Configurations
Here we describe the four runs we submitted to the Linking Task, plus an additional one performed subsequently. 1. Dutch. The Dutch Wikipedia was indexed without stemming or stopping. Retrieval was performed on the Dutch collection, returning the relevant links from the English collection. 2. Dutch stemmed. Identical to Run 1, except that the Dutch Wikipedia text is stemmed and stopped as described in Sect. 2. 3. English. This run represents the second approach with stop word removal and stemming applied to the English documents and queries. The translated query was applied to the indexed English Wikipedia. 4. Dutch with blind relevance feedback. This run is almost identical to Run 1, with a difference in parameter setting for Lemur to perform blind relevance feedback. Lemur/Indri uses a relevance model for query expansion, for details see [4]. The first 10 retrieved documents were assumed relevant and queries were expanded by 5 terms. 5. English. (referred to as 3 ) This is an amended version of Run 3, with the difference of an improved preprocessing phase applied to the English Wikipedia, disregarding irrelevant pages as described in Sect. 4.
4
Results
The Linking Task was assessed by the organisers as a known item task. The top ranked relevant link for each anchor is referred to as a primary link, and all other relevant links identified by the assessors as secondary links [1].
When to Cross Over? Cross-Language Linking Using Wikipedia
411
Table 1. Scores for Related Links Run Run Run Run Run Run
Recall (prim) MRR (prim) MRR (sec) 1 2 3 4 3
0.267 0.267 0.079 0.230 0.230
0.182 0.182 0.056 0.144 0.171
0.268 0.275 0.090 0.190 –
Table 1 shows Recall and Mean Reciprocal Rank (MRR) for primary links, and MRR values for secondary links. Recall cannot be calculated for secondary links due to the lack of an exhaustive identification of secondary links. Table 1 also includes Run 3 evaluated automatically using the same set of primary links as in the official evaluation. Secondary links have been omitted as we could not provide the required additional manual case-by-case evaluation by the assessors. Runs 1 and 2 achieved the highest scores. Although they do yield slightly different output, the decision of whether to stem and stop text does not alter the results statistically, in the matter of primary links, while stemming and stopping (Run 2) did improve results a little in finding secondary links. Run 4 using blind relevance feedback to expand the queries was not effective here. Setting the optimal parameters for this process would require further experimentation, and either this or alternative expansion methods may produce better results. The main problem of retrieving from the Dutch collection lies in the differences between the English and the Dutch versions of Wikipedia. Although the English site contains a significantly larger number of articles, there are articles that have no equivalent pages cross-lingually, due to different structuring or cultural differences. Systems 1, 2 and 4 might (and in fact did) come up with relevant links at some points which were lost when looking for a direct link to an English page. Thus a weak point of our approach is that some hits from the Dutch Wikipedia might get lost in the English output due to the lack of an equivalent English article. In the extreme case, our system might return no output at all if none of the hits for a given anchor are linked to any page in the English Wikipedia. Run 3 performed significantly worse. This might be due to two aspects of the switch to the English collection. First, the query text was translated automatically from Dutch to English, which in itself carries a risk of translation errors due to misinterpretation of the query or weaknesses in the translation dictionaries. While the MultiMatch translation tool has a vocabulary expanded to include many concepts from the domain of cultural heritage, there are many specialist concepts in the ASR transcription which are not included in its translation vocabulary. Approximately 3.5% of Dutch words were left untranslated (in addition to names). Some of these turned out to be important expressions, e.g. rariteitenkabinet ’cabinet of curiosities’, which were in fact successfully retrieved by the systems for Run 1 and 2 (although ranked lower than desired). The other main problem we encountered in Run 3 lay in the English Wikipedia and our limited experience concerning its structure. The downloadable dump
412
´ Gyarmati and G.J.F. Jones A.
includes a large number of pages that look like useful articles, but are in fact not. These articles include old articles set for deletion and meta-articles containing discussion of an existing, previous or future article. We were not aware of these articles during the initial development phase, but this had a significant impact on our results, about 18.5 % of the links returned in Run 3 proved to be invalid articles. Run 3 reflects results where the English Wikipedia archive has been cleaned up to remove these irrelevant pages prior to indexing. As shown in Table 1, this cleanup produces a significant improvement in performance. A similar cleanup applied to the Dutch collection would produce a new ranking of Dutch documents. However, very few of the Dutch pages which would be deleted in cleanup are actually retrieved or have a link to English pages, and thus any changes in the Dutch archive will have no noticeable effect on evaluation of the overall system output.
5
Conclusions
In this paper we have outlined the two approaches used in our submissions to the Linking Task at VideoCLEF 2009. We found using the source language for retrieval to be more effective than switching to the target language in an early phase. This result may be different if translation of the query for the second method were to be improved. Both methods could be expected to benefit from the ongoing development of Wikipedia collections.
Acknowledgements This work is funded by a grant under the Science Foundation Ireland Research Frontiers Programme 2008. We are grateful to Eamonn Newman for assistance with the MultiMatch translation tool.
References 1. Larson, M., Newman, E., Jones, G.J.F.: Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In: Peters, C., et al. (eds.) CLEF 2009 Workshop, Part II. LNCS, vol. 6242, Springer, Heidelberg (2010) 2. The Lemur Toolkit, http://www.lemurproject.org/ 3. Jones, G.J.F., Fantino, F., Newman, E., Zhang, Y.: Domain-Specific Query Translation for Multilingual Information Access Using Machine Translation Augmented With Dictionaries Mined From Wikipedia. In: Proceedings of the 2nd International Workshop on Cross Lingual Information Access - Addressing the Information Need of Multilingual Societies (CLIA-2008), Hyderabad, India, pp. 34–41 (2008) 4. Don, M.: Indri Retrieval Model Overview, http://ciir.cs.umass.edu/~ metzler/indriretmodel.html 5. Oleander Stemming Library, http://sourceforge.net/projects/porterstemmers/ 6. Snowball, http://snowball.tartarus.org/
Author Index
Adda, Gilles I-289 Agirre, Eneko I-36, I-166, I-273 Agosti, Maristella I-508 Ah-Pine, Julien II-124 Al Batal, Rami II-324 AleAhmad, Abolfazl I-110 Alegria, I˜ naki I-174 Alink, W. I-468 Alpkocak, Adil II-219 Anderka, Maik I-50 Ansa, Olatz I-273 Arafa, Waleed II-189 Araujo, Lourdes I-245, I-253 Arregi, Xabier I-273 Avni, Uri II-239 Azzopardi, Leif I-480 Bakke, Brian II-72, II-223 Barat, C´ecile II-164 Barbu Mititelu, Verginica I-257 Basile, Pierpaolo I-150 Batista, David I-305 Becks, Daniela I-491 Bedrick, Steven II-72, II-223 Benavent, Xaro II-142 Bencz´ ur, Andr´ as A. II-340 Benzineb, Karim I-502 Berber, Tolga II-219 Bergler, Sabine II-150 Bernard, Guillaume I-289 Bernhard, Delphine I-120, I-598 Besan¸con, Romaric I-342 Bilinski, Eric I-289 Binder, Alexander II-269 Blackwood, Graeme W. I-578 Borbinha, Jos´e I-90 Borges, Thyago Bohrer I-135 Boro¸s, Emanuela II-277 Bosca, Alessio I-544 Buscaldi, Davide I-128, I-197, I-223, I-438 Byrne, William I-578 Cabral, Lu´ıs Miguel I-212 Calabretto, Sylvie II-203
Can, Burcu I-641 Caputo, Annalina I-150 Caputo, Barbara II-85, II-110 Cardoso, Nuno I-305, I-318 Ceau¸su, Alexandru I-257 Cetin, Mujdat II-247 Chan, Erwin I-658 Chaudiron, St´ephane I-342 Chevallet, Jean-Pierre II-324 Chin, Pok II-37 Choukri, Khalid I-342 Clinchant, Stephane II-124 Clough, Paul II-13, II-45 Comas, Pere R. I-197, I-297 Cornacchia, Roberto I-468 Correa, Santiago I-223, I-438 Cristea, Dan I-229 Croitoru, Cosmina II-283 Csurka, Gabriela II-124 Damankesh, Asma I-366 Dar´ oczy, B´ alint II-340 Dehdari, Jon I-98 Denos, Nathalie I-354 de Pablo-S´ anchez, C´esar I-281 Deserno, Thomas M. II-85 de Ves, Esther II-142 de Vries, Arjen P. I-468 de Wit, Joost II-401 D’hondt, Eva I-497 Diacona¸su, Mihail-Ciprian II-369 D´ıaz-Galiano, Manuel Carlos I-381, II-185, II-348 Dimitrovski, Ivica II-231 Dini, Luca I-544 Di Nunzio, Giorgio Maria I-36, I-508 Dobril˘ a, Tudor-Alexandru II-369 Dolamic, Ljiljana I-102 Doran, Christine I-508 Dornescu, Iustin I-326 Dr˘ agu¸sanu, Cristian-Alexandru I-362 Ducottet, Christophe II-164 Dumont, Emilie II-299 Dunker, Peter II-94 Dˇzeroski, Saˇso II-231
414
Author Index
Eggel, Ivan II-72, II-211, II-332 Eibl, Maximilian I-570, II-377 El Demerdash, Osama II-150 Ercil, Aytul II-247 Fakeri-Tabrizi, Ali II-291 Falquet, Gilles I-502 Fautsch, Claire I-476 Fekete, Zsolt II-340 Feng, Yue II-295 Fern´ andez, Javi I-158 Ferr´es, Daniel I-322 Ferro, Nicola I-13, I-552 Flach, Peter I-625, I-633 Fluhr, Christian I-374 Forˇcscu, Corina I-174 Forner, Pamela I-174 Galibert, Olivier I-197, I-289 Gallinari, Patrick II-291 Gao, Yan II-255 Garc´ıa-Cumbreras, Miguel A. II-348 Garc´ıa-Serrano, Ana II-142 Garrido, Guillermo I-245, I-253 Garrote Salazar, Marta I-281 Gaussier, Eric I-354 G´ery, Mathias II-164 Gevers, Theo II-261 Ghorab, M. Rami I-518 Giampiccolo, Danilo I-174 Gl¨ ockner, Ingo I-265 Glotin, Herv´e II-299 Gobeill, Julien I-444 Goh, Hanlin II-287 Goldberger, Jacob II-239 Gol´enia, Bruno I-625, I-633 G´ omez, Jos´e M. I-158 Go˜ ni, Jos´e Miguel II-142 Gonzalo, Julio II-13, II-21 Goyal, Anuj II-133 Graf, Erik I-480 Granados, Ruben II-142 Granitzer, Michael I-142 Greenspan, Hayit II-239 Grigoriu, Alecsandru I-362 G¨ uld, Mark Oliver II-85 Gurevych, Iryna I-120, I-452 Guyot, Jacques I-502 ´ Gyarmati, Agnes II-409
Habibian, AmirHossein I-110 Halvey, Martin II-133, II-295 Hansen, Preben I-460 Harman, Donna I-552 Harrathi, Farah II-203 Hartrumpf, Sven I-310 Herbert, Benjamin I-452 Hersh, William II-72, II-223 Hollingshead, Kristy I-649 Hu, Qinmin II-195 Huang, Xiangji II-195 Husarciuc, Maria I-229 Ibrahim, Ragia II-189 Iftene, Adrian I-229, I-362, I-426, I-534, II-277, II-283, II-369 Inkpen, Diana II-157 Ionescu, Ovidiu I-426 Ion, Radu I-257 Irimia, Elena I-257 Izquierdo, Rub´en I-158 Jadidinejad, Amir Hossein I-70, I-98 J¨ arvelin, Anni I-460 J¨ arvelin, Antti I-460 Jochems, Bart II-385 Jones, Gareth J.F. I-58, I-410, I-518, II-172, II-354, II-409 Jose, Joemon M. II-133, II-295 Juffinger, Andreas I-142 Kahn Jr., Charles E. II-72 Kalpathy-Cramer, Jayashree II-223 Karlgren, Jussi II-13 Kawanabe, Motoaki II-269 Kern, Roman I-142 Kierkels, Joep J.M. II-393 Kludas, Jana II-60 Kocev, Dragi II-231 Koelle, Ralph I-538 Kohonen, Oskar I-609 K¨ olle, Ralph I-491 Kosseim, Leila II-150 Kurimo, Mikko I-578 K¨ ursten, Jens I-570, II-377 Lagus, Krista I-609 La¨ıb, Meriama I-342 Lamm, Katrin I-538
II-72,
Author Index Langlais, Philippe I-617 Largeron, Christine II-164 Larson, Martha II-354, II-385 Larson, Ray R. I-86, I-334, I-566 Lavall´ee, Jean-Fran¸cois I-617 Le Borgne, Herv´e II-177 Leelanupab, Teerapong II-133 Lemaˆıtre, C´edric II-164 Lestari Paramita, Monica II-45 Leveling, Johannes I-58, I-310, I-410, I-518, II-172 Li, Yiqun II-255 Lignos, Constantine I-658 Lin, Hongfei II-195 Lipka, Nedim I-50 Lopez de Lacalle, Maddalen I-273 Llopis, Fernando II-120 Llorente, Ainhoa II-307 Lloret, Elena II-29 Lopez, Patrice I-430 L´ opez-Ostenero, Fernando II-21 Lopez-Pellicer, Francisco J. I-305 Losada, David E. I-418 Loskovska, Suzana II-231 Lungu, Irina-Diana II-369 Machado, Jorge I-90 Magdy, Walid I-410 Mahmoudi, Fariborz I-70, I-98 Maisonnasse, Lo¨ıc II-203, II-324 Manandhar, Suresh I-641 Mandl, Thomas I-36, I-491, I-508, I-538 Mani, Inderjeet I-508 Marcus, Mitchell P. I-658 Mart´ınez, Paloma I-281 Martins, Bruno I-90 Mart´ın-Valdivia, Mar´ıa Teresa II-185, II-348, II-373 Min, Jinming II-172 Mo¨ellic, Pierre-Alain II-177 Monson, Christian I-649, I-666 Montejo-R´ aez, Arturo I-381, II-348, II-373 Moreau, Nicolas I-174, I-197 Moreira, Viviane P. I-135 Moreno Schneider, Juli´ an I-281 Moriceau, V´eronique I-237 Moruz, Alex I-229 Mostefa, Djamel I-197, I-342 Motta, Enrico II-307
415
Moulin, Christophe II-164 Mulhem, Philippe II-324 M¨ uller, Henning II-72, II-211, II-332 Mu˜ noz, Rafael II-120 Myoupo, D´ebora II-177 Navarro-Colorado, Borja II-29 Navarro, Sergio II-120 Nemeskey, D´ avid II-340 Newman, Eamonn II-354 Ngiam, Jiquan II-287 Nowak, Stefanie II-94 Oakes, Michael I-526 Oancea, George-R˘ azvan I-426 Ordelman, Roeland II-385 Oroumchian, Farhad I-366 Osenova, Petya I-174 Otegi, Arantxa I-36, I-166, I-273 Ozogur-Akyuz, Sureyya II-247 Paris, S´ebastien II-299 Pasche, Emilie I-444 Peinado, V´ıctor II-13, II-21 Pelzer, Bj¨ orn I-265 Pe˜ nas, Anselmo I-174, I-245, I-253 Perea-Ortega, Jos´e Manuel I-381, II-185, II-373 P´erez-Iglesias, Joaqu´ın I-245, I-253 Peters, Carol I-1, I-13, II-1 Petr´ as, Istv´ an II-340 Pham, Trong-Ton II-324 Piroi, Florina I-385 Pistol, Ionu¸t I-229 Popescu, Adrian II-177 Pronobis, Andrzej II-110, II-315 Puchol-Blasco, Marcel II-29 Pun, Thierry II-393 Punitha, P. II-133 Qamar, Ali Mustafa I-354 Qu´enot, Georges II-324 Raaijmakers, Stephan II-401 Radhouani, Sa¨ıd II-72, II-223 Roark, Brian I-649, I-666 Roda, Giovanna I-385 ´ Rodrigo, Alvaro I-174, I-245, I-253 Rodr´ıguez, Horacio I-322 Romary, Laurent I-430
416
Author Index
Ronald, John Anton Chrisostom I-374 Ro¸sca, George II-277 Rosset, Sophie I-197, I-289 Rossi, Aur´elie I-374 Rosso, Paolo I-128, I-197, I-223, I-438 Roussey, Catherine II-203 Ruch, Patrick I-444 R¨ uger, Stefan II-307 Ruiz, Miguel E. II-37 Sanderson, Mark II-45 Santos, Diana I-212 Saralegi, Xabier I-273 Savoy, Jacques I-102, I-476 Schulz, Julia Maria I-508 Semeraro, Giovanni I-150 Shaalan, Khaled I-366 Shakery, Azadeh I-110 Sikl´ osi, D´ avid II-340 Silva, M´ ario J. I-305 Smeulders, Arnold W.M. II-261 Smits, Ewine II-385 Soldea, Octavian II-247 Soleymani, Mohammad II-393 Spiegler, Sebastian I-625, I-633 S ¸ tef˘ anescu, Dan I-257 Stein, Benno I-50 Sutcliffe, Richard I-174 Szarvas, Gy¨ orgy I-452 Tait, John I-385 Tannier, Xavier I-237 Tchoukalov, Tzvetan I-666 Teodoro, Douglas I-444 Terol, Rafael M. II-29 Timimi, Isma¨ıl I-342 Tollari, Sabrina II-291 Tomlinson, Stephen I-78 Tommasi, Tatiana II-85 Toucedo, Jos´e Carlos I-418
Trandab˘ a¸t, Diana I-229 Tsikrika, Theodora II-60 Tufi¸s, Dan I-257 Turmo, Jordi I-197, I-297 Turunen, Ville T. I-578 Unay, Devrim II-247 Ure˜ na-L´ opez, L. Alfonso II-185, II-348, II-373 Usunier, Nicolas II-291
I-381,
Vamanu, Loredana II-283 van de Sande, Koen E.A. II-261 van Rijsbergen, Keith I-480 V´ azquez, Sonia II-29 Verberne, Suzan I-497 Versloot, Corn´e II-401 Vicente-D´ıez, Mar´ıa Teresa I-281 Virpioja, Sami I-578, I-609 Wade, Vincent I-58, I-62, I-518 Weiner, Zsuzsa II-340 Welter, Petra II-85 Wilkins, Peter II-172 Wolf, Elisabeth I-120 Womser-Hacker, Christa I-491 Xing, Li Xu, Yan
II-110, II-315 I-526
Yang, Charles I-658 Yeh, Alexander I-508 Ye, Zheng II-195 Zaragoza, Hugo I-166, I-273 Zenz, Veronika I-385 Zhao, Zhong-Qiu II-299 Zhou, Dong I-58, I-62, I-518 Zhou, Xin II-211 Zhu, Qian II-157 Zuccon, Guido II-133